This workshop aims to foster collaborations between researchers across multiple disciplines through a set of central questions and techniques for algorithm design for large data.
By bringing together experts from diverse fields, we seek to explore innovative approaches and shared challenges in this rapidly evolving area.
Due to a generous grant from SIGACT, the workshop is free to attend for all interested participants.
The workshop is now over. Thanks for your interest in WALDO 2025!
See below for talk details, including abstracts.
12:00 pm ET
Opening Remarks by SIGACT
12:05pm ET
Anupam Gupta: The Price of Explainability for Clustering
[
Abstract]
A recent line of work asks: given a clustering optimization problem (like k-means or k-medians), how does the cost of the best explainable clustering compare to that of the best clustering?
We consider the model where a clustering is called explainable if it can be defined by a tree of axis-parallel cuts. Building on a recent sequence of works, we pin down the optimal price of explainability
for k-median clustering, and also improve on existing results for k-means clustering. In this talk, I will discuss some of these results (both ours and prior results), the techniques underlying them,
and point out some of the open questions in the area.
This talk is based on this paper (https://arxiv.org/abs/2304.09743) with Madhusudhan Pittu (CMU), Ola Svensson (EPFL), and Rachel Yuan (CMU).
12:30 pm ET
Mark Braverman: Optimality of Frequency Moment Estimation
[
Abstract]
Estimating the second frequency moment of a stream up to (1
± ε) multiplicative error requires at most O(log n / ε²) bits of
space, due to a seminal result of Alon, Matias, and Szegedy. It is
also known that at least Ω(log n + 1/ε²) space is needed.
We prove a tight lower bound of Ω(log(nε²) / ε²) for all ε = Ω(1/√n).
When ε > n^(-1/2 + c), where c > 0, our lower bound matches the
classic upper bound of AMS. For smaller values of ε, we also introduce
a revised algorithm that improves the classic AMS bound and matches
our lower bound.
Our lower bound also applies to the more general problem of p-th
frequency moment estimation for the range of p in (1, 2], providing a
tight bound in the only remaining range to settle the optimal space
complexity of estimating frequency moments.
Based on a joint work with Or Zamir
12:55 pm ET
Coffee Break (15 mins)
13:10 pm ET
David P. Woodruff: Lifting Linear Sketches: Optimal Bounds and Adversarial Robustness
[
Abstract]
We introduce a novel technique for ``lifting'' dimension lower bounds for linear sketches in the continuous setting to dimension lower bounds for
linear sketches with polynomially-bounded integer entries when the input is a polynomially-bounded integer vector. Using this technique,
we obtain the first optimal sketching lower bounds for discrete inputs in a data stream, for classical problems such as approximating t
he frequency moments, estimating the operator norm, and compressed sensing. Additionally, we lift the adaptive attack of Hardt and Woodruff (STOC, 2013)
for breaking any real-valued linear sketch via a sequence of real-valued queries, and show how to obtain an attack on any integer-valued linear sketch using
integer-valued queries. This shows that there is no linear sketch in a data stream with insertions and deletions that is adversarially robust for approximating
any $L_p$ norm of the input. This resolves a central open question for adversarially robust streaming algorithms. To do so, we introduce a
new pre-processing technique of independent interest which, given an integer-valued linear sketch, increases the dimension of the sketch
by only a constant factor in order to make the orthogonal lattice to its row span smooth, enabling us to leverage results in lattice theory on
discrete Gaussian distributions and reason that efficient discrete sketches imply efficient continuous sketches.
Our work resolves open questions from the Banff '14 and '17 workshops on Communication Complexity and Applications,
as well as the STOC '21 and FOCS '23 workshops on adaptivity and robustness.
Based on joint work with Elena Gribelyuk, Honghao Lin, Huacheng Yu, and Samson Zhou.
13:35 pm ET
Slobodan Mitrović: Computing Graph Cuts Privately
[
Abstract]
With the increasing availability of publicly shared statistics derived from private datasets, safeguarding users' personal information has become crucial. Differential privacy (DP)
has emerged as a widely adopted framework for quantifying the level of individual privacy an algorithm preserves.
Over the past decade, numerous fundamental algorithms have been studied within the context of DP. This presentation will focus on
recent advances in the differentially private computation of graph cuts. We will also touch on multiple interesting open problems in this area of research.
14:00 pm ET
Monday Poster Session
15:00 pm ET
Madhur Tulsiani: Expander Graphs and Optimally List-Decodable Codes
[
Abstract]
We construct a new family of explicit codes that are list decodable to capacity and achieve an optimal list size of O(1/ϵ).
In contrast to existing explicit constructions of codes achieving list decoding capacity, our arguments do not rely on
algebraic structure but utilize simple combinatorial properties of expander graphs.
Our construction is based on a celebrated distance amplification procedure due to Alon, Edmonds, and Luby [FOCS'95],
which transforms any high-rate code into one with near-optimal rate-distance tradeoff. We generalize it to show that the same procedure can be used
to transform any high-rate code into one that achieves list decoding capacity. Our proof can be interpreted as a "local-to-global" phenomenon
for (a slight strengthening of) the generalized Singleton bound.
As a corollary of our result, we also obtain the first explicit construction of LDPC codes achieving list decoding capacity,
and in fact arbitrarily close to the generalized Singleton bound.
Based on joint work with Fernando Granha Jeronimo, Tushant Mittal, and Shashank Srivastava.
15:25 pm ET
Meghal Gupta: Stream-Decodable Error-Correcting Codes
[
Abstract]
In the standard noisy communication model, Alice encodes a message using an error-correcting code and sends it to Bob,
who decodes it after receiving the entire message and storing it in memory. In this talk, we'll explore what happens when Bob
doesn't have enough memory to store the whole message and must instead decode it bit by bit as it arrives. We'll define what it
means for a code to be stream-decodable and present nearly matching upper and lower bounds on the code length required in this setting.
This is based on joint works with Venkat Guruswami, Mihir Singhal, and Rachel Zhang.
15:50 pm ET
Coffee Break (15 mins)
16:05 pm ET
Rocco Servedio: Is Nasty Noise Actually Harder than Malicious Noise?
[
Abstract]
We consider the relative abilities and limitations of computationally efficient algorithms for learning in the presence of noise, under two well-studied and challenging adversarial noise models for learning Boolean functions:
*malicious noise, in which an adversary can arbitrarily corrupt a random subset of examples given to the learner; and
*nasty noise, in which an adversary can arbitrarily corrupt an adversarially chosen subset of examples given to the learner.
We consider both the distribution-independent and fixed-distribution settings. Our main results highlight a dramatic difference between these two settings:
1. For distribution-independent learning, we prove a strong equivalence between the two noise models: If a class C of functions is efficiently learnable in the presence of η-rate malicious noise, then it is also efficiently learnable in the presence of η-rate nasty noise.
2. In sharp contrast, for the fixed-distribution setting we show an arbitrarily large separation: Under a standard cryptographic assumption, for any arbitrarily large value r there exists a concept class for which there is a ratio of r between the rate η_malicious of
malicious noise that polynomial-time learning algorithms can tolerate, versus the rate η_nasty of nasty noise that such learning algorithms can tolerate.
Joint work with Guy Blanc, Yizhi Huang, and Tal Malkin.
16:30 pm ET
Pravesh Kothari: The Quasi-polynomial Low-Degree Conjecture is False
[
Abstract]
There is a growing body of work on proving hardness results for average-case optimization problems by bounding
the low-degree likelihood ratio (LDLR) --- a quantitative estimate of the closeness of low-degree moments --- between a null distribution
and a related planted distribution. Such hardness results are now ubiquitous for foundational problems in algorithms, statistics, and cryptography.
This line of work is supported by the low-degree conjecture of Hopkins, which postulates that a vanishing degree-D LDLR implies
the absence of any noise-tolerant distinguishing algorithm with runtime n^{\widetilde{O}(D)} whenever 1) the null distribution is
product on $\{0,1\}^{\binom{n}{k}}$, and 2) the planted distribution is permutation invariant, that is, invariant under any relabeling $[n] \rightarrow [n]$.
In this talk, I'll present our refutation of this conjecture. Specifically, we show that for all sufficiently small $\epsilon>0$ and
any $k\geq 2$, there is a permutation-invariant planted distribution on $\{0,1\}^{\binom{n}{k}}$ that has a vanishing degree-$n^{1-O(\epsilon)}$
LDLR with respect to the uniform distribution on $\{0,1\}^{\binom{n}{k}}$ even as a $n^{O(\log^{1/(k-1)}(n))}$-time algorithm solves
the corresponding $\epsilon$-noisy distinguishing problem. Our construction relies on algorithms for list decoding for noisy polynomial interpolation in the high-error regime.
We also give another construction of a pair of planted and (a non-product) null distributions on $\R^{n \times n}$ with a
vanishing $n^{\Omega(1)}$-degree LDLR even as simply the largest eigenvalue serves as an efficient noise-tolerant distinguisher.
Our results suggest that while a vanishing LDLR bound may still be interpreted as evidence of hardness, developing a theory of average-case
complexity based on such heuristics requires a more careful approach.
12:00 pm ET
Opening Remarks by Steering Committee
12:05 pm ET
Michael Kapralov: Improved Algorithms for Kernel Matrix-Vector Multiplication Under Sparsity Assumptions
[
Abstract]
Motivated by the problem of fast processing of attention matrices, we study fast algorithms for computing matrix-vector products for asymmetric Gaussian Kernel matrices $K\in \mathbb{R}^{n\times n}$.
$K$'s columns are indexed by a set of $n$ keys $k_1,k_2\ldots, k_n\in \mathbb{R}^d$, rows by a set of $n$ queries $q_1,q_2,\ldots,q_n\in \mathbb{R}^d $, and
its $i,j$ entry is $K_{ij} = e^{-\|q_i-k_j\|_2^2/2\sigma^2}$ for some bandwidth parameter $\sigma>0$. Given a vector $x\in \mathbb{R}^n$ and error parameter $\epsilon>0$,
our task is to output a $y\in \mathbb{R}^n$ such that $\|Kx-y\|_2\leq \epsilon \|x\|_2$ in time subquadratic in $n$ and linear in $d$.
Our algorithms rely on the following modelling assumption about the matrices $K$: the sum of the entries of $K$ scales linearly in $n$,
as opposed to worst case quadratic growth. We validate this assumption experimentally, for Gaussian kernel matrices encountered in
various settings such as fast attention computation in LLMs. We obtain the first subquadratic-time algorithm that works under this assumption,
for unrestricted vectors.
12:30 pm ET
Sanjeev Khanna: Correlation Clustering and (De)Sparsification: Graph Sketches Can Match Classical Algorithms
[
Abstract]
Correlation clustering is a widely-used approach for clustering large data sets based only on pairwise similarity information.
In recent years, there has been a steady stream of better and better classical algorithms for approximating this problem.
Meanwhile, another line of research has focused on porting the classical advances to various sublinear algorithm models, including semi-streaming,
Massively Parallel Computation (MPC), and distributed computing. Yet, these latter works typically rely on
ad-hoc approaches that may not always keep up with advances in improved approximation ratios achieved by classical algorithms.
This raises the following natural question: can the gains made by classical algorithms for correlation clustering be ported over to
sublinear algorithms in a black-box manner? We answer this question in the affirmative via the paradigm of graph de-sparsification that may be of independent interest.
This is joint work with Sepehr Assadi and Aaron (Louie) Putterman.
12:55 pm ET
Coffee Break (15 mins)
13:10 pm ET
Nicole Wein: Covering Approximate Shortest Paths with DAGs
[
Abstract]
I will talk about the new notion of a "DAG cover" that we define. It is a directed analog of a tree cover,
which is closely related to a probabilistic tree embedding. A DAG cover of a general directed graph G is a
small collection of DAGs so that for all pairs of vertices s,t, some DAG in the collection provides low distortion for dist(s,t).
I will discuss upper and lower bounds for DAG covers in various parameter regimes, and pose some open problems.
Joint work with Sepehr Assadi and Gary Hoppenworth
13:35 pm ET
Karthik C.S.: Near-Optimal Lower Bound for Parameterized Euclidean k-means
[
Abstract]
The Euclidean k-means problem is a fundamental problem in computational geometry, where given a set of points in high-dimensional space,
the goal is to find k representative points so as to minimize the sum of the squared distances from each point to its closest representative.
In seminal works, de la Vega, Karpinski, Kenyon, and Rabani [STOC'03] and Kumar, Sabharwal, and Sen [JACM'10] showed how to obtain a (1+eps)-approximation for
high-dimensional Euclidean k-means in time poly(nd) 2^(k/eps)^{O(1)}.
In this talk, we introduce a new fine-grained hypothesis called Exponential Time for Expanders Hypothesis (XXH), which roughly asserts that there are no
non-trivial exponential-time approximation algorithms for the vertex cover problem on near-perfect vertex expanders. Assuming XXH,
we close the above long line of work on approximating Euclidean k-means by showing that there is no poly(nd) 2^(o(k/eps)) time algorithm
achieving a (1+eps)-approximation for k-means in Euclidean space.
This lower bound is tight as it matches the algorithm given by Feldman, Monemizadeh, and Sohler [SoCG'07] up to log factors in the exponent.
14:00 pm ET
Soheil Behnezhad: Vizing's Theorem in Near-Linear Time
[
Abstract]
In his classic result, Vizing (1964) proved that any graph of maximum degree ∆ can be edge colored using at most ∆+1 different colors.
Vizing’s original proof is algorithmic and shows that such an edge coloring can be found in O(mn) time where m and n denote the number of
edges and vertices respectively. This was subsequently improved to O~(m \sqrt{n}) independently by [Arjomandi ’82; Gabow et al. ’85].
This bound remained state-of-the-art for four decades, and only recently got improved to O~(min{n^2, mn^{1/4}}) [Assadi ’24; Bhattacharya et al. ’24].
In this talk, I will present a completely new approach to edge coloring that leads to the
first near-linear—in fact O(m log ∆)—time algorithm for Vizing’s theorem.
Based on a recent joint work with Sepehr Assadi, Sayan Bhattacharya, Martın Costa, Shay Solomon, and Tianyi Zhang (to appear in STOC 2025).
14:25 pm ET
Break (35 mins)
15:00 pm ET
Vincent Cohen-Addad: Solving the Correlation Cluster LP in Sublinear Time
[
Abstract]
Correlation Clustering is a fundamental and widely-studied problem in unsupervised learning and data mining. The input is a graph and the goal is to construct a
clustering minimizing the number of inter-cluster edges plus the number of missing intra-cluster edges.
[CCL+24] introduced the cluster LP for Correlation Clustering, which they argued captures the problem much more succinctly than previous linear programming formulations.
However, the cluster LP has exponential size, with a variable for every possible set of vertices in the input graph. Nevertheless, [CCL+24] showed how to find a feasible
solution for the cluster LP in time O(npoly(1/ε)) with objective value at most (1 + ε) times the value of an optimal solution for the respective Correlation Clustering instance.
Furthermore, they showed how to round a solution to the cluster LP, yielding a (1.437 + ε)-approximation algorithm for the Correlation Clustering problem.
The main technical result of this paper is a new approach to find a feasible solution for the cluster LP with objective value at most (1 + ε) of the optimum in time ̃O(2poly(1/ε)n),
where n is the number of vertices in the graph. We also show how to implement the rounding within the same time bounds, thus achieving a fast (1.437 + ε)-approximation
algorithm for the Correlation Clustering problem. This bridges the gap between state-of-the-art methods for approximating Correlation Clustering and the recent focus on fast
algorithms.
Joint work with Nairen Cao, Euiwoong Lee, Shi Li, David Rasmussen Lolck, Alantha Newman, Mikkel Thorup, Lukas Vogl. Shuyi Yan, Hanwen Zhang
15:25 pm ET
Elena Grigorescu: Differential Privacy and Sublinear-Time are Incompatible Sometimes
[
Abstract]
Differential privacy and sublinear algorithms have both been well-studied paradigms in big data analysis. Although recent works have shown the existence of differentially private
sublinear algorithms for many problems, little is known regarding hardness results on these algorithms. In this paper, we initiate the study of lower bounds for problems that
aim for both differentially-private and sublinear-time algorithms. Our main result is the incompatibility of both the desiderata in the general case.
In particular, we prove that a simple problem based on one-way marginals yields both a differentially-private algorithm, as well as a sublinear-time algorithm,
but does not admit a ``strictly'' sublinear-time algorithm that is also differentially private.
Joint work with Jeremiah Blocki, Hendrik Fichtenberger,
Tamalika Mukherjee.
15:50 pm ET
Coffee Break (15 mins)
16:05 pm ET
Jane Lange: Lifting Uniform Learners via Distributional Decomposition
[
Abstract]
We show how any PAC learning algorithm that works under the uniform distribution can be transformed, in a blackbox fashion, into one that works under an
arbitrary and unknown distribution $\mathcal{D}$. The efficiency of our transformation scales with the inherent complexity of $\mathcal{D}$,
running in $\poly(n, (md)^d)$ time for distributions over $\{\pm 1\}^n$ whose pmfs are computed by depth-$d$ decision trees,
where $m$ is the sample complexity of the original algorithm. For monotone distributions our transformation uses only samples from $\mathcal{D}$,
and for general ones it uses subcube conditioning samples.
A key technical ingredient is an algorithm which, given the aforementioned access to $\mathcal{D}$, produces an \emph{optimal} decision tree
decomposition of $\mathcal{D}$: an approximation of $\mathcal{D}$ as a mixture of uniform distributions over disjoint subcubes.
With this decomposition in hand, we run the uniform-distribution learner on each subcube and combine the hypotheses using the decision tree.
This algorithmic decomposition lemma also yields new algorithms for learning decision tree distributions with runtimes that exponentially
improve on the prior state of the art---results of independent interest in distribution learning.
16:30 pm ET
Josh Brakensiek: Redundancy Is All You Need (for Sparsification)
[
Abstract]
The seminal work of Benczúr and Karger demonstrated cut sparsifiers of near-linear size, with several applications throughout theoretical computer science. Subsequent extensions have yielded sparsifiers for hypergraph cuts and more recently linear codes over Abelian groups.
A decade ago, Kogan and Krauthgamer asked about the sparsifiability of arbitrary constraint satisfaction problems (CSPs). For this question, a trivial lower bound is the size of a non-redundant CSP instance, which admits, for each constraint, an assignment satisfying only
that constraint (so that no constraint can be dropped by the sparsifier). For graph cuts, spanning trees are non-redundant instances.
Our main result is that redundant clauses are sufficient for sparsification: for any CSP predicate R, every unweighted instance of CSP(R) has a sparsifier of size at most its non-redundancy (up to polylog factors). For weighted instances, we similarly pin down the sparsifiability
to the so-called chain length of the predicate. These results precisely determine the extent to which any CSP can be sparsified. A key technical ingredient in our work is a novel application of the entropy method from Gilmer's recent breakthrough on the union-closed sets conjecture.
As an immediate consequence of our main theorem, a number of results in the non-redundancy literature immediately extend to CSP sparsification. We also contribute new techniques for understanding the non-redundancy of CSP predicates. In particular, we give an explicit family of
predicates whose non-redundancy roughly corresponds to the structure of matching vector families in coding theory. By adapting methods from the matching vector codes literature, we are able to construct an explicit predicate whose non-redundancy has a non-integral exponent.
Joint work with Venkatesan Guruswami.
12:00 pm ET
Opening Remarks by Organizers
12:30 pm ET
Sofya Raskhodnikova: Fully Dynamic Algorithms for Graphs with Edge Differential Privacy
[
Abstract]
We study differentially private algorithms for analyzing graphs in the challenging setting of continual release with fully dynamic updates, where edges are inserted and deleted over time,
and the algorithm is required to update the solution at every time step. Previous work has presented differentially private algorithms for many graph problems that can handle insertions only or deletions only
(called partially dynamic algorithms) and obtained some hardness results for the fully dynamic setting. The only algorithms in the latter setting were for the edge count, given by Fichtenberger, Henzinger, and Ost (ESA '21),
and for releasing the values of all graph cuts, given by Fichtenberger, Henzinger, and Upadhyay (ICML '23). We provide the first differentially private and fully dynamic graph algorithms for several
other fundamental graph statistics (including the triangle count, the number of connected components, the size of the maximum matching, and the degree histogram), analyze their error, and show strong
lower bounds on the error for all algorithms in this setting.
In the talk, we will discuss two variants of edge differential privacy for fully dynamic graph algorithms and our current understanding of the error achievable under both variants: event-level and item-level.
Under the former notion, two graph update sequences are considered neighboring if, roughly speaking, they differ in at most one update; under the latter notion, they can differ only in updates pertaining to one edge.
Differential privacy requires that for any two neighboring inputs, the output distributions of the algorithm are close. We give upper and lower bounds on the error of both---event-level and item-level---fully dynamic algorithms
for several fundamental graph problems. No fully dynamic algorithms that are private at the item-level (the more stringent of the two notions) were known before. In the case of item-level privacy, for several problems,
our algorithms match our lower bounds.
Joint work with Teresa Anna Steiner
12:55 pm ET
Coffee Break (15 mins)
13:10 pm ET
Piotr Indyk: A Bi-metric Framework for Fast Similarity Search
[
Abstract]
We propose a new ``bi-metric'' framework for designing nearest neighbor data structures. Our framework assumes two dissimilarity functions:
a ground-truth metric that is accurate but expensive to compute (e.g., a cross-encoder that runs a large neural network to compare two sentences),
and a proxy metric that is cheaper but less accurate (e.g., the Euclidean distance between feature vectors representing the sentences, possibly sketched).
In both theory and practice, we show how to construct data structures using only the proxy metric such that the query procedure achieves
the accuracy of the expensive metric, while only using a limited number of calls to both metrics.
Our theoretical results instantiate this framework for two popular nearest neighbor search algorithms: DiskANN and Cover Tree.
In both cases we show that, as long as the proxy metric used to construct the data structure approximates the ground-truth metric up to a bounded factor,
our data structure achieves arbitrarily good approximation guarantees with respect to the ground-truth metric. On the empirical side, we apply the
framework to the text retrieval problem with two dissimilarity functions evaluated by ML models with vastly different computational costs.
We observe that for almost all data sets in the MTEB benchmark, our approach achieves a considerably better accuracy-efficiency tradeoff than the alternatives, such as re-ranking.
Joint work with Haike Xu and Sandeep Silwal.
13:35 pm ET
Sumegha Garg: A New Information Complexity Measure for Multi-Pass Streaming Algorithms
[
Abstract]
In this talk, we will introduce a new notion of information complexity for multi-pass streaming problems, and use it to prove memory lower bounds for the coin problem.
In the coin problem, one sees a stream of 𝑛 i.i.d. uniform bits and one would like to compute the majority (or sum) with constant advantage.
We show that any constant pass algorithm must use Ω(log 𝑛) bits of memory. This information complexity notion is also useful to prove tight
space complexity for the needle problem, which in turn implies tight bounds for the problem of approximating higher frequency moments in a data stream.
Joint work with Mark Braverman, Qian Li, Shuo Wang, David P. Woodruff and Jiapeng Zhang.
14:00 pm ET
Wednesday Poster Session
15:00 pm ET
Sepehr Assadi: Distributed Triangle Detection is Hard in Few Rounds
[
Abstract]
We prove that any distributed algorithm in synchronous n-vertex networks with O(log n) bandwidth per edge, namely, the CONGEST model, for detecting existence of
triangles requires Omega(log log n) rounds. This establishes the first multi-round lower bound for this problem.
It has been known that the standard communication complexity arguments in the CONGEST model are provably incapable of establishing any meaningful lower bounds for the triangle detection problem.
Our main technical contribution is thus a new information theoretic argument which combines recent advances on multi-pass graph streaming lower bounds with the point-to-point
communication aspects of distributed models, and can be of independent interest.
Based on joint work with Janani Sundaresan (available on arXiv: https://arxiv.org/abs/2504.01802)
15:25 pm ET
Huacheng Yu: Near-Optimal Relative Error Streaming Quantile Estimation via Elastic Compactors
[
Abstract]
Computing the approximate quantiles or ranks of a stream is a fundamental task in data monitoring. Given a stream of elements x_1, x_2, ..., x_n and a query x,
a relative-error quantile estimation algorithm can estimate the rank of x with respect to the stream, up to a multiplicative (1 +- eps) * rank(x) error.
Notably, this requires the sketch to obtain more precise estimates for the ranks of elements on the tails of the distribution, as compared to the
additive +- eps n error regime. This is particularly favorable for some practical applications, such as anomaly detection.
Previously, the best known algorithms for relative error achieved space O(eps^-1 log^1.5(eps n)) (Cormode, Karnin, Liberty, Thaler, Vesel`y, 2021) and
O(eps^-2 log(eps n)) (Zhang, Lin, Xu, Korn, Wang, 2006). In this work, we present a nearly-optimal streaming algorithm for the relative-error
quantile estimation problem using O~(eps^-1 log(eps n)) space, which almost matches the trivial offline Omega(eps^-1 log(eps n)) space lower bound.
To surpass the eps^-1 log^1.5(eps n) barrier of the previous approach, our algorithm crucially relies on a new data structure, called the elastic compactor,
which can be dynamically resized over the course of the stream. Interestingly, we design a space allocation scheme which adaptively allocates space to
each compactor based on the "hardness" of the input stream. This approach allows us to avoid using the maximal space simultaneously for every compactor
and facilitates the improvement in the total space complexity. Along the way, we also propose and study a new problem called the Top Quantiles Problem,
which only requires the sketch to provide estimates for the ranks of elements in a fixed-length tail of the distribution.
This problem serves as an important subproblem in our algorithm, though it is also an interesting problem of its own right.
Joint work with Elena Gribelyuk, Pachara Sawettamalya, Hongxun Wu.
15:50 pm ET
Coffee Break (15 mins)
16:05 pm ET
Maryam Aliakbarpour: Leveraging Predictions for Efficient Hypothesis Testing in Discrete Distributions
[
Abstract]
Prior knowledge—whether from historical data, domain expertise, or predictive models—can enhance statistical inference by reducing sample complexity.
We introduce a general framework for leveraging such predictions and apply it to hypothesis testing for discrete distributions. In the standard setting,
optimal sample complexity bounds are known for identity, closeness, and uniformity testing. We show that access to a predicted distribution can reduce the required samples,
with gains determined by its total variation distance from the true distribution. Our algorithms are adaptive, adjusting to prediction accuracy without prior calibration,
and robust, never exceeding the standard sample complexity when predictions are uninformative. We establish information-theoretic lower bounds confirming optimality and
present experimental results demonstrating significant practical improvements.
16:30 pm ET
Clément Canonne: Better Private Distribution Testing by Leveraging Unverified Auxiliary Data
[
Abstract]
We extend the framework of augmented distribution testing (Aliakbarpour, Indyk, Rubinfeld, and Silwal, NeurIPS 2024) to the differentially private setting.
This captures scenarios where a data analyst must perform hypothesis testing tasks on sensitive data, but is able to leverage prior knowledge
(public, but possibly erroneous or untrusted) about the data distribution.
We design private algorithms in this augmented setting for three flagship distribution testing tasks, uniformity, identity, and closeness testing,
whose sample complexity smoothly scales with the claimed quality of the auxiliary information. We complement our algorithms with
information-theoretic lower bounds, showing that their sample complexity is optimal (up to logarithmic factors).
Joint work with Maryam Aliakbarpour, Arnav Burudgunte, and Ronitt Rubinfeld.
Posters
Posters
Monday Poster Session:
Wednesday Poster Session:
Support
Support
WALDO 2025 is generously supported by a community grant from SIGACT.
Web design by Pedro Paredes.
Steering Committee
Steering Committee