Abstract:
Given data from some experiment, inferring information from
the underlying distribution is of prime importance, and has been
extensively studied. However, due to the huge size of the data,
traditional methods are often no longer applicable. Thus new tools and
techniques are being developed for inferring useful information from
large amounts of data. This thesis makes progress in this direction.
The primary goal is to design efficient randomized algorithms aka.
testers that can distinguish whether a given unknown object is
"close'' or "far'' from a property of interest with as few accesses as
possible. This is referred to as distribution testing when the unknown
object is a probability distribution, and graph property testing when
it is a graph. The minimum number of samples required to decide a
property in distribution testing is referred to as sample complexity,
while in graph property testing, it is referred to as query
complexity.
In this thesis, we study several fundamental problems in distribution
and graph property testing such as (i) Can one design a tolerant
tester for any distribution property with only black-box access to a
non-tolerant tester? (ii) Does there exist distribution properties
with global structure that can be learnt efficiently? (iii) the role
of adaptivity in distribution testing, and tolerant testing for (iv)
graph isomorphism and (v) bipartiteness.
The results of the thesis are divided into three parts. In the first
part, we study the connection between the sample complexities of
non-tolerant and tolerant testing of distributions and prove a tight
quadratic gap for label-invariant (symmetric) properties, while
providing lower bounds for non-concentrated properties. We also
present an algorithm that can learn a concentrated distribution even
when its support set is unknown apriori.
In the second part, we investigate problems (ii) and (iii) in huge
object model, where distributions are defined over n-dimensional
Hamming cube and the tester obtains n-bit strings as samples. Since
reading the string in its entirety may not be feasible for large n,
the tester has query access to the sampled strings. We define the
notion of index-invariant properties, properties that are invariant
under the permutations of the indices {1,......,n} and prove that any
index-invariant property whose VC-dimension is bounded has a tester
whose query complexity is independent of n and depends only on
VC-dimension. Moreover, the dependencies of sample and query
complexities of our tester on the VC-dimension are tight. We also
study the power of adaptiveness in this model and prove a tight
quadratic separation between query complexities of adaptive and
non-adaptive testers for index-invariant properties, compared to tight
exponential separation for its non-index-invariant counterpart.
In the third part, we study property testing of dense graphs and give
positive answers to problems (iv) and (v). We prove that tolerant
graph isomorphism testing is equivalent to the problem of estimating
the Earth Mover Distance of two distributions, constructed from the
graphs. Moreover, our equivalence proof is model independent. Finally
we design a tester for tolerant bipartiteness testing whose query and
time complexities are significantly better compared to previous works.