In this post we will further expand the setting in which concentration inequalities are stated. Remember that we started from the arithmetical mean
where all were independent. We first showed that if are sufficiently “light-tailed” (subgaussian), then concentrates above its mean, with a remarkable special case when are bounded. We then obtained another result (concentration of quadratic forms) for Gaussian when is a quadratic form of a Gaussian vector (actually they can also be expanded to subgaussian case); again we by no means relaxed the independency of The goal of this post is to relax this assumption in some extent.
Recall Hoeffding inequality:
Theorem (Hoeffding’s inequality)
For independent, it holds
As we will see in a while, it is the boundness of that allows to relax the independence of The proof idea was that due to independency of the cumulant of the sum was the sum of cumulants:
After this we were done by proving that each individual was subgaussian with variance proxy and this latter fact by no means depends on the relations between Therefore, if we manage to prove something like (1) for a sequence of non-independent random variables, we will get an analogue of Hoeffding’s inequality. As a side note, let us mention that when some functional which in a way controls the ‘scattering’ of features the behaviour similar to (1), we say that this functional tensorizes. Proving that tensorization holds for some functional (more frequently in the form of an inequality, and only for i.i.d. but for wider classes of functions ) is a usual route to obtain concentration results. Specifically, the most advanced results are obtained using the tensorization of entropy.
Now we come back to our setting, where is simply a sum, but may be dependent. We prove not (1) but a weaker property. To begin, let us limit the ‘degree’ in which are dependent.
A filtration is a sequence of nested sigma algebras, whenever with
The simplest (and most common) example of filtration, which will be enough for us, is one generated by any random sequence In this case i.e. a sigma-algebra generated by the first random variables.
Consider a random sequence and filtration such that is measurable with respect to is called a martingale difference sequence (MDS) with respect to if for
Often is generated by and then is called an MDS, without pointing out with respect to which filtration. Besides, sometimes the filtration is clear from the context and then it is omitted either.
I may give also the definition of martingale here but I will not since I won’t use it in this course. Anyway, note that if is a martingale, then is an MDS, and vice versa.
Let be an MDS such that is conditionally (absolutely) subgaussian, i.e.
with constant Then, is -subgaussian, and
Applying Hoeffding’s lemma to differences conditioned on we get the equivalent of Hoeffding’s inequality for the case when a sum of (dependent) bounded differences form a martingale:
Theorem (Azuma-Hoeffding inequality)
Let be an MDS such that is bounded:
Then, is -subgaussian, and
Proof of Azuma’s lemma. Whereas we will not directly use that is an MDS, it actually follows from the conditional subgaussian property by taking so anyway we consider only MDS. Now,
where we used first the “tower property” of conditional expectation and then conditional subgaussianity. Iterating this, we bound the cumulant as needed and obtain the claim. ♦
In the post to follow, we conclude the ‘first tier’ of the course with McDiarmid’s inequality which controlles the deviations of a function with bounded differences.
1. Usually one deals with bounded MDS rather than general conditionally subgaussian ones, at least in the case of combinatorial problems (on graphs) and statistical learning theory. However, one may encounter another, simpler processes with dependent and conditionally subgaussian increments (generally, with respect to some filtration, i.e. a sequence of nested sigma-algebras), not necessarily bounded. In particular, consider two simple covariance-like processes:
where is a sequence of i.i.d. Bernoulli random variables, and its “decoupled” Gaussian counterpart
where are two mutually independent i.i.d. Gaussian sequences. Processes like these are sometimes called order 2 Gaussian chaos. Note that in (2a), is an MDS with respect to with Sequentially conditioning, we can upper-bound the exponential moments of the increments (with non-random !) and obtain the deviation bounds. If in (2a) there were instead of we couldn’t properly condition, and hence should use the bounds for a quadratic form of a subgaussian (Bernoulli) random vector. We would also be in the same trouble if in (2a) there were (non-bounded) Gaussians instead of Bernoulli’s — essentially because the exponential moments of the increments are not absolutely bounded — although for this case there is an alternative, as described here and in Proposition 34 of [Tao]. It seems that finally it amounts to paying additional log-term but I might have mistaken.
Meanwhile, (2b) is easily controlled conditioning on the whole at once without any of these problems. By the way, intuition prompts that (2b) shouldn’t concentrate substantially better than its non-decoupled counterpart where is substituted with Whereas for this example it can be verified directly, for more complex ones there is a machinery called comparison inequalities.
2. Just as we obtained the analogue of Hoeffding’s inequality for the case when the increments are conditionally (absolutely) subgaussian, we may obtain the analogue of Bernstein’s inequality for conditionally (absolutely) subexponential increments.