## Azuma-Hoeffding inequality

In this post we will further expand the setting in which concentration inequalities are stated. Remember that we started from the arithmetical mean $\displaystyle F(X) = \frac{1}{n}\sum_{\i=1}^n X_i,$

where all $X_i$  were independent. We first showed that if $X_i$  are sufficiently “light-tailed” (subgaussian), then $F(X)$  concentrates above its mean, with a remarkable special case when $X_i$  are bounded. We then obtained another result (concentration of quadratic forms) for Gaussian $X$  when $F(X)$  is a quadratic form of a Gaussian vector (actually they can also be expanded to subgaussian case); again we by no means relaxed the independency of $X.$  The goal of this post is to relax this assumption in some extent.

Recall Hoeffding inequality:

Theorem  (Hoeffding’s inequality)

For $X_1, \,\dotsc,\, X_n$  independent, $X_i \in [a_i, b_i],$  it holds $\displaystyle P\left( \sum_{i = 1}^n X_i -\mu_i \ge t \right) \le \exp \left(-\frac{2 t^2}{\sum_{i=1}^n (b_i - a_i)^2} \right).$

As we will see in a while, it is the boundness of $X_i$  that allows to relax the independence of $\xi_i.$  The proof idea was that due to independency of $X_i,$  the cumulant of the sum was the sum of cumulants: $L(\lambda) :=\displaystyle \log \mathbb{E} \exp \lambda\left(\sum_{i = 1}^n X_i - \mu_i \right) = \sum_{i=1}^n \log \mathbb{E} \exp \lambda\left(X_i - \mu_i \right). \,\,\,\,\,\,\,\,\,\,\,\, (1)$

After this we were done by proving that each individual $X_i$  was subgaussian with variance proxy $\displaystyle \sigma_i^2 = (b_i - a_i)^2/4,$  and this latter fact by no means depends on the relations between $X_i.$  Therefore, if we manage to prove something like (1) for a sequence of non-independent random variables, we will get an analogue of Hoeffding’s inequality. As a side note, let us mention that when some functional which in a way controls the ‘scattering’ of $F,$  features the behaviour similar to (1), we say that this functional  tensorizes.  Proving that tensorization holds for some functional (more frequently in the form of an inequality, and only for i.i.d. $X_i$  but for wider classes of functions $F$)  is a usual route to obtain concentration results.  Specifically, the most advanced results are obtained using the tensorization of entropy.

Now we come back to our setting, where $F(\cdot)$  is simply a sum, but $X_i$  may be dependent. We prove not (1) but a weaker property. To begin, let us limit the ‘degree’ in which $X_i$  are dependent.

Definition

A filtration $\left(\mathcal{F}_i\right)_{i=0}^n$  is a sequence of nested sigma algebras, $\mathcal{F}_i \subseteq \mathcal{F}_j$  whenever $i \le j,$  with $\mathcal{F}_0 = \{\omega\}.$

The simplest (and most common) example of filtration, which will be enough for us, is one generated by any random sequence $\left(X_i\right)_{i=1}^n.$  In this case $\mathcal{F}_i = \mathcal{F}(X_1,\, ...,\, X_i),$  i.e. a sigma-algebra generated by the first $i$  random variables.

Definition

Consider a random sequence $\left(\Delta_i\right)_{i=1}^n$  and filtration $\left(\mathcal{F}_i\right)_{0=1}^n$  such that $\Delta_i$  is measurable with respect to $\mathcal{F}_i.$ $\left(\Delta_i\right)_{i=1}^n$  is called a  martingale difference sequence (MDS) with respect to $\left(\mathcal{F}_i\right)_{i=0}^n$  if  for $1 \le i \le n$ $\mathbb{E}(\Delta_i| \mathcal{F}_{i-1}) = 0\,\,\,\, a.s.$

Often $\mathcal{F}_i$  is generated by $\Delta_1,\, ...,\, \Delta_i,$  and then $\left(\Delta_i\right)_{i=1}^n$  is called an MDS, without pointing out with respect to which filtration. Besides, sometimes the filtration is clear from the context and then it is omitted either.

I may give also the definition of  martingale   here but I will not since I won’t use it in this course. Anyway, note that if $(Y_i)_{i=1}^n$  is a martingale, then $\Delta_i = Y_i - Y_{i-1}$  is an MDS, and vice versa.

Azuma’s lemma

Let $\Delta_i = X_i - \mu_i,$ $1 \le i \le n,$  be an MDS such that $\Delta_i$  is conditionally (absolutely) subgaussian, i.e. $\displaystyle\log\mathbb{E} \left(e^{\lambda \Delta_i} |\Delta_1, ..., \Delta_{i-1}\right) \le \frac{\lambda^2 \sigma_i^2}{2},$

with constant $\sigma_i.$  Then, $\sum_{i=1}^n \Delta_i = \sum_{i=1}^n X_i - \mu_i$  is $\sum_{i=1}^n\sigma_i^2$-subgaussian, and $\displaystyle P\left(\sum_{i=1}^n X_i - \mu_i \ge t\right) \le \exp\left(-\frac{t^2}{2\sum_{i=1}^n\sigma_i^2}\right)$

Applying Hoeffding’s lemma to differences $\Delta_i$  conditioned on $(\Delta_1, ..., \Delta_{i-1}),$ $1 \le i \le n,$ we get the equivalent of Hoeffding’s inequality for the case when a sum of (dependent) bounded differences form a martingale:

Theorem  (Azuma-Hoeffding inequality)

Let $\Delta_i = X_i - \mu_i,$ $1 \le i \le n,$  be an MDS such that $\Delta_i$  is bounded: $\displaystyle a_i \le \Delta_i \le b_i \,\,\,\, a.s.$

Then, $\sum_{i=1}^n \Delta_i = \sum_{i=1}^n X_i - \mu_i$  is $\sum_{i=1}^n\sigma_i^2$-subgaussian, and $\displaystyle P\left(\sum_{i=1}^n X_i - \mu_i \ge t\right) \le \exp\left(-\frac{2 t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right)$

Proof of Azuma’s lemma.  Whereas we will not directly use that $\left(\Delta_i\right)$  is an MDS, it actually follows from the conditional subgaussian property by taking $\lambda \to 0,$  so anyway we consider only MDS. Now, $\displaystyle \mathbb{E} \left[ e^{\lambda\sum_{i = 1}^n \Delta_i}\right] = \mathbb{E} \left[ \mathbb{E}\left[e^{\lambda\sum_{i = 1}^n\Delta_i} \middle|\Delta_1, ..., \Delta_{n-1}\right] \right] = \mathbb{E} \left[ e^{\lambda \sum_{i = 1}^{n-1} \Delta_i} \cdot \mathbb{E} \middle[ e^{\lambda \Delta_n} \middle| \Delta_1, ..., \Delta_{n-1} \middle] \right]\\ \le\mathbb{E} \left[ e^{\lambda\sum_{i = 1}^{n-1} \Delta_i}\right] \cdot e^{\lambda^2 \sigma_n^2/2},$

where we used first the “tower property” of conditional expectation and then conditional subgaussianity. Iterating this, we bound the cumulant as needed and obtain the claim. ♦

In the post to follow, we conclude the ‘first tier’ of the course with McDiarmid’s inequality which controlles the deviations of a function with bounded differences.

Remarks

1.  Usually one deals with bounded MDS rather than general conditionally subgaussian ones, at least in the case of combinatorial problems (on graphs) and statistical learning theory. However, one may encounter another, simpler processes with dependent and conditionally subgaussian increments (generally, with respect to some filtration, i.e. a sequence of nested sigma-algebras), not necessarily bounded. In particular, consider two simple covariance-like processes: $\displaystyle \Delta_i = B_i \,\sum_{1 \le k \le \tau} B_{i-k}\,\,\,\,\,\,\,\,\,\,\,\,(2a),$

where $B$  is a sequence of i.i.d. Bernoulli random variables, and its “decoupled” Gaussian counterpart $\displaystyle \Delta_i = X_i \,\sum_{|k| \le \tau} X'_{i-k}\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,(2b),$

where $X, X'$ are two mutually independent i.i.d. Gaussian sequences. Processes like these are sometimes called  order 2 Gaussian chaos.  Note that in (2a), $\Delta_i$  is an MDS with respect to $\mathcal{F}(X_{  with $\sigma_i \le 1.$  Sequentially conditioning, we can upper-bound the exponential moments of the increments (with non-random $\sigma_i$!) and obtain the deviation bounds. If in (2a) there were $|k| \le \tau$  instead of $1 \le k \le \tau,$  we couldn’t properly condition, and hence should use the bounds for a quadratic form of a subgaussian (Bernoulli) random vector. We would also be in the same trouble  if in (2a) there were (non-bounded) Gaussians instead of Bernoulli’s — essentially because the exponential moments of the increments are not absolutely bounded — although for this case there is an alternative, as described here and in Proposition 34 of [Tao]. It seems that finally it amounts to paying additional log-term but I might have mistaken.

Meanwhile, (2b) is easily controlled conditioning on the whole $X'$  at once without any of these problems. By the way, intuition prompts that (2b) shouldn’t concentrate substantially better than its non-decoupled counterpart where $X'$  is substituted with $X.$ Whereas for this example it can be verified directly, for more complex ones there is a machinery called comparison inequalities.

2.  Just as we obtained the analogue of Hoeffding’s inequality for the case when the increments are conditionally (absolutely) subgaussian, we may obtain the analogue of Bernstein’s inequality for conditionally (absolutely) subexponential increments. 