Azuma-Hoeffding inequality

In this post we will further expand the setting in which concentration inequalities are stated. Remember that we started from the arithmetical mean

\displaystyle F(X) = \frac{1}{n}\sum_{\i=1}^n X_i,

where all  X_i  were independent. We first showed that if  X_i  are sufficiently “light-tailed” (subgaussian), then  F(X)  concentrates above its mean, with a remarkable special case when  X_i  are bounded. We then obtained another result (concentration of quadratic forms) for Gaussian  X  when  F(X)  is a quadratic form of a Gaussian vector (actually they can also be expanded to subgaussian case); again we by no means relaxed the independency of  X.  The goal of this post is to relax this assumption in some extent.

Recall Hoeffding inequality:

Theorem  (Hoeffding’s inequality) 

For  X_1, \,\dotsc,\, X_n  independent,  X_i \in [a_i, b_i],  it holds

\displaystyle P\left( \sum_{i = 1}^n X_i -\mu_i \ge t \right) \le \exp \left(-\frac{2 t^2}{\sum_{i=1}^n (b_i - a_i)^2} \right).

As we will see in a while, it is the boundness of  X_i  that allows to relax the independence of  \xi_i.  The proof idea was that due to independency of  X_i,  the cumulant of the sum was the sum of cumulants:

L(\lambda) :=\displaystyle \log \mathbb{E} \exp \lambda\left(\sum_{i = 1}^n X_i - \mu_i \right) = \sum_{i=1}^n \log \mathbb{E} \exp \lambda\left(X_i - \mu_i \right). \,\,\,\,\,\,\,\,\,\,\,\, (1)

After this we were done by proving that each individual  X_i  was subgaussian with variance proxy  \displaystyle \sigma_i^2 = (b_i - a_i)^2/4,  and this latter fact by no means depends on the relations between  X_i.  Therefore, if we manage to prove something like (1) for a sequence of non-independent random variables, we will get an analogue of Hoeffding’s inequality. As a side note, let us mention that when some functional which in a way controls the ‘scattering’ of  F,  features the behaviour similar to (1), we say that this functional  tensorizes.  Proving that tensorization holds for some functional (more frequently in the form of an inequality, and only for i.i.d.  X_i  but for wider classes of functions  F)  is a usual route to obtain concentration results.  Specifically, the most advanced results are obtained using the tensorization of entropy.

Now we come back to our setting, where  F(\cdot)  is simply a sum, but  X_i  may be dependent. We prove not (1) but a weaker property. To begin, let us limit the ‘degree’ in which  X_i  are dependent.


A filtration  \left(\mathcal{F}_i\right)_{i=0}^n  is a sequence of nested sigma algebras,  \mathcal{F}_i \subseteq \mathcal{F}_j  whenever  i \le j,  with  \mathcal{F}_0 = \{\omega\}.

The simplest (and most common) example of filtration, which will be enough for us, is one generated by any random sequence  \left(X_i\right)_{i=1}^n.  In this case  \mathcal{F}_i = \mathcal{F}(X_1,\, ...,\, X_i),  i.e. a sigma-algebra generated by the first  i  random variables.


Consider a random sequence  \left(\Delta_i\right)_{i=1}^n  and filtration  \left(\mathcal{F}_i\right)_{0=1}^n  such that  \Delta_i  is measurable with respect to  \mathcal{F}_i.  \left(\Delta_i\right)_{i=1}^n  is called a  martingale difference sequence (MDS) with respect to  \left(\mathcal{F}_i\right)_{i=0}^n  if  for  1 \le i \le n

\mathbb{E}(\Delta_i| \mathcal{F}_{i-1}) = 0\,\,\,\, a.s.

Often  \mathcal{F}_i  is generated by  \Delta_1,\, ...,\, \Delta_i,  and then  \left(\Delta_i\right)_{i=1}^n  is called an MDS, without pointing out with respect to which filtration. Besides, sometimes the filtration is clear from the context and then it is omitted either.

I may give also the definition of  martingale   here but I will not since I won’t use it in this course. Anyway, note that if  (Y_i)_{i=1}^n  is a martingale, then  \Delta_i = Y_i - Y_{i-1}  is an MDS, and vice versa.

Azuma’s lemma

Let  \Delta_i = X_i - \mu_i,  1 \le i \le n,  be an MDS such that \Delta_i  is conditionally (absolutely) subgaussian, i.e.

\displaystyle\log\mathbb{E} \left(e^{\lambda \Delta_i} |\Delta_1, ..., \Delta_{i-1}\right) \le \frac{\lambda^2 \sigma_i^2}{2},

with constant  \sigma_i.  Then,  \sum_{i=1}^n \Delta_i = \sum_{i=1}^n X_i - \mu_i  is  \sum_{i=1}^n\sigma_i^2-subgaussian, and

\displaystyle P\left(\sum_{i=1}^n X_i - \mu_i \ge t\right) \le \exp\left(-\frac{t^2}{2\sum_{i=1}^n\sigma_i^2}\right)

Applying Hoeffding’s lemma to differences  \Delta_i  conditioned on  (\Delta_1, ..., \Delta_{i-1}),  1 \le i \le n, we get the equivalent of Hoeffding’s inequality for the case when a sum of (dependent) bounded differences form a martingale:

Theorem  (Azuma-Hoeffding inequality)

Let  \Delta_i = X_i - \mu_i,  1 \le i \le n,  be an MDS such that \Delta_i  is bounded:

\displaystyle a_i \le \Delta_i \le b_i \,\,\,\, a.s.

Then,  \sum_{i=1}^n \Delta_i = \sum_{i=1}^n X_i - \mu_i  is  \sum_{i=1}^n\sigma_i^2-subgaussian, and

\displaystyle P\left(\sum_{i=1}^n X_i - \mu_i \ge t\right) \le \exp\left(-\frac{2 t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right)

Proof of Azuma’s lemma.  Whereas we will not directly use that  \left(\Delta_i\right)  is an MDS, it actually follows from the conditional subgaussian property by taking \lambda \to 0,  so anyway we consider only MDS. Now,

\displaystyle \mathbb{E} \left[ e^{\lambda\sum_{i = 1}^n \Delta_i}\right] = \mathbb{E} \left[ \mathbb{E}\left[e^{\lambda\sum_{i = 1}^n\Delta_i} \middle|\Delta_1, ..., \Delta_{n-1}\right] \right] = \mathbb{E} \left[ e^{\lambda \sum_{i = 1}^{n-1} \Delta_i} \cdot \mathbb{E} \middle[ e^{\lambda \Delta_n} \middle| \Delta_1, ..., \Delta_{n-1} \middle] \right]\\ \le\mathbb{E} \left[ e^{\lambda\sum_{i = 1}^{n-1} \Delta_i}\right] \cdot e^{\lambda^2 \sigma_n^2/2},

where we used first the “tower property” of conditional expectation and then conditional subgaussianity. Iterating this, we bound the cumulant as needed and obtain the claim. ♦

In the post to follow, we conclude the ‘first tier’ of the course with McDiarmid’s inequality which controlles the deviations of a function with bounded differences.


1.  Usually one deals with bounded MDS rather than general conditionally subgaussian ones, at least in the case of combinatorial problems (on graphs) and statistical learning theory. However, one may encounter another, simpler processes with dependent and conditionally subgaussian increments (generally, with respect to some filtration, i.e. a sequence of nested sigma-algebras), not necessarily bounded. In particular, consider two simple covariance-like processes:

\displaystyle \Delta_i = B_i \,\sum_{1 \le k \le \tau} B_{i-k}\,\,\,\,\,\,\,\,\,\,\,\,(2a),

where  B  is a sequence of i.i.d. Bernoulli random variables, and its “decoupled” Gaussian counterpart

\displaystyle \Delta_i = X_i \,\sum_{|k| \le \tau} X'_{i-k}\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,(2b),

where X, X' are two mutually independent i.i.d. Gaussian sequences. Processes like these are sometimes called  order 2 Gaussian chaos.  Note that in (2a),  \Delta_i  is an MDS with respect to  \mathcal{F}(X_{<i}),  with  \sigma_i \le 1.  Sequentially conditioning, we can upper-bound the exponential moments of the increments (with non-random  \sigma_i!) and obtain the deviation bounds. If in (2a) there were  |k| \le \tau  instead of  1 \le k \le \tau,  we couldn’t properly condition, and hence should use the bounds for a quadratic form of a subgaussian (Bernoulli) random vector. We would also be in the same trouble  if in (2a) there were (non-bounded) Gaussians instead of Bernoulli’s — essentially because the exponential moments of the increments are not absolutely bounded — although for this case there is an alternative, as described here and in Proposition 34 of [Tao]. It seems that finally it amounts to paying additional log-term but I might have mistaken.

Meanwhile, (2b) is easily controlled conditioning on the whole  X'  at once without any of these problems. By the way, intuition prompts that (2b) shouldn’t concentrate substantially better than its non-decoupled counterpart where  X'  is substituted with  X. Whereas for this example it can be verified directly, for more complex ones there is a machinery called comparison inequalities.

2.  Just as we obtained the analogue of Hoeffding’s inequality for the case when the increments are conditionally (absolutely) subgaussian, we may obtain the analogue of Bernstein’s inequality for conditionally (absolutely) subexponential increments.

About Dmitry Ostrovsky

PhD student with interests in statistics, optimization, and machine learning.
This entry was posted in Course on Concentration Inequalities and tagged . Bookmark the permalink.

1 Response to Azuma-Hoeffding inequality

  1. Pingback: Martingale method II — McDiarmid’s inequality | Look at the corners!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s