Subexponential distributions I — Bernstein inequality and quadratic forms

In this post we move on from  subgaussian  distributions to another important class of distributions called  subexponential.  The simplest and most common example of such distributions is chi-square. As usual, we are interested in tail probability bounds for such distributions, and the way to obtain them will be Chernoff’s technique.

We start by noting that  \chi_1^2 = \xi^2,  where  \xi \sim \mathcal{N}(0,1).  Since we already know from one of the previous posts  that  P\left(|\xi| > t \right) \le \exp\left(-t^2/2\right),  the union bound gives for  X \sim \chi_1^2:

\displaystyle P \left(X \ge t \right) \le 2 e^{-t}.

Now, the observation that  X_n \sim \chi^2_n  is distributed as the sum of  n  i.i.d.  \chi_1^2  appeals to using a naive union bound:

\displaystyle P \left(X_n \ge t \right) \le 2n \exp\left(-\frac{t}{n}\right).

The typical value of  X_n  according to it is  O(n \log n)  which is completely wrong. Indeed, the independency of different terms in  X_n  was not used. In fact, for its intuitively ‘worst concentrated’ counterpart  Y_n \sim n \cdot \xi^2,  the naive bound still holds, but in fact,  Y_n  is  O(n).  The naive bound performs even worse for  X_n,  since X_n / n, being the empirical average of i.i.d. random variables with the unit expectation each, should concentrate as  1 + o(1)  w.h.p. (see post “Intuition behind concentration inequalities”), and therefore w.h.p.  X_n = n + o(n),  which is generally smaller than  O(n).

The looseness of the union bound is due to the fact that it does not exploit the independence of different terms inside  X_n.  The natural way to capture this independence is the so-called  Chernoff’s technique.  Instead of  X_n \sim \chi_n^2  we will consider the “generalized  \chi_n^2  ”:

\displaystyle Y \sim a_1 \xi_1^2 + ... + a_n \xi_n^2

with  \mu := \mathbb{E} Y = \mathbf{1}^T\mathbf{a},  not even requiring that \mathbf{a} \succeq 0.  We have

\displaystyle P \left(Y - \mu \ge t \right) \le \frac{\mathbb{E}\, e^{\lambda \left(Y - \mu\right)}}{e^{\lambda t}},

but, unlike the subgaussian case, this formula holds not for all  \lambda.  Indeed, for  \xi \sim \mathcal{N}(0,1),

\displaystyle \log \mathbb{E} \exp \left(\lambda a \xi^2\right) = \log \left( \frac{1}{\sqrt{1 - 2 a \lambda}} \right) = -\frac{1}{2} \log \left(1 - 2 a \lambda \right), \,\,\,\,\,\,\, 1-2a\lambda \ge 0.

Now we use the well-known bound

\displaystyle -\log (1 - u) \le u + \frac{u^2}{2(1-u)},        \displaystyle 0 \le u \le 1;

noticing (with a bit of simple algebra) that when  1-2|a|\lambda \ge 0,

\displaystyle \log \mathbb{E}\exp \left[ \lambda |a| (1- \xi^2) \right] \le \log \mathbb{E}\exp \left[ \lambda |a| (\xi^2 - 1) \right]

(in words: “the right tail of  \chi_1^2  is heavier than the right tail”), and hence

                        \displaystyle \label{1} \log \mathbb{E} \exp \left[ \lambda a (\xi^2 - 1) \right] \le \frac{a^2\lambda^2}{1 - 2 |a| \lambda},        \displaystyle 0 \le \lambda \le \frac{1}{2|a|}.            (1)

Note that

    \displaystyle\label{1'} \log \mathbb{E} \exp \left[ \lambda a (\xi^2 - 1) \right] \le 2 a^2\lambda^2,        \displaystyle 0 \le \lambda \le \frac{1}{4|a|}.            (1′)

Summing over  i,  we obtain

\displaystyle \label{2} \log \mathbb{E}\, \exp\left[\lambda \left(Y - \mu\right)\right] \le\frac{|\mathbf{a}|_2^2\, \lambda^2}{1 - 2 |\mathbf{a}|_\infty \lambda},        \displaystyle 0 \le \lambda \le \frac{1}{2|\mathbf{a}|_\infty}.            (2)

We make life a bit easier, noting, as in (1), that

                      \displaystyle \label{2'} \log \mathbb{E}\, \exp\left[\lambda \left(Y - \mu\right)\right] \le 2|\mathbf{a}|_2^2\, \lambda^2,        \displaystyle 0 \le \lambda \le \frac{1}{4|\mathbf{a}|_\infty};            (2′)

an import observation is that for the last bound (1′) is already sufficient.

Finally, minimizing in feasible  \lambda,  we obtain the tail bound:

\displaystyle P \left(Y - \mu \ge t \right) \le\exp\left( - \frac{t^2}{8 |\mathbf{a}|_2^2}\right) \, \bigvee \, \exp\left(-\frac{t}{8 |\mathbf{a}|_\infty}\right).            (3)

A convenient reparametrization of this comes with  b = 4 |\mathbf{a}|_\infty,  \sigma^2 = 4 |\mathbf{a}|_2^2.  In terms of these new parameters, we now introduce a new class of distributions which is closed under addition of independent random variables:

Definition  (Subexponential distribution)

A random variable X with \mathbb{E}X = \mu, is (\sigma^2, b)subexponential,  if

                      \displaystyle \log \mathbb{E} \exp \left[ \lambda (X - \mu) \right] \le \frac{\lambda^2\sigma^2}{2},        \displaystyle 0 \le \lambda \le \frac{1}{b}.

For example, \chi_1^2 is subexponential (4, 4). Note also that \sigma^2-subgaussian is (\sigma^2, 0)-subexponential (and vice versa). For subexponential distributions we have, restating (3),

Theorem 1

For  X  (\sigma^2, b)-subexponential,

\displaystyle P\left(X - \mu > t\right) \le\exp\left( - \frac{t^2}{2 \sigma^2}\right) \, \bigvee \, \exp\left(-\frac{t}{2 b}\right) \le \exp \left(-\frac{t^2}{2\sigma^2 + 2bt}\right).

The bounds of this form — quadratic-over-linear in the exponent — are sometimes called Bernstein-type inequalities. We will examine this basic inequality more thoroughly in the next post. Now we return to the behaviour of the sum of independent subexponentials. Recall (3): a sum of independent  (\sigma_i^2, b_i)-subexponentials is subexponential with  \sigma^2 =\sum_i \sigma_i^2,  b = \max_i b_i,  and we get

Theorem 2  (Bernstein’s inequality)

For  X_i  be independent  (\sigma^2_i, b_i)-subexponentials,

\displaystyle P\left(\sum_{i=1}^n X_i - \mu_i > t\right) \le\exp \left(-\frac{t^2}{2\sum_{i=1}^n \sigma^2_i + 2 |b|_\infty t}\right).

The ‘crude’ message of this bound is that when we add up  n  i.i.d.  (\sigma^2, b)-subexponentials, the mean  \mu = \sum_i \mu_i  is  O(n),  whereas  X-\mu  is w.h.p.  O(\sqrt{n}) — at a first glance we have the same concentration as for the subgaussian case. However, the devil is in the detail: if we added up  \sigma^2_i-subgaussians, there wouldn’t be the additional linear term  2 |b|_\infty\, t  in the denominator. For the moderate deviations  0 \le t \le \sigma^2/b,  the so-called near zone, the linear term is small, and we have subgaussian tails, as if  b = 0.  But in the far zone, t \ge \sigma^2/b,  the tails are “a bit heavier”, corresponding to the exponential distribution with rate parameter  \lambda = 1 / 2b.  We will discuss this subtlety in detail in the next post.

We finish our ride for today by stating the deviation bound for a quadratic form of a standard Gaussian vector. Let  \xi  be a standard Gaussian vector of some dimension, possibly infinite, and  \mathbf{Q}(\xi) := \xi^T Q \xi  a quadratic form of  \xi  with symmetric matrix  Q  (we don’t require that Q \succeq 0).  Then, due to the rotational invariance of the Gaussian distribution,  Q(\xi)  is distributed the same as the “generalized chi-square” with  a_i=\lambda_i,  the eigenvalues of  Q.  Introducing Schatten p-norms  \|Q\|_p := \|\lambda\|_p  and coming back to (3), we have

Theorem 3  (Deviations of a quadratic form)

\displaystyle P\left(\mathbf{Q}(\xi) - \text{Tr}(Q) \ge t\right) \le\exp\left(-\frac{t^2}{8\|Q\|_2^2 + 8 \| Q\|_\infty t}\right).


\displaystyle P\left(\mathbf{Q}(\xi) - \text{Tr}(Q) \ge \sqrt{8\|Q\|_2^2 \cdot \textup{u}} +8 \| Q\|_\infty \cdot \textup{u}\right) \le\exp\left(-\textup{u}\right).

In particular,  for  X=\chi_n^2,

P\left(X - n \ge \sqrt{8n \, \textup{u}} + 8\textup{u}\right) \le\exp\left(-\textup{u}\right).

The message of the last inequality is that with high probability  X = n + O(\sqrt{n}).

About Dmitry Ostrovsky

PhD student with interests in statistics, optimization, and machine learning.
This entry was posted in Course on Concentration Inequalities. Bookmark the permalink.

4 Responses to Subexponential distributions I — Bernstein inequality and quadratic forms

  1. Pingback: Subexponential distributions, II — Regimes and Bernstein vs Hoeffding | Look at the corners!

  2. Pingback: Azuma-Hoeffding inequality | Look at the corners!

  3. csabaszepesvari says:

    At the very beginning why do you need the union bound? You could simply calculate: $P(X \ge t) = P( \xi^2 \ge t ) = P( |\xi| \ge \sqrt{t}) \le \exp( -(\sqrt{t})^2/2 ) = \exp(-t)$.

  4. Csaba, you’re right of course, and thank you for the feedback.
    The union bound here is for the “methodological purpose”, as seen by the next display (passing from \Chi_1^2 to \Chi_n^2). Basically, the whole first paragraph is just to introduce the idea that one can use the union bound to control the sum of i.i.d. r.v.’s. Of course, union bounding alone is not enough, one needs Chernoff on top, as shown in the remaining part of the post.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s