## Subexponential distributions I — Bernstein inequality and quadratic forms

In this post we move on from  subgaussian  distributions to another important class of distributions called  subexponential.  The simplest and most common example of such distributions is chi-square. As usual, we are interested in tail probability bounds for such distributions, and the way to obtain them will be Chernoff’s technique.

We start by noting that  $\chi_1^2 = \xi^2,$  where  $\xi \sim \mathcal{N}(0,1).$  Since we already know from one of the previous posts  that  $P\left(|\xi| > t \right) \le \exp\left(-t^2/2\right),$  the union bound gives for  $X \sim \chi_1^2$:

$\displaystyle P \left(X \ge t \right) \le 2 e^{-t}.$

Now, the observation that  $X_n \sim \chi^2_n$  is distributed as the sum of  $n$  i.i.d.  $\chi_1^2$  appeals to using a naive union bound:

$\displaystyle P \left(X_n \ge t \right) \le 2n \exp\left(-\frac{t}{n}\right).$

The typical value of  $X_n$  according to it is  $O(n \log n)$  which is completely wrong. Indeed, the independency of different terms in  $X_n$  was not used. In fact, for its intuitively ‘worst concentrated’ counterpart  $Y_n \sim n \cdot \xi^2,$  the naive bound still holds, but in fact,  $Y_n$  is  $O(n).$  The naive bound performs even worse for  $X_n,$  since $X_n / n$, being the empirical average of i.i.d. random variables with the unit expectation each, should concentrate as  $1 + o(1)$  w.h.p. (see post “Intuition behind concentration inequalities”), and therefore w.h.p.  $X_n = n + o(n),$  which is generally smaller than  $O(n)$.

The looseness of the union bound is due to the fact that it does not exploit the independence of different terms inside  $X_n.$  The natural way to capture this independence is the so-called  Chernoff’s technique.  Instead of  $X_n \sim \chi_n^2$  we will consider the “generalized  $\chi_n^2$  ”:

$\displaystyle Y \sim a_1 \xi_1^2 + ... + a_n \xi_n^2$

with  $\mu := \mathbb{E} Y = \mathbf{1}^T\mathbf{a},$  not even requiring that $\mathbf{a} \succeq 0.$  We have

$\displaystyle P \left(Y - \mu \ge t \right) \le \frac{\mathbb{E}\, e^{\lambda \left(Y - \mu\right)}}{e^{\lambda t}},$

but, unlike the subgaussian case, this formula holds not for all  $\lambda.$  Indeed, for  $\xi \sim \mathcal{N}(0,1),$

$\displaystyle \log \mathbb{E} \exp \left(\lambda a \xi^2\right) = \log \left( \frac{1}{\sqrt{1 - 2 a \lambda}} \right) = -\frac{1}{2} \log \left(1 - 2 a \lambda \right), \,\,\,\,\,\,\, 1-2a\lambda \ge 0.$

Now we use the well-known bound

$\displaystyle -\log (1 - u) \le u + \frac{u^2}{2(1-u)},$        $\displaystyle 0 \le u \le 1;$

noticing (with a bit of simple algebra) that when  $1-2|a|\lambda \ge 0,$

$\displaystyle \log \mathbb{E}\exp \left[ \lambda |a| (1- \xi^2) \right] \le \log \mathbb{E}\exp \left[ \lambda |a| (\xi^2 - 1) \right]$

(in words: “the right tail of  $\chi_1^2$  is heavier than the right tail”), and hence

$\displaystyle \label{1} \log \mathbb{E} \exp \left[ \lambda a (\xi^2 - 1) \right] \le \frac{a^2\lambda^2}{1 - 2 |a| \lambda},$        $\displaystyle 0 \le \lambda \le \frac{1}{2|a|}.$            (1)

Note that

$\displaystyle\label{1'} \log \mathbb{E} \exp \left[ \lambda a (\xi^2 - 1) \right] \le 2 a^2\lambda^2,$        $\displaystyle 0 \le \lambda \le \frac{1}{4|a|}.$            (1′)

Summing over  $i,$  we obtain

$\displaystyle \label{2} \log \mathbb{E}\, \exp\left[\lambda \left(Y - \mu\right)\right] \le\frac{|\mathbf{a}|_2^2\, \lambda^2}{1 - 2 |\mathbf{a}|_\infty \lambda},$        $\displaystyle 0 \le \lambda \le \frac{1}{2|\mathbf{a}|_\infty}.$            (2)

We make life a bit easier, noting, as in (1), that

$\displaystyle \label{2'} \log \mathbb{E}\, \exp\left[\lambda \left(Y - \mu\right)\right] \le 2|\mathbf{a}|_2^2\, \lambda^2,$        $\displaystyle 0 \le \lambda \le \frac{1}{4|\mathbf{a}|_\infty};$            (2′)

an import observation is that for the last bound (1′) is already sufficient.

Finally, minimizing in feasible  $\lambda,$  we obtain the tail bound:

$\displaystyle P \left(Y - \mu \ge t \right) \le\exp\left( - \frac{t^2}{8 |\mathbf{a}|_2^2}\right) \, \bigvee \, \exp\left(-\frac{t}{8 |\mathbf{a}|_\infty}\right).$            (3)

A convenient reparametrization of this comes with  $b = 4 |\mathbf{a}|_\infty,$  $\sigma^2 = 4 |\mathbf{a}|_2^2.$  In terms of these new parameters, we now introduce a new class of distributions which is closed under addition of independent random variables:

Definition  (Subexponential distribution)

A random variable $X$ with $\mathbb{E}X = \mu,$ is $(\sigma^2, b)$subexponential,  if

$\displaystyle \log \mathbb{E} \exp \left[ \lambda (X - \mu) \right] \le \frac{\lambda^2\sigma^2}{2},$        $\displaystyle 0 \le \lambda \le \frac{1}{b}.$

For example, $\chi_1^2$ is subexponential $(4, 4)$. Note also that $\sigma^2$-subgaussian is $(\sigma^2, 0)$-subexponential (and vice versa). For subexponential distributions we have, restating (3),

Theorem 1

For  $X$  $(\sigma^2, b)$-subexponential,

$\displaystyle P\left(X - \mu > t\right) \le\exp\left( - \frac{t^2}{2 \sigma^2}\right) \, \bigvee \, \exp\left(-\frac{t}{2 b}\right) \le \exp \left(-\frac{t^2}{2\sigma^2 + 2bt}\right).$

The bounds of this form — quadratic-over-linear in the exponent — are sometimes called Bernstein-type inequalities. We will examine this basic inequality more thoroughly in the next post. Now we return to the behaviour of the sum of independent subexponentials. Recall (3): a sum of independent  $(\sigma_i^2, b_i)$-subexponentials is subexponential with  $\sigma^2 =\sum_i \sigma_i^2,$  $b = \max_i b_i,$  and we get

Theorem 2  (Bernstein’s inequality)

For  $X_i$  be independent  $(\sigma^2_i, b_i)$-subexponentials,

$\displaystyle P\left(\sum_{i=1}^n X_i - \mu_i > t\right) \le\exp \left(-\frac{t^2}{2\sum_{i=1}^n \sigma^2_i + 2 |b|_\infty t}\right).$

The ‘crude’ message of this bound is that when we add up  $n$  i.i.d.  $(\sigma^2, b)$-subexponentials, the mean  $\mu = \sum_i \mu_i$  is  $O(n),$  whereas  $X-\mu$  is w.h.p.  $O(\sqrt{n})$ — at a first glance we have the same concentration as for the subgaussian case. However, the devil is in the detail: if we added up  $\sigma^2_i$-subgaussians, there wouldn’t be the additional linear term  $2 |b|_\infty\, t$  in the denominator. For the moderate deviations  $0 \le t \le \sigma^2/b,$  the so-called near zone, the linear term is small, and we have subgaussian tails, as if  $b = 0.$  But in the far zone, $t \ge \sigma^2/b,$  the tails are “a bit heavier”, corresponding to the exponential distribution with rate parameter  $\lambda = 1 / 2b.$  We will discuss this subtlety in detail in the next post.

We finish our ride for today by stating the deviation bound for a quadratic form of a standard Gaussian vector. Let  $\xi$  be a standard Gaussian vector of some dimension, possibly infinite, and  $\mathbf{Q}(\xi) := \xi^T Q \xi$  a quadratic form of  $\xi$  with symmetric matrix  $Q$  (we don’t require that $Q \succeq 0$).  Then, due to the rotational invariance of the Gaussian distribution,  $Q(\xi)$  is distributed the same as the “generalized chi-square” with  $a_i=\lambda_i,$  the eigenvalues of  $Q.$  Introducing Schatten $p$-norms  $\|Q\|_p := \|\lambda\|_p$  and coming back to (3), we have

Theorem 3  (Deviations of a quadratic form)

$\displaystyle P\left(\mathbf{Q}(\xi) - \text{Tr}(Q) \ge t\right) \le\exp\left(-\frac{t^2}{8\|Q\|_2^2 + 8 \| Q\|_\infty t}\right).$

Alternatively,

$\displaystyle P\left(\mathbf{Q}(\xi) - \text{Tr}(Q) \ge \sqrt{8\|Q\|_2^2 \cdot \textup{u}} +8 \| Q\|_\infty \cdot \textup{u}\right) \le\exp\left(-\textup{u}\right).$

In particular,  for  $X=\chi_n^2,$

$P\left(X - n \ge \sqrt{8n \, \textup{u}} + 8\textup{u}\right) \le\exp\left(-\textup{u}\right).$

The message of the last inequality is that with high probability  $X = n + O(\sqrt{n}).$

At the very beginning why do you need the union bound? You could simply calculate: $P(X \ge t) = P( \xi^2 \ge t ) = P( |\xi| \ge \sqrt{t}) \le \exp( -(\sqrt{t})^2/2 ) = \exp(-t)$.