## Thresholding for sparse signals. Square-Root Lasso

In this post we continue with soft thresholding. Our estimator $\widehat \theta$  is an optimal solution (unique due to strong convexity) to $\displaystyle \min_{\theta} \|\theta -y\|_2^2 + \lambda \|\theta\|_1,$

and the observation model is $y = \theta^* + z,$ $z \sim\sigma_n\,\mathcal{N}(0,I_n),$  note that we included the noise variance $\sigma_n := \frac{\sigma}{\sqrt{n}}$ in $z$.  We are interested in how the loss of $\widehat \theta$ compares with ‘the oracle’ $\theta^*$  the loss for which is zero. We will assume that the signal is sparse; specifically, its support is $S \subset [n]$  with $s:=|S|.$

Theorem 1

Let $\lambda = \sigma_n [\sqrt{2 \log n} + \sqrt{2 \log \alpha^{-1}}].$  Then, with probability at least $1-\alpha,$ $\displaystyle \|\theta^* - \widehat\theta\|_2^2 \le C s\, \sigma_n^2 \log \frac{n}{\alpha}$

for some absolute constant $C.$  In other words, we only pay logarithmic price  for not knowing the support.

I will give two proofs of this theorem: the first more straightforward, and the second one is less intuitive but is easier to generalize.

First proof.

Instead of expanding the error, let’s just look at the error $\|\widehat\theta-\theta^*\|_2^2$  componentwise. Since $\widehat\theta_i = y^{[\lambda]}_{i}:= \text{sign}(y_i) \left(|y_i|-\lambda\right)_+,$ $\displaystyle (\widehat\theta_i - \theta^*_i)^2 = \left[\text{sign}(y_i)(|y_i|-\lambda)_+ - \theta^*_i\right]^2.$

Pick $\lambda$  as in the premise of the theorem, so that with high probability $\|z\|_\infty \le \lambda.$  If that event takes place, for the components $i \notin S$  the error is zero, whereas for $i \in S$  it holds $|\widehat\theta_i - \theta^*_i| = |y^{[\lambda]}_i - \theta^*_i| \le |y^{[\lambda]}_i -y_i| + |y_i-\theta^*_i| \le \lambda + |z_i| \le 2\lambda.$

adding squared inequalities above and bounding $\sqrt{a} + \sqrt{b} \le \sqrt{2(a + b)},$  we arrive at the claim with $C = 8.$  ♦

Second proof.

We start with oracle feasibility, which, after rearranging the terms, results in $\displaystyle \frac{1}{2}\|\widehat\theta-\theta^*\|_2^2 \le \langle z, \widehat\theta-\theta^*\rangle-\lambda \|\widehat\theta\|_1+ \lambda \|\theta^*\|_1.$

Now introduce the error vector $\delta := \widehat\theta-\theta^*.$  Note that $\|\theta^*\|_1 - \|\widehat\theta\|_1 \le \|\delta_S\|_1 - \|\delta_{\overline{S}}\|_1,$  and we get $\displaystyle \frac{1}{2}\|\delta\|_2^2 \le \langle z, \delta_S\rangle + \langle z, \delta_{\overline{S}}\rangle + \lambda \|\delta_S\|_1 - \lambda \|\delta_{\overline{S}}\|_1\\ \quad\,\,\,\,\,\,\,\,\,\,\,\,\le \|z\|_\infty (\|\delta_S\|_1 +\|\delta_{\overline{S}}\|_1)+ \lambda \|\delta_S\|_1 - \lambda \|\delta_{\overline{S}}\|_1.$

We have $\|\delta_S\|_1 \le \sqrt{s}\|\delta_S\|_2 \le \sqrt{s}\|\delta\|_2.$  Picking $\lambda$  as recommended, we get rid of $\|\delta_{\overline{S}}\|_1,$ and arrive at $\displaystyle\displaystyle \|\delta\|_2^2 \le 4\sqrt{s}\lambda\|\delta\|_2,$

and we are done. ♦

Suppose that instead of the quadratic empirical loss $\|\theta-y\|_2^2-\|z\|_2^2$  we minimize some convex empirical loss $L(\theta) \ge 0,$  again penalized by the $\ell_1$-norm (the applications include Lasso with non-orthogonal matrix and Square-Root Lasso which we will discuss below). Then oracle feasibility gives $\displaystyle L(\widehat \theta) - L(\theta^*) \le \lambda(\|\delta_S\|_1-\|\delta_{\overline{S}}\|_1).$

By convexity the left-hand side is lower-bounded by $\langle\nabla L(\theta^*),\delta\rangle,$  and we arrive at $\displaystyle -\|\nabla L(\theta^*)\|_\infty (\|\delta_S\|_1+\|\delta_{\overline{S}}\|_1)\le \lambda(\|\delta_S\|_1-\|\delta_{\overline{S}}\|_1).$

The quantity $\nabla L(\theta^*)$  is called the score.  Note that if the loss is strongly convex at $\theta^*$ (for Lasso this holds e.g. if the design matrix is RIP), we can get the quadratic inequality taking just $\lambda > \|\nabla L(\theta^*)\|_\infty$  w.h.p. to kill $\|\delta_{\overline{S}}\|_1.$  But bare convexity is enough for the following trick. Take $\lambda$  such that $\lambda > c\|\nabla L(\theta^*)\|_\infty$  w.h.p. for some absolute constant $c > 1.$  Then, $\displaystyle \|\delta_{\overline{S}}\|_1 \le \overline{c} \|\delta_S\|_1, \quad \overline{c} = \frac{c+1}{c-1}.$

This lets us write the following bound: $\displaystyle L(\widehat \theta) - L(\theta^*) \lesssim c(1+\overline{c})\|\nabla L(\theta^*)\|_\infty\|\delta_S\|_1 \le 8\sqrt{s}\|\nabla L(\theta^*)\|_\infty\|\delta\|_2,$

since $\min\limits_{c > 1} c(1+\overline{c}) = 8$  is attained at $c = 2.$ This might be useful if one can link the left-hand side with $\|\delta\|_2$ in alternative way (not necessarily through strong convexity).

Consider, for example, $L(\theta) = \|y-\theta\|_2$  (this empirical loss is biased but it doesn’t matter since we compare it in two different points). That is, we run the Square-Root Lasso: $\min_\theta \|\theta-y\|_2 + \lambda \|\theta\|_1.$

Note that $\displaystyle L^2(\widehat\theta) - L^2(\theta^*) = \|\delta\|^2_2 - 2\langle z, \delta \rangle.$

Combining that with the bound above, we get $\displaystyle \|\delta\|_2^2 \left(1-C_1^2s\|\nabla L(\theta^*)\|_{\infty}^2\right) \le C_2 \sqrt{s} \|\delta\|_2 \left[\|z\|_\infty + \left\|\nabla L(\theta^*)\right\|_\infty L(\theta^*)\right].$

Note that one can bound $\|\nabla L(\theta^*)\|_{\infty} = \frac {\|z\|_\infty}{\| z \|_2} \lesssim \sqrt{\frac{\log n}{n}}$  and $\left\|\nabla L(\theta^*)\right\|_\infty L(\theta^*) = \|z\|_\infty,$  so we have the following theorem.

Theorem 2

Let $\lambda = O\left(\frac{\sqrt{2 \log n} + u_\alpha}{\sqrt{n} -u_\alpha}\right),$  where $u_\alpha = \sqrt{2 \log \alpha^{-1}},$  for the Square-Root Lasso. Then, with probability at least $1-\alpha,$ $\displaystyle \|\theta^* - \widehat\theta\|_2^2 \lesssim C s\, \sigma_n^2 \log \frac{n}{\alpha}$

as long as $s \log n \lesssim n.$

Note that the “optimal” $\lambda$  doesn’t depend on $\sigma_n,$  so we might even not know it at all when running the algorithm (and obtain the same convergence rate as for the usual Lasso)!  The price for this is a very mild condition on $s.$ 1. Royi Avital says: