Thresholding for sparse signals. Square-Root Lasso

In this post we continue with soft thresholding. Our estimator \widehat \theta  is an optimal solution (unique due to strong convexity) to

\displaystyle \min_{\theta} \|\theta -y\|_2^2 + \lambda \|\theta\|_1,

and the observation model is  y = \theta^* + z,  z \sim\sigma_n\,\mathcal{N}(0,I_n),  note that we included the noise variance  \sigma_n := \frac{\sigma}{\sqrt{n}} in z.  We are interested in how the loss of  \widehat \theta compares with ‘the oracle’ \theta^*  the loss for which is zero. We will assume that the signal is sparse; specifically, its support is  S \subset [n]  with  s:=|S|.

Theorem 1

Let \lambda = \sigma_n [\sqrt{2 \log n} + \sqrt{2 \log \alpha^{-1}}].  Then, with probability at least  1-\alpha,

\displaystyle \|\theta^* - \widehat\theta\|_2^2 \le C s\, \sigma_n^2 \log \frac{n}{\alpha}

for some absolute constant  C.  In other words, we only pay logarithmic price  for not knowing the support.

I will give two proofs of this theorem: the first more straightforward, and the second one is less intuitive but is easier to generalize.

First proof.

Instead of expanding the error, let’s just look at the error  \|\widehat\theta-\theta^*\|_2^2  componentwise. Since  \widehat\theta_i = y^{[\lambda]}_{i}:= \text{sign}(y_i) \left(|y_i|-\lambda\right)_+,

\displaystyle (\widehat\theta_i - \theta^*_i)^2 = \left[\text{sign}(y_i)(|y_i|-\lambda)_+ - \theta^*_i\right]^2.

Pick  \lambda  as in the premise of the theorem, so that with high probability  \|z\|_\infty \le \lambda.  If that event takes place, for the components  i \notin S  the error is zero, whereas for  i \in S  it holds

|\widehat\theta_i - \theta^*_i| = |y^{[\lambda]}_i - \theta^*_i| \le |y^{[\lambda]}_i -y_i| + |y_i-\theta^*_i| \le \lambda + |z_i| \le 2\lambda.

adding squared inequalities above and bounding  \sqrt{a} + \sqrt{b} \le \sqrt{2(a + b)},  we arrive at the claim with  C = 8.  ♦

Second proof.

We start with oracle feasibility, which, after rearranging the terms, results in

\displaystyle \frac{1}{2}\|\widehat\theta-\theta^*\|_2^2 \le \langle z, \widehat\theta-\theta^*\rangle-\lambda \|\widehat\theta\|_1+ \lambda \|\theta^*\|_1.

Now introduce the error vector  \delta := \widehat\theta-\theta^*.  Note that  \|\theta^*\|_1 - \|\widehat\theta\|_1 \le \|\delta_S\|_1 - \|\delta_{\overline{S}}\|_1,  and we get

\displaystyle \frac{1}{2}\|\delta\|_2^2 \le \langle z, \delta_S\rangle + \langle z, \delta_{\overline{S}}\rangle + \lambda \|\delta_S\|_1 - \lambda \|\delta_{\overline{S}}\|_1\\ \quad\,\,\,\,\,\,\,\,\,\,\,\,\le \|z\|_\infty (\|\delta_S\|_1 +\|\delta_{\overline{S}}\|_1)+ \lambda \|\delta_S\|_1 - \lambda \|\delta_{\overline{S}}\|_1.

We have \|\delta_S\|_1 \le \sqrt{s}\|\delta_S\|_2 \le \sqrt{s}\|\delta\|_2.  Picking  \lambda  as recommended, we get rid of  \|\delta_{\overline{S}}\|_1, and arrive at

\displaystyle\displaystyle \|\delta\|_2^2 \le 4\sqrt{s}\lambda\|\delta\|_2,

and we are done. ♦

Suppose that instead of the quadratic empirical loss  \|\theta-y\|_2^2-\|z\|_2^2  we minimize some convex empirical loss  L(\theta) \ge 0,  again penalized by the  \ell_1-norm (the applications include Lasso with non-orthogonal matrix and Square-Root Lasso which we will discuss below). Then oracle feasibility gives

\displaystyle L(\widehat \theta) - L(\theta^*) \le \lambda(\|\delta_S\|_1-\|\delta_{\overline{S}}\|_1).

By convexity the left-hand side is lower-bounded by  \langle\nabla L(\theta^*),\delta\rangle,  and we arrive at

\displaystyle -\|\nabla L(\theta^*)\|_\infty (\|\delta_S\|_1+\|\delta_{\overline{S}}\|_1)\le \lambda(\|\delta_S\|_1-\|\delta_{\overline{S}}\|_1).

The quantity  \nabla L(\theta^*)  is called the score.  Note that if the loss is strongly convex at \theta^* (for Lasso this holds e.g. if the design matrix is RIP), we can get the quadratic inequality taking just  \lambda > \|\nabla L(\theta^*)\|_\infty  w.h.p. to kill  \|\delta_{\overline{S}}\|_1.  But bare convexity is enough for the following trick. Take  \lambda  such that  \lambda > c\|\nabla L(\theta^*)\|_\infty  w.h.p. for some absolute constant  c > 1.  Then,

\displaystyle \|\delta_{\overline{S}}\|_1 \le \overline{c} \|\delta_S\|_1, \quad \overline{c} = \frac{c+1}{c-1}.

This lets us write the following bound:

\displaystyle L(\widehat \theta) - L(\theta^*) \lesssim c(1+\overline{c})\|\nabla L(\theta^*)\|_\infty\|\delta_S\|_1 \le 8\sqrt{s}\|\nabla L(\theta^*)\|_\infty\|\delta\|_2,

since  \min\limits_{c > 1} c(1+\overline{c}) = 8  is attained at c = 2. This might be useful if one can link the left-hand side with  \|\delta\|_2 in alternative way (not necessarily through strong convexity).

Consider, for example, L(\theta) = \|y-\theta\|_2  (this empirical loss is biased but it doesn’t matter since we compare it in two different points). That is, we run the Square-Root Lasso:

\min_\theta \|\theta-y\|_2 + \lambda \|\theta\|_1.

Note that

\displaystyle L^2(\widehat\theta) - L^2(\theta^*) = \|\delta\|^2_2 - 2\langle z, \delta \rangle.

Combining that with the bound above, we get

\displaystyle \|\delta\|_2^2 \left(1-C_1^2s\|\nabla L(\theta^*)\|_{\infty}^2\right) \le C_2 \sqrt{s} \|\delta\|_2 \left[\|z\|_\infty + \left\|\nabla L(\theta^*)\right\|_\infty L(\theta^*)\right].

Note that one can bound  \|\nabla L(\theta^*)\|_{\infty} = \frac {\|z\|_\infty}{\| z \|_2} \lesssim \sqrt{\frac{\log n}{n}}  and  \left\|\nabla L(\theta^*)\right\|_\infty L(\theta^*) = \|z\|_\infty,  so we have the following theorem.

Theorem 2

Let \lambda = O\left(\frac{\sqrt{2 \log n} + u_\alpha}{\sqrt{n} -u_\alpha}\right),  where  u_\alpha = \sqrt{2 \log \alpha^{-1}},  for the Square-Root Lasso. Then, with probability at least  1-\alpha,

\displaystyle \|\theta^* - \widehat\theta\|_2^2 \lesssim C s\, \sigma_n^2 \log \frac{n}{\alpha}

as long as  s \log n \lesssim n.

Note that the “optimal” \lambda  doesn’t depend on  \sigma_n,  so we might even not know it at all when running the algorithm (and obtain the same convergence rate as for the usual Lasso)!  The price for this is a very mild condition on  s.


About Dmitry Ostrovsky

PhD student with interests in learning theory, statistics, and optimization, working with Anatoli Juditsky and Zaid Harchaoui.
This entry was posted in Sparsity, Statistics, Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s