Thresholding for sparse signals. Square-Root Lasso

In this post we continue with soft thresholding. Our estimator \widehat \theta  is an optimal solution (unique due to strong convexity) to

\displaystyle \min_{\theta} \|\theta -y\|_2^2 + \lambda \|\theta\|_1,

and the observation model is  y = \theta^* + z,  z \sim\sigma_n\,\mathcal{N}(0,I_n),  note that we included the noise variance  \sigma_n := \frac{\sigma}{\sqrt{n}} in z.  We are interested in how the loss of  \widehat \theta compares with ‘the oracle’ \theta^*  the loss for which is zero. We will assume that the signal is sparse; specifically, its support is  S \subset [n]  with  s:=|S|.

Theorem 1

Let \lambda = \sigma_n [\sqrt{2 \log n} + \sqrt{2 \log \alpha^{-1}}].  Then, with probability at least  1-\alpha,

\displaystyle \|\theta^* - \widehat\theta\|_2^2 \le C s\, \sigma_n^2 \log \frac{n}{\alpha}

for some absolute constant  C.  In other words, we only pay logarithmic price  for not knowing the support.

I will give two proofs of this theorem: the first more straightforward, and the second one is less intuitive but is easier to generalize.

First proof.

Instead of expanding the error, let’s just look at the error  \|\widehat\theta-\theta^*\|_2^2  componentwise. Since  \widehat\theta_i = y^{[\lambda]}_{i}:= \text{sign}(y_i) \left(|y_i|-\lambda\right)_+,

\displaystyle (\widehat\theta_i - \theta^*_i)^2 = \left[\text{sign}(y_i)(|y_i|-\lambda)_+ - \theta^*_i\right]^2.

Pick  \lambda  as in the premise of the theorem, so that with high probability  \|z\|_\infty \le \lambda.  If that event takes place, for the components  i \notin S  the error is zero, whereas for  i \in S  it holds

|\widehat\theta_i - \theta^*_i| = |y^{[\lambda]}_i - \theta^*_i| \le |y^{[\lambda]}_i -y_i| + |y_i-\theta^*_i| \le \lambda + |z_i| \le 2\lambda.

adding squared inequalities above and bounding  \sqrt{a} + \sqrt{b} \le \sqrt{2(a + b)},  we arrive at the claim with  C = 8.  ♦

Second proof.

We start with oracle feasibility, which, after rearranging the terms, results in

\displaystyle \frac{1}{2}\|\widehat\theta-\theta^*\|_2^2 \le \langle z, \widehat\theta-\theta^*\rangle-\lambda \|\widehat\theta\|_1+ \lambda \|\theta^*\|_1.

Now introduce the error vector  \delta := \widehat\theta-\theta^*.  Note that  \|\theta^*\|_1 - \|\widehat\theta\|_1 \le \|\delta_S\|_1 - \|\delta_{\overline{S}}\|_1,  and we get

\displaystyle \frac{1}{2}\|\delta\|_2^2 \le \langle z, \delta_S\rangle + \langle z, \delta_{\overline{S}}\rangle + \lambda \|\delta_S\|_1 - \lambda \|\delta_{\overline{S}}\|_1\\ \quad\,\,\,\,\,\,\,\,\,\,\,\,\le \|z\|_\infty (\|\delta_S\|_1 +\|\delta_{\overline{S}}\|_1)+ \lambda \|\delta_S\|_1 - \lambda \|\delta_{\overline{S}}\|_1.

We have \|\delta_S\|_1 \le \sqrt{s}\|\delta_S\|_2 \le \sqrt{s}\|\delta\|_2.  Picking  \lambda  as recommended, we get rid of  \|\delta_{\overline{S}}\|_1, and arrive at

\displaystyle\displaystyle \|\delta\|_2^2 \le 4\sqrt{s}\lambda\|\delta\|_2,

and we are done. ♦

Suppose that instead of the quadratic empirical loss  \|\theta-y\|_2^2-\|z\|_2^2  we minimize some convex empirical loss  L(\theta) \ge 0,  again penalized by the  \ell_1-norm (the applications include Lasso with non-orthogonal matrix and Square-Root Lasso which we will discuss below). Then oracle feasibility gives

\displaystyle L(\widehat \theta) - L(\theta^*) \le \lambda(\|\delta_S\|_1-\|\delta_{\overline{S}}\|_1).

By convexity the left-hand side is lower-bounded by  \langle\nabla L(\theta^*),\delta\rangle,  and we arrive at

\displaystyle -\|\nabla L(\theta^*)\|_\infty (\|\delta_S\|_1+\|\delta_{\overline{S}}\|_1)\le \lambda(\|\delta_S\|_1-\|\delta_{\overline{S}}\|_1).

The quantity  \nabla L(\theta^*)  is called the score.  Note that if the loss is strongly convex at \theta^* (for Lasso this holds e.g. if the design matrix is RIP), we can get the quadratic inequality taking just  \lambda > \|\nabla L(\theta^*)\|_\infty  w.h.p. to kill  \|\delta_{\overline{S}}\|_1.  But bare convexity is enough for the following trick. Take  \lambda  such that  \lambda > c\|\nabla L(\theta^*)\|_\infty  w.h.p. for some absolute constant  c > 1.  Then,

\displaystyle \|\delta_{\overline{S}}\|_1 \le \overline{c} \|\delta_S\|_1, \quad \overline{c} = \frac{c+1}{c-1}.

This lets us write the following bound:

\displaystyle L(\widehat \theta) - L(\theta^*) \lesssim c(1+\overline{c})\|\nabla L(\theta^*)\|_\infty\|\delta_S\|_1 \le 8\sqrt{s}\|\nabla L(\theta^*)\|_\infty\|\delta\|_2,

since  \min\limits_{c > 1} c(1+\overline{c}) = 8  is attained at c = 2. This might be useful if one can link the left-hand side with  \|\delta\|_2 in alternative way (not necessarily through strong convexity).

Consider, for example, L(\theta) = \|y-\theta\|_2  (this empirical loss is biased but it doesn’t matter since we compare it in two different points). That is, we run the Square-Root Lasso:

\min_\theta \|\theta-y\|_2 + \lambda \|\theta\|_1.

Note that

\displaystyle L^2(\widehat\theta) - L^2(\theta^*) = \|\delta\|^2_2 - 2\langle z, \delta \rangle.

Combining that with the bound above, we get

\displaystyle \|\delta\|_2^2 \left(1-C_1^2s\|\nabla L(\theta^*)\|_{\infty}^2\right) \le C_2 \sqrt{s} \|\delta\|_2 \left[\|z\|_\infty + \left\|\nabla L(\theta^*)\right\|_\infty L(\theta^*)\right].

Note that one can bound  \|\nabla L(\theta^*)\|_{\infty} = \frac {\|z\|_\infty}{\| z \|_2} \lesssim \sqrt{\frac{\log n}{n}}  and  \left\|\nabla L(\theta^*)\right\|_\infty L(\theta^*) = \|z\|_\infty,  so we have the following theorem.

Theorem 2

Let \lambda = O\left(\frac{\sqrt{2 \log n} + u_\alpha}{\sqrt{n} -u_\alpha}\right),  where  u_\alpha = \sqrt{2 \log \alpha^{-1}},  for the Square-Root Lasso. Then, with probability at least  1-\alpha,

\displaystyle \|\theta^* - \widehat\theta\|_2^2 \lesssim C s\, \sigma_n^2 \log \frac{n}{\alpha}

as long as  s \log n \lesssim n.

Note that the “optimal” \lambda  doesn’t depend on  \sigma_n,  so we might even not know it at all when running the algorithm (and obtain the same convergence rate as for the usual Lasso)!  The price for this is a very mild condition on  s.

About Dmitry Ostrovsky

PhD student with interests in statistics, optimization, and machine learning.
This entry was posted in Sparsity, Statistics, Uncategorized. Bookmark the permalink.

1 Response to Thresholding for sparse signals. Square-Root Lasso

  1. Royi Avital says:

    You have great posts in this blog.

    I hope you’ll start adding new content soon.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s