In this post we continue with soft thresholding. Our estimator is an optimal solution (unique due to strong convexity) to

and the observation model is note that we included the noise variance in . We are interested in how the loss of compares with ‘the oracle’ the loss for which is zero. We will assume that the signal is sparse; specifically, its support is with

Theorem 1Let Then, with probability at least

for some absolute constant In other words, we only pay logarithmic price for not knowing the support.

I will give two proofs of this theorem: the first more straightforward, and the second one is less intuitive but is easier to generalize.

**First proof.**

Instead of expanding the error, let’s just look at the error componentwise. Since

Pick as in the premise of the theorem, so that with high probability If that event takes place, for the components the error is zero, whereas for it holds

adding squared inequalities above and bounding we arrive at the claim with ♦

**Second proof.**

We start with oracle feasibility, which, after rearranging the terms, results in

Now introduce the error vector Note that and we get

We have Picking as recommended, we get rid of and arrive at

and we are done. ♦

Suppose that instead of the quadratic empirical loss we minimize some convex empirical loss again penalized by the -norm (the applications include Lasso with non-orthogonal matrix and Square-Root Lasso which we will discuss below). Then oracle feasibility gives

By convexity the left-hand side is lower-bounded by and we arrive at

The quantity is called *the** score. *Note that if the loss is strongly convex at (for Lasso this holds e.g. if the design matrix is RIP), we can get the quadratic inequality taking just w.h.p. to kill But bare convexity is enough for the following trick. Take such that w.h.p. for some absolute constant Then,

This lets us write the following bound:

since is attained at This might be useful if one can link the left-hand side with in alternative way (not necessarily through strong convexity).

Consider, for example, (this empirical loss is biased but it doesn’t matter since we compare it in two different points). That is, we run the * Square-Root Lasso*:

Note that

Combining that with the bound above, we get

Note that one can bound and so we have the following theorem.

Theorem 2Let where for the Square-Root Lasso. Then, with probability at least

as long as

Note that the “optimal” doesn’t depend on so we might even not know it at all when running the algorithm (and obtain the same convergence rate as for the usual Lasso)! The price for this is a very mild condition on