In this post we continue with soft thresholding. Our estimator is an optimal solution (unique due to strong convexity) to
and the observation model is note that we included the noise variance in . We are interested in how the loss of compares with ‘the oracle’ the loss for which is zero. We will assume that the signal is sparse; specifically, its support is with
Let Then, with probability at least
for some absolute constant In other words, we only pay logarithmic price for not knowing the support.
I will give two proofs of this theorem: the first more straightforward, and the second one is less intuitive but is easier to generalize.
Instead of expanding the error, let’s just look at the error componentwise. Since
Pick as in the premise of the theorem, so that with high probability If that event takes place, for the components the error is zero, whereas for it holds
adding squared inequalities above and bounding we arrive at the claim with ♦
We start with oracle feasibility, which, after rearranging the terms, results in
Now introduce the error vector Note that and we get
We have Picking as recommended, we get rid of and arrive at
and we are done. ♦
Suppose that instead of the quadratic empirical loss we minimize some convex empirical loss again penalized by the -norm (the applications include Lasso with non-orthogonal matrix and Square-Root Lasso which we will discuss below). Then oracle feasibility gives
By convexity the left-hand side is lower-bounded by and we arrive at
The quantity is called the score. Note that if the loss is strongly convex at (for Lasso this holds e.g. if the design matrix is RIP), we can get the quadratic inequality taking just w.h.p. to kill But bare convexity is enough for the following trick. Take such that w.h.p. for some absolute constant Then,
This lets us write the following bound:
since is attained at This might be useful if one can link the left-hand side with in alternative way (not necessarily through strong convexity).
Consider, for example, (this empirical loss is biased but it doesn’t matter since we compare it in two different points). That is, we run the Square-Root Lasso:
Combining that with the bound above, we get
Note that one can bound and so we have the following theorem.
Let where for the Square-Root Lasso. Then, with probability at least
as long as
Note that the “optimal” doesn’t depend on so we might even not know it at all when running the algorithm (and obtain the same convergence rate as for the usual Lasso)! The price for this is a very mild condition on