Thresholding over the l1-ball

Since long ago I wanted to write a post with an elementary introduction to soft / hard thresholding. The time has come to pay the dues, so here we go.

We are concerned with the standard problem of estimating a function on the unit grid on  [0,1]:

\displaystyle Y_t = f\left(\frac{t}{n}\right) + \sigma \xi_t, \quad 0 \le t \le n-1.

Given an estimator  \widehat f  of   f,  let us evaluate the error in terms of the mean square error restricted to the grid,

\displaystyle \text{MSE} = \mathbb{E}\left[\frac{1}{n}\left\|\widehat f-f\right\|_{2}^2\right] =\mathbb{E}\|\widehat f-f\|_{n,2}^2,

where  \|\cdot\|_{n,2}^2 := \frac{1}{n} \|\cdot\|_{2}^2  is the empirical norm of the signal sampled signal which approximates the functional  L_2-norm on a finite grid. Since the loss  \|\widehat f-f\|_{n,2}^2  is rotation-invariant, we may transfer the problem in the frequency domain by means of the Discrete Fourier transform, dealing now with the “small noise”-observations

\displaystyle y = \theta^* + \sigma_n z, \quad\sigma_n = \frac{\sigma}{\sqrt{n}}

for which  \text{MSE} = \|\widehat \theta^* - \theta \|_2^2. Now, given a level  \lambda > 0,  consider the soft thresholding estimator

\displaystyle \widehat{\theta}^{\lambda}_{j} := \text{sign}(y_j) (|y_j|-\lambda)_+,

which can also be defined as the unique (due to strong convexity) solution of

\displaystyle \min_{\theta}\frac{1}{2}\|y-\theta\|_2^2 + \lambda\|\theta\|_1,

and its counterpart, the hard thresholding estimator  \widehat{\theta}^{[\lambda]},  with  \widehat{\theta}^{[\lambda]}_j = y_j  if  |y_j| > \lambda  and zero otherwise.


Given the norm of the true signal  \|\theta^*\|_1,  both estimators with thresholding level  \displaystyle\lambda_n = \sigma_n \left(\sqrt{2\log n} + \sqrt{2 \log \alpha^{-1}}\right)  provide that with probability at least  1-\alpha,

\displaystyle \|\widehat\theta-\theta^*\|_2^2 \le 4\sqrt{2}\|\theta^*\|_1 \sigma_n \left(\sqrt{\log n}+\sqrt{\log\alpha^{-1}}\right).

Moreover, this is the minimax risk for an  \ell_1-ball.

Remark.  As an illustration of this result, consider first the Bayesian formulation with the restriction \mathbb{E} \|\theta^*\|_2^2 \le \sigma^2.  As such, there is no hope of consistent estimation, since  \mathbb{E} \|\theta^*\|_2^2  and \mathbb{E} \|z\|_2^2  are of the same order (if that ‘argument’ did not convince you, apply the Van Trees inequality). And indeed, one may apply the shrinkage estimator (also called the Wiener filter)

\displaystyle \widehat \theta_j = c_j y_j, \quad

with the shrinkage parameter c_j \equiv \frac{1}{2}  (to get the optimal shrinkage, one may guess that the least favorable distribution is Gaussian and then explicitly minimize the corresponding Bayesian risk).  The corresponding MSE is just 1. But if we restrict the priors to be such that  \mathbb{E} \|\theta^*\|_1 \le \sigma,  Theorem 1 will give

\displaystyle \text{MSE} = O\left(\sigma^2 \sqrt{\frac{\log n}{n}}\right).

Proof.  We will prove only the upper bound. For that, first notice that the loss of hard thresholding can never be greater than that of soft thresholding with the same threshold (it is the same if we are below the threshold and obviously not greater when we are above), hence we can only analyze soft thresholding.

The proof is based on a very simple idea of the oracle feasibility. It is easy to check that the soft thresholding estimator  \widehat \theta  is the global minimizer of  \frac{1}{2}\|y - \theta\|_2^2 + \lambda \|\theta\|_1,  hence

\displaystyle \frac{1}{2}\|y - \widehat\theta\|_2^2 + \lambda \|\widehat \theta\|_1 \le\frac{1}{2}\|y - \theta^*\|_2^2 + \lambda \|\theta^*\|_1 = \frac{1}{2}\sigma_n^2\|z\|_2^2 + \lambda \|\theta^*\|_1.

Expanding the squared norm and subtracting  \frac{1}{2}\sigma_n^2 \|z\|_2^2  from both sides, we get

\displaystyle\frac{1}{2}\|\theta^* - \widehat\theta\|_2^2 \le \lambda\|\theta^*\|_1 - \sigma_n \langle z, \theta^* \rangle- \lambda\|\widehat\theta\|_1 + \sigma_n \langle z, \widehat\theta \rangle

Note that now in the left-hand side we have exactly what we needed, so we need to get things done with the right-hand side. The idea is to choose the minimal  \lambda  such that the sum of the last two terms (where the unknown  \widehat \theta  messes things up)  is negative. Since  \langle z, \widehat\theta \rangle \le \|z \|_\infty \|\widehat \theta\|_1,  and using the concentration of  \|z\|_\infty  (it is Lipschitz, and the expectation may be worked out easily), we can get that

\displaystyle \mathbb{P}\left(\|z\|_\infty \ge \sqrt{2 \log n} + t\right) \le e^{-\frac{t^2}{2}},

whence with probability at least  1 - \alpha,

\displaystyle \|z\|_\infty \le \sqrt{2\log n} + \sqrt{2\log\alpha^{-1}}

Picking such  \lambda,  and using Young’s inequality once more to upper bound  \langle z, \theta^* \rangle,  we finish the proof. ♦


About Dmitry Ostrovsky

PhD student with interests in learning theory, statistics, and optimization, working with Anatoli Juditsky and Zaid Harchaoui.
This entry was posted in Memos, Sparsity, Statistics. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s