Projected Gradient Descent I — Lipschitz functions

In this post we find out dimension-free oracle complexity for the following problem class  $\mathcal{P}(L,R):$

• function  $f$  is  $L$-Lipschitz on  $\mathcal{X}$  (with respect to Euclidean norm), i.e.

$\displaystyle |f(x) - f(y)| \le L \|x-y\|_2$  for any $x, y \in \mathcal{X};$

• $\mathcal{X}$  is contained in the euclidean ball of radius  $R$.

We are given the following  first-order oracle.  Given a query point  $x,$  it outputs the function value $f(x)$  together with the subgradient  $g(x)$  of  $f$  at  $x,$  i.e. a linear functional  $g$  such that for any  $y \in \mathcal{X},$

$\displaystyle f(y) - f(x) \ge g^T(y-x).$

Note that a convex function indeed has a subradient at every internal point of its domain (this is, in fact, ‘almost’ a criterion of convexity, since every function which has a subgradient at every point of its convex domain is convex).

To solve problems in  $\mathcal{P}(L,R),$  we use Projected Gradient Descent (PGD).  Starting from a feasible point  $x_1 \in \mathcal{X},$  it iterates

$\displaystyle y_{k+1} = x_{k} - \eta\, g(x_k)$

$\displaystyle x_{k+1} = \Pi_\mathcal{X}(y_{k+1}),$

where  $\Pi_\mathcal{X}$  is the Euclidean projection operator on  $\mathcal{X},$  i.e.

$\displaystyle \Pi_\mathcal{X}(y) = \arg\min_{x \in \mathcal{X}} \|y-x\|_2,$

and  $\eta$  is the step size parameter.

Theorem

For any problem of  $\mathcal{P}(L,R),$  PGD with  $\eta = \frac{R}{L\sqrt{t}}$  satisfies

$\displaystyle f \left( \frac{1}{t} \sum_{s=1}^t x_s \right) - f^* \le \frac{LR}{\sqrt{t}}.$

Proof.  Let  $x^*$  be any optimal point. Denote $g_s = g(x_s).$  Consider the error on the  $s\text{-th}$ step:

$\displaystyle f(x_s) - f(x^*) \le g_k^T(x_s - x^*)$    [definition of subgradient]

$\displaystyle \le\frac{1}{\eta}(x_s - y_{s+1})^T (x_s - x^*)$ [iteration step]

$\displaystyle =\frac{1}{2\eta} \big( \|x_s - x^*\|_2^2 - \|y_{s+1} - x^*\|_2^2 + \|x_s - y_{s+1}\|_2^2\big)$

$\displaystyle =\frac{1}{2\eta} \big(\|x_s - x^*\|_2^2 - \|x_{s+1} - x^*\|_2^2 \big) +\frac{\eta}{2} \|g_s\|^2_2$   [projection lemma; iteration step]

In the last transition we used “Pythagorean inequality” (the projection to the convex set is closer to all points of this set than the initial point). Due to $f$ being $L$-Lipschitz we have $\|g_s\|_2 \le L,$  and the following holds:

$\displaystyle f(s_k) - f(x^*) \le \frac{\eta L^2}{2} + \frac{1}{2\eta} \left(\|x_s - x^*\|_2^2 - \|x_{s+1} - x^*\|_2^2\right).$

Exploiting the convexity of $f$ and summing over $s,$ we get

$\displaystyle f \left( \frac{1}{t} \sum_{s=1}^t x_s \right) - f^* \le \frac{1}{t} \sum_{s=1}^T \left( f(x_s) - f^* \right) \le \frac{\eta L^2}{2} + \frac{1}{2\eta t},$

and it remains only to pick the optimal  $\eta = \frac{R}{L\sqrt{t}}.$  Note that this theorem gives the following upper bound on the black-box complexity of the class  $\mathcal{P}(L,R):$

$\displaystyle \mathcal{C}_\mathcal{P}(\varepsilon) = O\left(\frac{1}{\varepsilon^2}\right)$

Remarks

1.  In practice it would be bizarre to choose step sizes depending on the number of steps in advance (for example, because we may decide to continue iterations after  $t$, or stop before that). It may be easily shown that choosing ‘online’ step size  $\eta_s = \frac{R}{L\sqrt{s}},$  we get the same rate up to the  $\log(t)$  factor.

2.  Step size recommended in the theorem is in any case very small. Heuristic explanation is that for the non-smooth objective, we have no guarantee that subgradients shrink to zero as we go to an optimal point (the only guarantee is that there exist a zero subgradient at the exact optimal point), and hence we must make sure [from the very beginning] that steps won’t be too large. On the other hand, it turns out that if $f$ is smooth, we can take just constant step size. Heuristically it’s due to the following reason: when the function is smooth, the gradient itself diminishes as we get to the optimal point. Also, note that if we are in the unconstrained case, then, if we are provided with the initial point at a distance  $R$  from  $x^*,$  we have the same complexity bound (and we need no projections).

3.  Convexity, formulated as the subgradient condition, provides a global upper bound to the suboptimality gap  $f(x_s) - f^*$  through the lower bound on  $f^*.$  Combining such bounds for every step, and upper-bounding the error  $\|x_s - y_{s+1}\|_2$  of going ‘the wrong way’ using the Lipschitz property, we obtain our bound. It occurs that if the objective is smooth, we have also the lower bound on the decrement  $f(x_s) -f(x_{s+1}).$  Mating this bound with the upper bound on the suboptimality gap, we may work out better bounds. Going on further with this reasoning (with sharper upper bound on the gap and lower bound on the decrement) for tighter problem classes (e.g. when the function is strongly convex) yields progressively better oracle complexity bounds.

4.  We upper-bounded here the oracle complexity for the problem class in question, but remained silent about the arithmetical complexity (that is, how much computation we should make, taking in respect also the oracle calls) of PGD. In fact, the projection step may be burdensome since it reduces to solving the quadratic programming (QP) problem. In general case, this problem cannot be solved exactly; however, there are two ways to overcome this issue.

(1) In several cases the projection problem can be solved exactly. The simplest examples are when  $\mathcal{X}$  is  $\ell_2$-ball and  $\ell_\infty$-ball. More intricate are the probability simplex  and  $\ell_1$-ball. A simple algorithm projects on these sets in $O(n \log n).$  [Duchi] provides a slightly more complicated algorithm which runs in  $O(n),$  but it is rarely used in practice.

(2) We can use projection-free method such as Conditional Gradient Descent,  in which all iteration points stay inside  $\mathcal{X}.$  In this method, the projection problem is substituted with a linear optimization problem (maximization of a linear functional over  $\mathcal{X}$)  which is generally easier.

PhD student with interests in statistics, optimization, and machine learning.
This entry was posted in Convex Optimization, Course on Convex Optimization. Bookmark the permalink.

5 Responses to Projected Gradient Descent I — Lipschitz functions

1. Артур says:

Почему у меня такое чувство, что этот пост был опубликован случайно?

2. ostrodmit says:

Он в процессе написания, не читай пока.

• Артур says:

Хорошо, я попробую удержаться.

3. ostrodmit says:

Сейчас можешь смотреть.