In this post we find out dimension-free oracle complexity for the following problem class
- function is -Lipschitz on (with respect to Euclidean norm), i.e.
- is contained in the euclidean ball of radius .
We are given the following first-order oracle. Given a query point it outputs the function value together with the subgradient of at i.e. a linear functional such that for any
Note that a convex function indeed has a subradient at every internal point of its domain (this is, in fact, ‘almost’ a criterion of convexity, since every function which has a subgradient at every point of its convex domain is convex).
To solve problems in we use Projected Gradient Descent (PGD). Starting from a feasible point it iterates
where is the Euclidean projection operator on i.e.
and is the step size parameter.
For any problem of PGD with satisfies
Proof. Let be any optimal point. Denote Consider the error on the step:
[definition of subgradient]
[projection lemma; iteration step]
In the last transition we used “Pythagorean inequality” (the projection to the convex set is closer to all points of this set than the initial point). Due to being -Lipschitz we have and the following holds:
Exploiting the convexity of and summing over we get
and it remains only to pick the optimal Note that this theorem gives the following upper bound on the black-box complexity of the class
1. In practice it would be bizarre to choose step sizes depending on the number of steps in advance (for example, because we may decide to continue iterations after , or stop before that). It may be easily shown that choosing ‘online’ step size we get the same rate up to the factor.
2. Step size recommended in the theorem is in any case very small. Heuristic explanation is that for the non-smooth objective, we have no guarantee that subgradients shrink to zero as we go to an optimal point (the only guarantee is that there exist a zero subgradient at the exact optimal point), and hence we must make sure [from the very beginning] that steps won’t be too large. On the other hand, it turns out that if is smooth, we can take just constant step size. Heuristically it’s due to the following reason: when the function is smooth, the gradient itself diminishes as we get to the optimal point. Also, note that if we are in the unconstrained case, then, if we are provided with the initial point at a distance from we have the same complexity bound (and we need no projections).
3. Convexity, formulated as the subgradient condition, provides a global upper bound to the suboptimality gap through the lower bound on Combining such bounds for every step, and upper-bounding the error of going ‘the wrong way’ using the Lipschitz property, we obtain our bound. It occurs that if the objective is smooth, we have also the lower bound on the decrement Mating this bound with the upper bound on the suboptimality gap, we may work out better bounds. Going on further with this reasoning (with sharper upper bound on the gap and lower bound on the decrement) for tighter problem classes (e.g. when the function is strongly convex) yields progressively better oracle complexity bounds.
4. We upper-bounded here the oracle complexity for the problem class in question, but remained silent about the arithmetical complexity (that is, how much computation we should make, taking in respect also the oracle calls) of PGD. In fact, the projection step may be burdensome since it reduces to solving the quadratic programming (QP) problem. In general case, this problem cannot be solved exactly; however, there are two ways to overcome this issue.
(1) In several cases the projection problem can be solved exactly. The simplest examples are when is -ball and -ball. More intricate are the probability simplex and -ball. A simple algorithm projects on these sets in [Duchi] provides a slightly more complicated algorithm which runs in but it is rarely used in practice.
(2) We can use projection-free method such as Conditional Gradient Descent, in which all iteration points stay inside In this method, the projection problem is substituted with a linear optimization problem (maximization of a linear functional over ) which is generally easier.