## Formulations of Mirror Descent

A very nice reference for this post is [this blog post].

Mirror Descent is a first-order algorithm to solve a convex optimization problem  $\min_{x \in X} f(x).$  In this post we will learn several equivalent formulations of this algorithm and discuss how they are related to each other.

Let  $\omega(x)$  be a distance-generating function (d.-g.f.), 1-strongly convex on the domain $X$ and differentiable on its interior, and consider the Bregman divergence on  $X \times \text{int}(X):$

$\displaystyle D(x,y) = \omega(x) - \omega(y) - \big\langle x-y,\nabla\omega(y)\big\rangle$

Projection formulation

$\displaystyle \nabla\omega(y) = \nabla\omega(x) - t \nabla f(x)$

$\displaystyle x^+ = \pi_{X}(y) := \arg\min_{\xi \in X} D(\xi,y)$

Here  $\nabla f(\cdot)$  is any subgradient of the objective function, and  $t$  is the stepsize parameter. This formulation, used extensively in [Bubeck 2015], permits to draw analogy with the Projected Subgradient Descent for which $\omega(x) = \frac{1}{2}\|x\|_2^2,$  $\nabla\omega(x) = x,$  and  $D(x,y) = \frac{1}{2}\|x-y\|_2^2.$  Nevertheless, this formulation is a bit delusive since it is not clear how to compute the point  $y$  itself, and under what conditions such  $y$  exists. Actually the existence of  $y$  is guaranteed as soon as  $\omega(x)$  is strongly convex and proper on  $X.$  Supposing it is indeed the case, let us obtain another formulation by some elementary convex analysis. Consider the Legendre conjugate of  $\omega(\cdot)$  with respect to $X:$

$\displaystyle \omega_X^*(g) := \max_{x \in X}\, \langle g,x \rangle - \omega(x).$

Since  $\omega_X^*(g)$  is the pointwise maximum of convex functions, we have

$\displaystyle \widehat x = \nabla \omega_X^*(g),$

where  $\widehat x$  is the corresponding maximizer, unique due to the strong convexity of  $\omega.$ Combining that observation with the previous formulation,

$\displaystyle x^+ = \arg\min_{\xi \in X} D(\xi,y) = \arg\min_{\xi \in X}\, \omega(\xi) - \omega(y) - \langle \nabla \omega(y),\xi-y \rangle$

$\displaystyle = \arg\min_{\xi \in X}\, \omega(\xi) - \langle \nabla \omega(y),\xi \rangle$

$\displaystyle = \arg\min_{\xi \in X}\, \omega(\xi) - \langle \nabla \omega(x)-t\nabla f(x),\xi \rangle,$  we obtain

Mirror formulation

$\displaystyle x^+ = \nabla\omega_X^*[\nabla\omega(x) - t \nabla f(x)]$

The interesting point is that actually this formulation is more general: it does not require (in contrast with the first one) any assumptions besides strong convexity. Moreover, it is easy to see from the first-order optimality conditions that the map  $\nabla\omega_X^*(\cdot)$  is given by the following relation:

$\displaystyle \nabla\omega_X^*(g) = \partial^{-1}(\omega + I_X)(g) = (\nabla \omega + N_X)^{-1}(g)$

where  $I_X$  is the indicator function, $N_X$  is the normal cone, and the multi-map $\nabla\omega + N_X: X \to \mathbb{R}^n$  is invertible (injective and surjective). The mirror formulation reads: first go ‘beyond the mirror’ by the map $\nabla(\omega + I_X)(\cdot) = \nabla \omega(\cdot)$  on  $X,$  then take a step beyond the mirror, and finally go back by the inverse map.

Yet another formulation is obtained from the last line of the chaining argument above.

Proximal point formulation

$\displaystyle x^+ = \arg\min_{\xi \in X}\, \langle \nabla f(x), \xi \rangle + \frac{D(\xi,x)}{t}$

In other words, the algorithm is trying to minimize the local linear model of the function while staying close to the previous search point. Finally, elementary algebra gives the following

Regularization formulation

$\displaystyle x^+ = \arg\min_{\xi \in X}\, -\langle \omega(x) - t \nabla f(x), \xi \rangle + \omega(\xi)$

It is most often used to actually implement the algorithm.