Lagrange duality via the Fenchel conjugate. The dual of ERM

A recurring trick in optimization is the characterization of Lagrange dual problem in terms of the much simpler Fenchel duality. This trick is used in dual methods for Machine Learning and Signal Processing, ADMM, Augmented Lagrangian, etc.

Consider a convex program on a polyhedron

\displaystyle \min_{x \in \mathbb{R}^n} f(x) \quad \text{s.t.} \quad Ax \le b, \quad \text{where}\,\, A \in \mathbb{R}^{m \times n}.

Suppose also that the polyhedron is non-empty (and Slater’s condition is satisfied), so the dual gap is zero (in practice it is often the case that we can be sure that it is true). We can concentrate on the Lagrange dual problem

\displaystyle \max_{\lambda \in \mathbb{R}^m_+} g(\lambda) = -\min_{\lambda \in \mathbb{R}^m_+} -g(\lambda), \quad \text{where}

\displaystyle g(\lambda) = \min_{x \in \mathbb{R}^n}\lambda^T (Ax - b) + f(x).

Denote  \displaystyle f_X^*(y) = \max_{x \in X}\, \langle y, x \rangle - f(x)  the Fenchel conjugate of f w.r.t. some set  X  (we will omit the subscript when  X  is the whole space). The key observation is as follows:

\displaystyle -g(\lambda) = \max_{x \in \mathbb{R}^n} \lambda^T (-Ax + b) -f(x) = f^*(-A^T \lambda) + b^T\lambda.

If we can calculate  f^*  fast (it may be even given in a closed form) then we can solve instead the dual problem, which smooth if the initial problem is strongly convex (by the main property of Fenchel conjugate). Thus it sometimes makes sense to “perturb” f(x) by some \sigma-strongly convex regulariser \sigma \omega(x), so that the new dual  g_{\sigma}(\lambda)  will be smooth, and Nesterov’s Accelerated Gradient Descent can be applied. Of course, we also need to control the approximation of  g(\lambda)  by  g_\sigma (\lambda),  so evidently there is a trade-off. That idea justifies Nesterov’s smoothing technique which applies to well-structured non-smooth problems.

Note that the smoothness of the dual depends on  A.  It is easy to obtain that, given the pair of norms  \|\cdot\|_1, \|\cdot\|_2  on  \mathbb{R}^n  and  \mathbb{R}^m  correspondingly, if the primal is  \sigma-strongly convex), then the dual is  (\|A\|^2_{12}/\sigma)-smooth.

We might also consider the case of equality constraints:

\displaystyle \min_{x \in \mathbb{R}^n} f(x) \quad \text{s.t.}\,\, Ax = 0,

in which case the dual is unconstrained

\displaystyle \max_{\lambda \in \mathbb{R}^d} -f^*(-A^T \lambda).

Another trick to change the problem formulation is when one first introduces dummy variables and constraints, and then proceeds as above. As an example, consider Empirical Risk Minimization (ERM). For a dataset  x_1,\, ...,\, x_n \in \mathbb{R}^d, \quad d \gg n, the primal problem is

\displaystyle \min_{w \in \mathbb{R}^d}\,\, \frac{1}{n}\sum_{i=1}^n \phi_i (w^T x_i) + \frac{\lambda}{2}\|w\|_2^2.

Here the problem is that the first term (the empirical risk) is non-separable. Introducing dummy constraints  z_i = w^T x_i  and Lagrange multipliers  \alpha_i,\, i \in \overline{1,n},  we can work out the dual problem:

\displaystyle \max_{\alpha \in \mathbb{R}^n}\,\, -\frac{1}{n}\sum_{i=1}^n \phi^*_i (-\alpha_i) + \frac{\lambda}{2} \left\| \frac{1}{\lambda n}\sum_{i=1}^n \alpha_i x_i \right\|_2^2

This is used, for example, in SDCA.


About Dmitry Ostrovsky

PhD student with interests in learning theory, statistics, and optimization, working with Anatoli Juditsky and Zaid Harchaoui.
This entry was posted in Optimization tricks. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s