## Lagrange duality via the Fenchel conjugate. The dual of ERM

A recurring trick in optimization is the characterization of Lagrange dual problem in terms of the much simpler Fenchel duality. This trick is used in dual methods for Machine Learning and Signal Processing, ADMM, Augmented Lagrangian, etc.

Consider a convex program on a polyhedron

$\displaystyle \min_{x \in \mathbb{R}^n} f(x) \quad \text{s.t.} \quad Ax \le b, \quad \text{where}\,\, A \in \mathbb{R}^{m \times n}.$

Suppose also that the polyhedron is non-empty (and Slater’s condition is satisfied), so the dual gap is zero (in practice it is often the case that we can be sure that it is true). We can concentrate on the Lagrange dual problem

$\displaystyle \max_{\lambda \in \mathbb{R}^m_+} g(\lambda) = -\min_{\lambda \in \mathbb{R}^m_+} -g(\lambda), \quad \text{where}$

$\displaystyle g(\lambda) = \min_{x \in \mathbb{R}^n}\lambda^T (Ax - b) + f(x).$

Denote  $\displaystyle f_X^*(y) = \max_{x \in X}\, \langle y, x \rangle - f(x)$  the Fenchel conjugate of $f$ w.r.t. some set  $X$  (we will omit the subscript when  $X$  is the whole space). The key observation is as follows:

$\displaystyle -g(\lambda) = \max_{x \in \mathbb{R}^n} \lambda^T (-Ax + b) -f(x) = f^*(-A^T \lambda) + b^T\lambda.$

If we can calculate  $f^*$  fast (it may be even given in a closed form) then we can solve instead the dual problem, which smooth if the initial problem is strongly convex (by the main property of Fenchel conjugate). Thus it sometimes makes sense to “perturb” f(x) by some $\sigma$-strongly convex regulariser $\sigma \omega(x),$ so that the new dual  $g_{\sigma}(\lambda)$  will be smooth, and Nesterov’s Accelerated Gradient Descent can be applied. Of course, we also need to control the approximation of  $g(\lambda)$  by  $g_\sigma (\lambda),$  so evidently there is a trade-off. That idea justifies Nesterov’s smoothing technique which applies to well-structured non-smooth problems.

Note that the smoothness of the dual depends on  $A.$  It is easy to obtain that, given the pair of norms  $\|\cdot\|_1, \|\cdot\|_2$  on  $\mathbb{R}^n$  and  $\mathbb{R}^m$  correspondingly, if the primal is  $\sigma$-strongly convex), then the dual is  $(\|A\|^2_{12}/\sigma)$-smooth.

We might also consider the case of equality constraints:

$\displaystyle \min_{x \in \mathbb{R}^n} f(x) \quad \text{s.t.}\,\, Ax = 0,$

in which case the dual is unconstrained

$\displaystyle \max_{\lambda \in \mathbb{R}^d} -f^*(-A^T \lambda).$

Another trick to change the problem formulation is when one first introduces dummy variables and constraints, and then proceeds as above. As an example, consider Empirical Risk Minimization (ERM). For a dataset  $x_1,\, ...,\, x_n \in \mathbb{R}^d, \quad d \gg n,$ the primal problem is

$\displaystyle \min_{w \in \mathbb{R}^d}\,\, \frac{1}{n}\sum_{i=1}^n \phi_i (w^T x_i) + \frac{\lambda}{2}\|w\|_2^2.$

Here the problem is that the first term (the empirical risk) is non-separable. Introducing dummy constraints  $z_i = w^T x_i$  and Lagrange multipliers  $\alpha_i,\, i \in \overline{1,n},$  we can work out the dual problem:

$\displaystyle \max_{\alpha \in \mathbb{R}^n}\,\, -\frac{1}{n}\sum_{i=1}^n \phi^*_i (-\alpha_i) + \frac{\lambda}{2} \left\| \frac{1}{\lambda n}\sum_{i=1}^n \alpha_i x_i \right\|_2^2$

This is used, for example, in SDCA.