A recurring trick in optimization is the characterization of Lagrange dual problem in terms of the much simpler Fenchel duality. This trick is used in dual methods for Machine Learning and Signal Processing, ADMM, Augmented Lagrangian, etc.
Consider a convex program on a polyhedron
Suppose also that the polyhedron is non-empty (and Slater’s condition is satisfied), so the dual gap is zero (in practice it is often the case that we can be sure that it is true). We can concentrate on the Lagrange dual problem
Denote the Fenchel conjugate of w.r.t. some set (we will omit the subscript when is the whole space). The key observation is as follows:
If we can calculate fast (it may be even given in a closed form) then we can solve instead the dual problem, which smooth if the initial problem is strongly convex (by the main property of Fenchel conjugate). Thus it sometimes makes sense to “perturb” f(x) by some -strongly convex regulariser so that the new dual will be smooth, and Nesterov’s Accelerated Gradient Descent can be applied. Of course, we also need to control the approximation of by so evidently there is a trade-off. That idea justifies Nesterov’s smoothing technique which applies to well-structured non-smooth problems.
Note that the smoothness of the dual depends on It is easy to obtain that, given the pair of norms on and correspondingly, if the primal is -strongly convex), then the dual is -smooth.
We might also consider the case of equality constraints:
in which case the dual is unconstrained
Another trick to change the problem formulation is when one first introduces dummy variables and constraints, and then proceeds as above. As an example, consider Empirical Risk Minimization (ERM). For a dataset the primal problem is
Here the problem is that the first term (the empirical risk) is non-separable. Introducing dummy constraints and Lagrange multipliers we can work out the dual problem:
This is used, for example, in SDCA.