## Black box approach to optimization

To  learn convex optimization methods and stack this structure in my head, I decided to write the material down by topics. The focus will be on (constrained) large-scale, or dimension-free, optimization corresponding to first-order methods, with applications to machine learning in mind. I don’t pretend to original contribution here; the material will be taken from the [Bubeck], [Nesterov], and [Nemirovski].

In this post we introduce the main framework we will subsequently be working with, the so-called black box model.

Suppose that we are given an optimization problem of the form

$\displaystyle \min\, f(x)$

$\displaystyle \text{s.t. }\, x \in \mathcal{X} \subset \mathbb{R}^n,$

where  $f$  will always be a convex function, and  $\mathcal{X}$  a convex and compact (i.e. closed and bounded) set with non-empty interior. Let  $f^*$  denote the unknown minimum value, or the problem value.

We will study  problem classes, each being a subset of problems characterized by some restrictions on  $f$ and characteristics of  $\mathcal{X}$  (for example,  $f$  is bounded by some constant  $B,$  and  $\mathcal{X}$  is contained in a euclidean ball with radius  $R$).  The  black box model  is then constructed as follows. Assume that an  oracle   $\mathcal{O}$   is given, that is, a “machine’,’ which for any query point $x \in \mathcal{X}$  provides an answer with some information on  $f$ in this point (e.g. the value of  $f$).  To solve the problem up to the desired accuracy level  $\varepsilon,$  we employ some algorithm which will always take the following form:

1. Query the oracle at the point  $x_s.$
2. Using the information to date, do the following. Check the termination condition:
• “Yes” -> output the guess on the  minimum value  $f^*$  and terminate;
• “No”  -> chose a new point  $x_{s+1}$  and go to 1.

We hope that the output of the method will at some point find a nearly optimal point (such that the value of the objective function in it differs from the problem value by no greater than $\varepsilon).$ ‘Information to date’ implies (1) the information from all of the previous oracle responses, and (2) the full information on  $\mathcal{X}.$

Definition  (Oracle complexity of the problem class)

Oracle complexity of a problem class  $\mathcal{P}$  (with respect to the oracle  $\mathcal{O}$)  is

$\displaystyle \mathcal{C}_\mathcal{P} (\epsilon) = \min_{M \in \mathcal{M}_\mathcal{O}} \max_{P \in \mathcal{P}}\, C(M, P, \varepsilon)$

where  $C(M, P, \varepsilon)$  is the number of queries to  $\mathcal{O}$  which a method  $M$ commits to solve the problem instance  $P$  up to the accuracy level  $\varepsilon.$

For a given class  $\mathcal{P},$  often the “reasonable” oracle  $\mathcal{O}$  is determined by it (and therefore we omit $\mathcal{O}$ in the complexity notation, slightly confusing the notation). With fixed problem class and oracle,  $\mathcal{C}_\mathcal{P}(\varepsilon)$  is, of course, a decreasing function of  $\varepsilon.$

The study of a class  $\mathcal{P}$  usually consists in providing the upper and lower bounds on its oracle complexity  $\mathcal{C}_\mathcal{P}(\varepsilon)$  as a function of  $\varepsilon.$  If they coincide up to a multiplicative constant (which may depend on every quantity here except for $\epsilon$), then  $\mathcal{P}$  is considered to be fully studied.

Upper and lower bounds are provided differently. An upper bound may be given by some fixed method for which it may be proved that the method performs not more than a given number of oracle calls,  for a fixed  $\varepsilon,$  for any problem of the class. A lower bound provides a problem instance from  $\mathcal{P}$  which takes not less than some specified number of oracle calls, for any method to be used. Besides that, bounds are proved with the adversary idea, namely, that a hard problem instance may be built gradually as the method calls the oracle (it is easier to play for the adversary since he makes the second step in this two-step game, that is why such bounds are lower bounds).

As previously noted, our interest will be in huge-dimensional problems meaning that the dimension of the problem class  $n$  may be very large (from ten thousands to billions and more). Here the situation is as follows. One may often get upper and lower bounds on complexities with the same dependence on $\epsilon$ but with multiplicative constants growing with  $n.$  Whereas for many problem classes these constants grow dramatically fast (exponentially and even faster) with $n,$  it occurs that for some of them (essentially when the first-order information about the function is disposable), one may come up with lower bounds are independent of the problem dimension. Given such lower bounds, the challenge is in proving dimension-free upper bounds by discovering the corresponding ([almost] dimension-free) efficient methods.