Study notes from Convex Optimization by Stephen Boyd, Lieven Vandenberghe.

1. Mathematical optimization

\qquadA mathematical optimization problem has the following form

minimizef0(x)subjecttofi(x)bi,i=1,,m(1.1)\begin{array}{ll} \mathrm{minimize} & f_0(x) \\ \mathrm{subject to} & f_i(x) \leq b_i, i = 1, \cdots, m \tag{1.1} \end{array}

Here the vector x=(x1,,xn)x = (x_1, \cdots, x_n) is the optimization variable of the problem, the function f0:RnRf_0 : \boldsymbol{R}^n \rightarrow \boldsymbol{R} is the objective function, the function fi:RnR,i=1,,mf_i : \boldsymbol{R}^n \rightarrow \boldsymbol{R}, i = 1, \cdots, m are the constraint function, and the constrants b1,,bmb_1, \cdots, b_m are the limits (or bounds) for the constraints. A vector xx^* is called optimal (or solution) of the problem (1.1)(1.1), if it has the smallest objective value among all vectors that satisfy the constraints: for any zz with f1(z)b1,,fm(z)bmf_1(z) \leq b_1, \cdots, f_m(z) \leq b_m, we have f0(z)f0(x)f_0(z) \geq f_0(x^*).

2. Convex optimization

\qquadThe optimization problem (1.1)(1.1) is called a convex optimization if the objective and constraint functions f0,,fmf_0, \cdots, f_m are convex, i.e.i.e., satisfy

fi(αx+βy)αfi(x)+βfi(y)(2.1)\begin{array}{c} f_i(\alpha x + \beta y) \leq \alpha f_i(x) + \beta f_i(y) \tag{2.1} \end{array}

for all x,yRnx, y \in \boldsymbol{R}^n and all α,βR\alpha, \beta \in \boldsymbol{R} with α+β=1,α0,β0\alpha + \beta = 1, \alpha \geq 0, \beta \geq 0.

\qquadThere is in general no analytical formula for the solution of convex optimization problems, but there are very effective methods for solving them. Interior-point methods work very well in practice, and in some cases can be proved to solve the problem to a specified accuracy with a number of operations that does not exceed a polynomial of the problem dimensions. Interior-point methods can solve convex problem in a number of steps of iterations that is almost always in the range between 10 and 100. Ignoring any structure in the problem (such as sparsity), each step requires on the order of max{n3,n2m,F}\max{\{n^3, n^2 m, F\}} operations, where FF is the cost of evaluating the first and second derivatives of the objective and constraint functions f0,,fmf_0, \cdots, f_m.

\qquadWe cannot yet claim that solving general convex optimization problems is a mature technology, but it is reasonable to expect that solving general convex optimization problems will become a technology within a few years. And for some subclasses of convex of convex optimization problems, like second-order cone programming or geometric programming, it is fair to say that interior-point methods are approaching a technology.

\qquadThere are two very widely known and used special subclasses of convex optimization: least-squares and linear programming, which can be solved by efficient method.

2.1 Least squares

\qquadA least-squares problem is an optimization problem with no constraints (i.e.i.e., m=0m = 0) and an objective which is a sum of squares of terms of form aiTxbia_i^T x - b_i:

minimizef0(x)=Axb22=i=1k(aiTxbi)2(2.2)\begin{array}{ll} \mathrm{minimize} & f_0(x) = \parallel{Ax - b}\parallel^2_2 = \sum_{i=1}^k (a_i^T x - b_i)^2 \tag{2.2} \end{array}

Here ARk×nA \in \boldsymbol{R}^{k \times n} (with knk \geq n), aiTa_i^T are the rows of AA, and the vector xRnx \in \boldsymbol{R}^n is the optimization variable.

Solving least-squares problems

\qquadThe solution of a least-squares problem (2.1)(2.1) can be reduced to solving a set of linear equations:

(ATA)x=ATb\begin{array}{c} (A^T A)x = A^T b \end{array}

So we have the analytical solution x=(ATA)1ATbx = (A^T A)^{-1} A^T b. For least-squares problems we have good algorithms for solving the problem to high accuracy, with very high reliabbility. The least-swqures problem can be solved in O(n2k)O(n^2 k) time complexity. If the matrix AA is sparse, which means that it has far fewer than knkn nonzero entries, we can solve the least-squares problem much faster than order n2kn^2 k.

\qquadIn weighted least-squares, the weighted least-squares cost is minimized:

i=1kwi(aiTxbi)2\begin{array}{c} \sum_{i=1}^k w_i(a_i^T x - b_i)^2 \end{array}

where w1,,wkw_1,\cdots,w_k are positive.

\qquadAnother technique in least-squares is regularization, in which extra terms are added to the cost function. In the simplest case, a positive multiple of the sum of squares of the variables is added to the cost function:

i=1k(aiTxbi)2+ρi=1nxi2\begin{array}{c} \sum_{i=1}^k (a_i^T x - b_i)^2 + \rho \sum_{i=1}^n x_i^2 \end{array}

where ρ>0\rho > 0. The extra terms penalize large values of x, and result in a sensible solution in cases when minimizing the first sum only does not. The parameter ρ\rho is chosen by the user to give the right trade-off between making the original objective function i=1k(aiTxbi)2\sum_{i=1}^k (a_i^T x - b_i)^2 small, while keeping i=1nxi2\sum_{i=1}^n x_i^2 not too big. Regularization comes up in statistical estimation when the vector xx to be estimated is given a prior distribution.

2.2 Linear programming

\qquadThe optimization problem (1.1)(1.1) is called a linear programming if the objective and constraint functions f0,,fmf_0, \cdots, f_m are linear, i.e.i.e., satisfy

fi(αx+βy)=αfi(x)+βfi(y)(2.3)\begin{array}{c} f_i(\alpha x + \beta y) = \alpha f_i(x) + \beta f_i(y) \tag{2.3} \end{array}

for all x,yRnx, y \in \boldsymbol{R}^n and all α,βR\alpha, \beta \in \boldsymbol{R}. \qquadIf the optimization problem is bot linear, it is called a nonlinear programming.

\qquadSpecifically, the standard definition of linear programming is given as follow, in which the objective and all constraint functions are linear:

minimizecTxsubject toaiTxbi,i=1,,m(2.4)\begin{array}{ll} \mathrm{minimize} & c^T x \\ \mathrm{subject\ to} & a_i^T x \leq b_i, i = 1, \cdots, m \tag{2.4} \end{array}

Here the vectors c,a1,,amRnc, a_1, \cdots, a_m \in \boldsymbol{R}^n and scalars b1,,bmRb_1, \cdots, b_m \in \boldsymbol{R} are problem parameters that specify the objective and constraint functions.

Solving linear programming

\qquadThere is no simple analytical formula for the solution of a linear programming, but there are a variety of very effective methods for solving them, including Dantzig’s simplex method, and the more recent interior-point methods.
\qquadWhile we cannot give the exact number of arithmetic operations required to solve a linear programming, we can establish rigorous bounds on the number of operations required to solve a linear programming, to a given accuracy, using an interior-point method. The complexity in practice is order n2mn^2 m (assuming mnm \geq n) but with a constant that is less well characterized than for least-squares. If the problem is sparse, or has some other exploitable structure, we can often solve problems with tens or hundreds of thousands of variables and constraints.
\qquadAs with least-squares problems, it is still a challenge to solve extremely large linear programs, or to solve linear programs with exacting real-time computing requirements. But, like least-squares, we can say that solving (most) linear programs is a mature technology. Linear programming solvers can be embeded in many tools and applications.

Solving linear programming

\qquadAs a simple example, consider the Chebyshev approximation problem:

minimizemaxi=1,,kaiTxbi\begin{array}{ll} \mathrm{minimize} & \max_{i=1,\cdots,k} |a_i^T x - b_i| \end{array}

Here xRnx \in \boldsymbol{R}^n is the variable, and a1,,akRna_1, \cdots, a_k \in \boldsymbol{R}^n, b1,,bkRb_1, \cdots, b_k \in \boldsymbol{R} are parameters that specify the problem instance. Note the resemblance to least-squares problem (2.1)(2.1). For both problems, the objective is a measure of the size of the terms aiTbia_i^T -b_i. In least-squares, we use the sum of squares of the terms as objective, whereas in Chebyshev approximation, we use the maximum of the absolute values. One other important distinction is that the objective function in the Chebyshev approximation problem is not differentiable, while the objective in the least-squares problem (2.1)(2.1) is quadratic, and therefore differentiable.

\qquadThe Chebyshev approximation problem can be solved by solving the linear programming:

minimizetsubject toaiTxtbi,i=1,,kaiTxtbi,i=1,,k\begin{array}{ll} \mathrm{minimize} & t \\ \mathrm{subject\ to} & a_i^T x - t \leq b_i, i = 1, \cdots, k \\ & -a_i^T x - t \leq - b_i, i = 1, \cdots, k \end{array}

with variables xRnx \in \boldsymbol{R}^n and tRt \in \boldsymbol{R}. Since linear programming is readily solved, the Chebyshev approximation problem is therefore readily solved.

3. Nonlinear optimization

\qquadNonlinear optimization problem means that the objective or constraint functions are not linear, but not known to be convex. Sadly, there are no effective methods for solving the general nonlinear prgramming problem. Even simple looking problems with as few as ten variables can be extremely challenging, while problems with a few hundreds of variables can be intractable. Methods for the general nonlinear programming problem therefore take several different aproaches, each of which involves some compromise.

3.1 Local optimization

\qquadIn local optimization, the compromise is to give up seeking the optimal xx, which minimizes the objective over all feasible points. Instead we seek a point that is only locally optimal, which means that it minimizes the objective function among feasiable points that are near it, but is not guaranteed to have a lower objective value than all other feasible points. A large fraction of the research on general nonlinear programming has focused on methods for local optimization, which as a consequence are well developed.

\qquadLocal optimization methods can be fast, can handle large-scale problems, and are widely applicable, since they only require differentiability of the objective and constraint functions. As a result, local optimization methods are widely used in applications where there is value in finding a good poing, if not the very best.

\qquadThere are several disadvantages of local optimization methods, beyond not finding the true, globally optimal solution. The methods require an initial guess for the optimization variable. This initial guess or starting point is critical, and can greatly affect the objective value of the local solution obtained. Little information is provided about how far from globally optimal the local solution is. Local optimization methods are often sensitive to algorithm parameter values, which may need to be adjusted for a particular problem, or family of problems.

\qquadUsing a local optimization method is trickier than solving a least-squares problem, linear programming, or convex optimization problem. It involves experimenting with the choice of algorithm, adjusting algorithm parameters, and finding a good enough initial guess (when one instance is to be solved) or a method for producing enough initial guess (when a family of problems is to be solved).

3.2 Global optimization

\qquadIn global optimization, the true global solution of the optimization problem is found, the compromise is efficiency. The worst-case complexity of global optimization methods grows exponentially with the problem sizes nn and mm. The hope is that in practice, for the particular problem instances encountered, the method is far faster. While this favorable situation does occur, it is not typical. Even small problems, with a few tens of variables, can take a very long time to solve.

\qquadGlobal optimization is used for problems with a small number of variables, where computing time is not critical, and the value of finding the true global solution is very high. A local optimization method can rapidly find a set of parameter values that is bad, but not guaranteed to be the absolute worst possible. If a local optimization method finds parameter values that yield unacceptable performance, it has succeeded in determining that the system is not reliable. But a local optimization method cannot certify the system as reliable, it can only fail to find bad parameter values. A global optimization method, in contrast, will find the absolute worst values of the parameters, and if the associated performance is acceptable, can certify the system as safe. The cost is computation time, which can be very large, even for a relatively small number of parameters. But it may be worth it in cases where the value of certifying the performance is high, or the cost of being wrong about the reliability or safety is high.

4. Role of convex optimization in nonconvex problems

\qquadConvex optimization also plays an important role in problems that are not convex.

4.1 Initializatoin for local optimization

One obvious use is to combine convex optimization with a local optimization method. Starting with a nonconvex problem, we first find an approximate, but convex, formulation of the problem. By solving this approximate problem, which can be done easily and without an initial guess, we obtain the the exact solution to the approximate convex problem. This point is then used as the starting poing for a local optimization method, applied to the original nonconvex problem.

4.2 Convex heuristics for nonconvex optimization

\qquadConvex optimization is the basis for several heuristics for solving nonconvex problems.

\qquadOne interesting example is the problem of finding a sparse vector xx that satisfies some constraints. While this is a difficult combinatorial problem, there are some simple heuristics, based on convex optimization, that often find fairly sparse solutions.

\qquadAnother broad example is given by randomized algorithms, in which an approximate solution to a nonconvex problem is found by drawing some number of candidates from a probability distribution, and taking the best one found as the approximate solution.

4.3 Bounds for global optimization

\qquadMany methods for global optimization require a cheaply computable lower bound on the optimal value of the nonconvex problem… Two standard methods for doing this are based on convex optimization. In relaxation, each nonconvex constraint is replaced with a looser, but convex, constraint. In Lagrangian relaxation, the Lagrangian dual problem is solved. This problem is convex, and provides a lower bound on the optimal value of the nonconvex problem.