Study notes from Convex Optimization by Stephen Boyd, Lieven Vandenberghe.

## 1. Mathematical optimization

$\qquad$A mathematical optimization problem has the following form

$\begin{array}{ll} \mathrm{minimize} & f_0(x) \\ \mathrm{subject to} & f_i(x) \leq b_i, i = 1, \cdots, m \tag{1.1} \end{array}$

Here the vector $x = (x_1, \cdots, x_n)$ is the optimization variable of the problem, the function $f_0 : \boldsymbol{R}^n \rightarrow \boldsymbol{R}$ is the objective function, the function $f_i : \boldsymbol{R}^n \rightarrow \boldsymbol{R}, i = 1, \cdots, m$ are the constraint function, and the constrants $b_1, \cdots, b_m$ are the limits (or bounds) for the constraints. A vector $x^*$ is called optimal (or solution) of the problem $(1.1)$, if it has the smallest objective value among all vectors that satisfy the constraints: for any $z$ with $f_1(z) \leq b_1, \cdots, f_m(z) \leq b_m$, we have $f_0(z) \geq f_0(x^*)$.

## 2. Convex optimization

$\qquad$The optimization problem $(1.1)$ is called a convex optimization if the objective and constraint functions $f_0, \cdots, f_m$ are convex, $i.e.$, satisfy

$\begin{array}{c} f_i(\alpha x + \beta y) \leq \alpha f_i(x) + \beta f_i(y) \tag{2.1} \end{array}$

for all $x, y \in \boldsymbol{R}^n$ and all $\alpha, \beta \in \boldsymbol{R}$ with $\alpha + \beta = 1, \alpha \geq 0, \beta \geq 0$.

$\qquad$There is in general no analytical formula for the solution of convex optimization problems, but there are very effective methods for solving them. Interior-point methods work very well in practice, and in some cases can be proved to solve the problem to a specified accuracy with a number of operations that does not exceed a polynomial of the problem dimensions. Interior-point methods can solve convex problem in a number of steps of iterations that is almost always in the range between 10 and 100. Ignoring any structure in the problem (such as sparsity), each step requires on the order of $\max{\{n^3, n^2 m, F\}}$ operations, where $F$ is the cost of evaluating the first and second derivatives of the objective and constraint functions $f_0, \cdots, f_m$.

$\qquad$We cannot yet claim that solving general convex optimization problems is a mature technology, but it is reasonable to expect that solving general convex optimization problems will become a technology within a few years. And for some subclasses of convex of convex optimization problems, like second-order cone programming or geometric programming, it is fair to say that interior-point methods are approaching a technology.

$\qquad$There are two very widely known and used special subclasses of convex optimization: least-squares and linear programming, which can be solved by efficient method.

### 2.1 Least squares

$\qquad$A least-squares problem is an optimization problem with no constraints ($i.e.$, $m = 0$) and an objective which is a sum of squares of terms of form $a_i^T x - b_i$:

$\begin{array}{ll} \mathrm{minimize} & f_0(x) = \parallel{Ax - b}\parallel^2_2 = \sum_{i=1}^k (a_i^T x - b_i)^2 \tag{2.2} \end{array}$

Here $A \in \boldsymbol{R}^{k \times n}$ (with $k \geq n$), $a_i^T$ are the rows of $A$, and the vector $x \in \boldsymbol{R}^n$ is the optimization variable.

Solving least-squares problems

$\qquad$The solution of a least-squares problem $(2.1)$ can be reduced to solving a set of linear equations:

$\begin{array}{c} (A^T A)x = A^T b \end{array}$

So we have the analytical solution $x = (A^T A)^{-1} A^T b$. For least-squares problems we have good algorithms for solving the problem to high accuracy, with very high reliabbility. The least-swqures problem can be solved in $O(n^2 k)$ time complexity. If the matrix $A$ is sparse, which means that it has far fewer than $kn$ nonzero entries, we can solve the least-squares problem much faster than order $n^2 k$.

$\qquad$In weighted least-squares, the weighted least-squares cost is minimized:

$\begin{array}{c} \sum_{i=1}^k w_i(a_i^T x - b_i)^2 \end{array}$

where $w_1,\cdots,w_k$ are positive.

$\qquad$Another technique in least-squares is regularization, in which extra terms are added to the cost function. In the simplest case, a positive multiple of the sum of squares of the variables is added to the cost function:

$\begin{array}{c} \sum_{i=1}^k (a_i^T x - b_i)^2 + \rho \sum_{i=1}^n x_i^2 \end{array}$

where $\rho > 0$. The extra terms penalize large values of x, and result in a sensible solution in cases when minimizing the first sum only does not. The parameter $\rho$ is chosen by the user to give the right trade-off between making the original objective function $\sum_{i=1}^k (a_i^T x - b_i)^2$ small, while keeping $\sum_{i=1}^n x_i^2$ not too big. Regularization comes up in statistical estimation when the vector $x$ to be estimated is given a prior distribution.

### 2.2 Linear programming

$\qquad$The optimization problem $(1.1)$ is called a linear programming if the objective and constraint functions $f_0, \cdots, f_m$ are linear, $i.e.$, satisfy

$\begin{array}{c} f_i(\alpha x + \beta y) = \alpha f_i(x) + \beta f_i(y) \tag{2.3} \end{array}$

for all $x, y \in \boldsymbol{R}^n$ and all $\alpha, \beta \in \boldsymbol{R}$. $\qquad$If the optimization problem is bot linear, it is called a nonlinear programming.

$\qquad$Specifically, the standard definition of linear programming is given as follow, in which the objective and all constraint functions are linear:

$\begin{array}{ll} \mathrm{minimize} & c^T x \\ \mathrm{subject\ to} & a_i^T x \leq b_i, i = 1, \cdots, m \tag{2.4} \end{array}$

Here the vectors $c, a_1, \cdots, a_m \in \boldsymbol{R}^n$ and scalars $b_1, \cdots, b_m \in \boldsymbol{R}$ are problem parameters that specify the objective and constraint functions.

Solving linear programming

$\qquad$There is no simple analytical formula for the solution of a linear programming, but there are a variety of very effective methods for solving them, including Dantzig’s simplex method, and the more recent interior-point methods.
$\qquad$While we cannot give the exact number of arithmetic operations required to solve a linear programming, we can establish rigorous bounds on the number of operations required to solve a linear programming, to a given accuracy, using an interior-point method. The complexity in practice is order $n^2 m$ (assuming $m \geq n$) but with a constant that is less well characterized than for least-squares. If the problem is sparse, or has some other exploitable structure, we can often solve problems with tens or hundreds of thousands of variables and constraints.
$\qquad$As with least-squares problems, it is still a challenge to solve extremely large linear programs, or to solve linear programs with exacting real-time computing requirements. But, like least-squares, we can say that solving (most) linear programs is a mature technology. Linear programming solvers can be embeded in many tools and applications.

Solving linear programming

$\qquad$As a simple example, consider the Chebyshev approximation problem:

$\begin{array}{ll} \mathrm{minimize} & \max_{i=1,\cdots,k} |a_i^T x - b_i| \end{array}$

Here $x \in \boldsymbol{R}^n$ is the variable, and $a_1, \cdots, a_k \in \boldsymbol{R}^n$, $b_1, \cdots, b_k \in \boldsymbol{R}$ are parameters that specify the problem instance. Note the resemblance to least-squares problem $(2.1)$. For both problems, the objective is a measure of the size of the terms $a_i^T -b_i$. In least-squares, we use the sum of squares of the terms as objective, whereas in Chebyshev approximation, we use the maximum of the absolute values. One other important distinction is that the objective function in the Chebyshev approximation problem is not differentiable, while the objective in the least-squares problem $(2.1)$ is quadratic, and therefore differentiable.

$\qquad$The Chebyshev approximation problem can be solved by solving the linear programming:

$\begin{array}{ll} \mathrm{minimize} & t \\ \mathrm{subject\ to} & a_i^T x - t \leq b_i, i = 1, \cdots, k \\ & -a_i^T x - t \leq - b_i, i = 1, \cdots, k \end{array}$

with variables $x \in \boldsymbol{R}^n$ and $t \in \boldsymbol{R}$. Since linear programming is readily solved, the Chebyshev approximation problem is therefore readily solved.

## 3. Nonlinear optimization

$\qquad$Nonlinear optimization problem means that the objective or constraint functions are not linear, but not known to be convex. Sadly, there are no effective methods for solving the general nonlinear prgramming problem. Even simple looking problems with as few as ten variables can be extremely challenging, while problems with a few hundreds of variables can be intractable. Methods for the general nonlinear programming problem therefore take several different aproaches, each of which involves some compromise.

### 3.1 Local optimization

$\qquad$In local optimization, the compromise is to give up seeking the optimal $x$, which minimizes the objective over all feasible points. Instead we seek a point that is only locally optimal, which means that it minimizes the objective function among feasiable points that are near it, but is not guaranteed to have a lower objective value than all other feasible points. A large fraction of the research on general nonlinear programming has focused on methods for local optimization, which as a consequence are well developed.

$\qquad$Local optimization methods can be fast, can handle large-scale problems, and are widely applicable, since they only require differentiability of the objective and constraint functions. As a result, local optimization methods are widely used in applications where there is value in finding a good poing, if not the very best.

$\qquad$There are several disadvantages of local optimization methods, beyond not finding the true, globally optimal solution. The methods require an initial guess for the optimization variable. This initial guess or starting point is critical, and can greatly affect the objective value of the local solution obtained. Little information is provided about how far from globally optimal the local solution is. Local optimization methods are often sensitive to algorithm parameter values, which may need to be adjusted for a particular problem, or family of problems.

$\qquad$Using a local optimization method is trickier than solving a least-squares problem, linear programming, or convex optimization problem. It involves experimenting with the choice of algorithm, adjusting algorithm parameters, and finding a good enough initial guess (when one instance is to be solved) or a method for producing enough initial guess (when a family of problems is to be solved).

### 3.2 Global optimization

$\qquad$In global optimization, the true global solution of the optimization problem is found, the compromise is efficiency. The worst-case complexity of global optimization methods grows exponentially with the problem sizes $n$ and $m$. The hope is that in practice, for the particular problem instances encountered, the method is far faster. While this favorable situation does occur, it is not typical. Even small problems, with a few tens of variables, can take a very long time to solve.

$\qquad$Global optimization is used for problems with a small number of variables, where computing time is not critical, and the value of finding the true global solution is very high. A local optimization method can rapidly find a set of parameter values that is bad, but not guaranteed to be the absolute worst possible. If a local optimization method finds parameter values that yield unacceptable performance, it has succeeded in determining that the system is not reliable. But a local optimization method cannot certify the system as reliable, it can only fail to find bad parameter values. A global optimization method, in contrast, will find the absolute worst values of the parameters, and if the associated performance is acceptable, can certify the system as safe. The cost is computation time, which can be very large, even for a relatively small number of parameters. But it may be worth it in cases where the value of certifying the performance is high, or the cost of being wrong about the reliability or safety is high.

## 4. Role of convex optimization in nonconvex problems

$\qquad$Convex optimization also plays an important role in problems that are not convex.

### 4.1 Initializatoin for local optimization

One obvious use is to combine convex optimization with a local optimization method. Starting with a nonconvex problem, we first find an approximate, but convex, formulation of the problem. By solving this approximate problem, which can be done easily and without an initial guess, we obtain the the exact solution to the approximate convex problem. This point is then used as the starting poing for a local optimization method, applied to the original nonconvex problem.

### 4.2 Convex heuristics for nonconvex optimization

$\qquad$Convex optimization is the basis for several heuristics for solving nonconvex problems.

$\qquad$One interesting example is the problem of finding a sparse vector $x$ that satisfies some constraints. While this is a difficult combinatorial problem, there are some simple heuristics, based on convex optimization, that often find fairly sparse solutions.

$\qquad$Another broad example is given by randomized algorithms, in which an approximate solution to a nonconvex problem is found by drawing some number of candidates from a probability distribution, and taking the best one found as the approximate solution.

### 4.3 Bounds for global optimization

$\qquad$Many methods for global optimization require a cheaply computable lower bound on the optimal value of the nonconvex problem… Two standard methods for doing this are based on convex optimization. In relaxation, each nonconvex constraint is replaced with a looser, but convex, constraint. In Lagrangian relaxation, the Lagrangian dual problem is solved. This problem is convex, and provides a lower bound on the optimal value of the nonconvex problem.