Unconstrained Gradient-Based Optimization - Engineering Design Optimization

In this chapter we focus on unconstrained optimization problems with continuous design variables, which we can write as

\underset{x}{\text{minimize}} f(x) \, ,

(4.1)

where $x=[x_1, \ldots, x_n]$ is composed of the design variables that the optimization algorithm can change.

We solve these problems using gradient information to determine a series of steps from a starting guess (or initial design) to the optimum, as shown in Figure 4.1.

Gradient-based optimization starts with a guess, x_{0}, and takes a sequence of steps in n-dimensional space that converge to an optimum, x^*. — Figure 4.1:Gradient-based optimization starts with a guess, $x_{0}$ , and takes a sequence of steps in $n$ -dimensional space that converge to an optimum, $x^*$ .

We assume the objective function to be nonlinear, $C^2$ continuous, and deterministic. We do not assume unimodality or multimodality, and there is no guarantee that the algorithm finds the global optimum. Referring to the attributes that classify an optimization problem (Figure 1.22), the optimization algorithms discussed in this chapter range from first to second order, perform a local search, and evaluate the function directly.

The algorithms are based on mathematical principles rather than heuristics.

Although most engineering design problems are constrained, the constrained optimization algorithms in Chapter 5 build on the methods explained in the current chapter.

4.1 Fundamentals¶

To determine the step directions shown in Figure 4.1, gradient-based methods need the gradient (first-order information). Some methods also use curvature (second-order information). Gradients and curvature are required to build a second-order Taylor series, a fundamental building block in establishing optimality and developing gradient-based optimization algorithms.

4.1.1 Derivatives and Gradients¶

Recall that we are considering a scalar objective function $f(x)$ , where $x$ is the vector of design variables, $x = [x_1, x_2, \ldots, x_n]$ .

The gradient of this function, $\nabla f(x)$ , is a column vector of first-order partial derivatives of the function with respect to each design variable:

\nabla f(x) = \begin{bmatrix} \dfrac{\partial f}{\partial x_1}, \dfrac{\partial f}{\partial x_2}, \cdots, \dfrac{\partial f}{\partial x_n} \end{bmatrix} \, ,

(4.2)

where each partial derivative is defined as the following limit:

\frac{\partial f}{\partial x_i} = \lim_{\varepsilon \rightarrow 0} \frac{f(x_1, \ldots, x_i + \varepsilon, \ldots, x_n) - f(x_1, \ldots, x_i, \ldots, x_n)}{\varepsilon} \, .

(4.3)

Each component in the gradient vector quantifies the function’s local rate of change with respect to the corresponding design variable, as shown in Figure 4.2 for the two-dimensional case. In other words, these components represent the slope of the function along each coordinate direction. The gradient is a vector pointing in the direction of the greatest function increase from the current point.

Figure 4.2:Components of the gradient vector in the two-dimensional case.

The gradient vectors are normal to the surfaces of constant $f$ in $n$ -dimensional space (isosurfaces). In the two-dimensional case, gradient vectors are perpendicular to the function contour lines, as shown in Figure 4.2.^[1]

If a function is simple, we can use symbolic differentiation as we did in Example 4.1. However, symbolic differentiation has limited utility for general engineering models because most models are far more complicated; they may include loops, conditionals, nested functions, and implicit equations. Fortunately, there are several methods for computing derivatives numerically; we cover these methods in Chapter 6.

Each gradient component has units that correspond to the units of the function divided by the units of the corresponding variable. Because the variables might represent different physical quantities, each gradient component might have different units.

From an engineering design perspective, it might be helpful to think about relative changes, where the derivative is given as the percentage change in the function for a 1 percent increase in the variable. This relative derivative can be computed by nondimensionalizing both the function and the variable, that is,

\frac{\partial f}{\partial x}\frac{x}{f},

(4.4)

where $f$ and $x$ are the values of the function and variable, respectively, at the point where the derivative is computed.

Example 4.2 (Interpretation of derivatives for wing design problem)

Consider the wing design problem from Example 1.1, where the objective function is the required power ( $P$ ). For the derivative of power with respect to span ( $\partial P/\partial b$ ), the units are watts per meter (W/m). For a wing with $c=1$  m and $b=12$  m, we have $P = 1087.85$  W and $\partial P/\partial b = -41.65$  W/m. This means that for an increase in span of 1 m, the linear approximation predicts a decrease in power of 41.65 W (to $P = 1046.20$ ). However, the actual power at $b = 13~m$ is 1059.77 W because the function is nonlinear (see Figure 4.4). The relative derivative for this same design can be computed as $(\partial P/\partial b)(b/P) = -0.459$ , which means that for a 1 percent increase in span, the linear approximation predicts a 0.459 percent decrease in power. The actual decrease is 0.310 percent.

Figure 4.4:Power versus span and the corresponding derivative.

The gradient components quantify the function’s rate of change in each coordinate direction ( $x_i$ ), but sometimes we are interested in the rate of change in a direction that is not a coordinate direction. The rate of change in a direction $p$ is quantified by a directional derivative, defined as

\nabla_p f(x) = \lim_{\varepsilon \rightarrow 0} \frac{f(x + \varepsilon p) - f(x)}{\varepsilon} \, .

(4.5)

We can find this derivative by projecting the gradient onto the desired direction $p$ using the dot product

\nabla_p f(x) = \nabla f^\intercal p \, .

(4.6)

When $p$ is a unit vector aligned with one of the Cartesian coordinates $i$ , this dot product yields the corresponding partial derivative $\partial f / \partial x_i$ . A two-dimensional example of this projection is shown in Figure 4.5.

Projection of the gradient in an arbitrary unit direction p. — Figure 4.5:Projection of the gradient in an arbitrary unit direction $p$ .

From the gradient projection, we can see why the gradient is the direction of the steepest increase. If we use this definition of the dot product,

\nabla_p f(x) = \nabla f^\intercal p = \|\nabla f\| \|p\| \cos\theta \, ,

(4.7)

where $\theta$ is the angle between the two vectors, we can see that this is maximized when $\theta = 0^\circ$ . That is, the directional derivative is largest when $p$ points in the same direction as $\nabla f$ .

If $\theta$ is in the interval $(-90, 90)^\circ$ , the directional derivative is positive and is thus in a direction of increase, as shown in Figure 4.6. If $\theta$ is in the interval $(90, 180]^\circ$ , the directional derivative is negative, and $p$ points in a descent direction. Finally, if $\theta = \pm 90^\circ$ , the directional derivative is 0, and thus the function value does not change for small steps; it is locally flat in that direction. This condition occurs when $\nabla f$ and $p$ are orthogonal; therefore, the gradient is orthogonal to the function isosurfaces.

The gradient \nabla f is always orthogonal to contour lines (surfaces), and the directional derivative in the direction p is given by \nabla f^\intercal p. — Figure 4.6:The gradient $\nabla f$ is always orthogonal to contour lines (surfaces), and the directional derivative in the direction $p$ is given by $\nabla f^\intercal p$ .

To get the correct slope in the original units of $x$ , the direction should be normalized as $\hat p = p / \|p\|$ . However, in some of the gradient-based algorithms of this chapter, $p$ is not normalized because the length contains useful information. If $p$ is not normalized, the slopes and variable axis are scaled by a constant.

Example 4.3 (Directional derivative of a quadratic function)

Consider the following function of two variables:

f(x_1, x_2) = x_1^2 + 2x_2^2 - x_1 x_2 \, .

The gradient can be obtained using symbolic differentiation, yielding

\nabla f(x_1, x_2) = \begin{bmatrix} 2 x_1 - x_2 \\ 4 x_2 - x_1 \end{bmatrix} \, .

At point $x = [-1, 1]$ , the gradient is

\nabla f(-1, 1) = \begin{bmatrix} -3 \\ 5 \end{bmatrix} \, .

Taking the derivative in the normalized direction $p=[2/\sqrt{5},-1/\sqrt{5}]$ , we obtain

\nabla f^\intercal p = [-3,5] \begin{bmatrix} 2/\sqrt{5} \\ -1/\sqrt{5} \end{bmatrix} = -\frac{11}{\sqrt{5}} \, ,

which we show in Figure 4.7 (left). We use a $p$ with unit length to get the slope of the function in the original units.

Figure 4.7:Function contours and direction $p$ (left), one-dimensional slice along $p$ (middle), directional derivative for all directions on polar plot (right).

A projection of the function in the $p$ direction can be obtained by plotting $f$ along the line defined by $x + \alpha p$ , where $\alpha$ is the independent variable, as shown in Figure 4.7 (middle). The projected slope of the function in that direction corresponds to the slope of this single-variable function. The polar plot in Figure 4.7 (right) shows how the directional derivative changes with the direction of $p$ . The directional derivative has a maximum in the direction of the gradient, has the largest negative magnitude in the opposite direction, and has zero values in the directions orthogonal to the gradient.

4.1.2 Curvature and Hessians¶

The rate of change of the gradient—the curvature—is also useful information because it tells us if a function’s slope is increasing (positive curvature), decreasing (negative curvature), or stationary (zero curvature).

In one dimension, the gradient reduces to a scalar (the slope), and the curvature is also a scalar that can be calculated by taking the second derivative of the function. To quantify curvature in $n$ dimensions, we need to take the partial derivative of each gradient component $j$ with respect to each coordinate direction $i$ , yielding

\frac{\partial^2 f}{\partial x_i \partial x_j} \, .

(4.8)

If the function $f$ has continuous second partial derivatives, the order of differentiation does not matter, and the mixed partial derivatives are equal; thus

\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i} \, .

(4.9)

This property is known as the symmetry of second derivatives or equality of mixed partials.^[2]

Considering all gradient components and their derivatives with respect to all coordinate directions results in a second-order tensor. This tensor can be represented as a square $(n \times n)$ matrix of second-order partial derivatives called the Hessian:

H_f(x) = \begin{bmatrix} \dfrac{\partial^2 f}{\partial x_1^2} & \dfrac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_1 \partial x_n} \\ \dfrac{\partial^2 f}{\partial x_2 \partial x_1} & \dfrac{\partial^2 f}{\partial x_2^2} & \cdots & \dfrac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \dfrac{\partial^2 f}{\partial x_n \partial x_1} & \dfrac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_n^2} \end{bmatrix} \, .

(4.10)

The Hessian is expressed in index notation as:

{H_f}_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} \, .

(4.11)

Because of the symmetry of second derivatives, the Hessian is a symmetric matrix with $n(n+1)/2$ independent elements.

Each row $i$ of the Hessian is a vector that quantifies the rate of change of all components $j$ of the gradient vector with respect to the direction $i$ . On the other hand, each column $j$ of the matrix quantifies the rate of change of component $j$ of the gradient vector with respect to all coordinate directions $i$ . Because the Hessian is symmetric, the rows and columns are transposes of each other, and these two interpretations are equivalent.

We can find the rate of change of the gradient in an arbitrary normalized direction $p$ by taking the product $Hp$ . This yields an $n$ -vector that quantifies the rate of change of the gradient in the direction $p$ , where each component of the vector is the rate of the change of the corresponding partial derivative with respect to a movement along $p$ . Therefore, this product is defined as follows:

Hp = \nabla_p \left( \nabla f(x) \right) = \lim_{\varepsilon \rightarrow 0} \frac{\nabla f(x + \varepsilon p) - \nabla f(x)}{\varepsilon} \, .

(4.12)

Because of the symmetry of second derivatives, we can also interpret this as the rate of change in the directional derivative of the function along $p$ with respect to each of the components of $p$ .

To find the curvature of the one-dimensional function along a direction $p$ , we need to project $Hp$ onto direction $p$ as

\nabla_p \left( \nabla_p f(x) \right) = p^\intercal H p \, ,

(4.13)

which yields a scalar quantity. Again, if we want to get the curvature in the original units of $x$ , $p$ should be normalized.

For an $n$ -dimensional Hessian, it is possible to find directions $v_i$ (where $i=1,\ldots,n$ ) along which the projected curvature aligns with that direction, that is,

H v = \kappa v \, .

(4.14)

This is an eigenvalue problem whose eigenvectors represent the principal curvature directions, and the eigenvalues $\kappa$ quantify the corresponding curvatures. If each eigenvector is normalized as $\hat v = v / \|v\|$ , then the corresponding $\kappa$ is the curvature in the original units.

Example 4.4 (Hessian and principal curvature directions of a quadratic)

Consider the following quadratic function of two variables:

f(x_1, x_2) = x_1^2 + 2x_2^2 - x_1 x_2 \, ,

whose contours are shown in Figure 4.8 (left). These contours are ellipses that have the same center. The Hessian of this quadratic is

H = \begin{bmatrix} 2 & -1 \\ -1 & 4 \end{bmatrix} \, ,

which is constant. To find the curvature in the direction $p=[-1/2,-\sqrt{3}/2]$ , we compute

p^\intercal H p = \begin{bmatrix} \frac{-1}{2} & \frac{-\sqrt{3}}{2} \end{bmatrix} \begin{bmatrix} 2 & -1 \\ -1 & 4 \end{bmatrix} \begin{bmatrix} \frac{-1}{2} \\ \frac{-\sqrt{3}}{2} \end{bmatrix} = \frac{7 - \sqrt{3}}{2} \, .

The principal curvature directions can be computed by solving the eigenvalue problem (Equation 4.14). This yields two eigenvalues and two corresponding eigenvectors,

\kappa_A = 3 + \sqrt{2}, \quad v_A = \begin{bmatrix} 1 - \sqrt{2} \\ 1 \end{bmatrix}, \quad \textrm{and} \quad \kappa_B = 3 - \sqrt{2}, \quad v_B = \begin{bmatrix} 1 + \sqrt{2} \\ 1 \end{bmatrix} \, .

By plotting the principal curvature directions superimposed on the function contours (Figure 4.8, left), we can see that they are aligned with the ellipses’ major and minor axes. To see how the curvature varies as a function of the direction, we make a polar plot of the curvature $p^\intercal H p$ , where $p$ is normalized (Figure 4.8, right). The maximum curvature aligns with the first principal curvature direction, as expected, and the minimum curvature corresponds to the second principal curvature direction.

Figure 4.8:Contours of $f$ for this example and the two principal curvature directions in red. The polar plot shows the curvature, with the eigenvectors pointing at the directions of principal curvature; all other directions have curvature values in between.

4.1.3 Taylor Series¶

The Taylor series provides a local approximation to a function and is the foundation for gradient-based optimization algorithms.

For an $n$ -dimensional function, the Taylor series can predict the function along any direction $p$ . This is done by projecting the gradient and Hessian onto the desired direction $p$ to get an approximation of the function at any nearby point $x + p$ :^[3]

f(x + p) = f(x) + \nabla f(x)^\intercal p + \frac{1}{2} p^\intercal H(x) p + {\mathcal O} \left( \|p\|^3 \right) \, .

(4.15)

We use a second-order Taylor series (ignoring the cubic term) because it results in a quadratic, the lowest-order Taylor series that can have a minimum. For a function that is $C^2$ continuous, this approximation can be made arbitrarily accurate by making $\|p\|$ small enough.

Example 4.6 (Second-order Taylor series expansion of two-variable function)

Using the gradient and Hessian of the two-variable polynomial from Example 4.1 and Example 4.5, we can use Equation 4.15 to construct a second-order Taylor expansion about $x_{0}$ ,

\tilde f(p) = f \left(x_{0} \right) + \begin{bmatrix} 3 x_1^2 + 2 x_2^2 - 20 \\ 4 x_1 x_2 - 3 x_2^2 \end{bmatrix}^\intercal p + p^\intercal \begin{bmatrix} 6 x_1 & 4 x_2 \\ 4 x_2 & 4 x_1 - 6 x_2 \end{bmatrix} p \, .

Figure 4.10 shows the resulting Taylor series expansions about different points. We perform three expansions, each about three critical points: the minimum (left), the maximum (middle), and the saddle point (right). The expansion about the minimum yields a convex quadratic that is a good approximation of the original function near the minimum but becomes worse as we step farther away. The expansion about the maximum shows a similar trend except that the approximation is a concave quadratic. Finally, the expansion about the saddle point yields a saddle function.

Figure 4.10:The second-order Taylor series expansion uses the function value, gradient, and Hessian at a point to construct a quadratic model about that point. The model can vary drastically, depending on the function and the point location. The one-dimensional slices are in the $x_1$ direction and at $x_2$ values corresponding to the critical points.

4.1.4 Optimality Conditions¶

To find the minimum of a function, we must determine the mathematical conditions that identify a given point $x$ as a minimum. There is only a limited set of problems for which we can prove global optimality, so in general, we are only interested in local optimality.

A point $x^*$ is a local minimum if $f(x^*) \le f(x)$ for all $x$ in the neighborhood of $x^*$ . In other words, there must be no descent direction starting from the local minimum.

A second-order Taylor series expansion about $x^*$ for small steps of size $p$ yields

f(x^* + p) = f(x^*) + \nabla f(x^*)^\intercal p + \frac{1}{2} p^\intercal H(x^*) p + \ldots \, .

(4.16)

For $x^*$ to be an optimal point, we must have $f(x^* + p) \ge f(x^*)$ for all $p$ . This implies that the first- and second-order terms in the Taylor series have to be nonnegative, that is,

\nabla f(x^*)^\intercal p + \frac{1}{2} p^\intercal H(x^*) p \ge 0 \quad \text{for all} \quad p \, .

(4.17)

Because the magnitude of $p$ is small, we can always find a $p$ such that the first term dominates. Therefore, we require that

\nabla f(x^*)^\intercal p \ge 0 \quad \text{for all} \quad p \, .

(4.18)

Because $p$ can be in any arbitrary direction, the only way this inequality can be satisfied is if all the elements of the gradient are zero (refer to Figure 4.6),

\nabla f(x^*) = 0 \, .

(4.19)

This is the first-order necessary optimality condition for an unconstrained problem. This is necessary because if any element of $p$ is nonzero, there are descent directions (e.g., $p = -\nabla f$ ) for which the inequality would not be satisfied.

Because the gradient term has to be zero, we must now satisfy the remaining term in the inequality (Equation 4.17), that is,

p^\intercal H(x^*) p \ge 0 \quad \text{for all} \quad p \, .

(4.20)

From Equation 4.13, we know that this term represents the curvature in direction $p$ , so this means that the function curvature must be positive or zero when projected in any direction. You may recognize this inequality as the definition of a positive-semidefinite matrix. In other words, the Hessian $H(x^*)$ must be positive semidefinite.

For a matrix to be positive semidefinite, its eigenvalues must all be greater than or equal to zero. Recall that the eigenvalues of the Hessian quantify the principal curvatures, so as long as all the principal curvatures are greater than or equal to zero, the curvature along an arbitrary direction is also greater than or equal to zero.

These conditions on the gradient and curvature are necessary conditions for a local minimum but are not sufficient. They are not sufficient because if the curvature is zero in some direction $p$ (i.e., $p^\intercal H(x^*) p = 0$ ), we have no way of knowing if it is a minimum unless we check the third-order term. In that case, even if it is a minimum, it is a weak minimum.

The sufficient conditions for optimality require the curvature to be positive in any direction, in which case we have a strong minimum.

Mathematically, this means that $p^\intercal H(x^*) p > 0$ for all nonzero $p$ , which is the definition of a positive-definite matrix.

If $H$ is a positive-definite matrix, every eigenvalue of $H$ is positive.^[4]Figure 4.11 shows some examples of quadratic functions that are positive definite (all positive eigenvalues), positive semidefinite (nonnegative eigenvalues), indefinite (mixed eigenvalues), and negative definite (all negative eigenvalues).

Figure 4.11:Quadratic functions with different types of Hessians.

In summary, the necessary optimality conditions for an unconstrained optimization problem are

\begin{aligned} &\nabla f(x^*) = 0 \\ &H(x^*) \text{ is positive semidefinite} \, . \\ \end{aligned}

(4.21)

The sufficient optimality conditions are

\begin{aligned} &\nabla f(x^*) = 0 \\ &H(x^*) \text{ is positive definite} \, . \\ \end{aligned}

(4.22)

Example 4.7 (Finding minima analytically)

Consider the following function of two variables:

f= 0.5 x_1^4 + 2 x_1^3 + 1.5 x_1^2 + x_2^2 - 2 x_1 x_2 \, .

We can find the minima of this function by solving for the optimality conditions analytically.

To find the critical points of this function, we solve for the points at which the gradient is equal to zero,

\nabla f = \begin{bmatrix} \dfrac{\partial f}{\partial x_1} \\ \dfrac{\partial f}{\partial x_2} \end{bmatrix} = \begin{bmatrix} 2x_1^3 + 6x_1^2 + 3x_1 -2x_2 \\ 2x_2 - 2x_1 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \, .

(4.23)

From the second equation, we have that $x_2=x_1$ . Substituting this into the first equation yields

x_1 \left(2x_1^2 + 6x_1+1 \right)=0 \, .

(4.24)

The solution of this equation yields three points:

x_A = \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \quad x_B = \begin{bmatrix} -\dfrac{3}{2}-\dfrac{\sqrt{7}}{2} \\ -\dfrac{3}{2}-\dfrac{\sqrt{7}}{2} \end{bmatrix}, \quad x_C = \begin{bmatrix} \dfrac{\sqrt{7}}{2}-\dfrac{3}{2} \\ \dfrac{\sqrt{7}}{2}-\dfrac{3}{2} \end{bmatrix} \, .

(4.25)

To classify these points, we need to compute the Hessian matrix. Differentiating the gradient, we get

H(x_1,x_2) = \begin{bmatrix} \dfrac{\partial^2 f}{\partial x_1^2} & \dfrac{\partial^2 f}{\partial x_1 \partial x_2} \\ \dfrac{\partial^2 f}{\partial x_2 \partial x_1} & \dfrac{\partial^2 f}{\partial x_2^2} \end{bmatrix} = \begin{bmatrix} 6x_1^2+12x_1+3& -2 \\ -2 & 2 \end{bmatrix} \, .

(4.26)

The Hessian at the first point is

H \left( x_A \right) = \begin{bmatrix} 3 & -2 \\ -2 & 2 \end{bmatrix} \, ,

(4.27)

whose eigenvalues are $\kappa_1 \approx 0.438$ and $\kappa_2 \approx 4.561$ . Because both eigenvalues are positive, this point is a local minimum. For the second point,

H \left( x_B \right) = \begin{bmatrix} 3 \left(3+\sqrt{7} \right) & -2 \\ -2 & 2 \end{bmatrix} \, .

(4.28)

The eigenvalues are $\kappa_1 \approx 1.737$ and $\kappa_2 \approx 17.200$ , so this point is another local minimum. For the third point,

H \left( x_C \right) = \begin{bmatrix} 9-3\sqrt{7} & -2 \\ -2 & 2 \end{bmatrix} \, .

(4.29)

The eigenvalues for this Hessian are $\kappa_1 \approx -0.523$ and $\kappa_2 \approx 3.586$ , so this point is a saddle point.

Figure 4.12 shows these three critical points. To find out which of the two local minima is the global one, we evaluate the function at each of these points. Because $f\left( x_B \right) < f\left(x_A \right)$ , $x_B$ is the global minimum.

Figure 4.12:Minima and saddle point locations.

We may be able to solve the optimality conditions analytically for simple problems, as we did in Example 4.7. However, this is not possible in general because the resulting equations might not be solvable in closed form. Therefore, we need numerical methods that solve for these conditions.

When using a numerical approach, we seek points where $\nabla f(x^*) = 0$ , but the entries in $\nabla f$ do not converge to exactly zero because of finite-precision arithmetic. Instead, we define convergence for the first criterion based on the maximum component of the gradient, such that

\|\nabla f\|_\infty \le \tau \, ,

(4.30)

where $\tau$ is some tolerance. A typical absolute tolerance is $\tau = 10^{-6}$ or a six-order magnitude reduction in gradient when using a relative tolerance. Absolute and relative criteria are often combined in a metric such as the following:

\|\nabla f\|_\infty \le \tau \left(1 + \|\nabla f_0\|_\infty \right) \, ,

(4.24)

where $\nabla f_0$ is the gradient at the starting point.

The second optimality condition (that $H$ must be positive semidefinite) is not usually checked explicitly. If we satisfy the first condition, then all we know is that we have reached a stationary point, which could be a maximum, a minimum, or a saddle point. However, as shown in Section 4.4, the search directions for the algorithms of this chapter are always descent directions, and therefore in practice, they should converge to a local minimum.

For a practical algorithm, other exit conditions are often used besides the reduction in the norm of the gradient. A function might be poorly scaled, be noisy, or have other numerical issues that prevent it from ever satisfying this optimality condition (Equation 4.24). To prevent the algorithm from running indefinitely, it is common to set a limit on the computational budget, such as the number of function calls, the number of major iterations, or the clock time. Additionally, to detect a case where the optimizer is not making significant progress and not likely to improve much further, we might set criteria on the minimum step size and the minimum change in the objective. Similar to the conditions on the gradient, the minimum change in step size could be limited as follows:

\|x_k - x_{k-1}\|_\infty < \tau_x \left( 1 + \|x_{k-1}\|_\infty \right) \, .

(4.32)

The absolute and relative conditions on the objective are of the same form, although they only use an absolute value rather than a norm because the objective is scalar.

Tip 4.1: Check the exit message when using an optimizer

Optimizers usually include an exit message when returning a result. Inexperienced users often take whatever solution the optimizer returns without checking this message. However, as discussed previously, the optimizer may terminate without satisfying first-order optimality (Equation 4.24). Check the exit message and study the optimizer’s documentation to make sure you understand the result. If the message indicates that this is not a definite optimum, you should investigate further.

You might have to increase the limit on the number of iterations if the optimization reached this limit. When terminating due to small step sizes or function changes, you might need to improve your numerical model by reducing the noise (see Tip 3.2) or by smoothing it (Tip 4.7). Another likely culprit is scaling (Tip 4.4). Finally, you might want to explore the design space around the point where the optimizer is stuck (Tip 4.2) and more specifically, see what is happening with the line search (Tip 4.3).

4.2 Two Overall Approaches to Finding an Optimum¶

Although the optimality conditions derived in the previous section can be solved analytically to find the function minima, this analytic approach is not possible for functions that result from numerical models. Instead, we need iterative numerical methods to find minima based only on the function values and gradients.

In Chapter 3, we reviewed methods for solving simultaneous systems of nonlinear equations, which we wrote as $r(u)=0$ . Because the first-order optimality condition ( $\nabla f = 0$ ) can be written in this residual form (where $r = \nabla f$ and $u = x$ ), we could try to use the solvers from Chapter 3 directly to solve unconstrained optimization problems. Although several components of general solvers for $r(u)=0$ are used in optimization algorithms, these solvers are not the most effective approaches in their original form. Furthermore, solving $\nabla f = 0$ is not necessarily sufficient—it finds a stationary point but not necessarily a minimum. Optimization algorithms require additional considerations to ensure convergence to a minimum.

Like the iterative solvers from Chapter 3, gradient-based algorithms start with a guess, $x_{0}$ , and generate a series of points, $x_{1}, x_{2}, \ldots, x_{k}, \ldots$ that converge to a local optimum, $x^*$ , as previously illustrated in Figure 4.1. At each iteration, some form of the Taylor series about the current point is used to find the next point.

Taylor series quadratic models are only guaranteed to be accurate near the point about which the series is expanded (x_k). — Figure 4.13:Taylor series quadratic models are only guaranteed to be accurate near the point about which the series is expanded ( $x_k$ ).

A truncated Taylor series is, in general, only a good model within a small neighborhood, as shown in Figure 4.13, which shows three quadratic models of the same function based on three different points.

All quadratic approximations match the local gradient and curvature at the respective points. However, the Taylor series quadratic about the first point (left plot) yields a quadratic without a minimum (the only critical point is a saddle point). The second point (middle plot) yields a quadratic whose minimum is closer to the true minimum. Finally, the Taylor series about the actual minimum point (right plot) yields a quadratic with the same minimum, as expected, but we can see how the quadratic model worsens the farther we are from the point.

Because the Taylor series is only guaranteed to be a good model locally, we need a globalization strategy to ensure convergence to an optimum. Globalization here means making the algorithm robust enough that it can converge to a local minimum when starting from any point in the domain. This should not be confused with finding the global minimum, which is a separate issue (see Tip 4.8). There are two main globalization strategies: line search and trust region.

The line search approach consists of three main steps for every iteration (Figure 4.14):

Choose a suitable search direction from the current point. The choice of search direction is based on a Taylor series approximation.
Determine how far to move in that direction by performing a line search.
Move to the new point and update all values.

The two first steps can be seen as two separate subproblems. We address the line search subproblem in Section 4.3 and the search direction subproblem in Section 4.4.

Trust-region methods also consist of three steps (Figure 4.15):

Create a model about the current point. This model can be based on a Taylor series approximation or another type of surrogate model.
Minimize the model within a trust region around the current point to find the step.
Move to the new point, update values, and adapt the size of the trust region.

We introduce the trust-region approach in Section 4.5. However, we devote more attention to algorithms that use the line search approach because they are more common in general nonlinear optimization.

Both line search and trust-region approaches use iterative processes that must be repeated until some convergence criterion is satisfied. The first step in both approaches is usually referred to as a major iteration, whereas the second step might require more function evaluations corresponding to minor iterations.

Tip 4.2: Before optimizing, explore the design space

Before coupling your model solver with an optimizer, it is a good idea to explore the design space. Ensure that the solver is robust and can handle a wide variety of inputs within your provided bounds without errors. Plotting the multidimensional design space is generally impossible, but you can perform a series of one-dimensional sweeps. From the starting point, plot the objective with all design variables fixed except one. Vary that design variable across a range, and repeat that process for several design variables. These one-dimensional plots can identify issues such as analysis failures, noisy outputs, and discontinuous outputs, which you can then fix. These issues should be addressed before attempting to optimize. This same technique can be helpful when an optimizer becomes stuck; you can plot the behavior in a small neighborhood around the point of failure (see Tip 4.3).

4.3 Line Search¶

Gradient-based unconstrained optimization algorithms that use a line search follow Algorithm 4.1. We start with a guess $x_{0}$ and provide a convergence tolerance $\tau$ for the optimality condition.^[5]The final output is an optimal point $x^*$ and the corresponding function value $f(x^*)$ . As mentioned in the previous section, there are two main subproblems in line search gradient-based optimization algorithms: choosing the search direction and determining how far to step in that direction. In the next section, we introduce several methods for choosing the search direction. The line search method determines how far to step in the chosen direction and is usually independent of the method for choosing the search direction. Therefore, line search methods can be combined with any method for finding the search direction. However, the search direction method determines the name of the overall optimization algorithm, as we will see in the next section.

The line search starts from a given point x_{k} and searches solely along direction p_{k}. — Figure 4.16:The line search starts from a given point $x_{k}$ and searches solely along direction $p_{k}$ .

For the line search subproblem, we assume that we are given a starting location at $x_{k}$ and a suitable search direction $p_{k}$ along which to search (Figure 4.16). The line search then operates solely on points along direction $p_{k}$ starting from $x_{k}$ , which can be written as

x_{k+1} = x_{k} + \alpha p_{k} \, ,

(4.33)

where the scalar $\alpha$ is always positive and represents how far we go in the direction $p_{k}$ . This equation produces a one-dimensional slice of $n$ -dimensional space, as illustrated in Figure 4.17.

The line search projects the n-dimensional problem onto one dimension, where the independent variable is \alpha. — Figure 4.17:The line search projects the $n$ -dimensional problem onto one dimension, where the independent variable is $\alpha$ .

The line search determines the magnitude of the scalar $\alpha_{k}$ , which in turn determines the next point in the iteration sequence. Even though $x_{k}$ and $p_{k}$ are $n$ -dimensional, the line search is a one-dimensional problem with the goal of selecting $\alpha_{k}$ .

Line search methods require that the search direction $p_{k}$ be a descent direction so that $\nabla {f_{k}}^\intercal p_{k} < 0$ (see Figure 4.18). This guarantees that $f$ can be reduced by stepping some distance along this direction with a positive $\alpha$ .

Figure 4.18:The line search direction must be a descent direction.

The goal of the line search is not to find the value of $\alpha$ that minimizes $f \left(x_{k} + \alpha p_{k}\right)$ but to find a point that is “good enough” using as few function evaluations as possible. This is because finding the exact minimum along the line would require too many evaluations of the objective function and possibly its gradient. Because the overall optimization needs to find a point in $n$ -dimensional space, the search direction might change drastically between line searches, so spending too many iterations on each line search is generally not worthwhile.

Consider the bean function whose contours are shown in Figure 4.19.

Figure 4.19:The descent direction does not necessarily point toward the minimum, in which case it would be wasteful to do an exact line search.

At point $x_{k}$ , the direction $p_{k}$ is a descent direction. However, it would be wasteful to spend a lot of effort determining the exact minimum in the $p_{k}$ direction because it would not take us any closer to the minimum of the overall function (the dot on the right side of the plot). Instead, we should find a point that is good enough and then update the search direction.

To simplify the notation for the line search, we define the single-variable function

\phi(\alpha) = f \left(x_{k}+\alpha p_{k} \right),

(4.34)

where $\alpha=0$ corresponds to the start of the line search ( $x_{k}$ in Figure 4.17), and thus $\phi(0) = f(x_{k})$ . Then, using $x = x_{k} + \alpha p_{k}$ , the slope of the single-variable function is

\begin{aligned} \phi^{\prime}(\alpha) &= \frac{\partial \left(f(x)\right)}{\partial \alpha} = \sum_{i=1}^{n} \frac{\partial f}{\partial x_i} \frac{\partial x_i}{\partial \alpha} \, . \end{aligned}

(4.35)

Substituting into the derivatives results in

\phi^{\prime}(\alpha) = \nabla f \left(x_{k}+\alpha p_{k} \right)^\intercal p_{k} \, ,

(4.36)

which is the directional derivative along the search direction. The slope at the start of a given line search is

\phi^{\prime}(0) = \nabla {f_{k}}^\intercal p_{k} \, .

(4.37)

Because $p_{k}$ must be a descent direction, $\phi^{\prime}(0)$ is always negative. Figure 4.20 is a version of the one-dimensional slice from Figure 4.17 in this notation. The $\alpha$ axis and the slopes scale with the magnitude of $p_{k}$ .

For the line search, we denote the function as \phi(\alpha) with the same value as f. The slope \phi^\prime(\alpha) is the gradient of f projected onto the search direction. — Figure 4.20:For the line search, we denote the function as $\phi(\alpha)$ with the same value as $f$ . The slope $\phi^\prime(\alpha)$ is the gradient of $f$ projected onto the search direction.

4.3.1 Sufficient Decrease and Backtracking¶

The simplest line search algorithm to find a “good enough” point relies on the sufficient decrease condition combined with a backtracking algorithm. The sufficient decrease condition, also known as the Armijo condition, is given by the inequality

\phi(\alpha) \le \phi(0) + \mu_1 \alpha \phi^{\prime}(0) \, ,

(4.31)

where $\mu_1$ is a constant such that $0 < \mu_1 \le 1$ .^[6] The quantity $\alpha \phi^\prime(0)$ represents the expected decrease of the function, assuming the function continued at the same slope. The multiplier $\mu_1$ states that Equation 4.31 will be satisfied as long we achieve even a small fraction of the expected decrease, as shown in Figure 4.21.

Figure 4.21:The sufficient decrease line has a slope that is a small fraction of the slope at the start of the line search.

In practice, this constant is several orders of magnitude smaller than 1, typically $\mu_1=10^{-4}$ .

Because $p_{k}$ is a descent direction, and thus $\phi^\prime(0) = \nabla {f_{k}}^\intercal p_{k} < 0$ , there is always a positive $\alpha$ that satisfies this condition for a smooth function.

The concept is illustrated in Figure 4.22, which shows a function with a negative slope at $\alpha=0$ and a sufficient decrease line whose slope is a fraction of that initial slope. When starting a line search, we know the function value and slope at $\alpha=0$ , so we do not really know how the function varies until we evaluate it. Because we do not want to evaluate the function too many times, the first point whose value is below the sufficient decrease line is deemed acceptable. The sufficient decrease line slope in Figure 4.22 is exaggerated for illustration purposes; for typical values of $\mu_1$ , the line is indistinguishable from horizontal when plotted.

Line search algorithms require a first guess for $\alpha$ . As we will see later, some methods for finding the search direction also provide good guesses for the step length. However, in many cases, we have no idea of the scale of function, so our initial guess may not be suitable. Even if we do have an educated guess for $\alpha$ , it is only a guess, and the first step might not satisfy the sufficient decrease condition.

A straightforward algorithm that is guaranteed to find a step that satisfies the sufficient decrease condition is backtracking (Algorithm 4.2). This algorithm starts with a maximum step and successively reduces the step by a constant ratio $\rho$ until it satisfies the sufficient decrease condition (a typical value is $\rho = 0.5$ ). Because the search direction is a descent direction, we know that we will achieve an acceptable decrease in function value if we backtrack enough.

Although backtracking is guaranteed to find a point that satisfies sufficient decrease, there are two undesirable scenarios where this algorithm performs poorly. The first scenario is that the guess for the initial step is far too large, and the step sizes that satisfy sufficient decrease are smaller than the starting step by several orders of magnitude. Depending on the value of $\rho$ , this scenario requires a large number of backtracking evaluations.

The other undesirable scenario is where our initial guess immediately satisfies sufficient decrease. However, the function’s slope is still highly negative, and we could have decreased the function value by much more if we had taken a larger step. In this case, our guess for the initial step is far too small.

Even if our original step size is not too far from an acceptable step size, the basic backtracking algorithm ignores any information we have about the function values and gradients. It blindly takes a reduced step based on a preselected ratio $\rho$ . We can make more intelligent estimates of where an acceptable step is based on the evaluated function values (and gradients, if available). The next section introduces a more sophisticated line search algorithm that deals with these scenarios much more efficiently.

Example 4.8 (Backtracking line search)

Figure 4.23:A line search direction for an example problem.

Consider the following function:

f(x_1, x_2) = 0.1 x_1^6 - 1.5 x_1^4 + 5 x_1^2 + 0.1 x_2^4 + 3 x_2^2 - 9 x_2 + 0.5 x_1 x_2 \, .

(4.39)

Suppose we do a line search starting from $x=[-1.25, 1.25]$ in the direction $p=[4, 0.75]$ , as shown in Figure 4.23. Applying the backtracking algorithm with $\mu_1=10^{-4}$ and $\rho=0.7$ produces the iterations shown in Figure 4.24. The sufficient decrease line appears to be horizontal, but that is because the small negative slope cannot be discerned in a plot for typical values of $\mu_1$ . Using a large initial step of $\alpha_\text{init}=1.2$ (Figure 4.24, left), several iterations are required. For a small initial step of $\alpha_\text{init}=0.05$ (Figure 4.24, right), the algorithm satisfies sufficient decrease at the first iteration but misses the opportunity for further reductions.

Figure 4.24:Backtracking using different initial steps.

4.3.2 Strong Wolfe Conditions¶

One major weakness of the sufficient decrease condition is that it accepts small steps that marginally decrease the objective function because $\mu_1$ in Equation 4.31 is typically small. We could increase $\mu_1$ (i.e., tilt the red line downward in Figure 4.22) to prevent these small steps; however, that would prevent us from taking large steps that still result in a reasonable decrease. A large step that provides a reasonable decrease is desirable because large steps generally lead to faster convergence. Therefore, we want to prevent overly small steps while not making it more difficult to accept reasonable large steps. We can accomplish this by adding a second condition to construct a more efficient line search algorithm.

Just like guessing the step size, it is difficult to know in advance how much of a function value decrease to expect. However, if we compare the slope of the function at the candidate point with the slope at the start of the line search, we can get an idea if the function is “bottoming out”, or flattening, using the sufficient curvature condition:

Figure 4.25:The sufficient curvature condition requires the function slope magnitude to be a fraction of the initial slope.

|\phi^{\prime}(\alpha)| \le \mu_2 |\phi^{\prime}(0)| \, .

(4.32)

This condition requires that the magnitude of the slope at the new point be lower than the magnitude of the slope at the start of the line search by a factor of $\mu_2$ , as shown in Figure 4.25. This requirement is called the sufficient curvature condition because by comparing the two slopes, we quantify the function’s rate of change in the slope—the curvature. Typical values of $\mu_2$ range from 0.1 to 0.9, and the best value depends on the method for determining the search direction and is also problem dependent.

As $\mu_2$ tends to zero, enforcing the sufficient curvature condition tends toward a point where $\phi^{\prime}(\alpha)=0$ , which would yield an exact line search.

The sign of the slope at a point satisfying this condition is not relevant; all that matters is that the function slope be shallow enough. The idea is that if the slope $\phi^\prime(\alpha)$ is still negative with a magnitude similar to the slope at the start of the line search, then the step is too small, and we expect the function to decrease even further by taking a larger step. If the slope $\phi^\prime(\alpha)$ is positive with a magnitude similar to that at the start of the line search, then the step is too large, and we expect to decrease the function further by taking a smaller step. On the other hand, when the slope is shallow enough (either positive or negative), we assume that the candidate point is near a local minimum, and additional effort yields only incremental benefits that are wasteful in the context of the larger problem.

The sufficient decrease and sufficient curvature conditions are collectively known as the strong Wolfe conditions. Figure 4.26 shows acceptable intervals that satisfy the strong Wolfe conditions, which are more restrictive than the sufficient decrease condition (Figure 4.22).

Figure 4.26:Steps that satisfy the strong Wolfe conditions.

The sufficient decrease slope must be shallower than the sufficient curvature slope, that is, $0 < \mu_1 < \mu_2 < 1$ . This is to guarantee that there are steps that satisfy both the sufficient decrease and sufficient curvature conditions. Otherwise, the situation illustrated in Figure 4.27 could take place.

If \mu_2 < \mu_1, there might be no point that satisfies the strong Wolfe conditions. — Figure 4.27:If $\mu_2 < \mu_1$ , there might be no point that satisfies the strong Wolfe conditions.

We now present a line search algorithm that finds a step satisfying the strong Wolfe conditions. Enforcing the sufficient curvature condition means we require derivative information ( $\phi^\prime$ ), at least using the derivative at the beginning of the line search that we already computed from the gradient. There are various line search algorithms in the literature, including some that are derivative-free.

Here, we detail a line search algorithm based on the one developed by Moré and Thuente.3^[7]

The algorithm has two phases:

The bracketing phase finds an interval within which we are certain to find a point that satisfies the strong Wolfe conditions.
The pinpointing phase finds a point that satisfies the strong Wolfe conditions within the interval provided by the bracketing phase.

The bracketing phase is given by Algorithm 4.3 and illustrated in Figure 4.28. For brevity, we use a notation in the following algorithms where, for example, $\phi_0 \equiv \phi(0)$ and $\phi_\text{low} \equiv \phi(\alpha_\text{low})$ . Overall, the bracketing algorithm increases the step size until it either finds an interval that must contain a point satisfying the strong Wolfe conditions or a point that already meets those conditions.

We start the line search with a guess for the step size, which defines the first interval. For a smooth continuous function, we are guaranteed to have a minimum within an interval if either of the following hold:

The function value at the candidate step is higher than the value at the start of the line search.
The step satisfies sufficient decrease, and the slope is positive.

These two scenarios are illustrated in the top two rows of Figure 4.28. In either case, we have an interval within which we can find a point that satisfies the strong Wolfe conditions using the pinpointing algorithm. The order in arguments to the pinpoint function in Algorithm 4.3 is significant because this function assumes that the function value corresponding to the first $\alpha$ is the lower one.

The third row in Figure 4.28 illustrates the scenario where the point satisfies the strong Wolfe conditions, in which case the line search is finished.

If the point satisfies sufficient decrease and the slope at that point is negative, we assume that there are better points farther along the line, and the algorithm increases the step size. This larger step and the previous one define a new interval that has moved away from the line search starting point. We repeat the procedure and check the scenarios for this new interval. To save function calls, bracketing should return not just $\alpha^*$ but also the corresponding function value and gradient to the outer function.

Algorithm 4.3 (Bracketing phase for the line search algorithm)

Algorithm

\quad αinit>0\alpha_\text{init} > 0αinit​>0: Initial step size
\quad ϕ0,ϕ0′\phi_0, \phi^\prime_0ϕ0​,ϕ0′​: computed in outer routine, pass in to save function call
\quad 0<μ1<10 < \mu_1 < 10<μ1​<1: Sufficient decrease factor
\quad μ1<μ2<1\mu_1 < \mu_2 < 1μ1​<μ2​<1: Sufficient curvature factor
\quad σ>1\sigma > 1σ>1: Step size increase factor (e.g., σ=2\sigma=2σ=2)
\quad α∗\alpha^*α∗: Acceptable step size (satisfies the strong Wolfe conditions)
α1=0,α2=αinit\alpha_1 = 0, \quad \alpha_2 = \alpha_\text{init}α1​=0,α2​=αinit​ Define initial bracket
ϕ1=ϕ0,ϕ1′=ϕ0′\phi_1 = \phi_0, \quad \phi^\prime_1 = \phi^\prime_0ϕ1​=ϕ0​,ϕ1′​=ϕ0′​
while true do
ϕ2=ϕ(α2),ϕ2′=ϕ′(α2)\phi_2 = \phi(\alpha_2), \quad \phi^\prime_2 = \phi^\prime(\alpha_2)ϕ2​=ϕ(α2​),ϕ2′​=ϕ′(α2​)
if [ϕ2>ϕ0+μ1α2ϕ0′]\left[\phi_2 > \phi_0 + \mu_1 \alpha_2 \phi^\prime_0 \right][ϕ2​>ϕ0​+μ1​α2​ϕ0′​] or [ϕ2>ϕ1]\left[\phi_2 > \phi_1 \right][ϕ2​>ϕ1​] then
α∗=pinpoint(α1,α2,…)\alpha^* = \texttt{pinpoint}(\alpha_1, \alpha_2, \ldots)α∗=pinpoint(α1​,α2​,…) 1→1 \rightarrow1→ low, 2→2 \rightarrow2→ high
return  α∗\alpha^*α∗
if ∣ϕ2′∣≤−μ2ϕ0′|\phi^{\prime}_2| \le - \mu_2 \phi^\prime_0∣ϕ2′​∣≤−μ2​ϕ0′​ then
Step is acceptable; exit line searchreturn  α∗=α2\alpha^* = \alpha_2α∗=α2​
else if ϕ2′≥0\phi^{\prime}_2 \ge 0ϕ2′​≥0 then
Bracketed minimumα∗=pinpoint(α2,α1,…)\alpha^* = \text{\texttt{pinpoint}} \left(\alpha_2, \alpha_1, \ldots \right)α∗=pinpoint(α2​,α1​,…) Find acceptable step, 2→2 \rightarrow2→ low, 1→1 \rightarrow1→ high
return  α∗\alpha^*α∗
Slope is negativeα1=α2\alpha_1 = \alpha_2α1​=α2​
ϕ1=ϕ2,ϕ1′=ϕ2′\phi_1 = \phi_2, \quad \phi^\prime_1 = \phi^\prime_2ϕ1​=ϕ2​,ϕ1′​=ϕ2′​
α2=σα2\alpha_2 = \sigma \alpha_2α2​=σα2​ Increase step
first = false

If the bracketing phase does not find a point that satisfies the strong Wolfe conditions, it finds an interval where we are guaranteed to find such a point in the pinpointing phase described in Algorithm 4.4 and illustrated in Figure 4.29. The intervals generated by this algorithm, bounded by $\alpha_\text{low}$ and $\alpha_\text{high}$ , always have the following properties:

The interval has one or more points that satisfy the strong Wolfe conditions.
Among all the points generated so far that satisfy the sufficient decrease condition, $\alpha_\text{low}$ has the lowest function value.
The slope at $\alpha_\text{low}$ decreases toward $\alpha_\text{high}$ .

The first step of pinpointing is to find a new point within the given interval. Various techniques can be used to find such a point. The simplest one is to select the midpoint of the interval (bisection), but this method is limited to a linear convergence rate. It is more efficient to perform interpolation and select the point that minimizes the interpolation function, which can be done analytically (see Section 4.3.3). Using this approach, we can achieve quadratic convergence.

Once we have a new point within the interval, four scenarios are possible, as shown in Figure 4.29. The first scenario is that $\phi\left(\alpha_p\right)$ is above the sufficient decrease line or greater than or equal to $\phi(\alpha_\text{low})$ . In that scenario, $\alpha_p$ becomes the new $\alpha_\text{high}$ , and we have a new smaller interval.

In the second, third, and fourth scenarios, $\phi\left(\alpha_p\right)$ is below the sufficient decrease line, and $\phi\left(\alpha_p\right) < \phi(\alpha_\text{low})$ . In those scenarios, we check the value of the slope $\phi^\prime\left(\alpha_p\right)$ . In the second and third scenarios, we choose the new interval based on the direction in which the slope predicts a local decrease. If the slope is shallow enough (fourth scenario), we have found a point that satisfies the strong Wolfe conditions.

Algorithm 4.4 (Pinpoint function for the line search algorithm)

Algorithm

\quad αlow\alpha_\text{low}αlow​: Interval endpoint with lower function value
\quad αhigh\alpha_\text{high}αhigh​: Interval endpoint with higher function value
\quad ϕ0,ϕlow,ϕhigh,ϕ0′\phi_0, \phi_\text{low}, \phi_\text{high}, \phi^\prime_0ϕ0​,ϕlow​,ϕhigh​,ϕ0′​: Computed in outer routine
\quad ϕlow′,ϕhigh′\phi^\prime_\text{low}, \phi^\prime_\text{high}ϕlow′​,ϕhigh′​: One, if not both, computed previously
\quad 0<μ1<10 < \mu_1 < 10<μ1​<1: Sufficient decrease factor
\quad μ1<μ2<1\mu_1 < \mu_2 < 1μ1​<μ2​<1: Sufficient curvature factor
\quad α∗\alpha^*α∗: Step size satisfying strong Wolfe conditions
k=0k = 0k=0
while true do
Find αp\alpha_pαp​ in interval (αlow,αhigh)(\alpha_\text{low}, \alpha_\text{high})(αlow​,αhigh​) Use interpolation (see Section 4.3.3)
Uses ϕlow\phi_\text{low}ϕlow​, ϕhigh\phi_\text{high}ϕhigh​, and ϕ′\phi^\primeϕ′ from at least one endpoint
ϕp=ϕ(αp)\phi_p = \phi\left(\alpha_p\right)ϕp​=ϕ(αp​) Also evaluate ϕp′\phi^\prime_pϕp′​ if derivatives available
if ϕp>ϕ0+μ1αpϕ0′\phi_p > \phi_0 + \mu_1 \alpha_p \phi^{\prime}_0ϕp​>ϕ0​+μ1​αp​ϕ0′​ or ϕp>ϕlow\phi_p > \phi_\text{low}ϕp​>ϕlow​ then
αhigh=αp\alpha_{\text{high}} = \alpha_pαhigh​=αp​
Also update ϕhigh=ϕp\phi_{\text{high}} = \phi_pϕhigh​=ϕp​, and if cubic interpolation ϕhigh′=ϕp′\phi^\prime_{\text{high}} = \phi^\prime_pϕhigh′​=ϕp′​ϕp′=ϕ′(αp)\phi^\prime_p = \phi^{\prime}\left(\alpha_p\right)ϕp′​=ϕ′(αp​) If not already computed
if ∣ϕp′∣≤−μ2ϕ0′| \phi^\prime_p | \le - \mu_2 \phi^{\prime}_0∣ϕp′​∣≤−μ2​ϕ0′​ then
α∗=αp\alpha^* = \alpha_pα∗=αp​
return  αp\alpha_pαp​
else if ϕp′(αhigh−αlow)≥0\phi^\prime_p (\alpha_{\text{high}}-\alpha_{\text{low}}) \ge 0ϕp′​(αhigh​−αlow​)≥0 then
αhigh=αlow\alpha_{\text{high}} = \alpha_{\text{low}}αhigh​=αlow​
Update ϕhigh=ϕlow\phi_{\text{high}} = \phi_{\text{low}}ϕhigh​=ϕlow​, and if cubic interpolation ϕhigh′=ϕlow′\phi^\prime_{\text{high}} = \phi^\prime_{\text{low}}ϕhigh′​=ϕlow′​αlow=αp\alpha_{\text{low}} = \alpha_pαlow​=αp​ Also update ϕlow=ϕp\phi_{\text{low}} = \phi_pϕlow​=ϕp​, and if cubic interpolation ϕlow′=ϕp′\phi^\prime_{\text{low}} = \phi^\prime_pϕlow′​=ϕp′​
k=k+1k = k + 1k=k+1

In theory, the line search given in Algorithm 4.3 followed by Algorithm 4.4 is guaranteed to find a step length satisfying the strong Wolfe conditions. In practice, some additional considerations are needed for improved robustness. One of these criteria is to ensure that the new point in the pinpoint algorithm is not so close to an endpoint as to cause the interpolation to be ill-conditioned. A fallback option in case the interpolation fails could be a simpler algorithm, such as bisection. Another criterion is to ensure that the loop does not continue indefinitely in case finite-precision arithmetic leads to indistinguishable function value changes. A limit on the number of iterations might be necessary.

Figure 4.29:Visual representation of the pinpointing algorithm. The labels in red indicate the new interval endpoints.

Example 4.9 (Line search with bracketing and pinpointing)

Let us perform the same line search as in Algorithm 4.2 but using bracketing and pinpointing instead of backtracking. In this example, we use quadratic interpolation, the bracketing phase uses a step size increase factor of $\sigma=2$ , and the sufficient curvature factor is $\mu_2 = 0.9$ . Bracketing is achieved in the first iteration by using a large initial step of $\alpha_\text{init}=1.2$ (Figure 4.30, left).

Figure 4.30:Example of a line search iteration with different initial steps.

Then pinpointing finds an improved point through interpolation. The small initial step of $\alpha_\text{init}=0.05$ (Figure 4.30, right) does not satisfy the strong Wolfe conditions, and the bracketing phase moves forward toward a flatter part of the function. The result is a point that is much better than the one obtained with backtracking.

4.3.3 Interpolation for Pinpointing¶

Interpolation is recommended to find a new point within each interval at the pinpointing phase. Once we have an interpolation function, we find the new point by determining the analytic minimum of that function. This accelerates the convergence compared with bisection. We consider two options: quadratic interpolation and cubic interpolation.

Because we have the function value and derivative at one endpoint of the interval and at least the function value at the other endpoint, one option is to perform quadratic interpolation to estimate the minimum within the interval.

The quadratic can be written as

\tilde\phi(\alpha) = c_0 + c_1 \alpha + c_2 \alpha^2 \, ,

(4.41)

where $c_0$ , $c_1$ , and $c_2$ are constants to be determined by interpolation. Suppose that we have the function value and the derivative at $\alpha_1$ and the function value at $\alpha_2$ , as illustrated in Figure 4.31. These values correspond to $\alpha_\text{low}$ and $\alpha_\text{high}$ in the pinpointing algorithm, but we use the more generic indices 1 and 2 because the formulas of this section are not dependent on which one is lower or higher. Then, the boundary conditions at the endpoints are

\begin{aligned} \phi(\alpha_1) &= c_0 + c_1 \alpha_1 + c_2 \alpha_1^2 \\ \phi(\alpha_2) &= c_0 + c_2 \alpha_2 + c_2 \alpha_2^2 \\ \phi^\prime(\alpha_1) &= c_1 + 2 c_2 \alpha_1 \, . \end{aligned}

(4.42)

We can use these three equations to find the three coefficients based on function and derivative values. Once we have the coefficients for the quadratic, we can find the minimum of the quadratic analytically by finding the point $\alpha^*$ such that $\tilde\phi^\prime(\alpha^*)=0$ , which is $\alpha^*=- c_1 / 2 c_2$ . Substituting the analytic solution for the coefficients as a function of the given values into this expression yields the final expression for the minimizer of the quadratic:

\alpha^* = \frac{ 2 \alpha_{1} \left[ \phi(\alpha_2) - \phi(\alpha_1) \right] + \phi^{\prime}(\alpha_1) \left( \alpha_1^2 -\alpha_2^2 \right)} {2 \left[ \phi(\alpha_2) - \phi(\alpha_1) + \phi^{\prime}(\alpha_1) ( \alpha_{1} - \alpha_{2}) \right]} \, .

(4.43)

Figure 4.31:Quadratic interpolation with two function values and one derivative.

Figure 4.32:Cubic interpolation with function values and derivatives at endpoints.

Performing this quadratic interpolation for successive intervals is similar to the Newton method and also converges quadratically. The pure Newton method also models a quadratic, but it is based on the information at a single point (function value, derivative, and curvature), as opposed to information at two points.

If computing additional derivatives is inexpensive, or we already evaluated $\phi^\prime \left(\alpha_{i} \right)$ (either as part of Algorithm 4.3 or as part of checking the strong Wolfe conditions in Algorithm 4.4), then we have the function values and derivatives at both points. With these four pieces of information, we can perform a cubic interpolation,

\tilde\phi(\alpha) = c_0 + c_1 \alpha + c_2 \alpha^2 + c_3 \alpha^3 ,

(4.36)

as shown in Figure 4.32. To determine the four coefficients, we apply the boundary conditions:

\begin{aligned} \phi(\alpha_1) &= c_0 + c_1 \alpha_1 + c_2 \alpha_1^2 + c_3 \alpha_1^3 \\ \phi(\alpha_2) &= c_0 + c_2 \alpha_2 + c_2 \alpha_2^2 + c_3 \alpha_2^3 \\ \phi^\prime(\alpha_1) &= c_1 + 2 c_2 \alpha_1 + 3 c_3 \alpha_1^2 \\ \phi^\prime(\alpha_2) &= c_1 + 2 c_2 \alpha_2 + 3 c_3 \alpha_2^2 \, . \end{aligned}

(4.37)

Using these four equations, we can find expressions for the four coefficients as a function of the four pieces of information. Similar to the quadratic interpolation function, we can find the solution for $\tilde\phi^\prime(\alpha^*)=c_1 + 2 c_2 \alpha^* + 3 c_3 {\alpha^*}^2 = 0$ as a function of the coefficients.

There could be two valid solutions, but we are only interested in the minimum, for which the curvature is positive; that is, $\tilde\phi^{\prime\prime}(\alpha^*) = 2 c_2 + 6 c_3 \alpha^* > 0$ .

Substituting the coefficients with the expressions obtained from solving the boundary condition equations and selecting the minimum solution yields

\alpha^* = \alpha_2 - (\alpha_2 - \alpha_1) \frac{\phi^\prime(\alpha_2) + \beta_2 - \beta_1} {\phi^\prime(\alpha_2) - \phi^\prime(\alpha_1) + 2 \beta_2} \, ,

(4.46)

where

\begin{aligned} \beta_1 &= \phi^\prime(\alpha_1) + \phi^\prime(\alpha_2) - 3 \frac{\phi(\alpha_1) - \phi(\alpha_2)}{\alpha_1 - \alpha_2} \\ \beta_2 &= \text{sign}(\alpha_2 - \alpha_1)\sqrt{\beta_1^2 - \phi^\prime(\alpha_1) \phi^\prime(\alpha_2) } \, . \end{aligned}

(4.47)

These interpolations become ill-conditioned if the interval becomes too small. The interpolation may also lead to points outside the bracket. In such cases, we can switch to bisection for the problematic iterations.

4.4 Search Direction¶

As stated at the beginning of this chapter, each iteration of an unconstrained gradient-based algorithm consists of two main steps: determining the search direction and performing the line search (Algorithm 4.1). The optimization algorithms are named after the method used to find the search direction, $p_{k}$ , and can use any suitable line search. We start by introducing two first-order methods that only require the gradient and then explain two second-order methods that require the Hessian, or at least an approximation of the Hessian.

4.4.1 Steepest Descent¶

The steepest-descent method (also called gradient descent) is a simple and intuitive method for determining the search direction. As discussed in Section 4.1.1, the gradient points in the direction of steepest increase, so $-\nabla f$ points in the direction of steepest descent, as shown in Figure 4.33. Thus, our search direction at iteration $k$ is simply

p = - \nabla f \, .

(4.48)

Figure 4.33:The steepest-descent direction points in the opposite direction of the gradient.

One major issue with the steepest descent is that, in general, the entries in the gradient and its overall scale can vary greatly depending on the magnitudes of the objective function and design variables. The gradient itself contains no information about an appropriate step length, and therefore the search direction is often better posed as a normalized direction,

p_{k} = -\frac{\nabla f_{k}}{\|\nabla f_{k}\|} \, .

(4.49)

Algorithm 4.5 provides the complete steepest descent procedure.

Algorithm 4.5 (Steepest descent)

Algorithm

\quad x0x_{0}x0​: Starting point
\quad τ\tauτ: Convergence tolerance
\quad x∗x^*x∗: Optimal point
\quad f(x∗)f(x^*)f(x∗): Minimum function value
k=0k=0k=0 Initialize iteration counter
while ∥∇f∥∞>τ\|\nabla f\|_\infty > \tau∥∇f∥∞​>τ do
Optimality conditionpk=−∇fk∥∇fk∥p_{k} = -\frac{\nabla f_{k}}{\|\nabla f_{k}\|}pk​=−∥∇fk​∥∇fk​​ Normalized steepest descent direction
Estimate αinit\alpha_\text{init}αinit​ from Equation 4.43
αk=linesearch(pk,αinit)\alpha_{k} = \texttt{linesearch} \left(p_{k}, \alpha_\text{init} \right)αk​=linesearch(pk​,αinit​) Perform a line search
xk+1=xk+αkpkx_{k+1} = x_{k} + \alpha_{k} p_{k}xk+1​=xk​+αk​pk​ Update design variables
k=k+1k = k + 1k=k+1 Increment iteration index

Regardless of whether we choose to normalize the search direction or not, the gradient does not provide enough information to inform a good guess of the initial step size for the line search. As we saw in Section 4.3, this initial choice has a large impact on the efficiency of the line search because the first guess could be orders of magnitude too small or too large. The second-order methods described later in this section are better in this respect. In the meantime, we can make a guess of the step size for a given line search based on the result of the previous one. Assuming that we will obtain a decrease in objective function at the current line search that is comparable to the previous one, we can write

\alpha_{k} \nabla {f_{k}}^\intercal p_{k} \approx \alpha_{k-1} \nabla {f_{k-1}}^\intercal p_{k-1} \, .

(4.50)

Solving for the step length, we obtain the guess

\alpha_{k} = \alpha_{k-1} \frac{\nabla {f_{k-1}}^\intercal p_{k-1} }{\nabla {f_{k}}^\intercal p_{k}} \, .

(4.43)

Although this expression could be simplified for the steepest descent, we leave it as is so that it is applicable to other methods.

If the slope of the function increases in magnitude relative to the previous line search, this guess decreases relative to the previous line search step length, and vice versa. This is just the first step length in the new line search, after which we proceed as usual.

Although steepest descent sounds like the best possible search direction for decreasing a function, it generally is not. The reason is that when a function curvature varies significantly with direction, the gradient alone is a poor representation of function behavior beyond a small neighborhood, as illustrated previously in Figure 4.19.

Example 4.10 (Steepest descent with varying amount of curvature)

Consider the following quadratic function:

f(x_1, x_2) = x_1^2 + \beta x_2^2 \, ,

(4.52)

where $\beta$ can be set to adjust the curvature in the $x_2$ direction. In Figure 4.34, we show this function for $\beta = 1, 5, 15$ . The starting point is $x_{0} = (10, 1)$ . When $\beta = 1$ (left), this quadratic has the same curvature in all directions, and the steepest-descent direction points directly to the minimum. When $\beta > 1$ (middle and right), this is no longer the case, and steepest descent shows abrupt changes in the subsequent search directions. This zigzagging is an inefficient way to approach the minimum. The higher the difference in curvature, the more iterations it takes.

$Iteration history for a quadratic function, with three different curvatures, using the steepest-descent method with an exact line search (small enough \mu_2).$

Figure 4.34:Iteration history for a quadratic function, with three different curvatures, using the steepest-descent method with an exact line search (small enough $\mu_2$ ).

The behavior shown in Example 4.10 is expected, and we can show it mathematically. Assuming we perform an exact line search at each iteration, this means selecting the optimal value for $\alpha$ along the line search:

\begin{aligned} \frac{\partial f \left(x_{k} + \alpha p_{k} \right)}{\partial \alpha} &= 0 \Rightarrow \\ \frac{\partial f \left(x_{k+1} \right)}{\partial \alpha} &= 0 \Rightarrow \\ \frac{\partial f \left(x_{k+1} \right)}{\partial x_{k+1}}\frac{\partial \left(x_{k} + \alpha p_{k} \right)}{\partial \alpha} &= 0 \Rightarrow \\ \nabla {f_{k+1}}^\intercal p_{k} &= 0 \Rightarrow \\ -{p_{k+1}}^\intercal p_{k} &= 0 \, . \\ \end{aligned}

(4.53)

Hence, each search direction is orthogonal to the previous one. When performing an exact line search, the gradient projection in the line search direction vanishes at the minimum, which means that the gradient is orthogonal to the search direction, as shown in Figure 4.35.

Figure 4.35:The gradient projection in the line search direction vanishes at the line search minimum.

As discussed in the last section, exact line searches are not desirable, so the search directions are not orthogonal. However, the overall zigzagging behavior still exists.

Tip 4.4: Scale the design variables and the objective function

Problem scaling is one of the most crucial considerations in practical optimization. Steepest descent is susceptible to scaling, as demonstrated in Example 4.10. Even though we will learn about less sensitive methods, poor scaling can decrease the effectiveness of any method for general nonlinear functions.

A common cause of poor scaling is unit choice. For example, consider a problem with two types of design variables, where one type is the material thickness, on the order of 10^-6 m, and the other type is the length of the structure, on the order of 1 m. If both distances are measured in meters, the derivative in the thickness direction is much larger than the derivative in the length direction. In other words, the design space would have a valley that is steep in one direction and shallow in the other. The optimizer would have great difficulty in navigating this type of design space.

Similarly, if the objective is power and a typical value is 10⁶ W, the gradients would likely be relatively small, and satisfying convergence tolerances may be challenging.

A good rule of thumb is to scale the objective function and every design variable to be around unity. The scaling of the objective is only needed after the model analysis computes the function and can be written as

\bar f = f/s_f,

(4.54)

where $s_f$ is the scaling factor, which could be the value of the objective at the starting point, $f(x_0)$ , or another typical value. Multiplying the functions by a scalar does not change the optimal solution but can significantly improve the ability of the optimizer to find the optimum.

Figure 4.37:Scaling works by providing a scaled version of the design variables and objective function to the optimizer. However, the model analysis still needs to work with the original variables and function.

Scaling the design variables is more involved because scaling them changes the value that the optimizer would pass to the model and thus changes their meaning. In general, we might use different scaling factors for different types of variables, so we represent these as an $n$ -vector, $s_x$ . Starting with the physical design variables, $x_0$ , we obtain the scaled variables by dividing them by the scaling factors:

\bar x_0 = x_0 \oslash s_x,

(4.55)

where $\oslash$ denotes element-wise division. Then, because the optimizer works with the scaled variables, we need to convert them back to physical variables by multiplying them by the scaling factors:

x = \bar x \odot s_x,

(4.56)

where $\odot$ denotes element-wise multiplication. Finally, we must also convert the scaled variables to their physical values after the optimization is completed. The complete process is shown in Figure 4.37.

It is not necessary that the objective and all variables be precisely 1—which is impossible to maintain as the optimization progresses. Instead, this heuristic suggests that the objective and all variables should have an order of magnitude of 1. If one of the variables or functions is expected to vary across multiple orders of magnitude during an optimization, one effective way to scale is to take the logarithm. For example, suppose the objective was expected to vary across multiple orders of magnitude. In that case, we could minimize $\log(f)$ instead of minimizing $f$ .^[8]

This heuristic still does not guarantee that the derivatives are well scaled, but it provides a reasonable starting point for further fine-tuning of the problem scaling. A scaling example is discussed in Example 4.19.

Sometimes, additional adjustment is needed if the objective is far less sensitive to some of the design variables than others (i.e., the entries in the gradient span various orders of magnitude). A more appropriate but more involved approach is to scale the variables and objective function such that the gradient elements have a similar magnitude (ideally of order 1). Achieving a well-scaled gradient sometimes requires adjusting inputs and outputs away from the earlier heuristic. Sometimes this occurs because the objective is much less sensitive to a particular variable.

4.4.2 Conjugate Gradient¶

Steepest descent generally performs poorly, especially if the problem is not well scaled, like the quadratic example in Figure 4.34. The conjugate gradient method updates the search directions such that they do not zigzag as much. This method is based on the linear conjugate gradient method, which was designed to solve linear equations. We first introduce the linear conjugate gradient method and then adapt it to the nonlinear case.

For the moment, let us assume that we have the following quadratic objective function:

f(x) = \frac{1}{2} x^\intercal A x - b^\intercal x \, ,

(4.48)

where $A$ is a positive definite Hessian, and $b$ is the gradient at $x=0$ . The constant term is omitted with no loss of generality because it does not change the location of the minimum. To find the minimum of this quadratic, we require

\nabla f(x^*) = A x^* - b = 0 \, .

(4.49)

Thus, finding the minimum of a quadratic amounts to solving the linear system $Ax=b$ , and the residual vector is the gradient of the quadratic.

For a quadratic function with elliptical contours and the principal axis aligned with the coordinate axis, we can find the minimum in n steps, where n is the number of dimensions, by using a coordinate search. — Figure 4.38:For a quadratic function with elliptical contours and the principal axis aligned with the coordinate axis, we can find the minimum in $n$ steps, where $n$ is the number of dimensions, by using a coordinate search.

If $A$ were a positive-definite diagonal matrix, the contours would be elliptical, as shown in Figure 4.38 (or hyper-ellipsoids in the $n$ -dimensional case), and the axes of the ellipses would align with the coordinate directions. In that case, we could converge to the minimum by successively performing an exact line search in each coordinate direction for a total of $n$ line searches.

In the more general case (but still assuming $A$ to be positive definite), the axes of the ellipses form an orthogonal coordinate system in some other orientation. A coordinate search would no longer work as well in this case, as illustrated in Figure 4.39.

Figure 4.39:For a quadratic function with the elliptical principal axis not aligned with the coordinate axis, more iterations are needed to find the minimum using a coordinate search.

Figure 4.40:We can converge to the minimum of a quadratic function by minimizing along each Hessian eigenvector.

Recall from Section 4.1.2 that the eigenvectors of the Hessian represent the directions of principal curvature, which correspond to the axes of the ellipses. Therefore, we could successively perform a line search along the direction defined by each eigenvector and again converge to the minimum with $n$ line searches, as illustrated in Figure 4.40. The problem with this approach is that we would have to compute the eigenvectors of $A$ , a computation whose cost is ${\cal O}(n^3)$ .

Fortunately, the eigenvector directions are not the only set of directions that can minimize the quadratic function in $n$ line searches. To find out which directions can achieve this, let us express the path from the origin to the minimum of the quadratic as a sequence of $n$ steps with directions $p_{i}$ and length $\alpha_{i}$ ,

x^* = \sum_{i=0}^{n-1} \alpha_{i} p_{i} \, .

(4.59)

Thus, we have represented the solution as a linear combination of $n$ vectors. Substituting this into the quadratic (Equation 4.48), we get

\begin{aligned} f(x^*) &= f \left(\sum_{i=0}^{n-1} \alpha_{i} p_{i}\right) \\ &= \frac{1}{2} \left(\sum_{i=0}^{n-1} \alpha_{i} p_{i}\right)^\intercal A \left(\sum_{j=0}^{n-1} \alpha_{j} p_{j}\right) - b^\intercal \left(\sum_{i=0}^{n-1} \alpha_{i} p_{i}\right) \\ & = \frac{1}{2} \sum_{i=0}^{n-1} \sum_{j=0}^{n-1} \alpha_{i} \alpha_{j} {p_{i}}^\intercal A p_{j} - \sum_{i=0}^{n-1} \alpha_{i} b^\intercal p_{i} \, . \end{aligned}

(4.51)

Suppose that the vectors $p_{0}, p_{1}, \ldots, p_{n-1}$ are conjugate with respect to $A$ ; that is, they have the following property:

{p_{i}}^\intercal A p_{j} = 0, \quad \text{for all} \quad i \neq j \, .

(4.61)

Then, the double-sum term in Equation 4.51 can be simplified to a single sum and we can write

f(x^*) = \sum_{i=0}^{n-1} \left( \frac{1}{2} \alpha_{i}^2 {p_{i}}^\intercal A p_{i} - \alpha_{i} b^\intercal p_{i} \right) \, .

(4.62)

Because each term in this sum involves only one direction $p_{i}$ , we have reduced the original problem to a series of one-dimensional quadratic functions that can be minimized one at a time. Two possible conjugate directions are shown for the two-dimensional case in Figure 4.41.

By minimizing along a sequence of conjugate directions in turn, we can find the minimum of a quadratic in n steps, where n is the number of dimensions. — Figure 4.41:By minimizing along a sequence of conjugate directions in turn, we can find the minimum of a quadratic in $n$ steps, where $n$ is the number of dimensions.

Each one-dimensional problem corresponds to minimizing the quadratic with respect to the step length $\alpha_{i}$ . Differentiating each term and setting it to zero yields

\alpha_{i} {p_{i}}^\intercal A p_{i} - b^\intercal p_{i} = 0 \, \Rightarrow \, \alpha_{i} = \frac{b^\intercal p_{i}}{{p_{i}}^\intercal A p_{i}} \, ,

(4.54)

which corresponds to the result of an exact line search in direction $p_{i}$ .

There are many possible sets of vectors that are conjugate with respect to $A$ , including the eigenvectors. The conjugate gradient method finds these directions starting with the steepest-descent direction,

p_{0} = - \nabla f \left(x_{0} \right) \, ,

(4.64)

and then finds each subsequent direction using the update,

p_{k} = - \nabla f_{k} + \beta_{k} p_{k-1} \, .

(4.56)

For a positive $\beta$ , the result is a new direction somewhere between the current steepest descent and the previous search direction, as shown in Figure 4.42. The factor $\beta$ is set such that $p_{k}$ and $p_{k-1}$ are conjugate with respect to $A$ .

Figure 4.42:The conjugate gradient search direction update combines the steepest-descent direction with the previous conjugate gradient direction.

One option to compute a $\beta$ that achieves conjugacy is given by the Fletcher–Reeves formula,

\beta_{k} = \frac{\nabla {f_{k}}^\intercal \nabla {f_{k}}^{}}{\nabla {f_{k-1}}^\intercal \nabla {f_{k-1}}^{}} \, .

(4.57)

This formula is derived in Iterative Methods as Equation B.40 in the context of linear solvers. Here, we replace the residual of the linear system with the gradient of the quadratic because they are equivalent.

Using the directions given by Equation 4.56 and the step lengths given by Equation 4.54, we can minimize a quadratic in $n$ steps, where $n$ is the size of $x$ . The minimization shown in Figure 4.41 starts with the steepest-descent direction and then computes one update to converge to the minimum in two iterations using exact line searches. The linear conjugate gradient method is detailed in Algorithm 2.

However, we are interested in minimizing general nonlinear functions. We can adapt the linear conjugate gradient method to the nonlinear case by doing the following:

Use the gradient of the nonlinear function in the search direction update (Equation 4.56) and the expression for $\beta$ (Equation 4.57). This gradient can be computed using any of the methods in Chapter 6.
Perform an inexact line search instead of doing the exact line search. This frees us from providing the Hessian vector products required for an exact line search (see Equation 4.54). A line search that satisfies the strong Wolfe conditions is a good choice, but we need a stricter range in the sufficient decrease and sufficient curvature parameters ( $0 < \mu_1 < \mu_2 < {1}/{2}$ ).^[9] This stricter requirement on $\mu_2$ is necessary with the Fletcher–Reeves formula (Equation 4.57) to ensure descent directions. As a first guess for $\alpha$ in the line search, we can use the same estimate proposed for steepest descent (Equation 4.43).
Reset the search direction periodically back to the steepest-descent direction. In practice, resetting is often helpful to remove old information that is no longer useful. Some methods reset every $n$ iterations, motivated by the fact that the linear case only generates $n$ conjugate vectors. A more mathematical approach resets the direction when
$\frac{|\nabla f_k^\intercal \nabla f_{k-1}|}{|\nabla f_k^\intercal \nabla f_{k}|} \ge 0.1 \,.$
(4.67)

The full procedure is given in Algorithm 4.6. As with steepest descent, we may use normalized search directions.

The nonlinear conjugate gradient method is no longer guaranteed to converge in $n$ steps like its linear counterpart, but it significantly outperforms the steepest-descent method. The change required relative to steepest descent is minimal: save information on the search direction and gradient from the previous iteration, and add the $\beta$ term to the search direction update. Therefore, there is rarely a reason to prefer steepest descent. The parameter $\beta$ can be interpreted as a “damping parameter” that prevents each search direction from varying too much relative to the previous one. When the function steepens, the damping becomes larger, and vice versa.

The formula for $\beta$ in Equation 4.57 is only one of several options. Another well-known option is the Polak–Ribière formula, which is given by

\beta_{k} = \frac{\nabla {f_{k}}^\intercal \left( \nabla {f_{k}} - \nabla {f_{k-1}} \right)} {\nabla {f_{k-1}}^\intercal \nabla {f_{k-1}}^{}} \, .

(4.59)

The conjugate gradient method with the Polak–Ribière formula tends to converge more quickly than with the Fletcher–Reeves formula, and this method does not require the more stringent range for $\mu_2$ . However, regardless of the value of $\mu_2$ , the strong Wolfe conditions still do not guarantee that $p_{k}$ is a descent direction ( $\beta$ might become negative). This issue can be addressed by forcing $\beta$ to remain nonnegative:

\beta \leftarrow \text{max}(0, \beta) \,.

(4.60)

This equation automatically triggers a reset whenever $\beta = 0$ (see Equation 4.56), so in this approach, other checks on resetting can be removed from Algorithm 4.6.

Algorithm 4.6 (Nonlinear conjugate gradient)

Algorithm

\quad x0x_{0}x0​: Starting point
\quad τ\tauτ: Convergence tolerance
\quad x∗x^*x∗: Optimal point
\quad f(x∗)f(x^*)f(x∗): Minimum function value
k=0k=0k=0 Initialize iteration counter
while ∥∇fk∥∞>τ\|\nabla f_{k}\|_\infty > \tau∥∇fk​∥∞​>τ do
Optimality conditionif k=0k=0k=0 or reset = true then
first direction, and at resets pk=−∇fk∥∇fk∥p_{k} = -\frac{\nabla f_{k}}{\|\nabla f_{k}\|}pk​=−∥∇fk​∥∇fk​​
βk=∇fk⊺∇fk∇fk−1⊺∇fk−1\beta_{k} = \frac{\nabla {f_{k}}^\intercal \nabla {f_{k}}^{}}{\nabla {f_{k-1}}^\intercal \nabla {f_{k-1}}}βk​=∇fk−1​⊺∇fk−1​∇fk​⊺∇fk​​
pk=−∇fk∥∇fk∥+βkpk−1p_{k} = -\frac{\nabla f_{k}}{\|\nabla f_{k}\|} + \beta_{k} p_{k-1}pk​=−∥∇fk​∥∇fk​​+βk​pk−1​ Conjugate gradient direction update
αk=linesearch(pk,αinit)\alpha_{k} = \texttt{linesearch} \left(p_{k}, \alpha_\text{init} \right)αk​=linesearch(pk​,αinit​) Perform a line search
xk+1=xk+αkpkx_{k+1} = x_{k} + \alpha_{k} p_{k}xk+1​=xk​+αk​pk​ Update design variables
k=k+1k = k + 1k=k+1 Increment iteration index

4.4.3 Newton’s Method¶

The steepest-descent and conjugate gradient methods use only first-order information (the gradient). Newton’s method uses second-order (curvature) information to get better estimates for search directions. The main advantage of Newton’s method is that, unlike first-order methods, it provides an estimate of the step length because the curvature predicts where the function derivative is zero.

In Section 3.8, we presented Newton’s method for solving nonlinear equations. Newton’s method for minimizing functions is based on the same principle, but instead of solving $r(u)=0$ , we solve for $\nabla f(x) = 0$ .

As in Section 3.8, we can derive Newton’s method for one-dimensional function minimization from the Taylor series approximation,

f \left( x_{k} + s \right) \approx f \left(x_{k} \right) + s f^\prime \left( x_{k} \right) + \frac{1}{2} s^2 f^{\prime\prime} \left( x_{k} \right) \, .

(4.61)

We now include a second-order term to get a quadratic that we can minimize. We minimize this quadratic approximation by differentiating with respect to the step $s$ and setting the derivative to zero, which yields

f^\prime \left( x_{k} \right) + s f^{\prime\prime} \left( x_{k} \right) = 0 \quad \Rightarrow \quad s = -\frac{f^\prime \left( x_{k} \right)}{f^{\prime\prime} \left( x_{k} \right)} \, .

(4.71)

Thus, the Newton update is

x_{k+1} = x_{k} - \frac{f^{\prime}_k}{f^{\prime\prime}_k} \, .

(4.63)

We could also derive this equation by taking Newton’s method for root finding (Equation 3.24) and replacing $r(u)$ with $f^\prime(x)$ .

Figure 4.44:Newton’s method for finding roots can be adapted for function minimization by formulating it to find a zero of the derivative. We step to the minimum of a quadratic at each iteration (top) or equivalently find the root of the function’s first derivative (bottom).

Example 4.13 (Newton’s method for one-dimensional minimization)

Suppose we want to minimize the following single-variable function:

f(x) = (x-2)^4 + 2x^2 - 4x + 4 \, .

The first derivative is

f^\prime(x) = 4(x-2)^3 + 4x - 4 \, ,

and the second derivative is

f^{\prime\prime}(x) = 12(x-2)^2 + 4 \, .

Starting from $x_{0}=3$ , we can form the quadratic (Equation 4.61) using the function value and the first and second derivatives evaluated at that point, as shown in the top plot in Figure 4.44. Then, the minimum of the quadratic is given analytically by the Newton update (Equation 4.63). We successively form quadratics at each iteration and minimize them to find the next iteration. This is equivalent to finding the zero of the function’s first derivative, as shown in the bottom plot in Equation 4.63.

Like the one-dimensional case, we can build an $n$ -dimensional Taylor series expansion about the current design point:

f \left(x_{k} + s \right) \approx f_{k} + \nabla {f_{k}}^\intercal s + \frac{1}{2} {s}^\intercal H_{k} s \, ,

(4.73)

where $s$ is a vector centered at $x_{k}$ . Similar to the one-dimensional case, we can find the step $s_{k}$ that minimizes this quadratic model by taking the derivative with respect to $s$ and setting that equal to zero:

\begin{aligned} \frac{\mathrm{d} f \left(x_{k} + s \right)}{\mathrm{d} s} &= \nabla f_{k} + H_{k} s = 0 \, . \end{aligned}

(4.65)

Thus, each Newton step is the solution of a linear system where the matrix is the Hessian,

H_{k} s_{k} = -\nabla f_{k} \, .

(4.75)

This linear system is analogous to the one used for solving nonlinear systems with Newton’s method (Equation 3.30), except that the Jacobian becomes the Hessian, the residual is the gradient, and the design variables replace the states. We can use any of the linear solvers mentioned in Section 3.6 and to solve this system.

Figure 4.45:Iteration history for a quadratic function using an exact line search and Newton’s method. Unsurprisingly, only one iteration is required.

When minimizing the quadratic function from Example 4.10, Newton’s method converges in one step for any value of $\beta$ , as shown in Figure 4.45. Thus, Newton’s method is scale invariant

Because the function is quadratic, the quadratic “approximation” from the Taylor series is exact, so we can find the minimum in one step. It will take more iterations for a general nonlinear function, but using curvature information generally yields a better search direction than first-order methods. In addition, Newton’s method provides a step length embedded in $s_{k}$ because the quadratic model estimates the stationary point location. Furthermore, Newton’s method exhibits quadratic convergence.

Although Newton’s method is powerful, it suffers from a few issues in practice. One issue is that the Newton step does not necessarily result in a function decrease. This issue can occur if the Hessian is not positive definite or if the quadratic predictions overshoot because the actual function has more curvature than predicted by the quadratic approximation. Both of these possibilities are illustrated in Figure 4.46.

Figure 4.46:Newton’s method in its pure form is vulnerable to negative curvature (in which case it might step away from the minimum) and overshooting (which might result in a function increase).

If the Hessian is not positive definite, the step might not even be in a descent direction. Replacing the real Hessian with a positive-definite Hessian can mitigate this issue. The quasi-Newton methods in the next section force a positive-definite Hessian by construction.

To fix the overshooting issue, we can use a line search instead of blindly accepting the Newton step length. We would set $p_{k} = s_{k}$ , with $\alpha_\text{init}=1$ as the first guess for the step length. In this case, we have a much better guess for $\alpha$ compared with the steepest-descent or conjugate gradient cases because this guess is based on the local curvature. Even if the first step length given by the Newton step overshoots, the line search would find a point with a lower function value.

The trust-region methods in Section 4.5 address both of these issues by minimizing the function approximation within a specified region around the current iteration.

Another major issue with Newton’s method is that the Hessian can be difficult or costly to compute. Even if available, the solution of the linear system in Equation 4.65 can be expensive. Both of these considerations motivate the quasi-Newton methods, which we explain next.

4.4.4 Quasi-Newton Methods¶

As mentioned in Section 4.4.3, Newton’s method is efficient because the second-order information results in better search directions, but it has the significant shortcoming of requiring the Hessian. Quasi-Newton methods are designed to address this issue. The basic idea is that we can use first-order information (gradients) along each step in the iteration path to build an approximation of the Hessian.

In one dimension, we can adapt the secant method (see Equation 3.26) for function minimization. Instead of estimating the first derivative, we now estimate the second derivative (curvature) using two successive first derivatives, as follows:

{f^{\prime \prime}}_{k+1} = \frac{{f^\prime}_{k+1} - {f^\prime}_{k}}{x_{k+1} - x_{k}} \, .

(4.76)

Then we can use this approximation in the Newton step (Equation 4.63) to obtain an iterative procedure that requires only first derivatives instead of first and second derivatives.

The quadratic approximation based on this approximation of the second derivative is

\tilde f_{k+1} \left(x_{k+1} + s \right) = f_{k+1} + s f^{\prime}_{k+1} + \frac{s^2}{2} \left( \frac{{f^\prime}_{k+1} - {f^\prime}_{k}}{x_{k+1} - x_{k}} \right) \, .

(4.77)

Taking the derivative of this approximation with respect to $s$ , we get

\tilde f_{k+1}^\prime\left(x_{k+1} + s \right) = f^{\prime}_{k+1} + s \left( \frac{{f^\prime}_{k+1} - {f^\prime}_{k}}{x_{k+1} - x_{k}} \right) \, .

(4.78)

For $s=0$ , which corresponds to $x_{k+1}$ , we get $\tilde f_{k+1}^\prime\left(x_{k+1} \right) = f^{\prime}_{k+1}$ , which tells us that the slope of the approximation matches the slope of the actual function at $x_{k+1}$ , as expected.

Figure 4.48:The quadratic approximation based on the secant method matches the slopes at the two last points and the function value at the last point.

Also, by stepping backward to $x_{k}$ by setting $s = -\left( x_{k+1} - x_{k} \right)$ , we find that $\tilde f_{k+1}^\prime \left(x_{k} \right) = f^{\prime}_k$ . Thus, the nature of this approximation is such that it matches the slope of the actual function at the last two points, as shown in Figure 4.48.

In $n$ dimensions, things are more involved, but the principle is the same: use first-derivative information from the last two points to approximate second-derivative information. Instead of iterating along the $x$ -axis as we would in one dimension, the optimization in $n$ dimensions follows a sequence of steps (as shown in Figure 4.1) for the separate line searches. We have gradients at the endpoints of each step, so we can take the difference between the gradients at those points to get the curvature along that direction. The question is: How do we update the Hessian, which is expressed in the coordinate system of $x$ , based on directional curvatures in directions that are not necessarily aligned with the coordinate system?

Quasi-Newton methods use the quadratic approximation of the objective function,

\tilde f \left(x_{k} + p \right) = f_{k} + \nabla {f_{k}}^\intercal p + \frac{1}{2} {p}^\intercal \tilde H_{k} p \, ,

(4.79)

where $\tilde H$ is an approximation of the Hessian.

Similar to Newton’s method, we minimize this quadratic with respect to $p$ , which yields the linear system

\tilde H_{k} p_{k} = - \nabla {f_{k}} \, .

(4.71)

We solve this linear system for $p_{k}$ , but instead of accepting it as the final step, we perform a line search in the $p_{k}$ direction. Only after finding a step size $\alpha_{k}$ that satisfies the strong Wolfe conditions do we update the point using

x_{k+1} = x_{k} + \alpha_{k} p_{k} \, .

(4.81)

Quasi-Newton methods update the approximate Hessian at every iteration based on the latest information using an update of the form

\tilde H_{k+1} = \tilde H_{k} + \Delta \tilde H_{k} \, ,

(4.73)

where the update $\Delta \tilde H_{k}$ is a function of the last two gradients.

The first Hessian approximation is usually set to the identity matrix (or a scaled version of it), which yields a steepest-descent direction for the first line search (set $\tilde H=I$ in Equation 4.71 to verify this).

We now develop the requirements for the approximate Hessian update. Suppose we just obtained the new point $x_{k+1}$ after a line search starting from $x_{k}$ in the direction $p_{k}$ . We can write the new quadratic based on an updated Hessian as follows:

\tilde f \left(x_{k+1} + p \right) = f_{k+1} + \nabla {f_{k+1}}^\intercal p + \frac{1}{2} {p}^\intercal \tilde H_{k+1} p \, .

(4.83)

We can assume that the new point’s function value and gradient are given, but we do not have the new approximate Hessian yet. Taking the gradient of this quadratic with respect to $p$ , we obtain

\nabla \tilde f \left(x_{k+1} + p \right) = \nabla {f_{k+1}} + \tilde H_{k+1} p\, .

(4.75)

In the single-variable case, we observed that the quadratic approximation based on the secant method matched the slope of the actual function at the last two points. Therefore, it is logical to require the $n$ -dimensional quadratic based on the approximate Hessian to match the gradient of the actual function at the last two points.

The gradient of the new approximation (Equation 4.75) matches the gradient at the new point $x_{k+1}$ by construction (just set $p=0$ ). To find the gradient predicted by the new approximation (Equation 4.75) at the previous point $x_{k}$ , we set $p = x_{k} - x_{k+1} = -\alpha_{k} p_{k}$ (which is a backward step from the end of the last line search to the start of the line search) to get

\nabla \tilde f \left( x_{k+1} -\alpha_{k} p_{k} \right) = \nabla {f_{k+1}} -\alpha_{k} \tilde H_{k+1} p_{k} \, .

(4.76)

Now, we enforce that this must be equal to the actual gradient at that point,

\begin{aligned} \nabla {f_{k+1}} -\alpha_{k} \tilde H_{k+1} p_{k} &= \nabla {f_{k}} \Rightarrow \\ \alpha_{k} \tilde H_{k+1} p_{k} &= \nabla {f_{k+1}} - \nabla {f_{k}} \, . \end{aligned}

(4.77)

To simplify the notation, we define the step as

s_{k} = x_{k+1} - x_{k} = \alpha_{k} p_{k} \, ,

(4.87)

and the difference in the gradient as

y_{k} = \nabla {f_{k+1}} - \nabla {f_{k}} \, .

(4.88)

Figure 4.49 shows the step and the corresponding gradients.

Figure 4.49:Quasi-Newton methods use the gradient at the endpoint of each step to estimate the curvature in the step direction and update an approximation of the Hessian.

Rewriting Equation 4.77 using this notation, we get

\tilde H_{k+1} s_{k} = y_{k} \, .

(4.80)

This is called the secant equation and is a fundamental requirement for quasi-Newton methods. The result is intuitive when we recall the meaning of the product of the Hessian with a vector (Equation 4.12): it is the rate of change of the Hessian in the direction defined by that vector. Thus, it makes sense that the rate of change of the curvature predicted by the approximate Hessian should match the difference between the gradients.^[10]

We need $\tilde H$ to be positive definite. Using the secant equation (Equation 4.80) and the definition of positive definiteness ( $s^\intercal H s > 0$ ), we see that this requirement implies that the predicted curvature is positive along the step; that is,

{s_{k}}^\intercal y_{k} > 0 \, .

(4.81)

This is called the curvature condition, and it is automatically satisfied if the line search finds a step that satisfies the strong Wolfe conditions.

The secant equation (Equation 4.80) is a linear system of $n$ equations where the step and the gradients are known. However, there are $n(n+1)/2$ unknowns in the approximate Hessian matrix (recall that it is symmetric), so this equation is not sufficient to determine the elements of $\tilde H$ . The requirement of positive definiteness adds one more equation, but those are not enough to determine all the unknowns, leaving us with an infinite number of possibilities for $\tilde H$ .

To find a unique $\tilde H_{k+1}$ , we rationalize that among all the matrices that satisfy the secant equation (Equation 4.80), $\tilde H_{k+1}$ should be the one closest to the previous approximate Hessian, $\tilde H_{k}$ . This makes sense intuitively because the curvature information gathered in one step is limited (because it is along a single direction) and should not change the Hessian approximation more than necessary to satisfy the requirements.

The original quasi-Newton update, known as DFP, was first proposed by Davidon and then refined by Fletcher and also Powell (see historical note in Section 2.3).56 The DFP update formula has been superseded by the BFGS formula, which was independently developed by Broyden, Fletcher, Goldfarb, and Shanno.78910 BFGS is currently considered the most effective quasi-Newton update, so we focus on this update. However, Section C.2.1 has more details on DFP.

The formal derivation of the BFGS update formula is rather involved, so we do not include it here. Instead, we work through an informal derivation that provides intuition about this update and quasi-Newton methods in general. We also include more details in Section C.2.2.

Recall that quasi-Newton methods add an update to the previous Hessian approximation (Equation 4.73). One way to think about an update that yields a matrix close to the previous one is to consider the rank of the update, $\Delta \tilde H$ . The lower the rank of the update, the closer the updated matrix is to the previous one. Also, the curvature information contained in this update is minimal because we are only gathering information in one direction for each update. Therefore, we can reason that the rank of the update matrix should be the lowest possible rank that satisfies the secant equation (Equation 4.80).

The update must be symmetric and positive definite to ensure a symmetric positive-definite Hessian approximation. If we start with a symmetric positive-definite approximation, then all subsequent approximations remain symmetric and positive definite. As it turns out, it is possible to derive a rank 1 update matrix that satisfies the secant equation, but this update is not guaranteed to be positive definite. However, we can get positive definiteness with a rank 2 update.

The self outer product of a vector produces a symmetric (n \times n) matrix of rank 1. — Figure 4.50:The self outer product of a vector produces a symmetric $(n \times n)$ matrix of rank 1.

We can obtain a symmetric rank 2 update by adding two symmetric rank 1 matrices. One convenient way to obtain a symmetric rank 1 matrix is to perform a self outer product of a vector, which takes a vector of size $n$ and multiplies it with its transpose to obtain an $(n \times n)$ matrix, as shown in Figure 4.50. Matrices resulting from vector outer products have rank 1 because all the columns are linearly dependent.

With two linearly independent vectors ( $u$ and $v$ ), we can get a rank 2 update using

\tilde H_{k+1} = \tilde H_{k} + \alpha u u^\intercal + \beta v v^\intercal \, ,

(4.82)

where $\alpha$ and $\beta$ are scalar coefficients. Substituting this into the secant equation (Equation 4.80), we have

\tilde H_{k} s_{k} + \alpha u u^\intercal s_{k} + \beta v v^\intercal s_{k} = y_{k} \, .

(4.83)

Because the new information about the function is encapsulated in the vectors $y$ and $s$ , we can reason that $u$ and $v$ should be based on these vectors. It turns out that using $s$ on its own does not yield a useful update (one term cancels out), but $\tilde H s$ does. Setting $u = y$ and $v = \tilde H s$ in Equation 4.83 yields

\tilde H_{k} s_{k} + \alpha y_{k} {y_{k}}^\intercal s_{k} + \beta \tilde H_{k} s_{k} \left(\tilde H_{k} s_{k}\right)^\intercal s_{k} = y_{k} \, .

(4.84)

Rearranging this equation, we have

y_{k} \left( 1 - \alpha {y_{k}}^\intercal s_{k} \right) = \tilde H_{k} s_{k} \left( 1 + \beta {s_{k}}^\intercal \tilde H_{k} s_{k}\right) \, .

(4.94)

Because the vectors $y_{k}$ and $\tilde H_{k} s_{k}$ are not parallel in general (because the secant equation applies to $\tilde H_{k+1}$ , not to $\tilde H_{k}$ ), the only way to guarantee this equality is to set the terms in parentheses to zero. Thus, the scalar coefficients are

\alpha = \frac{1}{{y_{k}}^\intercal s_{k}}, \quad \beta = - \frac{1}{{s_{k}}^\intercal \tilde H_{k} s_{k}} \, .

(4.95)

Substituting these coefficients and the chosen vectors back into Equation 4.82, we get the BFGS update,

\tilde H_{k+1} = \tilde H_{k} + \frac{ {y_{k}} {y_{k}}^\intercal }{{y_{k}}^\intercal s_{k}} - \frac{ \tilde H_{k} s_{k} {s_{k}}^\intercal \tilde H_{k} }{{s_{k}}^\intercal \tilde H_{k} s_{k}} \, .

(4.87)

Although we did not explicitly enforce positive definiteness, the rank 2 update is positive definite, and therefore, all the Hessian approximations are positive definite, as long as we start with a positive-definite approximation.

Now recall that we want to solve the linear system that involves this matrix (Equation 4.71), so it would be more efficient to approximate the inverse of the Hessian directly instead. The inverse can be found analytically from the update (Equation 4.87) using the Sherman–Morrison–Woodbury formula.^[11]Defining $\tilde V$ as the approximation of the inverse of the Hessian, the final result is

\tilde V_{k+1} = \left( I - \sigma_{k} s_{k} {y_{k}}^\intercal \right) \tilde V_{k} \left( I - \sigma_{k} y_{k} {s_{k}}^\intercal \right) + \sigma_{k} s_{k} {s_{k}}^\intercal \, ,

(4.88)

where

\sigma_{k} = \frac{1}{{y_{k}}^\intercal s_{k}} \, .

(4.98)

Figure 4.51 shows the sizes of the vectors and matrices involved in this equation.

Sizes of each term of the BFGS update (). — Figure 4.51:Sizes of each term of the BFGS update (Equation 4.88).

Now we can replace the potentially costly solution of the linear system (Equation 4.71) with the much cheaper matrix-vector product,

p_{k} = - \tilde V_{k} \nabla f_{k} \, ,

(4.90)

where $\tilde V$ is the estimate for the inverse of the Hessian.

Algorithm 4.7 details the steps for the BFGS algorithm. Unlike first-order methods, we should not normalize the direction vector $p_k$ because the length of the vector is meaningful. Once we have curvature information, the quasi-Newton step should give a reasonable estimate of where the function slope flattens. Thus, as advised for Newton’s method, we set $\alpha_\text{init} = 1$ . Alternatively, this would be equivalent to using a normalized direction vector and then setting $\alpha_\text{init}$ to the initial magnitude of $p$ . However, optimization algorithms in practice use $\alpha_\text{init} = 1$ to signify that a full (quasi-) Newton step was accepted (see Tip 4.5).

As discussed previously, we need to start with a positive-definite estimate to maintain a positive-definite inverse Hessian. Typically, this is the identity matrix or a weighted identity matrix, for example:

\tilde V_{0} = \frac{1}{\|\nabla f_{0}\|} I \, .

(4.100)

This makes the first step a normalized steepest-descent direction:

p_{0} = -\tilde V_{0}\nabla f_{0} = -\frac{\nabla f_{0}}{\|\nabla f_{0}\|} \, .

(4.101)

Algorithm 4.7 (BFGS)

Algorithm

\quad x0x_{0}x0​: Starting point
\quad τ\tauτ: Convergence tolerance
\quad x∗x^*x∗: Optimal point
\quad f(x∗)f(x^*)f(x∗): Minimum function value
k=0k=0k=0 Initialize iteration counter
αinit=1\alpha_\text{init} = 1αinit​=1 Initial step length for line search
while ∥∇fk∥∞>τ\|\nabla f_{k}\|_\infty > \tau∥∇fk​∥∞​>τ do
Optimality conditionif k=0k=0k=0or reset = true then
V~k=1∥∇f∥I\tilde V_{k} = \frac{1}{\|\nabla f\|}IV~k​=∥∇f∥1​I
s=xk−xk−1s = x_{k} - x_{k-1}s=xk​−xk−1​ Last step
y=∇fk−∇fk−1y = \nabla {f_{k}} - \nabla {f_{k-1}}y=∇fk​−∇fk−1​ Curvature along last step
σ=1s⊺y\sigma = \frac{1}{s^\intercal y}σ=s⊺y1​
V~k=(I−σsy⊺)V~k−1(I−σys⊺)+σss⊺\tilde V_{k} = \left( I - \sigma s y^\intercal \right) \tilde V_{k-1}
                \left( I - \sigma y s^\intercal \right) + \sigma s s^\intercalV~k​=(I−σsy⊺)V~k−1​(I−σys⊺)+σss⊺ Quasi-Newton update
p=−V~k∇fkp = - \tilde V_{k} \nabla f_{k}p=−V~k​∇fk​ Compute quasi-Newton step
α=linesearch(p,αinit)\alpha = \texttt{linesearch} \left(p, \alpha_\text{init} \right)α=linesearch(p,αinit​) Should satisfy the strong Wolfe conditions
xk+1=xk+αpx_{k+1} = x_{k} + \alpha pxk+1​=xk​+αp Update design variables
k=k+1k = k + 1k=k+1 Increment iteration index

In a practical algorithm, $\tilde V$ might require occasional resets to the scaled identity matrix. This is because as we iterate in the design space, curvature information gathered far from the current point might become irrelevant and even counterproductive. The trigger for this reset could occur when the directional derivative $\nabla f^\intercal p$ is greater than some threshold. That would mean that the slope along the search direction is shallow; in other words, the search direction is close to orthogonal to the steepest-descent direction.

Another well-known quasi-Newton update is the symmetric rank 1 (SR1) update, which we derive in Section C.2.3. Because the update is rank 1, it does not guarantee positive definiteness. Why would we be interested in a Hessian approximation that is potentially indefinite? In practice, the matrices produced by SR1 have been found to approximate the true Hessian matrix well, often better than BFGS. This alternative is more common in trust-region methods (see Section 4.5), which depend more strongly on an accurate Hessian and do not require positive definiteness. It is also sometimes used for constrained optimization problems where the Hessian of the Lagrangian is often indefinite, even at the minimizer.

Example 4.15 (BFGS applied to the bean function)

Figure 4.52:BFGS optimization path.

Minimizing the same bean function from previous examples using BFGS, we get the optimization path shown in Figure 4.52. We also show the corresponding quadratic approximations for a few selected steps of this minimization in Figure 4.53. Because we generate approximations to the inverse, we invert those approximations to get the Hessian approximation for the purpose of illustration.

We initialize the inverse Hessian to the identity matrix, which results in a quadratic with circular contours and a steepest-descent step (Figure 4.53, left). Using the BFGS update procedure, after two iterations,

x_{2} = (0.1197030, -0.043079) \, ,

and the inverse Hessian approximation is

\tilde V_{2} = \begin{bmatrix} 0.435747 & -0.202020 \\ -0.202020 & 0.222556 \end{bmatrix} \, .

Figure 4.53:Minimization of the bean function using BFGS. The first quadratic approximation has circular contours (left). After two iterations, the quadratic approximation improves, and the step approaches the minimum (middle). Once converged, the minimum of the quadratic approximation coincides with the bean function minimum (right).

The exact inverse Hessian at the same point is

H^{-1}\left(x_{2}\right) = \begin{bmatrix} 0.450435 & 0.035946 \\ 0.035946 & 0.169535 \end{bmatrix} \, .

The predicted curvature improves, and it results in a good step toward the minimum, as shown in the middle plot of Figure 4.53. The one-dimensional slice reveals how the approximation curvature in the line search direction is higher than the actual; however, the line search moves past the approximation minimum toward the true minimum.

By the end of the optimization, at $x^* = (1.213412, 0.824123)$ , the BFGS estimate is

\tilde V^* = \begin{bmatrix} 0.276946 & 0.224010 \\ 0.224010 & 0.347847 \end{bmatrix} \, ,

whereas the exact one is

H^{-1}\left(x^*\right) = \begin{bmatrix} 0.276901 & 0.223996 \\ 0.223996 & 0.347867 \end{bmatrix} \, .

Now the estimate is much more accurate. In the right plot of Figure 4.53, we can see that the minimum of the approximation coincides with the actual minimum. The approximation is only accurate locally, worsening away from the minimum.

4.4.5 Limited-Memory Quasi-Newton Methods¶

When the number of design variables is large (millions or billions), it might not be possible to store the Hessian inverse approximation matrix in memory. This motivates limited-memory quasi-Newton methods, which make it possible to handle such problems. In addition, these methods also improve the computational efficiency of medium-sized problems (hundreds or thousands of design variables) with minimal sacrifice in accuracy.

Recall that we are only interested in the matrix-vector product $\tilde V \nabla f$ to find each search direction using Equation 4.90. As we will see in this section, we can compute this product without ever actually forming the matrix $\tilde V$ . We focus on doing this for the BFGS update because this is the most popular approach (known as L-BFGS), although similar techniques apply to other quasi-Newton update formulas.

The BFGS update (Equation 4.88) is a recursive sequence:

\tilde V_{k} = \left[(I - \sigma s y^\intercal) \tilde V (I - \sigma y s^\intercal) + \sigma s s^\intercal\right]_{k-1} \, ,

(4.102)

where

\sigma = \frac{1}{{s}^\intercal y} \, .

(4.103)

If we save the sequence of $s$ and $y$ vectors and specify a starting value for $\tilde V_{0}$ , we can compute any subsequent $\tilde V_{k}$ . Of course, what we want is $\tilde V_{k} \nabla f_{k}$ , which we can also compute using an algorithm with the recurrence relationship. However, such an algorithm would not be advantageous from the memory-usage perspective because we would have to store a long sequence of vectors and a starting matrix.

To reduce the memory usage, we do not store the entire history of vectors. Instead, we limit the storage to the last $m$ vectors for $s$ and $y$ . In practice, $m$ is usually between 5 and 20.

Next, we make the starting Hessian diagonal such that we only require vector storage (or scalar storage if we make all entries in the diagonal equal). A common choice is to use a scaled identity matrix, which just requires storing one number,

\tilde V_{0} = \frac{s^\intercal y}{y^\intercal y} I \, ,

(4.95)

where the $s$ and $y$ correspond to the previous iteration. Algorithm 4.8 details the procedure.

Algorithm 4.8 (Compute search direction using L-BFGS)

Algorithm

\quad ∇fk\nabla f_{k}∇fk​: Gradient at point xkx_{k}xk​
\quad sk−1,…,k−ms_{k-1, \ldots, k-m}sk−1,…,k−m​: History of steps xk−xk−1x_{k} - x_{k-1}xk​−xk−1​
\quad yk−1,…,k−my_{k-1, \ldots, k-m}yk−1,…,k−m​: History of gradient differences ∇fk−∇fk−1\nabla f_{k} - \nabla f_{k-1}∇fk​−∇fk−1​
\quad ppp: Search direction −V~k∇fk-\tilde V_{k}\nabla f_{k}−V~k​∇fk​
d=∇fkd = \nabla f_{k}d=∇fk​
for i=k−1i = k-1i=k−1 to  k−mk-mk−m by  -1 do
αi=σisi⊺d\alpha_{i} = \sigma_{i} {s_{i}}^\intercal dαi​=σi​si​⊺d
d=d−αiyid = d - \alpha_{i} y_{i}d=d−αi​yi​
V~0=(sk−1⊺yk−1yk−1⊺yk−1)I\tilde V_{0} = \left(\frac{s_{k-1}^\intercal y_{k-1}}{y_{k-1}^\intercal y_{k-1}}\right) IV~0​=(yk−1⊺​yk−1​sk−1⊺​yk−1​​)I Initialize Hessian inverse approximation as a scaled identity matrix
d=V~0dd = \tilde V_{0} dd=V~0​d
for i=k−mi = k-mi=k−m to  k−1k-1k−1 do
βi=σiyi⊺d\beta_{i} = \sigma_{i} {y_{i}}^\intercal dβi​=σi​yi​⊺d
d=d+(αi−βi)sid = d + (\alpha_{i} - \beta_{i}) s_{i}d=d+(αi​−βi​)si​
p=−dp = - dp=−d

Using this technique, we no longer need to bear the memory cost of storing a large matrix or incur the computational cost of a large matrix-vector product. Instead, we store a small number of vectors and require fewer vector-vector products (a cost that scales linearly with $n$ rather than quadratically).

Example 4.16 (L-BFGS compared with BFGS for the bean function)

Figure 4.54:Optimization paths using BFGS and L-BFGS.

Minimizing the same bean function from the previous examples, the optimization iterations using BFGS and L-BFGS are the same, as shown in Figure 4.54.

The L-BFGS method is applied to the same sequence using the last five iterations. The number of variables is too small to benefit from the limited-memory approach, but we show it in this small problem as an example. At the same $x^*$ as in Example 4.15, the product $\tilde V \nabla f$ is estimated using Algorithm 4.8 as

d^* = \begin{bmatrix} -7.38683\times10^{-5} \\ 5.75370\times10^{-5} \end{bmatrix},

whereas the exact value is:

\tilde V^* \nabla f^* = \begin{bmatrix} -7.49228\times10^{-5} \\ 5.90441\times10^{-5} \end{bmatrix}.

Example 4.17 (Minimizing the total potential energy for a spring system)

Many structural mechanics models involve solving an unconstrained energy minimization problem. Consider a mass supported by two springs, as shown in Figure 4.55. Formulating the total potential energy for the system as a function of the mass position yields the following problem:^[12]

\begin{gather*} \underset{x_1, x_2}{\text{minimize}} \quad \frac{1}{2} k_1 \left( \sqrt{(\ell_1+ x_1)^2 + x_2^2} - \ell_1 \right)^2 + \frac{1}{2} k_2 \left( \sqrt{(\ell_2-x_1)^2 + x_2^2} - \ell_2 \right)^2 \\- mg x_2 \, . \end{gather*}

(4.105)

Figure 4.55:Two-spring system with no applied force (top) and with applied force (bottom).

The contours of this function are shown in Figure 4.56 for the case where $l_1=12, l_2=8, k_1=1, k_2=10, mg=7$ . There is a minimum and a maximum. The minimum represents the position of the mass at the stable equilibrium condition. The maximum also represents an equilibrium point, but it is unstable. All methods converge to the minimum when starting near the maximum. All four methods use the same parameters, convergence tolerance, and starting point. Depending on the starting point, Newton’s method can become stuck at the saddle point, and if a line search is not added to safeguard it, it could have terminated at the maximum instead.

As expected, steepest descent is the least efficient, and the second-order methods are the most efficient. The number of iterations and the relative performance are problem dependent and sensitive to the optimization algorithm parameters, so we should not analyze the number of iterations too closely. However, these results show the expected trends for most problems.

Figure 4.56:Minimizing the total potential for two-spring system.

Example 4.18 (Comparing methods for the Rosenbrock function)

We now test the methods on the following more challenging function:

f \left(x_1, x_2 \right) = \left(1 - x_1 \right)^2 + 100 \left(x_2 - x_1^2 \right)^2 ,

(4.106)

which is known as the Rosenbrock function. This is a well-known optimization problem because a narrow, highly curved valley makes it challenging to minimize.^[13]The optimization path and the convergence history for four methods starting from $x = (-1.2, 1.0)$ are shown in Figure 4.57 and Figure 4.58, respectively. All four methods use an inexact line search with the same parameters and a convergence tolerance of $\| \nabla f \|_\infty \le 10^{-6}$ . Compared with the previous two examples, the difference between the steepest-descent and second-order methods is much more dramatic (two orders of magnitude more iterations!), owing to the more challenging variation in the curvature (recall Example 4.10).

Figure 4.57:Optimization paths for the Rosenbrock function using steepest descent, conjugate gradient, BFGS, and Newton.

Figure 4.58:Convergence of the four methods shows the dramatic difference between the linear convergence of steepest descent, the superlinear convergence of the conjugate gradient method, and the quadratic convergence of the methods that use second-order information.

The steepest-descent method converges, but it takes many iterations because it bounces between the steep walls of the valley while making little progress along the bottom of the valley. The conjugate gradient method is much more efficient because it damps the steepest-descent oscillations. Eventually, the conjugate gradient method achieves superlinear convergence near the optimum, saving many iterations over the last several orders of magnitude in the convergence criterion. The methods that use second-order information are even more efficient, exhibiting quadratic convergence in the last few iterations.

The number of major iterations is not always an effective way to compare performance. For example, Newton’s method takes fewer major iterations, but each iteration in Newton’s method is more expensive than each iteration in the quasi-Newton method. This is because Newton’s method requires a linear solution, which is an $\mathcal{O}(n^3)$ operation, as opposed to a matrix-vector multiplication, which is an $\mathcal{O}(n^2)$ operation. For a small problem like the two-dimensional Rosenbrock function, this is an insignificant difference, but this is a significant difference in computational effort for large problems. Additionally, each major iteration includes a line search, and depending on the quality of the search direction, the number of function calls contained in each iteration will differ.

Example 4.19 (Problem scaling)

In Tip 4.4, we discussed the importance of scaling. Let us illustrate this with an example. Consider a stretched version of the Rosenbrock function from Example 4.18:

f \left(x_1, x_2 \right) = \left(1 - \frac{x_1}{10^4} \right)^2 + 100 \left(x_2 - \left(\frac{x_1}{10^4}\right)^2 \right)^2 .

(4.96)

The contours of this function have the same characteristics as those of the original Rosenbrock function shown in Figure 4.57, but the $x_1$ axis is stretched, as shown in Figure 4.59. Because $x_1$ is scaled by such a large number (10⁴), we cannot show it using the same scale as the $x_2$ axis, otherwise the $x_2$ axis would disappear. The minimum of this function is at $x^* = [10^4, 1]$ , where $f^* = 0$ .

Figure 4.59:The contours of the scaled Rosenbrock function (Equation 4.96) are highly stretched in the $x_1$ direction, by orders of magnitude more than what we can show here.

Let us attempt to minimize this function starting from $x_0 = [-5000, -3]$ . The gradient at this starting point is $\nabla f(x_0) = [-0.0653, -650.0]$ , so the slope in the $x_2$ direction is four orders of magnitude times larger than the slope in the $x_1$ direction! Therefore, there is a significant bias toward moving along the $x_2$ direction but little incentive to move in the $x_1$ direction. After an exact line search in the steepest descent direction, we obtain the step to $x_A = [-5000, 0.25]$ as shown in Figure 4.59. The optimization stops at this point, even though it is not a minimum. This premature convergence is because $\partial f / \partial x_1$ is orders of magnitude smaller, so both components of the gradient satisfy the optimality conditions when using a standard relative tolerance.

To address this issue, we scale the design variables as explained in Tip 4.4. Using the scaling $s_x = [10^4, 1]$ , the scaled starting point becomes $\bar x_0 = [-5000, -3] \oslash [10^4, 1] = [-0.5, -3]$ . Before evaluating the function, we need to convert the design variables back to their unscaled values, that is, $f(x) = f(\bar x \odot s_x)$ .

This scaling of the design variables alone is sufficient to improve the optimization convergence. Still, let us also scale the objective because it is large at our starting point (around 900). Dividing the objective by $s_f = 1000$ , the initial gradient becomes $\nabla f(x_0) = [-0.00206, -0.6]$ .

This is still not ideally scaled, but it has much less variation in orders of magnitude—more than sufficient to solve the problem successfully. The optimizer returns $\bar x^* = [1, 1]$ , where $\bar f^* = 1.57\times{10}^{-12}$ . When rescaled back to the problem coordinates, $x^* = [10^4, 1], f^* = 1.57\times{10}^{-9}$ .

In this example, the function derivatives span many orders of magnitude, so dividing the function by a scalar does not have much effect. Instead, we could minimize $\log(f)$ , which allows us to solve the problem even without scaling $x$ . If we also scale $x$ , the number of required iterations for convergence decreases. Using $\log(f)$ as the objective and scaling the design variables as before yields $\bar x^* = [1, 1]$ , where $\bar f^* = -25.28$ , which in the original problem space corresponds to $x^* = [10^4, 1]$ , where $f^* = 1.05\times{10}^{-11}$ .

Although this example does not correspond to a physical problem, such differences in scaling occur frequently in engineering analysis. For example, optimizing the operating point of a propeller might involve two variables: the pitch angle and the rotation rate. The angle would typically be specified in radians (a quantity of order 1) and the rotation rate in rotations per minute (typically tens of thousands).

Poor scaling causes premature convergence for various reasons. In Example 4.19, it was because convergence was based on a tolerance relative to the starting gradient, and some gradient components were much smaller than others. When using an absolute tolerance, premature convergence can occur when the gradients are small to begin with (because of the scale of the problem, not because they are near an optimum). When the scaling is poor, the optimizer is even more dependent on accurate gradients to navigate the narrow regions of function improvement.

Larger engineering simulations are usually more susceptible to numerical noise due to iteration loops, solver convergence tolerances, and longer computational procedures. Another issue arises when the derivatives are not computed accurately. In these cases, poorly scaled problems struggle because the line search directions are not accurate enough to yield a decrease, except for tiny step sizes.

Most practical optimization algorithms terminate early when this occurs, not because the optimality conditions are satisfied but because the step sizes or function changes are too small, and progress is stalled (see Tip 4.1). A lack of attention to scaling is one of the most frequent causes of poor solutions in engineering optimization problems.

4.5 Trust-Region Methods¶

In Section 4.2, we mentioned two main approaches for unconstrained gradient-based optimization: line search and trust region. We described the line search in Section 4.3 and the associated methods for computing search directions in Section 4.4. Now we describe trust-region methods, also known as restricted-step methods. The main motivation for trust-region methods is to address the issues with Newton’s method (see Section 4.4.3) and quasi-Newton updates that do not guarantee a positive definite-Hessian approximation (e.g., SR1, which we briefly described in Section 4.4.4).

The trust-region approach is fundamentally different from the line search approach because it finds the direction and distance of the step simultaneously instead of finding the direction first and then the distance. The trust-region approach builds a model of the function to be minimized and then minimizes the model within a trust region, within which we trust the model to be good enough for our purposes.

The most common model is a local quadratic function, but other models may be used. When using a quadratic model based on the function value, gradient, and Hessian at the current iteration, the method is similar to Newton’s method.

Figure 4.60:Trust-region methods minimize a model within a trust region for each iteration, and then they update the trust-region size and the model before the next iteration.

The trust region is centered about the current iteration point and can be defined as an $n$ -dimensional box, sphere, or ellipsoid of a given size. Each trust-region iteration consists of the following main steps:

Create or update the model about the current point.
Minimize the model within the trust region.
Move to the new point, update values, and adapt the size of the trust region.

These steps are illustrated in Figure 4.60, and they are repeated until convergence. Figure 4.61 shows the steps to minimize the bean function, where the circles show the trust regions for each iteration.

Figure 4.61:Path for the trust-region approach showing the circular trust regions at each step.

The trust-region subproblem solved at each iteration is

\begin{aligned} \underset{s}{\text{minimize}} & \quad \tilde f(s)\\ \text{subject to} & \quad \|s\| \le \Delta \, , \\ \end{aligned}

(4.97)

where $\tilde f(s)$ is the local trust-region model, $s$ is the step from the current iteration point, and $\Delta$ is the size of the trust region. We use $s$ instead of $p$ to indicate that this is a step vector and not simply the direction vector used in methods based on a line search.

The subproblem (Equation 4.97) defines the trust region as a norm. The Euclidean norm, $\|s\|_2$ , defines a spherical trust region and is the most common type of trust region. Sometimes $\infty$ -norms are used instead because they are easy to apply, but 1-norms are rarely used because they are just as complex as 2-norms but introduce sharp corners that can be problematic.11 The shape of the trust region is dictated by the norm (see Figure A.8) and can significantly affect the convergence rate. The ideal trust-region shape depends on the local function space, and some algorithms allow for the trust-region shape to change throughout the optimization.

4.5.1 Quadratic Model with Spherical Trust Region¶

Using a quadratic trust-region model and the Euclidean norm, we can define the more specific subproblem:

\begin{aligned} \underset{s}{\text{minimize}} &\quad \tilde{f}(s) = f_{k} + \nabla {f_{k}}^\intercal s + \frac{1}{2}s^\intercal \tilde H_{k} s\\ \text{subject to} &\quad \|s\|_2 \le \Delta_{k} \, , \end{aligned}

(4.98)

where $\tilde H_{k}$ is the approximate (or true) Hessian at our current iterate. This problem has a quadratic objective and quadratic constraints and is called a quadratically constrained quadratic program (QCQP).

If the problem is unconstrained and $\tilde H$ is positive definite, we can get to the solution using a single step, $s = -{\tilde H{(k)}}^{-1} \nabla f_{k}$ . However, because of the constraints, there is no analytic solution for the QCQP. Although the problem is still straightforward to solve numerically (because it is a convex problem; see Section 11.4), it requires an iterative solution approach with multiple factorizations.

Similar to the line search, where we only obtain a sufficiently good point instead of finding the exact minimum, in the trust-region subproblem, we seek an approximate solution to the QCQP. Including the trust-region constraint allows us to omit the requirement that $\tilde H_{k}$ be positive definite, which is used in most quasi-Newton methods. We do not detail approximate solution methods to the QCQP, but there are various options.11124

Figure 4.62 compares the bean function with a local quadratic model, which is built using information about the point where the arrow originates. The trust-region step seeks the minimum of the local quadratic model within the circular trust region. Unlike line search methods, as the size of the trust region changes, the direction of the step (the solution to Equation 4.98) might also change, as shown on the right panel of Figure 4.62.

Quadratic model (gray contours) compared to the actual function (blue contours), and two different different trust region sizes (red circles). The trust-region step s_{k} finds the minimum of the quadratic model while remaining within the trust region. The steepest-descent direction p is shown for comparison. — (a)

4.5.2 Trust-Region Sizing Strategy¶

This section presents an algorithm for updating the size of the trust region at each iteration. The trust region can grow, shrink, or remain the same, depending on how well the model predicts the actual function decrease. The metric we use to assess the model is the actual function decrease divided by the expected decrease:

r = \frac{f(x)-f(x+s)}{\tilde f(0) - \tilde f(s)} \, .

(4.99)

The denominator in this definition is the expected decrease, which is always positive. The numerator is the actual change in the function, which could be a reduction or an increase. An $r$ value close to unity means that the model agrees well with the actual function. An $r$ value larger than 1 is fortuitous and means that the actual decrease was even greater than expected. A negative value of $r$ means that the function actually increased at the expected minimum, and therefore the model is not suitable.

The trust-region sizing strategy in Algorithm 4.9 determines the size of the trust region at each major iteration $k$ based on the value of $r_{k}$ . The parameters in this algorithm are not derived from any theory; instead, they are empirical. This example uses the basic procedure from Nocedal & Wright (2006) with the parameters recommended by Conn et al. (2000)

Algorithm 4.9 (Trust-region algorithm)

Algorithm

\quad x0x_{0}x0​: Starting point
\quad Δ0\Delta_{0}Δ0​: Initial size of the trust region
\quad x∗x^*x∗: Optimal point
while not converged do
Compute or estimate the Hessian
Solve (approximately) for sks_{k}sk​ Use Equation 4.97
Compute rkr_{k}rk​ Use Equation 4.99
Resize trust region
if rk≤0.05r_{k} \le 0.05rk​≤0.05 then
Poor modelΔk+1=Δk/4\Delta_{k+1} = \Delta_{k} / 4Δk+1​=Δk​/4 Shrink trust region
sk=0s_{k} = 0sk​=0 Reject step
else if rk≥0.9r_{k} \ge 0.9rk​≥0.9 and ∥sk∥=Δk\|s_{k}\| = \Delta_{k}∥sk​∥=Δk​ then
Good model and step to edgeΔk+1=min⁡(2Δk,Δmax)\Delta_{k+1} = \min (2 \Delta_{k}, \Delta_\text{max})Δk+1​=min(2Δk​,Δmax​) Expand trust region
Reasonable model and step within trust region
Δk+1=Δk\Delta_{k+1} = \Delta_{k}Δk+1​=Δk​ Maintain trust region size
xk+1=xk+skx_{k+1} = x_{k} + s_{k}xk+1​=xk​+sk​ Update location of trust region
k=k+1k = k+1k=k+1 Update iteration count

The initial value of $\Delta$ is usually 1, assuming the problem is already well scaled. One way to rationalize the trust-region method is that the quadratic approximation of a nonlinear function is guaranteed to be reasonable only within a limited region around the current point $x_{k}$ . Thus, we minimize the quadratic function within a region around $x_{k}$ within which we trust the quadratic model.

When our model performs well, we expand the trust region. When it performs poorly, we shrink the trust region. If we shrink the trust region sufficiently, our local model will eventually be a good approximation of the actual function, as dictated by the Taylor series expansion.

We should also set a maximum trust-region size ( $\Delta_\text{max}$ ) to prevent the trust region from expanding too much. Otherwise, it may take too long to reduce the trust region to an acceptable size over other portions of the design space where a smaller trust region is needed.

The same convergence criteria used in other gradient-based methods are applicable.^[14]

Example 4.20 (Trust-region method applied to the total potential energy of spring system)

Minimizing the total potential energy function from Example 4.17 using a trust-region method starting from the same points as before yields the optimization path shown in Figure 4.63. The initial trust region size is $\Delta = 0.3$ , and the maximum allowable is $\Delta_{\text{max}} = 1.5$ .

Figure 4.63:Minimizing the total potential for two-spring system using a trust-region method shown at different iterations. The local quadratic approximation is overlaid on the function contours and the trust region is shown as a red circle.

The first few quadratic approximations do not have a minimum because the function has negative curvature around the starting point, but the trust region prevents steps that are too large. When it gets close enough to the bowl containing the minimum, the quadratic approximation has a minimum, and the trust-region subproblem yields a minimum within the trust region. In the last few iterations, the quadratic is a good model, and therefore the region remains large.

4.5.3 Comparison with Line Search Methods¶

Trust-region methods are typically more strongly dependent on accurate Hessians than are line search methods. For this reason, they are usually only effective when exact gradients (or better yet, an exact Hessian) can be supplied. Many optimization packages require the user to provide the full Hessian, or at least the gradients, to use a trust-region approach. Trust-region methods usually require fewer iterations than quasi-Newton methods with a line search, but each iteration is more computationally expensive because they require at least one matrix factorization.

Scaling can also be more challenging with trust-region approaches. Newton’s method is invariant with scaling, but a Euclidean trust-region constraint implicitly assumes that the function changes in each direction at a similar rate. Some enhancements try to address this issue through elliptical trust regions rather than spherical ones.

Tip 4.7: Smooth model discontinuities

Many models are defined in a piecewise manner, resulting in a discontinuous function value, discontinuous derivative, or both. This can happen even if the underlying physical behavior is continuous, such as fitting experimental data using a non-smooth interpolation. The solution is to modify the implementation so that it is continuous while remaining consistent with the physics. If the physics is truly discontinuous, it might still be advisable to artificially smooth the function, as long as there is no significant increase in the modeling error. Even if the smoothed version is highly nonlinear, having a continuous first derivative helps the derivative computation and gradient-based optimization. Some techniques are specific to the problem, but we discuss some examples here.

Figure 4.65:Smoothed absolute value function.

The absolute value function can often be tolerated as the outermost level of the optimization.

However, if propagated through subsequent functions, it can introduce numerical issues from rapid changes in the function. One possibility to smooth this function is to round off the vertex with a quadratic function, as shown in Figure 4.65. If we force continuity in the function and the first derivative, then the equation of a smooth absolute value is

f(x) = \begin{cases} |x| & \text{if} \quad |x| > \Delta x \\ \dfrac{x^2}{2 \Delta x} + \dfrac{\Delta x}{2} & \text{otherwise} \, , \end{cases}

(4.111)

where $\Delta x$ is a user-adjustable parameter representing the half-width of the transition.

Piecewise functions are often used in fits to empirical data. Cubic splines or a sigmoid function can blend the transition between two functions smoothly. We can also use the same technique to blend discrete steps (where the two functions are constant values) or implement smooth max or min functions.^[15]For example, a sigmoid can be used to blend two functions ( $f_1(x)$ and $f_2(x)$ ) together at a transition point $x_t$ using

f(x) = f_1(x) + \left( f_2(x) - f_1(x) \right) \left(\frac{1}{1 + e^{-h(x - x_t)}}\right) \, ,

(4.112)

where $h$ is a user-selected parameter that controls how sharply the transition occurs. The left side of Figure 4.66 shows an example transitioning $x$ and $x^2$ with $x_t = 0$ and $h=50$ .

Figure 4.66:Smoothly blending two functions.

Another approach is to use a cubic spline for the blending. Given a transition point $x_t$ and a half-width $\Delta x$ , we can compute a cubic spline transition as

f(x) = \begin{cases} f_1(x) & \text{if} \quad x < x_1 \\ f_2(x) & \text{if} \quad x > x_2 \\ c_1 x^3 + c_2 x^2 + c_3 x + c_4 & \text{otherwise} \, , \end{cases}

(4.113)

where we define $x_1 = x_t - \Delta x$ and $x_2 = x_t + \Delta x$ , and the coefficients $c$ are found by solving the following linear system:

\begin{bmatrix} x_1^3 & x_1^2 & x_1 & 1 \\ x_2^3 & x_2^2 & x_2 & 1 \\ 3 x_1^2 & 2 x_1 & 1 & 0 \\ 3 x_2^2 & 2 x_2 & 1 & 0 \\ \end{bmatrix} \begin{bmatrix} c_1 \\ c_2 \\ c_3 \\ c_4 \end{bmatrix} = \begin{bmatrix} f_1(x_1) \\ f_2(x_2) \\ f_1^\prime(x_1) \\ f_2^\prime(x_2) \end{bmatrix}.

(4.114)

This ensures continuity in the function and the first derivative. The right side of Figure 4.66 shows the same two functions and transition location, blended with a cubic spline using a half-width of 0.05.

Tip 4.8: Gradient-based optimization can find the global optimum

Gradient-based methods are local search methods. If the design space is fundamentally multimodal, it may be helpful to augment the gradient-based search with a global search. The simplest and most common approach is to use a multistart approach, where we run a gradient-based search multiple times, starting from different points, as shown in Figure 4.67. The starting points might be chosen from engineering intuition, randomly generated points, or sampling methods, such as Latin hypercube sampling (see Section 10.2.1).

Figure 4.67:A multistart approach with a gradient-based algorithm finds the global minimum of the Jones function. We successfully apply the same strategy to a discontinuous version of this function in Example 7.9.

Convergence testing is needed to determine a suitable number of starting points. If all points converge to the same optimum and the starting points are well spaced, this suggests that the design space might not be multimodal after all. By using multiple starting points, we increase the likelihood that we find the global optimum, or at least that we find a better optimum than would be found with a single starting point. One advantage of this approach is that it can easily be run in parallel.

Another approach is to start with a global search strategy (see Chapter 7). After a suitable initial exploration, the design(s) given by the global search become starting points for gradient-based optimization(s).

This finds points that satisfy the optimality conditions, which is typically challenging with a pure gradient-free approach. It also improves the convergence rate and finds optima more precisely.

4.6 Summary¶

Gradient-based optimization is powerful because gradients make it possible to efficiently navigate $n$ -dimensional space in a series of steps converging to an optimum. The gradient also determines when the optimum has been reached, which is when the gradient is zero.

Gradients provide only local information, so an approach that ensures a function decrease when stepping away from the current point is required. There are two approaches to ensure this: line search and trust region. Algorithms based on a line search have two stages: finding an appropriate search direction and determining how far to step in that direction. Trust-region algorithms minimize a surrogate function within a finite region around the current point. The region expands or contracts, depending on how well the optimization within the previous iteration went. Gradient-based optimization algorithms based on a line search are more prevalent than trust-region methods, but trust-region methods can be effective when second derivatives are available.

There are different options for determining the search direction for each line search using gradient information. Although the negative gradient points in the steepest-descent direction, following this direction is not the best approach because it is prone to oscillations. The conjugate gradient method dampens these oscillations and thus converges much faster than steepest descent.

Second-order methods use curvature information, which dramatically improves the rate of convergence. Newton’s method converges quadratically but requires the Hessian of the function, which can be prohibitive. Quasi-Newton methods circumvent this requirement by building an approximation of the inverse of the Hessian based on changes in the gradients along the optimization path. Quasi-Newton methods also avoid matrix factorization, requiring matrix-vector multiplication instead. Because they are much less costly while achieving better than linear convergence, quasi-Newton methods are widely used. Limited-memory quasi-Newton methods can be used when the problem is too large to fit in computer memory.

The line search in a given direction does not seek to find a minimum because this is not usually worthwhile. Instead, it seeks to find a “good enough” point that sufficiently decreases the function and the slope. Once such a point is found, we select a new search direction and repeat the process. Second-order methods provide a guess for the first step length in the line search that further improves overall convergence.

This chapter provides the building blocks for the gradient-based constrained optimization covered in the next chapter.

Problems¶

Exercise 1

Answer true or false and justify your answer.

Gradient-based optimization requires the function to be continuous and infinitely differentiable.
Gradient-based methods perform a local search.
Gradient-based methods are only effective for problems with one minimum.
The dot product of $\nabla f$ with a unit vector $p$ yields the slope of the $f$ along the direction of $p$ .
The Hessian of a unimodal function is positive definite or positive semidefinite everywhere.
Each column $j$ of the Hessian quantifies the rate of change of component $j$ of the gradient vector with respect to all coordinate directions $i$ .
If the function curvature at a point is zero in some direction, that point cannot be a local minimum.
A globalization strategy in a gradient-based algorithm ensures convergence to the global minimum.
The goal of the line search is to find the minimum along a given direction.
For minimization, the line search must always start in a descent direction.
The direction in the steepest-descent algorithm for a given iteration is orthogonal to the direction of the previous iteration.
Newton’s method is not affected by problem scaling.
Quasi-Newton methods approximate the function Hessian by using gradients.
Newton’s method is a good choice among gradient-based methods because it uses exact second-order information and therefore converges well from any starting point.
The trust-region method does not require a line search.

Exercise 7

Implement the two line search algorithms from Section 4.3, such that they work in $n$ dimensions ( $x$ and $p$ can be vectors of any size).

As a first test for your code, reproduce the results from the examples in Section 4.3 and plot the function and iterations for both algorithms. For the line search that satisfies the strong Wolfe conditions, reduce the value of $\mu_2$ until you get an exact line search. How much accuracy can you achieve?
Test your code on another easy two-dimensional function, such as the bean function from Example 4.11, starting from different points and using different directions (but remember that you must always provide a valid descent direction; otherwise, the algorithm might not work!). Does it always find a suitable point? Exploration: Try different values of $\mu_2$ and $\rho$ to analyze their effect on the number of iterations.
Apply your line search algorithms to the two-dimensional Rosenbrock function and then the $n$ -dimensional variant (see Rosenbrock Function). Again, try different points and search directions to see how robust the algorithm is, and try to tune $\mu_2$ and $\rho$ .

Exercise 9

Program the steepest-descent, conjugate gradient, and BFGS algorithms from Section 4.4. You must have a thoroughly tested line search algorithm from the previous exercise first. For the gradients, differentiate the functions analytically and compute them exactly. Solve each problem using your implementations of the various algorithms, as well as off-the-shelf optimization software for comparison.

For your first test problem, reproduce the results from the examples in Section 4.4.
Minimize the two-dimensional Rosenbrock function (see Rosenbrock Function) using the various algorithms and compare your results starting from $x = (-1, 2)$ . Compare the total number of evaluations. Compare the number of minor versus major iterations. Discuss the trends. Exploration: Try different starting points and tuning parameters (e.g., $\rho$ and $\mu_2$ in the line search) and compare the number of major and minor iterations.
Benchmark your algorithms on the $n$ -dimensional variant of the Rosenbrock function (see Rosenbrock Function). Try $n=3$ and $n=4$ first, then $n=8, 16, 32, \ldots$ . What is the highest number of dimensions you can solve? How does the number of function evaluations scale with the number of variables?
Optional: Implement L-BFGS and compare it with BFGS.

Exercise 12

The brachistochrone problem seeks to find the path that minimizes travel time between two points for a particle under the force of gravity.^[16]Solve the discretized version of this problem using an optimizer of your choice (see Section D.1.7 for a detailed description).

Plot the optimal path for the frictionless case with $n=10$ and compare it to the exact solution (see Section D.1.7).
Solve the optimal path with friction and plot the resulting path. Report the travel time between the two points and compare it to the frictionless case.
Study the effect of increased problem dimensionality. Start with 4 points and double the dimension each time up to 128 points. Plot and discuss the increase in computational expense with problem size. Example metrics include the number of major iterations, function evaluations, and computational time. Hint: When solving the higher-dimensional cases, start with the solution interpolated from a lower-dimensional case—this is called a warm start.

Footnotes¶

In this book, most of the illustrations and examples are based on two-dimensional problems because they are easy to visualize. However, the principles and methods apply to $n$ dimensions.
↩
The discovery and proof of the symmetry of second derivatives property has a long history.1
↩
For a more extensive introduction to the Taylor series, see Taylor Series Expansion.
↩
For other approaches to determine if a matrix is positive definite, see Section A.6.
↩
This algorithm, and others in this section, use a basic convergence check for simplicity. See the end of Section 4.1.4 for better alternatives and additional exit criteria.
↩
This condition can be problematic near a local minimum because $\phi(0)$ and $\phi(\alpha)$ are so similar that their subtraction is inaccurate. 2 introduced a condition with improved accuracy, along with an efficient line search based on a secant method.
↩
A similar algorithm is detailed in Chapter 3 of 4
↩
If $f$ can be negative, a transformation is required to ensure that the logarithm argument is always positive.
↩
For more details on the line search requirements, see Sec. 5.2 in 4
↩
The secant equation is also known as the quasi-Newton condition.
↩
This formula is also known as the Woodbury matrix identity. Given a matrix and an update to that matrix, it yields an explicit expression for the inverse of the updated matrix in terms of the inverses of the matrix and the update (see Section C.3).
↩
Section D.1.8 has details on this problem.
↩
The “bean” function we used in previous examples is a milder version of the Rosenbrock function.
↩
11 provide more detail on trust-region problems, including trust-region norms and scaling, approaches to solving the trust-region subproblem, extensions to the model, and other important practical considerations.
↩
Another option to smooth the max of multiple functions is aggregation, which is detailed in Section 5.7.
↩
This problem was mentioned in Section 2.2 as one of the problems that inspired developments in calculus of variations.
↩

References¶

Higgins, T. J. (1940). A note on the history of mixed partial derivatives. Scripta Mathematica, 7, 59–62.
Hager, W. W., & Zhang, H. (2005). A new conjugate gradient method with guaranteed descent and an efficient line search. SIAM Journal on Optimization, 16(1), 170–192. 10.1137/030601880
Moré, J. J., & Thuente, D. J. (1994). Line search algorithms with guaranteed sufficient decrease. ACM Transactions on Mathematical Software (TOMS), 20(3), 286–307. 10.1145/192115.192132
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer. 10.1007/978-0-387-40065-5
Davidon, W. C. (1991). Variable metric method for minimization. SIAM Journal on Optimization, 1(1), 1–17. 10.1137/0801001
Fletcher, R., & Powell, M. J. D. (1963). A rapidly convergent descent method for minimization. The Computer Journal, 6(2), 163–168. 10.1093/comjnl/6.2.163
Broyden, C. G. (1970). The convergence of a class of double-rank minimization algorithms 1. General considerations. IMA Journal of Applied Mathematics, 6(1), 76–90. 10.1093/imamat/6.1.76
Fletcher, R. (1970). A new approach to variable metric algorithms. The Computer Journal, 13(3), 317–322. 10.1093/comjnl/13.3.317
Goldfarb, D. (1970). A family of variable-metric methods derived by variational means. Mathematics of Computation, 24(109), 23–23. 10.1090/s0025-5718-1970-0258249-6
Shanno, D. F. (1970). Conditioning of quasi-Newton methods for function minimization. Mathematics of Computation, 24(111), 647–647. 10.1090/s0025-5718-1970-0274029-x
Conn, A. R., Gould, N. I. M., & Toint, Ph. L. (2000). Trust Region Methods. SIAM.
Steihaug, T. (1983). The conjugate gradient method and trust regions in large scale optimization. SIAM Journal on Numerical Analysis, 20(3), 626–637. 10.1137/0720042

4 Unconstrained Gradient-Based Optimization

4.1 Fundamentals¶

4.1.1 Derivatives and Gradients¶

4.1.2 Curvature and Hessians¶

4.1.3 Taylor Series¶

4.1.4 Optimality Conditions¶

4.2 Two Overall Approaches to Finding an Optimum¶

4.3 Line Search¶

4.3.1 Sufficient Decrease and Backtracking¶

4.3.2 Strong Wolfe Conditions¶

4.3.3 Interpolation for Pinpointing¶

4.4 Search Direction¶

4.4.1 Steepest Descent¶

4.4.2 Conjugate Gradient¶

4.4.3 Newton’s Method¶

4.4.4 Quasi-Newton Methods¶

4.4.5 Limited-Memory Quasi-Newton Methods¶

4.5 Trust-Region Methods¶

4.5.1 Quadratic Model with Spherical Trust Region¶

4.5.2 Trust-Region Sizing Strategy¶

4.5.3 Comparison with Line Search Methods¶

4.6 Summary¶

Problems¶