Appendix: Quasi-Newton Methods - Engineering Design Optimization

C.1 Broyden’s Method¶

Broyden’s method is the extension of the secant method (from Section 3.8) to $n$ dimensions.1 It can also be viewed as the analog of the quasi-Newton methods from Section 4.4.4 for solving equations (as opposed to finding a minimum).

Using the notation from Chapter 3, suppose we have a set of $n$ equations $r(u)=[r_1, \ldots, r_n]=0$ and $n$ unknowns $u=[u_1, \ldots, u_n]$ . Writing a Taylor series expansion of $r(u)$ and selecting the linear term of the Taylor series expansion of $r$ yields

J_{k+1} \left( {u}_{k + 1} - {u}_k \right) \approx r_{k+1} - r_{k} \, ,

(C.1)

where $J$ is the $(n \times n)$ Jacobian, $\partial r / \partial u$ . Defining the step in $u$ as

{s}_k = {u}_{k + 1} - {u}_k \, ,

(C.2)

and the change in the residuals as

{y}_k = {r}_{k + 1} - {r}_k \, ,

(C.3)

we can write Equation C.1 as

\tilde J_{k+1} s_k = y_k \, .

(C.4)

This is the equivalent of the secant equation (Equation 4.80). The difference is that we now approximate the Jacobian instead of the Hessian. The right-hand side is the difference between two subsequent function values (which quantifies the directional derivative along the last step) instead of the difference between gradients (which quantifies the curvature).

We seek a rank 1 update of the form

\tilde J = \tilde J_k + v v^\intercal ,

(C.5)

where the self outer product $v v^\intercal$ yields a symmetric matrix of rank 1. Substituting this update into the required condition (Equation C.4) yields

\left( \tilde J_k + v v^\intercal \right) s_k = y_k .

(C.6)

Post-multiplying both sides by $s^\intercal$ , rearranging, and dividing by $s_k^\intercal s_k$ yields

v v^\intercal = \frac{\Big( y_k - \tilde J_k s_k \Big) s_k^\intercal}{s_k^\intercal s_k}.

(C.7)

Substituting this result into the update (Equation C.5), we get the Jacobian approximation update,

\tilde J_{k+1} = \tilde J_{k} + \frac{\Big( y_{k} - \tilde J_{k} s_{k} \Big) s_{k}^\intercal}{s_{k}^\intercal s_{k}} \, ,

(C.8)

where

y_{k} = r_{k+1} - r_{k} \,

(C.9)

is the difference in the function values (as opposed to the difference in the gradients used in optimization).

This update can be inverted using the Sherman–Morrison–Woodbury formula (Section C.3) to get the more useful update on the inverse of the Jacobian,

\tilde {J}^{-1}_{k+1} = \tilde J^{-1}_{k} + \frac{\left( s_{k} - \tilde J^{-1}_{k} y_{k} \right) y_{k}^\intercal}{y_{k}^\intercal y_{k}} \, .

(C.10)

We can start with $\tilde J^{-1}_0 = I$ . Similar to the Newton step (Equation 3.30), the step in Broyden’s method is given by solving the linear system. Because the inverse is provided explicitly, we can just perform the multiplication,

\Delta u_k = - \tilde J^{-1} r_k \, .

(C.11)

Then we update the variables as

u_{k+1} = u_k + \Delta u_k \, .

(C.12)

C.2 Additional Quasi-Newton Approximations¶

In Section 4.4.4, we introduced the Broyden–Fletcher–Goldfarb–Shanno (BFGS) quasi-Newton approximation for unconstrained optimization, which was also used in Section 5.5 for constrained optimization. Here we expand on that to introduce other quasi-Newton approximations and generalize them.

To get a unique solution for the approximate Hessian update, quasi-Newton methods quantify the “closeness” of successive Hessian approximations by using some norm of the difference between the two matrices, leading to the following optimization problem:

\begin{aligned} \text{minimize} & \quad \|\tilde H - \tilde H_{k}\| \\ \text{by varying} & \quad \tilde H \\ \text{subject to} & \quad \tilde H = \tilde H^\intercal\\ & \quad \tilde H s_{k} = y_{k} \, , \end{aligned}

(C.13)

where, ${y}_k = \nabla{f}_{k + 1} - \nabla {f}_{k}$ , and ${s}_k = {x}_{k + 1} - {x}_k$ (the latest step). There are several possibilities for quantifying the “closeness” between matrices and satisfying the constraints, leading to different quasi-Newton updates. With a convenient choice of matrix norm, we can solve this optimization problem analytically to obtain a formula for $\tilde H_{k+1}$ as a function of $\tilde H_{k}$ , $s_{k}$ , and $y_{k}$ .

The optimization problem (Equation C.13) does not enforce a positive-definiteness constraint. It turns out that the update formula always produces a $\tilde H_{k+1}$ that is positive definite, provided that $\tilde H_{k}$ is positive definite. The fact that the curvature condition (Equation 4.81) is satisfied for each step helps with this.

C.2.1 Davidon–Fletcher–Powell Update¶

The Davidon–Fletcher–Powell (DFP) update can be derived using a similar approach to that used to derive the BFGS update in Section 4.4.4. However, instead of starting with the update for the Hessian, we start with the update to the Hessian inverse,

\tilde V_{k+1} = \tilde V_{k} + \alpha u u^\intercal + \beta v v^\intercal \, .

(C.14)

We need the inverse version of the secant equation (Equation 4.80), which is

\tilde V_{k+1} y_k = s_k \, .

(C.15)

Setting $u=s_k$ and $v=\tilde V_k y_k$ in the update (Equation C.14) and substituting it into the inverse version of the secant equation (Equation C.15), we get

\tilde V_k y_k + \alpha s_k s_k^\intercal y_k + \beta \tilde V_k y_k y_k^\intercal \tilde V_k y_k = s_k \, .

(C.16)

We can obtain the coefficients $\alpha$ and $\beta$ by rearranging this equation and using similar arguments to those used in the BFGS update derivation (see Section 4.4.4). The DFP update for the Hessian inverse approximation is

\tilde{V}_{k+1} = \tilde{V}_k + \frac{1}{y_k^\intercal s_k} s_k s_k^\intercal - \frac{1}{y_k^\intercal \tilde{V}_k y_k} \tilde{V}_k y_k y_k^\intercal \tilde{V}_k \, .

(C.17)

However, the DFP update was originally derived by solving the optimization problem (Equation C.13), which minimizes a matrix norm of the update while enforcing symmetry and the secant equation. This problem can be solved analytically through the Karush–Kuhn–Tucker (KKT) conditions and a convenient matrix norm. The weighted Frobenius norm (Equation A.35) was the norm used in this case, where the weights were based on an averaged Hessian inverse.

The derivation is lengthy and is not included here. The final result is the update,

\tilde H_{k+1} = \left( I - \sigma_{k} s_{k} {y_{k}}^\intercal \right) \tilde H_{k} \left( I - \sigma_{k} y_{k} {s_{k}}^\intercal \right) + \sigma_{k} y_{k} {y_{k}}^\intercal,

(C.18)

where

\sigma_{k} = \frac{1}{{y_{k}}^\intercal s_{k}} .

(C.19)

This can be inverted using the Sherman–Morrison–Woodbury formula (Section C.3) to get the update on the inverse (Equation C.17).

C.2.2 BFGS¶

The BFGS update was informally derived in Section 4.4.4.

As discussed previously, obtaining an approximation of the Hessian inverse is a more efficient way to get the quasi-Newton step.

Similar to DFP, BFGS was originally formally derived by analytically solving an optimization problem. However, instead of solving the optimization problem of Equation C.13, we solve a similar problem using the Hessian inverse approximation instead. This problem can be stated as

\begin{aligned} \text{minimize} \quad & \|{\tilde V} - {\tilde V}_k\|\\ \text{subject to} \quad & \tilde V {y}_k = {s}_k\\ \quad & {\tilde V} = {\tilde V}^\intercal, \end{aligned}

(C.20)

where $\tilde V$ is the updated inverse Hessian that we seek, ${\tilde V}_k$ is the inverse Hessian approximation from the previous step. The first constraint is known as the secant equation applied to the inverse. The second constraint enforces symmetric updates. We do not explicitly specify positive definiteness. The matrix norm is again a weighted Frobenius norm (Equation A.35), but now the weights are based on an averaged Hessian (instead of the inverse for DFP). Solving this optimization problem (Equation C.20), the final result is

\tilde V_{k+1} = \left( I - \sigma_{k} s_{k} {y_{k}}^\intercal \right) \tilde V_{k} \left( I - \sigma_{k} y_{k} {s_{k}}^\intercal \right) + \sigma_{k} s_{k} {s_{k}}^\intercal,

(C.21)

where

\sigma_{k} = \frac{1}{{y_{k}}^\intercal s_{k}} .

(C.22)

This is identical to Equation 4.88.

C.2.3 Symmetric Rank 1 Update¶

The symmetric rank 1 (SR1) update is a quasi-Newton update that is rank 1 as opposed to the rank 2 update of DFP and BFGS (Equation C.14). The SR1 update can be derived formally without solving the optimization problem of Equation C.13 because there is only one update that satisfies the secant equation.

Similar to the rank 2 update of the approximate inverse Hessian (Equation 4.82), we construct the update,

\tilde V = \tilde V_k + \alpha {v}{v}^\intercal \, ,

(C.23)

where we only need one self outer product to produce a rank 1 update (as opposed to two).

Substituting the rank 1 update (Equation C.23) into the secant equation, we obtain

{\tilde V}_k {y}_k + \alpha {v}{v}^\intercal {y}_k = {s}_k \, .

(C.24)

Rearranging yields

\left( \alpha {v}^\intercal {y}_k \right) {v} = {s}_k - {\tilde V}_k {y}_k.

(C.25)

Thus, we have to make sure that ${v}$ is in the direction of ${y}_k - {H}_k{s}_k$ . The scalar $\alpha$ must be such that the scaling of the vectors on both sides of the equation match each other. We define a normalized $v$ in the desired direction,

{v} = \frac{{s}_k - {\tilde V}_k {y}_k}{\|{s}_k - {\tilde V}_k {y}_k\|_2} \, .

(C.26)

To find the correct value for $\alpha$ , we substitute Equation C.26 into Equation C.25 to get

{s}_k - {\tilde V}_k {y}_k = \alpha \frac{{s}_k^\intercal {y}_k - {y}_k^\intercal {\tilde V}_k {y}_k}{\|{s}_k - {\tilde V}_k {y}_k\|_2^2} \left({s}_k - {\tilde V}_k {y}_k\right) \, .

(C.27)

Solving for $\alpha$ yields

\alpha = \frac{\|{s}_k - {\tilde V}_k {y}_k\|_2^2}{{s}_k^\intercal {y}_k - {y}_k^\intercal {\tilde V}_k {y}_k} \, .

(C.28)

Substituting Equation C.26 and Equation C.28 into Equation C.23, we get the SR1 update

{\tilde V}_{k+1} = {\tilde V}_k + \frac{1}{{s}_k^\intercal {y}_k - {y}_k^\intercal {\tilde V}_k {y}_k} \left({s}_k - {\tilde V}_k{y}_k\right) \left({s}_k - {\tilde V}_k{y}_k\right)^\intercal \, .

(C.29)

Because it is possible for the denominator in this update to be zero, the update requires safeguarding. This update is not positive definite in general because the denominator can be negative.

As in the BFGS method, the search direction at each major iteration is given by $p_{k} = - \tilde V_{k} \nabla f_{k}$ and a line search with $\alpha_\text{init}=1$ determines the final step length.

C.2.4 Unification of SR1, DFP, and BFGS¶

The SR1, DFP, and BFGS updates for the inverse Hessian approximation can be expressed using the following more general formula:

{\tilde V}_{k+1} = {\tilde V}_k + \begin{bmatrix} {\tilde V}_k {y}_k & {s}_k \end{bmatrix} \begin{bmatrix} \alpha & \beta \\ \beta & \gamma \end{bmatrix} \begin{bmatrix} {y}_k^\intercal{\tilde V}_k \\ {s}_k^\intercal \\ \end{bmatrix}.

(C.30)

For the SR1 method, we have

\begin{aligned} \alpha_{\text{SR1}} &= \frac{1}{{y}_k^\intercal{s}_k - {y}_k^\intercal{\tilde V}_k{y}_k} \\ \beta_{\text{SR1}} &= -\frac{1}{{y}_k^\intercal{s}_k - {y}_k^\intercal{\tilde V}_k{y}_k} \\ \gamma_{\text{SR1}} &= \frac{1}{{y}_k^\intercal{s}_k - {y}_k^\intercal{\tilde V}_k{y}_k} \, . \end{aligned}

(C.31)

For the DFP method, we have

\alpha_{\text{DFP}} = -\frac{1}{{y}_k^\intercal{\tilde V}_k{y}_k}, \quad \beta_{\text{DFP}} = 0, \quad \gamma_{\text{DFP}} = \frac{1}{{y}_k^\intercal{s}_k}.

(C.32)

For the BFGS method, we have

\alpha_{\text{BFGS}} = 0, \quad \beta_{\text{BFGS}} = -\frac{1}{{y}_k^\intercal{s}_k}, \quad \gamma_{\text{BFGS}} = \frac{1}{{y}_k^\intercal{s}_k} + \frac{{y}_k^\intercal {\tilde V}_k {y}_k}{\left({y}_k^\intercal{s}_k\right)^2}.

(C.33)

C.3 Sherman–Morrison–Woodbury Formula¶

The formal derivations of the DFP and BFGS methods use the Sherman–Morrison–Woodbury formula (also known as the Woodbury matrix identity). Suppose that the inverse of a matrix is known, and then the matrix is perturbed. The Sherman–Morrison–Woodbury formula gives the inverse of the perturbed matrix without having to re-invert the perturbed matrix. We used this formula in Section 4.4.4 to derive the quasi-Newton update.

One possible perturbation is a rank 1 update of the form

\hat A = A + u v^\intercal \, ,

(C.34)

where $u$ and $v$ are $n$ -vectors. This is a rank 1 update to $A$ because $u v^\intercal$ is an outer product that produces a matrix whose rank is equal to 1 (see Figure 4.50).

If $\hat A$ is nonsingular, and $A^{-1}$ is known, the Sherman–Morrison–Woodbury formula gives

\hat A^{-1} = A^{-1} - \frac{A^{-1} u v^\intercal A^{-1}}{1 + v^\intercal A^{-1} u} \, .

(C.35)

This formula can be verified by multiplying Equation C.34 and Equation C.35, which yields the identity matrix.

This formula can be generalized for higher-rank updates as follows:

\hat A = A + U V^\intercal \, ,

(C.36)

where $U$ and $V$ are $(n \times p)$ matrices for some $p$ between 1 and $n$ . Then,

\hat A^{-1} = A^{-1}- A^{-1} U \left( I + V^\intercal A^{-1} U \right) V^\intercal A^{-1} \, .

(C.37)

Although we need to invert a new matrix, $\left( I + V^\intercal A^{-1} U \right)$ , this matrix is typically small and can be inverted analytically for $p=2$ for the rank 2 update, for example.

References¶

Broyden, C. G. (1965). A class of methods for solving nonlinear simultaneous equations. Mathematics of Computation, 19(92), 577–593. 10.1090/S0025-5718-1965-0198670-6