Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Engineering Design Optimization. Cambridge University Press, Jan 2022

Appendix: Quasi-Newton Methods

C.1 Broyden’s Method

Broyden’s method is the extension of the secant method (from Section 3.8) to nn dimensions.1 It can also be viewed as the analog of the quasi-Newton methods from Section 4.4.4 for solving equations (as opposed to finding a minimum).

Using the notation from Chapter 3, suppose we have a set of nn equations r(u)=[r1,,rn]=0r(u)=[r_1, \ldots, r_n]=0 and nn unknowns u=[u1,,un]u=[u_1, \ldots, u_n]. Writing a Taylor series expansion of r(u)r(u) and selecting the linear term of the Taylor series expansion of rr yields

Jk+1(uk+1uk)rk+1rk,J_{k+1} \left( {u}_{k + 1} - {u}_k \right) \approx r_{k+1} - r_{k} \, ,

where JJ is the (n×n)(n \times n) Jacobian, r/u\partial r / \partial u. Defining the step in uu as

sk=uk+1uk,{s}_k = {u}_{k + 1} - {u}_k \, ,

and the change in the residuals as

yk=rk+1rk,{y}_k = {r}_{k + 1} - {r}_k \, ,

we can write Equation C.1 as

J~k+1sk=yk.\tilde J_{k+1} s_k = y_k \, .

This is the equivalent of the secant equation (Equation 4.80). The difference is that we now approximate the Jacobian instead of the Hessian. The right-hand side is the difference between two subsequent function values (which quantifies the directional derivative along the last step) instead of the difference between gradients (which quantifies the curvature).

We seek a rank 1 update of the form

J~=J~k+vv,\tilde J = \tilde J_k + v v^\intercal ,

where the self outer product vvv v^\intercal yields a symmetric matrix of rank 1. Substituting this update into the required condition (Equation C.4) yields

(J~k+vv)sk=yk.\left( \tilde J_k + v v^\intercal \right) s_k = y_k .

Post-multiplying both sides by ss^\intercal, rearranging, and dividing by sksks_k^\intercal s_k yields

vv=(ykJ~ksk)sksksk.v v^\intercal = \frac{\Big( y_k - \tilde J_k s_k \Big) s_k^\intercal}{s_k^\intercal s_k}.

Substituting this result into the update (Equation C.5), we get the Jacobian approximation update,

J~k+1=J~k+(ykJ~ksk)sksksk,\tilde J_{k+1} = \tilde J_{k} + \frac{\Big( y_{k} - \tilde J_{k} s_{k} \Big) s_{k}^\intercal}{s_{k}^\intercal s_{k}} \, ,

where

yk=rk+1rky_{k} = r_{k+1} - r_{k} \,

is the difference in the function values (as opposed to the difference in the gradients used in optimization).

This update can be inverted using the Sherman–Morrison–Woodbury formula (Section C.3) to get the more useful update on the inverse of the Jacobian,

J~k+11=J~k1+(skJ~k1yk)ykykyk. \tilde {J}^{-1}_{k+1} = \tilde J^{-1}_{k} + \frac{\left( s_{k} - \tilde J^{-1}_{k} y_{k} \right) y_{k}^\intercal}{y_{k}^\intercal y_{k}} \, .

We can start with J~01=I\tilde J^{-1}_0 = I. Similar to the Newton step (Equation 3.30), the step in Broyden’s method is given by solving the linear system. Because the inverse is provided explicitly, we can just perform the multiplication,

Δuk=J~1rk.\Delta u_k = - \tilde J^{-1} r_k \, .

Then we update the variables as

uk+1=uk+Δuk.u_{k+1} = u_k + \Delta u_k \, .

C.2 Additional Quasi-Newton Approximations

In Section 4.4.4, we introduced the Broyden–Fletcher–Goldfarb–Shanno (BFGS) quasi-Newton approximation for unconstrained optimization, which was also used in Section 5.5 for constrained optimization. Here we expand on that to introduce other quasi-Newton approximations and generalize them.

To get a unique solution for the approximate Hessian update, quasi-Newton methods quantify the “closeness” of successive Hessian approximations by using some norm of the difference between the two matrices, leading to the following optimization problem:

minimizeH~H~kby varyingH~subject toH~=H~H~sk=yk,\begin{aligned} \text{minimize} & \quad \|\tilde H - \tilde H_{k}\| \\ \text{by varying} & \quad \tilde H \\ \text{subject to} & \quad \tilde H = \tilde H^\intercal\\ & \quad \tilde H s_{k} = y_{k} \, , \end{aligned}

where, yk=fk+1fk{y}_k = \nabla{f}_{k + 1} - \nabla {f}_{k}, and sk=xk+1xk{s}_k = {x}_{k + 1} - {x}_k (the latest step). There are several possibilities for quantifying the “closeness” between matrices and satisfying the constraints, leading to different quasi-Newton updates. With a convenient choice of matrix norm, we can solve this optimization problem analytically to obtain a formula for H~k+1\tilde H_{k+1} as a function of H~k\tilde H_{k}, sks_{k}, and yky_{k}.

The optimization problem (Equation C.13) does not enforce a positive-definiteness constraint. It turns out that the update formula always produces a H~k+1\tilde H_{k+1} that is positive definite, provided that H~k\tilde H_{k} is positive definite. The fact that the curvature condition (Equation 4.81) is satisfied for each step helps with this.

C.2.1 Davidon–Fletcher–Powell Update

The Davidon–Fletcher–Powell (DFP) update can be derived using a similar approach to that used to derive the BFGS update in Section 4.4.4. However, instead of starting with the update for the Hessian, we start with the update to the Hessian inverse,

V~k+1=V~k+αuu+βvv.\tilde V_{k+1} = \tilde V_{k} + \alpha u u^\intercal + \beta v v^\intercal \, .

We need the inverse version of the secant equation (Equation 4.80), which is

V~k+1yk=sk.\tilde V_{k+1} y_k = s_k \, .

Setting u=sku=s_k and v=V~kykv=\tilde V_k y_k in the update (Equation C.14) and substituting it into the inverse version of the secant equation (Equation C.15), we get

V~kyk+αskskyk+βV~kykykV~kyk=sk.\tilde V_k y_k + \alpha s_k s_k^\intercal y_k + \beta \tilde V_k y_k y_k^\intercal \tilde V_k y_k = s_k \, .

We can obtain the coefficients α\alpha and β\beta by rearranging this equation and using similar arguments to those used in the BFGS update derivation (see Section 4.4.4). The DFP update for the Hessian inverse approximation is

V~k+1=V~k+1yksksksk1ykV~kykV~kykykV~k. \tilde{V}_{k+1} = \tilde{V}_k + \frac{1}{y_k^\intercal s_k} s_k s_k^\intercal - \frac{1}{y_k^\intercal \tilde{V}_k y_k} \tilde{V}_k y_k y_k^\intercal \tilde{V}_k \, .

However, the DFP update was originally derived by solving the optimization problem (Equation C.13), which minimizes a matrix norm of the update while enforcing symmetry and the secant equation. This problem can be solved analytically through the Karush–Kuhn–Tucker (KKT) conditions and a convenient matrix norm. The weighted Frobenius norm (Equation A.35) was the norm used in this case, where the weights were based on an averaged Hessian inverse.

The derivation is lengthy and is not included here. The final result is the update,

H~k+1=(Iσkskyk)H~k(Iσkyksk)+σkykyk,\tilde H_{k+1} = \left( I - \sigma_{k} s_{k} {y_{k}}^\intercal \right) \tilde H_{k} \left( I - \sigma_{k} y_{k} {s_{k}}^\intercal \right) + \sigma_{k} y_{k} {y_{k}}^\intercal,

where

σk=1yksk.\sigma_{k} = \frac{1}{{y_{k}}^\intercal s_{k}} .

This can be inverted using the Sherman–Morrison–Woodbury formula (Section C.3) to get the update on the inverse (Equation C.17).

C.2.2 BFGS

The BFGS update was informally derived in Section 4.4.4.

As discussed previously, obtaining an approximation of the Hessian inverse is a more efficient way to get the quasi-Newton step.

Similar to DFP, BFGS was originally formally derived by analytically solving an optimization problem. However, instead of solving the optimization problem of Equation C.13, we solve a similar problem using the Hessian inverse approximation instead. This problem can be stated as

minimizeV~V~ksubject toV~yk=skV~=V~,\begin{aligned} \text{minimize} \quad & \|{\tilde V} - {\tilde V}_k\|\\ \text{subject to} \quad & \tilde V {y}_k = {s}_k\\ \quad & {\tilde V} = {\tilde V}^\intercal, \end{aligned}

where V~\tilde V is the updated inverse Hessian that we seek, V~k{\tilde V}_k is the inverse Hessian approximation from the previous step. The first constraint is known as the secant equation applied to the inverse. The second constraint enforces symmetric updates. We do not explicitly specify positive definiteness. The matrix norm is again a weighted Frobenius norm (Equation A.35), but now the weights are based on an averaged Hessian (instead of the inverse for DFP). Solving this optimization problem (Equation C.20), the final result is

V~k+1=(Iσkskyk)V~k(Iσkyksk)+σksksk,\tilde V_{k+1} = \left( I - \sigma_{k} s_{k} {y_{k}}^\intercal \right) \tilde V_{k} \left( I - \sigma_{k} y_{k} {s_{k}}^\intercal \right) + \sigma_{k} s_{k} {s_{k}}^\intercal,

where

σk=1yksk.\sigma_{k} = \frac{1}{{y_{k}}^\intercal s_{k}} .

This is identical to Equation 4.88.

C.2.3 Symmetric Rank 1 Update

The symmetric rank 1 (SR1) update is a quasi-Newton update that is rank 1 as opposed to the rank 2 update of DFP and BFGS (Equation C.14). The SR1 update can be derived formally without solving the optimization problem of Equation C.13 because there is only one update that satisfies the secant equation.

Similar to the rank 2 update of the approximate inverse Hessian (Equation 4.82), we construct the update,

V~=V~k+αvv,\tilde V = \tilde V_k + \alpha {v}{v}^\intercal \, ,

where we only need one self outer product to produce a rank 1 update (as opposed to two).

Substituting the rank 1 update (Equation C.23) into the secant equation, we obtain

V~kyk+αvvyk=sk.{\tilde V}_k {y}_k + \alpha {v}{v}^\intercal {y}_k = {s}_k \, .

Rearranging yields

(αvyk)v=skV~kyk.\left( \alpha {v}^\intercal {y}_k \right) {v} = {s}_k - {\tilde V}_k {y}_k.

Thus, we have to make sure that v{v} is in the direction of ykHksk{y}_k - {H}_k{s}_k. The scalar α\alpha must be such that the scaling of the vectors on both sides of the equation match each other. We define a normalized vv in the desired direction,

v=skV~kykskV~kyk2.{v} = \frac{{s}_k - {\tilde V}_k {y}_k}{\|{s}_k - {\tilde V}_k {y}_k\|_2} \, .

To find the correct value for α\alpha, we substitute Equation C.26 into Equation C.25 to get

skV~kyk=αskykykV~kykskV~kyk22(skV~kyk).{s}_k - {\tilde V}_k {y}_k = \alpha \frac{{s}_k^\intercal {y}_k - {y}_k^\intercal {\tilde V}_k {y}_k}{\|{s}_k - {\tilde V}_k {y}_k\|_2^2} \left({s}_k - {\tilde V}_k {y}_k\right) \, .

Solving for α\alpha yields

α=skV~kyk22skykykV~kyk.\alpha = \frac{\|{s}_k - {\tilde V}_k {y}_k\|_2^2}{{s}_k^\intercal {y}_k - {y}_k^\intercal {\tilde V}_k {y}_k} \, .

Substituting Equation C.26 and Equation C.28 into Equation C.23, we get the SR1 update

V~k+1=V~k+1skykykV~kyk(skV~kyk)(skV~kyk). {\tilde V}_{k+1} = {\tilde V}_k + \frac{1}{{s}_k^\intercal {y}_k - {y}_k^\intercal {\tilde V}_k {y}_k} \left({s}_k - {\tilde V}_k{y}_k\right) \left({s}_k - {\tilde V}_k{y}_k\right)^\intercal \, .

Because it is possible for the denominator in this update to be zero, the update requires safeguarding. This update is not positive definite in general because the denominator can be negative.

As in the BFGS method, the search direction at each major iteration is given by pk=V~kfkp_{k} = - \tilde V_{k} \nabla f_{k} and a line search with αinit=1\alpha_\text{init}=1 determines the final step length.

C.2.4 Unification of SR1, DFP, and BFGS

The SR1, DFP, and BFGS updates for the inverse Hessian approximation can be expressed using the following more general formula:

V~k+1=V~k+[V~kyksk][αββγ][ykV~ksk]. {\tilde V}_{k+1} = {\tilde V}_k + \begin{bmatrix} {\tilde V}_k {y}_k & {s}_k \end{bmatrix} \begin{bmatrix} \alpha & \beta \\ \beta & \gamma \end{bmatrix} \begin{bmatrix} {y}_k^\intercal{\tilde V}_k \\ {s}_k^\intercal \\ \end{bmatrix}.

For the SR1 method, we have

αSR1=1ykskykV~kykβSR1=1ykskykV~kykγSR1=1ykskykV~kyk.\begin{aligned} \alpha_{\text{SR1}} &= \frac{1}{{y}_k^\intercal{s}_k - {y}_k^\intercal{\tilde V}_k{y}_k} \\ \beta_{\text{SR1}} &= -\frac{1}{{y}_k^\intercal{s}_k - {y}_k^\intercal{\tilde V}_k{y}_k} \\ \gamma_{\text{SR1}} &= \frac{1}{{y}_k^\intercal{s}_k - {y}_k^\intercal{\tilde V}_k{y}_k} \, . \end{aligned}

For the DFP method, we have

αDFP=1ykV~kyk,βDFP=0,γDFP=1yksk.\alpha_{\text{DFP}} = -\frac{1}{{y}_k^\intercal{\tilde V}_k{y}_k}, \quad \beta_{\text{DFP}} = 0, \quad \gamma_{\text{DFP}} = \frac{1}{{y}_k^\intercal{s}_k}.

For the BFGS method, we have

αBFGS=0,βBFGS=1yksk,γBFGS=1yksk+ykV~kyk(yksk)2.\alpha_{\text{BFGS}} = 0, \quad \beta_{\text{BFGS}} = -\frac{1}{{y}_k^\intercal{s}_k}, \quad \gamma_{\text{BFGS}} = \frac{1}{{y}_k^\intercal{s}_k} + \frac{{y}_k^\intercal {\tilde V}_k {y}_k}{\left({y}_k^\intercal{s}_k\right)^2}.

C.3 Sherman–Morrison–Woodbury Formula

The formal derivations of the DFP and BFGS methods use the Sherman–Morrison–Woodbury formula (also known as the Woodbury matrix identity). Suppose that the inverse of a matrix is known, and then the matrix is perturbed. The Sherman–Morrison–Woodbury formula gives the inverse of the perturbed matrix without having to re-invert the perturbed matrix. We used this formula in Section 4.4.4 to derive the quasi-Newton update.

One possible perturbation is a rank 1 update of the form

A^=A+uv,\hat A = A + u v^\intercal \, ,

where uu and vv are nn-vectors. This is a rank 1 update to AA because uvu v^\intercal is an outer product that produces a matrix whose rank is equal to 1 (see Figure 4.50).

If A^\hat A is nonsingular, and A1A^{-1} is known, the Sherman–Morrison–Woodbury formula gives

A^1=A1A1uvA11+vA1u.\hat A^{-1} = A^{-1} - \frac{A^{-1} u v^\intercal A^{-1}}{1 + v^\intercal A^{-1} u} \, .

This formula can be verified by multiplying Equation C.34 and Equation C.35, which yields the identity matrix.

This formula can be generalized for higher-rank updates as follows:

A^=A+UV,\hat A = A + U V^\intercal \, ,

where UU and VV are (n×p)(n \times p) matrices for some pp between 1 and nn. Then,

A^1=A1A1U(I+VA1U)VA1.\hat A^{-1} = A^{-1}- A^{-1} U \left( I + V^\intercal A^{-1} U \right) V^\intercal A^{-1} \, .

Although we need to invert a new matrix, (I+VA1U)\left( I + V^\intercal A^{-1} U \right), this matrix is typically small and can be inverted analytically for p=2p=2 for the rank 2 update, for example.

References
  1. Broyden, C. G. (1965). A class of methods for solving nonlinear simultaneous equations. Mathematics of Computation, 19(92), 577–593. 10.1090/S0025-5718-1965-0198670-6