Appendix: Quasi-Newton Methods
C.1 Broyden’s Method¶
Broyden’s method is the extension of the secant method (from Section 3.8) to dimensions.1 It can also be viewed as the analog of the quasi-Newton methods from Section 4.4.4 for solving equations (as opposed to finding a minimum).
Using the notation from Chapter 3, suppose we have a set of equations and unknowns . Writing a Taylor series expansion of and selecting the linear term of the Taylor series expansion of yields
where is the Jacobian, . Defining the step in as
and the change in the residuals as
we can write Equation C.1 as
This is the equivalent of the secant equation (Equation 4.80). The difference is that we now approximate the Jacobian instead of the Hessian. The right-hand side is the difference between two subsequent function values (which quantifies the directional derivative along the last step) instead of the difference between gradients (which quantifies the curvature).
We seek a rank 1 update of the form
where the self outer product yields a symmetric matrix of rank 1. Substituting this update into the required condition (Equation C.4) yields
Post-multiplying both sides by , rearranging, and dividing by yields
Substituting this result into the update (Equation C.5), we get the Jacobian approximation update,
where
is the difference in the function values (as opposed to the difference in the gradients used in optimization).
This update can be inverted using the Sherman–Morrison–Woodbury formula (Section C.3) to get the more useful update on the inverse of the Jacobian,
We can start with . Similar to the Newton step (Equation 3.30), the step in Broyden’s method is given by solving the linear system. Because the inverse is provided explicitly, we can just perform the multiplication,
Then we update the variables as
C.2 Additional Quasi-Newton Approximations¶
In Section 4.4.4, we introduced the Broyden–Fletcher–Goldfarb–Shanno (BFGS) quasi-Newton approximation for unconstrained optimization, which was also used in Section 5.5 for constrained optimization. Here we expand on that to introduce other quasi-Newton approximations and generalize them.
To get a unique solution for the approximate Hessian update, quasi-Newton methods quantify the “closeness” of successive Hessian approximations by using some norm of the difference between the two matrices, leading to the following optimization problem:
where, , and (the latest step). There are several possibilities for quantifying the “closeness” between matrices and satisfying the constraints, leading to different quasi-Newton updates. With a convenient choice of matrix norm, we can solve this optimization problem analytically to obtain a formula for as a function of , , and .
The optimization problem (Equation C.13) does not enforce a positive-definiteness constraint. It turns out that the update formula always produces a that is positive definite, provided that is positive definite. The fact that the curvature condition (Equation 4.81) is satisfied for each step helps with this.
C.2.1 Davidon–Fletcher–Powell Update¶
The Davidon–Fletcher–Powell (DFP) update can be derived using a similar approach to that used to derive the BFGS update in Section 4.4.4. However, instead of starting with the update for the Hessian, we start with the update to the Hessian inverse,
We need the inverse version of the secant equation (Equation 4.80), which is
Setting and in the update (Equation C.14) and substituting it into the inverse version of the secant equation (Equation C.15), we get
We can obtain the coefficients and by rearranging this equation and using similar arguments to those used in the BFGS update derivation (see Section 4.4.4). The DFP update for the Hessian inverse approximation is
However, the DFP update was originally derived by solving the optimization problem (Equation C.13), which minimizes a matrix norm of the update while enforcing symmetry and the secant equation. This problem can be solved analytically through the Karush–Kuhn–Tucker (KKT) conditions and a convenient matrix norm. The weighted Frobenius norm (Equation A.35) was the norm used in this case, where the weights were based on an averaged Hessian inverse.
The derivation is lengthy and is not included here. The final result is the update,
where
This can be inverted using the Sherman–Morrison–Woodbury formula (Section C.3) to get the update on the inverse (Equation C.17).
C.2.2 BFGS¶
The BFGS update was informally derived in Section 4.4.4.
As discussed previously, obtaining an approximation of the Hessian inverse is a more efficient way to get the quasi-Newton step.
Similar to DFP, BFGS was originally formally derived by analytically solving an optimization problem. However, instead of solving the optimization problem of Equation C.13, we solve a similar problem using the Hessian inverse approximation instead. This problem can be stated as
where is the updated inverse Hessian that we seek, is the inverse Hessian approximation from the previous step. The first constraint is known as the secant equation applied to the inverse. The second constraint enforces symmetric updates. We do not explicitly specify positive definiteness. The matrix norm is again a weighted Frobenius norm (Equation A.35), but now the weights are based on an averaged Hessian (instead of the inverse for DFP). Solving this optimization problem (Equation C.20), the final result is
where
This is identical to Equation 4.88.
C.2.3 Symmetric Rank 1 Update¶
The symmetric rank 1 (SR1) update is a quasi-Newton update that is rank 1 as opposed to the rank 2 update of DFP and BFGS (Equation C.14). The SR1 update can be derived formally without solving the optimization problem of Equation C.13 because there is only one update that satisfies the secant equation.
Similar to the rank 2 update of the approximate inverse Hessian (Equation 4.82), we construct the update,
where we only need one self outer product to produce a rank 1 update (as opposed to two).
Substituting the rank 1 update (Equation C.23) into the secant equation, we obtain
Rearranging yields
Thus, we have to make sure that is in the direction of . The scalar must be such that the scaling of the vectors on both sides of the equation match each other. We define a normalized in the desired direction,
To find the correct value for , we substitute Equation C.26 into Equation C.25 to get
Solving for yields
Substituting Equation C.26 and Equation C.28 into Equation C.23, we get the SR1 update
Because it is possible for the denominator in this update to be zero, the update requires safeguarding. This update is not positive definite in general because the denominator can be negative.
As in the BFGS method, the search direction at each major iteration is given by and a line search with determines the final step length.
C.2.4 Unification of SR1, DFP, and BFGS¶
The SR1, DFP, and BFGS updates for the inverse Hessian approximation can be expressed using the following more general formula:
For the SR1 method, we have
For the DFP method, we have
For the BFGS method, we have
C.3 Sherman–Morrison–Woodbury Formula¶
The formal derivations of the DFP and BFGS methods use the Sherman–Morrison–Woodbury formula (also known as the Woodbury matrix identity). Suppose that the inverse of a matrix is known, and then the matrix is perturbed. The Sherman–Morrison–Woodbury formula gives the inverse of the perturbed matrix without having to re-invert the perturbed matrix. We used this formula in Section 4.4.4 to derive the quasi-Newton update.
One possible perturbation is a rank 1 update of the form
where and are -vectors. This is a rank 1 update to because is an outer product that produces a matrix whose rank is equal to 1 (see Figure 4.50).
If is nonsingular, and is known, the Sherman–Morrison–Woodbury formula gives
This formula can be verified by multiplying Equation C.34 and Equation C.35, which yields the identity matrix.
This formula can be generalized for higher-rank updates as follows:
where and are matrices for some between 1 and . Then,
Although we need to invert a new matrix, , this matrix is typically small and can be inverted analytically for for the rank 2 update, for example.
- Broyden, C. G. (1965). A class of methods for solving nonlinear simultaneous equations. Mathematics of Computation, 19(92), 577–593. 10.1090/S0025-5718-1965-0198670-6