Computing Derivatives - Engineering Design Optimization

The gradient-based optimization methods introduced in Chapter 4 and Chapter 5 require the derivatives of the objective and constraints with respect to the design variables, as illustrated in Figure 6.1. Derivatives also play a central role in other numerical algorithms. For example, the Newton-based methods introduced in Section 3.8 require the derivatives of the residuals.

Figure 6.1:Efficient derivative computation is crucial for the overall efficiency of gradient-based optimization.

The accuracy and computational cost of the derivatives are critical for the success of these methods. Gradient-based methods are only efficient when the derivative computation is also efficient. The computation of derivatives can be the bottleneck in the overall optimization procedure, especially when the model solver needs to be called repeatedly. This chapter introduces the various methods for computing derivatives and discusses the relative advantages of each method.

6.1 Derivatives, Gradients, and Jacobians¶

The derivatives we focus on are first-order derivatives of one or more functions of interest ( $f$ ) with respect to a vector of variables ( $x$ ). In the engineering optimization literature, the term sensitivity analysis is often used to refer to the computation of derivatives, and derivatives are sometimes referred to as sensitivity derivatives or design sensitivities. Although these terms are not incorrect, we prefer to use the more specific and concise term derivative.

For the sake of generality, we do not specify which functions we want to differentiate in this chapter (which could be an objective, constraints, residuals, or any other function). Instead, we refer to the functions being differentiated as the functions of interest and represent them as a vector-valued function, $f = [f_1, f_2, \ldots, f_{n_f}]$ . Neither do we specify the variables with respect to which we differentiate (which could be design variables, state variables, or any other independent variable).

The derivatives of all the functions of interest with respect to all the variables form a Jacobian matrix,

J_f = \frac{\partial {f}}{\partial {x}}= \begin{bmatrix} \nabla {f_1}^\intercal \\ \vdots \\ \nabla {f_{n_f}}^\intercal \end{bmatrix} = \underbrace{ \begin{bmatrix} \frac{\partial f_1}{\partial {x_1}}& \cdots & \frac{\partial f_1}{\partial {x_{n_x}}}\\ \vdots & \ddots & \vdots \\ \frac{\partial f_{n_f}}{\partial {x_1}}& \cdots & \frac{\partial f_{n_f}}{\partial {x_{n_x}}}\\ \end{bmatrix}}_{(n_f \times n_x)} \, ,

(6.1)

which is an $(n_f \times n_x)$ rectangular matrix where each row corresponds to the gradient of each function with respect to all the variables.

Row $i$ of the Jacobian is the gradient of function $f_i$ . Each column in the Jacobian is called the tangent with respect to a given variable $x_j$ . The Jacobian can be related to the $\nabla$ operator as follows:

J_f = f \nabla^\intercal = \begin{bmatrix} f_1 \\ \vdots \\ f_{n_f} \end{bmatrix} \begin{bmatrix} \frac{\partial }{\partial {x_1}}\ldots \frac{\partial }{\partial {x_{n_x}}}\end{bmatrix} = \begin{bmatrix} \frac{\partial f_1}{\partial {x_1}}& \cdots & \frac{\partial f_1}{\partial {x_{n_x}}}\\ \vdots & \ddots & \vdots \\ \frac{\partial f_{n_f}}{\partial {x_1}}& \cdots & \frac{\partial f_{n_f}}{\partial {x_{n_x}}}\\ \end{bmatrix} \, .

(6.2)

Example 6.1 (Jacobian of a vector-valued function)

Consider the following function with two variables and two functions of interest:

f(x) = \begin{bmatrix} f_1 (x_1, x_2) \\ f_2 (x_1, x_2) \end{bmatrix} = \begin{bmatrix} x_1 x_2 + \sin x_1 \\ x_1 x_2 + x_2^2 \end{bmatrix} .

We can differentiate this symbolically to obtain exact reference values:

\frac{\partial f}{\partial x}= \begin{bmatrix} x_2 + \cos x_1 & x_1 \\ x_2 & x_1 + 2 x_2 \end{bmatrix}.

We evaluate this at $x = \left( \pi/4, 2 \right)$ , which yields

\frac{\partial f}{\partial x}= \begin{bmatrix} 2.707 & 0.785 \\ 2.000 & 4.785 \\ \end{bmatrix} .

6.2 Overview of Methods for Computing Derivatives¶

We can classify the methods for computing derivatives according to the representation used for the numerical model. There are three possible representations, as shown in Figure 6.2. In one extreme, we know nothing about the model and consider it a black box where we can only control the inputs and observe the outputs (Figure 6.2, left). In this chapter, we often refer to $x$ as the input variables and $f$ as the output variables. When this is the case, we can only compute derivatives using finite differences (Section 6.4).

In the other extreme, we have access to the source code used to compute the functions of interest and perform the differentiation line by line (Figure 6.2, right). This is the essence of the algorithmic differentiation approach (Section 6.6). The complex-step method (Section 6.5) is related to algorithmic differentiation, as explained in Section 6.6.5.

In the intermediate case, we consider the model residuals and states (Figure 6.2, middle), which are the quantities required to derive and implement implicit analytic methods (Section 6.7). When the model can be represented with multiple components, we can use a coupled derivative approach (Section 13.3) where any of these derivative computation methods can be used for each component.

Derivative computation methods can consider three different levels of information: function values (a), model states (b), and lines of code (c). — (a)

Tip 6.1: Identify and mitigate the sources of numerical noise

As mentioned in Tip 3.2, it is vital to determine the level of numerical noise in your model. This is especially important when computing derivatives of the model because taking the derivative can amplify the noise. There are several common sources of model numerical noise, some of which we can mitigate.

Iterative solvers can introduce numerical noise when the convergence tolerance is too high or when they have an inherent limit in their precision (see Section 3.5.3). When we do not have enough precision, we can reduce the convergence tolerance or increase the iteration limit.

Another possible source of error is file input and output. Many legacy codes are driven by reading and writing input and output files. However, the numbers in the files usually have fewer digits than the code’s working precision. The ideal solution is to modify the code to be called directly and pass the data through memory. Another solution is to increase the precision in the files.

6.3 Symbolic Differentiation¶

Symbolic differentiation is well known and widely used in calculus, but it is of limited use in the numerical optimization of most engineering models. Except for the most straightforward cases (e.g., Example 6.1), many engineering models involve a large number of operations, utilize loops and various conditional logic, are implicitly defined, or involve iterative solvers (see Chapter 3). Although the mathematical expressions within these iterative procedures is explicit, it is challenging, or even impossible, to use symbolic differentiation to obtain closed-form mathematical expressions for the derivatives of the procedure. Even when it is possible, these expressions are almost always computationally inefficient.

Example 6.2 (Symbolic differentiation leading to expression swell)

Kepler’s equation describes the orbit of a body under gravity, as briefly discussed in Section 2.2.

The following implicit equation can be obtained from Kepler’s equation:^[1]

f = \sin(x + f) \, .

Thus, $f$ is an implicit function of $x$ . As a simple numerical procedure, we use fixed-point iteration to determine the value of $f$ for a given input $x$ . That means we start with a guess for $f$ on the right-hand side of that expression to estimate a new value for $f$ , and repeat. In this case, convergence typically happens in about 10 iterations. Arbitrarily, we choose $x$ as the initial guess for $f$ , resulting in the following computational procedure:

Now that we have a computational procedure, we would like to compute the derivative $\mathrm{d} f/ \mathrm{d} x$ . We can use a symbolic math toolbox to find the following closed-form expression for this derivative:

dfdx =
cos(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(2*x))))))))))*(cos(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(2*x)))))))))*(cos(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(2*x))))))))*(cos(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(2*x)))))))*(cos(x + sin(x + sin(x + sin(x + sin(x + sin(2*x))))))*(cos(x + sin(x + sin(x + sin(x + sin(2*x)))))*(cos(x + sin(x + sin(x + sin(2*x))))*(cos(x + sin(x + sin(2*x)))*(cos(x + sin(2*x))*(2*cos(2*x) + 1) + 1) + 1) + 1) + 1) + 1) + 1) + 1) + 1)

This expression is long and is full of redundant calculations. This problem becomes exponentially worse as the number of iterations in the loop is increased, so this approach is intractable for computational models of even moderate complexity—this is known as expression swell. Therefore, we dedicate the rest of this chapter to methods for computing derivatives numerically.

Symbolic differentiation is still valuable for obtaining derivatives of simple explicit components within a larger model. Furthermore, algorithm differentiation (discussed in a later section) relies on symbolic differentiation to differentiate each line of code in the model.

6.4 Finite Differences¶

Because of their simplicity, finite-difference methods are a popular approach to computing derivatives. They are versatile, requiring nothing more than function values. Finite differences are the only viable option when we are dealing with black-box functions because they do not require any knowledge about how the function is evaluated. Most gradient-based optimization algorithms perform finite differences by default when the user does not provide the required gradients. However, finite differences are neither accurate nor efficient.

6.4.1 Finite-Difference Formulas¶

Finite-difference approximations are derived by combining Taylor series expansions. It is possible to obtain finite-difference formulas that estimate an arbitrary order derivative with any order of truncation error by using the right combinations of these expansions. The simplest finite-difference formula can be derived directly from a Taylor series expansion in the $j\text{th}$ direction,

f(x + h \hat e_j) = f(x) + h \frac{\partial f}{\partial {x_j}}+ \frac{h^2}{2!} \frac{\partial^{2} f}{\partial {x_j}^{2}} + \frac{h^3}{3!} \frac{\partial^{3} f}{\partial {x_j}^{3}} + \ldots \, ,

(6.3)

where $\hat e_j$ is the unit vector in the $j\text{th}$ direction. Solving this for the first derivative, we obtain the finite-difference formula,

\frac{\partial f}{\partial {x_j}}= \frac{f(x + h \hat e_j)-f(x)}{h} + \mathcal{O}(h) \, ,

(6.4)

where $h$ is a small scalar called the finite-difference step size.

Exact derivative compared with a forward finite-difference approximation (). — Figure 6.3:Exact derivative compared with a forward finite-difference approximation (Equation 6.4).

This approximation is called the forward difference and is directly related to the definition of a derivative because

\frac{\partial f}{\partial {x_j}}= \lim_{h \rightarrow 0} \frac{f(x + h \hat e_j) - f(x)}{h} \approx \frac{f(x + h \hat e_j) - f(x)}{h} \, .

(6.5)

The truncation error is ${\cal O}(h)$ , and therefore this is a first-order approximation. The difference between this approximation and the exact derivative is illustrated in Figure 6.3.

The backward-difference approximation can be obtained by replacing $h$ with $-h$ to yield

\frac{\partial f}{\partial {x_j}}= \frac{f(x)-f(x - h \hat e_j)}{h} + {\cal O}(h)\, ,

(6.6)

which is also a first-order approximation.

Assuming each function evaluation yields the full vector $f$ , the previous formulas compute the $j\text{th}$ column of the Jacobian in Equation 6.1. To compute the full Jacobian, we need to loop through each direction $\hat e_j$ , add a step, recompute $f$ , and compute a finite difference. Hence, the cost of computing the complete Jacobian is proportional to the number of input variables of interest, $n_x$ .

For a second-order estimate of the first derivative, we can use the expansion of $f(x- h \hat e_j)$ to obtain

Exact derivative compared with a central finite-difference approximation (). — Figure 6.4:Exact derivative compared with a central finite-difference approximation (Equation 6.8).

f(x - h \hat e_j) = f(x) - h \frac{\partial f}{\partial {x_j}}+ \frac{h^2}{2!} \frac{\partial^{2} f}{\partial {x_j}^{2}} - \frac{h^3}{3!} \frac{\partial^{3} f}{\partial {x_j}^{3}} + \ldots .

(6.7)

Then, if we subtract this from the expansion in Equation 6.3 and solve the resulting equation for the derivative of $f$ , we get the central-difference formula,

\frac{\partial f}{\partial {x_j}}= \frac{f(x + h \hat e_j)-f(x - h \hat e_j)}{2h} + {\cal O}(h^2).

(6.8)

The stencil of points for this formula is shown in Figure 6.4, where we can see that this estimate is closer to the actual derivative than the forward difference.

Even more accurate estimates can be derived by combining different Taylor series expansions to obtain higher-order truncation error terms. This technique is widely used in finite-difference methods for solving differential equations, where higher-order estimates are desirable.

However, finite-precision arithmetic eventually limits the achievable accuracy for our purposes (as discussed in the next section). With double-precision arithmetic, there are not enough significant digits to realize a significant advantage beyond central difference.

We can also estimate second derivatives (or higher) by combining Taylor series expansions. For example, adding the expansions for $f(x+h)$ and $f(x-h)$ cancels out the first derivative and third derivative terms, yielding the second-order approximation to the second-order derivative,

\frac{\partial^{2} f}{\partial {x_j}^{2}} = \frac{f(x+2 h \hat e_j)-2f(x)+f(x- 2 h \hat e_j)}{4h^2} + {\cal O}\left( h^2 \right) \, .

(6.9)

The finite-difference method can also be used to compute directional derivatives, which are the scalar projection of the gradient into a given direction. To do this, instead of stepping in orthogonal directions to get the gradient, we need to step in the direction of interest, $p$ , as shown in Figure 6.5. Using the forward difference, for example,

\nabla_p {f} = \frac{f(x + h p)-f(x)}{h} + {\cal O}(h) \, .

(6.10)

One application of directional derivatives is to compute the slope in line searches (Section 4.3).

Figure 6.5:Computing a directional derivative using a forward finite difference.

6.4.2 The Step-Size Dilemma¶

When estimating derivatives using finite-difference formulas, we are faced with the step-size dilemma. Because each estimate has a truncation error of $\mathcal{O}(h)$ (or $\mathcal{O}(h^2)$ when second order), we would like to choose as small of a step size as possible to reduce this error. However, as the step size reduces, subtractive cancellation (a roundoff error introduced in Section 3.5.1) becomes dominant. Given the opposing trends of these errors, there is an optimal step size for which the sum of the two errors is at a minimum.

Theoretically, the optimal step size for the forward finite difference is approximately $\sqrt{\varepsilon_f}$ , where $\varepsilon_f$ is the precision of $f$ . The error bound is also about $\sqrt{\varepsilon_f}$ . For the central difference, the optimal step size scales approximately with $\varepsilon_f^{1/3}$ , with an error bound of $\varepsilon_f^{2/3}$ . These step and error bound estimates are just approximate and assume well-scaled problems.

Example 6.3 (Accuracy of finite differences)

Figure 6.6:The forward-difference derivative initially improves as the step decreases but eventually gives a zero derivative for a small enough step size.

To demonstrate the step-size dilemma, consider the following function:

f(x) = \frac{e^x}{\sqrt{\sin^3x + \cos^3x}} .

The exact derivative at $x = 1.5$ is computed to 16 digits based on symbolic differentiation as a reference value.

Figure 6.7:As the step size $h$ decreases, the total error in the finite-difference estimates initially decreases because of a reduced truncation error. However, subtractive cancellation takes over when the step is small enough and eventually yields an entirely wrong derivative.

In Figure 6.6, we show the derivatives given by the forward difference, where we can see that as we decrease the step size, the derivative approaches the exact value, but then it worsens and becomes zero for a small enough step size.

We plot the relative error of the forward- and central-difference formulas for a decreasing step size in Figure 6.7. As the step decreases, the forward-difference estimate initially converges at a linear rate because its truncation error is ${\cal O}(h)$ , whereas the central difference converges quadratically. However, as the step reduces below a particular value (about 10^-8 for the forward difference and 10^-5 for the central difference), subtractive cancellation errors become increasingly significant. These values match the theoretical predictions for the optimal step and error bounds when we set $\varepsilon_f = 10^{-16}$ . When $h$ is so small that no difference exists in the output (for steps smaller than 10^-16), the finite-difference estimates yield zero (and $\varepsilon=1$ ), which corresponds to 100 percent error.

Table 1 lists the data for the forward difference, where we can see the number of digits in the difference $\Delta f$ decreasing with decreasing step size until the difference is zero (for $h = 10^{-17}$ ).

Table 1:Subtractive cancellation leads to a loss of precision and, ultimately, inaccurate finite-difference estimates.

$h$	$f(x+h)$	$\Delta f$	$\mathrm{d} f / \mathrm{d} x$
$10^{ -1}$	4.9562638252880662	0.4584837713419043	4.58483771
$10^{ -2}$	4.5387928890592475	0.0410128351130856	4.10128351
$10^{ -4}$	4.4981854440562818	0.0004053901101200	4.05390110
$10^{ -6}$	4.4977841073787870	0.0000040534326251	4.05343263
$10^{ -8}$	4.4977800944804409	0.0000000405342790	4.05342799
10^-10	4.4977800543515052	0.0000000004053433	4.05344203
10^-12	4.4977800539502155	0.0000000000040536	4.05453449
10^-14	4.4977800539462027	0.0000000000000409	4.17443857
10^-16	4.4977800539461619	0.0000000000000000	0.00000000
10^-18	4.4977800539461619	0.0000000000000000	0.00000000
Exact	4.4977800539461619		4.05342789

Tip 6.2: When using finite differencing, always perform a step-size study

In practice, most gradient-based optimizers use finite differences by default to compute the gradients. Given the potential for inaccuracies, finite differences are often the culprit in cases where gradient-based optimizers fail to converge. Although some of these optimizers try to estimate a good step size, there is no substitute for a step-size study by the user. The step-size study must be performed for all variables and does not necessarily apply to the whole design space. Therefore, repeating this study for other values of $x$ might be required.

Because we do not usually know the exact derivative, we cannot plot the error as we did in Figure 6.7. However, we can always tabulate the derivative estimates as we did in Table 1. In the last column, we can see from the pattern of digits that match the previous step size that $h=10^{-8}$ is the best step size in this case.

Figure 6.8:Finite-differencing noisy functions can either smooth the derivative estimates or result in estimates with the wrong trends.

Finite-difference approximations are sometimes used with larger steps than would be desirable from an accuracy standpoint to help smooth out numerical noise or discontinuities in the model. This approach sometimes works, but it is better to address these problems within the model whenever possible. Figure 6.8 shows an example of this effect. For this noisy function, the larger step ignores the noise and gives the correct trend, whereas the smaller step results in an estimate with the wrong sign.

6.4.3 Practical Implementation¶

Algorithm 6.1 details a procedure for computing a Jacobian using forward finite differences. It is usually helpful to scale the step size based on the value of $x_j$ , unless $x_j$ is too small. Therefore, we combine the relative and absolute quantities to obtain the following step size:

\Delta x_j = h \left( 1 + | x_j | \right) \, .

(6.11)

This is similar to the expression for the convergence criterion in Equation 4.24. Although the absolute step size usually differs for each $x_j$ , the relative step size $h$ is often the same and is user-specified.

Algorithm 6.1 (Forward finite-difference gradient computation of a vector-valued function

f(x)

)

Algorithm

\quad xxx: Point about which to compute the gradient
\quad fff: Vector of functions of interest
\quad JJJ: Jacobian of fff with respect to xxx
f0=f(x)f_0 = f(x)f0​=f(x) Evaluate reference values
h=10−6h = 10^{-6}h=10−6 Relative step size (value should be tuned)
for j=1j = 1j=1 to  nxn_xnx​ do
Δx=h(1+∣xj∣)\Delta x = h (1 + |x_j|)Δx=h(1+∣xj​∣) Step size should be scaled but not smaller than hhh
xj=xj+Δxx_j = x_j + \Delta xxj​=xj​+Δx Modify in place for efficiency, but copying vector is also an option
f+=f(x)f_+ = f(x)f+​=f(x) Evaluate function at perturbed point
J∗,j=f+−f0ΔxJ_{*,j} =  \dfrac{f_+ - f_0}{\Delta x}J∗,j​=Δxf+​−f0​​ Finite difference yields one column of Jacobian at a time
xj=xj−Δxx_j = x_j - \Delta xxj​=xj​−Δx Do not forget to reset!

6.5 Complex Step¶

The complex-step derivative approximation, strangely enough, computes derivatives of real functions using complex variables. Unlike finite differences, the complex-step method requires access to the source code and cannot be applied to black-box components. The complex-step method is accurate but no more efficient than finite differences because the computational cost still scales linearly with the number of variables.

6.5.1 Theory¶

The complex-step method can also be derived using a Taylor series expansion. Rather than using a real step $h$ , as we did to derive the finite-difference formulas, we use a pure imaginary step, $ih$ .^[2] If $f$ is a real function in real variables and is also analytic (differentiable in the complex domain), we can expand it in a Taylor series about a real point $x$ as follows:

f(x + ih \hat e_j) = f(x) + i h \frac{\partial f}{\partial {x_j}}- \frac{h^2}{2} \frac{\partial^{2} f}{\partial {x_j}^{2}} - i \frac{h^3}{6} \frac{\partial^{3} f}{\partial {x_j}^{3}} + \ldots \, .

(6.12)

Taking the imaginary parts of both sides of this equation, we have

\Im \left(f(x + ih \hat e_j)\right) = h \frac{\partial f}{\partial {x_j}}- \frac{h^3}{6} \frac{\partial^{3} f}{\partial {x_j}^{3}} + \ldots \, .

(6.13)

Dividing this by $h$ and solving for $\partial f / \partial x_j$ yields the complex-step derivative approximation,^[3]

\frac{\partial f}{\partial {x_j}}= \frac{ \Im \left( f(x + ih \hat e_j) \right) }{h} + \mathcal{O}(h^2) \, ,

(6.14)

which is a second-order approximation. To use this approximation, we must provide a complex number with a perturbation in the imaginary part, compute the original function using complex arithmetic, and take the imaginary part of the output to obtain the derivative.

In practical terms, this means that we must convert the function evaluation to take complex numbers as inputs and compute complex outputs. Because we have assumed that $f(x)$ is a real function of a real variable in the derivation of Equation 6.14, the procedure described here does not work for models that already involve complex arithmetic.

In Section 6.5.2, we explain how to convert programs to handle the required complex arithmetic for the complex-step method to work in general. The complex-step method has been extended to compute exact second derivatives as well.45

Unlike finite differences, this formula has no subtraction operation and thus no subtractive cancellation error. The only source of numerical error is the truncation error. However, the truncation error can be eliminated if $h$ is decreased to a small enough value (say, 10^-200). Then, the precision of the complex-step derivative approximation (Equation 6.14) matches the precision of $f$ . This is a tremendous advantage over the finite-difference approximations (Equation 6.4 and Equation 6.8).

Like the finite-difference approach, each evaluation yields a column of the Jacobian ( $\partial f / \partial x_j$ ), and the cost of computing all the derivatives is proportional to the number of design variables. The cost of the complex-step method is comparable to that of a central difference because we compute a real and an imaginary part for every number in our code.

If we take the real part of the Taylor series expansion (Equation 6.12), we obtain the value of the function on the real axis,

f(x) = \text{Re} \left( f(x + ih \hat e_j) \right) + \mathcal{O}(h^2) \, .

(6.15)

Similar to the derivative approximation, we can make the truncation error disappear by using a small enough $h$ . This means that no separate evaluation of $f(x)$ is required to get the original real value of the function; we can simply take the real part of the complex evaluation.

What is a “small enough $h$ ”? When working with finite-precision arithmetic, the error can be eliminated entirely by choosing an $h$ so small that all $h^2$ terms become zero because of underflow (i.e., $h^2$ is smaller than the smallest representable number, which is approximately 10^-324 when using double precision; see Section 3.5.1). Eliminating these squared terms does not affect the accuracy of the derivative carried in the imaginary part because the squared terms only appear in the error terms of the complex-step approximation.

At the same time, $h$ must be large enough that the imaginary part ( $h \cdot \partial f / \partial x$ ) does not underflow.

Suppose that $\mu$ is the smallest representable number. Then, the two requirements result in the following bounds:

\mu \left| \frac{\partial f}{\partial x} \right|^{-1} < h < \sqrt{\mu} \, .

(6.16)

A step size of 10^-200 works well for double-precision functions.

Example 6.4 (Complex-step accuracy compared with finite differences)

To show how the complex-step method works, consider the function in Example 6.3. In addition to the finite-difference relative errors from Figure 6.7, we plot the complex-step error in Figure 6.9.

Figure 6.9:Unlike finite differences, the complex-step method is not subject to subtractive cancellation. Therefore, the error is the same as that of the function evaluation (machine zero in this case).

The complex-step estimate converges quadratically with decreasing step size, as predicted by the truncation error term. The relative error reduces to machine precision at around $h=10^{-8}$ and stays at that level. The error eventually increases when $h$ is so small that the imaginary parts get affected by underflow (around $h=10^{-308}$ in this case).

The real parts and the derivatives of the complex evaluations are listed in Table 2. For a small enough step, the real part is identical to the original real function evaluation, and the complex-step method yields derivatives that match to machine precision.

Table 2:For a small enough step, the real part of the complex evaluation is identical to the real evaluation, and the derivative matches to machine precision.

$h$	$\text{Re}\left(f\right)$	$\Im \left(f\right) / h$
$10^{ -1}$	4.4508662116993065	4.0003330384671729
$10^{ -2}$	4.4973069409015318	4.0528918144659292
$10^{ -4}$	4.4977800066307951	4.0534278402854467
$10^{ -6}$	4.4977800539414297	4.0534278938932582
$10^{ -8}$	4.4977800539461619	4.0534278938986201
10^-10	4.4977800539461619	4.0534278938986201
10^-12	4.4977800539461619	4.0534278938986201
10^-14	4.4977800539461619	4.0534278938986210
10^-16	4.4977800539461619	4.0534278938986201
10^-18	4.4977800539461619	4.0534278938986210
10^-200	4.4977800539461619	4.0534278938986201
Exact	4.4977800539461619	4.0534278938986201

Comparing the best accuracy of each of these approaches, we can see that by using finite differences, we only achieve a fraction of the accuracy that is obtained by using the complex-step approximation.

6.5.2 Complex-Step Implementation¶

We can use the complex-step method even when the evaluation of $f$ involves the solution of numerical models through computer programs.3 The outer loop for computing the derivatives of multiple functions with respect to all variables (Algorithm 6.2) is similar to the one for finite differences. A reference function evaluation is not required, but now the function must handle complex numbers correctly.

The complex-step method can be applied to any model, but modifications might be required. We need the source code for the model to make sure that the program can handle complex numbers and the associated arithmetic, that it handles logical operators consistently, and that certain functions yield the correct derivatives.

First, the program may need to be modified to use complex numbers. In programming languages like Fortran or C, this involves changing real-valued type declarations (e.g., double) to complex type declarations (e.g., double complex). In some languages, such as MATLAB, Python, and Julia, this is unnecessary because functions are overloaded to automatically accept either type.

Second, some changes may be required to preserve the correct logical flow through the program. Relational logic operators (e.g., “greater than”, “less than”, “if”, and “else”) are usually not defined for complex numbers. These operators are often used in programs, together with conditional statements, to redirect the execution thread. The original algorithm and its “complexified” version should follow the same execution thread. Therefore, defining these operators to compare only the real parts of the arguments is the correct approach. Functions that choose one argument, such as the maximum or the minimum values, are based on relational operators. Following the previous argument, we should determine the maximum and minimum values based on the real parts alone.

Figure 6.10:The usual definition of a complex absolute value returns a real number (the length of the vector), which is not compatible with the complex-step method.

Third, some functions need to be redefined for complex arguments. The most common function that needs to be redefined is the absolute value function. For a complex number, $z = x + i y$ , the absolute value is defined as

| z | = \sqrt{x^2 + y^2} \, ,

(6.17)

as shown in Figure 6.10. This definition is not complex analytic, which is required in the derivation of the complex-step derivative approximation.

Figure 6.11:The absolute value function needs to be redefined such that the imaginary part yields the correct derivatives.

As shown in Figure 6.11, the correct derivatives for the real absolute value function are +1 and -1, depending on whether $x$ is greater than or less than zero. The following complex definition of the absolute value yields the correct derivatives:

\text{abs}(x+iy) = \begin{cases} - x - iy, & \text{if} \quad x < 0 \\ + x + iy, & \text{if} \quad x \ge 0 \, . \end{cases}

(6.18)

Setting the imaginary part to $y=h$ and dividing by $h$ corresponds to the slope of the absolute value function. There is an exception at $x=0$ , where the function is not analytic, but a derivative does not exist in any case. We use the “greater or equal” in the logic so that the approximation yields the correct right-sided derivative at that point.

Depending on the programming language, we may need to redefine some trigonometric functions. This is because some default implementations, although correct, do not maintain accurate derivatives for small complex-step sizes.

We must replace these with mathematically equivalent implementations that avoid numerical issues.

Fortunately, we can automate most of these changes by using scripts to process the codes, and in most programming languages, we can easily redefine functions using operator overloading.^[4]

Tip 6.4: Check the convergence of the imaginary part

When the solver that computes $f$ is iterative, it might be necessary to change the convergence criterion so that it checks for the convergence of the imaginary part, in addition to the existing check on the real part.

Figure 6.12:The imaginary parts of the variables often lag relative to the real parts in iterative solvers.

The imaginary part, which contains the derivative information, often lags relative to the real part in terms of convergence, as shown in Figure 6.12. Therefore, if the solver only checks for the real part, it might yield a derivative with a precision lower than the function value. In this example, $f$ is the drag coefficient given by a computational fluid dynamics solver and $\varepsilon$ is the relative error for each part.

6.6 Algorithmic Differentiation¶

Algorithmic differentiation (AD)—also known as computational differentiation or automatic differentiation—is a well-known approach based on the systematic application of the chain rule to computer programs.67 The derivatives computed with AD can match the precision of the function evaluation.

The cost of computing derivatives with AD can be proportional to either the number of variables or the number of functions, depending on the type of AD, making it flexible.

Another attractive feature of AD is that its implementation is largely automatic, thanks to various AD tools. To explain AD, we start by outlining the basic theory with simple examples. Then we explore how the method is implemented in practice with further examples.

6.6.1 Variables and Functions as Lines of Code¶

The basic concept of AD is as follows. Even long, complicated codes consist of a sequence of basic operations (e.g., addition, multiplication, cosine, exponentiation). Each operation can be differentiated symbolically with respect to the variables in the expression.

AD performs this symbolic differentiation and adds the code that computes the derivatives for each variable in the code. The derivatives of each variable accumulate in what amounts to a numerical version of the chain rule.

AD considers all the variables in the code, where the inputs x are among the first variables, and the outputs f are among the last. — Figure 6.13:AD considers all the variables in the code, where the inputs $x$ are among the first variables, and the outputs $f$ are among the last.

The fundamental building blocks of a code are unary and binary operations. These operations can be combined to obtain more elaborate explicit functions, which are typically expressed in one line of computer code. We represent the variables in the computer code as the sequence $v = [v_1, \ldots, v_i, \ldots, v_n]$ , where $n$ is the total number of variables assigned in the code. One or more of these variables at the start of this sequence are given and correspond to $x$ , and one or more of the variables toward the end of the sequence are the outputs of interest, $f$ , as illustrated in Figure 6.13.

In general, a variable assignment corresponding to a line of code can involve any other variable, including itself, through an explicit function,

v_i = v_i(v_1, v_2,\ldots,v_i,\ldots,v_n) \, .

(6.19)

Except for the most straightforward codes, many of the variables in the code are overwritten as a result of iterative loops.

Figure 6.14:Unrolling of loops is a useful mental model to understand the derivative propagation in the AD of general code.

To understand AD, it is helpful to imagine a version of the code where all the loops are unrolled. Instead of overwriting variables, we create new versions of those variables, as illustrated in Figure 6.14. Then, we can represent the computations in the code in a sequence with no loops such that each variable in this larger set only depends on previous variables, and then

v_i = v_i(v_1, v_2,\ldots,v_{i-1}) \, .

(6.20)

Given such a sequence of operations and the derivatives for each operation, we can apply the chain rule to obtain the derivatives for the entire sequence. Unrolling the loops is just a mental model for understanding how the chain rule operates, and it is not explicitly done in practice.

The chain rule can be applied in two ways. In the forward mode, we choose one input variable and work forward toward the outputs until we get the desired total derivative. In the reverse mode, we choose one output variable and work backward toward the inputs until we get the desired total derivative.

6.6.2 Forward-Mode AD¶

The chain rule for the forward mode can be written as

\frac{dv_i}{d{v_j}}= \sum_{k=j}^{i-1} \frac{\partial v_i}{\partial {v_k}}\frac{dv_k}{d{v_j}}\, ,

(6.21)

where each partial derivative is obtained by symbolically differentiating the explicit expression for $v_i$ . The total derivatives are the derivatives with respect to the chosen input $v_j$ , which can be computed using this chain rule.

Using the forward mode, we evaluate a sequence of these expressions by fixing $j$ in Equation 6.21 (effectively choosing one input $v_j$ ) and incrementing $i$ to get the derivative of each variable $v_i$ . We only need to sum up to $i-1$ because of the form of Equation 6.20, where each $v_i$ only depends on variables that precede it.

For a more convenient notation, we define a new variable that represents the total derivative of variable $i$ with respect to a fixed input $j$ as $\dot v_i \equiv {\mathrm{d} v_i}/{\mathrm{d} {v_j}}$ and rewrite the chain rule as

\dot v_i = \sum_{k=j}^{i-1} \frac{\partial v_i}{\partial {v_k}}\dot v_k \, .

(6.22)

The chosen input $j$ corresponds to the seed, which we set to $\dot v_j=1$ (using the definition for $\dot v_j$ , we see that means setting $\mathrm{d} v_j/\mathrm{d} v_j = 1$ ).

Figure 6.15:The forward mode propagates derivatives to all the variables that depend on the seeded input variable.

This chain rule then propagates the total derivatives forward, as shown in Figure 6.15, affecting all the variables that depend on the seeded variable.

Once we are done applying the chain rule (Equation 6.22) for the chosen input variable $v_j$ , we end up with the total derivatives $\mathrm{d} v_i / \mathrm{d} v_j$ for all $i>j$ . The sum in the chain rule (Equation 6.22) only needs to consider the nonzero partial derivative terms. If a variable $k$ does not explicitly appear in the expression for $v_i$ , then $\partial v_i / \partial v_k = 0$ , and there is no need to consider the corresponding term in the sum. In practice, this means that only a small number of terms is considered for each sum.

Suppose we have four variables $v_1, v_2, v_3$ , and $v_4$ , and $x \equiv v_1$ , $f \equiv v_4$ , and we want $\mathrm{d} f / \mathrm{d} x$ . We assume that each variable depends explicitly on all the previous ones. Using the chain rule (Equation 6.22), we set $j=1$ (because we want the derivative with respect to $x \equiv v_1$ ) and increment in $i$ to get the sequence of derivatives:

\begin{aligned} {\dot v_1} &= 1 \\ {\dot v_2} &= \frac{\partial v_2}{\partial v_1} {\dot v_1}\\ {\dot v_3} &= \frac{\partial v_3}{\partial v_1} {\dot v_1} + \frac{\partial v_3}{\partial v_2} {\dot v_2} \\ \dot v_4 &= \frac{\partial v_4}{\partial v_1} {\dot v_1} + \frac{\partial v_4}{\partial v_2} {\dot v_2} + \frac{\partial v_4}{\partial v_3} {\dot v_3} \equiv \frac{df}{dx}\,. \end{aligned}

(6.23)

In each step, we just need to compute the partial derivatives of the current operation $v_i$ and then multiply using the total derivatives $\dot v$ that have already been computed. We move forward by evaluating the partial derivatives of $v$ in the same sequence to evaluate the original function.

This is convenient because all of the unknowns are partial derivatives, meaning that we only need to compute derivatives based on the operation at hand (or line of code).

In this abstract example with four variables that depend on each other sequentially, the Jacobian of the variables with respect to themselves is as follows:

J_v = \begin{bmatrix} \textcolor{#4E79A7}{1} & 0 & 0 & 0 \\ \textcolor{#4E79A7}{\frac{dv_2}{d{v_1}}} & 1 & 0 & 0 \\ \textcolor{#4E79A7}{\frac{dv_3}{d{v_1}}} & \frac{dv_3}{d{v_2}}& 1 & 0 \\ \textcolor{#4E79A7}{\frac{dv_4}{d{v_1}}} & \frac{dv_4}{d{v_2}}& \frac{dv_4}{d{v_3}}& 1 \end{bmatrix} \, .

(6.24)

By setting the seed ${\dot v_1} = 1$ and using the forward chain rule (Equation 6.22), we have computed the first column of $J_v$ from top to bottom. This column corresponds to the tangent with respect to $v_1$ . Using forward-mode AD, obtaining derivatives for other outputs is free (e.g., $\mathrm{d} v_3 / \mathrm{d} v_1 \equiv \dot v_3$ in Equation 6.23).

However, if we want the derivatives with respect to additional inputs, we would need to set a different seed and evaluate an entire set of similar calculations. For example, if we wanted $\mathrm{d} v_4 / \mathrm{d} v_2$ , we would set the seed as ${\dot v_2} = 1$ and evaluate the equations for $\dot v_3$ and $\dot v_4$ , where we would now have $\mathrm{d} v_4 / \mathrm{d} v_2 = \dot v_4$ . This would correspond to computing the second column in $J_v$ (Equation 6.24).

Thus, the cost of the forward mode scales linearly with the number of inputs we are interested in and is independent of the number of outputs.

Example 6.5 (Forward-mode AD)

Consider the function with two inputs and two outputs from Example 6.1. We could evaluate the explicit expressions in this function using only two lines of code. However, to make the AD process more apparent, we write the code such that each line has a single unary or binary operation, which is how a computer ends up evaluating the expression:

\begin{aligned} v_1 &= v_1(v_1) = x_1 \\ v_2 &= v_2(v_2) = x_2 \\ v_3 &= v_3(v_1, v_2) = v_1 v_2 \\ v_4 &= v_4(v_1) = \sin v_1 \\ v_5 &= v_5(v_3, v_4) = v_3 + v_4 \quad = f_1\\ v_6 &= v_6(v_2) = v_2^2\\ v_7 &= v_7(v_3, v_6) = v_3 + v_6 \quad = f_2 \, . \end{aligned}

Using the forward mode, set the seed $\dot v_1=1$ , and $\dot v_2=0$ to obtain the derivatives with respect to $x_1$ . When using the chain rule (Equation 6.22), only one or two partial derivatives are nonzero in each sum because the operations are either unary or binary in this case. For example, the addition operation that computes $v_5$ does not depend explicitly on $v_2$ , so $\partial v_5 / \partial v_2=0$ . To further elaborate, when evaluating the operation $v_5 = v_3 + v_4$ , we do not need to know how $v_3$ was computed; we just need to know the value of the two numbers we are adding. Similarly, when evaluating the derivative $\partial v_5 / \partial v_2$ , we do not need to know how or whether $v_3$ and $v_4$ depended on $v_2$ ; we just need to know how this one operation depends on $v_2$ . So even though symbolic derivatives are involved in individual operations, the overall process is distinct from symbolic differentiation. We do not combine all the operations and end up with a symbolic derivative. We develop a computational procedure to compute the derivative that ends up with a number for a given input—similar to the computational procedure that computes the functional outputs and does not produce a symbolic functional output.

Say we want to compute $\mathrm{d} f_2 / \mathrm{d} x_1$ , which in our example corresponds to $\mathrm{d} v_{7} / \mathrm{d} v_1$ . The evaluation point is the same as in Example 6.1: $x=(\pi/4, 2)$ . Using the chain rule (Equation 6.22) and considering only the nonzero partial derivative terms, we get the following sequence:

\begin{aligned} \dot v_1 &= 1\\ \dot v_2 &= 0 \\ \dot v_3 &= \frac{\partial v_3}{\partial v_1} \dot v_1 + \frac{\partial v_3}{\partial v_2} \dot v_2 = v_2 \cdot \dot v_1 + v_1 \cdot 0 = 2\\ \dot v_4 &= \frac{\partial v_4}{\partial v_1} \dot v_1 = \cos v_1 \cdot \dot v_1 = 0.707\ldots\\ \dot v_5 &= \frac{\partial v_5}{\partial v_3} \dot v_3 + \frac{\partial v_5}{\partial v_4} \dot v_4 = 1 \cdot \dot v_3 + 1 \cdot \dot v_4 = 2.707\ldots \equiv \frac{\partial f_1}{\partial {x_1}}\\ \end{aligned}

\begin{aligned} \dot v_6 &= \frac{\partial v_6}{\partial v_2} \dot v_2 = 2 v_2 \cdot \dot v_2 = 0 \\ \dot v_7 &= \frac{\partial v_7}{\partial v_3} \dot v_3 + \frac{\partial v_7}{\partial v_6} \dot v_6 = 1 \cdot \dot v_3 + 1 \cdot \dot v_6 = 2 \equiv \frac{\partial f_2}{\partial {x_1}}\, . \end{aligned}

(6.25)

This sequence is illustrated in matrix form in Figure 6.16. The procedure is equivalent to performing forward substitution in this linear system.

Figure 6.16:Dependency used in the forward chain rule propagation in Equation 6.25. The forward mode is equivalent to solving a lower triangular system by forward substitution, where the system is sparse.

We now have a procedure (not a symbolic expression) for computing $\mathrm{d} f_2 / \mathrm{d} x_1$ for any $(x_1, x_2)$ . The dependencies of these operations are shown in Figure 6.17 as a computational graph.

Although we set out to compute $\mathrm{d} f_2 / \mathrm{d} x_1$ , we also obtained $\mathrm{d} f_1 / \mathrm{d} x_1$ as a by-product. We can obtain the derivatives for all outputs with respect to one input for the same cost as computing the outputs. If we wanted the derivative with respect to the other input, $\mathrm{d} f_1 / \mathrm{d} x_2$ , a new sequence of calculations would be necessary.

Figure 6.17:Computational graph for the numerical example evaluations, showing the forward propagation of the derivative with respect to $x_1$ .

So far, we have assumed that we are computing derivatives with respect to each component of $x$ . However, just like for finite differences, we can also compute directional derivatives using forward-mode AD. We do so by setting the appropriate seed in the $\dot v$ ’s that correspond to the inputs in a vectorized manner. Suppose we have $x \equiv [v_1, \ldots, v_{n_x}]$ . To compute the derivative with respect to $x_j$ , we would set the seed as the unit vector $\dot v = \hat e_j$ and follow a similar process for the other elements. To compute a directional derivative in direction $p$ , we would set the seed as $\dot v = p / \|p\|$ .

Tip 6.5: Use a directional derivative for quick verification

We can use a directional derivative in arbitrary directions to verify the gradient computation. The directional derivative is the scalar projection of the gradient in the chosen direction, that is, $\nabla f^\intercal p$ . We can use the directional derivative to verify the gradient computed by some other method, which is especially useful when the evaluation of $f$ is expensive and we have many gradient elements. We can verify a gradient by projecting it into some direction (say, $p = [1,\ldots, 1]$ ) and then comparing it to the directional derivative in that direction. If the result matches the reference, then all the gradient elements are most likely correct (it is good practice to try a couple more directions just to be sure). However, if the result does not match, this directional derivative does not reveal which gradient elements are incorrect.

6.6.3 Reverse-Mode AD¶

The reverse mode is also based on the chain rule but uses the alternative form:

\frac{dv_i}{d{v_j}}= \sum_{k=j+1}^{i} \frac{\partial v_k}{\partial v_j} \frac{dv_i}{d{v_k}}\, ,

(6.26)

where the summation happens in reverse (starts at $i$ and decrements to $j+1$ ). This is less intuitive than the forward chain rule, but it is equally valid. Here, we fix the index $i$ corresponding to the output of interest and decrement $j$ until we get the desired derivative.

Similar to the forward-mode total derivative notation (Equation 6.22), we define a more convenient notation for the variables that carry the total derivatives with a fixed $i$ as $\bar v_j \equiv \mathrm{d} {v_i} / \mathrm{d} {{v_j}}$ ,

which are sometimes called adjoint variables. Then we can rewrite the chain rule as

\bar v_j = \sum_{k=j+1}^{i} \frac{\partial v_k}{\partial v_j} \bar v_k \, .

(6.27)

This chain rule propagates the total derivatives backward after setting the reverse seed $\bar v_i=1$ , as shown in Figure 6.18.

Figure 6.18:The reverse mode propagates derivatives to all the variables on which the seeded output variable depends.

This affects all the variables on which the seeded variable depends.

The reverse-mode variables $\bar v$ represent the derivatives of one output, $i$ , with respect to all the input variables (instead of the derivatives of all the outputs with respect to one input, $j$ , in the forward mode). Once we are done applying the reverse chain rule (Equation 6.27) for the chosen output variable $v_i$ , we end up with the total derivatives $\mathrm{d} v_i / \mathrm{d} v_j$ for all $j<i$ .

Applying the reverse mode to the same four-variable example as before, we get the following sequence of derivative computations (we set $i = 4$ and decrement $j$ ):

\begin{aligned} {\bar v_4} &= 1\\ {\bar v_3} &= \frac{\partial v_4}{\partial v_3} {\bar v_4}\\ \end{aligned}

\begin{aligned} {\bar v_2} &= \frac{\partial v_3}{\partial v_2} {\bar v_3} + \frac{\partial v_4}{\partial v_2} {\bar v_4}\\ \bar v_1 &= \frac{\partial v_2}{\partial v_1} {\bar v_2} + \frac{\partial v_3}{\partial v_1} {\bar v_3} + \frac{\partial v_4}{\partial v_1} {\bar v_4} \equiv \frac{df}{dx}\, . \end{aligned}

(6.28)

The partial derivatives of $v$ must be computed for $v_4$ first, then $v_3$ , and so on. Therefore, we have to traverse the code in reverse. In practice, not every variable depends on every other variable, so a computational graph is created during code evaluation. Then, when computing the adjoint variables, we traverse the computational graph in reverse. As before, the derivatives we need to compute in each line are only partial derivatives.

Recall the Jacobian of the variables,

J_v = \begin{bmatrix} {1} & 0 & 0 & 0 \\ {\frac{dv_2}{d{v_1}}} & 1 & 0 & 0 \\ {\frac{dv_3}{d{v_1}}} & \frac{dv_3}{d{v_2}}& 1 & 0 \\ \textcolor{#4E79A7}{\frac{dv_4}{d{v_1}}} & \textcolor{#4E79A7}{\frac{dv_4}{d{v_2}}} & \textcolor{#4E79A7}{\frac{dv_4}{d{v_3}}} & \textcolor{#4E79A7}{1} \end{bmatrix} \, .

(6.29)

By setting $\bar v_4 =1$ and using the reverse chain rule (Equation 6.27), we have computed the last row of $J_v$ from right to left. This row corresponds to the gradient of $f \equiv v_4$ . Using the reverse mode of AD, obtaining derivatives with respect to additional inputs is free (e.g., $\mathrm{d} v_4 / \mathrm{d} v_2 \equiv \bar v_2$ in Equation 6.28).

However, if we wanted the derivatives of additional outputs, we would need to evaluate a different sequence of derivatives. For example, if we wanted $\mathrm{d} v_3 / \mathrm{d} v_1$ , we would set $\bar v_3 = 1$ and evaluate the expressions for $\bar v_2$ and $\bar v_1$ in Equation 6.28, where $\mathrm{d} v_3 / d v_1 \equiv \bar v_1$ . Thus, the cost of the reverse mode scales linearly with the number of outputs and is independent of the number of inputs.

One complication with the reverse mode is that the resulting sequence of derivatives requires the values of the variables, starting with the last ones and progressing in reverse. For example, the partial derivative in the second operation of Equation 6.28 might involve $v_3$ . Therefore, the code needs to run in a forward pass first, and all the variables must be stored for use in the reverse pass, which increases memory usage.

Example 6.6 (Reverse-mode AD)

Suppose we want to compute $\partial f_2 / \partial x_1$ for the function from Example 6.5. First, we need to run the original code (a forward pass) and store the values of all the variables because they are necessary in the reverse chain rule (Equation 6.26) to compute the numerical values of the partial derivatives. Furthermore, the reverse chain rule requires the information on all the dependencies to determine which partial derivatives are nonzero. The forward pass and dependencies are represented by the computational graph shown in Figure 6.19.

Figure 6.19:Computational graph for the function.

Using the chain rule (Equation 6.26) and setting the seed for the desired variable $\bar v_7=1$ , we get

\begin{aligned} \bar v_7 &= 1\\ \bar v_6 &= \frac{\partial v_7}{\partial v_6} \bar v_7 = \bar v_7 = 1\\ \bar v_5 &= 0 \\ \bar v_4 &= \frac{\partial v_5}{\partial v_4} \bar v_5 = \bar v_5 = 0\\ \bar v_3 &= \frac{\partial v_7}{\partial v_3} \bar v_7 + \frac{\partial v_5}{\partial v_3} \bar v_5 = \bar v_7 + \bar v_5 = 1\\ \bar v_2 &= \frac{\partial v_6}{\partial v_2} \bar v_6 + \frac{\partial v_3}{\partial v_2} \bar v_3 = 2 v_2 \bar v_6 + v_1 \bar v_3 = 4.785 = \frac{\partial f_2}{\partial {x_2}}\\ \bar v_1 &= \frac{\partial v_4}{\partial v_1} \bar v_4 + \frac{\partial v_3}{\partial v_1} \bar v_3 = (\cos v_1) \bar v_4 + v_2 \bar v_3 = 2 = \frac{\partial f_2}{\partial {x_1}}\, . \end{aligned}

(6.30)

After running the forward evaluation and storing the elements of $v$ , we can run the reverse pass shown in Figure 6.20. This reverse pass is illustrated in matrix form in Figure 6.21. The procedure is equivalent to performing back substitution in this linear system.

Figure 6.20:Computational graph for the reverse mode, showing the backward propagation of the derivative of $f_2$ .

Although we set out to evaluate $\mathrm{d} f_2 / \mathrm{d} x_1$ , we also computed $\mathrm{d} f_2 / \mathrm{d} x_2$ as a by-product. For each output, the derivatives of all inputs come at the cost of evaluating only one more line of code. Conversely, if we want the derivatives of $f_1$ , a whole new set of computations is needed.

Figure 6.21:Dependency used in the reverse chain rule propagation in Equation 6.30. The reverse mode is equivalent to solving an upper triangular system by backward substitution, where the system is sparse.

In forward mode, the computation of a given derivative, $\dot v_i$ , requires the partial derivatives of the line of code that computes $v_i$ with respect to its inputs.

In the reverse case, however, to compute a given derivative, $\bar v_j$ , we require the partial derivatives with respect to $v_j$ of the functions that the current variable $v_j$ affects. Knowledge of the function a variable affects is not encoded in that variable computation, and that is why the computational graph is required.

Unlike with forward-mode AD and finite differences, it is impossible to compute a directional derivative by setting the appropriate seeds.

Although the seeds in the forward mode are associated with the inputs, the seeds for the reverse mode are associated with the outputs. Suppose we have multiple functions of interest, $f \equiv [v_{n-n_f}, \ldots, v_n]$ . To find the derivatives of $f_1$ in a vectorized operation, we would set $\bar v = [1, 0, \ldots, 0]$ .

A seed with multiple nonzero elements computes the derivatives of a weighted function with respect to all the variables, where the weight for each function is determined by the corresponding $\bar v$ value.

6.6.4 Forward Mode or Reverse Mode?¶

Our goal is to compute $J_f$ , the ( $n_f \times n_x$ ) matrix of derivatives of all the functions of interest $f$ with respect to all the input variables $x$ . However, AD computes many other derivatives corresponding to intermediate variables. The complete Jacobian for all the intermediate variables, $v_i = v_i(v_1, v_2,\ldots,v_i,\ldots,v_n)$ , assuming that the loops have been unrolled, has the structure shown in Figure 6.22 and Figure 6.23.

The input variables $x$ are among the first entries in $v$ , whereas the functions of interest $f$ are among the last entries of $v$ .

When n_x < n_f, the forward mode is advantageous. — Figure 6.22:When $n_x < n_f$ , the forward mode is advantageous.

For simplicity, let us assume that the entries corresponding to $x$ and $f$ are contiguous, as previously shown in Figure 6.13. Then, the derivatives we want ( $J_f$ ) are a block located on the lower left in the much larger matrix ( $J_v$ ), as shown in Figure 6.22 and Figure 6.23. Although we are only interested in this block, AD requires the computation of additional intermediate derivatives.

When n_x > n_f, the reverse mode is advantageous. — Figure 6.23:When $n_x > n_f$ , the reverse mode is advantageous.

The main difference between the forward and the reverse approaches is that the forward mode computes the Jacobian column by column, whereas the reverse mode does it row by row. Thus, the cost of the forward mode is proportional to $n_x$ , whereas the cost of the reverse mode is proportional to $n_f$ . If we have more outputs (e.g., objective and constraints) than inputs (design variables), the forward mode is more efficient, as illustrated in Figure 6.22. On the other hand, if we have many more inputs than outputs, then the reverse mode is more efficient, as illustrated in Figure 6.23. If the number of inputs is similar to the number of outputs, neither mode has a significant advantage.

In both modes, each forward or reverse pass costs less than 2–3 times the cost of running the original code in practice. However, because the reverse mode requires storing a large amount of data, memory costs also need to be considered. In principle, the required memory is proportional to the number of variables, but there are techniques that can reduce the memory usage significantly.^[5]

6.6.5 AD Implementation¶

There are two main ways to implement AD: by source code transformation or by operator overloading. The function we used to demonstrate the issues with symbolic differentiation (Example 6.2) can be differentiated much more easily with AD. In the examples that follow, we use this function to demonstrate how the forward and reverse mode work using both source code transformation and operator overloading.

Source Code Transformation¶

AD tools that use source code transformation process the whole source code automatically with a parser and add lines of code that compute the derivatives. The added code is highlighted in Example 6.7 and Example 6.8.

Example 6.7 (Source code transformation for forward mode)

Running an AD source transformation tool on the code from Example 6.2 produces the code that follows.

The AD tool added a new line after each variable assignment that computes the corresponding derivative. We can then set the seed, $\dot x = 1$ and run the code. As the loops proceed, $\dot f$ accumulates the derivative as $f$ is successively updated.

Example 6.8 (Source code transformation for reverse mode)

The reverse-mode AD version of the code from Example 6.2 follows.

The first loop is identical to the original code except for one line. Because the derivatives that accumulate in the reverse loop depend on the intermediate values of the variables, we need to store all the variables in the forward loop. We store and retrieve the variables using a stack, hence the call to “push”.^[6]The second loop, which runs in reverse, is where the derivatives are computed. We set the reverse seed, $\bar f=1$ , and then the adjoint variables accumulate the derivatives back to the start.

Operator Overloading¶

The operator overloading approach creates a new augmented data type that stores both the variable value and its derivative. Every floating-point number $v$ is replaced by a new type with two parts $(v, \dot v)$ , commonly referred to as a dual number. All standard operations (e.g., addition, multiplication, sine) are overloaded such that they compute $v$ according to the original function value and $\dot v$ according to the derivative of that function. For example, the multiplication operation, $x_1 \cdot x_2$ , would be defined for the dual-number data type as

(x_1, \dot x_1) \cdot (x_2, \dot x_2) = (x_1 x_2, x_1 \dot x_2 + \dot x_1 x_2),

(6.31)

where we compute the original function value in the first term, and the second term carries the derivative of the multiplication.

Although we wrote the two parts explicitly in Equation 6.31, the source code would only show a normal multiplication, such as $v_3 = v_1 \cdot v_2$ . However, each of these variables would be of the new type and carry the corresponding $\dot v$ quantities. By overloading all the required operations, the computations happen “behind the scenes”, and the source code does not have to be changed, except to declare all the variables to be of the new type and to set the seed. Example 6.9 lists the original code from Example 6.2 with notes on the actual computations that are performed as a result of overloading.

Example 6.9 (Operator overloading for forward mode)

Using the derived data types and operator overloading approach in forward mode does not change the code listed in Example 6.2. The AD tool provides overloaded versions of the functions in use, which in this case are assignment, addition, and sine. These functions are overloaded as follows:

\begin{align*} v_2 = v_1 \quad &\Rightarrow \quad (v_2, \dot v_2) = (v_1, \dot v_1) \\ v_1 + v_2 \quad &\Rightarrow \quad (v_1, \dot v_1) + (v_2, \dot v_2) \equiv (v_1 + v_2, \dot v_1 + \dot v_2)\\ \sin(v) \quad &\Rightarrow \quad \sin(v, \dot{v)} \equiv (\sin(v), \cos(v) \dot{v)} \, . \end{align*}

In this case, the source code is unchanged, but additional computations occur through the overloaded functions. We reproduce the code that follows with notes on the hidden operations that take place.

We set the seed, $\dot x = 1$ , and for each function assignment, we add the corresponding derivative line. As the loops are repeated, $\dot f$ accumulates the derivative as $f$ is successively updated.

The implementation of the reverse mode using operating overloading is less straightforward and is not detailed here. It requires a new data type that stores the information from the computational graph and the variable values when running the forward pass. This information can be stored using the taping technique. After the forward evaluation of using the new type, the “tape” holds the sequence of operations, which is then evaluated in reverse to propagate the reverse-mode seed.^[8]

Connection of AD with the Complex-Step Method¶

The complex-step method from Section 6.5 can be interpreted as forward-mode AD using operator overloading, where the data type is the complex number $(x, y) \equiv x + i y$ , and the imaginary part $y$ carries the derivative. To see this connection more clearly, let us write the complex multiplication operation as

f = (x_1+iy_1) (x_2+iy_2) = \left(x_1 x_2 - y_1 y_2 \right) + i\left( y_1 x_2 + x_1 y_2 \right).

(6.32)

This equation is similar to the overloaded multiplication (Equation 6.31). The only difference is that the real part includes the term $- y_1 y_2$ , which corresponds to the second-order error term in Equation 6.15. In this case, the complex part gives the exact derivative, but a second-order error might appear for other operations. As argued before, these errors vanish in finite-precision arithmetic if the complex step is small enough.

Source Code Transformation versus Operator Overloading¶

The source code transformation and the operator overloading approaches each have their relative advantages and disadvantages. The overloading approach is much more elegant because the original code stays practically the same and can be maintained directly. On the other hand, the source transformation approach enlarges the original code and results in less readable code, making it hard to work with. Still, it is easier to see what operations take place when debugging. Instead of maintaining source code transformed by AD, it is advisable to work with the original source and devise a workflow where the parser is rerun before compiling a new version.

One advantage of the source code transformation approach is that it tends to yield faster code and allows more straightforward compile-time optimizations. The overloading approach requires a language that supports user-defined data types and operator overloading, whereas source transformation does not. Developing a source transformation AD tool is usually more challenging than developing the overloading approach because it requires an elaborate parser that understands the source syntax.

6.6.6 AD Shortcuts for Matrix Operations¶

The efficiency of AD can be dramatically increased with manually implemented shortcuts. When the code involves matrix operations, manual implementation of a higher-level differentiation of those operations is more efficient than the line-by-line AD implementation. Giles (2008) documents the forward and reverse differentiation of many matrix elementary operations.

For example, suppose that we have a matrix multiplication $C=AB$ . Then, the forward mode yields

\dot C = \dot A B + A \dot B \, .

(6.33)

The idea is to use $\dot A$ and $\dot B$ from the AD code preceding the operation and then manually implement this formula (bypassing any AD of the code that performs that operation) to obtain $\dot C$ , as shown in Figure 6.24. Then we can use $\dot C$ to seed the remainder of the AD code.

The reverse mode of the multiplication yields

\bar A = \bar C B^\intercal, \quad \bar B = A^\intercal \bar C \, .

(6.34)

Similarly, we take $\bar C$ from the reverse AD code and implement the formula manually to compute $\bar A$ and $\bar B$ , which we can use in the remaining AD code in the reverse procedure.

Figure 6.24:Matrix operations, including the solution of linear systems, can be differentiated manually to bypass more costly AD code.

One particularly useful (and astounding!) result is the differentiation of the matrix inverse product. If we have a linear solver such that $C = A^{-1} B$ , we can bypass the solver in the AD process by using the following results:

\dot C = A^{-1} \left( \dot B - \dot A C \right)

(6.35)

for the forward mode and

\bar B = A^{-\intercal} \bar C, \quad \bar A = -\bar B C^\intercal

(6.36)

for the reverse mode.

In addition to deriving the formulas just shown, Giles (2008) derives formulas for the matrix derivatives of the inverse, determinant, norms, quadratic, polynomial, exponential, eigenvalues and eigenvectors, and singular value decomposition. Taking shortcuts as described here applies more broadly to any case where a part of the process can be differentiated manually to produce a more efficient derivative computation.

6.7 Implicit Analytic Methods—Direct and Adjoint¶

Direct and adjoint methods—which we refer to jointly as implicit analytic methods—linearize the model governing equations to obtain a system of linear equations whose solution yields the desired derivatives. Like the complex-step method and AD, implicit analytic methods compute derivatives with a precision matching that of the function evaluation.

The direct method is analogous to forward-mode AD, whereas the adjoint method is analogous to reverse-mode AD.

Analytic methods can be thought of as lying in between the finite-difference method and AD in terms of the number of variables involved. With finite differences, we only need to be aware of inputs and outputs, whereas AD involves every single variable assignment in the code. Analytic methods work at the model level and thus require knowledge of the governing equations and the corresponding state variables.

There are two main approaches to deriving implicit analytic methods: continuous and discrete. The continuous approach linearizes the original continuous governing equations, such as a partial differential equation (PDE), and then discretizes this linearization. The discrete approach linearizes the governing equations only after they have been discretized as a set of residual equations, $r(u)=0$ .

Each approach has its advantages and disadvantages. The discrete approach is preferred and is easier to generalize, so we explain this approach exclusively. One of the primary reasons the discrete approach is preferred is that the resulting derivatives are consistent with the function values because they use the same discretization. The continuous approach is only consistent in the limit of a fine discretization. The resulting inconsistencies can mislead the optimization.^[9]

6.7.1 Residuals and Functions¶

As mentioned in Chapter 3, a discretized numerical model can be written as a system of residuals,

r(u; x) = 0 \, ,

(6.37)

where the semicolon denotes that the design variables $x$ are fixed when these equations are solved for the state variables $u$ . Through these equations, $u$ is an implicit function of $x$ . This relationship is represented by the box containing the solver and residual equations in Figure 6.25.

Relationship between functions and design variables for a system involving a solver. The implicit equations r(u;x)=0 define the states u for a given x, so the functions of interest f depend explicitly and implicitly on the design variables x. — Figure 6.25:Relationship between functions and design variables for a system involving a solver. The implicit equations $r(u;x)=0$ define the states $u$ for a given $x$ , so the functions of interest $f$ depend explicitly and implicitly on the design variables $x$ .

The functions of interest, $f(x, u)$ , are typically explicit functions of the state variables and the design variables. However, because $u$ is an implicit function of $x$ , $f$ is ultimately an implicit function of $x$ as well. To compute $f$ for a given $x$ , we must first find $u$ such that $r(u; x) = 0$ . This is usually the most computationally costly step and requires a solver (see Section 3.6). The residual equations could be nonlinear and involve many state variables. In PDE-based models it is common to have millions of states. Once we have solved for the state variables $u$ , we can compute the functions of interest $f$ . The computation of $f$ for a given $u$ and $x$ is usually much cheaper because it does not require a solver. For example, in PDE-based models, computing such functions typically involves an integration of the states over a surface, or some other transformation of the states.

To compute $\mathrm{d} f / \mathrm{d} x$ using finite differences, we would have to use the solver to find $u$ for each perturbation of $x$ . That means that we would have to run the solver $n_x$ times, which would not scale well when the solution is costly. AD also requires the propagation of derivatives through the solution process. As we will see, implicit analytic methods avoid involving the potentially expensive nonlinear solution in the derivative computation.

Example 6.10 (Residuals and functions in structural analysis)

Recall Example 3.2, where we introduced the structural model of a truss structure. The residuals in this case are the linear equations,

r(u) \equiv K(x) u - q = 0 \, ,

(6.38)

where the state variables are the displacements, $u$ . Solving for the displacement requires only a linear solver in this case, but it is still the most costly part of the analysis. Suppose that the design variables are the cross-sectional areas of the truss members. Then, the stiffness matrix is a function of $x$ , but the external forces are not.

Suppose that the functions of interest are the stresses in each of the truss members. This is an explicit function of the displacements, which is given by the matrix multiplication

f(x,u) \equiv \sigma(u) = S u \, ,

where $S$ is a matrix that depends on $x$ . This is a much cheaper computation than solving the linear system (Equation 6.38).

6.7.2 Direct and Adjoint Derivative Equations¶

The derivatives we ultimately want to compute are the ones in the Jacobian $\mathrm{d} f / \mathrm{d} x$ . Given the explicit and implicit dependence of $f$ on $x$ , we can use the chain rule to write the total derivative Jacobian of $f$ as

\frac{df}{dx}= \frac{\partial f}{\partial x} + \frac{\partial f}{\partial u} \frac{du}{dx}\, ,

(6.39)

where the result is an $(n_f \times n_x)$ matrix.^[10]In this context, the total derivatives, ${\mathrm{d} f}/{\mathrm{d} x}$ , take into account the change in $u$ that is required to keep the residuals of the governing equations (Equation 6.37) equal to zero. The partial derivatives in Equation 6.39 represent the variation of $f(x,u)$ with respect to changes in $x$ or $u$ without regard to satisfying the governing equations.

To better understand the difference between total and partial derivatives in this context, imagine computing these derivatives using finite differences with small perturbations. For the total derivatives, we would perturb $x$ , re-solve the governing equations to obtain $u$ , and then compute $f$ , which would account for both dependency paths in Figure 6.25. To compute the partial derivatives ${\partial f}/{\partial x}$ and ${\partial f}/{\partial u}$ , however, we would perturb $x$ or $u$ and recompute $f$ without re-solving the governing equations. In general, these partial derivative terms are cheap to compute numerically or can be obtained symbolically.

To find the total derivative $\mathrm{d} u / \mathrm{d} x$ , we need to consider the governing equations. Assuming that we are at a point where $r(x,u)=0$ , any perturbation in $x$ must be accompanied by a perturbation in $u$ such that the governing equations remain satisfied. Therefore, the differential of the residuals can be written as

\mathrm{d} r= \frac{\partial r}{\partial x} \mathrm{d} x + \frac{\partial r}{\partial u} \mathrm{d} u= 0 \, .

(6.40)

This constraint is illustrated in Figure 6.26 in two dimensions, but keep in mind that $x$ , $u$ , and $r$ are vectors in the general case.

The governing equations (Equation 6.37) map an $n_x$ -vector $x$ to an $n_u$ -vector $u$ . This mapping defines a hypersurface (also known as a manifold) in the $x$ – $u$ space.

The total derivative $\mathrm{d} f /\mathrm{d} x$ that we ultimately want to compute represents the effect that a perturbation on $x$ has on $f$ subject to the constraint of remaining on this hypersurface, which can be achieved with the appropriate variation in $u$ .

The governing equations determine the values of u for a given x. Given a point that satisfies the equations, the appropriate differential in u must accompany a differential of x about that point for the equations to remain satisfied. — Figure 6.26:The governing equations determine the values of $u$ for a given $x$ . Given a point that satisfies the equations, the appropriate differential in $u$ must accompany a differential of $x$ about that point for the equations to remain satisfied.

To obtain a more useful equation, we rearrange Equation 6.40 to get the linear system

\frac{\partial r}{\partial u} \frac{du}{dx}= - \frac{\partial r}{\partial x} \, ,

(6.41)

where $\partial r / \partial x$ and $\mathrm{d} u / \mathrm{d} x$ are both $(n_u \times n_x)$ matrices, and ${\partial r}/{\partial u}$ is a square matrix of size $(n_u \times n_u)$ . This linear system is useful because if we provide the partial derivatives in this equation (which are cheap to compute), we can solve for the total derivatives $\mathrm{d} u / \mathrm{d} x$ (whose computation would otherwise require re-solving $r(u)=0$ ). Because $\mathrm{d} u / \mathrm{d} x$ is a matrix with $n_x$ columns, this linear system needs to be solved for each $x_i$ with the corresponding column of the right-hand-side matrix ${\partial r}/{\partial x_i}$ .

Now let us assume that we can invert the matrix in the linear system (Equation 6.41) and substitute the solution for $\mathrm{d} u / \mathrm{d} x$ into the total derivative equation (Equation 6.39). Then we get

\frac{df}{dx}= \frac{\partial f}{\partial x} - \frac{\partial f}{\partial u} \left. \frac{\partial r}{\partial u} \right. ^{-1} \frac{\partial r}{\partial x} \, ,

(6.42)

where all the derivative terms on the right-hand side are partial derivatives. The partial derivatives in this equation can be computed using any of the methods that we have described earlier: symbolic differentiation, finite differences, complex step, or AD. Equation 6.42 shows two ways to compute the total derivatives, which we call the direct method and the adjoint method.

The direct method (already outlined earlier) consists of solving the linear system (Equation 6.41) and substituting $\mathrm{d} u / \mathrm{d} x$ into Equation 6.39. Defining $\phi \equiv - \mathrm{d} u / \mathrm{d} x$ , we can rewrite Equation 6.41 as

\frac{\partial r}{\partial u} \phi = \frac{\partial r}{\partial x} \, .

(6.43)

After solving for $\phi$ (one column at the time), we can use it in the total derivative equation (Equation 6.39) to obtain,

\frac{df}{dx}= \frac{\partial f}{\partial x} - \frac{\partial f}{\partial u} \phi \, .

(6.44)

This is sometimes called the forward mode because it is analogous to forward-mode AD.

Solving the linear system (Equation 6.43) is typically the most computationally expensive operation in this procedure. The cost of this approach scales with the number of inputs $n_x$ but is essentially independent of the number of outputs $n_f$ . This is the same scaling behavior as finite differences and forward-mode AD. However, the constant of proportionality is typically much smaller in the direct method because we only need to solve the nonlinear equations $r(u; x)=0$ once to obtain the states.

The total derivatives () can be computed either by solving for \phi (direct method) or by solving for \psi (adjoint method). — Figure 6.27:The total derivatives (Equation 6.42) can be computed either by solving for $\phi$ (direct method) or by solving for $\psi$ (adjoint method).

The adjoint method changes the linear system that is solved to compute the total derivatives. Looking at Figure 6.27, we see that instead of solving the linear system with ${\partial r}/{\partial x}$ on the right-hand side, we can solve it with ${\partial f}/{\partial u}$ on the right-hand side. This corresponds to replacing the two Jacobians in the middle with a new matrix of unknowns,

\psi^\intercal \equiv \frac{\partial f}{\partial u} \left. \frac{\partial r}{\partial u}\right.^{-1} \, ,

(6.45)

where the columns of $\psi$ are called the adjoint vectors. Multiplying both sides of Equation 6.45 by $\partial r / \partial u$ on the right and taking the transpose of the whole equation, we obtain the adjoint equation,

\left.\frac{\partial r}{\partial u}\right.^\intercal \psi = \left.\frac{\partial f}{\partial u} \right.^\intercal \, .

(6.46)

This linear system has no dependence on $x$ . Each adjoint vector is associated with a function of interest $f_j$ and is found by solving the adjoint equation (Equation 6.46) with the corresponding row ${\partial f_j}/{\partial u}$ . The solution ( $\psi$ ) is then used to compute the total derivative

\frac{df}{dx}= \frac{\partial f}{\partial x} - \psi^\intercal \frac{\partial r}{\partial x} \, .

(6.47)

This is sometimes called the reverse mode because it is analogous to reverse-mode AD.

As we will see in Section 6.9, the adjoint vectors are equivalent to the total derivatives $\mathrm{d} f / \mathrm{d} r$ , which quantify the change in the function of interest given a perturbation in the residual that gets zeroed out by an appropriate change in $u$ .^[11]

6.7.3 Direct or Adjoint?¶

Two possibilities for the size of \mathrm{d} f/\mathrm{d} x in . When n_x<n_f, it is advantageous to solve the linear system with the vector to the right of the square matrix because it has fewer columns. When n_x>n_f, it is advantageous to solve the transposed linear system with the vector to the left because it has fewer rows. — Figure 6.28:Two possibilities for the size of $\mathrm{d} f/\mathrm{d} x$ in Figure 6.27. When $n_x<n_f$ , it is advantageous to solve the linear system with the vector to the right of the square matrix because it has fewer columns. When $n_x>n_f$ , it is advantageous to solve the transposed linear system with the vector to the left because it has fewer rows.

Similar to the direct method, the solution of the adjoint linear system (Equation 6.46) tends to be the most expensive operation. Although the linear system is of the same size as that of the direct method, the cost of the adjoint method scales with the number of outputs $n_f$ and is essentially independent of the number of inputs $n_x$ . The comparison between the computational cost of the direct and adjoint methods is summarized in Table 3 and illustrated in Figure 6.28.

Similar to the trade-offs between forward- and reverse-mode AD, if the number of outputs is greater than the number of inputs, the direct (forward) method is more efficient (Figure 6.28, top). On the other hand, if the number of inputs is greater than the number of outputs, it is more efficient to use the adjoint (reverse) method (Figure 6.28, bottom). When the number of inputs and outputs is large and similar, neither method has an advantage, and the cost of computing the full total derivative Jacobian might be prohibitive. In this case, aggregating the outputs and using the adjoint method might be effective, as explained in Tip 6.7.

In practice, the adjoint method is implemented much more often than the direct method. Although both methods require a similar implementation effort, the direct method competes with methods that are much more easily implemented, such as finite differencing, complex step, and forward-mode AD. On the other hand, the adjoint method only competes with reverse-mode AD, which is plagued by the memory issue.

Table 3:Cost comparison of computing derivatives with direct and adjoint methods.

Step	Direct	Adjoint
Partial derivative computation	Same	Same
Linear solution	$n_x$ times	$n_f$ times
Matrix multiplications	Same	Same

Another reason why the adjoint method is more widely used is that many optimization problems have a few functions of interest (one objective and a few constraints) and many design variables. The adjoint method has made it possible to solve optimization problems involving computationally intensive PDE models.^[12]

Although implementing implicit analytic methods is labor intensive, it is worthwhile if the differentiated code is used frequently and in applications that demand repeated evaluations. For such applications, analytic differentiation with partial derivatives computed using AD is the recommended approach for differentiating code because it combines the best features of these methods.

Example 6.11 (Differentiating an implicit function)

Consider the following simplified equation for the natural frequency of a beam:

f = \lambda m^2 \, ,

(6.48)

where $\lambda$ is a function of $m$ through the following relationship:

\frac{\lambda}{m} + \cos\lambda = 0 \, .

Figure 6.29 shows the equivalent of Figure 6.25 in this case. Our goal is to compute the derivative $\mathrm{d} f / \mathrm{d} m$ . Because $\lambda$ is an implicit function of $m$ , we cannot find an explicit expression for $\lambda$ as a function of $m$ , substitute that expression into Equation 6.48, and then differentiate normally. Fortunately, the implicit analytic methods allow us to compute this derivative.

Figure 6.29:Model for this example.

Referring back to our nomenclature,

\begin{aligned} f(x,u) &\equiv f(m, \lambda) = \lambda m^2 , \\ r(u; x) &\equiv r(\lambda; m) = \frac{\lambda}{m} + \cos\lambda = 0 \, , \end{aligned}

where $m$ is the design variable and $\lambda$ is the state variable. The partial derivatives that we need for the total derivative computation (Equation 6.42) are as follows:

\begin{aligned} \frac{\partial f}{\partial x} &= \frac{\partial f}{\partial m} = 2 \lambda m, & \quad \frac{\partial f}{\partial u} &= \frac{\partial f}{\partial \lambda} = m^2\\ \frac{\partial r}{\partial x} &= \frac{\partial r}{\partial m} = -\frac{\lambda}{m^2}, & \quad \frac{\partial r}{\partial u} &= \frac{\partial r}{\partial \lambda} = \frac{1}{m} - \sin\lambda \, . \end{aligned}

Because this is a problem of only one function of interest and one design variable, there is no distinction between the direct and adjoint methods (forward and reverse), and the linear system solution is simply a division. Substituting these partial derivatives into the total derivative equation (Equation 6.42) yields

\frac{df}{dm}= 2 \lambda m + \frac{\lambda}{\frac{1}{m} - \sin\lambda} \, .

Thus, we obtained the desired derivative despite the implicitly defined function. Here, it was possible to get an explicit expression for the total derivative, but generally, it is only possible to get a numeric value.

Example 6.12 (Direct and adjoint methods applied to structural analysis)

Consider the structural analysis we reintroduced in Example 6.10. Let us compute the derivatives of the stresses with respect to the cross-sectional truss member areas and denote the number of degrees of freedom as $n_u$ and the number of truss members as $n_t$ . Figure 6.30 shows the equivalent of Figure 6.25 for this case.

Figure 6.30:Model for this example

We require four Jacobians of partial derivatives: $\partial r / \partial x$ , $\partial r / \partial u$ , $\partial \sigma / \partial x$ , and $\partial \sigma / \partial u$ . When differentiating the governing equations with respect to an area $x_i$ , neither the displacements nor the external forces depend directly on the areas,^[13]so we obtain

\frac{\partial r}{\partial x_i} = \frac{\partial}{\partial x_i} \left(Ku-q\right) = \frac{\partial}{\partial x_i} \left(Ku\right) = \frac{\partial K}{\partial x_i} u \, .

This is a vector of size $n_u$ corresponding to one column of $\partial r/\partial x$ . We can compute this term by symbolically differentiating the equations that assemble the stiffness matrix. Alternatively, we could use AD on the function that computes the stiffness matrix or use finite differencing. Using AD, we can employ the techniques described in Section 6.7.4 for an efficient implementation.

It is more efficient to compute the derivative of the product $Ku$ directly instead of differentiating $K$ and then multiplying by $u$ . This avoids storing and subtracting the entire perturbed matrix. We can apply a forward finite difference to the product as follows:

\frac{\partial r}{\partial x_i} \approx \frac{K(x + h \hat e_i) u - K(x) u}{h} \, .

Because the external forces do not depend on the displacements in this case,^[14]the partial derivatives of the governing equations with respect to the displacements are given by

\frac{\partial r}{\partial u} = K \, .

We already have the stiffness matrix, so this term does not require any further computations.

The partial derivative of the stresses with respect to the areas is zero ( $\partial \sigma / \partial x = 0$ ) because there is no direct dependence.^[15]Thus, the partial derivative of the stress with respect to displacements is

\frac{\partial \sigma}{\partial u}= S \, ,

which is an $(n_t \times n_u)$ matrix that we already have from the stress computation.

Now we can use either the direct or adjoint method by replacing the partial derivatives in the respective equations. The direct linear system (Equation 6.43) yields

K \phi_i = \frac{\partial}{\partial x_i} (K u) \, ,

where $i$ corresponds to each truss member area. Once we have $\phi_i$ , we can use it to compute the total derivatives of all the stresses with respect to member area $i$ with Equation 6.44, as follows:

\frac{d\sigma}{d{x_i}}= - S \phi_i \, .

The adjoint linear system (Equation 6.46) yields^[16]

K \psi_j = S_{j,*}^\intercal \, ,

where $j$ corresponds to each truss member, and $S_{j,*}$ is the $j$ th row of $S$ . Once we have $\psi_j$ , we can use it to compute the total derivative of the stress in member $j$ with respect to all truss member areas with Equation 6.47, as follows:

\frac{d\sigma_j}{d{x}}= - \psi_j^\intercal \frac{\partial }{\partial x} (Ku) \, .

In this case, there is no advantage in using one method over the other because the number of areas is the same as the number of stresses. However, if we aggregated the stresses as suggested in Tip 6.7, the adjoint would be advantageous.

Tip 6.7: Aggregate outputs to reduce the cost of adjoint or reverse methods

For problems with many outputs and many inputs, there is no efficient way of computing the Jacobian. This is common in some structural optimization problems, where the number of stress constraints is similar to the number of design variables because they are both associated with each structural element (see Example 6.12).

We can address this issue by aggregating the functions of interest as described in Section 5.7 and then implementing the adjoint method to compute the gradient. In Example 6.12, we would aggregate the stresses in one or more groups to reduce the number of required adjoint solutions.

We can use these techniques to aggregate any outputs, but in principle, these outputs should have some relation to each other. For example, they could be the stresses in a structure (see Example 6.12).^[17]

6.7.4 Adjoint Method with AD Partial Derivatives¶

Implementing the implicit analytic methods for models involving long, complicated code requires significant development effort. In this section, we focus on implementing the adjoint method because it is more widely used, as explained in Section 6.7.3. We assume that $n_f < n_x$ , so that the adjoint method is advantageous.

To ease the implementation of adjoint methods, we recommend a hybrid adjoint approach where the reverse mode of AD computes the partial derivatives in the adjoint equations (Equation 6.46) and total derivative equation (Equation 6.47).^[18]

The partials terms $\partial f / \partial x$ form an $(n_f \times n_x)$ matrix and $\partial f / \partial u$ is an $(n_f \times n_u)$ matrix. These partial derivatives can be computed by identifying the section of the code that computes $f$ for a given $x$ and $u$ and running the AD tool for that section. This produces code that takes $\bar f$ as an input and outputs $\bar x$ and $\bar u$ , as shown in Figure 6.31.

Applying reverse AD to the code that computes f produces code that computes the partial derivatives of f with respect to x and u. — Figure 6.31:Applying reverse AD to the code that computes $f$ produces code that computes the partial derivatives of $f$ with respect to $x$ and $u$ .

Recall that we must first run the entire original code that computes $u$ and $f$ . Then we can run the AD code with the desired seed. Suppose we want the derivative of the $j$ th component of $f$ . We would set $\bar f_j = 1$ and the other elements to zero. After running the AD code, we obtain $\bar x$ and $\bar u$ , which correspond to the rows of the respective matrix of partial terms, that is,

\bar x = \frac{\partial {f_j}}{\partial x}, \quad \bar u = \frac{\partial {f_j}}{\partial u}\, .

(6.49)

Thus, with each run of the AD code, we obtain the derivatives of one function with respect to all design variables and all state variables. One run is required for each element of $f$ . The reverse mode is advantageous if $n_f < n_x$ , .

The Jacobian $\partial r / \partial u$ can also be computed using AD. Because $\partial r / \partial u$ is typically sparse, the techniques covered in Section 6.8 significantly increase the efficiency of computing this matrix. This is a square matrix, so neither AD mode has an advantage over the other if we explicitly compute and store the whole matrix.

However, reverse-mode AD is advantageous when using an iterative method to solve the adjoint linear system (Equation 6.46). When using an iterative method, we do not form $\partial r / \partial u$ . Instead, we require products of the transpose of this matrix with some vector $v$ ,^[19]

{\frac{\partial r}{\partial u}}^\intercal v \, .

(6.50)

The elements of $v$ act as weights on the residuals and can be interpreted as a projection onto the direction of $v$ . Suppose we have the reverse AD code for the residual computation, as shown in Figure 6.32.

Applying reverse AD to the code that computes r produces code that computes the partial derivatives of r with respect to x and u. — Figure 6.32:Applying reverse AD to the code that computes $r$ produces code that computes the partial derivatives of $r$ with respect to $x$ and $u$ .

This code requires a reverse seed $\bar r$ , which determines the weights we want on each residual. Typically, a seed would have only one nonzero entry to find partial derivatives (e.g., setting $\bar r=[1,0,\ldots,0]$ would yield the first row of the Jacobian, $\bar u \equiv \partial r_1 / \partial u$ ). However, to get the product in Equation 6.50, we require the seed to be weighted as $\bar r = v$ . Then, we can compute the product by running the reverse AD code once to obtain $\bar u \equiv [\partial r / \partial u]^\intercal v$ .

The final term needed to compute total derivatives with the adjoint method is the last term in Equation 6.47, which can be written as

\psi^\intercal \frac{\partial r}{\partial x}= \left( {\frac{\partial r}{\partial x}}^\intercal \psi \right)^\intercal .

(6.51)

This is yet another transpose vector product that can be obtained using the same reverse AD code for the residuals, except that now the residual seed is $\bar r = \psi$ , and the product we want is given by $\bar x$ .

In sum, it is advantageous to use reverse-mode AD to compute the partial derivative terms for the adjoint equations, especially if the adjoint equations are solved using an iterative approach that requires only matrix-vector products. Similar techniques and arguments apply for the direct method, except that in that case, forward-mode AD is advantageous for computing the partial derivatives.

Tip 6.8: Verifying the implementation of derivative computations

Always compare your derivative computation against a different implementation.

You can compare analytic derivatives with finite-difference derivatives, but that is only a partial verification because finite differences are not accurate enough. Comparing against the complex-step method or AD is preferable. Still, finite differences are recommended as an additional check. If you can only use finite differences, compare two different finite difference approximations.

You should use unit tests to verify each partial derivative term as you are developing the code (see Tip 3.4) instead of just hoping it all works together at the end (it usually does not!). One necessary but not sufficient test for the verification of analytic methods is the dot-product test. For analytic methods, the dot-product test can be derived from Equation 6.42. For a chosen variable $x_i$ and function $f_j$ , we have the following equality:

\psi_j^\intercal \frac{\partial r}{\partial {x_i}}= \frac{\partial f_j}{\partial u}\phi_i \, .

(6.52)

Each side of this equation yields a scalar that should match to working precision. The dot-product test verifies that your partial derivatives and the solutions for the direct and adjoint linear systems are consistent.

For AD, the dot-product test for a code with inputs $x$ and outputs $f$ is as follows:

\dot x^\intercal \bar x = \dot x^\intercal \left( \left. \frac{\partial f}{\partial x}\right.^\intercal \bar f \right) = \left( \dot x^\intercal \left. \frac{\partial f}{\partial x}\right.^\intercal \right) \bar f = {\dot f}^\intercal \bar f \, .

(6.53)

6.8 Sparse Jacobians and Graph Coloring¶

In this chapter, we have discussed various ways to compute a Jacobian of a model. If the Jacobian has many zero elements, it is said to be sparse. In many cases, we can take advantage of that sparsity to significantly reduce the computational time required to construct the Jacobian.

When applying a forward approach (forward-mode AD, finite differencing, or complex step), the cost of computing the Jacobian scales with $n_x$ . Each forward pass re-evaluates the model to compute one column of the Jacobian. For example, when using finite differencing, $n_x$ evaluations would be required. To compute the $j$ th column of the Jacobian, the input vector would be

[x_1,\; x_2,\; \ldots,\; x_j + h,\; \ldots,\; x_{n_x}] \, .

(6.54)

We can significantly reduce the cost of computing the Jacobian depending on its sparsity pattern. As a simple example, consider a square diagonal Jacobian:

\frac{df}{dx}\equiv \begin{bmatrix} J_{11} & 0 & 0 & 0 & 0 \\ 0 & J_{22} & 0 & 0 & 0 \\ 0 & 0 & J_{33} & 0 & 0 \\ 0 & 0 & 0 & J_{44} & 0 \\ 0 & 0 & 0 & 0 & J_{55} \end{bmatrix} \, .

(6.55)

For this scenario, the Jacobian can be constructed with one evaluation rather than $n_x$ evaluations. This is because a given output $f_i$ depends on only one input $x_i$ . We could think of the outputs as $n_x$ independent functions. Thus, for finite differencing, rather than requiring $n_x$ input vectors with $n_x$ function evaluations, we can use one input vector, as follows:

[x_1 + h,\; x_2 + h,\; \ldots,\; x_5 + h] \, ,

(6.56)

allowing us to compute all the nonzero entries in one pass.^[20]

Although the diagonal case is easy to understand, it is a special situation. To generalize this concept, let us consider the following $(5 \times 6)$ matrix as an example:

\begin{bmatrix} J_{11} & 0 & 0 & J_{14} & 0 & J_{16} \\ 0 & 0 & J_{23} & J_{24} & 0 & 0 \\ J_{31} & J_{32} & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & J_{45} & 0 \\ 0 & 0 & J_{53} & 0 & J_{55} & J_{56} \end{bmatrix} \, .

(6.57)

A subset of columns that does not have more than one nonzero in any given row are said to be structurally orthogonal. In this example, the following sets of columns are structurally orthogonal: (1, 3), (1, 5), (2, 3), (2, 4, 5), (2, 6), and (4, 5). Structurally orthogonal columns can be combined, forming a smaller Jacobian that reduces the number of forward passes required. This reduced Jacobian is referred to as compressed. There is more than one way to compress this Jacobian, but in this case, the minimum number of compressed columns—referred to as colors—is three. In the following compressed Jacobian, we combine columns 1 and 3 (blue); columns 2, 4, and 5 (red); and leave column 6 on its own (black):

\begin{bmatrix} \color{primary} J_{11} & \color{secondary} 0 & \color{primary} 0 & \color{secondary} J_{14} & \color{secondary} 0 & J_{16} \\ \color{primary} 0 & \color{secondary} 0 & \color{primary} J_{23} & \color{secondary} J_{24} & \color{secondary} 0 & 0 \\ \color{primary} J_{31} &\color{secondary} J_{32} & \color{primary} 0 & \color{secondary} 0 & \color{secondary} 0 & 0 \\ \color{primary} 0 & \color{secondary} 0 & \color{primary} 0 & \color{secondary} 0 & \color{secondary} J_{45} & 0 \\ \color{primary} 0 & \color{secondary} 0 & \color{primary} J_{53} & \color{secondary} 0 & \color{secondary} J_{55} & J_{56} \end{bmatrix} \Rightarrow \begin{bmatrix} \color{primary} J_{11} & \color{secondary} J_{14} & J_{16} \\ \color{primary} J_{23} & \color{secondary} J_{24} & 0 \\ \color{primary} J_{31} & \color{secondary} J_{32} & 0 \\ \color{primary} 0 & \color{secondary} J_{45} & 0 \\ \color{primary} J_{53} & \color{secondary} J_{55} & J_{56} \end{bmatrix} \, .

(6.58)

For finite differencing, complex step, and forward-mode AD, only compression among columns is possible. Reverse mode AD allows compression among the rows. The concept is the same, but instead, we look for structurally orthogonal rows. One such compression is as follows:

\begin{bmatrix} \color{primary} J_{11} & \color{primary} 0 & \color{primary} 0 & \color{primary} J_{14} & \color{primary} 0 & \color{primary} J_{16} \\ 0 & 0 & J_{23} & J_{24} & 0 & 0 \\ \color{secondary} J_{31} & \color{secondary} J_{32} & \color{secondary} 0 & \color{secondary} 0 & \color{secondary} 0 & \color{secondary} 0 \\ \color{primary} 0 & \color{primary} 0 & \color{primary} 0 & \color{primary} 0 & \color{primary} J_{45} & \color{primary} 0 \\ \color{secondary} 0 & \color{secondary} 0 & \color{secondary} J_{53} & \color{secondary} 0 & \color{secondary} J_{55} & \color{secondary} J_{56} \end{bmatrix} \Rightarrow \begin{bmatrix} \color{primary} J_{11} & \color{primary} 0 & \color{primary} 0 & \color{primary} J_{14} & \color{primary} J_{45} & \color{primary} J_{16} \\ 0 & 0 & J_{23} & J_{24} & 0 & 0 \\ \color{secondary} J_{31} & \color{secondary} J_{32} & \color{secondary} J_{53}& \color{secondary} 0 & \color{secondary} J_{55} & \color{secondary} J_{56} \\ \end{bmatrix} \, .

(6.59)

AD can also be used even more flexibly when both modes are used: forward passes to evaluate groups of structurally orthogonal columns and reverse passes to evaluate groups of structurally orthogonal rows. Rather than taking incremental steps in each direction as is done in finite differencing, we set the AD seed vector with 1s in the directions we wish to evaluate, similar to how the seed is set for directional derivatives, as discussed in Section 6.6.

For these small Jacobians, it is straightforward to determine how to compress the matrix in the best possible way. For a large matrix, this is not so easy. One approach is to use graph coloring. This approach starts by building a graph where the vertices represent the row and column indices, and the edges represent nonzero entries in the Jacobian. Then, algorithms are applied to this graph that estimate the fewest number of “colors” (orthogonal columns) using heuristics. Graph coloring is a large field of research, where derivative computation is one of many applications.^[21]

Example 6.13 (Speed up from sparse derivatives)

In static aerodynamic analyses, the forces and moments produced at two different flow conditions are independent. If there are many different flow conditions of interest, the resulting Jacobian is sparse. Examples include evaluating the power produced by a wind turbine at different wind speeds or assessing an aircraft’s performance throughout a flight envelope. Many other engineering analyses have a similar structure.

Consider a typical wind turbine blade optimization. The Jacobian of the functions of interest is fully dense with respect to geometry changes. However, the part of the Jacobian that contains the derivatives with respect to the various flow conditions is diagonal, as illustrated on left side of Figure 6.33. Blank blocks represent derivatives that are zero. We can compress the diagonal part of the Jacobian as shown on the right side of Figure 6.33.

Figure 6.33:Jacobian structure for wind turbine problem. The original Jacobian (left) can be replaced with a compressed one (right).

To illustrate the potential benefits of using a sparse representation, we time the Jacobian computation for various sizes of inflow conditions using forward AD with and without graph coloring (Figure 6.34). For more than 100 inflow conditions, the difference in time required exceeds one order of magnitude (note the log-log scale). Because Jacobians are needed at every iteration in the optimization, this is a tremendous speedup, enabled by exploiting the sparsity pattern.24

Figure 6.34:Jacobian computational time with and without coloring.

6.9 Unified Derivatives Equation¶

Now that we have introduced all the methods for computing derivatives, we will see how they are connected. For example, we have mentioned that the direct and adjoint methods are analogous to the forward and reverse mode of AD, respectively, but we did not show this mathematically. The unified derivatives equation (UDE) expresses both methods.25 Also, the implicit analytic methods from Section 6.7 assumed one set of implicit equations ( $r=0$ ) and one set of explicit functions ( $f$ ). The UDE formulates the derivative computation for systems with mixed sets of implicit and explicit equations.

We first derive the UDE from basic principles and give an intuitive explanation of the derivative terms. Then, we show how we can use the UDE to handle implicit and explicit equations. We also show how the UDE can retrieve the direct and adjoint equations. Finally, we show how the UDE is connected to AD.

6.9.1 UDE Derivation¶

Figure 6.35:Solution of a system of two equations expressed by residuals.

Suppose we have a set of $n$ residual equations with the same number of unknowns,

r_i(u_1, u_2, \ldots, u_n) = 0, \quad i = 1, \ldots, n \, ,

(6.60)

and that there is at least one solution $u^*$ such that $r(u^*)=0$ . Such a solution can be visualized for $n=2$ , as shown in Figure 6.35.

These residuals are general: each one can depend on any subset of the variables $u$ and can be truly implicit functions or explicit functions converted to the implicit form (see Section 3.3 and Example 3.3). The total differentials for these residuals are

\mathrm{d} r_i = \frac{\partial r_i}{\partial {u_1}}\mathrm{d} u_1 + \ldots + \frac{\partial r_i}{\partial {u_n}}\mathrm{d} u_n , \quad i = 1, \ldots, n \, .

(6.61)

These represent first-order changes in $r$ due to perturbations in $u$ . The differentials of $u$ can be visualized as perturbations in the space of the variables. The differentials of $r$ can be visualized as linear changes to the surface defined by $r=0$ , as illustrated in Figure 6.36.

The differential \mathrm{d} r can be visualized as a linearized (first-order) change of the contour value. — Figure 6.36:The differential $\mathrm{d} r$ can be visualized as a linearized (first-order) change of the contour value.

We can write the differentials (Equation 6.61) in matrix form as

\begin{bmatrix} \frac{\partial r_1}{\partial {u_1}}& \cdots & \frac{\partial r_1}{\partial {u_n}}\\ \vdots & \ddots & \vdots \\ \frac{\partial r_n}{\partial {u_1}}& \cdots & \frac{\partial r_n}{\partial {u_n}}\end{bmatrix} \begin{bmatrix} \mathrm{d} u_1 \\ \vdots \\ \mathrm{d} u_n \end{bmatrix} = \begin{bmatrix} \mathrm{d} r_1 \\ \vdots \\ \mathrm{d} r_n \end{bmatrix} \, .

(6.62)

The partial derivatives in the matrix are derivatives of the expressions for $r$ with respect to $u$ that can be obtained symbolically, and they are in general functions of $u$ . The vector of differentials $\mathrm{d} u$ represents perturbations in $u$ that can be solved for a given vector of changes $\mathrm{d} r$ .

Now suppose that we are at a solution $u^*$ , such that $r(u^*)=0$ . All the partial derivatives ( $\partial r/\partial u$ ) can be evaluated at $u^*$ . When all entries in $\mathrm{d} r$ are zero, then the solution of this linear system yields $\mathrm{d} u = 0$ . This is because if there is no disruption in the residuals that are already zero, the variables do not need to change either.

The total derivatives \mathrm{d} u_1 / \mathrm{d} r_1 and \mathrm{d} u_2 / \mathrm{d} r_1 represent the first-order changes needed to satisfy a perturbation r_1 = \mathrm{d} r_1 while keeping r_2=0. — Figure 6.37:The total derivatives $\mathrm{d} u_1 / \mathrm{d} r_1$ and $\mathrm{d} u_2 / \mathrm{d} r_1$ represent the first-order changes needed to satisfy a perturbation $r_1 = \mathrm{d} r_1$ while keeping $r_2=0$ .

How is this linear system useful? With these differentials, we can choose different combinations of $\mathrm{d} r$ to obtain any total derivatives that we want. For example, we can get the total derivatives of $u$ with respect to a single residual $r_i$ by keeping $\mathrm{d} r_i$ while setting all the other differentials to zero ( $\mathrm{d} r_{j \ne i} = 0$ ). The visual interpretation of this total derivative is shown in Figure 6.37 for $n=2$ and $i=1$ . Setting $\mathrm{d} r = [0,\ldots,0, \mathrm{d} r_i, 0, \ldots, 0]$ in Equation 6.62 and moving $\mathrm{d} r_i$ to the denominator, we obtain the following linear system:^[22]

\begin{bmatrix} \frac{\partial r_1}{\partial {u_1}}& \cdots & \frac{\partial r_1}{\partial {u_i}}& \cdots & \frac{\partial r_1}{\partial {u_n}}\\ \vdots & \ddots & \vdots & & \vdots \\ \frac{\partial r_i}{\partial {u_1}}& \cdots & \frac{\partial r_i}{\partial {u_i}}& \cdots & \frac{\partial r_i}{\partial {u_n}}\\ \vdots & & \vdots & \ddots & \vdots \\ \frac{\partial r_n}{\partial {u_1}}& \cdots & \frac{\partial r_n}{\partial {u_i}}& \cdots & \frac{\partial r_n}{\partial {u_n}}\end{bmatrix} \begin{bmatrix} \frac{du_1}{d{r_i}}\\ \vdots \\ \frac{du_i}{d{r_i}}\\ \vdots \\ \frac{du_n}{d{r_i}}\end{bmatrix} = \begin{bmatrix} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{bmatrix} \, .

(6.63)

Doing the same for all $i = 1, \ldots, n$ , we get the following $n$ linear systems:

\begin{bmatrix} \frac{\partial r_1}{\partial {u_1}}& \cdots & \frac{\partial r_1}{\partial {u_n}}\\ \vdots & \ddots & \vdots \\ \frac{\partial r_n}{\partial {u_1}}& \cdots & \frac{\partial r_n}{\partial {u_n}}\end{bmatrix} \begin{bmatrix} \frac{du_1}{d{r_1}}& \cdots & \frac{du_1}{d{r_n}}\\ \vdots & \ddots & \vdots \\ \frac{du_n}{d{r_1}}& \cdots & \frac{du_n}{d{r_n}}\end{bmatrix} = \begin{bmatrix} 1 & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & 1 \end{bmatrix} \, .

(6.64)

Solving these linear systems yields the total derivatives of all the elements of $u$ with respect to all the elements of $r$ . We can write this more compactly in matrix form as

\frac{\partial r}{\partial {u}}\frac{du}{d{r}}= I \, .

(6.65)

This is the forward form of the UDE.

The total derivatives $\mathrm{d} u / \mathrm{d} r$ might not seem like the derivatives in which we are interested. Based on the implicit analytic methods derived in Section 6.7.2, these look like derivatives of states with respect to residuals, not the derivatives that we ultimately want to compute ( $\mathrm{d} f / \mathrm{d} x$ ). However, we will soon see that with the appropriate choice of $r$ and $u$ , we can obtain a linear system that solves for the total derivatives we want.

With Equation 6.65, we can solve one column at a time. Similar to AD, we can also solve for the rows instead by transposing the systems as ^[23]

\left. \frac{\partial r}{\partial {u}}\right.^\intercal \left.\frac{du}{d{r}}\right.^\intercal = I \, ,

(6.66)

which is the reverse form of the UDE.

The total derivatives \mathrm{d} u_1 / \mathrm{d} r_1 and \mathrm{d} u_1 / \mathrm{d} r_2 represent the first-order change in u_1 resulting from perturbations \mathrm{d} r_1 and \mathrm{d} r_2. — Figure 6.38:The total derivatives $\mathrm{d} u_1 / \mathrm{d} r_1$ and $\mathrm{d} u_1 / \mathrm{d} r_2$ represent the first-order change in $u_1$ resulting from perturbations $\mathrm{d} r_1$ and $\mathrm{d} r_2$ .

Now, each column $j$ yields $\mathrm{d} u_j / \mathrm{d} r$ —the total derivative of one variable with respect to all the residuals. This total derivative is interpreted visually in Figure 6.38.

The usefulness of the total derivative Jacobian $\mathrm{d} u / \mathrm{d} r$ might still not be apparent. In the next section, we explain how to set up the UDE to include $\mathrm{d} f / \mathrm{d} x$ in the UDE unknowns ( $\mathrm{d} u / \mathrm{d} r$ ).

Example 6.14 (Computing and interpreting

\mathrm{d} u / \mathrm{d} r

)

Suppose we want to find the rectangle that is inscribed in the ellipse given by

r_1 (u_1, u_2) = \frac{u_1^2}{4} + {u_2^2} - 1 = 0 \, .

A change in this residual represents a change in the size of the ellipse without changing its proportions. Of all the possible rectangles that can be inscribed in the ellipse, we want the rectangle with an area that is half of that of this ellipse, such that

r_2(u_1, u_2) = 4 u_1 u_2 - {\pi} = 0 \, .

A change in this residual represents a change in the area of the rectangle. There are two solutions, as shown in the left pane of Figure 6.39. These solutions can be found using Newton’s method, which converges to one solution or the other, depending on the starting guess. We will pick the one on the right, which is $[u_1, u_2] = [1.79944, 0.43647]$ . The solution represents the coordinates of the rectangle corner that touches the ellipse.

(a)

(b)

Figure 6.39:Rectangle inscribed in ellipse problem.

Taking the partial derivatives, we can write the forward UDE (Equation 6.65) for this problem as follows:

\begin{bmatrix} {u_1}/{2} & {2u_2} \\ 4 u_2 & 4 u_1 \end{bmatrix} \begin{bmatrix} \frac{du_1}{d{r_1}}& \frac{du_1}{d{r_2}}\\ \frac{du_2}{d{r_1}}& \frac{du_2}{d{r_2}}\end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \, .

(6.67)

Solving this linear system for each of the two right-hand sides, we get

\begin{bmatrix} \frac{du_1}{d{r_1}}& \frac{du_1}{d{r_2}}\\ \frac{du_2}{d{r_1}}& \frac{du_2}{d{r_2}}\end{bmatrix} = \begin{bmatrix} 1.45353 & -0.17628 \\ -0.35257 & 0.18169 \end{bmatrix} \, .

(6.68)

These derivatives reflect the change in the coordinates of the point where the rectangle touches the ellipse as a result of a perturbation in the size of the ellipse, $\mathrm{d} r_1$ , and the area of the rectangle $\mathrm{d} r_2$ . The right side of Figure 6.39 shows the visual interpretation of $\mathrm{d} u / \mathrm{d} r_1$ as an example.

6.9.2 UDE for Mixed Implicit and Explicit Components¶

In the previous section, the UDE was derived based on residual equations. The equations were written in implicit form, but there was no assumption on whether the equations were implicit or explicit. Because we can write an explicit equation in implicit form (see Section 3.3 and Example 3.3), the UDE allows a mix of implicit and explicit set of equations, which we now call components.

To derive the implicit analytic equations in Section 6.7, we considered two components: an implicit component that determines $u$ by solving $r(u;x)=0$ and an explicit component that computes the functions of interest, $f(x,u)$ .

We can recover the implicit analytic differentiation equations (direct and adjoint) from the UDE by defining a set of variables that concatenates the state variables with inputs and outputs as follows:

\hat u \equiv \begin{bmatrix} x \\ u \\ f \end{bmatrix} \, .

(6.69)

This is a vector with $\left( n_x + n_u + n_f \right)$ variables. For the residuals, we need a vector with the same size. We can obtain this by realizing that the residuals associated with the inputs and outputs are just explicit functions that can be written in implicit form. Then, we have

\hat r \equiv \begin{bmatrix} x-\check{x} \\ r-\check{r}(x,u) \\ f-\check{f}(x,u) \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix} \, .

(6.70)

Here, we distinguish $x$ (the actual variable in the UDE system) from $\check{x}$ (a given input) and $f$ (the variable) from $\check{f}$ (an explicit function of $x$ and $u$ ). Similarly, $r$ is the vector of variables associated with the residual and $\check{r}$ is the residual function itself.

Taking the differential of the residuals, and considering only one of them to be nonzero at a time, we obtain,

\mathrm{d} \hat r \equiv \begin{bmatrix} \mathrm{d} x \\ \mathrm{d} r \\ \mathrm{d} f \end{bmatrix} \, .

(6.71)

Using these variable and residual definitions in Equation 6.65 and Equation 6.66 yields the full UDE shown in Figure 6.40, where the block we ultimately want to compute is $\mathrm{d} f / \mathrm{d} x$ .

Figure 6.40:The direct and adjoint methods can be recovered from the UDE.

To compute $\mathrm{d} f / \mathrm{d} x$ using the forward UDE (left-hand side of the equation in Figure 6.40, we can ignore all but three blocks in the total derivatives matrix: $I$ , $\mathrm{d} u / \mathrm{d} x$ , and $\mathrm{d} f / \mathrm{d} x$ . By multiplying these blocks and using the definition $\phi \equiv - \mathrm{d} u / \mathrm{d} x$ , we recover the direct linear system (Equation 6.43) and the total derivative equation (Equation 6.44).

To compute $\mathrm{d} f / \mathrm{d} x$ using the reverse UDE (right-hand side of the equation in Figure 6.40), we can ignore all but three blocks in the total derivatives matrix: $I$ , $\mathrm{d} f/ \mathrm{d} r$ , and $\mathrm{d} f / \mathrm{d} x$ . By multiplying these blocks and defining $\psi \equiv - \mathrm{d} f / \mathrm{d} r$ , we recover the adjoint linear system (Equation 6.46) and the corresponding total derivative equation (Equation 6.47).

The definition of $\psi$ here is significant because, as mentioned in Section 6.7.2, the adjoint vector is the total derivative of the objective function with respect to the governing equation residuals.

By defining one implicit component (associated with $u$ ) and two explicit components (associated with $x$ and $f$ ), we have retrieved the direct and adjoint methods from the UDE. In general, we can define an arbitrary number of components, so the UDE provides a mathematical framework for computing the derivatives of coupled systems. Furthermore, each component can be implicit or explicit, so the UDE can handle an arbitrary mix of components. All we need to do is to include the desired states in the UDE augmented variables vector (Equation 6.69) and the corresponding residuals in the UDE residuals vector (Equation 6.70). We address coupled systems in Section 13.3.3 and use the UDE in Section 13.2.6, where we extend it to coupled systems with a hierarchy of components.

Example 6.15 (Computing arbitrary derivatives with the UDE)

Say we want to compute the total derivatives of the perimeter of the rectangle from with respect to the axes of the ellipse. The equation for the ellipse can be rewritten as

r_3(u_1, u_2) = \frac{u_1^2}{x_1^2} + \frac{u_2^2}{x_2^2} - 1 = 0 \, ,

where $x_1$ and $x_2$ are the semimajor and semiminor axes of the ellipse, respectively. The baseline values are $[x_1,x_2]=[2,1]$ . The residual for the rectangle area is

r_4(u_1, u_2) = 4 u_1 u_2 - \frac{\pi}{2} x_1 x_2 = 0 \, .

To add the independent variables $x_1$ and $x_2$ , we write them as residuals in implicit form as

r_1(x_1) = x_1 - 2 = 0, \quad r_2(x_2) = x_2 - 1 = 0 \, .

The perimeter can be written in implicit form as

r_5(u_1, u_2) = f - 4 (u_1 + u_2) = 0 \, .

Figure 6.41:Dependencies of the residuals.

Now we have a system of five equations and five variables, with the dependencies shown in Figure 6.41. The first two variables in $x$ are given, and we can compute $u$ and $f$ using a solver as before.

Taking all the partial derivatives, we get the following forward system:

\begin{bmatrix} 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ -\dfrac{2 u_1^2}{x_1^3} & -\dfrac{2 u_2^2}{x_2^3} & \dfrac{2 u_1}{x_1^2} & \dfrac{2 u_2}{x_2^2} & 0 \\ -\dfrac{\pi}{2}x_2 & -\dfrac{\pi}{2}x_1 & 4 u_2 & 4 u_1 & 0 \\ 0 & 0 & -4 & -4 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ \frac{du_1}{d{x_1}}& \frac{du_1}{d{x_2}}& \frac{du_1}{d{r_3}}& \frac{du_1}{d{r_4}}& 0 \\ \frac{du_2}{d{x_1}}& \frac{du_2}{d{x_2}}& \frac{du_2}{d{r_3}}& \frac{du_2}{d{r_4}}& 0 \\ \textcolor{#4E79A7}{\frac{df}{d{x_1}}} & \textcolor{#4E79A7}{\frac{df}{d{x_2}}} & \frac{df}{d{r_3}}& \frac{df}{d{r_4}}& 1 \\ \end{bmatrix} = I \, .

We only want the two $\mathrm{d} f / \mathrm{d} x$ terms in this equation. We can either solve this linear system twice to compute the first two columns, or we can compute both terms with a single solution of the reverse (transposed) system. Transposing the system, substituting the numerical values for $x$ and $u$ , and removing the total derivative terms that we do not need, we get the following system:

\begin{bmatrix} 1 & 0 & -0.80950 & -1.57080 & 0 \\ 0 & 1 & -0.38101 & -3.14159 & 0 \\ 0 & 0 & 0.89972 & 1.74588 & -4 \\ 0 & 0 & 0.87294 & 7.19776 & -4 \\ 0 & 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} \textcolor{#4E79A7}{\frac{df}{d{x_1}}} \\ \textcolor{#4E79A7}{\frac{df}{d{x_2}}} \\ \frac{df}{d{r_3}}\\ \frac{df}{d{r_4}}\\ 1 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 1 \end{bmatrix} \, .

Solving this linear system, we obtain

\begin{bmatrix} \textcolor{#4E79A7}{\frac{df}{d{x_1}}} \\ \textcolor{#4E79A7}{\frac{df}{d{x_2}}} \\ \frac{df}{d{r_3}}\\ \frac{df}{d{r_4}}\end{bmatrix} = \begin{bmatrix} \textcolor{#4E79A7}{3.59888} \\ \textcolor{#4E79A7}{1.74588} \\ 4.40385 \\ 0.02163 \end{bmatrix} \, .

Figure 6.42:Contours of $f$ as a function of $x$ and the total derivatives at $x=[2, 1]$ .

The total derivatives of interest are shown in Figure 6.42.

We could have obtained the same solution using the adjoint equations from Section 6.7.2. The only difference is the nomenclature because the adjoint vector in this case is $\psi = - [ {\mathrm{d} f}/{\mathrm{d} {r_3}}, {\mathrm{d} f}/{\mathrm{d} {r_4}} ]$ . We can interpret these terms as the change of $f$ with respect to changes in the ellipse size and rectangle area, respectively.

6.9.3 Recovering AD¶

Now we will see how we can recover AD from the UDE. First, we define the UDE variables associated with each operation or line of code (assuming all loops have been unrolled), such that $u \equiv v$ and

v_i = \check v_i(v_1,\dots,v_{i-1}), \quad i = 1, \ldots, n \, .

(6.72)

Recall from Section 6.6.1 that each variable is an explicit function of the previous ones.

To define the appropriate residuals, we use the same technique from before to convert an explicit function into implicit form by moving all the terms in the left-hand side to obtain

r_i = v_i - \check v_i(v_1,\dots,v_{i-1}) \, .

(6.73)

The distinction between $v$ and $\check v$ is that $v$ represents variables that are considered independent in the UDE, whereas $\check v$ represents the explicit expressions. Of course, the values for these become equal once the system is solved. Similar to the differentials in Equation 6.71, $\mathrm{d} r \equiv \mathrm{d} v$

Taking the partial derivatives of the residuals (Equation 6.73) with respect to $v$ (Equation 6.72), and replacing the total derivatives in the forward form of the UDE (Equation 6.65) with the new symbols yields

\begin{bmatrix} 1 & 0 & \cdots & 0 \\ -\frac{\partial \check v_2}{\partial {v_1}} & 1 & \ddots & \vdots \\ \vdots & \ddots & \ddots & 0 \\ -\frac{\partial \check v_n}{\partial {v_1}} & \cdots & -\frac{\partial \check v_n}{\partial {v_{n-1}}} & 1 \end{bmatrix} \begin{bmatrix} \frac{dv_1}{d{v_1}} & 0 & \cdots & 0 \\ \frac{dv_2}{d{v_1}} & \frac{dv_2}{d{v_2}} & \ddots & \vdots \\ \vdots & \ddots & \ddots & 0 \\ \frac{dv_n}{d{v_1}} & \cdots & \frac{dv_n}{d{v_{n-1}}} & \frac{d{v_n}}{d{v_n}} \end{bmatrix} = I \, .

(6.74)

This equation is the matrix form of the AD forward chain rule (Equation 6.21), where each column of the total derivative matrix corresponds to the tangent vector ( $\dot v$ ) for the chosen input variable. As observed in Figure 6.16, the partial derivatives form a lower triangular matrix. The Jacobian we ultimately want to compute ( $\mathrm{d} f / \mathrm{d} x$ ) is composed of a subset of derivatives in the bottom-left corner near the $\mathrm{d} {v_n} / \mathrm{d} {v_1}$ term. To compute these derivatives, we need to perform forward substitution and compute one column of the total derivative matrix at a time, where each column is associated with the inputs of interest.

Similarly, the reverse form of the UDE (Equation 6.66) yields the transpose of Equation 6.74,

\begin{bmatrix} 1 & -\frac{\partial \check v_2}{\partial {v_1}} & \cdots & -\frac{\partial \check v_n}{\partial {v_1}} \\ 0 & 1 & \ddots & \vdots \\ \vdots & \ddots & \ddots & -\frac{\partial \check v_n}{\partial {v_{n-1}}} \\ 0 & \cdots & 0 & 1 \end{bmatrix} \begin{bmatrix} \frac{dv_1}{d{v_1}} & \frac{dv_2}{d{v_1}} & \cdots & \frac{dv_n}{d{v_1}} \\ 0 & \frac{dv_2}{d{v_2}} & \ddots & \vdots \\ \vdots & \ddots & \frac{dv_{n-1}}{d{v_{n-1}}} & \frac{dv_n}{d{v_{n-1}}} \\ 0 & \cdots & 0 & \frac{dv_{n}}{d{v_{n}}} \end{bmatrix} = I \, .

(6.75)

This is equivalent to the AD reverse chain rule (Equation 6.26), where each column of the total derivative matrix corresponds to the gradient vector ( $\bar v$ ) for the chosen output variable. The partial derivatives now form an upper triangular matrix, as previously shown in Figure 6.21. The derivatives of interest are now near the top-right corner of the total derivative matrix near the $\mathrm{d} {v_n} / \mathrm{d} {v _1}$ term. To compute these derivatives, we need to perform back substitutions, computing one column of the matrix at a time.

Tip 6.9: Scaling affects the derivatives

When scaling a problem (Tip 4.4 and Tip 5.3), you should be aware that the scale changes also affect the derivatives. You can apply the derivative methods of this chapter to the scaled function directly. However, scaling is often done outside the model because the scaling is specific to the optimization problem. In this case, you may want to use the original functions and derivatives and make the necessary modifications in an outer function that provides the objectives and constraints.

Using the nomenclature introduced in Tip 4.4, we represent the scaled design variables given to the optimizer as $\bar x$ . Then, the unscaled variables are $x = s_x \odot \bar x$ . Thus, the required scaled derivatives are

\frac{\mathrm{d} \bar f}{\mathrm{d} \bar x} = \frac{\mathrm{d} f}{\mathrm{d} x} \odot \frac{s_x}{s_f} \, .

(6.76)

Tip 6.10: Provide your own derivatives and use finite differences only as a last resort

Because of the step-size dilemma, finite differences are often the cause of failed optimizations. To put it more dramatically, finite differences are the root of all evil. Most gradient-based optimization software uses finite differences internally as a default if you do not provide your own gradients. Although some software packages try to find reasonable finite-difference steps, it is easy to get inaccurate derivatives, which then causes optimization difficulties or total failure. This is the top reason why beginners give up on gradient-based optimization!

Instead, you should provide gradients computed using one of the other methods described in this chapter. In contrast with finite differences, the derivatives computed by the other methods are usually as accurate as the function computation. You should also avoid using finite-difference derivatives as a reference for a definitive verification of the other methods.

If you have to use finite differences as a last resort, make sure to do a step-size study (see Tip 6.2). You should then provide your own finite-difference derivatives to the optimization or make sure that the optimizer finite-difference estimates are acceptable.

6.10 Summary¶

Derivatives are useful in many applications beyond optimization. This chapter introduced the methods available to compute the first derivatives of the outputs of a model with respect to its inputs. In optimization, the outputs are usually the objective function and the constraint functions, and the inputs are the design variables. The typical characteristics of the available methods are compared in Table 4.

Table 4:Characteristics of the various derivative computation methods. Some of these characteristics are problem or implementation dependent, so these are not universal.

	Accuracy	Scalability	Ease of	Implicit
			implementation	functions
Symbolic	$\bullet$		Hard
Finite difference			Easy	$\bullet$
Complex step	$\bullet$		Intermediate	$\bullet$
AD	$\bullet$	$\bullet$	Intermediate	$\bullet$
Implicit analytic	$\bullet$	$\bullet$	Hard	$\bullet$

Symbolic differentiation is accurate but only scalable for simple, explicit functions of low dimensionality. Therefore, it is necessary to compute derivatives numerically. Although it is generally intractable or inefficient for many engineering models, symbolic differentiation is used by AD at each line of code and in implicit analytic methods to derive expressions for computing the required partial derivatives.

Finite-difference approximations are popular because they are easy to implement and can be applied to any model, including black-box models. The downsides are that these approximations are not accurate, and the cost scales linearly with the number of variables. Many of the issues practitioners experience with gradient-based optimization can be traced to errors in the gradients when algorithms automatically compute these gradients using finite differences.

The complex-step method is accurate and relatively easy to implement. It usually requires some changes to the analysis source code, but this process can be scripted. The main advantage of the complex-step method is that it produces analytically accurate derivatives. However, like the finite-difference method, the cost scales linearly with the number of inputs, and each simulation requires more effort because of the complex arithmetic.

AD produces analytically accurate derivatives, and many implementations can be fully automated. The implementation requires access to the source code but is still relatively straightforward to apply. The computational cost of forward-mode AD scales with the number of inputs, and the reverse mode scales with the number of outputs. The scaling factor for the forward mode is generally lower than that for finite differences. The cost of reverse-mode AD is independent of the number of design variables.

Efficient gradient computation with an analytic method improves the scalability of gradient-based algorithms compared to finite differencing. In this case, we show the results for the n-dimensional Rosenbrock, where the cost of computing the derivatives analytically is independent of n. — Figure 6.43:Efficient gradient computation with an analytic method improves the scalability of gradient-based algorithms compared to finite differencing. In this case, we show the results for the $n$ -dimensional Rosenbrock, where the cost of computing the derivatives analytically is independent of $n$ .

Implicit analytic methods (direct and adjoint) are accurate and scalable but require significant implementation effort. These methods are exact (depending on how the partial derivatives are obtained), and like AD, they provide both forward and reverse modes with the same scaling advantages. Gradient-based optimization using the adjoint method is a powerful combination that scales well with the number of variables, as shown in Figure 6.43. The disadvantage is that because implicit methods are intrusive, they require considerable development effort.

A hybrid approach where the partial derivatives for the implicit analytic equations are computed with AD is generally recommended. This hybrid approach is computationally more efficient than AD while reducing the implementation effort of implicit analytic methods and ensuring accuracy.

The UDE encapsulates all the derivative computation methods in a single linear system. Using the UDE, we can formulate the derivative computation for an arbitrary system of mixed explicit and implicit components. This will be used in Section 13.2.6 to develop a mathematical framework for solving coupled systems and computing the corresponding derivatives.

Problems¶

Exercise 1

Answer true or false and justify your answer.

A first-order derivative is only one of many types of sensitivity analysis.
Each column of the Jacobian matrix represents the gradient of one of the functions of interest with respect to all the variables.
You can only compute derivatives of models for which you have the source code or, at the very least, understand how the model computes the functions of interest.
Symbolic differentiation is intractable for all but the simplest models because of expression swell.
Finite-difference approximations can compute first derivatives with a precision matching that of the function being differentiated.
The complex-step method can only be used to compute derivatives of real functions.
AD via source code transformation uses a code parser to differentiate each line of code symbolically.
The forward mode of AD computes the derivatives of all outputs with respect to one input, whereas the reverse mode computes the derivative of one output with respect to all inputs.
The adjoint method requires the same partial derivatives as the direct method.
Of the two implicit analytic methods, the direct method is more widely used than the adjoint method because most problems have more design variables than functions of interest.
Graph coloring makes Jacobians sparse by selectively replacing small-valued entries with zeros to trade accuracy for speed.
The unified derivatives equation can represent implicit analytic approaches but not AD.

Exercise 5

Suppose you have two airplanes that are flying in a horizontal plane defined by $x$ and $y$ coordinates. Both airplanes start at $y=0$ , but airplane 1 starts at $x=0$ , whereas airplane 2 has a head start of $x=\Delta x$ . The airplanes fly at a constant velocity. Airplane 1 has a velocity of $v_1$ in the direction of the positive $x$ -axis, and airplane 2 has a velocity of $v_2$ at an angle $\gamma$ with the $x$ -axis. The functions of interest are the distance ( $d$ ) and the angle ( $\theta$ ) between the two airplanes as a function of time. The independent variables are $\Delta x$ , $\gamma$ , $v_1$ , $v_2$ , and $t$ . Write the code that computes the functions of interest (outputs) for a given set of independent variables (inputs). Use AD to differentiate the code. Choose a set of inputs, compute the derivatives of all the outputs with respect to the inputs, and verify them against the complex-step method.

Exercise 6

Kepler’s equation, which we mentioned in Section 2.2, defines the relationship between a planet’s polar coordinates and the time elapsed from a given initial point through the implicit equation

E - e \sin (E) = M \, ,

where $M$ is the mean anomaly (a parameterization of time), $E$ is the eccentric anomaly (a parameterization of the polar angle), and $e$ is the eccentricity of the elliptical orbit.

Suppose that the function of interest is the difference between the eccentric and mean anomalies,

f(E, M) = E - M \, .

Derive an analytic expression for $\mathrm{d} f / \mathrm{d} e$ and $\mathrm{d} f / \mathrm{d} M$ . Verify your result against the complex-step method or AD (you will need a solver for Kepler’s equation, which was the subject of Exercise 6).

Footnotes¶

Here, $f$ is the difference between the eccentric and mean anomalies, $x$ is the mean anomaly, and the eccentricity is set to 1. For more details, see Exercise 6 and .
↩
This method originated with the work of 1 who developed formulas that use complex arithmetic for computing the derivatives of real functions of arbitrary order with arbitrary order truncation error, much like the Taylor series combination approach in finite differences. Later, 2 observed that the simplest of these formulas was convenient for computing first derivatives.
↩
This approximation can also be derived from one of the Cauchy–Riemann equations, which are fundamental in complex analysis and express complex differentiability.3
↩
For more details on the problematic functions and how to implement the complex-step method in various programming languages, see 3 A summary, implementation guide, and scripts are available at: http://bit.ly/complexstep
↩
One of the main techniques for reducing the memory usage of reverse AD is checkpointing; see Chapter 12 in 6
↩
A stack, also known as last in, first out (LIFO), is a data structure that stores a one-dimensional array. We can only add an element to the top of the stack (push) and take the element from the top of the stack (pop).
↩
The overloading of ‘’ $+$ ” computes $(v, \dot{v)} = (x + f, \dot{x) + \dot{f}}$ and then the overloading of “ $\sin$ ” computes $(f, \dot{f)} = (\sin(v), \cos(v)\dot v)$ .
↩
See Sec. 5.4 in 6 for more details on reverse mode using operating overloading.
↩
17 compare the continuous and discrete adjoint approaches in more detail.
↩
This chain rule can be derived by writing the total differential of $f$ as
$\mathrm{d} f = \frac{\partial f}{\partial x} \mathrm{d} x + \frac{\partial f}{\partial u} \mathrm{d} u$
and then “dividing” it by $\mathrm{d} x$ . See Section A.2 for more background on differentials.
↩
The adjoint vector can also be interpreted as a Lagrange multiplier vector associated with equality constraints $r=0$ . Defining the Lagrangian $\mathcal{L}(x,u) = f + \psi^\intercal r$ and differentiating it with respect to $x$ , we get
$\frac{\partial {\mathcal{L}}}{\partial x}= \frac{\partial f}{\partial x}+ \psi^\intercal \frac{\partial r}{\partial x}.$
Thus, the total derivatives $\mathrm{d} f / \mathrm{d} x$ are the derivatives of this Lagrangian.
↩
One widespread application of the adjoint method has been in aerodynamic and hydrodynamic shape optimization.18
↩
The displacements do change with the areas but only through the solution of the governing equations, which are not considered when taking partial derivatives.
↩
This is not true for large displacements, but we assume small displacements .
↩
Although ultimately, the areas do change the stresses, they do so only through changes in the displacements.
↩
Usually, the stiffness matrix is symmetric, and $K^\intercal = K$ . This means that the solver for displacements can be repurposed for adjoint computation by setting the right-hand side shown here instead of the loads. For that reason, this right-hand side is sometimes called a pseudo-load.
↩
19 provide recommendations on constraint aggregation for structural optimization.
↩
20 provide more details on this approach and its applications.
↩
See Iterative Methods for more details on iterative solvers.
↩
21 were the first to show that the number of function evaluations could be reduced for sparse Jacobians.
↩
22 provide a review of graph coloring in the context of computing derivatives. 23 show how to use graph coloring to compute total coupled derivatives.
↩
As explained in Section A.2, we take the liberty of treating differentials algebraically and skip a more rigorous and lengthy proof.
↩
Normally, for two matrices $A$ and $B$ , $(AB)^\intercal=B^\intercal A^\intercal$ , but in this case,
$AB=I \Rightarrow B=A^{-1} \Rightarrow B^\intercal=A^{-\intercal} \Rightarrow\\ A^\intercal B^\intercal = I$ .
↩

References¶

Lyness, J. N., & Moler, C. B. (1967). Numerical differentiation of analytic functions. SIAM Journal on Numerical Analysis, 4(2), 202–210. 10.1137/0704019
Squire, W., & Trapp, G. (1998). Using complex variables to estimate derivatives of real functions. SIAM Review, 40(1), 110–112. 10.1137/S003614459631241X
Martins, J. R. R. A., Sturdza, P., & Alonso, J. J. (2003). The complex-step derivative approximation. ACM Transactions on Mathematical Software, 29(3), 245–262. 10.1145/838250.838251
Lantoine, G., Russell, R. P., & Dargent, T. (2012). Using multicomplex variables for automatic computation of high-order derivatives. ACM Transactions on Mathematical Software, 38(3), 1–21. 10.1145/2168773.2168774
Fike, J. A., & Alonso, J. J. (2012). Automatic differentiation through the use of hyper-dual numbers for second derivatives. In S. Forth, P. Hovland, E. Phipps, J. Utke, & A. Walther (Eds.), Recent Advances in Algorithmic Differentiation (pp. 163–173). Springer. 10.1007/978-3-642-30023-3_15
Griewank, A. (2000). Evaluating Derivatives. SIAM. 10.1137/1.9780898717761
Naumann, U. (2011). The Art of Differentiating Computer Programs—An Introduction to Algorithmic Differentiation. SIAM. https://books.google.ca/books/about/The_Art_of_Differentiating_Computer_Prog.html?id=OgQuUR4nLu0C
Utke, J., Naumann, U., Fagan, M., Tallent, N., Strout, M., Heimbach, P., Hill, C., & Wunsch, C. (2008). OpenAD/F: A modular open-source tool for automatic differentiation of Fortran codes. ACM Transactions on Mathematical Software, 34(4). 10.1145/1377596.1377598
Hascoet, L., & Pascual, V. (2013). The Tapenade automatic differentiation tool: Principles, model, and specification. ACM Transactions on Mathematical Software, 39(3), 20:1-20:43. 10.1145/2450153.2450158
Griewank, A., Juedes, D., & Utke, J. (1996). Algorithm 755: ADOL-C: A package for the automatic differentiation of algorithms written in C\slash C++. ACM Transactions on Mathematical Software, 22(2), 131–167. 10.1145/229473.229474
Wiltschko, A. B., van Merriënboer, B., & Moldovan, D. (2017). Tangent: automatic differentiation using source code transformation in Python. https://arxiv.org/abs/1711.02712
Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne, S., & Zhang, Q. (2018). JAX: Composable Transformations of Python+NumPy Programs (0.2.5). http://github.com/google/jax
Revels, J., Lubin, M., & Papamarkou, T. (2016). Forward-mode automatic differentiation in Julia. In ArXiv e-prints. https://arxiv.org/abs/1607.07892
Neidinger, R. D. (2010). Introduction to automatic differentiation and MATLAB object-oriented programming. SIAM Review, 52(3), 545–563. 10.1137/080743627
Betancourt, M. (2018). A geometric theory of higher-order automatic differentiation. https://arxiv.org/abs/1812.11592