Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Engineering Design Optimization. Cambridge University Press, Jan 2022

6 Computing Derivatives

The gradient-based optimization methods introduced in Chapter 4 and Chapter 5 require the derivatives of the objective and constraints with respect to the design variables, as illustrated in Figure 6.1. Derivatives also play a central role in other numerical algorithms. For example, the Newton-based methods introduced in Section 3.8 require the derivatives of the residuals.

Efficient derivative computation is crucial for the overall efficiency of gradient-based optimization.

Figure 6.1:Efficient derivative computation is crucial for the overall efficiency of gradient-based optimization.

The accuracy and computational cost of the derivatives are critical for the success of these methods. Gradient-based methods are only efficient when the derivative computation is also efficient. The computation of derivatives can be the bottleneck in the overall optimization procedure, especially when the model solver needs to be called repeatedly. This chapter introduces the various methods for computing derivatives and discusses the relative advantages of each method.

6.1 Derivatives, Gradients, and Jacobians

The derivatives we focus on are first-order derivatives of one or more functions of interest (ff) with respect to a vector of variables (xx). In the engineering optimization literature, the term sensitivity analysis is often used to refer to the computation of derivatives, and derivatives are sometimes referred to as sensitivity derivatives or design sensitivities. Although these terms are not incorrect, we prefer to use the more specific and concise term derivative.

For the sake of generality, we do not specify which functions we want to differentiate in this chapter (which could be an objective, constraints, residuals, or any other function). Instead, we refer to the functions being differentiated as the functions of interest and represent them as a vector-valued function, f=[f1,f2,,fnf]f = [f_1, f_2, \ldots, f_{n_f}]. Neither do we specify the variables with respect to which we differentiate (which could be design variables, state variables, or any other independent variable).

The derivatives of all the functions of interest with respect to all the variables form a Jacobian matrix,

Jf=fx=[f1fnf]=[f1x1f1xnxfnfx1fnfxnx](nf×nx), J_f = \frac{\partial {f}}{\partial {x}}= \begin{bmatrix} \nabla {f_1}^\intercal \\ \vdots \\ \nabla {f_{n_f}}^\intercal \end{bmatrix} = \underbrace{ \begin{bmatrix} \frac{\partial f_1}{\partial {x_1}}& \cdots & \frac{\partial f_1}{\partial {x_{n_x}}}\\ \vdots & \ddots & \vdots \\ \frac{\partial f_{n_f}}{\partial {x_1}}& \cdots & \frac{\partial f_{n_f}}{\partial {x_{n_x}}}\\ \end{bmatrix}}_{(n_f \times n_x)} \, ,

which is an (nf×nx)(n_f \times n_x) rectangular matrix where each row corresponds to the gradient of each function with respect to all the variables.

Row ii of the Jacobian is the gradient of function fif_i. Each column in the Jacobian is called the tangent with respect to a given variable xjx_j. The Jacobian can be related to the \nabla operator as follows:

Jf=f=[f1fnf][x1xnx]=[f1x1f1xnxfnfx1fnfxnx].J_f = f \nabla^\intercal = \begin{bmatrix} f_1 \\ \vdots \\ f_{n_f} \end{bmatrix} \begin{bmatrix} \frac{\partial }{\partial {x_1}}\ldots \frac{\partial }{\partial {x_{n_x}}}\end{bmatrix} = \begin{bmatrix} \frac{\partial f_1}{\partial {x_1}}& \cdots & \frac{\partial f_1}{\partial {x_{n_x}}}\\ \vdots & \ddots & \vdots \\ \frac{\partial f_{n_f}}{\partial {x_1}}& \cdots & \frac{\partial f_{n_f}}{\partial {x_{n_x}}}\\ \end{bmatrix} \, .

6.2 Overview of Methods for Computing Derivatives

We can classify the methods for computing derivatives according to the representation used for the numerical model. There are three possible representations, as shown in Figure 6.2. In one extreme, we know nothing about the model and consider it a black box where we can only control the inputs and observe the outputs (Figure 6.2, left). In this chapter, we often refer to xx as the input variables and ff as the output variables. When this is the case, we can only compute derivatives using finite differences (Section 6.4).

In the other extreme, we have access to the source code used to compute the functions of interest and perform the differentiation line by line (Figure 6.2, right). This is the essence of the algorithmic differentiation approach (Section 6.6). The complex-step method (Section 6.5) is related to algorithmic differentiation, as explained in Section 6.6.5.

In the intermediate case, we consider the model residuals and states (Figure 6.2, middle), which are the quantities required to derive and implement implicit analytic methods (Section 6.7). When the model can be represented with multiple components, we can use a coupled derivative approach (Section 13.3) where any of these derivative computation methods can be used for each component.

Derivative computation methods can consider three different levels of information: function values (a), model states (b), and lines of code (c).

(a)

Derivative computation methods can consider three different levels of information: function values (a), model states (b), and lines of code (c).

(b)

Derivative computation methods can consider three different levels of information: function values (a), model states (b), and lines of code (c).

(c)

Figure 6.2:Derivative computation methods can consider three different levels of information: function values (a), model states (b), and lines of code (c).

6.3 Symbolic Differentiation

Symbolic differentiation is well known and widely used in calculus, but it is of limited use in the numerical optimization of most engineering models. Except for the most straightforward cases (e.g., Example 6.1), many engineering models involve a large number of operations, utilize loops and various conditional logic, are implicitly defined, or involve iterative solvers (see Chapter 3). Although the mathematical expressions within these iterative procedures is explicit, it is challenging, or even impossible, to use symbolic differentiation to obtain closed-form mathematical expressions for the derivatives of the procedure. Even when it is possible, these expressions are almost always computationally inefficient.

Symbolic differentiation is still valuable for obtaining derivatives of simple explicit components within a larger model. Furthermore, algorithm differentiation (discussed in a later section) relies on symbolic differentiation to differentiate each line of code in the model.

6.4 Finite Differences

Because of their simplicity, finite-difference methods are a popular approach to computing derivatives. They are versatile, requiring nothing more than function values. Finite differences are the only viable option when we are dealing with black-box functions because they do not require any knowledge about how the function is evaluated. Most gradient-based optimization algorithms perform finite differences by default when the user does not provide the required gradients. However, finite differences are neither accurate nor efficient.

6.4.1 Finite-Difference Formulas

Finite-difference approximations are derived by combining Taylor series expansions. It is possible to obtain finite-difference formulas that estimate an arbitrary order derivative with any order of truncation error by using the right combinations of these expansions. The simplest finite-difference formula can be derived directly from a Taylor series expansion in the jthj\text{th} direction,

f(x+he^j)=f(x)+hfxj+h22!2fxj2+h33!3fxj3+,f(x + h \hat e_j) = f(x) + h \frac{\partial f}{\partial {x_j}}+ \frac{h^2}{2!} \frac{\partial^{2} f}{\partial {x_j}^{2}} + \frac{h^3}{3!} \frac{\partial^{3} f}{\partial {x_j}^{3}} + \ldots \, ,

where e^j\hat e_j is the unit vector in the jthj\text{th} direction. Solving this for the first derivative, we obtain the finite-difference formula,

fxj=f(x+he^j)f(x)h+O(h), \frac{\partial f}{\partial {x_j}}= \frac{f(x + h \hat e_j)-f(x)}{h} + \mathcal{O}(h) \, ,

where hh is a small scalar called the finite-difference step size.

Exact derivative compared with a forward finite-difference approximation ().

Figure 6.3:Exact derivative compared with a forward finite-difference approximation (Equation 6.4).

This approximation is called the forward difference and is directly related to the definition of a derivative because

fxj=limh0f(x+he^j)f(x)hf(x+he^j)f(x)h.\frac{\partial f}{\partial {x_j}}= \lim_{h \rightarrow 0} \frac{f(x + h \hat e_j) - f(x)}{h} \approx \frac{f(x + h \hat e_j) - f(x)}{h} \, .

The truncation error is O(h){\cal O}(h), and therefore this is a first-order approximation. The difference between this approximation and the exact derivative is illustrated in Figure 6.3.

The backward-difference approximation can be obtained by replacing hh with h-h to yield

fxj=f(x)f(xhe^j)h+O(h),\frac{\partial f}{\partial {x_j}}= \frac{f(x)-f(x - h \hat e_j)}{h} + {\cal O}(h)\, ,

which is also a first-order approximation.

Assuming each function evaluation yields the full vector ff, the previous formulas compute the jthj\text{th} column of the Jacobian in Equation 6.1. To compute the full Jacobian, we need to loop through each direction e^j\hat e_j, add a step, recompute ff, and compute a finite difference. Hence, the cost of computing the complete Jacobian is proportional to the number of input variables of interest, nxn_x.

For a second-order estimate of the first derivative, we can use the expansion of f(xhe^j)f(x- h \hat e_j) to obtain

Exact derivative compared with a central finite-difference approximation ().

Figure 6.4:Exact derivative compared with a central finite-difference approximation (Equation 6.8).

f(xhe^j)=f(x)hfxj+h22!2fxj2h33!3fxj3+.f(x - h \hat e_j) = f(x) - h \frac{\partial f}{\partial {x_j}}+ \frac{h^2}{2!} \frac{\partial^{2} f}{\partial {x_j}^{2}} - \frac{h^3}{3!} \frac{\partial^{3} f}{\partial {x_j}^{3}} + \ldots .

Then, if we subtract this from the expansion in Equation 6.3 and solve the resulting equation for the derivative of ff, we get the central-difference formula,

fxj=f(x+he^j)f(xhe^j)2h+O(h2).\frac{\partial f}{\partial {x_j}}= \frac{f(x + h \hat e_j)-f(x - h \hat e_j)}{2h} + {\cal O}(h^2).

The stencil of points for this formula is shown in Figure 6.4, where we can see that this estimate is closer to the actual derivative than the forward difference.

Even more accurate estimates can be derived by combining different Taylor series expansions to obtain higher-order truncation error terms. This technique is widely used in finite-difference methods for solving differential equations, where higher-order estimates are desirable.

However, finite-precision arithmetic eventually limits the achievable accuracy for our purposes (as discussed in the next section). With double-precision arithmetic, there are not enough significant digits to realize a significant advantage beyond central difference.

We can also estimate second derivatives (or higher) by combining Taylor series expansions. For example, adding the expansions for f(x+h)f(x+h) and f(xh)f(x-h) cancels out the first derivative and third derivative terms, yielding the second-order approximation to the second-order derivative,

2fxj2=f(x+2he^j)2f(x)+f(x2he^j)4h2+O(h2).\frac{\partial^{2} f}{\partial {x_j}^{2}} = \frac{f(x+2 h \hat e_j)-2f(x)+f(x- 2 h \hat e_j)}{4h^2} + {\cal O}\left( h^2 \right) \, .

The finite-difference method can also be used to compute directional derivatives, which are the scalar projection of the gradient into a given direction. To do this, instead of stepping in orthogonal directions to get the gradient, we need to step in the direction of interest, pp, as shown in Figure 6.5. Using the forward difference, for example,

pf=f(x+hp)f(x)h+O(h). \nabla_p {f} = \frac{f(x + h p)-f(x)}{h} + {\cal O}(h) \, .

One application of directional derivatives is to compute the slope in line searches (Section 4.3).

Computing a directional derivative using a forward finite difference.

Figure 6.5:Computing a directional derivative using a forward finite difference.

6.4.2 The Step-Size Dilemma

When estimating derivatives using finite-difference formulas, we are faced with the step-size dilemma. Because each estimate has a truncation error of O(h)\mathcal{O}(h) (or O(h2)\mathcal{O}(h^2) when second order), we would like to choose as small of a step size as possible to reduce this error. However, as the step size reduces, subtractive cancellation (a roundoff error introduced in Section 3.5.1) becomes dominant. Given the opposing trends of these errors, there is an optimal step size for which the sum of the two errors is at a minimum.

Theoretically, the optimal step size for the forward finite difference is approximately εf\sqrt{\varepsilon_f}, where εf\varepsilon_f is the precision of ff. The error bound is also about εf\sqrt{\varepsilon_f}. For the central difference, the optimal step size scales approximately with εf1/3\varepsilon_f^{1/3}, with an error bound of εf2/3\varepsilon_f^{2/3}. These step and error bound estimates are just approximate and assume well-scaled problems.

Finite-differencing noisy functions can either smooth the derivative estimates or result in estimates with the wrong trends.

Figure 6.8:Finite-differencing noisy functions can either smooth the derivative estimates or result in estimates with the wrong trends.

Finite-difference approximations are sometimes used with larger steps than would be desirable from an accuracy standpoint to help smooth out numerical noise or discontinuities in the model. This approach sometimes works, but it is better to address these problems within the model whenever possible. Figure 6.8 shows an example of this effect. For this noisy function, the larger step ignores the noise and gives the correct trend, whereas the smaller step results in an estimate with the wrong sign.

6.4.3 Practical Implementation

Algorithm 6.1 details a procedure for computing a Jacobian using forward finite differences. It is usually helpful to scale the step size based on the value of xjx_j, unless xjx_j is too small. Therefore, we combine the relative and absolute quantities to obtain the following step size:

Δxj=h(1+xj).\Delta x_j = h \left( 1 + | x_j | \right) \, .

This is similar to the expression for the convergence criterion in Equation 4.24. Although the absolute step size usually differs for each xjx_j, the relative step size hh is often the same and is user-specified.

6.5 Complex Step

The complex-step derivative approximation, strangely enough, computes derivatives of real functions using complex variables. Unlike finite differences, the complex-step method requires access to the source code and cannot be applied to black-box components. The complex-step method is accurate but no more efficient than finite differences because the computational cost still scales linearly with the number of variables.

6.5.1 Theory

The complex-step method can also be derived using a Taylor series expansion. Rather than using a real step hh, as we did to derive the finite-difference formulas, we use a pure imaginary step, ihih.[2] If ff is a real function in real variables and is also analytic (differentiable in the complex domain), we can expand it in a Taylor series about a real point xx as follows:

f(x+ihe^j)=f(x)+ihfxjh222fxj2ih363fxj3+.f(x + ih \hat e_j) = f(x) + i h \frac{\partial f}{\partial {x_j}}- \frac{h^2}{2} \frac{\partial^{2} f}{\partial {x_j}^{2}} - i \frac{h^3}{6} \frac{\partial^{3} f}{\partial {x_j}^{3}} + \ldots \, .

Taking the imaginary parts of both sides of this equation, we have

(f(x+ihe^j))=hfxjh363fxj3+.\Im \left(f(x + ih \hat e_j)\right) = h \frac{\partial f}{\partial {x_j}}- \frac{h^3}{6} \frac{\partial^{3} f}{\partial {x_j}^{3}} + \ldots \, .

Dividing this by hh and solving for f/xj\partial f / \partial x_j yields the complex-step derivative approximation,[3]

fxj=(f(x+ihe^j))h+O(h2), \frac{\partial f}{\partial {x_j}}= \frac{ \Im \left( f(x + ih \hat e_j) \right) }{h} + \mathcal{O}(h^2) \, ,

which is a second-order approximation. To use this approximation, we must provide a complex number with a perturbation in the imaginary part, compute the original function using complex arithmetic, and take the imaginary part of the output to obtain the derivative.

In practical terms, this means that we must convert the function evaluation to take complex numbers as inputs and compute complex outputs. Because we have assumed that f(x)f(x) is a real function of a real variable in the derivation of Equation 6.14, the procedure described here does not work for models that already involve complex arithmetic.

In Section 6.5.2, we explain how to convert programs to handle the required complex arithmetic for the complex-step method to work in general. The complex-step method has been extended to compute exact second derivatives as well.45

Unlike finite differences, this formula has no subtraction operation and thus no subtractive cancellation error. The only source of numerical error is the truncation error. However, the truncation error can be eliminated if hh is decreased to a small enough value (say, 10-200). Then, the precision of the complex-step derivative approximation (Equation 6.14) matches the precision of ff. This is a tremendous advantage over the finite-difference approximations (Equation 6.4 and Equation 6.8).

Like the finite-difference approach, each evaluation yields a column of the Jacobian (f/xj\partial f / \partial x_j), and the cost of computing all the derivatives is proportional to the number of design variables. The cost of the complex-step method is comparable to that of a central difference because we compute a real and an imaginary part for every number in our code.

If we take the real part of the Taylor series expansion (Equation 6.12), we obtain the value of the function on the real axis,

f(x)=Re(f(x+ihe^j))+O(h2).f(x) = \text{Re} \left( f(x + ih \hat e_j) \right) + \mathcal{O}(h^2) \, .

Similar to the derivative approximation, we can make the truncation error disappear by using a small enough hh. This means that no separate evaluation of f(x)f(x) is required to get the original real value of the function; we can simply take the real part of the complex evaluation.

What is a “small enough hh”? When working with finite-precision arithmetic, the error can be eliminated entirely by choosing an hh so small that all h2h^2 terms become zero because of underflow (i.e., h2h^2 is smaller than the smallest representable number, which is approximately 10-324 when using double precision; see Section 3.5.1). Eliminating these squared terms does not affect the accuracy of the derivative carried in the imaginary part because the squared terms only appear in the error terms of the complex-step approximation.

At the same time, hh must be large enough that the imaginary part (hf/xh \cdot \partial f / \partial x) does not underflow.

Suppose that μ\mu is the smallest representable number. Then, the two requirements result in the following bounds:

μfx1<h<μ.\mu \left| \frac{\partial f}{\partial x} \right|^{-1} < h < \sqrt{\mu} \, .

A step size of 10-200 works well for double-precision functions.

6.5.2 Complex-Step Implementation

We can use the complex-step method even when the evaluation of ff involves the solution of numerical models through computer programs.3 The outer loop for computing the derivatives of multiple functions with respect to all variables (Algorithm 6.2) is similar to the one for finite differences. A reference function evaluation is not required, but now the function must handle complex numbers correctly.

The complex-step method can be applied to any model, but modifications might be required. We need the source code for the model to make sure that the program can handle complex numbers and the associated arithmetic, that it handles logical operators consistently, and that certain functions yield the correct derivatives.

First, the program may need to be modified to use complex numbers. In programming languages like Fortran or C, this involves changing real-valued type declarations (e.g., double) to complex type declarations (e.g., double complex). In some languages, such as MATLAB, Python, and Julia, this is unnecessary because functions are overloaded to automatically accept either type.

Second, some changes may be required to preserve the correct logical flow through the program. Relational logic operators (e.g., “greater than”, “less than”, “if”, and “else”) are usually not defined for complex numbers. These operators are often used in programs, together with conditional statements, to redirect the execution thread. The original algorithm and its “complexified” version should follow the same execution thread. Therefore, defining these operators to compare only the real parts of the arguments is the correct approach. Functions that choose one argument, such as the maximum or the minimum values, are based on relational operators. Following the previous argument, we should determine the maximum and minimum values based on the real parts alone.

The usual definition of a complex absolute value returns a real number (the length of the vector), which is not compatible with the complex-step method.

Figure 6.10:The usual definition of a complex absolute value returns a real number (the length of the vector), which is not compatible with the complex-step method.

Third, some functions need to be redefined for complex arguments. The most common function that needs to be redefined is the absolute value function. For a complex number, z=x+iyz = x + i y, the absolute value is defined as

z=x2+y2,| z | = \sqrt{x^2 + y^2} \, ,

as shown in Figure 6.10. This definition is not complex analytic, which is required in the derivation of the complex-step derivative approximation.

The absolute value function needs to be redefined such that the imaginary part yields the correct derivatives.

Figure 6.11:The absolute value function needs to be redefined such that the imaginary part yields the correct derivatives.

As shown in Figure 6.11, the correct derivatives for the real absolute value function are +1 and -1, depending on whether xx is greater than or less than zero. The following complex definition of the absolute value yields the correct derivatives:

abs(x+iy)={xiy,ifx<0+x+iy,ifx0.\text{abs}(x+iy) = \begin{cases} - x - iy, & \text{if} \quad x < 0 \\ + x + iy, & \text{if} \quad x \ge 0 \, . \end{cases}

Setting the imaginary part to y=hy=h and dividing by hh corresponds to the slope of the absolute value function. There is an exception at x=0x=0, where the function is not analytic, but a derivative does not exist in any case. We use the “greater or equal” in the logic so that the approximation yields the correct right-sided derivative at that point.

Depending on the programming language, we may need to redefine some trigonometric functions. This is because some default implementations, although correct, do not maintain accurate derivatives for small complex-step sizes.

We must replace these with mathematically equivalent implementations that avoid numerical issues.

Fortunately, we can automate most of these changes by using scripts to process the codes, and in most programming languages, we can easily redefine functions using operator overloading.[4]

6.6 Algorithmic Differentiation

Algorithmic differentiation (AD)—also known as computational differentiation or automatic differentiation—is a well-known approach based on the systematic application of the chain rule to computer programs.67 The derivatives computed with AD can match the precision of the function evaluation.

The cost of computing derivatives with AD can be proportional to either the number of variables or the number of functions, depending on the type of AD, making it flexible.

Another attractive feature of AD is that its implementation is largely automatic, thanks to various AD tools. To explain AD, we start by outlining the basic theory with simple examples. Then we explore how the method is implemented in practice with further examples.

6.6.1 Variables and Functions as Lines of Code

The basic concept of AD is as follows. Even long, complicated codes consist of a sequence of basic operations (e.g., addition, multiplication, cosine, exponentiation). Each operation can be differentiated symbolically with respect to the variables in the expression.

AD performs this symbolic differentiation and adds the code that computes the derivatives for each variable in the code. The derivatives of each variable accumulate in what amounts to a numerical version of the chain rule.

AD considers all the variables in the code, where the inputs x are among the first variables, and the outputs f are among the last.

Figure 6.13:AD considers all the variables in the code, where the inputs xx are among the first variables, and the outputs ff are among the last.

The fundamental building blocks of a code are unary and binary operations. These operations can be combined to obtain more elaborate explicit functions, which are typically expressed in one line of computer code. We represent the variables in the computer code as the sequence v=[v1,,vi,,vn]v = [v_1, \ldots, v_i, \ldots, v_n], where nn is the total number of variables assigned in the code. One or more of these variables at the start of this sequence are given and correspond to xx, and one or more of the variables toward the end of the sequence are the outputs of interest, ff, as illustrated in Figure 6.13.

In general, a variable assignment corresponding to a line of code can involve any other variable, including itself, through an explicit function,

vi=vi(v1,v2,,vi,,vn).v_i = v_i(v_1, v_2,\ldots,v_i,\ldots,v_n) \, .

Except for the most straightforward codes, many of the variables in the code are overwritten as a result of iterative loops.

Unrolling of loops is a useful mental model to understand the derivative propagation in the AD of general code.

Figure 6.14:Unrolling of loops is a useful mental model to understand the derivative propagation in the AD of general code.

To understand AD, it is helpful to imagine a version of the code where all the loops are unrolled. Instead of overwriting variables, we create new versions of those variables, as illustrated in Figure 6.14. Then, we can represent the computations in the code in a sequence with no loops such that each variable in this larger set only depends on previous variables, and then

vi=vi(v1,v2,,vi1).v_i = v_i(v_1, v_2,\ldots,v_{i-1}) \, .

Given such a sequence of operations and the derivatives for each operation, we can apply the chain rule to obtain the derivatives for the entire sequence. Unrolling the loops is just a mental model for understanding how the chain rule operates, and it is not explicitly done in practice.

The chain rule can be applied in two ways. In the forward mode, we choose one input variable and work forward toward the outputs until we get the desired total derivative. In the reverse mode, we choose one output variable and work backward toward the inputs until we get the desired total derivative.

6.6.2 Forward-Mode AD

The chain rule for the forward mode can be written as

dvidvj=k=ji1vivkdvkdvj,\frac{dv_i}{d{v_j}}= \sum_{k=j}^{i-1} \frac{\partial v_i}{\partial {v_k}}\frac{dv_k}{d{v_j}}\, ,

where each partial derivative is obtained by symbolically differentiating the explicit expression for viv_i. The total derivatives are the derivatives with respect to the chosen input vjv_j, which can be computed using this chain rule.

Using the forward mode, we evaluate a sequence of these expressions by fixing jj in Equation 6.21 (effectively choosing one input vjv_j) and incrementing ii to get the derivative of each variable viv_i. We only need to sum up to i1i-1 because of the form of Equation 6.20, where each viv_i only depends on variables that precede it.

For a more convenient notation, we define a new variable that represents the total derivative of variable ii with respect to a fixed input jj as v˙idvi/dvj\dot v_i \equiv {\mathrm{d} v_i}/{\mathrm{d} {v_j}} and rewrite the chain rule as

v˙i=k=ji1vivkv˙k. \dot v_i = \sum_{k=j}^{i-1} \frac{\partial v_i}{\partial {v_k}}\dot v_k \, .

The chosen input jj corresponds to the seed, which we set to v˙j=1\dot v_j=1 (using the definition for v˙j\dot v_j, we see that means setting dvj/dvj=1\mathrm{d} v_j/\mathrm{d} v_j = 1).

The forward mode propagates derivatives to all the variables that depend on the seeded input variable.

Figure 6.15:The forward mode propagates derivatives to all the variables that depend on the seeded input variable.

This chain rule then propagates the total derivatives forward, as shown in Figure 6.15, affecting all the variables that depend on the seeded variable.

Once we are done applying the chain rule (Equation 6.22) for the chosen input variable vjv_j, we end up with the total derivatives dvi/dvj\mathrm{d} v_i / \mathrm{d} v_j for all i>ji>j. The sum in the chain rule (Equation 6.22) only needs to consider the nonzero partial derivative terms. If a variable kk does not explicitly appear in the expression for viv_i, then vi/vk=0\partial v_i / \partial v_k = 0, and there is no need to consider the corresponding term in the sum. In practice, this means that only a small number of terms is considered for each sum.

Suppose we have four variables v1,v2,v3v_1, v_2, v_3, and v4v_4, and xv1x \equiv v_1, fv4f \equiv v_4, and we want df/dx\mathrm{d} f / \mathrm{d} x. We assume that each variable depends explicitly on all the previous ones. Using the chain rule (Equation 6.22), we set j=1j=1 (because we want the derivative with respect to xv1x \equiv v_1) and increment in ii to get the sequence of derivatives:

v˙1=1v˙2=v2v1v˙1v˙3=v3v1v˙1+v3v2v˙2v˙4=v4v1v˙1+v4v2v˙2+v4v3v˙3dfdx.\begin{aligned} {\dot v_1} &= 1 \\ {\dot v_2} &= \frac{\partial v_2}{\partial v_1} {\dot v_1}\\ {\dot v_3} &= \frac{\partial v_3}{\partial v_1} {\dot v_1} + \frac{\partial v_3}{\partial v_2} {\dot v_2} \\ \dot v_4 &= \frac{\partial v_4}{\partial v_1} {\dot v_1} + \frac{\partial v_4}{\partial v_2} {\dot v_2} + \frac{\partial v_4}{\partial v_3} {\dot v_3} \equiv \frac{df}{dx}\,. \end{aligned}

In each step, we just need to compute the partial derivatives of the current operation viv_i and then multiply using the total derivatives v˙\dot v that have already been computed. We move forward by evaluating the partial derivatives of vv in the same sequence to evaluate the original function.

This is convenient because all of the unknowns are partial derivatives, meaning that we only need to compute derivatives based on the operation at hand (or line of code).

In this abstract example with four variables that depend on each other sequentially, the Jacobian of the variables with respect to themselves is as follows:

Jv=[1000dv2dv1100dv3dv1dv3dv210dv4dv1dv4dv2dv4dv31].J_v = \begin{bmatrix} \textcolor{#4E79A7}{1} & 0 & 0 & 0 \\ \textcolor{#4E79A7}{\frac{dv_2}{d{v_1}}} & 1 & 0 & 0 \\ \textcolor{#4E79A7}{\frac{dv_3}{d{v_1}}} & \frac{dv_3}{d{v_2}}& 1 & 0 \\ \textcolor{#4E79A7}{\frac{dv_4}{d{v_1}}} & \frac{dv_4}{d{v_2}}& \frac{dv_4}{d{v_3}}& 1 \end{bmatrix} \, .

By setting the seed v˙1=1{\dot v_1} = 1 and using the forward chain rule (Equation 6.22), we have computed the first column of JvJ_v from top to bottom. This column corresponds to the tangent with respect to v1v_1. Using forward-mode AD, obtaining derivatives for other outputs is free (e.g., dv3/dv1v˙3\mathrm{d} v_3 / \mathrm{d} v_1 \equiv \dot v_3 in Equation 6.23).

However, if we want the derivatives with respect to additional inputs, we would need to set a different seed and evaluate an entire set of similar calculations. For example, if we wanted dv4/dv2\mathrm{d} v_4 / \mathrm{d} v_2, we would set the seed as v˙2=1{\dot v_2} = 1 and evaluate the equations for v˙3\dot v_3 and v˙4\dot v_4, where we would now have dv4/dv2=v˙4\mathrm{d} v_4 / \mathrm{d} v_2 = \dot v_4. This would correspond to computing the second column in JvJ_v (Equation 6.24).

Thus, the cost of the forward mode scales linearly with the number of inputs we are interested in and is independent of the number of outputs.

So far, we have assumed that we are computing derivatives with respect to each component of xx. However, just like for finite differences, we can also compute directional derivatives using forward-mode AD. We do so by setting the appropriate seed in the v˙\dot v’s that correspond to the inputs in a vectorized manner. Suppose we have x[v1,,vnx]x \equiv [v_1, \ldots, v_{n_x}]. To compute the derivative with respect to xjx_j, we would set the seed as the unit vector v˙=e^j\dot v = \hat e_j and follow a similar process for the other elements. To compute a directional derivative in direction pp, we would set the seed as v˙=p/p\dot v = p / \|p\|.

6.6.3 Reverse-Mode AD

The reverse mode is also based on the chain rule but uses the alternative form:

dvidvj=k=j+1ivkvjdvidvk,\frac{dv_i}{d{v_j}}= \sum_{k=j+1}^{i} \frac{\partial v_k}{\partial v_j} \frac{dv_i}{d{v_k}}\, ,

where the summation happens in reverse (starts at ii and decrements to j+1j+1). This is less intuitive than the forward chain rule, but it is equally valid. Here, we fix the index ii corresponding to the output of interest and decrement jj until we get the desired derivative.

Similar to the forward-mode total derivative notation (Equation 6.22), we define a more convenient notation for the variables that carry the total derivatives with a fixed ii as vˉjdvi/dvj\bar v_j \equiv \mathrm{d} {v_i} / \mathrm{d} {{v_j}},

which are sometimes called adjoint variables. Then we can rewrite the chain rule as

vˉj=k=j+1ivkvjvˉk. \bar v_j = \sum_{k=j+1}^{i} \frac{\partial v_k}{\partial v_j} \bar v_k \, .

This chain rule propagates the total derivatives backward after setting the reverse seed vˉi=1\bar v_i=1, as shown in Figure 6.18.

The reverse mode propagates derivatives to all the variables on which the seeded output variable depends.

Figure 6.18:The reverse mode propagates derivatives to all the variables on which the seeded output variable depends.

This affects all the variables on which the seeded variable depends.

The reverse-mode variables vˉ\bar v represent the derivatives of one output, ii, with respect to all the input variables (instead of the derivatives of all the outputs with respect to one input, jj, in the forward mode). Once we are done applying the reverse chain rule (Equation 6.27) for the chosen output variable viv_i, we end up with the total derivatives dvi/dvj\mathrm{d} v_i / \mathrm{d} v_j for all j<ij<i.

Applying the reverse mode to the same four-variable example as before, we get the following sequence of derivative computations (we set i=4i = 4 and decrement jj):

vˉ4=1vˉ3=v4v3vˉ4\begin{aligned} {\bar v_4} &= 1\\ {\bar v_3} &= \frac{\partial v_4}{\partial v_3} {\bar v_4}\\ \end{aligned}
vˉ2=v3v2vˉ3+v4v2vˉ4vˉ1=v2v1vˉ2+v3v1vˉ3+v4v1vˉ4dfdx.\begin{aligned} {\bar v_2} &= \frac{\partial v_3}{\partial v_2} {\bar v_3} + \frac{\partial v_4}{\partial v_2} {\bar v_4}\\ \bar v_1 &= \frac{\partial v_2}{\partial v_1} {\bar v_2} + \frac{\partial v_3}{\partial v_1} {\bar v_3} + \frac{\partial v_4}{\partial v_1} {\bar v_4} \equiv \frac{df}{dx}\, . \end{aligned}

The partial derivatives of vv must be computed for v4v_4 first, then v3v_3, and so on. Therefore, we have to traverse the code in reverse. In practice, not every variable depends on every other variable, so a computational graph is created during code evaluation. Then, when computing the adjoint variables, we traverse the computational graph in reverse. As before, the derivatives we need to compute in each line are only partial derivatives.

Recall the Jacobian of the variables,

Jv=[1000dv2dv1100dv3dv1dv3dv210dv4dv1dv4dv2dv4dv31].J_v = \begin{bmatrix} {1} & 0 & 0 & 0 \\ {\frac{dv_2}{d{v_1}}} & 1 & 0 & 0 \\ {\frac{dv_3}{d{v_1}}} & \frac{dv_3}{d{v_2}}& 1 & 0 \\ \textcolor{#4E79A7}{\frac{dv_4}{d{v_1}}} & \textcolor{#4E79A7}{\frac{dv_4}{d{v_2}}} & \textcolor{#4E79A7}{\frac{dv_4}{d{v_3}}} & \textcolor{#4E79A7}{1} \end{bmatrix} \, .

By setting vˉ4=1\bar v_4 =1 and using the reverse chain rule (Equation 6.27), we have computed the last row of JvJ_v from right to left. This row corresponds to the gradient of fv4f \equiv v_4. Using the reverse mode of AD, obtaining derivatives with respect to additional inputs is free (e.g., dv4/dv2vˉ2\mathrm{d} v_4 / \mathrm{d} v_2 \equiv \bar v_2 in Equation 6.28).

However, if we wanted the derivatives of additional outputs, we would need to evaluate a different sequence of derivatives. For example, if we wanted dv3/dv1\mathrm{d} v_3 / \mathrm{d} v_1, we would set vˉ3=1\bar v_3 = 1 and evaluate the expressions for vˉ2\bar v_2 and vˉ1\bar v_1 in Equation 6.28, where dv3/dv1vˉ1\mathrm{d} v_3 / d v_1 \equiv \bar v_1. Thus, the cost of the reverse mode scales linearly with the number of outputs and is independent of the number of inputs.

One complication with the reverse mode is that the resulting sequence of derivatives requires the values of the variables, starting with the last ones and progressing in reverse. For example, the partial derivative in the second operation of Equation 6.28 might involve v3v_3. Therefore, the code needs to run in a forward pass first, and all the variables must be stored for use in the reverse pass, which increases memory usage.

Unlike with forward-mode AD and finite differences, it is impossible to compute a directional derivative by setting the appropriate seeds.

Although the seeds in the forward mode are associated with the inputs, the seeds for the reverse mode are associated with the outputs. Suppose we have multiple functions of interest, f[vnnf,,vn]f \equiv [v_{n-n_f}, \ldots, v_n]. To find the derivatives of f1f_1 in a vectorized operation, we would set vˉ=[1,0,,0]\bar v = [1, 0, \ldots, 0].

A seed with multiple nonzero elements computes the derivatives of a weighted function with respect to all the variables, where the weight for each function is determined by the corresponding vˉ\bar v value.

6.6.4 Forward Mode or Reverse Mode?

Our goal is to compute JfJ_f, the (nf×nxn_f \times n_x) matrix of derivatives of all the functions of interest ff with respect to all the input variables xx. However, AD computes many other derivatives corresponding to intermediate variables. The complete Jacobian for all the intermediate variables, vi=vi(v1,v2,,vi,,vn)v_i = v_i(v_1, v_2,\ldots,v_i,\ldots,v_n), assuming that the loops have been unrolled, has the structure shown in Figure 6.22 and Figure 6.23.

The input variables xx are among the first entries in vv, whereas the functions of interest ff are among the last entries of vv.

When n_x < n_f, the forward mode is advantageous.

Figure 6.22:When nx<nfn_x < n_f, the forward mode is advantageous.

For simplicity, let us assume that the entries corresponding to xx and ff are contiguous, as previously shown in Figure 6.13. Then, the derivatives we want (JfJ_f) are a block located on the lower left in the much larger matrix (JvJ_v), as shown in Figure 6.22 and Figure 6.23. Although we are only interested in this block, AD requires the computation of additional intermediate derivatives.

When n_x > n_f, the reverse mode is advantageous.

Figure 6.23:When nx>nfn_x > n_f, the reverse mode is advantageous.

The main difference between the forward and the reverse approaches is that the forward mode computes the Jacobian column by column, whereas the reverse mode does it row by row. Thus, the cost of the forward mode is proportional to nxn_x, whereas the cost of the reverse mode is proportional to nfn_f. If we have more outputs (e.g., objective and constraints) than inputs (design variables), the forward mode is more efficient, as illustrated in Figure 6.22. On the other hand, if we have many more inputs than outputs, then the reverse mode is more efficient, as illustrated in Figure 6.23. If the number of inputs is similar to the number of outputs, neither mode has a significant advantage.

In both modes, each forward or reverse pass costs less than 2–3 times the cost of running the original code in practice. However, because the reverse mode requires storing a large amount of data, memory costs also need to be considered. In principle, the required memory is proportional to the number of variables, but there are techniques that can reduce the memory usage significantly.[5]

6.6.5 AD Implementation

There are two main ways to implement AD: by source code transformation or by operator overloading. The function we used to demonstrate the issues with symbolic differentiation (Example 6.2) can be differentiated much more easily with AD. In the examples that follow, we use this function to demonstrate how the forward and reverse mode work using both source code transformation and operator overloading.

Source Code Transformation

AD tools that use source code transformation process the whole source code automatically with a parser and add lines of code that compute the derivatives. The added code is highlighted in Example 6.7 and Example 6.8.

Operator Overloading

The operator overloading approach creates a new augmented data type that stores both the variable value and its derivative. Every floating-point number vv is replaced by a new type with two parts (v,v˙)(v, \dot v), commonly referred to as a dual number. All standard operations (e.g., addition, multiplication, sine) are overloaded such that they compute vv according to the original function value and v˙\dot v according to the derivative of that function. For example, the multiplication operation, x1x2x_1 \cdot x_2, would be defined for the dual-number data type as

(x1,x˙1)(x2,x˙2)=(x1x2,x1x˙2+x˙1x2),(x_1, \dot x_1) \cdot (x_2, \dot x_2) = (x_1 x_2, x_1 \dot x_2 + \dot x_1 x_2),

where we compute the original function value in the first term, and the second term carries the derivative of the multiplication.

Although we wrote the two parts explicitly in Equation 6.31, the source code would only show a normal multiplication, such as v3=v1v2v_3 = v_1 \cdot v_2. However, each of these variables would be of the new type and carry the corresponding v˙\dot v quantities. By overloading all the required operations, the computations happen “behind the scenes”, and the source code does not have to be changed, except to declare all the variables to be of the new type and to set the seed. Example 6.9 lists the original code from Example 6.2 with notes on the actual computations that are performed as a result of overloading.

The implementation of the reverse mode using operating overloading is less straightforward and is not detailed here. It requires a new data type that stores the information from the computational graph and the variable values when running the forward pass. This information can be stored using the taping technique. After the forward evaluation of using the new type, the “tape” holds the sequence of operations, which is then evaluated in reverse to propagate the reverse-mode seed.[8]

Connection of AD with the Complex-Step Method

The complex-step method from Section 6.5 can be interpreted as forward-mode AD using operator overloading, where the data type is the complex number (x,y)x+iy(x, y) \equiv x + i y, and the imaginary part yy carries the derivative. To see this connection more clearly, let us write the complex multiplication operation as

f=(x1+iy1)(x2+iy2)=(x1x2y1y2)+i(y1x2+x1y2).f = (x_1+iy_1) (x_2+iy_2) = \left(x_1 x_2 - y_1 y_2 \right) + i\left( y_1 x_2 + x_1 y_2 \right).

This equation is similar to the overloaded multiplication (Equation 6.31). The only difference is that the real part includes the term y1y2- y_1 y_2, which corresponds to the second-order error term in Equation 6.15. In this case, the complex part gives the exact derivative, but a second-order error might appear for other operations. As argued before, these errors vanish in finite-precision arithmetic if the complex step is small enough.

Source Code Transformation versus Operator Overloading

The source code transformation and the operator overloading approaches each have their relative advantages and disadvantages. The overloading approach is much more elegant because the original code stays practically the same and can be maintained directly. On the other hand, the source transformation approach enlarges the original code and results in less readable code, making it hard to work with. Still, it is easier to see what operations take place when debugging. Instead of maintaining source code transformed by AD, it is advisable to work with the original source and devise a workflow where the parser is rerun before compiling a new version.

One advantage of the source code transformation approach is that it tends to yield faster code and allows more straightforward compile-time optimizations. The overloading approach requires a language that supports user-defined data types and operator overloading, whereas source transformation does not. Developing a source transformation AD tool is usually more challenging than developing the overloading approach because it requires an elaborate parser that understands the source syntax.

6.6.6 AD Shortcuts for Matrix Operations

The efficiency of AD can be dramatically increased with manually implemented shortcuts. When the code involves matrix operations, manual implementation of a higher-level differentiation of those operations is more efficient than the line-by-line AD implementation. Giles (2008) documents the forward and reverse differentiation of many matrix elementary operations.

For example, suppose that we have a matrix multiplication C=ABC=AB. Then, the forward mode yields

C˙=A˙B+AB˙.\dot C = \dot A B + A \dot B \, .

The idea is to use A˙\dot A and B˙\dot B from the AD code preceding the operation and then manually implement this formula (bypassing any AD of the code that performs that operation) to obtain C˙\dot C, as shown in Figure 6.24. Then we can use C˙\dot C to seed the remainder of the AD code.

The reverse mode of the multiplication yields

Aˉ=CˉB,Bˉ=ACˉ.\bar A = \bar C B^\intercal, \quad \bar B = A^\intercal \bar C \, .

Similarly, we take Cˉ\bar C from the reverse AD code and implement the formula manually to compute Aˉ\bar A and Bˉ\bar B, which we can use in the remaining AD code in the reverse procedure.

Matrix operations, including the solution of linear systems, can be differentiated manually to bypass more costly AD code.

Figure 6.24:Matrix operations, including the solution of linear systems, can be differentiated manually to bypass more costly AD code.

One particularly useful (and astounding!) result is the differentiation of the matrix inverse product. If we have a linear solver such that C=A1BC = A^{-1} B, we can bypass the solver in the AD process by using the following results:

C˙=A1(B˙A˙C)\dot C = A^{-1} \left( \dot B - \dot A C \right)

for the forward mode and

Bˉ=ACˉ,Aˉ=BˉC\bar B = A^{-\intercal} \bar C, \quad \bar A = -\bar B C^\intercal

for the reverse mode.

In addition to deriving the formulas just shown, Giles (2008) derives formulas for the matrix derivatives of the inverse, determinant, norms, quadratic, polynomial, exponential, eigenvalues and eigenvectors, and singular value decomposition. Taking shortcuts as described here applies more broadly to any case where a part of the process can be differentiated manually to produce a more efficient derivative computation.

6.7 Implicit Analytic Methods—Direct and Adjoint

Direct and adjoint methods—which we refer to jointly as implicit analytic methods—linearize the model governing equations to obtain a system of linear equations whose solution yields the desired derivatives. Like the complex-step method and AD, implicit analytic methods compute derivatives with a precision matching that of the function evaluation.

The direct method is analogous to forward-mode AD, whereas the adjoint method is analogous to reverse-mode AD.

Analytic methods can be thought of as lying in between the finite-difference method and AD in terms of the number of variables involved. With finite differences, we only need to be aware of inputs and outputs, whereas AD involves every single variable assignment in the code. Analytic methods work at the model level and thus require knowledge of the governing equations and the corresponding state variables.

There are two main approaches to deriving implicit analytic methods: continuous and discrete. The continuous approach linearizes the original continuous governing equations, such as a partial differential equation (PDE), and then discretizes this linearization. The discrete approach linearizes the governing equations only after they have been discretized as a set of residual equations, r(u)=0r(u)=0.

Each approach has its advantages and disadvantages. The discrete approach is preferred and is easier to generalize, so we explain this approach exclusively. One of the primary reasons the discrete approach is preferred is that the resulting derivatives are consistent with the function values because they use the same discretization. The continuous approach is only consistent in the limit of a fine discretization. The resulting inconsistencies can mislead the optimization.[9]

6.7.1 Residuals and Functions

As mentioned in Chapter 3, a discretized numerical model can be written as a system of residuals,

r(u;x)=0,r(u; x) = 0 \, ,

where the semicolon denotes that the design variables xx are fixed when these equations are solved for the state variables uu. Through these equations, uu is an implicit function of xx. This relationship is represented by the box containing the solver and residual equations in Figure 6.25.

Relationship between functions and design variables for a system involving a solver. The implicit equations r(u;x)=0 define the states u for a given x, so the functions of interest f depend explicitly and implicitly on the design variables x.

Figure 6.25:Relationship between functions and design variables for a system involving a solver. The implicit equations r(u;x)=0r(u;x)=0 define the states uu for a given xx, so the functions of interest ff depend explicitly and implicitly on the design variables xx.

The functions of interest, f(x,u)f(x, u), are typically explicit functions of the state variables and the design variables. However, because uu is an implicit function of xx, ff is ultimately an implicit function of xx as well. To compute ff for a given xx, we must first find uu such that r(u;x)=0r(u; x) = 0. This is usually the most computationally costly step and requires a solver (see Section 3.6). The residual equations could be nonlinear and involve many state variables. In PDE-based models it is common to have millions of states. Once we have solved for the state variables uu, we can compute the functions of interest ff. The computation of ff for a given uu and xx is usually much cheaper because it does not require a solver. For example, in PDE-based models, computing such functions typically involves an integration of the states over a surface, or some other transformation of the states.

To compute df/dx\mathrm{d} f / \mathrm{d} x using finite differences, we would have to use the solver to find uu for each perturbation of xx. That means that we would have to run the solver nxn_x times, which would not scale well when the solution is costly. AD also requires the propagation of derivatives through the solution process. As we will see, implicit analytic methods avoid involving the potentially expensive nonlinear solution in the derivative computation.

6.7.2 Direct and Adjoint Derivative Equations

The derivatives we ultimately want to compute are the ones in the Jacobian df/dx\mathrm{d} f / \mathrm{d} x. Given the explicit and implicit dependence of ff on xx, we can use the chain rule to write the total derivative Jacobian of ff as

dfdx=fx+fududx,\frac{df}{dx}= \frac{\partial f}{\partial x} + \frac{\partial f}{\partial u} \frac{du}{dx}\, ,

where the result is an (nf×nx)(n_f \times n_x) matrix.[10]In this context, the total derivatives, df/dx{\mathrm{d} f}/{\mathrm{d} x}, take into account the change in uu that is required to keep the residuals of the governing equations (Equation 6.37) equal to zero. The partial derivatives in Equation 6.39 represent the variation of f(x,u)f(x,u) with respect to changes in xx or uu without regard to satisfying the governing equations.

To better understand the difference between total and partial derivatives in this context, imagine computing these derivatives using finite differences with small perturbations. For the total derivatives, we would perturb xx, re-solve the governing equations to obtain uu, and then compute ff, which would account for both dependency paths in Figure 6.25. To compute the partial derivatives f/x{\partial f}/{\partial x} and f/u{\partial f}/{\partial u}, however, we would perturb xx or uu and recompute ff without re-solving the governing equations. In general, these partial derivative terms are cheap to compute numerically or can be obtained symbolically.

To find the total derivative du/dx\mathrm{d} u / \mathrm{d} x, we need to consider the governing equations. Assuming that we are at a point where r(x,u)=0r(x,u)=0, any perturbation in xx must be accompanied by a perturbation in uu such that the governing equations remain satisfied. Therefore, the differential of the residuals can be written as

dr=rxdx+rudu=0.\mathrm{d} r= \frac{\partial r}{\partial x} \mathrm{d} x + \frac{\partial r}{\partial u} \mathrm{d} u= 0 \, .

This constraint is illustrated in Figure 6.26 in two dimensions, but keep in mind that xx, uu, and rr are vectors in the general case.

The governing equations (Equation 6.37) map an nxn_x-vector xx to an nun_u-vector uu. This mapping defines a hypersurface (also known as a manifold) in the xxuu space.

The total derivative df/dx\mathrm{d} f /\mathrm{d} x that we ultimately want to compute represents the effect that a perturbation on xx has on ff subject to the constraint of remaining on this hypersurface, which can be achieved with the appropriate variation in uu.

The governing equations determine the values of u for a given x. Given a point that satisfies the equations, the appropriate differential in u must accompany a differential of x about that point for the equations to remain satisfied.

Figure 6.26:The governing equations determine the values of uu for a given xx. Given a point that satisfies the equations, the appropriate differential in uu must accompany a differential of xx about that point for the equations to remain satisfied.

To obtain a more useful equation, we rearrange Equation 6.40 to get the linear system

rududx=rx,\frac{\partial r}{\partial u} \frac{du}{dx}= - \frac{\partial r}{\partial x} \, ,

where r/x\partial r / \partial x and du/dx\mathrm{d} u / \mathrm{d} x are both (nu×nx)(n_u \times n_x) matrices, and r/u{\partial r}/{\partial u} is a square matrix of size (nu×nu)(n_u \times n_u). This linear system is useful because if we provide the partial derivatives in this equation (which are cheap to compute), we can solve for the total derivatives du/dx\mathrm{d} u / \mathrm{d} x (whose computation would otherwise require re-solving r(u)=0r(u)=0). Because du/dx\mathrm{d} u / \mathrm{d} x is a matrix with nxn_x columns, this linear system needs to be solved for each xix_i with the corresponding column of the right-hand-side matrix r/xi{\partial r}/{\partial x_i}.

Now let us assume that we can invert the matrix in the linear system (Equation 6.41) and substitute the solution for du/dx\mathrm{d} u / \mathrm{d} x into the total derivative equation (Equation 6.39). Then we get

dfdx=fxfuru1rx,\frac{df}{dx}= \frac{\partial f}{\partial x} - \frac{\partial f}{\partial u} \left. \frac{\partial r}{\partial u} \right. ^{-1} \frac{\partial r}{\partial x} \, ,

where all the derivative terms on the right-hand side are partial derivatives. The partial derivatives in this equation can be computed using any of the methods that we have described earlier: symbolic differentiation, finite differences, complex step, or AD. Equation 6.42 shows two ways to compute the total derivatives, which we call the direct method and the adjoint method.

The direct method (already outlined earlier) consists of solving the linear system (Equation 6.41) and substituting du/dx\mathrm{d} u / \mathrm{d} x into Equation 6.39. Defining ϕdu/dx\phi \equiv - \mathrm{d} u / \mathrm{d} x, we can rewrite Equation 6.41 as

ruϕ=rx. \frac{\partial r}{\partial u} \phi = \frac{\partial r}{\partial x} \, .

After solving for ϕ\phi (one column at the time), we can use it in the total derivative equation (Equation 6.39) to obtain,

dfdx=fxfuϕ. \frac{df}{dx}= \frac{\partial f}{\partial x} - \frac{\partial f}{\partial u} \phi \, .

This is sometimes called the forward mode because it is analogous to forward-mode AD.

Solving the linear system (Equation 6.43) is typically the most computationally expensive operation in this procedure. The cost of this approach scales with the number of inputs nxn_x but is essentially independent of the number of outputs nfn_f. This is the same scaling behavior as finite differences and forward-mode AD. However, the constant of proportionality is typically much smaller in the direct method because we only need to solve the nonlinear equations r(u;x)=0r(u; x)=0 once to obtain the states.

The total derivatives () can be computed either by solving for \phi (direct method) or by solving for \psi (adjoint method).

Figure 6.27:The total derivatives (Equation 6.42) can be computed either by solving for ϕ\phi (direct method) or by solving for ψ\psi (adjoint method).

The adjoint method changes the linear system that is solved to compute the total derivatives. Looking at Figure 6.27, we see that instead of solving the linear system with r/x{\partial r}/{\partial x} on the right-hand side, we can solve it with f/u{\partial f}/{\partial u} on the right-hand side. This corresponds to replacing the two Jacobians in the middle with a new matrix of unknowns,

ψfuru1,\psi^\intercal \equiv \frac{\partial f}{\partial u} \left. \frac{\partial r}{\partial u}\right.^{-1} \, ,

where the columns of ψ\psi are called the adjoint vectors. Multiplying both sides of Equation 6.45 by r/u\partial r / \partial u on the right and taking the transpose of the whole equation, we obtain the adjoint equation,

ruψ=fu. \left.\frac{\partial r}{\partial u}\right.^\intercal \psi = \left.\frac{\partial f}{\partial u} \right.^\intercal \, .

This linear system has no dependence on xx. Each adjoint vector is associated with a function of interest fjf_j and is found by solving the adjoint equation (Equation 6.46) with the corresponding row fj/u{\partial f_j}/{\partial u}. The solution (ψ\psi) is then used to compute the total derivative

dfdx=fxψrx. \frac{df}{dx}= \frac{\partial f}{\partial x} - \psi^\intercal \frac{\partial r}{\partial x} \, .

This is sometimes called the reverse mode because it is analogous to reverse-mode AD.

As we will see in Section 6.9, the adjoint vectors are equivalent to the total derivatives df/dr\mathrm{d} f / \mathrm{d} r, which quantify the change in the function of interest given a perturbation in the residual that gets zeroed out by an appropriate change in uu.[11]

6.7.3 Direct or Adjoint?

Two possibilities for the size of \mathrm{d} f/\mathrm{d} x in . When n_x<n_f, it is advantageous to solve the linear system with the vector to the right of the square matrix because it has fewer columns. When n_x>n_f, it is advantageous to solve the transposed linear system with the vector to the left because it has fewer rows.

Figure 6.28:Two possibilities for the size of df/dx\mathrm{d} f/\mathrm{d} x in Figure 6.27. When nx<nfn_x<n_f, it is advantageous to solve the linear system with the vector to the right of the square matrix because it has fewer columns. When nx>nfn_x>n_f, it is advantageous to solve the transposed linear system with the vector to the left because it has fewer rows.

Similar to the direct method, the solution of the adjoint linear system (Equation 6.46) tends to be the most expensive operation. Although the linear system is of the same size as that of the direct method, the cost of the adjoint method scales with the number of outputs nfn_f and is essentially independent of the number of inputs nxn_x. The comparison between the computational cost of the direct and adjoint methods is summarized in Table 3 and illustrated in Figure 6.28.

Similar to the trade-offs between forward- and reverse-mode AD, if the number of outputs is greater than the number of inputs, the direct (forward) method is more efficient (Figure 6.28, top). On the other hand, if the number of inputs is greater than the number of outputs, it is more efficient to use the adjoint (reverse) method (Figure 6.28, bottom). When the number of inputs and outputs is large and similar, neither method has an advantage, and the cost of computing the full total derivative Jacobian might be prohibitive. In this case, aggregating the outputs and using the adjoint method might be effective, as explained in Tip 6.7.

In practice, the adjoint method is implemented much more often than the direct method. Although both methods require a similar implementation effort, the direct method competes with methods that are much more easily implemented, such as finite differencing, complex step, and forward-mode AD. On the other hand, the adjoint method only competes with reverse-mode AD, which is plagued by the memory issue.

Table 3:Cost comparison of computing derivatives with direct and adjoint methods.

StepDirectAdjoint
Partial derivative computationSameSame
Linear solutionnxn_x timesnfn_f times
Matrix multiplicationsSameSame

Another reason why the adjoint method is more widely used is that many optimization problems have a few functions of interest (one objective and a few constraints) and many design variables. The adjoint method has made it possible to solve optimization problems involving computationally intensive PDE models.[12]

Although implementing implicit analytic methods is labor intensive, it is worthwhile if the differentiated code is used frequently and in applications that demand repeated evaluations. For such applications, analytic differentiation with partial derivatives computed using AD is the recommended approach for differentiating code because it combines the best features of these methods.

6.7.4 Adjoint Method with AD Partial Derivatives

Implementing the implicit analytic methods for models involving long, complicated code requires significant development effort. In this section, we focus on implementing the adjoint method because it is more widely used, as explained in Section 6.7.3. We assume that nf<nxn_f < n_x, so that the adjoint method is advantageous.

To ease the implementation of adjoint methods, we recommend a hybrid adjoint approach where the reverse mode of AD computes the partial derivatives in the adjoint equations (Equation 6.46) and total derivative equation (Equation 6.47).[18]

The partials terms f/x\partial f / \partial x form an (nf×nx)(n_f \times n_x) matrix and f/u\partial f / \partial u is an (nf×nu)(n_f \times n_u) matrix. These partial derivatives can be computed by identifying the section of the code that computes ff for a given xx and uu and running the AD tool for that section. This produces code that takes fˉ\bar f as an input and outputs xˉ\bar x and uˉ\bar u, as shown in Figure 6.31.

Applying reverse AD to the code that computes f produces code that computes the partial derivatives of f with respect to x and u.

Figure 6.31:Applying reverse AD to the code that computes ff produces code that computes the partial derivatives of ff with respect to xx and uu.

Recall that we must first run the entire original code that computes uu and ff. Then we can run the AD code with the desired seed. Suppose we want the derivative of the jjth component of ff. We would set fˉj=1\bar f_j = 1 and the other elements to zero. After running the AD code, we obtain xˉ\bar x and uˉ\bar u, which correspond to the rows of the respective matrix of partial terms, that is,

xˉ=fjx,uˉ=fju.\bar x = \frac{\partial {f_j}}{\partial x}, \quad \bar u = \frac{\partial {f_j}}{\partial u}\, .

Thus, with each run of the AD code, we obtain the derivatives of one function with respect to all design variables and all state variables. One run is required for each element of ff. The reverse mode is advantageous if nf<nxn_f < n_x, .

The Jacobian r/u\partial r / \partial u can also be computed using AD. Because r/u\partial r / \partial u is typically sparse, the techniques covered in Section 6.8 significantly increase the efficiency of computing this matrix. This is a square matrix, so neither AD mode has an advantage over the other if we explicitly compute and store the whole matrix.

However, reverse-mode AD is advantageous when using an iterative method to solve the adjoint linear system (Equation 6.46). When using an iterative method, we do not form r/u\partial r / \partial u. Instead, we require products of the transpose of this matrix with some vector vv,[19]

ruv.{\frac{\partial r}{\partial u}}^\intercal v \, .

The elements of vv act as weights on the residuals and can be interpreted as a projection onto the direction of vv. Suppose we have the reverse AD code for the residual computation, as shown in Figure 6.32.

Applying reverse AD to the code that computes r produces code that computes the partial derivatives of r with respect to x and u.

Figure 6.32:Applying reverse AD to the code that computes rr produces code that computes the partial derivatives of rr with respect to xx and uu.

This code requires a reverse seed rˉ\bar r, which determines the weights we want on each residual. Typically, a seed would have only one nonzero entry to find partial derivatives (e.g., setting rˉ=[1,0,,0]\bar r=[1,0,\ldots,0] would yield the first row of the Jacobian, uˉr1/u\bar u \equiv \partial r_1 / \partial u). However, to get the product in Equation 6.50, we require the seed to be weighted as rˉ=v\bar r = v. Then, we can compute the product by running the reverse AD code once to obtain uˉ[r/u]v\bar u \equiv [\partial r / \partial u]^\intercal v.

The final term needed to compute total derivatives with the adjoint method is the last term in Equation 6.47, which can be written as

ψrx=(rxψ).\psi^\intercal \frac{\partial r}{\partial x}= \left( {\frac{\partial r}{\partial x}}^\intercal \psi \right)^\intercal .

This is yet another transpose vector product that can be obtained using the same reverse AD code for the residuals, except that now the residual seed is rˉ=ψ\bar r = \psi, and the product we want is given by xˉ\bar x.

In sum, it is advantageous to use reverse-mode AD to compute the partial derivative terms for the adjoint equations, especially if the adjoint equations are solved using an iterative approach that requires only matrix-vector products. Similar techniques and arguments apply for the direct method, except that in that case, forward-mode AD is advantageous for computing the partial derivatives.

6.8 Sparse Jacobians and Graph Coloring

In this chapter, we have discussed various ways to compute a Jacobian of a model. If the Jacobian has many zero elements, it is said to be sparse. In many cases, we can take advantage of that sparsity to significantly reduce the computational time required to construct the Jacobian.

When applying a forward approach (forward-mode AD, finite differencing, or complex step), the cost of computing the Jacobian scales with nxn_x. Each forward pass re-evaluates the model to compute one column of the Jacobian. For example, when using finite differencing, nxn_x evaluations would be required. To compute the jjth column of the Jacobian, the input vector would be

[x1,  x2,  ,  xj+h,  ,  xnx].[x_1,\; x_2,\; \ldots,\; x_j + h,\; \ldots,\; x_{n_x}] \, .

We can significantly reduce the cost of computing the Jacobian depending on its sparsity pattern. As a simple example, consider a square diagonal Jacobian:

dfdx[J1100000J2200000J3300000J4400000J55].\frac{df}{dx}\equiv \begin{bmatrix} J_{11} & 0 & 0 & 0 & 0 \\ 0 & J_{22} & 0 & 0 & 0 \\ 0 & 0 & J_{33} & 0 & 0 \\ 0 & 0 & 0 & J_{44} & 0 \\ 0 & 0 & 0 & 0 & J_{55} \end{bmatrix} \, .

For this scenario, the Jacobian can be constructed with one evaluation rather than nxn_x evaluations. This is because a given output fif_i depends on only one input xix_i. We could think of the outputs as nxn_x independent functions. Thus, for finite differencing, rather than requiring nxn_x input vectors with nxn_x function evaluations, we can use one input vector, as follows:

[x1+h,  x2+h,  ,  x5+h],[x_1 + h,\; x_2 + h,\; \ldots,\; x_5 + h] \, ,

allowing us to compute all the nonzero entries in one pass.[20]

Although the diagonal case is easy to understand, it is a special situation. To generalize this concept, let us consider the following (5×6)(5 \times 6) matrix as an example:

[J1100J140J1600J23J2400J31J3200000000J45000J530J55J56].\begin{bmatrix} J_{11} & 0 & 0 & J_{14} & 0 & J_{16} \\ 0 & 0 & J_{23} & J_{24} & 0 & 0 \\ J_{31} & J_{32} & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & J_{45} & 0 \\ 0 & 0 & J_{53} & 0 & J_{55} & J_{56} \end{bmatrix} \, .

A subset of columns that does not have more than one nonzero in any given row are said to be structurally orthogonal. In this example, the following sets of columns are structurally orthogonal: (1, 3), (1, 5), (2, 3), (2, 4, 5), (2, 6), and (4, 5). Structurally orthogonal columns can be combined, forming a smaller Jacobian that reduces the number of forward passes required. This reduced Jacobian is referred to as compressed. There is more than one way to compress this Jacobian, but in this case, the minimum number of compressed columns—referred to as colors—is three. In the following compressed Jacobian, we combine columns 1 and 3 (blue); columns 2, 4, and 5 (red); and leave column 6 on its own (black):

[J1100J140J1600J23J2400J31J3200000000J45000J530J55J56][J11J14J16J23J240J31J3200J450J53J55J56].\begin{bmatrix} \color{primary} J_{11} & \color{secondary} 0 & \color{primary} 0 & \color{secondary} J_{14} & \color{secondary} 0 & J_{16} \\ \color{primary} 0 & \color{secondary} 0 & \color{primary} J_{23} & \color{secondary} J_{24} & \color{secondary} 0 & 0 \\ \color{primary} J_{31} &\color{secondary} J_{32} & \color{primary} 0 & \color{secondary} 0 & \color{secondary} 0 & 0 \\ \color{primary} 0 & \color{secondary} 0 & \color{primary} 0 & \color{secondary} 0 & \color{secondary} J_{45} & 0 \\ \color{primary} 0 & \color{secondary} 0 & \color{primary} J_{53} & \color{secondary} 0 & \color{secondary} J_{55} & J_{56} \end{bmatrix} \Rightarrow \begin{bmatrix} \color{primary} J_{11} & \color{secondary} J_{14} & J_{16} \\ \color{primary} J_{23} & \color{secondary} J_{24} & 0 \\ \color{primary} J_{31} & \color{secondary} J_{32} & 0 \\ \color{primary} 0 & \color{secondary} J_{45} & 0 \\ \color{primary} J_{53} & \color{secondary} J_{55} & J_{56} \end{bmatrix} \, .

For finite differencing, complex step, and forward-mode AD, only compression among columns is possible. Reverse mode AD allows compression among the rows. The concept is the same, but instead, we look for structurally orthogonal rows. One such compression is as follows:

[J1100J140J1600J23J2400J31J3200000000J45000J530J55J56][J1100J14J45J1600J23J2400J31J32J530J55J56].\begin{bmatrix} \color{primary} J_{11} & \color{primary} 0 & \color{primary} 0 & \color{primary} J_{14} & \color{primary} 0 & \color{primary} J_{16} \\ 0 & 0 & J_{23} & J_{24} & 0 & 0 \\ \color{secondary} J_{31} & \color{secondary} J_{32} & \color{secondary} 0 & \color{secondary} 0 & \color{secondary} 0 & \color{secondary} 0 \\ \color{primary} 0 & \color{primary} 0 & \color{primary} 0 & \color{primary} 0 & \color{primary} J_{45} & \color{primary} 0 \\ \color{secondary} 0 & \color{secondary} 0 & \color{secondary} J_{53} & \color{secondary} 0 & \color{secondary} J_{55} & \color{secondary} J_{56} \end{bmatrix} \Rightarrow \begin{bmatrix} \color{primary} J_{11} & \color{primary} 0 & \color{primary} 0 & \color{primary} J_{14} & \color{primary} J_{45} & \color{primary} J_{16} \\ 0 & 0 & J_{23} & J_{24} & 0 & 0 \\ \color{secondary} J_{31} & \color{secondary} J_{32} & \color{secondary} J_{53}& \color{secondary} 0 & \color{secondary} J_{55} & \color{secondary} J_{56} \\ \end{bmatrix} \, .

AD can also be used even more flexibly when both modes are used: forward passes to evaluate groups of structurally orthogonal columns and reverse passes to evaluate groups of structurally orthogonal rows. Rather than taking incremental steps in each direction as is done in finite differencing, we set the AD seed vector with 1s in the directions we wish to evaluate, similar to how the seed is set for directional derivatives, as discussed in Section 6.6.

For these small Jacobians, it is straightforward to determine how to compress the matrix in the best possible way. For a large matrix, this is not so easy. One approach is to use graph coloring. This approach starts by building a graph where the vertices represent the row and column indices, and the edges represent nonzero entries in the Jacobian. Then, algorithms are applied to this graph that estimate the fewest number of “colors” (orthogonal columns) using heuristics. Graph coloring is a large field of research, where derivative computation is one of many applications.[21]

6.9 Unified Derivatives Equation

Now that we have introduced all the methods for computing derivatives, we will see how they are connected. For example, we have mentioned that the direct and adjoint methods are analogous to the forward and reverse mode of AD, respectively, but we did not show this mathematically. The unified derivatives equation (UDE) expresses both methods.25 Also, the implicit analytic methods from Section 6.7 assumed one set of implicit equations (r=0r=0) and one set of explicit functions (ff). The UDE formulates the derivative computation for systems with mixed sets of implicit and explicit equations.

We first derive the UDE from basic principles and give an intuitive explanation of the derivative terms. Then, we show how we can use the UDE to handle implicit and explicit equations. We also show how the UDE can retrieve the direct and adjoint equations. Finally, we show how the UDE is connected to AD.

6.9.1 UDE Derivation

Solution of a system of two equations expressed by residuals.

Figure 6.35:Solution of a system of two equations expressed by residuals.

Suppose we have a set of nn residual equations with the same number of unknowns,

ri(u1,u2,,un)=0,i=1,,n,r_i(u_1, u_2, \ldots, u_n) = 0, \quad i = 1, \ldots, n \, ,

and that there is at least one solution uu^* such that r(u)=0r(u^*)=0. Such a solution can be visualized for n=2n=2, as shown in Figure 6.35.

These residuals are general: each one can depend on any subset of the variables uu and can be truly implicit functions or explicit functions converted to the implicit form (see Section 3.3 and Example 3.3). The total differentials for these residuals are

dri=riu1du1++riundun,i=1,,n.\mathrm{d} r_i = \frac{\partial r_i}{\partial {u_1}}\mathrm{d} u_1 + \ldots + \frac{\partial r_i}{\partial {u_n}}\mathrm{d} u_n , \quad i = 1, \ldots, n \, .

These represent first-order changes in rr due to perturbations in uu. The differentials of uu can be visualized as perturbations in the space of the variables. The differentials of rr can be visualized as linear changes to the surface defined by r=0r=0, as illustrated in Figure 6.36.

The differential \mathrm{d} r can be visualized as a linearized (first-order) change of the contour value.

Figure 6.36:The differential dr\mathrm{d} r can be visualized as a linearized (first-order) change of the contour value.

We can write the differentials (Equation 6.61) in matrix form as

[r1u1r1unrnu1rnun][du1dun]=[dr1drn].\begin{bmatrix} \frac{\partial r_1}{\partial {u_1}}& \cdots & \frac{\partial r_1}{\partial {u_n}}\\ \vdots & \ddots & \vdots \\ \frac{\partial r_n}{\partial {u_1}}& \cdots & \frac{\partial r_n}{\partial {u_n}}\end{bmatrix} \begin{bmatrix} \mathrm{d} u_1 \\ \vdots \\ \mathrm{d} u_n \end{bmatrix} = \begin{bmatrix} \mathrm{d} r_1 \\ \vdots \\ \mathrm{d} r_n \end{bmatrix} \, .

The partial derivatives in the matrix are derivatives of the expressions for rr with respect to uu that can be obtained symbolically, and they are in general functions of uu. The vector of differentials du\mathrm{d} u represents perturbations in uu that can be solved for a given vector of changes dr\mathrm{d} r.

Now suppose that we are at a solution uu^*, such that r(u)=0r(u^*)=0. All the partial derivatives (r/u\partial r/\partial u) can be evaluated at uu^*. When all entries in dr\mathrm{d} r are zero, then the solution of this linear system yields du=0\mathrm{d} u = 0. This is because if there is no disruption in the residuals that are already zero, the variables do not need to change either.

The total derivatives \mathrm{d} u_1 / \mathrm{d} r_1 and \mathrm{d} u_2 / \mathrm{d} r_1 represent the first-order changes needed to satisfy a perturbation r_1 = \mathrm{d} r_1 while keeping r_2=0.

Figure 6.37:The total derivatives du1/dr1\mathrm{d} u_1 / \mathrm{d} r_1 and du2/dr1\mathrm{d} u_2 / \mathrm{d} r_1 represent the first-order changes needed to satisfy a perturbation r1=dr1r_1 = \mathrm{d} r_1 while keeping r2=0r_2=0.

How is this linear system useful? With these differentials, we can choose different combinations of dr\mathrm{d} r to obtain any total derivatives that we want. For example, we can get the total derivatives of uu with respect to a single residual rir_i by keeping dri\mathrm{d} r_i while setting all the other differentials to zero (drji=0\mathrm{d} r_{j \ne i} = 0). The visual interpretation of this total derivative is shown in Figure 6.37 for n=2n=2 and i=1i=1. Setting dr=[0,,0,dri,0,,0]\mathrm{d} r = [0,\ldots,0, \mathrm{d} r_i, 0, \ldots, 0] in Equation 6.62 and moving dri\mathrm{d} r_i to the denominator, we obtain the following linear system:[22]

[r1u1r1uir1unriu1riuiriunrnu1rnuirnun][du1driduidridundri]=[010].\begin{bmatrix} \frac{\partial r_1}{\partial {u_1}}& \cdots & \frac{\partial r_1}{\partial {u_i}}& \cdots & \frac{\partial r_1}{\partial {u_n}}\\ \vdots & \ddots & \vdots & & \vdots \\ \frac{\partial r_i}{\partial {u_1}}& \cdots & \frac{\partial r_i}{\partial {u_i}}& \cdots & \frac{\partial r_i}{\partial {u_n}}\\ \vdots & & \vdots & \ddots & \vdots \\ \frac{\partial r_n}{\partial {u_1}}& \cdots & \frac{\partial r_n}{\partial {u_i}}& \cdots & \frac{\partial r_n}{\partial {u_n}}\end{bmatrix} \begin{bmatrix} \frac{du_1}{d{r_i}}\\ \vdots \\ \frac{du_i}{d{r_i}}\\ \vdots \\ \frac{du_n}{d{r_i}}\end{bmatrix} = \begin{bmatrix} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{bmatrix} \, .

Doing the same for all i=1,,ni = 1, \ldots, n, we get the following nn linear systems:

[r1u1r1unrnu1rnun][du1dr1du1drndundr1dundrn]=[1001].\begin{bmatrix} \frac{\partial r_1}{\partial {u_1}}& \cdots & \frac{\partial r_1}{\partial {u_n}}\\ \vdots & \ddots & \vdots \\ \frac{\partial r_n}{\partial {u_1}}& \cdots & \frac{\partial r_n}{\partial {u_n}}\end{bmatrix} \begin{bmatrix} \frac{du_1}{d{r_1}}& \cdots & \frac{du_1}{d{r_n}}\\ \vdots & \ddots & \vdots \\ \frac{du_n}{d{r_1}}& \cdots & \frac{du_n}{d{r_n}}\end{bmatrix} = \begin{bmatrix} 1 & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & 1 \end{bmatrix} \, .

Solving these linear systems yields the total derivatives of all the elements of uu with respect to all the elements of rr. We can write this more compactly in matrix form as

rududr=I. \frac{\partial r}{\partial {u}}\frac{du}{d{r}}= I \, .

This is the forward form of the UDE.

The total derivatives du/dr\mathrm{d} u / \mathrm{d} r might not seem like the derivatives in which we are interested. Based on the implicit analytic methods derived in Section 6.7.2, these look like derivatives of states with respect to residuals, not the derivatives that we ultimately want to compute (df/dx\mathrm{d} f / \mathrm{d} x). However, we will soon see that with the appropriate choice of rr and uu, we can obtain a linear system that solves for the total derivatives we want.

With Equation 6.65, we can solve one column at a time. Similar to AD, we can also solve for the rows instead by transposing the systems as [23]

rududr=I, \left. \frac{\partial r}{\partial {u}}\right.^\intercal \left.\frac{du}{d{r}}\right.^\intercal = I \, ,

which is the reverse form of the UDE.

The total derivatives \mathrm{d} u_1 / \mathrm{d} r_1 and \mathrm{d} u_1 / \mathrm{d} r_2 represent the first-order change in u_1 resulting from perturbations \mathrm{d} r_1 and \mathrm{d} r_2.

Figure 6.38:The total derivatives du1/dr1\mathrm{d} u_1 / \mathrm{d} r_1 and du1/dr2\mathrm{d} u_1 / \mathrm{d} r_2 represent the first-order change in u1u_1 resulting from perturbations dr1\mathrm{d} r_1 and dr2\mathrm{d} r_2.

Now, each column jj yields duj/dr\mathrm{d} u_j / \mathrm{d} r—the total derivative of one variable with respect to all the residuals. This total derivative is interpreted visually in Figure 6.38.

The usefulness of the total derivative Jacobian du/dr\mathrm{d} u / \mathrm{d} r might still not be apparent. In the next section, we explain how to set up the UDE to include df/dx\mathrm{d} f / \mathrm{d} x in the UDE unknowns (du/dr\mathrm{d} u / \mathrm{d} r).

6.9.2 UDE for Mixed Implicit and Explicit Components

In the previous section, the UDE was derived based on residual equations. The equations were written in implicit form, but there was no assumption on whether the equations were implicit or explicit. Because we can write an explicit equation in implicit form (see Section 3.3 and Example 3.3), the UDE allows a mix of implicit and explicit set of equations, which we now call components.

To derive the implicit analytic equations in Section 6.7, we considered two components: an implicit component that determines uu by solving r(u;x)=0r(u;x)=0 and an explicit component that computes the functions of interest, f(x,u)f(x,u).

We can recover the implicit analytic differentiation equations (direct and adjoint) from the UDE by defining a set of variables that concatenates the state variables with inputs and outputs as follows:

u^[xuf].\hat u \equiv \begin{bmatrix} x \\ u \\ f \end{bmatrix} \, .

This is a vector with (nx+nu+nf)\left( n_x + n_u + n_f \right) variables. For the residuals, we need a vector with the same size. We can obtain this by realizing that the residuals associated with the inputs and outputs are just explicit functions that can be written in implicit form. Then, we have

r^[xxˇrrˇ(x,u)ffˇ(x,u)]=[000].\hat r \equiv \begin{bmatrix} x-\check{x} \\ r-\check{r}(x,u) \\ f-\check{f}(x,u) \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix} \, .

Here, we distinguish xx (the actual variable in the UDE system) from xˇ\check{x} (a given input) and ff (the variable) from fˇ\check{f} (an explicit function of xx and uu). Similarly, rr is the vector of variables associated with the residual and rˇ\check{r} is the residual function itself.

Taking the differential of the residuals, and considering only one of them to be nonzero at a time, we obtain,

dr^[dxdrdf].\mathrm{d} \hat r \equiv \begin{bmatrix} \mathrm{d} x \\ \mathrm{d} r \\ \mathrm{d} f \end{bmatrix} \, .

Using these variable and residual definitions in Equation 6.65 and Equation 6.66 yields the full UDE shown in Figure 6.40, where the block we ultimately want to compute is df/dx\mathrm{d} f / \mathrm{d} x.

The direct and adjoint methods can be recovered from the UDE.

Figure 6.40:The direct and adjoint methods can be recovered from the UDE.

To compute df/dx\mathrm{d} f / \mathrm{d} x using the forward UDE (left-hand side of the equation in Figure 6.40, we can ignore all but three blocks in the total derivatives matrix: II, du/dx\mathrm{d} u / \mathrm{d} x, and df/dx\mathrm{d} f / \mathrm{d} x. By multiplying these blocks and using the definition ϕdu/dx\phi \equiv - \mathrm{d} u / \mathrm{d} x, we recover the direct linear system (Equation 6.43) and the total derivative equation (Equation 6.44).

To compute df/dx\mathrm{d} f / \mathrm{d} x using the reverse UDE (right-hand side of the equation in Figure 6.40), we can ignore all but three blocks in the total derivatives matrix: II, df/dr\mathrm{d} f/ \mathrm{d} r, and df/dx\mathrm{d} f / \mathrm{d} x. By multiplying these blocks and defining ψdf/dr\psi \equiv - \mathrm{d} f / \mathrm{d} r, we recover the adjoint linear system (Equation 6.46) and the corresponding total derivative equation (Equation 6.47).

The definition of ψ\psi here is significant because, as mentioned in Section 6.7.2, the adjoint vector is the total derivative of the objective function with respect to the governing equation residuals.

By defining one implicit component (associated with uu) and two explicit components (associated with xx and ff), we have retrieved the direct and adjoint methods from the UDE. In general, we can define an arbitrary number of components, so the UDE provides a mathematical framework for computing the derivatives of coupled systems. Furthermore, each component can be implicit or explicit, so the UDE can handle an arbitrary mix of components. All we need to do is to include the desired states in the UDE augmented variables vector (Equation 6.69) and the corresponding residuals in the UDE residuals vector (Equation 6.70). We address coupled systems in Section 13.3.3 and use the UDE in Section 13.2.6, where we extend it to coupled systems with a hierarchy of components.

6.9.3 Recovering AD

Now we will see how we can recover AD from the UDE. First, we define the UDE variables associated with each operation or line of code (assuming all loops have been unrolled), such that uvu \equiv v and

vi=vˇi(v1,,vi1),i=1,,n.v_i = \check v_i(v_1,\dots,v_{i-1}), \quad i = 1, \ldots, n \, .

Recall from Section 6.6.1 that each variable is an explicit function of the previous ones.

To define the appropriate residuals, we use the same technique from before to convert an explicit function into implicit form by moving all the terms in the left-hand side to obtain

ri=vivˇi(v1,,vi1).r_i = v_i - \check v_i(v_1,\dots,v_{i-1}) \, .

The distinction between vv and vˇ\check v is that vv represents variables that are considered independent in the UDE, whereas vˇ\check v represents the explicit expressions. Of course, the values for these become equal once the system is solved. Similar to the differentials in Equation 6.71, drdv\mathrm{d} r \equiv \mathrm{d} v

Taking the partial derivatives of the residuals (Equation 6.73) with respect to vv (Equation 6.72), and replacing the total derivatives in the forward form of the UDE (Equation 6.65) with the new symbols yields

[100vˇ2v110vˇnv1vˇnvn11][dv1dv100dv2dv1dv2dv20dvndv1dvndvn1dvndvn]=I.\begin{bmatrix} 1 & 0 & \cdots & 0 \\ -\frac{\partial \check v_2}{\partial {v_1}} & 1 & \ddots & \vdots \\ \vdots & \ddots & \ddots & 0 \\ -\frac{\partial \check v_n}{\partial {v_1}} & \cdots & -\frac{\partial \check v_n}{\partial {v_{n-1}}} & 1 \end{bmatrix} \begin{bmatrix} \frac{dv_1}{d{v_1}} & 0 & \cdots & 0 \\ \frac{dv_2}{d{v_1}} & \frac{dv_2}{d{v_2}} & \ddots & \vdots \\ \vdots & \ddots & \ddots & 0 \\ \frac{dv_n}{d{v_1}} & \cdots & \frac{dv_n}{d{v_{n-1}}} & \frac{d{v_n}}{d{v_n}} \end{bmatrix} = I \, .

This equation is the matrix form of the AD forward chain rule (Equation 6.21), where each column of the total derivative matrix corresponds to the tangent vector (v˙\dot v) for the chosen input variable. As observed in Figure 6.16, the partial derivatives form a lower triangular matrix. The Jacobian we ultimately want to compute (df/dx\mathrm{d} f / \mathrm{d} x) is composed of a subset of derivatives in the bottom-left corner near the dvn/dv1\mathrm{d} {v_n} / \mathrm{d} {v_1} term. To compute these derivatives, we need to perform forward substitution and compute one column of the total derivative matrix at a time, where each column is associated with the inputs of interest.

Similarly, the reverse form of the UDE (Equation 6.66) yields the transpose of Equation 6.74,

[1vˇ2v1vˇnv101vˇnvn1001][dv1dv1dv2dv1dvndv10dv2dv2dvn1dvn1dvndvn100dvndvn]=I.\begin{bmatrix} 1 & -\frac{\partial \check v_2}{\partial {v_1}} & \cdots & -\frac{\partial \check v_n}{\partial {v_1}} \\ 0 & 1 & \ddots & \vdots \\ \vdots & \ddots & \ddots & -\frac{\partial \check v_n}{\partial {v_{n-1}}} \\ 0 & \cdots & 0 & 1 \end{bmatrix} \begin{bmatrix} \frac{dv_1}{d{v_1}} & \frac{dv_2}{d{v_1}} & \cdots & \frac{dv_n}{d{v_1}} \\ 0 & \frac{dv_2}{d{v_2}} & \ddots & \vdots \\ \vdots & \ddots & \frac{dv_{n-1}}{d{v_{n-1}}} & \frac{dv_n}{d{v_{n-1}}} \\ 0 & \cdots & 0 & \frac{dv_{n}}{d{v_{n}}} \end{bmatrix} = I \, .

This is equivalent to the AD reverse chain rule (Equation 6.26), where each column of the total derivative matrix corresponds to the gradient vector (vˉ\bar v) for the chosen output variable. The partial derivatives now form an upper triangular matrix, as previously shown in Figure 6.21. The derivatives of interest are now near the top-right corner of the total derivative matrix near the dvn/dv1\mathrm{d} {v_n} / \mathrm{d} {v _1} term. To compute these derivatives, we need to perform back substitutions, computing one column of the matrix at a time.

6.10 Summary

Derivatives are useful in many applications beyond optimization. This chapter introduced the methods available to compute the first derivatives of the outputs of a model with respect to its inputs. In optimization, the outputs are usually the objective function and the constraint functions, and the inputs are the design variables. The typical characteristics of the available methods are compared in Table 4.

Table 4:Characteristics of the various derivative computation methods. Some of these characteristics are problem or implementation dependent, so these are not universal.

AccuracyScalabilityEase ofImplicit
implementationfunctions
Symbolic\bulletHard
Finite differenceEasy\bullet
Complex step\bulletIntermediate\bullet
AD\bullet\bulletIntermediate\bullet
Implicit analytic\bullet\bulletHard\bullet

Symbolic differentiation is accurate but only scalable for simple, explicit functions of low dimensionality. Therefore, it is necessary to compute derivatives numerically. Although it is generally intractable or inefficient for many engineering models, symbolic differentiation is used by AD at each line of code and in implicit analytic methods to derive expressions for computing the required partial derivatives.

Finite-difference approximations are popular because they are easy to implement and can be applied to any model, including black-box models. The downsides are that these approximations are not accurate, and the cost scales linearly with the number of variables. Many of the issues practitioners experience with gradient-based optimization can be traced to errors in the gradients when algorithms automatically compute these gradients using finite differences.

The complex-step method is accurate and relatively easy to implement. It usually requires some changes to the analysis source code, but this process can be scripted. The main advantage of the complex-step method is that it produces analytically accurate derivatives. However, like the finite-difference method, the cost scales linearly with the number of inputs, and each simulation requires more effort because of the complex arithmetic.

AD produces analytically accurate derivatives, and many implementations can be fully automated. The implementation requires access to the source code but is still relatively straightforward to apply. The computational cost of forward-mode AD scales with the number of inputs, and the reverse mode scales with the number of outputs. The scaling factor for the forward mode is generally lower than that for finite differences. The cost of reverse-mode AD is independent of the number of design variables.

Efficient gradient computation with an analytic method improves the scalability of gradient-based algorithms compared to finite differencing. In this case, we show the results for the n-dimensional Rosenbrock, where the cost of computing the derivatives analytically is independent of n.

Figure 6.43:Efficient gradient computation with an analytic method improves the scalability of gradient-based algorithms compared to finite differencing. In this case, we show the results for the nn-dimensional Rosenbrock, where the cost of computing the derivatives analytically is independent of nn.

Implicit analytic methods (direct and adjoint) are accurate and scalable but require significant implementation effort. These methods are exact (depending on how the partial derivatives are obtained), and like AD, they provide both forward and reverse modes with the same scaling advantages. Gradient-based optimization using the adjoint method is a powerful combination that scales well with the number of variables, as shown in Figure 6.43. The disadvantage is that because implicit methods are intrusive, they require considerable development effort.

A hybrid approach where the partial derivatives for the implicit analytic equations are computed with AD is generally recommended. This hybrid approach is computationally more efficient than AD while reducing the implementation effort of implicit analytic methods and ensuring accuracy.

The UDE encapsulates all the derivative computation methods in a single linear system. Using the UDE, we can formulate the derivative computation for an arbitrary system of mixed explicit and implicit components. This will be used in Section 13.2.6 to develop a mathematical framework for solving coupled systems and computing the corresponding derivatives.

Problems

Footnotes
  1. Here, ff is the difference between the eccentric and mean anomalies, xx is the mean anomaly, and the eccentricity is set to 1. For more details, see Exercise 6 and .

  2. This method originated with the work of 1 who developed formulas that use complex arithmetic for computing the derivatives of real functions of arbitrary order with arbitrary order truncation error, much like the Taylor series combination approach in finite differences. Later, 2 observed that the simplest of these formulas was convenient for computing first derivatives.

  3. This approximation can also be derived from one of the Cauchy–Riemann equations, which are fundamental in complex analysis and express complex differentiability.3

  4. For more details on the problematic functions and how to implement the complex-step method in various programming languages, see 3 A summary, implementation guide, and scripts are available at: http://bit.ly/complexstep

  5. One of the main techniques for reducing the memory usage of reverse AD is checkpointing; see Chapter 12 in 6

  6. A stack, also known as last in, first out (LIFO), is a data structure that stores a one-dimensional array. We can only add an element to the top of the stack (push) and take the element from the top of the stack (pop).

  7. The overloading of ‘’++” computes (v,v)˙=(x+f,x)+f˙˙(v, \dot{v)} = (x + f, \dot{x) + \dot{f}} and then the overloading of “sin\sin” computes (f,f)˙=(sin(v),cos(v)v˙)(f, \dot{f)} = (\sin(v), \cos(v)\dot v).

  8. See Sec. 5.4 in 6 for more details on reverse mode using operating overloading.

  9. 17 compare the continuous and discrete adjoint approaches in more detail.

  10. This chain rule can be derived by writing the total differential of ff as

    df=fxdx+fudu\mathrm{d} f = \frac{\partial f}{\partial x} \mathrm{d} x + \frac{\partial f}{\partial u} \mathrm{d} u

    and then “dividing” it by dx\mathrm{d} x. See Section A.2 for more background on differentials.

  11. The adjoint vector can also be interpreted as a Lagrange multiplier vector associated with equality constraints r=0r=0. Defining the Lagrangian L(x,u)=f+ψr\mathcal{L}(x,u) = f + \psi^\intercal r and differentiating it with respect to xx, we get

    Lx=fx+ψrx.\frac{\partial {\mathcal{L}}}{\partial x}= \frac{\partial f}{\partial x}+ \psi^\intercal \frac{\partial r}{\partial x}.

    Thus, the total derivatives df/dx\mathrm{d} f / \mathrm{d} x are the derivatives of this Lagrangian.

  12. One widespread application of the adjoint method has been in aerodynamic and hydrodynamic shape optimization.18

  13. The displacements do change with the areas but only through the solution of the governing equations, which are not considered when taking partial derivatives.

  14. This is not true for large displacements, but we assume small displacements .

  15. Although ultimately, the areas do change the stresses, they do so only through changes in the displacements.

  16. Usually, the stiffness matrix is symmetric, and K=KK^\intercal = K. This means that the solver for displacements can be repurposed for adjoint computation by setting the right-hand side shown here instead of the loads. For that reason, this right-hand side is sometimes called a pseudo-load.

  17. 19 provide recommendations on constraint aggregation for structural optimization.

  18. 20 provide more details on this approach and its applications.

  19. See Iterative Methods for more details on iterative solvers.

  20. 21 were the first to show that the number of function evaluations could be reduced for sparse Jacobians.

  21. 22 provide a review of graph coloring in the context of computing derivatives. 23 show how to use graph coloring to compute total coupled derivatives.

  22. As explained in Section A.2, we take the liberty of treating differentials algebraically and skip a more rigorous and lengthy proof.

  23. Normally, for two matrices AA and BB, (AB)=BA(AB)^\intercal=B^\intercal A^\intercal, but in this case,
    AB=IB=A1B=AAB=IAB=I \Rightarrow B=A^{-1} \Rightarrow B^\intercal=A^{-\intercal} \Rightarrow\\ A^\intercal B^\intercal = I.

References
  1. Lyness, J. N., & Moler, C. B. (1967). Numerical differentiation of analytic functions. SIAM Journal on Numerical Analysis, 4(2), 202–210. 10.1137/0704019
  2. Squire, W., & Trapp, G. (1998). Using complex variables to estimate derivatives of real functions. SIAM Review, 40(1), 110–112. 10.1137/S003614459631241X
  3. Martins, J. R. R. A., Sturdza, P., & Alonso, J. J. (2003). The complex-step derivative approximation. ACM Transactions on Mathematical Software, 29(3), 245–262. 10.1145/838250.838251
  4. Lantoine, G., Russell, R. P., & Dargent, T. (2012). Using multicomplex variables for automatic computation of high-order derivatives. ACM Transactions on Mathematical Software, 38(3), 1–21. 10.1145/2168773.2168774
  5. Fike, J. A., & Alonso, J. J. (2012). Automatic differentiation through the use of hyper-dual numbers for second derivatives. In S. Forth, P. Hovland, E. Phipps, J. Utke, & A. Walther (Eds.), Recent Advances in Algorithmic Differentiation (pp. 163–173). Springer. 10.1007/978-3-642-30023-3_15
  6. Griewank, A. (2000). Evaluating Derivatives. SIAM. 10.1137/1.9780898717761
  7. Naumann, U. (2011). The Art of Differentiating Computer Programs—An Introduction to Algorithmic Differentiation. SIAM. https://books.google.ca/books/about/The_Art_of_Differentiating_Computer_Prog.html?id=OgQuUR4nLu0C
  8. Utke, J., Naumann, U., Fagan, M., Tallent, N., Strout, M., Heimbach, P., Hill, C., & Wunsch, C. (2008). OpenAD/F: A modular open-source tool for automatic differentiation of Fortran codes. ACM Transactions on Mathematical Software, 34(4). 10.1145/1377596.1377598
  9. Hascoet, L., & Pascual, V. (2013). The Tapenade automatic differentiation tool: Principles, model, and specification. ACM Transactions on Mathematical Software, 39(3), 20:1-20:43. 10.1145/2450153.2450158
  10. Griewank, A., Juedes, D., & Utke, J. (1996). Algorithm 755: ADOL-C: A package for the automatic differentiation of algorithms written in C\slash C++. ACM Transactions on Mathematical Software, 22(2), 131–167. 10.1145/229473.229474
  11. Wiltschko, A. B., van Merriënboer, B., & Moldovan, D. (2017). Tangent: automatic differentiation using source code transformation in Python. https://arxiv.org/abs/1711.02712
  12. Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-Milne, S., & Zhang, Q. (2018). JAX: Composable Transformations of Python+NumPy Programs (0.2.5). http://github.com/google/jax
  13. Revels, J., Lubin, M., & Papamarkou, T. (2016). Forward-mode automatic differentiation in Julia. In ArXiv e-prints. https://arxiv.org/abs/1607.07892
  14. Neidinger, R. D. (2010). Introduction to automatic differentiation and MATLAB object-oriented programming. SIAM Review, 52(3), 545–563. 10.1137/080743627
  15. Betancourt, M. (2018). A geometric theory of higher-order automatic differentiation. https://arxiv.org/abs/1812.11592