Appendix: Mathematics Background - Engineering Design Optimization

This appendix briefly reviews various mathematical concepts used throughout the book.

A.1 Taylor Series Expansion¶

Series expansions are representations of a given function in terms of a series of other (usually simpler) functions. One common series expansion is the Taylor series, which is expressed as a polynomial whose coefficients are based on the derivatives of the original function at a fixed point.

The Taylor series is a general tool that can be applied whenever the function has derivatives. We can use this series to estimate the value of the function near the given point, which is useful when the function is difficult to evaluate directly. The Taylor series is used to derive algorithms for finding the zeros of functions and algorithms for minimizing functions in Chapter 4 and Chapter 5.

To derive the Taylor series, we start with an infinite polynomial series about an arbitrary point, $x$ , to approximate the value of a function at $x + \Delta x$ using

f(x + \Delta x) = a_0 + a_1 \Delta x + a_2 \Delta x^2 + \ldots + a_k \Delta x^k + \ldots \, .

(A.1)

We can make this approximation exact at $\Delta x=0$ by setting the first coefficient to $f(x)$ . To find the appropriate value for $a_1$ , we take the first derivative to get

f^\prime(x + \Delta x) = a_1 + 2a_2 \Delta x + \ldots + i a_k \Delta x^{k-1} + \ldots \, ,

(A.2)

which means that we need $a_1 = f^\prime(x)$ to obtain an exact derivative at $x$ . To derive the other coefficients, we systematically take the derivative of both sides and the appropriate value of the first nonzero term (which is always constant). Identifying the pattern yields the general formula for the $n$ th-order coefficient:

a_k = \frac{f^{(k)}(x)}{k!} \, .

(A.3)

Substituting this into the polynomial in Equation A.1 yields the Taylor series

f(x + \Delta x) = \sum_{k=0}^\infty \frac{\Delta x^k}{k!}f^{(k)}(x) \, .

(A.4)

The series is typically truncated to use terms up to order $m$ ,

f(x + \Delta x) = \sum_{k=0}^m \frac{\Delta x^k}{k!}f^{(k)}(x) + \mathcal{O}\left(\Delta x^{m+1}\right) \, ,

(A.5)

which yields an approximation with a truncation error of order ${\cal O}(\Delta x^{m+1})$ . In optimization, it is common to use the first three terms (up to $m = 2$ ) to get a quadratic approximation.

The Taylor series in multiple dimensions is similar to the single-variable case but more complicated. The first derivative of the function becomes a gradient vector, and the second derivatives become a Hessian matrix. Also, we need to define a direction along which we want to approximate the function because that information is not inherent like it is in a one-dimensional function. The Taylor series expansion in $n$ dimensions along a direction $p$ can be written as

f(x + \alpha p) = f(x) + \alpha \sum_{k=1}^n p_k \frac{\partial f}{\partial {x_k}}+ \frac{1}{2} \alpha^2 \sum_{k=1}^n \sum_{l=1}^n p_k p_l \frac{\partial f}{\partial {{x_k}}{{x_l}}}+ \mathcal{O}\left(\alpha^3 \right) \, ,

(A.6)

where $\alpha$ is a scalar that determines how far to go in the direction $p$ .

In matrix form, we can write

f(x + \alpha p) = f(x) + \alpha \nabla f(x)^\intercal p + \frac{1}{2} \alpha^2 p^\intercal H(x) p + {\mathcal O} \left( \alpha ^3 \right) \, ,

(A.7)

where $H$ is the Hessian matrix.

Example 2 (Taylor series expansion for two variables)

Consider the following function of two variables:

f \left(x_1, x_2 \right) = \left(1 - x_1 \right)^2 + \left(1 - x_2 \right)^2 + \frac{1}{2} \left(2x_2 - x_1^2 \right)^2 \, .

Performing a Taylor series expansion about $x = [0,-2]$ , we get

f(x + \alpha p) = 18 + \alpha \begin{bmatrix} -2 \ -14 \end{bmatrix} p + \frac{1}{2} \alpha^2 p^\intercal \begin{bmatrix} 10 & 0 \\ 0 & 6 \end{bmatrix} p \, .

The original function, the linear approximation, and the quadratic approximation are compared in Figure A.2.

Figure A.2:Taylor series approximations for two-dimensional example.

A.2 Chain Rule, Total Derivatives, and Differentials¶

The single-variable chain rule is needed for differentiating composite functions. Given a composite function, $f(g(x))$ , the derivative with respect to the variable $x$ is

\frac{d}{dx}\left(f(g(x))\right) = \frac{df}{dg}\frac{dg}{dx}\, .

(8)

If a function depends on more than one variable, then we need to distinguish between partial and total derivatives. For example, if $f(g(x),h(x))$ , then $f$ is a function of two variables:  $g$ and $h$ . The application of the chain rule for this function is

\frac{\mathrm{d}}{\mathrm{d} x}\left(f(g(x),h(x))\right) = \frac{\partial f}{\partial g}\frac{\mathrm{d} g}{\mathrm{d} x} + \frac{\partial f}{\partial h}\frac{\mathrm{d} h}{\mathrm{d} x} \, ,

(9)

where $\partial / \partial x$ indicates a partial derivative, and $\mathrm{d} / \mathrm{d} x$ is a total derivative. When taking a partial derivative, we take the derivative with respect to only that variable, treating all other variables as constants. More generally,

\frac{\mathrm{d}}{\mathrm{d} x}(f(g_1(x), \ldots, g_n(x))) = \sum_{i=1}^n \left(\frac{\partial f}{\partial g_i}\frac{\mathrm{d} g_i}{\mathrm{d} x}\right) \, .

(A.10)

Example 4 (Partial versus total derivatives)

Consider $f(x, y(x)) = x^2 + y^2$ , where $y(x) = \sin(x)$ . The partial derivative of $f$ with respect to $x$ is

\frac{\partial f}{\partial x} = 2x \, ,

whereas the total derivative of $f$ with respect to $x$ is

\begin{aligned} \frac{\mathrm{d} f}{\mathrm{d} x} &= \frac{\partial f}{\partial x} + \frac{\partial f}{\partial y}\frac{\mathrm{d} y}{\mathrm{d} x} \\ &= 2 x + 2 y \cos(x) \\ &= 2 x + 2\sin(x)\cos(x) \, . \end{aligned}

Notice that the partial derivative and total derivative are quite different. For this simple case, we could also find the total derivative by direct substitution and then using an ordinary one-dimensional derivative. Substituting $y(x) = \sin(x)$ directly into the original expression for $f$ gives

\begin{aligned} f(x) &= x^2 + \sin^2(x)\\ \frac{\mathrm{d} f}{\mathrm{d} x} &= 2x + 2\sin(x)\cos(x) \, . \end{aligned}

Example 5 (Multivariable chain rule)

Expanding on our single-variable example, let $g(x) = \cos(x)$ , $h(x) = \sin(x)$ , and $f(g,h) = g^2h^3$ . Then, $f(g(x),h(x)) = \cos^2(x)\sin^3(x)$ . Applying (9), we have the following:

\begin{aligned} \frac{\mathrm{d}}{\mathrm{d} x}\left(f \left(g(x),h(x)\right)\right) &= \frac{\partial f}{\partial g} \frac{\mathrm{d} g}{\mathrm{d} x} + \frac{\partial f}{\partial h} \frac{\mathrm{d} h}{\mathrm{d} x}\\ &= 2g h^3 \frac{\mathrm{d} g}{\mathrm{d} x} + g^2 3 h^2 \frac{d h}{\mathrm{d} x}\\ &= - 2g h^3 \sin(x) + g^2 3 h^2 \cos(x)\\ &= -2 \cos(x) \sin^4(x) + 3 \cos^3(x) \sin^2(x) \, . \end{aligned}

The differential of a function represents the linear change in that function with respect to changes in the independent variable. We introduce them here because they are helpful for finding total derivatives of multivariable equations that are implicit.

If function $y=f(x)$ is differentiable, the differential $\mathrm{d} y$ is

\mathrm{d} y = f^\prime(x) \mathrm{d} x \, ,

(A.11)

where $\mathrm{d} x$ is a nonzero real number (considered small) and $\mathrm{d} y$ is an approximation of the change (due to the linear term in the Taylor series). We can solve for $f^\prime(x)$ to get $f^\prime(x) = \mathrm{d} y / \mathrm{d} x$ . This states that the derivative of $f$ with respect to $x$ is the differential of $y$ divided by the differential of $x$ . Strictly speaking, $\mathrm{d} y / \mathrm{d} x$ here is not the derivative, although it is written in the same way. The derivative is a symbol, not a fraction. However, for our purposes, we will use these representations interchangeably and treat differentials algebraically. We also write the differentials of functions as

\mathrm{d} f = f^\prime(x) \mathrm{d} x \, .

(A.12)

Example 6 (Multivariable chain rule using differentials)

We can solve Example 5 using differentials as follows. Taking the definition of each function, we write their differentials,

\mathrm{d} f = 2 g h^3 \mathrm{d} g + 3 g^2 h^2 \mathrm{d} h, \quad \mathrm{d} g = -\sin(x) \mathrm{d} x, \quad \mathrm{d} h = \cos(x) \mathrm{d} x \, .

Substituting $g$ , $\mathrm{d} g$ , $h$ , and $\mathrm{d} h$ into the differential of $f$ we get obtain

\mathrm{d} f = 2 \cos(x) \sin(x)^3 (-\sin(x) \mathrm{d} x) + 3 \cos(x)^2 \sin(x)^2 \cos(x) \mathrm{d} x \, .

Simplifying and dividing by $\mathrm{d} x$ yields the total derivative

\frac{df}{dx}= -2 \cos(x) \sin^4(x) + 3 \cos^3(x) \sin^2(x).

In Example 5, there is no clear advantage in using differentials. However, differentials are more straightforward for finding total derivatives of multivariable implicit equations because there is no need to identify the independent variables. Given an equation, we just need to (1) find the differential of the equation and (2) solve for the derivative of interest. When we want quantities to remain constant, we can set the corresponding differential to zero. Differentials can be applied to vectors (say a vector $x$ of size $n$ ), yielding a vector of differentials with the same size ( $\mathrm{d} x$ of size $n$ ). We use this technique to derive the unified derivatives equation (UDE) in Section 6.9.

Example 7 (Total derivatives of an implicit equation)

Suppose we have the equation for a circle,

x^2 + y^2 = r^2 \, .

The differential of this equation is

2 x \mathrm{d} x + 2 y \mathrm{d} y = 2 r \mathrm{d} r \, .

Say we want to find the slope of the tangent of a circle with a fixed radius. Then, $\mathrm{d} r=0$ , and we can solve for the derivative $\mathrm{d} y / \mathrm{d} x$ as follows:

2 x \mathrm{d} x + 2 y \mathrm{d} y = 0 \quad \Rightarrow \quad \frac{dy}{dx}= -\frac{x}{y} \, .

Another interpretation of this derivative is that it is the first-order change in $y$ with respect to a change in $x$ subject to the constraint of staying on a circle (keeping a constant $r$ ). Similarly, we could find the derivative of $x$ with respect to $y$ as $\mathrm{d} x / \mathrm{d} y = - y/x$ . Furthermore, we can find relationships between any derivative involving $r$ , $x$ , or $y$ .

A.3 Matrix Multiplication¶

Figure A.3:Matrix product and resulting size.

Consider a matrix $A$ of size $(m \times n)$ ^[1]and a matrix $B$ of size $(n \times p)$ . The two matrices can be multiplied together $(C = A B)$ as follows:

C_{ij} = \sum_{k = 1}^{n} A_{ik} B_{kj} \, ,

(A.13)

where $C$ is an $(m \times p)$ matrix. This multiplication is illustrated in Figure A.3. Two matrices can be multiplied only if their inner dimensions are equal ( $n$ in this case). The remaining products discussed in this section are just special cases of matrix multiplication, but they are common enough that we discuss them separately.

A.3.1 Vector-Vector Products¶

In this book, a vector $u$ is a column vector; thus, the row vector is represented as $u^\intercal$ . The product of two vectors can be performed in two ways. The more common is called an inner product (also known as a dot product or scalar product). The inner product is a functional, meaning that it is an operator that acts on vectors and produces a scalar. This product is illustrated in Figure A.4. In the real vector space of $n$ dimensions, the inner product of two vectors, $u$ and $v$ , whose dimensions are equal, is defined algebraically as

Figure A.4:Dot (or inner) product of two vectors.

\alpha = u^\intercal v = \begin{bmatrix} u_1 & u_2 & \ldots & u_n \\ \end{bmatrix} \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \\ \end{bmatrix} = \sum_{i=1}^n u_i v_i \, .

(A.14)

The order of multiplication is irrelevant, and therefore,

u^\intercal v = v^\intercal u \, .

(A.15)

In Euclidean space, where vectors have magnitude and direction, the inner product is defined as

u^\intercal v = \|u\| \|v\| \cos(\theta) \, ,

(A.16)

where $\|\cdot\|$ represents the 2-norm (Equation A.25), and $\theta$ is the angle between the two vectors.

Figure A.5:Outer product of two vectors.

The outer product takes the two vectors and multiplies them element-wise to produce a matrix, as illustrated in Figure A.5. Unlike the inner product, the outer product does not require the vectors to be of the same length. The matrix form is as follows:

\begin{aligned} C = uv^\intercal &= \begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_m \\ \end{bmatrix} \begin{bmatrix} v_1 & v_2 & \cdots & v_n \\ \end{bmatrix} \\ &= \begin{bmatrix} u_1v_1 & u_1v_2& \cdots &u_1v_n \\ u_2v_1 & u_2v_2& \cdots &u_2v_n \\ \vdots & \vdots& \ddots &\vdots \\ u_mv_1 & u_mv_2& \cdots &u_mv_n \end{bmatrix} \, . \end{aligned}

(A.17)

The index form is as follows:

(u v^\intercal)_{ij} = u_i v_j \, .

(A.18)

Outer products generate rank 1 matrices. They are used in quasi-Newton methods (Section 4.4.4 and ).

A.3.2 Matrix-Vector Products¶

Consider multiplying a matrix $A$ of size $(m \times n)$ by vector $u$ of size $n$ . The result is a vector of size $m$ :

v = Au \Rightarrow v_i = \sum_{j=1}^n A_{ij} u_j \, .

(A.19)

This multiplication is illustrated in Figure A.6.

The entries in $v$ are dot products between the rows of $A$ and $u$ :

v = \begin{bmatrix} \;\text{---}\text{---}\; A_{1 \ast} \;\text{---}\text{---}\;\\ \;\text{---}\text{---}\; A_{2 \ast} \;\text{---}\text{---}\;\\ \vdots \\ \;\text{---}\text{---}\; A_{m \ast} \;\text{---}\text{---}\;\\ \end{bmatrix} u \,,

(A.20)

where $A_{i \ast}$ is the $i$ th row of the matrix $A$ . Thus, a matrix-vector product transforms a vector in $n$ -dimensional space ( $\mathbb{R}^n$ ) to a vector in $m$ -dimensional space ( $\mathbb{R}^m$ ).

A matrix-vector product can be thought of as a linear combination of the columns of $A$ , where the $u_j$ values are the weights:

v = \begin{bmatrix} |\\ A_{\ast 1}\\ | \end{bmatrix} u_1 + \begin{bmatrix} |\\ A_{\ast 2} \\ | \end{bmatrix} u_2 + \ldots + \begin{bmatrix} |\\ A_{\ast n}\\ | \end{bmatrix} u_n,

(A.21)

and $A_{\ast j}$ are the columns of $A$ .

We can also multiply by a vector on the left, instead of on the right:

v^\intercal = u^\intercal A.

(A.22)

In this case, a row vector is multiplied with a matrix, producing a row vector.

A.3.3 Quadratic Form (Vector-Matrix-Vector Product)¶

Another common product is a quadratic form. A quadratic form consists of a row vector, times a matrix, times a column vector, producing a scalar:

\alpha = u^\intercal A u = \begin{bmatrix} u_1 & u_2 & \ldots & u_n \\ \end{bmatrix} \begin{bmatrix} A_{11} & A_{12}& \cdots &A_{1n} \\ A_{21} & A_{22}& \cdots &A_{2n} \\ \vdots & \vdots& \ddots &\vdots \\ A_{n1} & A_{n2}& \cdots &A_{nn} \end{bmatrix} \begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_n. \\ \end{bmatrix}

(A.23)

The index form is as follows:

\alpha = \sum_{i=1}^n\sum_{j=1}^n u_i A_{ij} u_j.

(A.24)

In general, a vector-matrix-vector product can have a nonsquare $A$ matrix, and the vectors would be two different sizes, but for a quadratic form, the two vectors $u$ are identical, and thus $A$ is square. Also, in a quadratic form, we assume that $A$ is symmetric (even if it is not, only the symmetric part of $A$ contributes, so effectively, it acts like a symmetric matrix).

A.4 Four Fundamental Subspaces in Linear Algebra¶

This section reviews how the dimensions of a matrix in a linear system relate to dimensional spaces.^[2] These concepts are especially helpful for understanding constrained optimization (Chapter 5) and build on the review in Section 5.2n-Dimensional Space.

A vector space is the set of all points that can be obtained by linear combinations of a given set of vectors. The vectors are said to span the vector space. A basis is a set of linearly independent vectors that generates all points in a vector space. A subspace is a space of lower dimension than the space that contains it (e.g., a line is a subspace of a plane).

Two vectors are orthogonal if the angle between them is 90 degrees. Then, their dot product is zero. A subspace $S_1$ is orthogonal to another subspace $S_2$ if every vector in $S_1$ is orthogonal to every vector in $S_2$ .

Consider an $(m \times n)$ matrix $A$ . The rank ( $r$ ) of a matrix $A$ is the maximum number of linearly independent row vectors of $A$ or, equivalently, the maximum number of linearly independent column vectors. The rank can also be defined as the dimensionality of the vector space spanned by the rows or columns of $A$ . For an $(m \times n)$ matrix, $r \le \min(m,n)$ .

Through a matrix-vector multiplication $Ax=b$ , this matrix maps an $n$ -vector $x$ into an $m$ -vector $b$ .

Figure A.7 shows this mapping and illustrates the four fundamental subspaces that we now explain.

The column space of a matrix $A$ is the vector space spanned by the vectors in the columns of $A$ . The dimensionality of this space is given by $r$ , where $r \le n$ , so the column space is a subspace of $n$ -dimensional space. The row space of a matrix $A$ is the vector space spanned by the vectors in the rows of $A$ (or equivalently, it is the column space of $A^\intercal$ ). The dimensionality of this space is given by $r$ , where $r \le m$ , so the row space is a subspace of $n$ -dimensional space.

The four fundamental subspaces of linear algebra. An (m \times n) matrix A maps vectors from n-space to m-space. When the vector is in the row space of the matrix, it maps to the column space of A (x_r \rightarrow b). When the vector is in the nullspace of A, it maps to zero (x_n \rightarrow 0). Combining the row space and nullspace of A, we can obtain any vector in n-dimensional space (x=x_r+x_n), which maps to the column space of A (x \rightarrow b). — Figure A.7:The four fundamental subspaces of linear algebra. An $(m \times n)$ matrix $A$ maps vectors from $n$ -space to $m$ -space. When the vector is in the row space of the matrix, it maps to the column space of $A$ ( $x_r \rightarrow b$ ). When the vector is in the nullspace of $A$ , it maps to zero ( $x_n \rightarrow 0$ ). Combining the row space and nullspace of $A$ , we can obtain any vector in $n$ -dimensional space ( $x=x_r+x_n$ ), which maps to the column space of $A$ ( $x \rightarrow b$ ).

The nullspace of a matrix $A$ is the vector space consisting of all the vectors that are orthogonal to the rows of $A$ . Equivalently, the nullspace of $A$ is the vector space of all vectors $x_n$ such that $A x_n =0$ . Therefore, the nullspace is orthogonal to the row space of $A$ . The dimension of the nullspace of $A$ is $n-r$ .

Combining the nullspace and row space of $A$ adds up to the whole $n$ -dimensional space, that is, $x = x_r + x_n$ , where $x_r$ is in the row space of $A$ and $x_n$ is in the nullspace of $A$ .

The left nullspace of a matrix $A$ is the vector space of all $x$ such that $A^\intercal x=0$ . Therefore, the left nullspace is orthogonal to the column space of $A$ . The dimension of the left nullspace of $A$ is $m-r$ . Combining the left nullspace and column space of $A$ adds up to the whole $m$ -dimensional space.

A.5 Vector and Matrix Norms¶

Norms give an idea of the magnitude of the entries in vectors and matrices. They are a generalization of the absolute value for real numbers. A norm $\|\cdot\|$ is a real-valued function with the following properties:

$\|x\| \ge 0$ for all $x$ .
$\|x\| = 0$ if an only if $x=0$ .
$\|\alpha x\| = |\alpha| \|x\|$ for all real numbers $\alpha$ .
$\|x+y\| \le \|x\| + \|y\|$ for all $x$ and $y$ .

Most common matrix norms also have the property that $\|xy\| \le \|x\| \|y\|$ , although this is not required in general.

We start by defining vector norms, where the vector is $x=[x_1,\ldots,x_n]$ . The most familiar norm for vectors is the 2-norm, also known as the Euclidean norm, which corresponds to the Euclidean length of the vector:

\|x\|_2 = \left( \sum_{i=1}^n x_i^2 \right)^{\frac{1}{2}} = \left(x_1^2 + x_2^2 + \ldots + x_n^2 \right)^\frac{1}{2} \, .

(A.25)

Because this norm is used so often, we often omit the subscript and just write $\|x\|$ . In this book, we sometimes use the square of the 2-norm, which can be written as the dot product,

\|x\|_2^2 = x^\intercal x \, .

(A.26)

More generally, we can refer to a class of norms called $p$ -norms:

\|x\|_p = \left( \sum_{i=1}^n |x_i|^p \right)^{\frac{1}{p}} = \left( |x_1|^p + |x_2|^p + \ldots + |x_n|^p \right)^\frac{1}{p} \, ,

(A.27)

where $1\le p < \infty$ . Of all the $p$ -norms, three are most commonly used: the 2-norm (Equation A.25), the 1-norm, and the $\infty$ -norm. From the previous definition, we see that the 1-norm is the sum of the absolute values of all the entries in $x$ :

\|x\|_1 = \sum_{i=1}^n |x_i| = |x_1| + |x_2| + \ldots + |x_n| \, .

(A.28)

The application of $\infty$ in the $p$ -norm definition is perhaps less obvious, but as $p\rightarrow\infty$ , the largest term in that sum dominates all of the others. Raising that quantity to the power of $1/p$ causes the exponents to cancel, leaving only the largest-magnitude component of $x$ . Thus, the infinity norm is

\|x\|_\infty = \max\limits_i |x_i| \, .

(A.29)

The infinity norm is commonly used in optimization convergence criteria.

Figure A.8:Norms for two-dimensional case.

The vector norms are visualized in Figure A.8 for $n=2$ . If $x = [1,\ldots,1]$ , then

\|x\|_1 = n, \quad \|x\|_2 = n^{\frac{1}{2}}, \quad \|x\|_\infty = 1 \, .

(A.30)

It is also possible to assign different weights to each vector component to form a weighted norm:

\|x\|_p = \left( w_1 |x_1|^p + w_2 |x_2|^p + \ldots + w_n |x_n|^p \right)^\frac{1}{p} \, .

(A.31)

Several norms for matrices exist. There are matrix norms similar to the vector norms that we defined previously. Namely,

\begin{aligned} \|A\|_1 &= \max_{1 \le j \le n} \sum_{i=1}^n \left| A_{ij} \right| \\ \|A\|_2 &= \left( \lambda_{\max} \left( A^\intercal A \right) \right)^\frac{1}{2} \\ \|A\|_\infty &= \max_{1 \le i \le n} \sum_{i=1}^n \left| A_{ij} \right| \, , \end{aligned}

(A.32)

where $\lambda_{\max} \left( A^\intercal A \right)$ is the largest eigenvalue of $A^\intercal A$ . When $A$ is a square symmetric matrix, then

\|A\|_2 = \left| \lambda_{\max} \left( A \right) \right| \, .

(A.33)

Another matrix norm that is useful but not related to any vector norm is the Frobenius norm, which is defined as the square root of the absolute squares of its elements, that is,

\|A\|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n A_{ij}^2} \, .

(A.34)

The Frobenius norm can be weighted by a matrix $W$ as follows:

\|A\|_W = \|W^{\frac{1}{2}} A W^{\frac{1}{2}}\|_F \, .

(A.35)

This norm is used in the formal derivation of the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update formula (see ).

A.6 Matrix Types¶

There are several common types of matrices that appear regularly throughout this book. We review some terminology here.

A diagonal matrix is a matrix where all off-diagonal terms are zero. In other words, $A$ is diagonal if:

A_{ij} = 0 \text{ for all } i \ne j \, .

(A.36)

The identity matrix $I$ is a special diagonal matrix where all diagonal components are 1.

The transpose of a matrix is defined as follows:

[{A^\intercal}]_{ij} = A_{ji} \, .

(A.37)

Note that

\begin{aligned} (A^\intercal)^\intercal &= A\\ (A + B)^\intercal &= A^\intercal + B^\intercal\\ (AB)^\intercal &= B^\intercal A^\intercal \, . \end{aligned}

(A.38)

A symmetric matrix is one where the matrix is equal to its transpose:

A^\intercal = A \, \Rightarrow \, A_{ij} = A_{ji} \, .

(A.39)

The inverse of a matrix, $A^{-1}$ , satisfies

A A^{-1} = I = A^{-1}A \, .

(A.40)

Not all matrices are invertible. Some common properties for inverses are as follows:

\begin{aligned} \left(A^{-1}\right)^{-1} &= A\\ (AB)^{-1} &= B^{-1} A^{-1}\\ \left(A^{-1}\right)^\intercal &= \left(A^\intercal \right)^{-1} \, . \end{aligned}

(A.41)

A symmetric matrix $A$ is positive definite if and only if

x^\intercal A x > 0

(A.42)

for all nonzero vectors $x$ . One property of positive-definite matrices is that their inverse is also positive definite.

The positive-definite condition (Equation A.42) can be challenging to verify. Still, we can use equivalent definitions that are more practical.

For example, by choosing appropriate $x$ s, we can derive the necessary conditions for positive definiteness:

\begin{aligned} A_{ii} &> 0 \quad \text{for all} \quad i \\ A_{ij} &< \sqrt{A_{ii} A_{jj}} \quad \text{for all} \quad i \neq j \, . \end{aligned}

(A.43)

These are necessary but not sufficient conditions. Thus, if any diagonal element is less than or equal to zero, we know that the matrix is not positive definite.

An equivalent condition to Equation A.42 is that all the eigenvalues of $A$ are positive. This is a sufficient condition.

For A to be positive definite, the determinants of the submatrices A_1, A_2, \ldots A_n must be greater than zero. — Figure A.9:For $A$ to be positive definite, the determinants of the submatrices $A_1, A_2, \ldots A_n$ must be greater than zero.

Another practical condition equivalent to Equation A.42 is that all the leading principal minors of $A$ are positive. A leading principal minor is the determinant of a leading principal submatrix. A leading principal submatrix of order $k$ , $A_k$ of an $(n \times n)$ matrix $A$ is obtained by removing the last $n-k$ rows and columns of $A$ , as shown in Figure A.9. Thus, to verify if $A$ is positive definite, we start with $k=1$ , check that $A_{1}>0$ (only one element), then check that $\det(A_2)>0$ , and so on, until $\det(A_n)>0$ . Suppose any of the determinants in this sequence is not positive. In that case, we can stop the process and conclude that $A$ is not positive definite.

A positive-semidefinite matrix satisfies

x^\intercal A x \geq 0

(A.44)

for all nonzero vectors $x$ . In this case, the eigenvalues are nonnegative, and there is at least one that is zero. A negative-definite matrix satisfies

x^\intercal A x < 0

(A.45)

for all nonzero vectors $x$ . In this case, all the eigenvalues are negative. An indefinite matrix is one that is neither positive definite nor negative definite. Then, there are at least two nonzero vectors $x$ and $y$ such that

x^\intercal A x > 0 > y^\intercal A y \, .

(A.46)

A.7 Matrix Derivatives¶

Let us consider the derivatives of a few common cases: linear and quadratic functions. Combining the concept of partial derivatives and matrix forms of equations allows us to find the gradients of matrix functions. First, let us consider a linear function, $f$ , defined as

f(x) = a^\intercal x + b = \sum_{i=1}^n a_i x_i + b_i \, ,

(A.47)

where $a$ , $x$ , and $b$ are vectors of length $n$ , and $a_i, ~x_i, ~\text{and } b_i$ are the $i$ th elements of $a,~x,\text{ and } b$ , respectively. If we take the partial derivative of each element with respect to an arbitrary element of $x$ , namely, $x_k$ , we get

\frac{\partial }{\partial x_k} \left[ \sum_{i=1}^n a_i x_i + b_i \right] = a_k \, .

(A.48)

Thus,

\nabla_x (a^\intercal x + b) = a \, .

(A.49)

Recall the quadratic form presented in Section A.3.3; we can combine that with a linear term to form a general quadratic function:

f(x) = x^\intercal A x + b^\intercal x + c \, ,

(A.50)

where $x, b$ , and $c$ are still vectors of length $n$ , and $A$ is an $n$ -by- $n$ symmetric matrix. In index notation, $f$ is as follows:

f(x) = \sum_{i=1}^n \sum_{j=1}^n x_i a_{ij} x_j + b_i x_i + c_i \, .

(A.51)

For convenience, we separate the diagonal terms from the off-diagonal terms, leaving us with

f(x) = \sum_{i=1}^n \left[ a_{ii} x_i^2 + b_i x_i + c_i \right] + \sum_{j\neq i} x_i a_{ij} x_j \, .

(A.52)

Now we take the partial derivatives with respect to $x_k$ as before, yielding

\frac{\partial f}{\partial x_k} = 2a_{kk} x_k + b_k +\sum_{j\neq i} x_j a_{jk} + \sum_{j\neq i} a_{kj} x_j \, .

(A.53)

We now move the diagonal terms back into the sums to get

\frac{\partial f}{\partial x_k} = b_k + \sum_{j=1}^n (x_j a_{jk} + a_{kj} x_j) \, ,

(A.54)

which we can put back into matrix form as follows:

\nabla_x f(x) = A^\intercal x + A x + b \, .

(A.55)

If $A$ is symmetric, then $A^\intercal = A$ , and thus

\nabla_x ( x^\intercal A x + b^\intercal x + c) = 2A x + b \, .

(A.56)

A.8 Eigenvalues and Eigenvectors¶

Given an $(n \times n)$ matrix, if there is a scalar $\lambda$ and a nonzero vector $v$ that satisfy

A v = \lambda v \, ,

(A.57)

then $\lambda$ is an eigenvalue of the matrix $A$ , and $v$ is an eigenvector.

The left-hand side of Equation A.57 is a matrix-vector product that represents a linear transformation applied to $v$ . The right-hand side of Equation A.57 is a scalar-vector product that represents a vector aligned with $v$ . Therefore, the eigenvalue problem (Equation A.57) answers the question: Which vectors, when transformed by $A$ , remain in the same direction, and how much do their corresponding lengths change in that transformation?

The solutions of the eigenvalue problem (Equation A.57) are given by the solutions of the scalar equation,

\det \left( A - \lambda I \right) = 0 \, .

(A.58)

This equation yields a polynomial of degree $n$ called the characteristic equation, whose roots are the eigenvalues of $A$ .

If $A$ is symmetric, it has $n$ real eigenvalues $(\lambda_1, \ldots, \lambda_n)$ and $n$ linearly independent eigenvectors $(v_1, \ldots, v_n)$ corresponding to those eigenvalues. It is possible to choose the eigenvectors to be orthogonal to each other (i.e., $v_i^\intercal v_j=0$ for $i \neq j$ ) and to normalize them (so that $v_i^\intercal v_i=1$ ).

We use the eigenvalue problem in Section 4.1.2, where the eigenvectors are the directions of principal curvature, and the eigenvalues quantify the curvature. Eigenvalues are also helpful in determining if a matrix is positive definite.

A.9 Random Variables¶

Imagine measuring the axial strength of a rod by performing a tensile test with many rods, each designed to be identical. Even with “identical” rods, every time you perform the test, you get a different result (hopefully with relatively small differences). This variation has many potential sources, including variation in the manufactured size and shape, in the composition of the material, and in the contact between the rod and testing fixture. In this example, we would call the axial strength a random variable, and the result from one test would be a random sample. The random variable, axial strength, is a function of several other random variables, such as bar length, bar diameter, and material Young’s modulus.

One measurement does not tell us anything about how variable the axial strength is, but if we perform the test many times, we can learn a lot about its distribution. From this information, we can infer various statistical quantities, such as the mean value of the axial strength. The mean of some variable $x$ that is measured $n$ times is estimated as follows:

\mu_x = \frac{1}{n}\sum_{i=1}^n x_i \, .

(A.59)

This is actually a sample mean, which would differ from the population mean (the true mean if you could measure every bar). With enough samples, the sample mean approaches the population mean. In this brief review, we do not distinguish between sample and population statistics.

Another important quantity is the variance or standard deviation. This is a measure of spread, or how far away our samples are from the mean. The unbiased^[3]estimate of the variance is

\sigma_x^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \mu_x)^2 \, ,

(A.60)

and the standard deviation is just the square root of the variance. A small variance implies that measurements are clustered tightly around the mean, whereas a large variance means that measurements are spread out far from the mean. The variance can also be written in the following mathematically equivalent but more computationally-friendly format:

\sigma_x^2 = \frac{1}{n-1}\left(\sum_{i=1}^n \left(x_i^2\right) - n \mu_x^2\right) \, .

(A.61)

More generally, we might want to know what the probability is of getting a bar with a specific axial strength. In our testing, we could tabulate the frequency of each measurement in a histogram. If done enough times, it would define a smooth curve, as shown in Figure A.10. This curve is called the probability density function (PDF), $p(x)$ , and it tells us the relative probability of a certain value occurring.

More specifically, a PDF gives the probability of getting a value with a certain range:

\text{Prob}(a \le x \le b) = \int_{a}^b p(x) \mathrm{d} x \, .

(A.62)

The total integral of the PDF must be 1 because it contains all possible outcomes (100 percent):

\int_{-\infty}^\infty p(x) \mathrm{d} x = 1 \, .

(A.63)

From the PDF, we can also measure various statistics, such as the mean value:

\mu_x = \mathbb{E}(x) = \int_{-\infty}^\infty x p(x) \mathrm{d} x \, .

(A.64)

This quantity is also referred to as the expected value of $x$ ( $\mathbb{E}[x]$ ). The expected value of a function of a random variable, $f(x)$ , is given by:^[4]

\mu_f = \mathbb{E}\left(f(x)\right) = \int_{-\infty}^\infty f(x) p(x) \mathrm{d} x \, .

(A.65)

We can also compute the variance, which is the expected value of the squared difference from the mean:

\sigma_x^2 = \mathbb{E}\left(\left(x - \mathbb{E}\left(x\right)\right)^2\right) = \int_{-\infty}^\infty (x - \mu_x)^2 p(x) \mathrm{d} x \, ,

(A.66)

or in a mathematically equivalent format:

\sigma_x^2 = \int_{-\infty}^\infty x^2 p(x) \mathrm{d} x - \mu_x^2 \, .

(A.67)

The mean and variance are the first and second moments of the distribution. In general, a distribution may require an infinite number of moments to describe it fully. Higher-order moments are generally mean centered and are normalized by the standard deviation so that the $n$ th normalized moment is computed as follows:

\mathbb{E}\left( \left(\frac{x - \mu_x}{\sigma}\right)^n \right) \, .

(A.68)

The third moment is called skewness, and the fourth is called kurtosis, although these higher-order moments are less commonly used.

The cumulative distribution function (CDF) is related to the PDF, which is the cumulative integral of the PDF and is defined as follows:

P(x) = \int_{-\infty}^x p(t) \mathrm{d} t \, .

(A.69)

The capital $P$ denotes the CDF, and the lowercase $p$ denotes the PDF. As an example, the PDF and corresponding CDF for the axial strength are shown in Figure A.10. The CDF always approaches 1 as $x \rightarrow \infty$ .

Comparison between PDF and CDF for a simple example. — (a)

Two normal distributions. Changing the mean causes a shift along the x-axis. Increasing the standard deviation causes the PDF to spread out. — Figure A.11:Two normal distributions. Changing the mean causes a shift along the $x$ -axis. Increasing the standard deviation causes the PDF to spread out.

We often fit a named distribution to the PDF of empirical data. One of the most popular distributions is the normal distribution, also known as the Gaussian distribution. Its PDF is as follows:

p(x; \mu, \sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} \exp{\left(\frac{-(x-\mu)^2}{2\sigma^2}\right)} \, .

(A.70)

For a normal distribution, the mean and variance are visible in the function, but these quantities are defined for any distribution. Figure A.11 shows two normal distributions with different means and standard deviations to illustrate the effect of those parameters.

Several other popular distributions are shown in Figure A.12: uniform distribution, Weibull distribution, lognormal distribution, and exponential distribution. These are only a few of many other possible probability distributions.

Popular probability distributions besides the normal distribution. — (a)

An extension of variance is the covariance, which measures the variability between two random variables:

\begin{aligned} \text{cov}(x, y) &= \mathbb{E}\left( \left(x - \mathbb{E}(x) \right)\left(y - \mathbb{E}(y)\right)\right) \\ &= \mathbb{E}(xy) - \mu_x \mu_y \, . \end{aligned}

(A.71)

From this definition, we see that the variance is related to covariance by the following:

\sigma_x^2 = \text{var}(x) = \text{cov}(x, x) \, .

(A.72)

Covariance is often expressed as a matrix, in which case the variance of each variable appears on the diagonal. The correlation is the covariance divided by the standard deviations:

\text{corr}(x, y) = \frac{\text{cov}(x, y)}{\sigma_x \sigma_y} \, .

(A.73)

Footnotes¶

In this notation, $m$ is the number of rows and $n$ is the number of columns.
↩
1 provides a comprehensive coverage of linear algebra and is credited with popularizing the concept of the “four fundamental subspaces”.
↩
Unbiased means that the expected value of the sample variance is the same as the true population variance. If $n$ were used in the denominator instead of $n-1$ , then the two quantities would differ by a constant.
↩
This is not a definition, but rather uses the expected value definition with a somewhat lengthy derivation.
↩

References¶

Strang, G. (2006). Linear Algebra and its Applications (4th ed.). Cengage Learning.