Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Engineering Design Optimization. Cambridge University Press, Jan 2022

Appendix: Mathematics Background

This appendix briefly reviews various mathematical concepts used throughout the book.

A.1 Taylor Series Expansion

Series expansions are representations of a given function in terms of a series of other (usually simpler) functions. One common series expansion is the Taylor series, which is expressed as a polynomial whose coefficients are based on the derivatives of the original function at a fixed point.

The Taylor series is a general tool that can be applied whenever the function has derivatives. We can use this series to estimate the value of the function near the given point, which is useful when the function is difficult to evaluate directly. The Taylor series is used to derive algorithms for finding the zeros of functions and algorithms for minimizing functions in Chapter 4 and Chapter 5.

To derive the Taylor series, we start with an infinite polynomial series about an arbitrary point, xx, to approximate the value of a function at x+Δxx + \Delta x using

f(x+Δx)=a0+a1Δx+a2Δx2++akΔxk+.f(x + \Delta x) = a_0 + a_1 \Delta x + a_2 \Delta x^2 + \ldots + a_k \Delta x^k + \ldots \, .

We can make this approximation exact at Δx=0\Delta x=0 by setting the first coefficient to f(x)f(x). To find the appropriate value for a1a_1, we take the first derivative to get

f(x+Δx)=a1+2a2Δx++iakΔxk1+,f^\prime(x + \Delta x) = a_1 + 2a_2 \Delta x + \ldots + i a_k \Delta x^{k-1} + \ldots \, ,

which means that we need a1=f(x)a_1 = f^\prime(x) to obtain an exact derivative at xx. To derive the other coefficients, we systematically take the derivative of both sides and the appropriate value of the first nonzero term (which is always constant). Identifying the pattern yields the general formula for the nnth-order coefficient:

ak=f(k)(x)k!.a_k = \frac{f^{(k)}(x)}{k!} \, .

Substituting this into the polynomial in Equation A.1 yields the Taylor series

f(x+Δx)=k=0Δxkk!f(k)(x).f(x + \Delta x) = \sum_{k=0}^\infty \frac{\Delta x^k}{k!}f^{(k)}(x) \, .

The series is typically truncated to use terms up to order mm,

f(x+Δx)=k=0mΔxkk!f(k)(x)+O(Δxm+1),f(x + \Delta x) = \sum_{k=0}^m \frac{\Delta x^k}{k!}f^{(k)}(x) + \mathcal{O}\left(\Delta x^{m+1}\right) \, ,

which yields an approximation with a truncation error of order O(Δxm+1){\cal O}(\Delta x^{m+1}). In optimization, it is common to use the first three terms (up to m=2m = 2) to get a quadratic approximation.

The Taylor series in multiple dimensions is similar to the single-variable case but more complicated. The first derivative of the function becomes a gradient vector, and the second derivatives become a Hessian matrix. Also, we need to define a direction along which we want to approximate the function because that information is not inherent like it is in a one-dimensional function. The Taylor series expansion in nn dimensions along a direction pp can be written as

f(x+αp)=f(x)+αk=1npkfxk+12α2k=1nl=1npkplfxkxl+O(α3),f(x + \alpha p) = f(x) + \alpha \sum_{k=1}^n p_k \frac{\partial f}{\partial {x_k}}+ \frac{1}{2} \alpha^2 \sum_{k=1}^n \sum_{l=1}^n p_k p_l \frac{\partial f}{\partial {{x_k}}{{x_l}}}+ \mathcal{O}\left(\alpha^3 \right) \, ,

where α\alpha is a scalar that determines how far to go in the direction pp.

In matrix form, we can write

f(x+αp)=f(x)+αf(x)p+12α2pH(x)p+O(α3),f(x + \alpha p) = f(x) + \alpha \nabla f(x)^\intercal p + \frac{1}{2} \alpha^2 p^\intercal H(x) p + {\mathcal O} \left( \alpha ^3 \right) \, ,

where HH is the Hessian matrix.

A.2 Chain Rule, Total Derivatives, and Differentials

The single-variable chain rule is needed for differentiating composite functions. Given a composite function, f(g(x))f(g(x)), the derivative with respect to the variable xx is

ddx(f(g(x)))=dfdgdgdx.\frac{d}{dx}\left(f(g(x))\right) = \frac{df}{dg}\frac{dg}{dx}\, .

If a function depends on more than one variable, then we need to distinguish between partial and total derivatives. For example, if f(g(x),h(x))f(g(x),h(x)), then ff is a function of two variables: gg and hh. The application of the chain rule for this function is

ddx(f(g(x),h(x)))=fgdgdx+fhdhdx,\frac{\mathrm{d}}{\mathrm{d} x}\left(f(g(x),h(x))\right) = \frac{\partial f}{\partial g}\frac{\mathrm{d} g}{\mathrm{d} x} + \frac{\partial f}{\partial h}\frac{\mathrm{d} h}{\mathrm{d} x} \, ,

where /x\partial / \partial x indicates a partial derivative, and d/dx\mathrm{d} / \mathrm{d} x is a total derivative. When taking a partial derivative, we take the derivative with respect to only that variable, treating all other variables as constants. More generally,

ddx(f(g1(x),,gn(x)))=i=1n(fgidgidx).\frac{\mathrm{d}}{\mathrm{d} x}(f(g_1(x), \ldots, g_n(x))) = \sum_{i=1}^n \left(\frac{\partial f}{\partial g_i}\frac{\mathrm{d} g_i}{\mathrm{d} x}\right) \, .

The differential of a function represents the linear change in that function with respect to changes in the independent variable. We introduce them here because they are helpful for finding total derivatives of multivariable equations that are implicit.

If function y=f(x)y=f(x) is differentiable, the differential dy\mathrm{d} y is

dy=f(x)dx,\mathrm{d} y = f^\prime(x) \mathrm{d} x \, ,

where dx\mathrm{d} x is a nonzero real number (considered small) and dy\mathrm{d} y is an approximation of the change (due to the linear term in the Taylor series). We can solve for f(x)f^\prime(x) to get f(x)=dy/dxf^\prime(x) = \mathrm{d} y / \mathrm{d} x. This states that the derivative of ff with respect to xx is the differential of yy divided by the differential of xx. Strictly speaking, dy/dx\mathrm{d} y / \mathrm{d} x here is not the derivative, although it is written in the same way. The derivative is a symbol, not a fraction. However, for our purposes, we will use these representations interchangeably and treat differentials algebraically. We also write the differentials of functions as

df=f(x)dx.\mathrm{d} f = f^\prime(x) \mathrm{d} x \, .

In Example 5, there is no clear advantage in using differentials. However, differentials are more straightforward for finding total derivatives of multivariable implicit equations because there is no need to identify the independent variables. Given an equation, we just need to (1) find the differential of the equation and (2) solve for the derivative of interest. When we want quantities to remain constant, we can set the corresponding differential to zero. Differentials can be applied to vectors (say a vector xx of size nn), yielding a vector of differentials with the same size (dx\mathrm{d} x of size nn). We use this technique to derive the unified derivatives equation (UDE) in Section 6.9.

A.3 Matrix Multiplication

Matrix product and resulting size.

Figure A.3:Matrix product and resulting size.

Consider a matrix AA of size (m×n)(m \times n)[1]and a matrix BB of size (n×p)(n \times p). The two matrices can be multiplied together (C=AB)(C = A B) as follows:

Cij=k=1nAikBkj,C_{ij} = \sum_{k = 1}^{n} A_{ik} B_{kj} \, ,

where CC is an (m×p)(m \times p) matrix. This multiplication is illustrated in Figure A.3. Two matrices can be multiplied only if their inner dimensions are equal (nn in this case). The remaining products discussed in this section are just special cases of matrix multiplication, but they are common enough that we discuss them separately.

A.3.1 Vector-Vector Products

In this book, a vector uu is a column vector; thus, the row vector is represented as uu^\intercal. The product of two vectors can be performed in two ways. The more common is called an inner product (also known as a dot product or scalar product). The inner product is a functional, meaning that it is an operator that acts on vectors and produces a scalar. This product is illustrated in Figure A.4. In the real vector space of nn dimensions, the inner product of two vectors, uu and vv, whose dimensions are equal, is defined algebraically as

Dot (or inner) product of two vectors.

Figure A.4:Dot (or inner) product of two vectors.

α=uv=[u1u2un][v1v2vn]=i=1nuivi.\alpha = u^\intercal v = \begin{bmatrix} u_1 & u_2 & \ldots & u_n \\ \end{bmatrix} \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \\ \end{bmatrix} = \sum_{i=1}^n u_i v_i \, .

The order of multiplication is irrelevant, and therefore,

uv=vu.u^\intercal v = v^\intercal u \, .

In Euclidean space, where vectors have magnitude and direction, the inner product is defined as

uv=uvcos(θ),u^\intercal v = \|u\| \|v\| \cos(\theta) \, ,

where \|\cdot\| represents the 2-norm (Equation A.25), and θ\theta is the angle between the two vectors.

Outer product of two vectors.

Figure A.5:Outer product of two vectors.

The outer product takes the two vectors and multiplies them element-wise to produce a matrix, as illustrated in Figure A.5. Unlike the inner product, the outer product does not require the vectors to be of the same length. The matrix form is as follows:

C=uv=[u1u2um][v1v2vn]=[u1v1u1v2u1vnu2v1u2v2u2vnumv1umv2umvn].\begin{aligned} C = uv^\intercal &= \begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_m \\ \end{bmatrix} \begin{bmatrix} v_1 & v_2 & \cdots & v_n \\ \end{bmatrix} \\ &= \begin{bmatrix} u_1v_1 & u_1v_2& \cdots &u_1v_n \\ u_2v_1 & u_2v_2& \cdots &u_2v_n \\ \vdots & \vdots& \ddots &\vdots \\ u_mv_1 & u_mv_2& \cdots &u_mv_n \end{bmatrix} \, . \end{aligned}

The index form is as follows:

(uv)ij=uivj.(u v^\intercal)_{ij} = u_i v_j \, .

Outer products generate rank 1 matrices. They are used in quasi-Newton methods (Section 4.4.4 and ).

A.3.2 Matrix-Vector Products

Consider multiplying a matrix AA of size (m×n)(m \times n) by vector uu of size nn. The result is a vector of size mm:

v=Auvi=j=1nAijuj.v = Au \Rightarrow v_i = \sum_{j=1}^n A_{ij} u_j \, .

This multiplication is illustrated in Figure A.6.

Matrix-vector product.

Figure A.6:Matrix-vector product.

The entries in vv are dot products between the rows of AA and uu:

v=[  ——  A1  ——    ——  A2  ——    ——  Am  ——  ]u,v = \begin{bmatrix} \;\text{---}\text{---}\; A_{1 \ast} \;\text{---}\text{---}\;\\ \;\text{---}\text{---}\; A_{2 \ast} \;\text{---}\text{---}\;\\ \vdots \\ \;\text{---}\text{---}\; A_{m \ast} \;\text{---}\text{---}\;\\ \end{bmatrix} u \,,

where AiA_{i \ast} is the iith row of the matrix AA. Thus, a matrix-vector product transforms a vector in nn-dimensional space (Rn\mathbb{R}^n) to a vector in mm-dimensional space (Rm\mathbb{R}^m).

A matrix-vector product can be thought of as a linear combination of the columns of AA, where the uju_j values are the weights:

v=[A1]u1+[A2]u2++[An]un,v = \begin{bmatrix} |\\ A_{\ast 1}\\ | \end{bmatrix} u_1 + \begin{bmatrix} |\\ A_{\ast 2} \\ | \end{bmatrix} u_2 + \ldots + \begin{bmatrix} |\\ A_{\ast n}\\ | \end{bmatrix} u_n,

and AjA_{\ast j} are the columns of AA.

We can also multiply by a vector on the left, instead of on the right:

v=uA.v^\intercal = u^\intercal A.

In this case, a row vector is multiplied with a matrix, producing a row vector.

A.3.3 Quadratic Form (Vector-Matrix-Vector Product)

Another common product is a quadratic form. A quadratic form consists of a row vector, times a matrix, times a column vector, producing a scalar:

α=uAu=[u1u2un][A11A12A1nA21A22A2nAn1An2Ann][u1u2un.]\alpha = u^\intercal A u = \begin{bmatrix} u_1 & u_2 & \ldots & u_n \\ \end{bmatrix} \begin{bmatrix} A_{11} & A_{12}& \cdots &A_{1n} \\ A_{21} & A_{22}& \cdots &A_{2n} \\ \vdots & \vdots& \ddots &\vdots \\ A_{n1} & A_{n2}& \cdots &A_{nn} \end{bmatrix} \begin{bmatrix} u_1 \\ u_2 \\ \vdots \\ u_n. \\ \end{bmatrix}

The index form is as follows:

α=i=1nj=1nuiAijuj.\alpha = \sum_{i=1}^n\sum_{j=1}^n u_i A_{ij} u_j.

In general, a vector-matrix-vector product can have a nonsquare AA matrix, and the vectors would be two different sizes, but for a quadratic form, the two vectors uu are identical, and thus AA is square. Also, in a quadratic form, we assume that AA is symmetric (even if it is not, only the symmetric part of AA contributes, so effectively, it acts like a symmetric matrix).

A.4 Four Fundamental Subspaces in Linear Algebra

This section reviews how the dimensions of a matrix in a linear system relate to dimensional spaces.[2] These concepts are especially helpful for understanding constrained optimization (Chapter 5) and build on the review in Section 5.2n-Dimensional Space.

A vector space is the set of all points that can be obtained by linear combinations of a given set of vectors. The vectors are said to span the vector space. A basis is a set of linearly independent vectors that generates all points in a vector space. A subspace is a space of lower dimension than the space that contains it (e.g., a line is a subspace of a plane).

Two vectors are orthogonal if the angle between them is 90 degrees. Then, their dot product is zero. A subspace S1S_1 is orthogonal to another subspace S2S_2 if every vector in S1S_1 is orthogonal to every vector in S2S_2.

Consider an (m×n)(m \times n) matrix AA. The rank (rr) of a matrix AA is the maximum number of linearly independent row vectors of AA or, equivalently, the maximum number of linearly independent column vectors. The rank can also be defined as the dimensionality of the vector space spanned by the rows or columns of AA. For an (m×n)(m \times n) matrix, rmin(m,n)r \le \min(m,n).

Through a matrix-vector multiplication Ax=bAx=b, this matrix maps an nn-vector xx into an mm-vector bb.

Figure A.7 shows this mapping and illustrates the four fundamental subspaces that we now explain.

The column space of a matrix AA is the vector space spanned by the vectors in the columns of AA. The dimensionality of this space is given by rr, where rnr \le n, so the column space is a subspace of nn-dimensional space. The row space of a matrix AA is the vector space spanned by the vectors in the rows of AA (or equivalently, it is the column space of AA^\intercal). The dimensionality of this space is given by rr, where rmr \le m, so the row space is a subspace of nn-dimensional space.

The four fundamental subspaces of linear algebra. An (m \times n) matrix A maps vectors from n-space to m-space. When the vector is in the row space of the matrix, it maps to the column space of A (x_r \rightarrow b). When the vector is in the nullspace of A, it maps to zero (x_n \rightarrow 0). Combining the row space and nullspace of A, we can obtain any vector in n-dimensional space (x=x_r+x_n), which maps to the column space of A (x \rightarrow b).

Figure A.7:The four fundamental subspaces of linear algebra. An (m×n)(m \times n) matrix AA maps vectors from nn-space to mm-space. When the vector is in the row space of the matrix, it maps to the column space of AA (xrbx_r \rightarrow b). When the vector is in the nullspace of AA, it maps to zero (xn0x_n \rightarrow 0). Combining the row space and nullspace of AA, we can obtain any vector in nn-dimensional space (x=xr+xnx=x_r+x_n), which maps to the column space of AA (xbx \rightarrow b).

The nullspace of a matrix AA is the vector space consisting of all the vectors that are orthogonal to the rows of AA. Equivalently, the nullspace of AA is the vector space of all vectors xnx_n such that Axn=0A x_n =0. Therefore, the nullspace is orthogonal to the row space of AA. The dimension of the nullspace of AA is nrn-r.

Combining the nullspace and row space of AA adds up to the whole nn-dimensional space, that is, x=xr+xnx = x_r + x_n, where xrx_r is in the row space of AA and xnx_n is in the nullspace of AA.

The left nullspace of a matrix AA is the vector space of all xx such that Ax=0A^\intercal x=0. Therefore, the left nullspace is orthogonal to the column space of AA. The dimension of the left nullspace of AA is mrm-r. Combining the left nullspace and column space of AA adds up to the whole mm-dimensional space.

A.5 Vector and Matrix Norms

Norms give an idea of the magnitude of the entries in vectors and matrices. They are a generalization of the absolute value for real numbers. A norm \|\cdot\| is a real-valued function with the following properties:

  • x0\|x\| \ge 0 for all xx.

  • x=0\|x\| = 0 if an only if x=0x=0.

  • αx=αx\|\alpha x\| = |\alpha| \|x\| for all real numbers α\alpha.

  • x+yx+y\|x+y\| \le \|x\| + \|y\| for all xx and yy.

Most common matrix norms also have the property that xyxy\|xy\| \le \|x\| \|y\|, although this is not required in general.

We start by defining vector norms, where the vector is x=[x1,,xn]x=[x_1,\ldots,x_n]. The most familiar norm for vectors is the 2-norm, also known as the Euclidean norm, which corresponds to the Euclidean length of the vector:

x2=(i=1nxi2)12=(x12+x22++xn2)12.\|x\|_2 = \left( \sum_{i=1}^n x_i^2 \right)^{\frac{1}{2}} = \left(x_1^2 + x_2^2 + \ldots + x_n^2 \right)^\frac{1}{2} \, .

Because this norm is used so often, we often omit the subscript and just write x\|x\|. In this book, we sometimes use the square of the 2-norm, which can be written as the dot product,

x22=xx.\|x\|_2^2 = x^\intercal x \, .

More generally, we can refer to a class of norms called pp-norms:

xp=(i=1nxip)1p=(x1p+x2p++xnp)1p,\|x\|_p = \left( \sum_{i=1}^n |x_i|^p \right)^{\frac{1}{p}} = \left( |x_1|^p + |x_2|^p + \ldots + |x_n|^p \right)^\frac{1}{p} \, ,

where 1p<1\le p < \infty. Of all the pp-norms, three are most commonly used: the 2-norm (Equation A.25), the 1-norm, and the \infty-norm. From the previous definition, we see that the 1-norm is the sum of the absolute values of all the entries in xx:

x1=i=1nxi=x1+x2++xn.\|x\|_1 = \sum_{i=1}^n |x_i| = |x_1| + |x_2| + \ldots + |x_n| \, .

The application of \infty in the pp-norm definition is perhaps less obvious, but as pp\rightarrow\infty, the largest term in that sum dominates all of the others. Raising that quantity to the power of 1/p1/p causes the exponents to cancel, leaving only the largest-magnitude component of xx. Thus, the infinity norm is

x=maxixi.\|x\|_\infty = \max\limits_i |x_i| \, .

The infinity norm is commonly used in optimization convergence criteria.

Norms for two-dimensional case.

Figure A.8:Norms for two-dimensional case.

The vector norms are visualized in Figure A.8 for n=2n=2. If x=[1,,1]x = [1,\ldots,1], then

x1=n,x2=n12,x=1.\|x\|_1 = n, \quad \|x\|_2 = n^{\frac{1}{2}}, \quad \|x\|_\infty = 1 \, .

It is also possible to assign different weights to each vector component to form a weighted norm:

xp=(w1x1p+w2x2p++wnxnp)1p.\|x\|_p = \left( w_1 |x_1|^p + w_2 |x_2|^p + \ldots + w_n |x_n|^p \right)^\frac{1}{p} \, .

Several norms for matrices exist. There are matrix norms similar to the vector norms that we defined previously. Namely,

A1=max1jni=1nAijA2=(λmax(AA))12A=max1ini=1nAij,\begin{aligned} \|A\|_1 &= \max_{1 \le j \le n} \sum_{i=1}^n \left| A_{ij} \right| \\ \|A\|_2 &= \left( \lambda_{\max} \left( A^\intercal A \right) \right)^\frac{1}{2} \\ \|A\|_\infty &= \max_{1 \le i \le n} \sum_{i=1}^n \left| A_{ij} \right| \, , \end{aligned}

where λmax(AA)\lambda_{\max} \left( A^\intercal A \right) is the largest eigenvalue of AAA^\intercal A. When AA is a square symmetric matrix, then

A2=λmax(A).\|A\|_2 = \left| \lambda_{\max} \left( A \right) \right| \, .

Another matrix norm that is useful but not related to any vector norm is the Frobenius norm, which is defined as the square root of the absolute squares of its elements, that is,

AF=i=1mj=1nAij2.\|A\|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n A_{ij}^2} \, .

The Frobenius norm can be weighted by a matrix WW as follows:

AW=W12AW12F.\|A\|_W = \|W^{\frac{1}{2}} A W^{\frac{1}{2}}\|_F \, .

This norm is used in the formal derivation of the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update formula (see ).

A.6 Matrix Types

There are several common types of matrices that appear regularly throughout this book. We review some terminology here.

A diagonal matrix is a matrix where all off-diagonal terms are zero. In other words, AA is diagonal if:

Aij=0 for all ij.A_{ij} = 0 \text{ for all } i \ne j \, .

The identity matrix II is a special diagonal matrix where all diagonal components are 1.

The transpose of a matrix is defined as follows:

[A]ij=Aji.[{A^\intercal}]_{ij} = A_{ji} \, .

Note that

(A)=A(A+B)=A+B(AB)=BA.\begin{aligned} (A^\intercal)^\intercal &= A\\ (A + B)^\intercal &= A^\intercal + B^\intercal\\ (AB)^\intercal &= B^\intercal A^\intercal \, . \end{aligned}

A symmetric matrix is one where the matrix is equal to its transpose:

A=AAij=Aji.A^\intercal = A \, \Rightarrow \, A_{ij} = A_{ji} \, .

The inverse of a matrix, A1A^{-1}, satisfies

AA1=I=A1A.A A^{-1} = I = A^{-1}A \, .

Not all matrices are invertible. Some common properties for inverses are as follows:

(A1)1=A(AB)1=B1A1(A1)=(A)1.\begin{aligned} \left(A^{-1}\right)^{-1} &= A\\ (AB)^{-1} &= B^{-1} A^{-1}\\ \left(A^{-1}\right)^\intercal &= \left(A^\intercal \right)^{-1} \, . \end{aligned}

A symmetric matrix AA is positive definite if and only if

xAx>0x^\intercal A x > 0

for all nonzero vectors xx. One property of positive-definite matrices is that their inverse is also positive definite.

The positive-definite condition (Equation A.42) can be challenging to verify. Still, we can use equivalent definitions that are more practical.

For example, by choosing appropriate xxs, we can derive the necessary conditions for positive definiteness:

Aii>0for alliAij<AiiAjjfor allij.\begin{aligned} A_{ii} &> 0 \quad \text{for all} \quad i \\ A_{ij} &< \sqrt{A_{ii} A_{jj}} \quad \text{for all} \quad i \neq j \, . \end{aligned}

These are necessary but not sufficient conditions. Thus, if any diagonal element is less than or equal to zero, we know that the matrix is not positive definite.

An equivalent condition to Equation A.42 is that all the eigenvalues of AA are positive. This is a sufficient condition.

For A to be positive definite, the determinants of the submatrices A_1, A_2, \ldots A_n must be greater than zero.

Figure A.9:For AA to be positive definite, the determinants of the submatrices A1,A2,AnA_1, A_2, \ldots A_n must be greater than zero.

Another practical condition equivalent to Equation A.42 is that all the leading principal minors of AA are positive. A leading principal minor is the determinant of a leading principal submatrix. A leading principal submatrix of order kk, AkA_k of an (n×n)(n \times n) matrix AA is obtained by removing the last nkn-k rows and columns of AA, as shown in Figure A.9. Thus, to verify if AA is positive definite, we start with k=1k=1, check that A1>0A_{1}>0 (only one element), then check that det(A2)>0\det(A_2)>0, and so on, until det(An)>0\det(A_n)>0. Suppose any of the determinants in this sequence is not positive. In that case, we can stop the process and conclude that AA is not positive definite.

A positive-semidefinite matrix satisfies

xAx0x^\intercal A x \geq 0

for all nonzero vectors xx. In this case, the eigenvalues are nonnegative, and there is at least one that is zero. A negative-definite matrix satisfies

xAx<0x^\intercal A x < 0

for all nonzero vectors xx. In this case, all the eigenvalues are negative. An indefinite matrix is one that is neither positive definite nor negative definite. Then, there are at least two nonzero vectors xx and yy such that

xAx>0>yAy.x^\intercal A x > 0 > y^\intercal A y \, .

A.7 Matrix Derivatives

Let us consider the derivatives of a few common cases: linear and quadratic functions. Combining the concept of partial derivatives and matrix forms of equations allows us to find the gradients of matrix functions. First, let us consider a linear function, ff, defined as

f(x)=ax+b=i=1naixi+bi,f(x) = a^\intercal x + b = \sum_{i=1}^n a_i x_i + b_i \, ,

where aa, xx, and bb are vectors of length nn, and ai, xi, and bia_i, ~x_i, ~\text{and } b_i are the iith elements of a, x, and ba,~x,\text{ and } b, respectively. If we take the partial derivative of each element with respect to an arbitrary element of xx, namely, xkx_k, we get

xk[i=1naixi+bi]=ak.\frac{\partial }{\partial x_k} \left[ \sum_{i=1}^n a_i x_i + b_i \right] = a_k \, .

Thus,

x(ax+b)=a.\nabla_x (a^\intercal x + b) = a \, .

Recall the quadratic form presented in Section A.3.3; we can combine that with a linear term to form a general quadratic function:

f(x)=xAx+bx+c,f(x) = x^\intercal A x + b^\intercal x + c \, ,

where x,bx, b, and cc are still vectors of length nn, and AA is an nn-by-nn symmetric matrix. In index notation, ff is as follows:

f(x)=i=1nj=1nxiaijxj+bixi+ci.f(x) = \sum_{i=1}^n \sum_{j=1}^n x_i a_{ij} x_j + b_i x_i + c_i \, .

For convenience, we separate the diagonal terms from the off-diagonal terms, leaving us with

f(x)=i=1n[aiixi2+bixi+ci]+jixiaijxj.f(x) = \sum_{i=1}^n \left[ a_{ii} x_i^2 + b_i x_i + c_i \right] + \sum_{j\neq i} x_i a_{ij} x_j \, .

Now we take the partial derivatives with respect to xkx_k as before, yielding

fxk=2akkxk+bk+jixjajk+jiakjxj.\frac{\partial f}{\partial x_k} = 2a_{kk} x_k + b_k +\sum_{j\neq i} x_j a_{jk} + \sum_{j\neq i} a_{kj} x_j \, .

We now move the diagonal terms back into the sums to get

fxk=bk+j=1n(xjajk+akjxj),\frac{\partial f}{\partial x_k} = b_k + \sum_{j=1}^n (x_j a_{jk} + a_{kj} x_j) \, ,

which we can put back into matrix form as follows:

xf(x)=Ax+Ax+b.\nabla_x f(x) = A^\intercal x + A x + b \, .

If AA is symmetric, then A=AA^\intercal = A, and thus

x(xAx+bx+c)=2Ax+b.\nabla_x ( x^\intercal A x + b^\intercal x + c) = 2A x + b \, .

A.8 Eigenvalues and Eigenvectors

Given an (n×n)(n \times n) matrix, if there is a scalar λ\lambda and a nonzero vector vv that satisfy

Av=λv,A v = \lambda v \, ,

then λ\lambda is an eigenvalue of the matrix AA, and vv is an eigenvector.

The left-hand side of Equation A.57 is a matrix-vector product that represents a linear transformation applied to vv. The right-hand side of Equation A.57 is a scalar-vector product that represents a vector aligned with vv. Therefore, the eigenvalue problem (Equation A.57) answers the question: Which vectors, when transformed by AA, remain in the same direction, and how much do their corresponding lengths change in that transformation?

The solutions of the eigenvalue problem (Equation A.57) are given by the solutions of the scalar equation,

det(AλI)=0.\det \left( A - \lambda I \right) = 0 \, .

This equation yields a polynomial of degree nn called the characteristic equation, whose roots are the eigenvalues of AA.

If AA is symmetric, it has nn real eigenvalues (λ1,,λn)(\lambda_1, \ldots, \lambda_n) and nn linearly independent eigenvectors (v1,,vn)(v_1, \ldots, v_n) corresponding to those eigenvalues. It is possible to choose the eigenvectors to be orthogonal to each other (i.e., vivj=0v_i^\intercal v_j=0 for iji \neq j) and to normalize them (so that vivi=1v_i^\intercal v_i=1).

We use the eigenvalue problem in Section 4.1.2, where the eigenvectors are the directions of principal curvature, and the eigenvalues quantify the curvature. Eigenvalues are also helpful in determining if a matrix is positive definite.

A.9 Random Variables

Imagine measuring the axial strength of a rod by performing a tensile test with many rods, each designed to be identical. Even with “identical” rods, every time you perform the test, you get a different result (hopefully with relatively small differences). This variation has many potential sources, including variation in the manufactured size and shape, in the composition of the material, and in the contact between the rod and testing fixture. In this example, we would call the axial strength a random variable, and the result from one test would be a random sample. The random variable, axial strength, is a function of several other random variables, such as bar length, bar diameter, and material Young’s modulus.

One measurement does not tell us anything about how variable the axial strength is, but if we perform the test many times, we can learn a lot about its distribution. From this information, we can infer various statistical quantities, such as the mean value of the axial strength. The mean of some variable xx that is measured nn times is estimated as follows:

μx=1ni=1nxi.\mu_x = \frac{1}{n}\sum_{i=1}^n x_i \, .

This is actually a sample mean, which would differ from the population mean (the true mean if you could measure every bar). With enough samples, the sample mean approaches the population mean. In this brief review, we do not distinguish between sample and population statistics.

Another important quantity is the variance or standard deviation. This is a measure of spread, or how far away our samples are from the mean. The unbiased[3]estimate of the variance is

σx2=1n1i=1n(xiμx)2,\sigma_x^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \mu_x)^2 \, ,

and the standard deviation is just the square root of the variance. A small variance implies that measurements are clustered tightly around the mean, whereas a large variance means that measurements are spread out far from the mean. The variance can also be written in the following mathematically equivalent but more computationally-friendly format:

σx2=1n1(i=1n(xi2)nμx2).\sigma_x^2 = \frac{1}{n-1}\left(\sum_{i=1}^n \left(x_i^2\right) - n \mu_x^2\right) \, .

More generally, we might want to know what the probability is of getting a bar with a specific axial strength. In our testing, we could tabulate the frequency of each measurement in a histogram. If done enough times, it would define a smooth curve, as shown in Figure A.10. This curve is called the probability density function (PDF), p(x)p(x), and it tells us the relative probability of a certain value occurring.

More specifically, a PDF gives the probability of getting a value with a certain range:

Prob(axb)=abp(x)dx.\text{Prob}(a \le x \le b) = \int_{a}^b p(x) \mathrm{d} x \, .

The total integral of the PDF must be 1 because it contains all possible outcomes (100 percent):

p(x)dx=1.\int_{-\infty}^\infty p(x) \mathrm{d} x = 1 \, .

From the PDF, we can also measure various statistics, such as the mean value:

μx=E(x)=xp(x)dx. \mu_x = \mathbb{E}(x) = \int_{-\infty}^\infty x p(x) \mathrm{d} x \, .

This quantity is also referred to as the expected value of xx (E[x]\mathbb{E}[x]). The expected value of a function of a random variable, f(x)f(x), is given by:[4]

μf=E(f(x))=f(x)p(x)dx. \mu_f = \mathbb{E}\left(f(x)\right) = \int_{-\infty}^\infty f(x) p(x) \mathrm{d} x \, .

We can also compute the variance, which is the expected value of the squared difference from the mean:

σx2=E((xE(x))2)=(xμx)2p(x)dx,\sigma_x^2 = \mathbb{E}\left(\left(x - \mathbb{E}\left(x\right)\right)^2\right) = \int_{-\infty}^\infty (x - \mu_x)^2 p(x) \mathrm{d} x \, ,

or in a mathematically equivalent format:

σx2=x2p(x)dxμx2.\sigma_x^2 = \int_{-\infty}^\infty x^2 p(x) \mathrm{d} x - \mu_x^2 \, .

The mean and variance are the first and second moments of the distribution. In general, a distribution may require an infinite number of moments to describe it fully. Higher-order moments are generally mean centered and are normalized by the standard deviation so that the nnth normalized moment is computed as follows:

E((xμxσ)n).\mathbb{E}\left( \left(\frac{x - \mu_x}{\sigma}\right)^n \right) \, .

The third moment is called skewness, and the fourth is called kurtosis, although these higher-order moments are less commonly used.

The cumulative distribution function (CDF) is related to the PDF, which is the cumulative integral of the PDF and is defined as follows:

P(x)=xp(t)dt.P(x) = \int_{-\infty}^x p(t) \mathrm{d} t \, .

The capital PP denotes the CDF, and the lowercase pp denotes the PDF. As an example, the PDF and corresponding CDF for the axial strength are shown in Figure A.10. The CDF always approaches 1 as xx \rightarrow \infty.

Comparison between PDF and CDF for a simple example.

(a)

Comparison between PDF and CDF for a simple example.

(b)

Figure A.10:Comparison between PDF and CDF for a simple example.

Two normal distributions. Changing the mean causes a shift along the x-axis. Increasing the standard deviation causes the PDF to spread out.

Figure A.11:Two normal distributions. Changing the mean causes a shift along the xx-axis. Increasing the standard deviation causes the PDF to spread out.

We often fit a named distribution to the PDF of empirical data. One of the most popular distributions is the normal distribution, also known as the Gaussian distribution. Its PDF is as follows:

p(x;μ,σ2)=1σ2πexp((xμ)22σ2).p(x; \mu, \sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} \exp{\left(\frac{-(x-\mu)^2}{2\sigma^2}\right)} \, .

For a normal distribution, the mean and variance are visible in the function, but these quantities are defined for any distribution. Figure A.11 shows two normal distributions with different means and standard deviations to illustrate the effect of those parameters.

Several other popular distributions are shown in Figure A.12: uniform distribution, Weibull distribution, lognormal distribution, and exponential distribution. These are only a few of many other possible probability distributions.

Popular probability distributions besides the normal distribution.

(a)

Popular probability distributions besides the normal distribution.

(b)

Popular probability distributions besides the normal distribution.

(c)

Popular probability distributions besides the normal distribution.

(d)

Figure A.12a:Popular probability distributions besides the normal distribution.


An extension of variance is the covariance, which measures the variability between two random variables:

cov(x,y)=E((xE(x))(yE(y)))=E(xy)μxμy.\begin{aligned} \text{cov}(x, y) &= \mathbb{E}\left( \left(x - \mathbb{E}(x) \right)\left(y - \mathbb{E}(y)\right)\right) \\ &= \mathbb{E}(xy) - \mu_x \mu_y \, . \end{aligned}

From this definition, we see that the variance is related to covariance by the following:

σx2=var(x)=cov(x,x).\sigma_x^2 = \text{var}(x) = \text{cov}(x, x) \, .

Covariance is often expressed as a matrix, in which case the variance of each variable appears on the diagonal. The correlation is the covariance divided by the standard deviations:

corr(x,y)=cov(x,y)σxσy.\text{corr}(x, y) = \frac{\text{cov}(x, y)}{\sigma_x \sigma_y} \, .
Footnotes
  1. In this notation, mm is the number of rows and nn is the number of columns.

  2. 1 provides a comprehensive coverage of linear algebra and is credited with popularizing the concept of the “four fundamental subspaces”.

  3. Unbiased means that the expected value of the sample variance is the same as the true population variance. If nn were used in the denominator instead of n1n-1, then the two quantities would differ by a constant.

  4. This is not a definition, but rather uses the expected value definition with a somewhat lengthy derivation.

References
  1. Strang, G. (2006). Linear Algebra and its Applications (4th ed.). Cengage Learning.