Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Engineering Design Optimization. Cambridge University Press, Jan 2022

10 Surrogate-Based Optimization

A surrogate model, also known as a response surface model or metamodel, is an approximate model of a functional output that represents a “curve fit” to some underlying data. The goal of a surrogate model is to build a model that is much faster to compute than the original function, but that still retains sufficient accuracy away from known data points.

Surrogate-based optimization replaces the original model with a surrogate model in the optimization process.

Figure 10.1:Surrogate-based optimization replaces the original model with a surrogate model in the optimization process.

Surrogate-based optimization (SBO) performs optimization using the surrogate model, as shown in Figure 10.1. When used in optimization, the surrogate might define the full optimization model (i.e., the inputs are design variables, and the outputs are objective and constraint functions), or the surrogate could be just a component of the overall model. SBO is more targeted than the broader field of surrogate modeling. Instead of aiming for a globally accurate surrogate, SBO just needs the surrogate model to be accurate enough to lead the optimizer to the true optimum.

In SBO, the surrogate model is usually improved during optimization as needed but can sometimes be constructed beforehand and remain fixed during optimization. Some optimization algorithms interrogate both the surrogate model and the original model, an approach that is sometimes called surrogate-assisted optimization.

10.1 When to Use a Surrogate Model

There are various scenarios for which surrogate models are helpful. One scenario is when the original model is computationally expensive. Surrogate models can be queried with minimal computational cost, but constructing them requires multiple evaluations of the original model. Suppose the number of evaluations needed to build a sufficiently accurate surrogate model is less than that needed to optimize the original model directly. In that case, SBO may be a worthwhile option. Constructing a surrogate model becomes even more compelling when it is reused in multiple optimizations.

Surrogate modeling can be effective in handling noisy models because they create a smooth representation of noisy data. This can be particularly advantageous when using gradient-based optimization.

One scenario that leads to both expensive evaluation and noisy output is experimental data. When the model data are experimental and the optimizer cannot query the experiment in an automated way, we can construct a surrogate model based on the experimental data. Then, the optimizer can query the surrogate model in the optimization.

Surrogate models are also helpful when we want to understand the design space, that is, how the objective and constraints (outputs) vary with respect to the design variables (inputs). By constructing a continuous model over discrete data, we obtain functional relationships that can be visualized more effectively.

When multiple sources of data are available, surrogate models can fuse the data to build a single model. The data could come from numerical models with different levels of fidelity or experimental data. For example, surrogate models can calibrate numerical model data using experimental data. This is helpful because experimental data is usually much more scarce than numerical data. The same reasoning applies to low- versus high-fidelity numerical data.

One potential issue with surrogate models is the curse of dimensionality, which refers to poor scalability with the number of inputs. The larger the number of inputs, the more model evaluations are needed to construct a surrogate model that is accurate enough. Therefore, the reasons for using surrogate models cited earlier might not be enough if the optimization problem has a large number of design variables.

The SBO process is shown in Figure 10.2. First, we use sampling methods to choose the initial points to evaluate the function or conduct experiments. These points are sometimes referred to as training data. Next, we build a surrogate model from the sampled points. We can then perform optimization by querying the surrogate model. Based on the results of the optimization, we include additional points in the sample and reconstruct the surrogate (infill).

Overview of surrogate-based optimization procedure.

Figure 10.2:Overview of surrogate-based optimization procedure.

We repeat this process until some convergence criterion or a maximum number of iterations is reached. In some procedures, infill is omitted; the surrogate is entirely constructed upfront and not subsequently updated.

The optimization step can be performed using any of the methods we covered previously. Because surrogate models are smooth and their gradients are easily computed, gradient-based optimization is preferred (see Chapter 4). However, some surrogate models can be highly multimodal, in which case a global search is preferred, either using gradient-based with multistart (see Tip 4.8) or a global gradient-free method (see Chapter 7).

This chapter discusses sampling, constructing a surrogate, and infill with some associated optimization strategies. We devote separate sections to two surrogate modeling methods that are more involved and widely used: kriging and deep neural nets. Many of the concepts discussed in this chapter have a wide range of applications beyond optimization.

10.2 Sampling

Sampling methods, also known as sampling plans, select the evaluation points to construct the initial surrogate. These evaluation points must be chosen carefully. A straightforward approach is full factorial sampling, where we discretize each dimension and evaluate at all combinations of the resulting grid. This is not efficient because it scales exponentially with the number of input variables.

Example 10.1 highlights one of the significant challenges of sampling methods: the curse of dimensionality.

For SBO, even with better sampling plans, using a large number of variables is costly. We need to identify the most important or most influential variables. Knowledge of the particular domain is helpful, as is exploring the magnitude of the entries in a gradient vector (Chapter 6) across multiple points in the domain. We can use various strategies to help us decide which variables matter most, but for our purposes, we assume that the most influential variables have already been determined so that the dimensionality is reasonable. Having selected a set of variables, we are now interested in sampling methods that characterize the design space of interest more efficiently than full factorial sampling.

In addition to their use in SBO, the sampling strategies discussed in this section are useful in many other applications, including various applications discussed in this book: initializing a genetic algorithm (Section 7.6), particle swarm optimization (Section 7.7) or a multistart gradient-based algorithm (Tip 4.8), or choosing the points to run in a Monte Carlo simulation (Section 12.3.3). Because the function behavior at each sample is independent, we can efficiently parallelize the evaluation of the functions.

10.2.1 Latin Hypercube Sampling

Latin hypercube sampling (LHS) is a popular sampling method that is built on a random process but is more effective and efficient than pure random sampling. Random sampling scales better than full factorial searches, but it tends to exhibit clustering and requires many points to reach the desired distribution (i.e., the law of large numbers). For example, Figure 10.3 compares 50 randomly generated points across uniform distributions in two dimensions versus Latin hypercube sampling. In random sampling, each sample is independent of past samples, but in LHS, we choose all samples beforehand to ensure a well-spread distribution.

Contrast between random and Latin hypercube sampling with 50 points using uniform distributions.

(a)

Contrast between random and Latin hypercube sampling with 50 points using uniform distributions.

(b)

Figure 10.3a:Contrast between random and Latin hypercube sampling with 50 points using uniform distributions.

To describe the methodology, consider two random variables with bounds, whose design space we can represent as a square.

A two-dimensional design space divided into eight intervals in each dimension.

Figure 10.4:A two-dimensional design space divided into eight intervals in each dimension.

Say we wanted only eight samples; we could divide the design space into eight intervals in each dimension, generating the grid of cells shown in Figure 10.4.

A full factorial search would identify a point in each cell, but that does not scale well. To be as efficient as possible and still cover the variation, we would want each row and each column to have one sample in it. In other words, the projection of points onto each dimension should be uniform. For example, the left side of Figure 10.5 shows the projection of a uniform LHS onto each dimension. We see that the points create a uniformly spread histogram.

Example LHS with projections onto the axes.

(a)

Example LHS with projections onto the axes.

(b)

Figure 10.5a:Example LHS with projections onto the axes.

The concept where one and only one point exists in any given row or column is called a Latin square, and the generalization to higher dimensions is called a Latin hypercube. There are many ways to achieve this, and some choices are better than others. Consider the sampling plan shown on the left of Figure 10.6. This plan meets our criteria but clearly does not fill the space and likely will not capture the relationships between design parameters well. Alternatively, the right side of Figure 10.6 has a sample in each row and column while also spanning the space much more effectively.

Contrasting sampling strategies that both fulfill the uniform projection requirement.

(a)

Contrasting sampling strategies that both fulfill the uniform projection requirement.

(b)

Figure 10.6a:Contrasting sampling strategies that both fulfill the uniform projection requirement.

LHS can be posed as an optimization problem where we seek to maximize the distance between the samples.

The constraint is that the projection on each axis must follow a chosen probability distribution. The specified distribution is often uniform, as in the previous examples, but it could also be any distribution, such as a normal distribution, as shown on the right side of Figure 10.5. This optimization problem does not have a unique solution, so random processes are used to determine the combination of points. Additionally, points are not usually placed in cell centers but at a random location within a given cell to allow for the possibility of reaching any point in the domain. The advantage of the LHS approach is that rather than relying on the law of large numbers to fill out our chosen probability distributions, we enforce it as a constraint. This method may still require many samples to characterize the design space accurately, but it usually requires far fewer than pure random sampling.

Instead of defining LHS as an optimization problem, a much simpler approach is typically used in which we ensure one sample per interval, but we rely on randomness to choose point combinations. Although this does not necessarily yield a maximum spread, it works well in practice and is simple to implement. Before discussing the algorithm, we discuss how to generate other distributions besides just uniform distributions.

We can convert from uniformly sampled points to an arbitrary distribution using a technique called inversion sampling.

Assume that we want to generate samples xx from an arbitrary probability density function (PDF) p(x)p(x) or, equivalently, from the corresponding cumulative distribution function (CDF) P(x)P(x).[1]The probability integral transform states that for any continuous CDF, y=P(x)y = P(x), the variable yy is uniformly distributed (a simple proof, but it is not shown here to avoid introducing additional notation). The procedure is to randomly sample from a uniform distribution (e.g., generate yy), then compute the corresponding xx such that P(x)=yP(x) = y, which we denote as x=P1yx = P^{-1} y. This latter step is known as an inverse CDF, a percent-point function, or a quantile function. This process is depicted in Figure 10.7 for a normal distribution. This same procedure allows us to use LHS with any distribution, simply by generating the samples on a uniform distribution.

An example of inversion sampling with a normal distribution. A few uniform samples are shown on the y-axis. The points are evaluated by the inverse CDF, represented by the arrows passing through the CDF for a normal distribution. If enough samples are drawn, the resulting distribution will be the PDF of a normal distribution.

Figure 10.7:An example of inversion sampling with a normal distribution. A few uniform samples are shown on the yy-axis. The points are evaluated by the inverse CDF, represented by the arrows passing through the CDF for a normal distribution. If enough samples are drawn, the resulting distribution will be the PDF of a normal distribution.

A typical algorithm is described in Algorithm 10.1. For each axis, we partition the CDF in nsn_s evenly spaced regions (evenly spaced along the CDF, which means that each region is equiprobable). We generate a random number within each evenly spaced interval, where 0 corresponds to the bottom of the interval and 1 to the top. We then evaluate the inverse CDF as described previously so that the points match our specified distribution (the CDF for a uniform distribution is just a line P(x)=xP(x) = x, so the output is not changed). Next, the column of points for that axis is randomly permuted. This process is repeated for each axis according to its specified probability distribution.

An example from the LHS algorithm showing uniform distribution in x_1 and a Gaussian distribution in x_2 with eight sample points. The equiprobable bins are shown as grid lines.

Figure 10.8:An example from the LHS algorithm showing uniform distribution in x1x_1 and a Gaussian distribution in x2x_2 with eight sample points. The equiprobable bins are shown as grid lines.

An example using Algorithm 10.1 for eight points is shown in Figure 10.8. In this example, we use a uniform distribution for x1x_1 and a normal distribution for x2x_2. There is one point in each equiprobable interval. As stated before, randomness does not necessarily ensure a good spread, but optimizing the spread is difficult because the function is highly multimodal. Instead, to encourage high spread, we could generate multiple Latin hypercube samples with this algorithm and select the one with the largest sum of the distance between points.

10.2.2 Low-Discrepancy Sequences

Low-discrepancy sequences generate deterministic sequences of points that are well spread. Each new point added in the sequence maintains low discrepancy—discrepancy refers to the variation in point density throughout the domain. Hence, a low-discrepancy set is close to even density (i.e., well spread). These sequences are called quasi-random because they often serve as suitable replacements for applications that use random sequences, but they are not random or even pseudo-random.

An advantage of low-discrepancy sequences over LHS is that most of the approaches do not require selecting all the samples beforehand. These methods generate deterministic sequences; in other words, we generate the same sequence of points whether we choose them beforehand or add more later. This property is particularly advantageous in iterative procedures. We may choose an initial sampling plan and add more well-spread points to the sample later. This is not necessarily an advantage for the methods of this chapter because the optimization drives the selection of new points rather than continuing to seek spread-out samples. However, this feature is useful for other applications, such as quadrature, Monte Carlo simulations, and other problems where an iterative sampling process is used to determine statistical convergence (see Section 12.3). Low-discrepancy sequences add more points that are well spread without having to throw out the existing samples. Even in non-iterative procedures, these sampling strategies can be a useful alternative.

Several of these sequences are built on generalizations of the one-dimensional van der Corput sequence to more than one dimension. Such sequences are defined by representing an integer ii in a given integer base bb (the van der Corput sequence is always base 2):

i=a0+a1b+a2b2++arbr where a[0,b1]. i = a_0 + a_1 b + a_2 b^2 + \ldots + a_r b^r \text{ where } a \in [0, b-1] \, .

If the base is 2, this is just a standard binary sequence (Section 7.6.1). After determining the relevant coefficients (aja_j), the ithi\text{th} element of the sequence is

ϕb(i)=a0b+a1b2+a2b3++arbr+1. \phi_b(i) = \frac{a_0}{b} + \frac{a_1}{b^2} + \frac{a_2}{b^3} + \ldots + \frac{a_r}{b^{r+1}} \, .

An algorithm to generate an element in this sequence, also known as a radical inverse function for base bb, is given in Algorithm 10.2.

For base 2, the sequence is as follows:

12,14,34,18,58,38,78,116,916,.\frac{1}{2}, \frac{1}{4}, \frac{3}{4}, \frac{1}{8}, \frac{5}{8}, \frac{3}{8}, \frac{7}{8}, \frac{1}{16}, \frac{9}{16},\ldots \, .

The interval is divided in half, and then each subinterval is also halved, with new points spreading out across the domain (see Figure 10.9).

Van Der Corput sequence.

Figure 10.9:Van Der Corput sequence.

Similarly, for base 3, the interval is split into thirds, then each subinterval is split into thirds, and so on:

13,23,19,49,79,29,59,89,127,.\frac{1}{3}, \frac{2}{3}, \frac{1}{9}, \frac{4}{9}, \frac{7}{9}, \frac{2}{9}, \frac{5}{9}, \frac{8}{9}, \frac{1}{27}, \ldots \, .

Halton Sequence

A Halton sequence uses pairwise prime numbers (larger than 1) for the base of each dimension of the problem.[2]The ithi\text{th} point in the Halton sequence is

ϕ(i,b1),ϕ(i,b2),,ϕ(i,bnx), \phi(i, b_1), \phi(i, b_2), \ldots, \phi(i, b_{n_x}) \, ,

where the bjb_j set is pairwise prime. As an example in two dimensions, Figure 10.10 shows 30 generated points of the Halton sequence where x1x_1 uses base 2, and x2x_2 uses base 3, and then a subsequent 20 generated points are added (in another color), showing the reuse of existing points.

Halton sequence with base 2 for x_1 and base 3 for x_2. First, 30 points are selected (in blue), and then 20 points are added (in red). These points would be identical to 50 points chosen at once.

Figure 10.10:Halton sequence with base 2 for x1x_1 and base 3 for x2x_2. First, 30 points are selected (in blue), and then 20 points are added (in red). These points would be identical to 50 points chosen at once.

If the dimensionality of the problem is high, then some of the base combinations lead to points that are highly correlated and thus undesirable for a sampling plan. For example, the left of Halton sequence with base 17 for x1x_1 and base 19 for x2x_2. shows 50 generated points where x1x_1 uses base 17, and x2x_2 uses base 19. To avoid this issue, we can use a scrambled Halton sequence.

Halton sequence with base 17 for x_1 and base 19 for x_2.

(a)

Halton sequence with base 17 for x_1 and base 19 for x_2.

(b)

Figure 10.11a:Halton sequence with base 17 for x1x_1 and base 19 for x2x_2.

Scrambling can be accomplished by generating a permutation array containing a random permutation of the integers p=[0,1,,b1]p = [0, 1, \ldots, b-1]. Then, rather than using the integers aa directly in Equation 10.2, we use the entries of aa as the indices of the permutation array. If pp is the permutation array, we have:

ϕb(i)=pa0b+pa1b2+pa2b3++parbr+1.\phi_b(i) = \frac{p_{a_0}}{b} + \frac{p_{a_1}}{b^2} + \frac{p_{a_2}}{b^3} + \ldots + \frac{p_{a_r}}{b^{r+1}} \, .

The permutation array is fixed for all digits aa and for all npn_p points in the domain. The right side of Halton sequence with base 17 for x1x_1 and base 19 for x2x_2. shows the same example (with base 17 and base 19) but with scrambling to weaken the strong correlations.

Hammersley Sequence

The Hammersley sequence is closely related to the Halton sequence. However, it provides better spacing if we know beforehand the number of points (npn_p) that we are going to use. This approach only needs nx1n_x - 1 bases (still pairwise prime) because the first dimension uses regular spacing:

inp,ϕ(i,b1),ϕ(i,b2),,ϕ(i,bnx1). \frac{i}{n_p}, \phi(i, b_1), \phi(i, b_2), \ldots, \phi(i, b_{n_x - 1}) \, .

Because this sequence needs to know the number of points (npn_p) beforehand, it is less useful for iterative procedures.

Hammersley sequence with base 2 for the x_2-axis.

Figure 10.12:Hammersley sequence with base 2 for the x2x_2-axis.

However, the implementation is straightforward, and so it may still be a useful alternative to LHS. Figure 10.12 shows 50 points generated from a Hammersley sequence where the x2x_2-axis uses base 2.

Other Sequences

A wide variety of other low-discrepancy sequences exist. The Faure sequence is similar to the Halton, but it uses the same base for all dimensions and uses permutation scrambling for each dimension instead.12 Sobol sequences use base 2 sequences but with a reordering based on “direction numbers”.3 Niederreiter sequences are effectively a generalization of Sobol sequences to other bases.4

10.3 Constructing a Surrogate

Once sampling is completed, we have a list of data points, often called training data:

(x(i),f(i)),\left(x^{(i)}, f^{(i)} \right) \, ,

where x(i)x^{(i)} is an input vector from the sampling plan, and f(i)f^{(i)} contains the corresponding outputs from evaluating the model: f(i)=f(x(i))f^{(i)} = f\left(x^{(i)}\right). We seek to construct a surrogate model from this data set. Surrogate models can be based on physics, mathematics, or a combination of the two.

Incorporating known physics into a model is often desirable to improve model accuracy. However, functional relationships are unknown for many complex problems, and a data-driven mathematical model can be more effective.

Interpolation models match the training data at the provided points, whereas regression models minimize the error between the training data and a function with an assumed trend.

Figure 10.13:Interpolation models match the training data at the provided points, whereas regression models minimize the error between the training data and a function with an assumed trend.

Surrogate-based models can be based on interpolation or regression, as illustrated in Figure 10.13. Interpolation builds a function that exactly matches the provided training data. Regression models do not try to match the training data points exactly; instead, they minimize the error between a smooth trend function and the training data. The nature of the training data can help decide between these two types of surrogate models. Regression is particularly useful when the data are noisy. Interpolatory models may produce undesirable oscillations when fitting the noise. In contrast, regression models can find a smooth function that is less sensitive to the noise. Interpolation is useful when the data are highly multimodal (and not noisy). This is because a regression model may smooth over variations that are actually physical, whereas an interpolatory model can accurately capture those variations.

There are two main steps involved in either type of surrogate model. First, we select a set of basis functions, which represent the form for the model.

Second, we determine the model parameters that provide the best fit to the provided data. Determining the model parameters is an optimization problem, which we discuss first. We discuss linear regression and nonlinear regression, which are techniques for choosing model parameters for a given set of basis functions. Next, we discuss cross validation, which is a critical technique for selecting an appropriate model form. Finally, we discuss common basis functions.

10.3.1 Linear Least Squares Regression

A linear regression model does not mean that the surrogate is linear in the input variables but rather that the model is linear in its coefficients (i.e., linear in the parameters we are estimating). For example, the following equation is a two-dimensional linear regression model, where we use f^\hat{f} to represent our estimated model of the function ff:

f^(x)=w1x12+w2x1x2+w3exp(x2)+w4x1+w5.\hat{f}(x) = w_1 x_1^2 + w_2 x_1 x_2 + w_3 \exp(x_2) + w_4 x_1 + w_5 \, .

This function is highly nonlinear, but it is classified as a linear regression model because the regression seeks to choose the appropriate values for the coefficients wiw_i (and the function is linear in ww).

A general linear regression model can be expressed as

f^=wψ(x)=iwiψi(x), \hat{f} = w^\intercal \psi(x) = \sum_i w_i \psi_i(x) \, ,

where ww is a vector of weights, and ψ\psi is a vector of basis functions. In this section, we assume that the basis functions are provided. In general, the basis functions can be any set of functions that we choose (and typically they are nonlinear). It is usually desirable for these functions to be orthogonal.

The coefficients are chosen to minimize the error between our predicted function values f^\hat{f} and the actual function values f(i)f^{(i)}. Because we want to minimize both positive and negative errors, we minimize the sum of the square of the errors (or a weighted sum of squared errors):[3]

minimizewi(f^(w;x(i))f(i))2. \underset{w}{\text{minimize}} \quad \sum_i \left(\hat{f}\left(w; x^{(i)} \right) - f^{(i)} \right)^2 \, .

The solution to this optimization problem is called a least squares solution. If the regression model is linear, we can simplify this objective and solve the problem analytically. Recall that f^=ψw\hat{f} = \psi^\intercal w, so the objective can be written as

minimizewi(ψ(x(i))wf(i))2.\underset{w}{\text{minimize}} \quad \sum_i \left( \psi\left(x^{(i)}\right)^\intercal w - f^{(i)} \right)^2 \, .

We can express this in matrix form by defining the following:

Ψ=[ψ(x(1))ψ(x(2))ψ(x(ns))].\Psi = \begin{bmatrix} \text{---} & \psi\left(x^{(1)}\right)^\intercal & \text{---} \\ \text{---} & \psi\left(x^{(2)}\right)^\intercal & \text{---} \\ & \vdots \\ \text{---} & \psi\left(x^{(n_s)}\right)^\intercal & \text{---} \\ \end{bmatrix} \, .

The matrix Ψ\Psi is of size (ns×nw)(n_s \times n_w), where nsn_s is the number of samples, nwn_w the number of parameters in ww, and nsnwn_s \ge n_w. This means that there should be more equations than unknowns or that we have sampled more points than the number of coefficients we need to estimate. This should make sense because our surrogate function is only an assumed form and generally not an exact fit to the actual underlying function. Thus, we need more data to create a good fit.

Then the optimization problem can be written in matrix form as:

minimizewΨwf22.\underset{w}{\text{minimize}} \quad \| \Psi w - f \|_2^2 \, .

Expanding the squared norm (i.e., x22=xx\|x\|_2^2 = x^\intercal x) gives

minimizewwΨΨw2fΨw+ff.\underset{w}{\text{minimize}} \quad w^\intercal \Psi^\intercal \Psi w - 2 f^\intercal \Psi w + f^\intercal f \, .

We can omit the last term from the objective because our optimization variables are ww, and the last term has no ww dependence:

minimizewwΨΨw2fΨw.\underset{w}{\text{minimize}} \quad w^\intercal \Psi^\intercal \Psi w - 2 f^\intercal \Psi w \, .

This fits the general form for an unconstrained quadratic programming (QP) problem, as shown in Section 5.5.1:

minimizex12xQx+qx,\underset{x}{\text{minimize}} \quad \frac{1}{2} x^\intercal Q x + q^\intercal x \, ,

where

Q=2ΨΨq=2Ψf.\begin{align} Q &= 2 \Psi^\intercal \Psi \\ q &= - 2 \Psi^\intercal f \, . \end{align}

Recall that an equality constrained QP (of which unconstrained is a subset) has an analytic solution as long as the QP is positive definite. In our case, we can show that QQ is positive definite as long as Ψ\Psi is full rank:

xQx=2xΨΨx=2Ψx22>0.x^\intercal Q x = 2 x^\intercal \Psi^\intercal \Psi x = 2 \| \Psi x \|_2^2 > 0 \, .

This is not surprising because the objective is a sum of squared values. Referring back to the solution in Section 5.5.1, and removing the portions associated with the constraints, the solution is

Qx=q.Q x = -q \, .

In our case, this becomes

2ΨΨw=2Ψf.2 \Psi^\intercal \Psi w = 2 \Psi^\intercal f \,.

After simplifying, we have an analytic solution for the weights:

w=(ΨΨ)1Ψf. w = \left( \Psi^\intercal \Psi \right)^{-1} \Psi^\intercal f \, .

We sometimes express the linear relationship in Equation 10.12 as Ψw=f\Psi w = f, although the case where there are more equations than unknowns does not typically have a solution (the problem is overdetermined). Instead, we seek the solution that minimizes the error Ψwf2\|\Psi w - f\|^2, that is, Equation 10.23. The quantity Ψ=(ΨΨ)1Ψ\Psi^\dagger = \left(\Psi^\intercal \Psi \right)^{-1} \Psi^\intercal is called the pseudoinverse of Ψ\Psi (or more specifically, the Moore–Penrose pseudoinverse), and thus we can write Equation 10.23 in the more compact form

w=Ψf. w = \Psi^\dagger f \, .

This allows for a similar form to solving a linear system of equations where an inverse would be used instead. In solving both a linear system and the linear least-squares equation (Equation 10.23), we do not explicitly invert a matrix. For linear least squares, a QR factorization is commonly used for improved numerical conditioning as compared to solving Equation 10.23 directly.

A common variation of this approach is to use regularized least squares, which adds a term in the objective. The new objective is

minimizewΨwf22+μw22,\underset{w}{\text{minimize}} \quad \| \Psi w - f \|_2^2 + \mu \| w \|_2^2 \, ,

where μ\mu is a weight assigned to the second term. This second term attempts to reduce the magnitudes of the entries in ww while balancing the fit in the first term. This approach is particularly beneficial if the data contain strong outliers or are particularly noisy. The rationale for this approach is that we may want to accept a higher error (quantified by the first term) in exchange for smaller values for the coefficients. This generally leads to simpler, more generalizable models (e.g., by reducing the influence of some terms). A related extension uses a second term of the form ww022\| w - w_0 \|_2^2. The idea is that we want a good fit, while maintaining parameters that are close to some known nominal values w0w_0.

A regularized least squares problem can be solved with the same linear least squares approach. We can write the previous problem using concatenated matrices and vectors:

minimizew[ΨμI]w[f0]22.\underset{w}{\text{minimize}} \quad \left\| \begin{bmatrix}\Psi \\ \sqrt{\mu} I \end{bmatrix}w - \begin{bmatrix} f \\ 0\end{bmatrix} \right \|_2^2 \, .

This is of the same linear form as before (Awb2\| Aw - b \|^2), so we can reuse the solution (Equation 10.23):

w=(AA)1Ab=(ΨΨ+μI)1Ψf.\begin{aligned} w^* &= \left( A^\intercal A \right)^{-1} A^\intercal b \\ &= \left( \Psi^\intercal \Psi + \mu I \right)^{-1} \Psi^\intercal f. \\ \end{aligned}

For linear least squares (with or without regularization), we have seen that the optimization problem of determining the appropriate coefficients can be found analytically. We can also add linear constraints to the problem (equality or inequality), and the optimization remains a QP. In that case, the problem is still convex. Although it does not generally have an analytic solution, we can still quickly find the global optimum. This topic is discussed in Section 11.3.

10.3.2 Maximum Likelihood Interpretation

This section presents an alternative motivation for the sum of squared error approach used in the previous section. It is somewhat of a diversion from the present discussion, but it will be helpful in several results later in this chapter. In the previous section, we assumed linear models of the form

f^(i)=wx(i),\hat{f}^{(i)} = w^\intercal x^{(i)} \, ,

where we use xx for simplicity in writing instead of ψ(x(i))\psi(x^{(i)}). The derivation remains the same for any arbitrary function of xx. The function f^\hat{f} is just a model, so we could say that it is equal to our actual observations f(i)f^{(i)} plus an error term:

f(i)=wx(i)+ε(i),f^{(i)} = w^\intercal x^{(i)} + \varepsilon^{(i)} \, ,

where ε\varepsilon captures the error associated with the ithi{\text{th}} data point. We assume that the error is normally distributed with mean zero and a standard deviation of σ\sigma:

p(ε(i))=1σ2πexp(ε(i)22σ2).p \left( \varepsilon^{(i)} \right) = \frac{1}{\sigma \sqrt{2 \pi}}\exp\left(-\frac{{\varepsilon^{(i)}}^2}{2 \sigma^2} \right) \, .

The use of Gaussian uncertainty can be motivated by the central limit theorem, which states that for large sample sizes, the sum of random variables tends toward a Gaussian distribution regardless of the original distribution associated with each variable. In other words, the sample distribution of the sample mean is approximately Gaussian. Because we assume the error terms to be the sum of various random, independent perturbations, then by the central limit theorem, we expect the errors to be normally distributed.

We now substitute Equation 10.29 into Equation 10.30 to show the probability of ff conditioned on xx and parameterized by ww:

p(f(i)x(i);w)=1σ2πexp((f(i)wx(i))22σ2).p \left(f^{(i)} | x^{(i)}; w \right) = \frac{1}{\sigma \sqrt{2 \pi}}\exp\left(-\frac{\left({f^{(i)} - w^\intercal x^{(i)}}\right)^2}{2 \sigma^2} \right) \, .

Once we include all the data points x(i)x^{(i)}, we would like to compute the probability of observing ff conditioned on the inputs xx for a given set of parameters in ww. We call this the likelihood function L(w)L(w). In this case, assuming all errors are independent, the total probability for observing the outputs is the product of the probability of observing each output:

L(w)=i=1ns1σ2πexp((f(i)wx(i))22σ2).L(w) = \prod_{i = 1}^{n_s} \frac{1}{\sigma \sqrt{2 \pi}}\exp\left(-\frac{\left({f^{(i)} - w^\intercal x^{(i)}}\right)^2}{2 \sigma^2} \right) \, .

Now we can pose this as an optimization problem where we wish to find the parameters ww that maximize the likelihood function; in other words, we maximize the probability that our model is consistent with the observed data. Because the objective is a product of multiple terms, it is helpful to take the logarithm of the objective. Maximizing LL or maximizing =ln(L)\ell = \ln(L) does not change the solution to the problem but makes it easier to solve. We call this the log likelihood function:

(w)=ln(i=1ns1σ2πexp((f(i)wx(i))22σ2))=i=1nsln(1σ2πexp((f(i)wx(i))22σ2))=nsln(1σ2π)i=1ns(f(i)wx(i))22σ2.\begin{align} \ell(w) &= \ln\left(\prod_{i = 1}^{n_s} \frac{1}{\sigma \sqrt{2 \pi}}\exp\left(-\frac{\left({f^{(i)} - w^\intercal x^{(i)}}\right)^2}{2 \sigma^2} \right) \right)\\ &= \sum_{i = 1}^{n_s} \ln \left( \frac{1}{\sigma \sqrt{2 \pi}}\exp\left(-\frac{\left({f^{(i)} - w^\intercal x^{(i)}}\right)^2}{2 \sigma^2} \right) \right) \\ &= n_s \ln{\left(\frac{1}{\sigma \sqrt{2 \pi}}\right)} - \sum_{i = 1}^{n_s} \frac{\left({f^{(i)} - w^\intercal x^{(i)}}\right)^2}{2 \sigma^2}. \end{align}

The first term has no dependence on ww, and so when optimizing (w)\ell(w); it is just a scalar term that can be removed as follows:

maximizew  (w)maximizew(i=1ns(f(i)wx(i))22σ2)minimizewi=1ns(f(i)wx(i))22σ2.\begin{align} &\underset{w}{\text{maximize}} \; \ell(w) \quad \Rightarrow \\ &\underset{w}{\text{maximize}} \left(- \sum_{i = 1}^{n_s} \frac{\left({f^{(i)} - w^\intercal x^{(i)}}\right)^2}{2 \sigma^2} \right) \quad \Rightarrow \\ &\underset{w}{\text{minimize}} \sum_{i = 1}^{n_s} \frac{({f^{(i)} - w^\intercal x^{(i)}})^2}{2 \sigma^2} \, . \end{align}

Similarly, the denominator of the second term has no dependence on ww and is just a scalar that can also be removed:

maximizew  (w)minimizewi=1ns(f(i)wx(i))2.\underset{w}{\text{maximize}} \; \ell(w) \quad \Rightarrow \quad \underset{w}{\text{minimize}} \sum_{i = 1}^{n_s} \left( {f^{(i)} - w^\intercal x^{(i)}} \right)^2 \, .

Thus, maximizing the log likelihood function (maximizing the probability of observing the data) is equivalent to minimizing the sum of squared errors (the least-squares formulation). This derivation provides another motivation for using the sum of squared errors in regression.

10.3.3 Nonlinear Least Squares Regression

A surrogate model can be nonlinear in the coefficients. For example, building on the simple function shown earlier in Equation 10.9 we can add the coefficients w6w_6 and w7w_7 as follows:

f^(x)=w1x1w6+w2x1x2+w3exp(w7x2)+w4x1+w5.\hat{f}(x) = w_1 x_1^{w_6} + w_2 x_1 x_2 + w_3 \exp(w_7 x_2) + w_4 x_1 + w_5 \,.

The addition of w6w_6 and w7w_7 makes this function nonlinear in the coefficients. We can still estimate these parameters, but not analytically.

Equation 10.11 is still relevant; we still seek to minimize the sum of the squared errors:

minimizewi(f^(w;x(i))f(i))2.\underset{w}{\text{minimize}} \sum_i \left(\hat{f}\left(w ; x^{(i)}\right) - f^{(i)} \right)^2 \, .

For general nonlinear regression models, we cannot write a more specific form for f^\hat{f} as we could for the linear case, so we leave the objective as it is.

This is a nonlinear least-squares problem. The optimization problem is unconstrained, so any of the methods from Chapter 4 apply. We could also easily add constraints, for example, bounds on parameters, known relationships between parameters, and so forth, and use the methods from Chapter 5.

In contrast to the linear case, we need to provide a starting point, our best guess for the parameters, and we may need to deal with scaling, noise, multimodality, or any of the other potential challenges of general nonlinear optimization. Still, this is a relatively straightforward problem within the broader realm of engineering optimization problems.

Although the methods of Chapter 4 can be used if the problem remains unconstrained, there are more specialized methods available that take advantage of the specific structure of the problem. One popular approach to solving the nonlinear least-squares problem is the Levenberg–Marquardt algorithm, which we discuss in this section.

As a stepping stone towards the Levenberg–Marquardt algorithm, we first derive the Gauss–Newton algorithm, which is a modification of Newton’s method (Section 3.8) for solving nonlinear least-squares problems. One way to think of this algorithm is as an iterative linearization of the residual. Once it is linearized, we can apply the same methods we derived for linear least squares. We linearize the residual r=f^(w)fr = \hat{f}(w) - f at iteration kk as

r(w)r(wk)+JrΔw,r(w) \approx r(w_k) + J_r \Delta w \, ,

where Δw\Delta w is the step and the Jacobian is

Jrij=riwj.{J_{r}}_{ij} = \frac{\partial r_i}{\partial w_j} \, .

After the linearization, the objective becomes

minimizeJrΔw+r22.\text{minimize} \quad \| J_r \Delta w + r \|_2^2\, .

This is now the same form as linear least squares (Equation 10.14), so we can reuse its solution (Equation 10.23) to solve for the step

Δw=(JrJr)1Jrr.\Delta w = - \left( J_r^\intercal J_r \right)^{-1} J_r^\intercal r \, .

We now have an update formula for the coefficients at each iteration:

wk+1=wk(JrJr)1Jrr. w_{k+1} = w_k - \left( J_r^\intercal J_r \right)^{-1} J_r^\intercal r \, .

An alternative derivation for this formula is to consider a Newton step for an unconstrained optimizer. The objective is e=iri2e = \sum_i r_i^2, and the formula for a Newton step (Section 4.4.3) is

wk+1=wkHe1e.w_{k+1} = w_k - H_e^{-1} \nabla e \, .

The gradient is

ej=2ririwj,\nabla e_j = 2 r_i \frac{\partial r_i}{\partial w_j} \, ,

or in matrix form:

e=2Jrr.\nabla e = 2 J_r^\intercal r \, .

The Hessian in index notation is

Hejk=2riwjriwk+2ri2riwjwk.{H_e}_{jk} = 2 \frac{\partial r_i}{\partial w_j} \frac{\partial r_i}{\partial w_k} + 2 r_i \frac{\partial^2 r_i}{\partial w_j \partial w_k} \, .

We can write it in matrix form as follows:

He=2JrJr+2rHr.H_e = 2 J_r^\intercal J_r + 2 r H_r \, .

If we neglect the second term in the Hessian, then the Newton update is:

wk+1=wk12(JrJr)12Jrr=wk(JrJr)1Jrr,\begin{aligned} w_{k+1} &= w_k - \frac{1}{2} \left(J_r^\intercal J_r\right)^{-1} 2 J_r^\intercal r\\ &= w_k - \left(J_r^\intercal J_r\right)^{-1} J_r^\intercal r \, , \end{aligned}

which is the same update as before.

Thus, another interpretation of this method is that a Gauss–Newton step is a modified Newton step where the second derivatives of the residual are neglected (and thus, a quasi-Newton approach to estimate second derivatives is not needed). This method is particularly effective near convergence because as r0r \rightarrow 0 (i.e., as we approach the solution to our residual minimization), the neglected term also approaches zero. The appeal of this approach is that we can often obtain an accurate prediction for the Hessian using only the first derivatives because of the known structure of the objective.

When the second term is not small, then the Gauss–Newton step may be too inaccurate. We could use a line search, but the Levenberg–Marquardt algorithm utilizes a different strategy. The idea is to regularize the problem as discussed in the previous section or, in other words, provide the ability to dampen the steps as needed. Each linearized subproblem becomes

minimizeΔwJrΔw+r22+μΔw22.\underset{\Delta w}{\text{minimize}} \quad \| J_r \Delta w + r \|_2^2 + \mu \| \Delta w \|_2^2 \, .

Recall that the solution to this problem (see Equation 10.27) is

Δw=(JrJr+μI)1Jrr.\Delta w = - \left( J_r^\intercal J_r + \mu I \right)^{-1} J_r^\intercal r \, .

If μ=0\mu = 0, then we retain the Gauss–Newton step. Conversely, as μ\mu becomes large, so that the JrJrJ_r^\intercal J_r is negligible, the step becomes

Δw=1μJrr.\Delta w = -\frac{1}{\mu} J_r^\intercal r \, .

This is precisely the steepest-descent direction for our objective (see Equation 10.49), although with a small magnitude because μ\mu is large. The parameter μ\mu provides some control for directions ranging between Gauss–Newton and steepest descent.

The Levenberg–Marquardt algorithm has been revised to improve the scaling for components of the gradient that are small. The second minimization term weights all parameters equally. The scaling can be improved by multiplying by a diagonal matrix in the regularization as follows:

minimizeΔwJrΔw+r22+μDΔw22,\underset{\Delta w}{\text{minimize}} \quad \| J_r \Delta w + r \|_2^2 + \mu \| D \Delta w \|_2^2 \, ,

where DD is defined as

D2=diag(JrJr).D^2 = \text{diag}\left(J_r^\intercal J_r \right) \, .

This matrix scales the objective by the diagonal elements of the Hessian. Thus, when μ\mu is large, and the direction tends toward the steepest descent, the components of the gradient are scaled by the curvature, which reduces the amount of zigzagging.

The solution to the minimization problem of Equation 10.56 is

Δw=(JrJr+μdiag(JrJr))1Jrr. \Delta w = - \left( J_r^\intercal J_r + \mu\, \text{diag}\left(J_r^\intercal J_r \right) \right)^{-1} J_r^\intercal r\, .

Finally, we describe one of the possible heuristics for selecting and updating the damping parameter μ\mu. After a successful step (a sufficient reduction in the objective), μ\mu is increased by a factor of ρ\rho. Conversely, an unsuccessful step is rejected, and μ\mu is reduced by a factor (μ=μ/ρ\mu = \mu / \rho).

Rather than returning a scalar objective (iri2\sum_i r_i^2), the user function should return a vector of the residuals because that vector is needed in the update steps (Equation 10.58). A potential convergence metric is a tolerance on objective value changes between subsequent iterations. The full procedure is described in Algorithm 10.3.

10.3.4 Cross Validation

The other important consideration for developing a surrogate model is the choice of the basis functions in ψ\psi. In some instances, we may know something about the model behavior and thus what type of basis functions should be used, but generally, the best way to determine the basis functions is through cross validation. Cross validation is also helpful in characterizing error, even if we already have a chosen set of basis functions. One of the reasons we use cross validation is to prevent overfitting. Overfitting occurs when we have too many degrees of freedom and closely fit a given set of data, but the resulting model has a poor predictive ability. In other words, we are fitting noise. The following example illustrates this idea with a one-dimensional function.

The solution to the overfitting problem highlighted in Example 10.5 is cross validation. Cross validation means that we use one set of data for training (creating the model) and a different set of data for assessing its predictive error. There are many different ways to perform cross validation; we describe two. Simple cross validation is illustrated in Figure 10.17 and consists of the following steps:

  1. Randomly split your data into a training set and a validation set (e.g., a 70–30 split).

  2. Train each candidate model (the different options for ψ\psi) using only the training set, but evaluate the error with the validation set. The error on previously unseen data is called the generalization error (ege_g in Figure 10.17).

  3. Choose the model with the lowest generalization error, and optionally retrain that model using all of the data.

Simple cross-validation process.

Figure 10.17:Simple cross-validation process.

An alternative option that is more involved but uses the data more effectively is called kk-fold cross validation. It is particularly advantageous when we have a small data set where we cannot afford to leave much out. This procedure is illustrated in Figure 10.18 and consists of the following steps:

  1. Randomly split your data into nn sets (e.g., nn = 10).

  2. Train each candidate model using the data from all sets except one (e.g., 9 of the 10 sets) and use the remaining set for validation. Repeat for all nn possible validation sets and average the performance.

  3. Choose the model with the lowest average generalization error. Optionally, retrain with all the data.

The extreme version of this process, when training data are very limited, is leave-one-out cross validation (i.e., each testing subset consists of one data point).

Diagram of k-fold cross-validation process.

Figure 10.18:Diagram of kk-fold cross-validation process.

10.3.5 Common Basis Functions

Although cross validation can help us find the lowest generalization error among a provided set of basis functions, we still need to determine what sets of options to consider. This selection is crucial because our model is only as good as the available options, but increasing the number of options increases computational time. The possibilities for basis functions are as numerous as the types of function. As stated before, it is generally desirable that they form an orthogonal set. We focus on a few commonly used functions.

Polynomials

Polynomials, of which we have already seen a few examples, are useful in many applications. However, we typically use low-order polynomials for regression because high-order polynomials rarely generalize well. Polynomials can be particularly effective in cases where a knowledge of the physics suggests them to be an appropriate choice (e.g., drag varies quadratically with speed)

Because a lot of structure is already built into the model form, fewer data points are needed to create a reasonable model (e.g., a quadratic function in nn dimensions needs at least n(n+1)/2+n+1n(n+1)/2 + n + 1 points, so this amounts to 6 points in two dimensions, 10 points in three dimensions, and so on).

Radial Basis Functions

Another common type of basis function is a radial basis function. Radial basis functions are functions that depend on the distance from some center point and can be written as follows:

ψ(i)=ψ(xc(i))=ψ(r(i)),\psi^{(i)} = \psi \left( \|x - c^{(i)}\| \right) = \psi \left( r^{(i)} \right) \, ,

where cc is the center point, and rr is the radius about the center point. Although the center points can be placed anywhere, we usually choose the sampling data as centering points:

ψ(i)=ψ(xx(i)).\psi^{(i)} = \psi\left( \|x - x^{(i)}\| \right) \, .

This is often a useful choice because it captures the idea that our ability to predict function behavior is related to how close we are to known function values (in other words, nearby points are more highly correlated). This form naturally lends itself to interpolation, although regularization can be added to allow for regression. Polynomials are often combined with radial basis functions because the polynomial can capture global function behavior, while the radial basis functions can introduce modifications to capture local behavior.

One popular radial basis function is the Gaussian basis:

ψ(i)(x)=exp(jθjxxj(i)2), \psi^{(i)}(x) = \exp\left(-\sum_j \theta_j \left| x - x_j^{(i)} \right|^2\right) \, ,

where θj\theta_j are the model parameters. One of the forms of kriging discussed in the following section can be viewed as a radial basis function model with a Gaussian basis.

10.4 Kriging

Kriging is a popular surrogate modeling technique that can build approximations of highly nonlinear engineering simulations. We may not have a simple parametric form for such simulations that we can use with regression and expect a good fit.

Instead of tuning the parameters of a functional form that describes what the function is, kriging tunes the parameters of a statistical model that describes how the function behaves.

The kriging statistical model that approximates ff consists of two terms: a function μ(x)\mu(x) that is meant to capture some of the function behavior and a random variable Z(x)Z(x). Thus, we can write the kriging model as

F(x)=μ(x)+Z(x)whereZ(x)N(0,σ2).F(x) = \mu(x) + Z(x) \quad \text{where} \quad Z(x) \sim \mathcal{N}(0, \sigma^2) \, .

When we evaluate the function we want to approximate at point xx, we get a scalar value f(x)f(x). In contrast, when we evaluate the stochastic process (Equation 10.62) at xx, we get a random variable F(x)F(x) that has a normal distribution with mean μ\mu and variance σ2\sigma^2. Although we wrote μ\mu as a function of xx, most kriging models consider this to be constant because the random variable term alone is effective in capturing the function behavior.

For the rest of this section, we discuss the case with constant μ\mu, which is called ordinary kriging. Kriging is also referred to as Gaussian process interpolation, or more generally in the regression case discussed later in this section, as Gaussian process regression.

The power of the statistical model lies in how it treats the correlation between the random variables. Although we do not know the exact form of the error term Z(x)Z(x), we can still make some reasonable assumptions about it. Consider two points in a sampling plan, x(i)x^{(i)} and x(j)x^{(j)}, and the corresponding terms, Z(x(i))Z\left(x^{(i)}\right) and Z(x(j))Z\left(x^{(j)}\right). Intuitively, we expect Z(x(i))Z\left(x^{(i)}\right) to be close to Z(x(j))Z\left(x^{(j)}\right) whenever x(i)x^{(i)} is close to x(j)x^{(j)}. Therefore, it seems reasonable to assume that the correlation between Z(x(i))Z\left(x^{(i)}\right) and Z(x(j))Z\left(x^{(j)}\right) is a function of the distance between the two points.

In kriging, we assume that this correlation is given by a kernel function K(x(i),x(j))K\left( x^{(i)},\, x^{(j)}\right):

K(x(i),x(j))=corr(Z(x(i)),Z(x(j)))K \left(x^{(i)}, x^{(j)} \right) = \text{corr} \left( Z\left(x^{(i)}\right) , Z\left(x^{(j)} \right) \right)

As a matrix, the kernel is represented as Kij=K(x(i),x(j))K_{ij} = K\left(x^{(i)}, x^{(j)}\right). Various kernel functions are used with kriging.[5]The most commonly used kernel function is

K(x(i),x(j))=exp(l=1ndθlxl(i)xl(j)pl), \begin{aligned} K \left(x^{(i)}, x^{(j)} \right) &= \exp\left(-\sum_{l=1}^{n_d} \theta_l \left| x^{(i)}_l - x^{(j)}_l \right|^{p_l}\right) \, , \end{aligned}

where θl0\theta_l \ge 0, 0pl20 \le p_l \le 2, and ndn_d is the number of dimensions (i.e., the length of the vector xx). If every pl=2p_l = 2, this becomes a Gaussian kernel.

Let us examine how the statistical model FF defined in Equation 10.62 captures the typical behavior of the function ff.

The parameter μ\mu captures the typical value, and σ2\sigma^2 captures the expected variance. The kernel (or correlation) function (Equation 10.64) implicitly models continuous functions. If ff is continuous, we know that, as x(i)x(j)0|x^{(i)}-x^{(j)}| \rightarrow 0, then f(x(i))f(x(j))0|f(x^{(i)})-f(x^{(j)})| \rightarrow 0. This is captured in the kernel function because as x(i)x(j)0|x^{(i)}-x^{(j)}| \rightarrow 0, the correlation approaches 1. The parameter θl\theta_l captures how active the function ff is in the llth coordinate direction. A unit difference in variable ll (xl(i)xl(j)=1|x^{(i)}_l - x^{(j)}_l| = 1) has a more significant impact on the correlation when θl\theta_l is large. The exponent plp_l describes the smoothness of the function in the llth coordinate direction. Values of plp_l close to 2 produce smooth functions, whereas values closer to zero produce functions with more variation.

Kriging surrogate modeling involves two main steps.

The first step consists of using the data to estimate the statistical model parameters μ\mu, σ2\sigma^2, θ1,,θnd\theta_1,\ldots,\theta_{n_d}, and p1,,pndp_1,\ldots,p_{n_d}. The second step consists of making predictions using the statistical model and these estimated parameter values.

The parameter estimation uses the same maximum likelihood approach from Section 10.3.2, but now it is more complicated. Let us denote the random variable as F(i)F(x(i))F^{(i)} \equiv F\left(x^{(i)}\right) and the vector of random variables as F=[F(1),,F(ns)]F = \left[ F^{(1)}, \ldots, F^{(n_s)}\right], where nsn_s is the number of samples. Similarly, f(i)f(x(i))f^{(i)} \equiv f\left(x^{(i)}\right) and the vector of observed function values is f[f(1),,f(ns)]f \equiv [f^{(1)},\dots,f^{(n_s)}]. Using this notation, we can say that the vector FF is jointly normally distributed. This is also known as a multivariate Gaussian distribution.[6]The probability density function (PDF) (the likelihood that F=f{F}={f}) is

p(f)=1(2π)ns/2Σ1/2exp[12(feμ)Σ1(feμ)],p(f) = \frac{1}{(2\pi)^{{n_s}/2} |\Sigma|^{1/2}} \exp\left[-\frac{1}{2}({f} - e \mu)^\intercal \Sigma^{-1} ({f} - e \mu ) \right] \, ,

where ee is a vector of 1s with size nsn_s, and Σ|\Sigma| is the determinant of the covariance,

Σij=σ2K(x(i),x(j)).\Sigma_{ij} = \sigma^2 K \left( x^{(i)}, \, x^{(j)} \right) \, .

The covariance between two elements F(i)F^{(i)} and F(j)F^{(j)} of F{F} is related to correlation by the following definition:[7]

Σij=cov(F(i),F(j))=σ2corr(F(i),F(j))=σ2K(x(i),x(j)).\Sigma_{ij} = \text{cov} \left( F^{(i)}, F^{(j)} \right) = \sigma^2 \text{corr} \left( F^{(i)}, F^{(j)} \right) = \sigma^2 K \left(x^{(i)}, x^{(j)} \right) \, .

We assume stationarity of the second moment, that is, the variance σ2\sigma^2 is constant in the domain.

The statistical model parameters θ1,,θnd\theta_1,\ldots,\theta_{n_d} and p1,,pndp_1,\ldots,p_{n_d} enter the likelihood (Equation 10.65) via their effect on the kernel KK (Equation 10.64) and hence on the covariance matrix Σ\Sigma (Equation 10.66).

We estimate the parameters using the maximum log likelihood approach from Section 10.3.2; that is, we maximize the probability of observing our data f{f} conditioned on the parameters μ\mu and Σ\Sigma. Using a PDF (Equation 10.65) where μ\mu is constant and the covariance is Σ=σ2K\Sigma = \sigma^2 K, yields the following likelihood function:

L(μ,σ,θ,p)=1(2π)ns/2σnsK1/2exp[(feμ)K1(feμ)2σ2].L(\mu, \sigma, \theta, p) = \frac{1}{(2 \pi)^{n_s/2} \sigma^{n_s} |K|^{1/2}} \exp\left[ -\frac{({f} - e\mu)^\intercal K^{-1} ({f} - e\mu)}{2 \sigma^2} \right] \, .

We now need to find the parameters μ\mu, σ\sigma, θi\theta_i, and pip_i that maximize this likelihood function, that is, maximize the probability of our observations f{f}.

As before, we take the logarithm to form the log likelihood function:

(μ,σ,θ,p)=ns2ln(2π)ns2ln(σ2)12lnK(feμ)K1(feμ)2σ2.\begin{gather} \ell(\mu, \sigma, \theta, p) = -\frac{n_s}{2}\ln(2 \pi) - \frac{n_s}{2} \ln\left(\sigma^2\right) \\ - \frac{1}{2}\ln{|K|} -\frac{({f} - e\mu)^\intercal K^{-1} ({f} - e\mu)}{2 \sigma^2} \, . \end{gather}

We can maximize part of this term analytically by taking derivatives with respect to μ\mu and σ\sigma, setting them equal to zero, and solving for their optimal values to obtain:

[align]μ=eK1feK1eσ2=(feμ)K1(feμ)ns.\begin{align*} [align] \mu^* &= \frac{e^\intercal K^{-1} {f}}{e^\intercal K^{-1} e}\\ {\sigma^*}^2 &= \frac{({f} - e\mu^*)^\intercal K^{-1} ({f} - e\mu^*)}{n_s} \, . \end{align*}

We now substitute these values back into the log likelihood function (Equation 10.68), which yields

(θ,p)=ns2ln(σ2)12lnK. \ell(\theta, p) = -\frac{n_s}{2}\ln \left({\sigma^*}^2 \right) - \frac{1}{2}\ln{|K|} \, .

This function, also called the concentrated likelihood function, only depends on the kernel KK, which depends on θ\theta and pp.

We cannot solve for optimal values of θ\theta and pp analytically. Instead, we rely on numerical optimization to maximize Equation 10.71. Because θ\theta can vary across a broad range, it is often better to search using logarithmic scaling. Once we solve that optimization problem, we compute the mean and variance in Equation 10.69 and .

Now that we have a fitted model, we can make predictions at new points where we have not sampled. We do this by substituting xpx_p into a formula called the kriging predictor. The formula is unique, but there are many ways to derive it.

One way to derive it is to find the function value at xpx_p that is the most consistent with the behavior of the function captured by the fitted kriging model.

Let fpf_p be our guess for the value of the function at xpx_p. One way to assess the consistency of our guess is to add (xp,fp)(x_p, f_p) as an artificial point to our training data (so that we now have ns+1n_s+1 points) and estimate the likelihood using the parameters from our fitted kriging model. The likelihood of this augmented data can now be thought of as a function of fpf_p: high values correspond to guessed values of fpf_p that are consistent with function behavior captured by the fitted kriging model. Therefore, the value of fpf_p that maximizes the likelihood of this augmented data set is a natural way to predict the value of the function. This is an optimization problem with a closed-form solution, and the corresponding formula is the kriging predictor.

Now we outline the derivation of the kriging predictor.[8]

With the augmented point, our function values are fˉ=[f,fp]\bar{f} = [{f}, f_p], where ff is the nsn_s-vector of function values from the original training data. Then, the correlation matrix with the additional data point is

Kˉ=[Kkk1],\bar K = \begin{bmatrix} K & k \\ k^\intercal & 1 \\ \end{bmatrix} \, ,

where kk is the correlation of the new point with the training data given by

k=[corr(F(x(1)),F(xp))=K(x(1),xp)corr(F(x(ns)),F(xp))=K(x(ns),xp)].k = \begin{bmatrix} \text{corr}\left(F \left(x^{(1)} \right), F (x_p) \right) = K \left(x^{(1)}, x_p \right) \\ \vdots\\ \text{corr}\left(F \left(x^{(n_s)}\right), F(x_p)\right) = K\left(x^{(n_s)}, x_p\right)\\ \end{bmatrix} \, .

The 1 in the bottom right of the augmented correlation matrix (Equation 10.72) is because the correlation of the new variable F(xp)F(x_p) with itself is 1. The log likelihood function with these new augmented vectors and the previously determined parameters is as follows (see Equation 10.68):

(fp)=ns2ln(2π)ns2ln(σ2)12lnKˉ(fˉeμ)Kˉ1(fˉeμ)2σ2.\ell(f_p) = -\frac{n_s}{2}\ln(2 \pi) - \frac{n_s}{2} \ln({\sigma^*}^2) - \frac{1}{2}\ln{|\bar{K}|} -\frac{(\bar{f} - e\mu^*)^\intercal \bar{K}^{-1} (\bar{f} - e\mu^*)}{2 {\sigma^*}^2} \, .

We want to maximize this function with respect to fpf_p. Because only the last term depends on fpf_p (it is a part of fˉ\bar{f}) we can omit the other terms and formulate the following:

maximizefp(fp)=(fˉev)Kˉ1(fˉeμ)2σ2.\underset{f_p}{\text{maximize}} \ell(f_p) = -\frac{\left(\bar{f} - ev)^\intercal \bar{K}^{-1} (\bar{f} - e\mu^* \right)}{2 {\sigma^*}^2} \, .

This problem can be solved analytically, yielding the mean value of the kriging prediction,

fp=μ+kK1(feμ). f_p = \mu^* + k^\intercal K^{-1}({f} - e \mu^*) \, .

The mean square error of the kriging prediction (that is, the expected squared value of the error) is given by[9]

σp2=σ2[1kK1k+(1kK1e)2eK1e]. \sigma_{p}^2 = {\sigma^*}^2\left[1 - k^\intercal K^{-1} k + \frac{(1 - k^\intercal K^{-1} e)^2}{e^\intercal K^{-1} e} \right] \, .

One attractive feature of kriging models is that they are interpolatory and thus match the training data exactly. To see how this is true, if xpx_p is the same as one of the training data points, x(i)x^{(i)}, then kk is just iith column of KK. Hence, K1kK^{-1}k is a vector eie_i, with all zeros except for 1 in the iith element. In the prediction (Equation 10.75), kK1=eik^\intercal K^{-1}=e_i^\intercal and so the last term is f(i)μf^{(i)}-\mu^*, which means that fp=f(i)f_p = f^{(i)}.

In the mean square error (Equation 10.76), kK1kk^\intercal K^{-1} k is the same as keik^\intercal e_i. This is the iith element of kk, which is 1. Therefore, the first two terms in the brackets in Equation 10.76 cancel, and the last term is zero, yielding σp2=0\sigma^2_p=0. This is expected; if we already sampled the point, the uncertainty about its function value should be zero.

When describing a fitted kriging model, we often refer to the standard error as the square root of this quantity (i.e., σp2\sqrt{\sigma_{p}^2}). The standard error is directly related to the confidence interval (e.g., ±1\pm 1 standard error corresponds to a 68 percent confidence interval).

If we can provide the gradients of the function at the training data points (in addition to the function values), we can use that information to build a more accurate kriging model. This approach is called gradient-enhanced kriging (GEK). The methodology is the same as before, except we add more observed outputs (i.e., in addition to the function values at the sampled points, we add their gradients). In addition to considering the correlation between the function values at different sampled points, the kernel matrix KK needs to be expanded to consider correlations between function values and gradients, gradients and function values, and among gradient components.

We can use still use equation (Equation 10.75) for the GEK predictor and equation (Equation 10.76) for the mean square error if we plug in “expanded versions” of the outputs f{f}, the vector kk, the matrix KK, and the vector of 1s, ee.

We expand the output vector to include not just the function values at the sampled points but also their gradients:

fGEK[f1fns f1fns].f_\text{GEK} \equiv \begin{bmatrix} f_1 \\ \vdots \\ f_{n_s} \\\ \nabla f_1 \\ \vdots \\ \nabla f_{n_s} \\ \end{bmatrix} \, .

This vector is of length ns+nsndn_s + n_s n_d, where ndn_d is the dimension of xx. The gradients are usually provided at the same xx locations as the function samples, but that is not required.

Recall that the term eμe\mu^* in Equation 10.75 for the kriging predictor represents the expected value of the random variables F(1),,F(ns)F^{(1)},\ldots,F^{(n_s)}. Now that we have expanded the outputs to include the gradients at the sampled points, the mean vector needs to be expanded to include the expected values of F(i)\nabla F^{(i)}, which are all zero.

We can still use eμe\mu^* in the formula for the predictor if we use the following definition:

eGEK[1,,1,0,,0],e_\text{GEK} \equiv [1, \ldots, 1, 0, \ldots, 0] \, ,

where 1 occurs for the first nsn_s entries, and 0 for the remaining nsndn_s n_d entries.

The additional correlations (between function values and derivatives and between the derivatives themselves) are as follows:

corr(F(x(i)),F(x(j)))=Kij\begin{aligned} \text{corr}\left(F\left(x^{(i)}\right), F\left(x^{(j)}\right)\right) &= K_{ij} \\ \end{aligned}
corr(F(x(i)),F(x(j))xl)=Kijxl(j)corr(F(x(i))xl,F(x(j)))=Kijxl(i)corr(F(x(i))xl,F(x(j))xk)=2Kijxl(i)xk(j).\begin{aligned} \text{corr}\left(F\left(x^{(i)}\right), \frac{\partial F\left(x^{(j)}\right)}{\partial x_l}\right) &= \frac{\partial K_{ij}}{\partial x^{(j)}_l}\\ \text{corr}\left(\frac{\partial F\left(x^{(i)}\right)}{\partial x_l}, F\left(x^{(j)}\right)\right) &= \frac{\partial K_{ij}}{\partial x^{(i)}_l}\\ \text{corr}\left(\frac{\partial F\left(x^{(i)}\right)}{\partial x_l}, \frac{\partial F\left(x^{(j)}\right)}{\partial x_k}\right) &= \frac{\partial^2 K_{ij}}{\partial x^{(i)}_l \partial x^{(j)}_k} \, . \end{aligned}

Here, we use ll and kk to represent a component of a vector, and we use KijK(x(i),x(j))K_{ij} \equiv K\left(x^{(i)}, x^{(j)}\right) as shorthand. For our particular kernel choice (Equation 10.64), these correlations become the following:

Kij=exp(k=1ndθl(xl(i)xl(j))2)Kijxl(j)=2θl(xl(i)xl(j))KijKijxl(i)=Kijxl(j)2Kijxl(i)xk(j)={4θlθk(xk(i)xk(j))(xl(i)xl(j))Kijlk4θl2(xl(i)xl(j))2Kij+2θlKijl=k,\begin{aligned} K_{ij} &= \exp\left(-\sum_{k=1}^{n_d} \theta_l \left(x^{(i)}_l - x^{(j)}_l \right)^2\right) \\ \frac{\partial K_{ij}}{\partial x^{(j)}_l} &= 2 \theta_l \left(x^{(i)}_l - x^{(j)}_l \right) K_{ij} \\ \frac{\partial K_{ij}}{\partial x^{(i)}_l} &= - \frac{\partial K_{ij}}{\partial x^{(j)}_l} \\ \frac{\partial^2 K_{ij}}{\partial x^{(i)}_l \partial x^{(j)}_k} &= \begin{cases} -4 \theta_l\theta_k \left(x^{(i)}_k - x^{(j)}_k\right)\left(x^{(i)}_l - x^{(j)}_l\right) K_{ij} & l \ne k \\ -4 \theta_l^2 \left(x^{(i)}_l - x^{(j)}_l \right)^2 K_{ij} + 2 \theta_l K_{ij} & l = k \, , \end{cases} \end{aligned}

where we used p=2p = 2. Putting this all together yields the expanded correlation matrix:

KGEK[KJKJKHK],K_\text{GEK} \equiv \begin{bmatrix} K & J_K \\ J_K^\intercal & H_K \\ \end{bmatrix} \, ,

where the (ns×nsnd)(n_s \times n_s n_d) block representing the first derivatives is

JK=[K11x(1)K1nsx(ns)Kns1x(1)Knsnsx(ns)]J_K = \begin{bmatrix} \dfrac{\partial K_{11}}{\partial x^{(1)}}^\intercal & \ldots & \dfrac{\partial K_{1 n_s}}{\partial x^{(n_s)}}^\intercal \\ \vdots & \ddots & \vdots \\ \dfrac{\partial K_{n_s 1}}{\partial x^{(1)}}^\intercal & \ldots & \dfrac{\partial K_{n_s n_s}}{\partial x^{(n_s)}}^\intercal \\ \end{bmatrix}

and the (nsnd×nsnd)(n_s n_d \times n_s n_d) matrix of second derivatives is

HK=[2K11x(1)x(1)2K1nsx(1)x(ns)2Kns1x(ns)x(1)2Knsnsx(ns)x(ns)].H_K = \begin{bmatrix} \dfrac{\partial^2 K_{11}}{\partial x^{(1)} \partial x^{(1)}} & \ldots & \dfrac{\partial^2 K_{1 n_s}}{\partial x^{(1)} \partial x^{(n_s)}} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial^2 K_{n_s 1}}{\partial x^{(n_s)} \partial x^{(1)}} & \ldots & \dfrac{\partial^2 K_{n_s n_s}}{\partial x^{(n_s)} \partial x^{(n_s)}} \\ \end{bmatrix} \, .

We can still get the estimates μ\mu^* and σ2{\sigma^*}^2 with Equation 10.69 and using the expanded versions of KK, ee, ff and replacing nsn_s in Equation 10.76 with ns(nd+1)n_s (n_d+1), which is the new length of the outputs.

The predictor equations (Equation 10.75 and Equation 10.76) also apply with the expanded matrices and vectors. However, we also need to expand kk in these computations to include the correlations between the gradients at the sampled points with the gradient at the point xx where we make a prediction. Thus, the expanded kk is:

kGEK[kcorr(F(x(1))x(1),F(xp))=K(x(1),xp)x(1)corr(F(x(ns))x(ns),F(xp))=K(x(ns),xp)x(ns)].k_\text{GEK} \equiv \begin{bmatrix} k \\ \text{corr}\left(\dfrac{\partial F\left(x^{(1)}\right)}{\partial x^{(1)}}, F(x_p)\right) = \dfrac{\partial K\left(x^{(1)}, x_p\right)}{\partial x^{(1)}}\\ \vdots\\ \text{corr}\left(\dfrac{\partial F\left(x^{(n_s)}\right)}{\partial x^{(n_s)}}, F(x_p)\right) = \dfrac{\partial K\left(x^{(n_s)}, x_p\right)}{\partial x^{(n_s)}}\\ \end{bmatrix} \, .

One difficulty with GEK is that the kernel matrix quickly grows in size as the dimension of the problem increases, the number of samples increases, or both. Various approaches have been proposed to improve the scaling with higher dimensions, such as a weighted sum of smaller correlation matrices9 or a partial least squares approach.6

The version of kriging in this section is interpolatory. For noisy data, a regression approach can be used by modifying the correlation matrix as follows:

KregK+τI,K_\text{reg} \equiv K + \tau I \, ,

with τ>0\tau > 0. This adds a positive constant along the diagonal, so the model no longer correlates perfectly with the provided points. The parameter τ\tau is then an additional parameter to estimate in the maximum likelihood optimization. Even for interpolatory models, this term is often still added to the covariance matrix with a small constant value of τ\tau (near machine precision) to ensure that the correlation matrix is invertible. This section focused on the most common choices when using kriging, but many other versions exist.10

10.5 Deep Neural Networks

Like kriging, deep neural nets can be used to approximate highly nonlinear simulations where we do not need to provide a parametric form. Neural networks follow the same basic steps described for other surrogate models but with a unique model leading to specialized approaches for derivative computation and optimization strategy. Neural networks loosely mimic the brain, which consists of a vast network of neurons. In neural networks, each neuron is a node that represents a simple function. A network defines chains of these simple functions to obtain composite functions that are much more complex. For example, three simple functions, f(1),f(2)f^{(1)}, f^{(2)}, and f(3)f^{(3)}, may be chained into the composite function (or network):

f(x)=f(3)(f(2)(f(1)(x))).f(x) = f^{(3)}\left(f^{(2)}\left(f^{(1)}(x)\right)\right).

Even though each function may be simple, the composite function can express complex behavior. Most neural networks are feedforward networks, meaning that information flows from inputs xx to outputs ff. Recurrent neural networks include feedback connections.

Figure 10.24 shows a diagram of a neural network. Each node represents a neuron. The neurons are connected between consecutive layers, forming a dense network. The first layer is the input layer, the last one is the output layer, and the middle ones are the hidden layers. The total number of layers is called the network’s depth. Deep neural networks have many layers, enabling the modeling of complex behavior.

Deep neural network with two hidden layers.

Figure 10.24:Deep neural network with two hidden layers.

The first and last layers can be viewed as the inputs and outputs of a surrogate model. Each neuron in the hidden layer represents a function. This means that the output from a neuron is a number, and thus the output from a whole layer can be represented as a vector xx. We represent the vector of values for layer kk by x(k)x^{(k)}, and the value for the iith neuron in layer kk by xi(k)x^{(k)}_i.

Consider a neuron in layer kk. This neuron is connected to many neurons from the previous layer k1k-1 (see the first part of Figure 10.25). We need to choose a functional form for each neuron in the layer that takes in the values from the previous layer as inputs. Chaining together linear functions would yield another linear function. Therefore, some layers must use nonlinear functions.

The most common choice for hidden layers is a layer of linear functions followed by a layer of functions that create nonlinearity. A neuron in the linear layer produces the following intermediate variable:

z=j=1nwjxj(k1)+b.z = \sum_{j = 1}^n w_j x_j^{(k-1)} + b \, .

In vector form:

z=wx(k1)+b.z = w^\intercal x^{(k-1)} + b \, .
Typical functional form for a neuron in the neural net.

Figure 10.25:Typical functional form for a neuron in the neural net.

The first term is a weighted sum of the values from the neurons in the previous layer. The ww vector contains the weights. The term bb is the bias, which is an offset that scales the significance of the overall output. These two terms are analogous to the weights used in the previous section but with the constant term separated for convenience. The second column of Figure 10.25 illustrates the linear (summation and bias) layer.

Activation functions.

Figure 10.26:Activation functions.

Next, we pass zz through an activation function, which we call a(z)a(z).

Historically, one of the most common activation functions has been the sigmoid function:

a(z)=11+ez.a(z) = \frac{1}{1 + e^{-z}} \, .

This function is shown in the top plot of Figure 10.26. The sigmoid function produces values between 0 and 1, so large negative inputs result in insignificant outputs (close to 0), and large positive inputs produce outputs close to 1.

Most modern neural nets use a rectified linear unit (ReLU) as the activation function:

a(z)=max(0,z).a(z) = \max(0, z) \, .

This function is shown in the bottom plot of Figure 10.26. The ReLU has been found to be far more effective than the sigmoid function in producing accurate neural nets. This activation function eliminates negative inputs. Thus, the bias term can be thought of as a threshold establishing what constitutes a significant value. The final two columns of Figure 10.25 illustrate the activation step.

Combining the linear function with the activation function produces the output for the ithi\text{th} neuron:

xi(k)=a(wx(k1)+bi).x_i^{(k)} = a \left(w^\intercal x^{(k-1)} + b_i \right) \, .

To compute the outputs for all the neurons in this layer, the weights ww for one neuron form one row in a matrix of weights WW and we can write:

[x1(k)xi(k)xnk(k)]=a([W1,1W1,jW1,nk1Wi,1Wi,jWi,nk1Wnk,1Wnk,jWnk,nk1][x1(k1)xj(k1)xnk1(k1)]+[b1bixnk])\begin{bmatrix} x_1^{(k)}\\ \vdots \\ {\color{primary} x_i^{(k)}} \\ \vdots \\ x_{n_k}^{(k)} \end{bmatrix} = a\left( \begin{bmatrix} {W_{1,1}} & {\cdots} & {W_{1,j}} & {\cdots} & {W_{1,{n_{k-1}}}} \\ \vdots & & \vdots & & \vdots \\ {\color{primary} W_{i,1}} & {\color{primary}\cdots} & {\color{primary}W_{i,j}} & {\color{primary}\cdots} & {\color{primary}W_{i,{n_{k-1}}}} \\ \vdots & & \vdots & & \vdots \\ {W_{n_k,1}} & {\cdots} & {W_{n_k,j}} & {\cdots} & {W_{n_k,{n_{k-1}}}} \end{bmatrix} \begin{bmatrix} x_1^{(k-1)}\\ \vdots \\ x_j^{(k-1)} \\ \vdots \\ x_{n_{k-1}}^{(k-1)} \end{bmatrix} + \begin{bmatrix} b_1\\ \vdots \\ {\color{primary}b_i} \\ \vdots \\ x_{n_{k}} \end{bmatrix} \right)

or

x(k)=a(Wx(k1)+b). x^{(k)} = a \left(W x^{(k-1)} + b \right).

The activation function is applied separately for each row. The following equation is more explicit (where wiw_i is the ithi\text{th} row of WW):

xi(k)=a(wixi(k1)+bi).x_i^{(k)} = a \left(w_i^\intercal x_i^{(k-1)} + b_i \right).

This neural net is now parameterized by a number of weights. Like other surrogate models, we need to determine the optimal value for these parameters (i.e., train the network) using training data. In the example of Figure 10.24, there is a layer of 5 neurons, 7 neurons, 7 neurons, and then 4 neurons, and so there would be 5×7+7×7+7×45 \times 7 + 7 \times 7 + 7 \times 4 weights and 7+7+47 + 7 + 4 bias terms, giving a total of 130 variables. This represents a small neural net because there are few inputs and few outputs. Large neural nets can have millions of variables. We need to optimize those variables to minimize a cost function.

As before, we use a maximum likelihood estimate where we optimize the parameters θ\theta (weights and biases in this case) to maximize the probability of observing the output data yy conditioned on our inputs xx. As shown in Section 10.3.2, this results in a sum of squared errors function:

minimizeθi=1n(f^(θ;x(i))f(i))2.\underset{\theta}{\text{minimize}} \quad \sum_{i=1}^n \left(\hat{f}\left(\theta; x^{(i)}\right) - f^{(i)} \right)^2 \, .

We now have the objective and variables in place to train the neural net. As with the other models discussed in this chapter, it is critical to set aside some data for cross validation.

Because the optimization problem (Equation 10.95) often has a large number of parameters θ\theta, we generally use a gradient-based optimization algorithm (however the algorithms of Chapter 4 are modified as we will discuss shortly). To solve Equation 10.95 using gradient-based optimization, we require the derivatives of the objective function with respect to the weighs θ\theta. Because the objective is a scalar and the number of weights is large, reverse-mode algorithmic differentiation (AD) (see Section 6.6) is ideal to compute the required derivatives.

Reverse-mode AD is known in the machine learning community as backpropagation.[10] Whereas general-purpose reverse-mode AD operates at the code level, backpropagation usually operates on larger sets of operations and data structures defined in machine learning libraries. Although less general, this approach can increase efficiency and stability.

The ReLU activation function (Figure 10.26, bottom) is not differentiable at z=0z = 0, but in practice, this is generally not problematic—primarily because these methods typically rely on inexact gradients anyway, as discussed next.

The objective function in Equation 10.95 consists of a sum of subfunctions, each of which depends on a single data point (x(i),f(i))x^{(i)}, f^{(i)}). Objective functions vary across machine learning applications, but most have this same form:

minimizeθf(θ),\underset{\theta}{\text{minimize}} f(\theta) \, ,

where

f(θ)=i=1n(θ;x(i),f(i))=i=1ni(θ).f(\theta) = \sum_{i=1}^n \ell \left(\theta; x^{(i)}, f^{(i)} \right) = \sum_{i=1}^n \ell_i(\theta) \, .

As previously mentioned, the challenge with these problems is that we often have large training sets where nn may be in the billions. That means that computing the objective can be costly, and computing the gradient can be even more costly.

If we divide the objective by nn (which does not change the solution), the objective function becomes an approximation of the expected value (see Section A.9):

f(θ)=1ni=1ni(θ)=E((θ))f(\theta) = \frac{1}{n} \sum_{i=1}^n \ell_i(\theta) = \mathbb{E}(\ell(\theta))

From probability theory, we know that we can estimate an expected value from a smaller set of random samples. For the application of estimating a gradient, we call this subset of random samples a minibatch S={x(1)x(m)}S = \left\{x^{(1)} \ldots x^{(m)}\right\}, where mm is usually between 1 and a few hundred. The entries x(1),,x(m)x^{(1)}, \ldots, x^{(m)} do not correspond to the first nn entries but are drawn randomly from a uniform probability distribution (Figure 10.27). Using the minibatch, we can estimate the gradient as the sum of the subfunction gradients at different training points:

θf(θ)1miSθ(θ;x(i),f(i)). \nabla_\theta f(\theta) \approx \frac{1}{m}\sum_{i \in S} \nabla_{\theta} \ell \left(\theta; x^{(i)}, f^{(i)} \right) \, .

Thus, we divide the training data into these minibatches and use a new minibatch to estimate the gradients at each iteration in the optimization.

Minibatches are randomly drawn from the training data.

Figure 10.27:Minibatches are randomly drawn from the training data.

This approach works well for these specific problems because of the unique form for the objective (Equation 10.98). As an example, for one million training samples, a single gradient evaluation would require evaluating all one million training samples. Alternatively, for a similar cost, a minibatch approach can update the optimization variables a million times using the gradient estimated from one training sample at a time. This latter process usually converges much faster, mainly because we are only fitting parameters against limited data in these problems, so we generally do not need to find the exact minimum.

Typically, this gradient is used with steepest descent methods (Section 4.4.1), more typically referred to as gradient descent in the machine learning communities. As discussed in Chapter 4, steepest descent is not the most effective optimization algorithm. However, steepest descent with the minibatch updates, called stochastic gradient descent, has been found to work well in machine learning applications. This suitability is primarily because (1) many machine learning optimizations are performed repeatedly, (2) the true objective is difficult to formalize, and (3) finding the absolute minimum is not as important as finding a good enough solution quickly. One key difference in stochastic gradient descent relative to the steepest descent method is that we do not perform a line search. Instead, the step size (called the learning rate in machine learning applications) is a preselected value that is usually decreased between major optimization iterations.

Stochastic minibatching is easily applied to first-order methods and has thus driven the development of improvements on stochastic gradient descent, such as momentum, Adagrad, and Adam.12 Although some of these methods may seem somewhat ad hoc, there is mathematical rigor to many of them.13 Batching makes the gradients noisy, so second-order methods are generally not pursued. However, ongoing research is exploring stochastic batch approaches that might effectively leverage the benefits of second-order methods.

10.6 Optimization and Infill

Once a surrogate model has been built, optimization may be performed using the surrogate function values. That is, instead of minimizing the expensive function f(x)f(x), we minimize the model f^(x)\hat{f}(x), as previously illustrated in Figure 10.1.

The surrogate model may be static, but more commonly, it is updated between optimization iterations by adding new training data and rebuilding the model.

The process by which we select new data points is called infill. There are two main approaches to infill: prediction-based exploitation and error-based exploration. Typically, only one infill point is chosen at a time. The assumption is that evaluating the model is computationally expensive, but rebuilding and evaluating the surrogate is cheap.

10.6.1 Exploitation

For models that do not provide uncertainty estimates, the only real option is exploitation. A prediction-based exploitation infill strategy adds an infill point wherever the surrogate predicts the optimum. The reasoning behind this approach is that in SBO, we do not necessarily care about having a globally accurate surrogate; instead, we only care about having an accurate surrogate near the optimum.

The most logical point to sample is thus the optimum predicted by the surrogate. Likely, the location predicted by the surrogate will not be at the true optimum. However, evaluating this point adds valuable information in the region of interest.

We rebuild the surrogate and re-optimize, repeating the process until convergence. This approach usually results in the quickest convergence to an optimum, which is desirable when the actual function is expensive to evaluate. The downside is that we may converge prematurely to an inferior local optimum for problems with multiple local optima.

Even though the approach is called exploitation, the optimizer used on the surrogate can be a global search method (gradient-based or gradient-free), although it is usually a local search method. If uncertainty is present, using the mean value of the surrogate as the infill criteria results in essentially an exploitation strategy.

The algorithm is outlined in Algorithm 10.4. Convergence could be based on a maximum number of iterations or a tolerance for the objective function’s fractional change.

10.6.2 Efficient Global Optimization

An alternative approach to infill uses error-based exploration. This approach requires using kriging (Section 10.4) or another surrogate approach that predicts not just function values but also error estimates. Although many infill metrics exist within this category, we focus on a popular one called expected improvement, and the associated algorithm, efficient global optimization (EGO).14

As stated previously, sampling where the mean is low is an exploitation strategy, but we do not necessarily want to sample where the uncertainty is high. That may lead to wasteful function calls in regions of the design space where the surrogate model is inaccurate but which are far from any optimum. In effect, this strategy would be like a larger sampling plan aiming to reduce error everywhere in the surrogate. Instead, we want to sample where we have the maximum probability of finding a better point.

Let the best solution we have found so far be f=f(x)f^* = f(x^*). The improvement for any new test point xx is then given by

I(x)=max(ff(x),0).I(x) = \max\left(f^* - f(x), 0\right) \, .

If f(x)ff(x) \ge f^*, there is no improvement, but if f(x)<ff(x) < f^*, the improvement is positive. However, f(x)f(x) is not a deterministic value in this model but rather a probability distribution. Thus, the expected improvement is the expected value (or mean) of the improvement:

EI(x)=E(max(ff(x),0)). EI(x) = \mathbb{E}\left(\max(f^* - f(x), 0)\right) \, .

The expected value for a kriging model can be found analytically as:

EI(x)=(fμf(x))Φ(fμf(x)σf(x))+σf(x)ϕ(fμf(x)σf(x)), EI(x) = (f^* - \mu_f(x))\Phi\left(\frac{f^* - \mu_f(x)}{\sigma_f(x)}\right) + \sigma_f(x)\phi\left(\frac{f^* - \mu_f(x)}{\sigma_f(x)}\right) \, ,

where Φ\Phi and ϕ\phi are the CDF and PDF, respectively, for the standard normal distribution, and μf\mu_f and σf\sigma_f are the mean and standard error functions produced from kriging ( Equation 10.75 and Equation 10.76).

The algorithm is similar to that of the previous section (Algorithm 10.4), but instead of choosing the minimum of the surrogate, the selected infill point is the point with the greatest expected improvement. The corresponding algorithm is detailed in Algorithm 10.5.

10.7 Summary

Surrogate-based optimization can be an effective approach to optimization problems where models are expensive to evaluate or noisy.

The first step in building a surrogate model is sampling, in which we select the points that are evaluated to obtain the training data. Full factorial searches are too expensive for even a modest number of variables, and random sampling does not provide good coverage, so we need techniques that provide good coverage with a small number of samples. Popular techniques for this kind of sampling include LHS and low-discrepancy sequences.

The next step is surrogate selection and construction. For a given choice of basis functions, regression is used to select optimal model parameters. Cross validation is a critical component of this process. We want good predictive capability, which means that the models work well on data that the model has not been trained against. Model selection often involves trade-offs of more rigid models that do not need as much training data versus more flexible models that require more training data. Polynomials are often used for regression problems because a relatively small number of samples can be used to capture model behavior. Radial basis functions are more often used for interpolation because they can handle multimodal behavior but may require more training data.

Kriging and deep neural nets are two options that model more complex and multimodal design spaces. When using these models, special considerations are needed for efficiency, such as using symmetric matrix factorizations and gradients for kriging and using backpropagation and stochastic gradients for deep neural nets.

The last step of the process is infill, where points are sampled during optimization to update the surrogate model. Some approaches are exploitation-based, where we perform optimization using the surrogate and then use the optimal solution to update the model. Other approaches are exploration-based, where we sample not just at the deterministic optimum but also at points where the expected improvement is high. Exploration-based approaches require surrogate models that provide uncertainty estimates, such as kriging models.

Problems

  • Nonlinear least squares. Implement a Levenberg–Marquardt algorithm and demonstrate its performance on the Rosenbrock function from three different starting points.

  • Kriging. Implement kriging (without gradients) and demonstrate its fit on the following one-dimensional function:

    y=exp(x)cos(5x),y = \exp(-x)\cos(5x) ,

    where x[0,2.5]x \in [0, 2.5], using the following five sample points: x=[0,0.2,1.0,1.2,2.2]x = [0, 0.2, 1.0, 1.2, 2.2].

  • Efficient global optimization. Use EGO with the function from the previous problem, showing the iteration history until the expected improvement reduces below 0.001.

    Footnotes
    1. PDFs and CDFs are reviewed in Section A.9.

    2. A set of numbers is pairwise prime if there is no positive integer that can evenly divide any pair of them, except 1. Typically, though, we just use the first nxn_x prime numbers.

    3. The choice of minimizing the sum of the squares rather than the sum of the absolute values or some other metric is not arbitrary. The motivation for using the sum of the squares is discussed further in the following section.

    4. Kernel functions must be symmetric and positive definite because a covariance matrix is always symmetric and positive definite.

    5. There are other ways to derive kriging that do not require making assumptions on the random variable distribution type.

    6. Covariance and correlation are briefly reviewed in Section A.9.

    7. 7 provides the complete derivation; here we show only a few key steps.

    8. The formula for mean squared error does not come from the augmented likelihood approach, but is a byproduct of showing that the kriging predictor is the “best linear unbiased predictor” for the assumed statistical model.8

    9. The machine learning community independently developed backpropagation before becoming aware of the connection to reverse-mode AD.11

    References
    1. Faure, H. (1982). Discrépance des suites associées à un systéme de numération (en dimension s). Acta Arithmetica, 41, 337–351. 10.4064/aa-41-4-337-351
    2. Faure, H., & Lemieux, C. (2009). Generalized Halton sequences in 2008: A comparative study. ACM Transactions on Modeling and Computer Simulation, 19(4), 1–31. 10.1145/1596519.1596520
    3. Sobol, I. M. (1967). On the distribution of points in a cube and the approximate evaluation of integrals. USSR Computational Mathematics and Mathematical Physics, 7(4), 86–112. 10.1016/0041-5553(67)90144-9
    4. Niederreiter, H. (1988). Low-discrepancy and low-dispersion sequences. Journal of Number Theory, 30(1), 51–70. 10.1016/0022-314X(88)90025-X
    5. Bouhlel, M. A., Hwang, J. T., Bartoli, N., Lafage, R., Morlier, J., & Martins, J. R. R. A. (2019). A Python surrogate modeling framework with derivatives. Advances in Engineering Software, 102662. https://doi.org/10.1016/j.advengsoft.2019.03.005
    6. Bouhlel, M. A., & Martins, J. R. R. A. (2019). Gradient-enhanced kriging for high-dimensional problems. Engineering with Computers, 1(35), 157–173. 10.1007/s00366-018-0590-x
    7. Jones, D. R. (2001). A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 21, 345–383. 10.1023/A:1012771025575
    8. Sacks, J., Welch, W. J., Mitchell, T. J., & Wynn, H. P. (1989). Design and analysis of computer experiments. Statistical Science, 4(4), 409–423. 10.2307/2245858
    9. Han, Z.-H., Zhang, Y., Song, C.-X., & Zhang, K.-S. (2017). Weighted gradient-enhanced kriging for high-dimensional surrogate modeling and design optimization. AIAA Journal, 55(12), 4330–4346. 10.2514/1.J055842
    10. Forrester, A., Sobester, A., & Keane, A. (2008). Engineering Design via Surrogate Modelling: A Practical Guide. John Wiley & Sons.
    11. Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2018). Automatic differentiation in machine learning: A survey. Journal of Machine Learning Research, 18(1), 5595–5637. 10.5555/3122009.3242010
    12. Ruder, S. (2016). An overview of gradient descent optimization algorithms. In CoRR: Vol. abs/1609.04747. http://arxiv.org/abs/1609.04747
    13. Goh, G. (2017). Why momentum really works. Distill. 10.23915/distill.00006
    14. Jones, D. R., Schonlau, M., & Welch, W. J. (1998). Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13, 455–492. 10.1023/A:1008306431147