Constrained Gradient-Based Optimization - Engineering Design Optimization

Engineering design optimization problems are rarely unconstrained. In this chapter, we explain how to solve constrained problems. The methods in this chapter build on the gradient-based unconstrained methods from Chapter 4 and also assume smooth functions. We first introduce the optimality conditions for a constrained optimization problem and then focus on three main methods for handling constraints: penalty methods, sequential quadratic programming (SQP), and interior-point methods.

Penalty methods are no longer used in constrained gradient-based optimization because they have been replaced by more effective methods. Still, the concept of a penalty is useful when thinking about constraints, partially motivates more sophisticated approaches like interior-point methods, and is often used with gradient-free optimizers.

SQP and interior-point methods represent the state of the art in nonlinear constrained optimization. We introduce the basics for these two optimization methods, but a complete and robust implementation of these methods requires detailed knowledge of a growing body of literature that is not covered here.

5.1 Constrained Problem Formulation¶

We can express a general constrained optimization problem as

\begin{aligned} \text{minimize} &\quad f(x) \\ \text{by varying} &\quad x_i & i &= 1, \ldots, n_x \\ \text{subject to} &\quad g_j(x) \le 0 & j &= 1, \ldots, n_g \\ &\quad h_l(x) = 0 & l &= 1, \ldots, n_{h} \\ &\quad \underline{x}_i \le x_i \le \overline{x}_i & i &= 1, \ldots, n_x \, , \end{aligned}

(5.1)

where $g(x)$ is the vector of inequality constraints, $h(x)$ is the vector of equality constraints, and $\underline{x}$ and $\overline{x}$ are lower and upper design variable bounds (also known as bound constraints). Both objective and constraint functions can be nonlinear, but they should be $C^2$ continuous to be solved using gradient-based optimization algorithms. The inequality constraints are expressed as “less than” without loss of generality because they can always be converted to “greater than” by putting a negative sign on $g$ . We could also eliminate the equality constraints $h=0$ without loss of generality by replacing it with two inequality constraints, $h \leq \varepsilon$ and $-h \leq \varepsilon$ , where $\varepsilon$ is some small number. In practice, it is desirable to distinguish between equality and inequality constraints because of numerical precision and algorithm implementation.

Example 5.1 (Graphical solution of constrained problem)

Consider the following two-variable problem with quadratic objective and constraint functions:

Figure 5.1:Graphical solution for constrained problem showing contours of the objective, the two constraint curves, and the shaded infeasible region.

\begin{aligned} \underset{x_1, x_2}{\text{minimize}} &\quad f(x_1, x_2) = x_1^2 - \frac{1}{2} x_1 - x_2 - 2 \\ \text{subject to} &\quad g_1(x_1, x_2) = x_1^2 - 4 x_1 + x_2 + 1 \le 0 \\ &\quad g_2(x_1, x_2) = \frac{1}{2} x_1^2 + x_2^2 - x_1 - 4 \le 0 \, . \end{aligned}

We can plot the contours of the objective function and the constraint lines ( $g_1=0$ and $g_2=0$ ), as shown in Figure 5.1. We can see the feasible region defined by the two constraints. The approximate location of the minimum is evident by inspection. We can visualize the contours for this problem because the functions can be evaluated quickly and because it has only two dimensions. If the functions were more expensive, we would not be able to afford the many evaluations needed to plot the contours. If the problem had more dimensions, it would become difficult or impossible to visualize the functions and feasible space fully.

Tip 5.1: Do not mistake constraints for objectives

Practitioners sometimes consider metrics to be objectives when it would be more appropriate to pose them as constraints. This can lead to a multiobjective problem, which does not have a single optimum and is costly to solve (more on this in Chapter 9).

A helpful rule of thumb is to ask yourself if improving that metric indefinitely is desirable or whether there is some threshold after which additional improvements do not matter. For example, you might state that you want to maximize the range of an electric car. However, there is probably a threshold beyond which increasing the range does not improve the car’s desirability (e.g., if the range is greater than can be driven in one day). In that case, the range should be posed as a constraint, and the objective should be another metric, such as efficiency or profitability.

The constrained problem formulation just described does not distinguish between nonlinear and linear constraints. It is advantageous to make this distinction because some algorithms can take advantage of these differences. However, the methods introduced in this chapter assume general nonlinear functions.

For unconstrained gradient-based optimization (Chapter 4), we only require the gradient of the objective, $\nabla f$ . To solve a constrained problem, we also require the gradients of all the constraints. Because the constraints are vectors, their derivatives yield a Jacobian matrix. For the equality constraints, the Jacobian is defined as

J_h = \frac{\partial h}{\partial x}= \underbrace{ \begin{bmatrix} \frac{\partial h_1}{\partial {x_1}}& \cdots & \frac{\partial h_1}{\partial {x_{n_x}}}\\ \vdots & \ddots & \vdots \\ \frac{\partial h_{n_h}}{\partial {x_1}}& \cdots & \frac{\partial h_{n_h}}{\partial {x_{n_x}}}\\ \end{bmatrix}}_{(n_h \times n_x)} = \begin{bmatrix} \nabla h_1^\intercal \\ \vdots \\ \nabla h_{n_h}^\intercal \end{bmatrix} \, ,

(5.2)

which is an $(n_h \times n_x)$ matrix whose rows are the gradients of each constraint. Similarly, the Jacobian of the inequality constraints is an $(n_g \times n_x)$ matrix.

Tip 5.2: Do not specify design variable bounds as nonlinear constraints

The design variable bounds in the general nonlinear constrained problem (Equation 5.1) are expressed as $\underline{x} \le x \le \overline{x}$ , where $\underline{x}$ is the vector of lower bounds and $\overline{x}$ is the vector of upper bounds. Bounds are treated differently in optimization algorithms, so they should be specified as a bound constraint rather than a general nonlinear constraint. Some bounds stem from physical limitations on the engineering system. If not otherwise limited, the bounds should be sufficiently wide not to constrain the problem artificially. It is good practice to check your optimal solution against your design variable bounds to ensure that you have not artificially constrained the problem.

5.2 Understanding n-Dimensional Space¶

Understanding the optimality conditions and optimization algorithms for constrained problems requires basic $n$ -dimensional geometry and linear algebra concepts. Here, we review the concepts in an informal way.^[1] We sketch the concepts for two and three dimensions to provide some geometric intuition but keep in mind that the only way to tackle $n$ dimensions is through mathematics.

There are several essential linear algebra concepts for constrained optimization. The span of a set of vectors is the space formed by all the points that can be obtained by a linear combination of those vectors. With one vector, this space is a line, with two linearly independent vectors, this space is a two-dimensional plane (see Figure 5.2), and so on. With $n$ linearly independent vectors, we can obtain any point in $n$ -dimensional space.

Figure 5.2:Span of one, two, and three vectors in three-dimensional space.

Because matrices are composed of vectors, we can apply the concept of span to matrices. Suppose we have a rectangular $(m \times n)$ matrix $A$ . For our purposes, we are interested in considering the $m$ row vectors in the matrix.

The rank of $A$ is the number of linearly independent rows of $A$ , and it corresponds to the dimension of the space spanned by the row vectors of $A$ .

The nullspace of a matrix $A$ is the set of all $n$ -dimensional vectors $p$ such that $Ap=0$ .

Nullspace of a (2 \times 3) matrix A of rank 2, where a_1 and a_2 are the row vectors of A. — Figure 5.3:Nullspace of a $(2 \times 3)$ matrix $A$ of rank 2, where $a_1$ and $a_2$ are the row vectors of $A$ .

This is a subspace of $n-r$ dimensions, where $r$ is the rank of $A$ . One fundamental theorem of linear algebra is that the nullspace of a matrix contains all the vectors that are perpendicular to the row space of that matrix and vice versa. This concept is illustrated in Figure 5.3 for $n=3$ , where $r=2$ , leaving only one dimension for the nullspace. Any vector $v$ that is perpendicular to $p$ must be a linear combination of the rows of $A$ , so it can be expressed as $v = \alpha a_1 + \beta a_2$ .^[2]A hyperplane is a generalization of a plane in $n$ -dimensional space and is an essential concept in constrained optimization. In a space of $n$ dimensions, a hyperplane is a subspace with at most $n-1$ dimensions. In Figure 5.4, we illustrate hyperplanes in two dimensions (a line) and three dimensions (a two-dimensional plane); higher dimensions cannot be visualized, but the mathematical description that follows holds for any $n$ .

Figure 5.4:Hyperplanes and half-spaces in two and three dimensions.

To define a hyperplane of $n-1$ dimensions, we just need a point contained in the hyperplane ( $x_0$ ) and a vector ( $v$ ). Then, the hyperplane is defined as the set of all points $x = x_0 + p$ such that $p^\intercal v=0$ . That is, the hyperplane is defined by all vectors that are perpendicular to $v$ . To define a hyperplane with $n-2$ dimensions, we would need two vectors, and so on. In $n$ dimensions, a hyperplane of $n-1$ dimensions divides the space into two half-spaces: in one of these, $p^\intercal v > 0$ , and in the other, $p^\intercal v < 0$ . Each half-space is closed if it includes the hyperplane ( $p^\intercal v=0$ ) and open otherwise.

When we have the isosurface of a function $f$ , the function gradient at a point on the isosurface is locally perpendicular to the isosurface. The gradient vector defines the tangent hyperplane at that point, which is the set of points such that $p^\intercal \nabla f=0$ . In two dimensions, the isosurface reduces to a contour and the tangent reduces to a line, as shown in Figure 5.5 (left). In three dimensions, we have a two-dimensional hyperplane tangent to an isosurface, as shown in Figure 5.5 (right).

Figure 5.5:The gradient of a function defines the hyperplane tangent to the function isosurface.

The intersection of multiple half-spaces yields a polyhedral cone.

A polyhedral cone is the set of all the points that can be obtained by the linear combination of a given set of vectors using nonnegative coefficients. This concept is illustrated in Figure 5.6 (left) for the two-dimensional case.

In this case, only two vectors are required to define a cone uniquely. In three dimensions and higher there could be any number of vectors corresponding to all the possible polyhedral “cross sections”, as illustrated in Figure 5.6 (middle and right).

Figure 5.6:Polyhedral cones in two and three dimensions.

5.3 Optimality Conditions¶

The optimality conditions for constrained optimization problems are not as straightforward as those for unconstrained optimization (Section 4.1.4). We begin with equality constraints because the mathematics and intuition are simpler, then add inequality constraints. As in the case of unconstrained optimization, the optimality conditions for constrained problems are used not only for the termination criteria, but they are also used as the basis for optimization algorithms.

5.3.1 Equality Constraints¶

First, we review the optimality conditions for an unconstrained problem, which we derived in Section 4.1.4. For the unconstrained case, we can take a first-order Taylor series expansion of the objective function with some step $p$ that is small enough that the second-order term is negligible and write

f(x + p) \approx f(x) + \nabla f(x)^\intercal p \, .

(5.3)

If $x^*$ is a minimum point, then every point in a small neighborhood must have a greater value,

f(x^* + p) \ge f(x^*) \, .

(5.4)

Given the Taylor series expansion (Equation 5.3), the only way that this inequality can be satisfied is if

\nabla f(x^*)^\intercal p \ge 0 \, .

(5.5)

The condition $\nabla f^\intercal p = 0$ defines a hyperplane that contains the directions along which the first-order variation of the function is zero. This hyperplane divides the space into an open half-space of directions where the function decreases ( $\nabla f^\intercal p < 0$ ) and an open half-space where the function increases ( $\nabla f^\intercal p > 0$ ), as shown in Figure 5.7. Again, we are considering first-order variations.

The gradient f(x), which is the direction of steepest function increase, splits the design space into two halves. Here we highlight the open half-space of directions that result in function decrease. — Figure 5.7:The gradient $f(x)$ , which is the direction of steepest function increase, splits the design space into two halves. Here we highlight the open half-space of directions that result in function decrease.

If the problem were unconstrained, the only way to satisfy the inequality in Equation 5.5 would be if $\nabla f(x^*) = 0$ . That is because for any nonzero $\nabla f$ , there is an open half-space of directions that result in a function decrease (see Figure 5.7). This is consistent with the first-order unconstrained optimality conditions derived in Section 4.1.4.

However, we now have a constrained problem. The function increase condition (Equation 5.5) still applies, but $p$ must also be a feasible direction. To find the feasible directions, we can write a first-order Taylor series expansion for each equality constraint function as

{h}_j (x + p) \approx {h}_j (x) + \nabla {h}_j (x)^\intercal p, \quad j = 1, \ldots, n_{ h} \, .

(5.6)

Again, the step size is assumed to be small enough so that the higher-order terms are negligible.

Assuming that $x$ is a feasible point, then ${h}_j (x) = 0$ for all constraints $j$ , and we are left with the second term in the linearized constraint (Equation 5.6). To remain feasible a small step away from $x$ , we require that ${h}_j (x+p) = 0$ for all $j$ . Therefore, first-order feasibility requires that

\nabla {h}_j (x)^\intercal p = 0, \quad \text{for all} \quad j = 1, \ldots, n_{h} \, ,

(5.7)

which means that a direction is feasible when it is orthogonal to all equality constraint gradients. We can write this in matrix form as

J_h (x) p = 0 \, .

(5.8)

This equation states that any feasible direction has to lie in the nullspace of the Jacobian of the constraints, $J_h$ .

If we have two equality constraints (n_h=2) in two-dimensional space (n_x=2), we are left with no freedom for optimization. — Figure 5.8:If we have two equality constraints ( $n_h=2$ ) in two-dimensional space ( $n_x=2$ ), we are left with no freedom for optimization.

Assuming that $J_h$ has full row rank (i.e., the constraint gradients are linearly independent), then the feasible space is a subspace of dimension $n_x - n_h$ . For optimization to be possible, we require $n_x > n_h$ . Figure 5.8 illustrates a case where $n_x = n_h = 2$ , where the feasible space reduces to a single point, and there is no freedom for performing optimization.

For one constraint, Equation 5.8 reduces to a dot product, and the feasible space corresponds to a tangent hyperplane, as illustrated on the left side of Figure 5.9 for the three-dimensional case. For two or more constraints, the feasible space corresponds to the intersection of all the tangent hyperplanes. On the right side of Figure 5.9, we show the intersection of two tangent hyperplanes in three-dimensional space (a line).

Figure 5.9:Feasible spaces in three dimensions for one and two constraints.

For constrained optimality, we need to satisfy both $\nabla f(x^*)^\intercal p \ge 0$ (Equation 5.5) and $J_h (x) p = 0$ (Equation 5.8). For equality constraints, if a direction $p$ is feasible, then $-p$ must also be feasible. Therefore, the only way to satisfy $\nabla f(x^*)^\intercal p \ge 0$ is if $\nabla f(x)^\intercal p=0$ .

In sum, for $x^*$ to be a constrained optimum, we require

\nabla f(x^*)^\intercal p = 0 \quad \text{for all $p$ such that} \quad J_h(x^*) p = 0 \, .

(5.9)

In other words, the projection of the objective function gradient onto the feasible space must vanish. Figure 5.10 illustrates this requirement for a case with two constraints in three dimensions.

If the projection of \nabla f onto the feasible space is nonzero, there is a feasible descent direction (left); if the projection is zero, the point is a constrained optimum (right). — Figure 5.10:If the projection of $\nabla f$ onto the feasible space is nonzero, there is a feasible descent direction (left); if the projection is zero, the point is a constrained optimum (right).

The constrained optimum conditions (Equation 5.9) require that $\nabla f$ be orthogonal to the nullspace of $J_h$ (since $p$ , as defined, is the nullspace of $J_h$ ). The row space of a matrix contains all the vectors that are orthogonal to its nullspace.^[3]Because the rows of $J_h$ are the gradients of the constraints, the objective function gradient must be a linear combination of the gradients of the constraints. Thus, we can write the requirements defined in Equation 5.9 as a single vector equation,

\nabla f(x^*) = - \sum_{j=1}^{n_h} \lambda_j \nabla {h}_j (x^*) \, ,

(5.10)

where $\lambda_j$ are called the Lagrange multipliers.^[4]There is a multiplier associated with each constraint. The sign of the Lagrange multipliers is arbitrary for equality constraints but will be significant later when dealing with inequality constraints.

Therefore, the first-order optimality conditions for the equality constrained case are

\begin{aligned} \nabla f(x^*) &= - J_h(x^*)^\intercal \lambda \\ h(x^*) &= 0 \, , \end{aligned}

(5.11)

where we have reexpressed Equation 5.10 in matrix form and added the constraint satisfaction condition.

In constrained optimization, it is sometimes convenient to use the Lagrangian function, which is a scalar function defined as

\mathcal{L}(x, \lambda) = f(x) + h(x)^\intercal \lambda \, .

(5.12)

In this function, the Lagrange multipliers are considered to be independent variables. Taking the gradient of $\mathcal{L}$ with respect to both $x$ and $\lambda$ and setting them to zero yields

\begin{aligned} \nabla_x {\cal L} &= \nabla f(x) + J_h(x)^\intercal \lambda = 0 \\ \nabla_\lambda {\cal L} &= h(x) = 0 \, , \end{aligned}

(5.13)

which are the first-order conditions derived in Equation 5.11.

With the Lagrangian function, we have transformed a constrained problem into an unconstrained problem by adding new variables, $\lambda$ . A constrained problem of $n_x$ design variables and $n_h$ equality constraints was transformed into an unconstrained problem with $n_x + n_h$ variables. Although you might be tempted to simply use the algorithms of Chapter 4 to minimize the Lagrangian function (Equation 5.12), some modifications are needed in the algorithms to solve these problems effectively (particularly once inequality constraints are introduced).

Figure 5.11:The constraint qualification condition does not hold in this case because the gradients of the two constraints not linearly independent.

The derivation of the first-order optimality conditions (Equation 5.11) assumes that the gradients of the constraints are linearly independent; that is, $J_h$ has full row rank. A point satisfying this condition is called a regular point and is said to satisfy linear independence constraint qualification. Figure 5.11 illustrates a case where the $x^*$ is not a regular point. A special case that does not satisfy constraint qualification is when one (or more) constraint gradient is zero. In that case, that constraint is not linearly independent, and the point is not regular. Fortunately, these situations are uncommon.

The optimality conditions just described are first-order conditions that are necessary but not sufficient. To make sure that a point is a constrained minimum, we also need to satisfy second-order conditions. For the unconstrained case, the Hessian of the objective function has to be positive definite. In the constrained case, we need to check the Hessian of the Lagrangian with respect to the design variables in the space of feasible directions. The Lagrangian Hessian is

H_{\cal L} = H_f + \sum_{j=1}^{n_h} \lambda_j H_{h_j} \, ,

(5.14)

where $H_f$ is the Hessian of the objective, and $H_{h_j}$ is the Hessian of equality constraint $j$ . The second-order sufficient conditions are as follows:

p^\intercal H_{\cal L} p > 0 \quad \text{for all $p$ such that} \quad J_h p = 0 \, .

(5.15)

This ensures that the curvature of the Lagrangian is positive when projected onto any feasible direction.

Example 5.2 (Equality constrained problem)

Consider the following constrained problem featuring a linear objective function and a quadratic equality constraint:

\begin{aligned} \underset{x_1, x_2}{\text{minimize}} & \quad f(x_1, x_2) = x_1 + 2 x_2 \\ \text{subject to} &\quad h(x_1, x_2) = \frac{1}{4} x_1^2 + x_2^2 - 1 = 0 \, . \end{aligned}

The Lagrangian for this problem is

{\cal L} (x_1, x_2, \lambda) = x_1 + 2 x_2 + \lambda \left( \frac{1}{4} x_1^2 + x_2^2 -1 \right) \, .

Differentiating this to get the first-order optimality conditions,

\begin{aligned} \frac{\partial {\cal L}}{\partial {x_1}}&= 1 + \frac{1}{2} \lambda x_1 = 0 \\ \frac{\partial {\cal L}}{\partial {x_2}}&= 2 + 2 \lambda x_2 = 0 \\ \frac{\partial {\cal L}}{\partial {\lambda}}&= \frac{1}{4} x_1^2 + x_2^2 - 1 = 0 \, . \end{aligned}

Solving these three equations for the three unknowns ( $x_1, x_2, \lambda$ ), we obtain two possible solutions:

\begin{aligned} x_A &= \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} -\sqrt{2} \\ -\frac{\sqrt{2}}{2} \end{bmatrix} , &\lambda_A = \sqrt{2} ,\\ x_B &= \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} \sqrt{2} \\ \frac{\sqrt{2}}{2} \end{bmatrix} , &\lambda_B = -\sqrt{2} \, . \end{aligned}

Figure 5.12:Two points satisfy the first-order optimality conditions; one is a constrained minimum, and the other is a constrained maximum.

These two points are shown in Figure 5.12, together with the objective and constraint gradients. The optimality conditions (Equation 5.11) state that the gradient must be a linear combination of the gradients of the constraints at the optimum. In the case of one constraint, this means that the two gradients are colinear (which occurs in this example).

To determine if either of these points is a minimum, we check the second-order conditions by evaluating the Hessian of the Lagrangian,

H_\mathcal{L} = \begin{bmatrix} \frac{1}{2}\lambda & 0 \\ 0 & 2 \lambda \end{bmatrix} \, .

The Hessian is only positive definite for the case where $\lambda_A = \sqrt{2}$ , and therefore $x_A$ is a minimum.

$The minimum of the Lagrangian function with the optimum Lagrange multiplier value (\lambda = \sqrt{2}) is the constrained minimum of the original problem.$

Figure 5.13:The minimum of the Lagrangian function with the optimum Lagrange multiplier value ( $\lambda = \sqrt{2}$ ) is the constrained minimum of the original problem.

Although the Hessian only needs to be positive definite in the feasible directions, in this case, we can show that it is positive or negative definite in all possible directions. The Hessian is negative definite for $x_B$ , so this is not a minimum; instead, it is a maximum.

Figure 5.13 shows the Lagrangian function (with the optimal Lagrange multiplier we solved for) overlaid on top of the original function and constraint. The unconstrained minimum of the Lagrangian corresponds to the constrained minimum of the original function. The Lagrange multiplier can be visualized as a third dimension coming out of the page. Here we show only the slice for the Lagrange multiplier that solves the optimality conditions.

Example 5.3 (Second-order conditions for constrained case)

Consider the following problem:

\begin{aligned} \underset{x_1, x_2}{\text{minimize}} & \quad f(x_1, x_2) = x_1^2 + 3(x_2 - 2)^2 \\ \text{subject to} &\quad h(x_1, x_2) = \beta x_1^2 - x_2 = 0 \, , \end{aligned}

where $\beta$ is a parameter that we will vary to change the characteristics of the constraint.

The Lagrangian for this problem is

{\cal L} (x_1, x_2, \lambda) = x_1^2 + 3(x_2 - 2)^2 + \lambda\left(\beta x_1^2 - x_2\right) \, .

Differentiating for the first-order optimality conditions, we get

\begin{aligned} \nabla_x {\cal L} &= \begin{bmatrix} 2x_1 (1 + \lambda \beta) \\ 6 (x_2 - 2) - \lambda \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \\ \nabla_\lambda {\cal L} &= \beta x_1^2 - x_2 = 0 \, . \end{aligned}

Solving these three equations for the three unknowns ( $x_1, x_2, \lambda$ ), the solution is $[x_1, x_2, \lambda] = [0, 0, -12]$ , which is independent of $\beta$ .

To determine if this is a minimum, we must check the second-order conditions by evaluating the Hessian of the Lagrangian,

H_\mathcal{L} = \begin{bmatrix} 2 (1-12\beta) & 0 \\ 0 & 6 \end{bmatrix} \, .

We only need $H_\mathcal{L}$ to be positive definite in the feasible directions. The feasible directions are all $p$ such that $J_h^\intercal p = 0$ . In this case, $J_h = [2\beta x_1, -1]$ , yielding $J_h(x^*) = [0, -1]$ . Therefore, the feasible directions at the solution can be represented as $p = [\alpha, 0]$ , where $\alpha$ is any real number. For positive curvature in the feasible directions, we require that

p^\intercal H_\mathcal{L} p = 2 \alpha^2 (1-12 \beta) > 0 \, .

Thus, the second-order sufficient condition requires that $\beta < {1}/{12}$ .^[5]We plot the constraint and the Lagrangian for three different values of $\beta$ in Figure 5.14. The location of the point satisfying the first-order optimality conditions is the same for all three cases, but the curvature of the constraint changes the Lagrangian significantly.

Figure 5.14:Three different problems illustrating the meaning of the second-order conditions for constrained problems.

For $\beta = -0.5$ , the Hessian of the Lagrangian is positive definite, and we have a minimum. For $\beta = 0.5$ , the Lagrangian has negative curvature in the feasible directions, so the point is not a minimum; we can reduce the objective by moving along the curved constraint. The first-order conditions alone do not capture this possibility because they linearize the constraint. Finally, in the limiting case ( $\beta=1/12$ ), the curvature of the constraint matches the curvature of the objective, and the curvature of the Lagrangian is zero in the feasible directions. ^[6]

5.3.2 Inequality Constraints¶

We can reuse some of the concepts from the equality constrained optimality conditions for inequality constrained problems.

Recall that an inequality constraint $j$ is feasible when $g_j(x^*) \le 0$ and it is said to be active if $g_j(x^*) = 0$ and inactive if $g_j(x^*) < 0$ .

Figure 5.15:The descent directions are in the open half-space defined by the objective function gradient.

As before, if $x^*$ is an optimum, any small enough feasible step $p$ from the optimum must result in a function increase. Based on the Taylor series expansion (Equation 5.3), we get the condition

\nabla f(x^*)^\intercal p \ge 0 \, ,

(5.16)

which is the same as for the equality constrained case. We use the arc in Figure 5.15 to show the descent directions, which are in the open half-space defined by the hyperplane tangent to the gradient of the objective.

To consider inequality constraints, we use the same linearization as the equality constraints (Equation 5.6), but now we enforce an inequality to get

g_j (x + p) \approx g_j (x) + \nabla g_j (x)^\intercal p \le 0 , \qquad j = 1, \ldots, n_{g} \, .

(5.17)

For a given candidate point that satisfies all constraints, there are two possibilities to consider for each inequality constraint: whether the constraint is inactive ( $g_j(x) < 0$ ) or active ( $g_j(x) = 0$ ). If a given constraint is inactive, we do not need to add any condition for it because we can take a step $p$ in any direction and remain feasible as long as the step is small enough. Thus, we only need to consider the active constraints for the optimality conditions.

Figure 5.16:The feasible directions for each constraint are in the closed half-space defined by the inequality constraint gradient.

For the equality constraint, we found that all first-order feasible directions are in the nullspace of the Jacobian matrix. Inequality constraints are not as restrictive. From Equation 5.17, if constraint $j$ is active ( $g_j(x) = 0$ ), then the nearby point $g_j(x + p)$ is only feasible if $\nabla g_j(x)^\intercal p \le 0$ for all constraints $j$ that are active. In matrix form, we can write $J_g(x) p \le 0$ , where the Jacobian matrix includes only the gradients of the active constraints. Thus, the feasible directions for inequality constraint $j$ can be any direction in the closed half-space, corresponding to all directions $p$ such that $p^\intercal g_j \le 0$ , as shown in Figure 5.16. In this figure, the arc shows the infeasible directions.

The set of feasible directions that satisfies all active constraints is the intersection of all the closed half-spaces defined by the inequality constraints, that is, all $p$ such that $J_g(x) p \le 0$ . This intersection of the feasible directions forms a polyhedral cone, as illustrated in Figure 5.17 for a two-dimensional case with two constraints.

Figure 5.17:Excluding the infeasible directions with respect to each constraint (red arcs) leaves the cone of feasible directions (blue), which is the polar cone of the active constraint gradients cone (gray).

To find the cone of feasible directions, let us first consider the cone formed by the active inequality constraint gradients (shown in gray in Figure 5.17). This cone is defined by all vectors $d$ such that

d = J_g^\intercal \sigma = \sum_{j=1}^{n_g} \sigma_j \nabla g_j, \quad \text{where} \quad \sigma_j \ge 0 \, .

(5.18)

A direction $p$ is feasible if $p^\intercal d \le 0$ for all $d$ in the cone. The set of all feasible directions forms the polar cone of the cone defined by Equation 5.18 and is shown in blue in Figure 5.17.

Now that we have established some intuition about the feasible directions, we need to establish under which conditions there is no feasible descent direction (i.e., we have reached an optimum). In other words, when is there no intersection between the cone of feasible directions and the open half-space of descent directions? To answer this question, we can use Farkas’ lemma. This lemma states that given a rectangular matrix ( $J_g$ in our case) and a vector with the same size as the rows of the matrix ( $\nabla f$ in our case), one (and only one) of two possibilities occurs:^[7]

There exists a $p$ such that $J_g p \le 0$ and $\nabla f^\intercal p < 0$ . This means that there is a descent direction that is feasible (Figure 5.18, left).
There exists a $\sigma$ such that $J_g^\intercal \sigma = -\nabla f$ with $\sigma \ge 0$ (Figure 5.18, right). This corresponds to optimality because it excludes the first possibility.

Two possibilities involving active inequality constraints. — (a)

The second possibility yields the following optimality criterion for inequality constraints:

\nabla f + J_g(x)^\intercal \sigma = 0 \, ,\quad \text{with} \quad \sigma \ge 0 \, .

(5.19)

Comparing with the corresponding criteria for equality constraints (Equation 5.13), we see a similar form. However, $\sigma$ corresponds to the Lagrange multipliers for the inequality constraints and carries the additional restriction that $\sigma \ge 0$ .

If equality constraints are present, the conditions for the inequality constraints apply only in the subspace of the directions feasible with respect to the equality constraints.

Similar to the equality constrained case, we can construct a Lagrangian function whose stationary points are candidates for optimal points. We need to include all inequality constraints in the optimality conditions because we do not know in advance which constraints are active. To represent inequality constraints in the Lagrangian, we replace them with the equality constraints defined by

{g}_j + s_j^2 = 0, \quad j=1,\ldots,n_{g} \, ,

(5.20)

where $s_j$ is a new unknown associated with each inequality constraint called a slack variable. The slack variable is squared to ensure it is nonnegative In that way, Equation 5.20 can only be satisfied when ${g}_j$ is feasible ( ${g}_j \le 0$ ).

The significance of the slack variable is that when $s_j=0$ , the corresponding inequality constraint is active ( $g_j=0$ ), and when $s_j \neq 0$ , the corresponding constraint is inactive.

The Lagrangian including both equality and inequality constraints is then

{\cal L}(x, \lambda, \sigma, s) = f(x) + \lambda^\intercal h(x) + \sigma^\intercal \left( g(x) + s \odot s \right) \, ,

(5.21)

where $\sigma$ represents the Lagrange multipliers associated with the inequality constraints. Here, we use $\odot$ to represent the element-wise multiplication of $s$ .^[8]Similar to the equality constrained case, we seek a stationary point for the Lagrangian, but now we have additional unknowns: the inequality Lagrange multipliers and the slack variables. Taking partial derivatives of the Lagrangian with respect to each set of unknowns and setting those derivatives to zero yields the first-order optimality conditions:

\begin{gather} \nabla_x \mathcal{L} = 0 \quad \Rightarrow \quad \frac{\partial {\cal L}}{\partial x_i} = \frac{\partial f}{\partial x_i} + \sum_{l=1}^{n_h} \lambda_l \frac{\partial h_l}{\partial x_i} + \sum_{j=1}^{n_{g}} \sigma_j \frac{\partial g_j}{\partial x_i} = 0 \\ \quad i=1,\ldots,n_x \, . \end{gather}

(5.22)

This criterion is the same as before but with additional Lagrange multipliers and constraints. Taking the derivatives with respect to the equality Lagrange multipliers, we have

\nabla_{\lambda} \mathcal{L} = 0 \quad \Rightarrow \quad \frac{\partial {\cal L}}{\partial \lambda_l} = h_l = 0, \quad l=1,\ldots,n_h \, ,

(5.23)

which enforces the equality constraints as before. Taking derivatives with respect to the inequality Lagrange multipliers, we get

\nabla_{\sigma} \mathcal{L} = 0 \quad \Rightarrow \quad \frac{\partial \mathcal{L}}{\partial \sigma_j} = g_j + s_j^2 = 0 \quad j=1,\ldots,n_{g} \, ,

(5.24)

which enforces the inequality constraints. Finally, differentiating the Lagrangian with respect to the slack variables, we obtain

\nabla_s \mathcal{L} = 0 \quad \Rightarrow \quad \frac{\partial {\cal L}}{\partial s_j} = 2 \sigma_j s_j = 0, \quad j=1,\ldots, n_{g} \, ,

(5.25)

which is called the complementary slackness condition. This condition helps us to distinguish the active constraints from the inactive ones. For each inequality constraint, either the Lagrange multiplier is zero (which means that the constraint is inactive), or the slack variable is zero (which means that the constraint is active). Unfortunately, the complementary slackness condition introduces a combinatorial problem. The complexity of this problem grows exponentially with the number of inequality constraints because the number of possible combinations of active versus inactive constraints is $2^{n_{g}}$ .

In addition to the conditions for a stationary point of the Lagrangian (Equation 5.22, Equation 5.23, Equation 5.24, and Equation 5.25), recall that we require the Lagrange multipliers for the active constraints to be nonnegative. Putting all these conditions together in matrix form, the first-order constrained optimality conditions are as follows:

\begin{aligned} \nabla f + J_h^\intercal \lambda + J_g^\intercal \sigma &= 0 \\ h &= 0 \\ g + s \odot s &= 0 \\ \sigma \odot s &= 0 \\ \sigma &\ge 0 \, . \\ \end{aligned}

(5.26)

These are called the Karush–Kuhn–Tucker (KKT) conditions. The equality and inequality constraints are sometimes lumped together using a single Jacobian matrix (and single Lagrange multiplier vector). This can be convenient because the expression for the Lagrangian follows the same form for both cases.

As in the equality constrained case, these first-order conditions are necessary but not sufficient. The second-order sufficient conditions require that the Hessian of the Lagrangian must be positive definite in all feasible directions, that is,

\begin{aligned} p^\intercal H_{\cal L} p &> 0 \quad \text{for all } p \text{ such that:} \\ J_h p &= 0 \\ J_g p &\le 0 \quad \text{for the active constraints}. \end{aligned}

(5.27)

In other words, we only require positive definiteness in the intersection of the nullspace of the equality constraint Jacobian with the feasibility cone of the active inequality constraints.

Similar to the equality constrained case, the KKT conditions (Equation 5.26) only apply when a point is regular, that is, when it satisfies linear independence constraint qualification. However, the linear independence applies only to the gradients of the inequality constraints that are active and the equality constraint gradients.

Suppose we have the two constraints shown in the left pane of Figure 5.19. For the given objective function contours, point $x^*$ is a minimum. At $x^*$ , the gradients of the two constraints are linearly independent, and $x^*$ is thus a regular point. Therefore, we can apply the KKT conditions at this point.

The KKT conditions apply only to regular points. A point x^* is regular when the gradients of the constraints are linearly independent. The middle and right panes illustrate cases where x^* is a constrained minimum but not a regular point. — Figure 5.19:The KKT conditions apply only to regular points. A point $x^*$ is regular when the gradients of the constraints are linearly independent. The middle and right panes illustrate cases where $x^*$ is a constrained minimum but not a regular point.

The middle and right panes of Figure 5.19 illustrate cases where $x^*$ is also a constrained minimum. However, $x^*$ is not a regular point in either case because the gradients of the two constraints are not linearly independent. This means that the gradient of the objective cannot be expressed as a unique linear combination of the constraints. Therefore, we cannot use the KKT conditions, even though $x^*$ is a minimum. The problem would be ill-conditioned, and the numerical methods described in this chapter would run into numerical difficulties. Similar to the equality constrained case, this situation is uncommon in practice.

Example 5.4 (Problem with one inequality constraint)

Consider a variation of the problem in Example 5.2 where the equality is replaced by an inequality, as follows:

\begin{aligned} \underset{x_1, x_2}{\text{minimize}} &\quad f(x_1, x_2) = x_1 + 2 x_2 \\ \text{subject to} &\quad g(x_1, x_2) = \frac{1}{4} x_1^2 + x_2^2 - 1 \le 0 \, . \end{aligned}

The Lagrangian for this problem is

{\cal L} (x_1, x_2, \sigma, s) = x_1 + 2 x_2 + \sigma \left( \frac{1}{4} x_1^2 + x_2^2 -1 + s^2\right) \, .

The objective function and feasible region are shown in Figure 5.20.

Figure 5.20:Inequality constrained problem with linear objective and feasible space within an ellipse.

Differentiating the Lagrangian with respect to all the variables, we get the first-order optimality conditions

\begin{aligned} \frac{\partial {\cal L}}{\partial {x_1}}&= 1 + \frac{1}{2} \sigma x_1 = 0 \\ \frac{\partial {\cal L}}{\partial {x_2}}&= 2 + 2 \sigma x_2 = 0 \\ \frac{\partial {\cal L}}{\partial {\sigma}}&= \frac{1}{4} x_1^2 + x_2^2 - 1 + s^2 = 0 \\ \frac{\partial {\cal L}}{\partial {s}}&= 2 \sigma s = 0 \, . \end{aligned}

There are two possibilities in the last (complementary slackness) condition: $s = 0$ (meaning the constraint is active) and $\sigma = 0$ (meaning the constraint is not active). However, we can see that setting $\sigma = 0$ in either of the two first equations does not yield a solution. Assuming that $s=0$ and $\sigma \ne 0$ , we can solve the equations to obtain:

x_A = \begin{bmatrix} x_1 \\ x_2 \\ \sigma \end{bmatrix} = \begin{bmatrix} -\sqrt{2} \\ -{\sqrt{2}}/{2} \\ \sqrt{2} \end{bmatrix} , \quad x_B = \begin{bmatrix} x_1 \\ x_2 \\ \sigma \end{bmatrix} = \begin{bmatrix} \sqrt{2} \\ {\sqrt{2}}/{2} \\ -\sqrt{2} \end{bmatrix} \, . \qquad

These are the same critical points as in the equality constrained case of Example 5.2, as shown in Figure 5.20. However, now the sign of the Lagrange multiplier is significant.

According to the KKT conditions, the Lagrange multiplier has to be nonnegative. Point $x_A$ satisfies this condition. As a result, there is no feasible descent direction at $x_A$ , as shown in Figure 5.21 (left). The Hessian of the Lagrangian at this point is the same as in Example 5.2, which we have already shown to be positive definite. Therefore, $x_A$ is a minimum.

Figure 5.21:At the minimum (left), the Lagrange multiplier is positive, and there is no feasible descent direction. At the critical point $x_B$ (right), the Lagrange multiplier is negative, and all descent directions are feasible, so this point is not a minimum.

Unlike the equality constrained problem, we do not need to check the Hessian at point $x_B$ because the Lagrange multiplier is negative. As a consequence, there are feasible descent directions, as shown in Figure 5.21 (right). Therefore, $x_B$ is not a minimum.

Example 5.5 (Simple problem with two inequality constraints)

Consider a variation of Example 5.4 where we add one more inequality constraint, as follows:

\begin{aligned} \underset{x_1, x_2}{\text{minimize}} &\quad f(x_1, x_2) = x_1 + 2 x_2 \\ \text{subject to} &\quad g_1(x_1, x_2) = \frac{1}{4} x_1^2 + x_2^2 - 1 \le 0 \\ &\quad g_2(x_2) = - x_2 \le 0 \, . \end{aligned}

The feasible region is the top half of the ellipse, as shown in Figure 5.22.

The Lagrangian for this problem is

{\cal L} (x, \sigma, s) = x_1 + 2 x_2 + \sigma_1 \left( \frac{1}{4} x_1^2 + x_2^2 -1 + s_1^2\right) + \sigma_2 \left( -x_2 + s_2^2 \right) \, .

Differentiating the Lagrangian with respect to all the variables, we get the first-order optimality conditions,

\begin{aligned} \frac{\partial {\cal L}}{\partial {x_1}}&= 1 + \frac{1}{2} \sigma_1 x_1 = 0 \\ \frac{\partial {\cal L}}{\partial {x_2}}&= 2 + 2 \sigma_1 x_2 - \sigma_2 = 0 \\ \frac{\partial {\cal L}}{\partial {\sigma_1}}&= \frac{1}{4} x_1^2 + x_2^2 - 1 + s_1^2 = 0 \\ \end{aligned}

\begin{aligned} \frac{\partial {\cal L}}{\partial {\sigma_2}}&= -x_2 + s_2^2 = 0 \\ \frac{\partial {\cal L}}{\partial {s_1}}&= 2 \sigma_1 s_1 = 0 \\ \frac{\partial {\cal L}}{\partial {s_2}}&= 2 \sigma_2 s_2 = 0 \, . \end{aligned}

We now have two complementary slackness conditions, which yield the four potential combinations listed in Table 1.

Table 1:Two inequality constraints yield four potential combinations.

Assumption	Meaning	$x_1$	$x_2$	$\sigma_1$	$\sigma_2$	$s_1$	$s_2$	Point
$s_1=0$	$g_1$ is active	-2	0	1	2	0	0	$x^*$
$s_2=0$	$g_2$ is active	2	0	-1	2	0	0	$x_C$
$\sigma_1=0$	$g_1$ is inactive	–	–	–	–	–	–
$\sigma_2=0$	$g_2$ is inactive
$s_1=0$	$g_1$ is active	$\sqrt{2}$	$\frac{\sqrt{2}}{2}$	$-\sqrt{2}$	0	0	$2^{-\frac{1}{4}}$	$x_B$
$\sigma_2=0$	$g_2$ is inactive
$\sigma_1=0$	$g_1$ is inactive	–	–	–	–	–	–
$s_2=0$	$g_2$ is active

Figure 5.22:Only one point satisfies the first-order KKT conditions.

Figure 5.23:At the minimum (left), the intersection of the feasible directions and descent directions is null, so there is no feasible descent direction. At this point, there is a cone of descent directions that is also feasible, so it is not a minimum.

Assuming that both constraints are active yields two possible solutions ( $x^*$ and $x_C$ ) corresponding to two different Lagrange multipliers. According to the KKT conditions, the Lagrange multipliers for all active inequality constraints have to be positive, so only the solution with $\sigma_1=1$ ( $x^*$ ) is a candidate for a minimum. This point corresponds to $x^*$ in Figure 5.22. As shown in Figure 5.23 (left), there are no feasible descent directions starting from $x^*$ . The Hessian of the Lagrangian at $x^*$ is identical to the previous example and is positive definite when $\sigma_1$ is positive. Therefore, $x^*$ is a minimum.

The other solution for which both constraints are active is point $x_C$ in Figure 5.22. As shown in Figure 5.23 (right), there is a cone of feasible descent directions, and therefore $x_C$ is not a minimum.

Assuming that neither constraint is active yields $1=0$ for the first optimality condition, so this situation is not possible. Assuming that $g_1$ is active yields the solution corresponding to the maximum that we already found in Example 5.4, $x_B$ . Finally, assuming that only $g_2$ is active yields no candidate point.

Although these examples can be solved analytically, they are the exception rather than the rule. The KKT conditions quickly become challenging to solve analytically (try solving Example 5.1), and as the number of constraints increases, trying all combinations of active and inactive constraints becomes intractable. Furthermore, engineering problems usually involve functions defined by models with implicit equations, which are impossible to solve analytically. The reason we include these analytic examples is to gain a better understanding of the KKT conditions. For the rest of the chapter, we focus on numerical methods, which are necessary for the vast majority of practical problems.

5.3.3 Meaning of the Lagrange Multipliers¶

The Lagrange multipliers quantify how much the corresponding constraints drive the design. More specifically, a Lagrange multiplier quantifies the sensitivity of the optimal objective function value $f(x^*)$ to a variation in the value of the corresponding constraint. Here we explain why that is the case. We discuss only inequality constraints, but the same analysis applies to equality constraints.

When a constraint is inactive, the corresponding Lagrange multiplier is zero. This indicates that changing the value of an inactive constraint does not affect the optimum, as expected. This is only valid to the first order because the KKT conditions are based on the linearization of the objective and constraint functions. Because small changes are assumed in the linearization, we do not consider the case where an inactive constraint becomes active after perturbation.

Now let us examine the active constraints. Suppose that we want to quantify the effect of a change in an active (or equality) constraint $g_i$ on the optimal objective function value.^[9]The differential of $g_i$ is given by the following dot product:

\mathrm{d} g_i = \frac{\partial g_i}{\partial x}\mathrm{d} x \, .

(5.28)

For all the other constraints $j$ that remain unperturbed, which means that

\frac{\partial g_j}{\partial x}\mathrm{d} x =0 \quad \text{for all} \quad j \neq i \, .

(5.29)

This equation states that any movement $\mathrm{d} x$ must be in the nullspace of the remaining constraints to remain feasible with respect to those constraints.^[10]An example with two constraints is illustrated in Figure 5.24, where $g_1$ is perturbed and $g_2$ remains fixed. The objective and constraint functions are linearized because we are considering first-order changes represented by the differentials.

Lagrange multipliers can be interpreted as the change in the optimal objective due a perturbation in the corresponding constraint. In this case, we show the effect of perturbing g_1. — Figure 5.24:Lagrange multipliers can be interpreted as the change in the optimal objective due a perturbation in the corresponding constraint. In this case, we show the effect of perturbing $g_1$ .

From the KKT conditions (Equation 5.22), we know that at the optimum,

\frac{\partial f}{\partial x}= -\sigma^\intercal \frac{\partial g}{\partial x}\, .

(5.30)

Using this condition, we can write the differential of the objective, $\mathrm{d} f = (\partial f / \partial x) \mathrm{d} x$ , as

\mathrm{d} f = -\sigma^\intercal \frac{\partial g}{\partial x}\mathrm{d} x \, .

(5.31)

According to Equation 5.28 and Equation 5.29, the product with $\mathrm{d} x$ is only nonzero for the perturbed constraint $i$ and therefore,

\mathrm{d} f = -\sigma_i \frac{\partial g_i}{\partial x}\mathrm{d} x = -\sigma_i \mathrm{d} g_i \, .

(5.32)

This leads to the derivative of the optimal $f$ with respect to a change in the value of constraint $i$ :

\sigma_i = -\frac{df}{dg_i}\, .

(5.33)

Thus, the Lagrange multipliers can predict how much improvement can be expected if a given constraint is relaxed. For inequality constraints, because the Lagrange multipliers are positive at an optimum, this equation correctly predicts a decrease in the objective function value when the constraint value is increased.

The derivative defined in Equation 5.33 has practical value because it tells us how much a given constraint drives the design. In this interpretation of the Lagrange multipliers, we need to consider the scaling of the problem and the units. Still, for similar quantities, they quantify the relative importance of the constraints.

5.3.4 Post-Optimality Sensitivities¶

It is sometimes helpful to find sensitivities of the optimal objective function value with respect to a parameter held fixed during optimization. Suppose that we have found the optimum for a constrained problem. Say we have a scalar parameter $\rho$ held fixed in the optimization, but now want to quantify the effect of a perturbation in that parameter on the optimal objective value. Perturbing $\rho$ changes the objective and the constraint functions, so the optimum point moves, as illustrated in Figure 5.25. For our current purposes, we use $g$ to represent either active inequality or equality constraints.

We assume that the set of active constraints does not change with a perturbation in $\rho$ like we did when perturbing the constraint in Section 5.3.3.

Figure 5.25:Post-optimality sensitivities quantify the change in the optimal objective due to a perturbation of a parameter that was originally fixed in the optimization.
The optimal objective value changes due to changes in the optimum point (which moves to $x^*_\rho$ ) and objective function (which becomes $f_\rho$ .)

The objective function is affected by $\rho$ through a change in $f$ itself and a change induced by the movement of the constraints. This dependence can be written in the total differential form as:

\mathrm{d} f = \frac{\partial f}{\partial \rho}\mathrm{d} \rho + \frac{\partial f}{\partial g}\frac{\partial g}{\partial \rho}\mathrm{d} \rho \, .

(5.34)

The derivative $\partial f / \partial g$ corresponds to the derivative of the optimal value of the objective with respect to a perturbation in the constraint, which according to Equation 5.33, is the negative of the Lagrange multipliers. This means that the post-optimality derivative is

\frac{df}{d\rho}= \frac{\partial f}{\partial \rho}- \sigma^\intercal \frac{\partial g}{\partial \rho}\, ,

(5.35)

where the partial derivatives with respect to $\rho$ can be computed without re-optimizing.

5.4 Penalty Methods¶

The concept behind penalty methods is intuitive: to transform a constrained problem into an unconstrained one by adding a penalty to the objective function when constraints are violated or close to being violated. As mentioned in the introduction to this chapter, penalty methods are no longer used directly in gradient-based optimization algorithms because they have difficulty converging to the true solution. However, these methods are still valuable because (1) they are simple and thus ease the transition into understanding constrained optimization; (2) they are useful in some constrained gradient-free methods (Chapter 7); (3) they can be used as merit functions in line search algorithms, as discussed in Section 5.5.3; (4) penalty concepts are used in interior points methods, as discussed in Section 5.6.

The penalized function can be written as

\hat{f}(x) = f(x) + \mu \pi(x) \, ,

(5.36)

where $\pi(x)$ is a penalty function, and the scalar $\mu$ is a penalty parameter. This is similar in form to the Lagrangian, but one difference is that $\mu$ is fixed instead of being a variable.

We can use the unconstrained optimization techniques to minimize $\hat{f}(x)$ . However, instead of just solving a single optimization problem, penalty methods usually solve a sequence of problems with different values of $\mu$ to get closer to the actual constrained minimum. We will see shortly why we need to solve a sequence of problems rather than just one problem.

Various forms for $\pi(x)$ can be used, leading to different penalty methods. There are two main types of penalty functions: exterior penalties, which impose a penalty only when constraints are violated, and interior penalty functions, which impose a penalty that increases as a constraint is approached.

Figure 5.26 shows both interior and exterior penalties for a two-dimensional function. The exterior penalty leads to slightly infeasible solutions, whereas an interior penalty leads to a feasible solution but underpredicts the objective.

Figure 5.26:Interior penalties tend to infinity as the constraint is approached from the feasible side of the constraint (left), whereas exterior penalty functions activate when the points are not feasible (right). The minimum for both approaches is different from the true constrained minimum.

5.4.1 Exterior Penalty Methods¶

Of the many possible exterior penalty methods, we focus on two of the most popular ones: quadratic penalties and the augmented Lagrangian method. Quadratic penalties are continuously differentiable and straightforward to implement, but they suffer from numerical ill-conditioning. The augmented Lagrangian method is more sophisticated; it is based on the quadratic penalty but adds terms that improve the numerical properties. Many other penalties are possible, such as 1-norms, which are often used when continuous differentiability is unnecessary.

Quadratic Penalty Method¶

For equality constrained problems, the quadratic penalty method takes the form

\hat{f}(x; \mu) = f(x) + \frac{\mu}{2}\sum_i {h}_i(x)^2 \, ,

(5.37)

where the semicolon denotes that $\mu$ is a fixed parameter. The motivation for a quadratic penalty is that it is simple and results in a function that is continuously differentiable. The factor of one half is unnecessary but is included by convention because it eliminates the extra factor of two when taking derivatives. The penalty is nonzero unless the constraints are satisfied ( $h_i = 0$ ), as desired.

Quadratic penalty for an equality constrained problem. The minimum of the penalized function (black dots) approaches the true constrained minimum (blue circle) as the penalty parameter \mu increases. — Figure 5.27:Quadratic penalty for an equality constrained problem. The minimum of the penalized function (black dots) approaches the true constrained minimum (blue circle) as the penalty parameter $\mu$ increases.

The value of the penalty parameter $\mu$ must be chosen carefully. Mathematically, we recover the exact solution to the constrained problem only as $\mu$ tends to infinity (see Figure 5.27). However, starting with a large value for $\mu$ is not practical. This is because the larger the value of $\mu$ , the larger the Hessian condition number, which corresponds to the curvature varying greatly with direction (see Example 4.10). This behavior makes the problem difficult to solve numerically.

To solve the problem more effectively, we begin with a small value of $\mu$ and solve the unconstrained problem. We then increase $\mu$ and solve the new unconstrained problem, using the previous solution as the starting point. We repeat this process until the optimality conditions (or some other approximate convergence criteria) are satisfied, as outlined in Algorithm 5.1. By gradually increasing $\mu$ and reusing the solution from the previous problem, we avoid some of the ill-conditioning issues. Thus, the original constrained problem is transformed into a sequence of unconstrained optimization problems.

There are three potential issues with the approach outlined in Algorithm 5.1. Suppose the starting value for $\mu$ is too low. In that case, the penalty might not be enough to overcome a function that is unbounded from below, and the penalized function has no minimum.

The second issue is that we cannot practically approach $\mu \rightarrow \infty$ . Hence, the solution to the problem is always slightly infeasible. By comparing the optimality condition of the constrained problem,

\nabla_x {\cal L} = \nabla f + J_h^\intercal \lambda = 0 \, ,

(5.38)

and the optimality condition of the penalized function,

\nabla_x \hat{f} = \nabla f + \mu J_h^\intercal h = 0 \, ,

(5.39)

we see that for each constraint $j$ ,

h_j \approx \frac{\lambda_j^*}{\mu} \, .

(5.40)

Because $h_j=0$ at the optimum, $\mu$ must be large to satisfy the constraints.

The third issue has to do with the curvature of the penalized function, which is directly proportional to $\mu$ . The extra curvature is added in a direction perpendicular to the constraints, making the Hessian of the penalized function increasingly ill-conditioned as $\mu$ increases. Thus, the need to increase $\mu$ to improve accuracy directly leads to a function space that is increasingly challenging to solve.

Example 5.6 (Quadratic penalty for equality constrained problem)

Figure 5.28:The quadratic penalized function minimum approaches the constrained minimum as the penalty parameter increases.

Consider the equality constrained problem from Example 5.2. The penalized function for that case is

\hat{f}(x; \mu) = x_1 + 2 x_2 + \frac{\mu}{2} \left( \frac{1}{4} x_1^2 + x_2^2 - 1 \right)^2 \, .

(5.41)

Figure 5.28 shows this function for different values of the penalty parameter $\mu$ . The penalty is active for all points that are infeasible, but the minimum of the penalized function does not coincide with the constrained minimum of the original problem. The penalty parameter needs to be increased for the minimum of the penalized function to approach the correct solution, but this results in a poorly conditioned function.

To show the impact of increasing $\mu$ , we solve a sequence of problems starting with a small value of $\mu$ and reusing the optimal point for one solution as the starting point for the next. Figure 5.29 shows that large penalty values are required for high accuracy. In this example, even using a penalty parameter of $\mu = 1,000$ (which results in extremely skewed contours), the objective value achieves only three digits of accuracy.

Figure 5.29:Error in optimal solution for increasing penalty parameter.

The approach discussed so far handles only equality constraints, but we can extend it to handle inequality constraints. Instead of adding a penalty to both sides of the constraints, we add the penalty when the inequality constraint is violated (i.e., when $g_j(x) > 0$ ). This behavior can be achieved by defining a new penalty function as

\hat{f}(x; \mu) = f(x) + \frac{\mu}{2}\sum_{j=1}^{n_g} \max\left(0, g_j(x)\right)^2 \, .

(5.42)

The only difference relative to the equality constraint penalty shown in Figure 5.27 is that the penalty is removed on the feasible side of the inequality constraint, as shown in Figure 5.30.

Figure 5.30:Quadratic penalty for an inequality constrained problem. The minimum of the penalized function approaches the constrained minimum from the infeasible side.

The inequality quadratic penalty can be used together with the quadratic penalty for equality constraints if we need to handle both types of constraints:

\hat{f}(x; \mu) = f(x) + \frac{\mu_h}{2}\sum_{l=1}^{n_h} {h}_l(x)^2 + \frac{\mu_g}{2}\sum_{j=1}^{n_g} \max\left(0, g_j(x)\right)^2 \, .

(5.43)

The two penalty parameters can be incremented in lockstep or independently.

Example 5.7 (Quadratic penalty for inequality constrained problem)

Consider the inequality constrained problem from Example 5.4. The penalized function for that case is

\hat{f}(x; \mu) = x_1 + 2 x_2 + \frac{\mu}{2} \max \left(0, \frac{1}{4} x_1^2 + x_2^2 - 1 \right)^2 \, .

This function is shown in Figure 5.31 for different values of the penalty parameter $\mu$ . The contours of the feasible region inside the ellipse coincide with the original function contours. However, outside the feasible region, the contours change to create a function whose minimum approaches the true constrained minimum as the penalty parameter increases.

Figure 5.31:The quadratic penalized function minimum approaches the constrained minimum from the infeasible side.

Augmented Lagrangian¶

As explained previously, the quadratic penalty method requires a large value of $\mu$ for constraint satisfaction, but the large $\mu$ degrades the numerical conditioning. The augmented Lagrangian method helps alleviate this dilemma by adding the quadratic penalty to the Lagrangian instead of just adding it to the function. The augmented Lagrangian function for equality constraints is

\hat{f}(x; \lambda, \mu) = f(x) + \sum_{j=1}^{n_h} \lambda_j h_j(x) + \frac{\mu}{2} \sum_{j=1}^{n_h} h_j(x)^2 \, .

(5.46)

To estimate the Lagrange multipliers, we can compare the optimality conditions for the augmented Lagrangian,

\nabla_x \hat{f}(x; \lambda, \mu) = \nabla f(x) + \sum_{j=1}^{n_h} \left( \lambda_j + \mu h_j(x) \right) \nabla h_j = 0 \, ,

(5.47)

to those of the actual Lagrangian,

\nabla_x {\cal L}(x^*, \lambda^*) = \nabla f(x^*) + \sum_{j=1}^{n_h} \lambda_j^* \nabla h_j(x^*) =0 \, .

(5.48)

Comparing these two conditions suggests the approximation

\lambda_j^* \approx \lambda_j + \mu h_j \, .

(5.49)

Therefore, we update the vector of Lagrange multipliers based on the current estimate of the Lagrange multipliers and constraint values using

\lambda_{k+1} = \lambda_{k} + \mu_{k} h(x_{k}) \, .

(5.50)

The complete algorithm is shown in Algorithm 5.2.

This approach is an improvement on the plain quadratic penalty because updating the Lagrange multiplier estimates at each iteration allows for more accurate solutions without increasing $\mu$ as much. The augmented Lagrangian approximation for each constraint obtained from Equation 5.49 is

h_j \approx \frac{1}{\mu}(\lambda_j^* - \lambda_j) \, .

(5.51)

The corresponding approximation in the quadratic penalty method is

h_j \approx \frac{\lambda_j^*}{\mu} \, .

(5.52)

The quadratic penalty relies solely on increasing $\mu$ in the denominator to drive the constraints to zero.

However, the augmented Lagrangian also controls the numerator through the Lagrange multiplier estimate. If the estimate is reasonably close to the true Lagrange multiplier, then the numerator becomes small for modest values of $\mu$ . Thus, the augmented Lagrangian can provide a good solution for $x^*$ while avoiding the ill-conditioning issues of the quadratic penalty.

Algorithm 5.2 (Augmented Lagrangian penalty method)

Algorithm

\quad x0x_0x0​: Starting point
\quad λ0=0\lambda_0=0λ0​=0: Initial Lagrange multiplier
\quad μ0>0\mu_0 > 0μ0​>0: Initial penalty parameter
\quad ρ>1\rho > 1ρ>1: Penalty increase factor
\quad x∗x^*x∗: Optimal point
\quad f(x∗)f(x^*)f(x∗): Corresponding function value
k=0k = 0k=0
while not converged do
xk∗←minimizexkf^(xk;λk,μk)x_k^{*} \gets \underset{x_k}{\text{minimize}} \hat{f}(x_k; \lambda_{k}, \mu_k)xk∗​←xk​minimize​f^​(xk​;λk​,μk​)
λk+1=λk+μkh(xk)\lambda_{k+1} = \lambda_{k} + \mu_k h(x_k)λk+1​=λk​+μk​h(xk​) Update Lagrange multipliers
μk+1=ρμk\mu_{k+1} = \rho \mu_kμk+1​=ρμk​ Increase penalty parameter
xk+1=xk∗x_{k+1} = x_k^*xk+1​=xk∗​ Update starting point for next optimization
k=k+1k = k + 1k=k+1

So far we have only discussed equality constraints where the definition for the augmented Lagrangian is universal. Example 5.8 included an inequality constraint by assuming it was active and treating it like an equality, but this is not an approach that can be used in general. Several formulations exist for handling inequality constraints using the augmented Lagrangian approach.456 One well-known approach is given by:7

\hat{f}(x; \mu) = f(x) + \lambda^\intercal \bar{g}(x) + \frac{1}{2} \mu \|\bar{g}(x)\|_2^2 \, .

(5.53)

where

\bar{g}_j(x) \equiv \begin{cases} h_j(x) & \text{ for equality constraints}\\ g_j(x) & \text{ if } g_j \ge -\lambda_j / \mu\\ -\lambda_j/\mu & \text{ otherwise \, .} \end{cases}

(5.54)

Example 5.8 (Augmented Lagrangian for inequality constrained problem)

Consider the inequality constrained problem from Example 5.4. Assuming the inequality constraint is active, the augmented Lagrangian (Equation 5.46) is

\hat{f}(x; \mu) = x_1 + 2 x_2 + \lambda \left( \frac{1}{4} x_1^2 + x_2^2 - 1 \right) + \frac{\mu}{2} \left( \frac{1}{4} x_1^2 + x_2^2 - 1 \right)^2 \, .

Applying Algorithm 5.2, starting with $\mu=0.5$ and using $\rho=1.1$ , we get the iterations shown in Figure 5.32.

Figure 5.32:Augmented Lagrangian applied to inequality constrained problem.

Compared with the quadratic penalty in Example 5.7, the penalized function is much better conditioned, thanks to the term associated with the Lagrange multiplier. The minimum of the penalized function eventually becomes the minimum of the constrained problem without a large penalty parameter.

Figure 5.33:Error in optimal solution as compared with true solution as a function of an increasing penalty parameter.

As done in Example 5.6, we solve a sequence of problems starting with a small value of $\mu$ and reusing the optimal point for one solution as the starting point for the next. In this case, we update the Lagrange multiplier estimate between optimizations as well. Figure 5.33 shows that only modest penalty parameters are needed to achieve tight convergence to the true solution, a significant improvement over the regular quadratic penalty.

5.4.2 Interior Penalty Methods¶

Interior penalty methods work the same way as exterior penalty methods—they transform the constrained problem into a series of unconstrained problems. The main difference with interior penalty methods is that they always seek to maintain feasibility. Instead of adding a penalty only when constraints are violated, they add a penalty as the constraint is approached from the feasible region. This type of penalty is particularly desirable if the objective function is ill-defined outside the feasible region. These methods are called interior because the iteration points remain on the interior of the feasible region. They are also referred to as barrier methods because the penalty function acts as a barrier preventing iterates from leaving the feasible region.

Figure 5.34:Two different interior penalty functions: inverse barrier and logarithmic barrier.

One possible interior penalty function to enforce $g(x) \le 0$ is the inverse barrier,

\pi(x) = \sum_{j=1}^{n_{g}} -\frac{1}{g_j(x)} \, ,

(5.55)

where $\pi(x) \rightarrow \infty$ as $g_j(x) \rightarrow 0^-$ (where the superscript “ $-$ ” indicates a left-sided derivative). A more popular interior penalty function is the logarithmic barrier,

\pi(x) = \sum_{j=1}^{n_{g}} -\ln \left(-g_j(x) \right) ,

(5.56)

which also approaches infinity as the constraint tends to zero from the feasible side. The penalty function is then

\hat{f}(x; \mu) = f(x) - \mu \sum_{j=1}^{n_{g}} \ln (-g_j(x)) \, .

(5.57)

These two penalty functions as illustrated in Figure 5.34.

Neither of these penalty functions applies when $g > 0$ because they are designed to be evaluated only within the feasible space. Algorithms based on these penalties must be prevented from evaluating infeasible points.

Like exterior penalty methods, interior penalty methods must also solve a sequence of unconstrained problems but with $\mu \rightarrow 0$ (see Algorithm 5.3). As the penalty parameter decreases, the region across which the penalty acts decreases, as shown in Figure 5.35.

Logarithmic barrier penalty for an inequality constrained problem. The minimum of the penalized function (black circles) approaches the true constrained minimum (blue circle) as the penalty parameter \mu decreases. — Figure 5.35:Logarithmic barrier penalty for an inequality constrained problem. The minimum of the penalized function (black circles) approaches the true constrained minimum (blue circle) as the penalty parameter $\mu$ decreases.

The methodology is the same as is described in Algorithm 5.1 but with a decreasing penalty parameter. One major weakness of the method is that the penalty function is not defined for infeasible points, so a feasible starting point must be provided. For some problems, providing a feasible starting point may be difficult or practically impossible.

The optimization must be safeguarded to prevent the algorithm from becoming infeasible when starting from a feasible point. This can be achieved by checking the constraints values during the line search and backtracking if any of them is greater than or equal to zero. Multiple backtracking iterations might be required.

Like exterior penalty methods, the Hessian for interior penalty methods becomes increasingly ill-conditioned as the penalty parameter tends to zero.8 There are augmented and modified barrier approaches that can avoid the ill-conditioning issue (and other methods that remain ill-conditioned but can still be solved reliably, albeit inefficiently).9 However, these methods have been superseded by the modern interior-point methods discussed in Section 5.6, so we do not elaborate on further improvements to classical penalty methods.

5.5 Sequential Quadratic Programming¶

SQP is the first of the modern constrained optimization methods we discuss. SQP is not a single algorithm; instead, it is a conceptual method from which various specific algorithms are derived. We present the basic method but mention only a few of the many details needed for robust practical implementations. We begin with equality constrained SQP and then add inequality constraints.

5.5.1 Equality Constrained SQP¶

To derive the SQP method, we start with the KKT conditions for this problem and treat them as equation residuals that need to be solved. Recall that the Lagrangian (Equation 5.12) is

\mathcal{L}(x, \lambda) = f(x) + h(x)^\intercal \lambda \, .

(5.58)

Differentiating this function with respect to the design variables and Lagrange multipliers and setting the derivatives to zero, we get the KKT conditions,

r = \begin{bmatrix} \nabla_x \mathcal{L}(x, \lambda)\\ \nabla_\lambda \mathcal{L}(x, \lambda)\\ \end{bmatrix} = \begin{bmatrix} \nabla f(x) + J_h^\intercal \lambda \\ {h}(x) \end{bmatrix} = 0.

(5.59)

Recall that to solve a system of equations $r(u) = 0$ using Newton’s method, we solve a sequence of linear systems,

J_r\left(u_{k} \right) p_u = -r \left(u_{k} \right) ,

(5.60)

where $J_r$ is the Jacobian of derivatives $\partial r/\partial u$ . The step in the variables is $p_u = u_{k+1} - u_k$ , where the variables are

u \equiv \begin{bmatrix} x \\ \lambda \end{bmatrix}\,.

(5.61)

Differentiating the vector of residuals (Equation 5.59) with respect to the two concatenated vectors in $u$ yields the following block linear system:

\begin{bmatrix} H_\mathcal{L} & J_h^\intercal \\ J_h & 0 \end{bmatrix} \begin{bmatrix} p_x \\ p_{\lambda} \end{bmatrix} = \begin{bmatrix} -\nabla_x \mathcal{L}\\ -{h} \end{bmatrix} \, .

(5.62)

This is a linear system of $n_x + n_h$ equations where the Jacobian matrix is square.

Structure and block shapes for the matrix in the SQP system () — Figure 5.37:Structure and block shapes for the matrix in the SQP system (Equation 5.62)

The shape of the matrix and its blocks are as shown in Figure 5.37. We solve a sequence of these problems to converge to the optimal design variables and the corresponding optimal Lagrange multipliers. At each iteration, we update the design variables and Lagrange multipliers as follows:

\begin{align} x_{k+1} &= x_k + \alpha_k p_x \\ \lambda_{k+1} &= \lambda_k + p_\lambda \, . \end{align}

(5.63)

The inclusion of $\alpha_k$ suggests that we do not automatically accept the Newton step (which corresponds to $\alpha=1$ ) but instead perform a line search as previously described in Section 4.3. The function used in the line search needs some modification, as discussed later in this section.

SQP can be derived in an alternative way that leads to different insights. This alternate approach requires an understanding of quadratic programming (QP), which is discussed in more detail in Section 11.3 but briefly described here. A QP problem is an optimization problem with a quadratic objective and linear constraints. In a general form, we can express any equality constrained QP as

\begin{aligned} \underset{x}{\text{minimize}} &\quad \frac{1}{2}x^\intercal Q x + q^\intercal x\\ \text{subject to} &\quad A x + b = 0 \, . \end{aligned}

(5.64)

A two-dimensional example with one constraint is illustrated in Figure 5.38.

Figure 5.38:Quadratic problem in two dimensions.

The constraint is a matrix equation that represents multiple linear equality constraints—one for every row in $A$ . We can solve this optimization problem analytically from the optimality conditions. First, we form the Lagrangian:

\mathcal{L}(x, \lambda) = \frac{1}{2}x^\intercal Qx + q^\intercal x + \lambda^\intercal (A x + b) \, .

(5.65)

We now take the partial derivatives and set them equal to zero:

\begin{aligned} \nabla_x \mathcal{L} &= Qx + q + A^\intercal \lambda = 0\\ \nabla_\lambda \mathcal{L} &= Ax + b = 0 \, . \end{aligned}

(5.66)

We can express those same equations in a block matrix form:

\begin{bmatrix} Q & A^\intercal\\ A & 0\\ \end{bmatrix} \begin{bmatrix} x \\ \lambda\\ \end{bmatrix} = \begin{bmatrix} -q \\ -b\\ \end{bmatrix} \, .

(5.68)

This is like the procedure we used in solving the KKT conditions, except that these are linear equations, so we can solve them directly without any iteration. As in the unconstrained case, finding the minimum of a quadratic objective results in a system of linear equations.

As long as $Q$ is positive definite, then the linear system always has a solution, and it is the global minimum of the QP.^[11]The ease with which a QP can be solved provides a strong motivation for SQP. For a general constrained problem, we can make a local QP approximation of the nonlinear model, solve the QP, and repeat this process until convergence. This method involves iteratively solving a sequence of quadratic programming problems, hence the name sequential quadratic programming.

To form the QP, we use a local quadratic approximation of the Lagrangian (removing the constant term because it does not change the solution) and a linear approximation of the constraints for some step $p$ near our current point. In other words, we locally approximate the problem as the following QP:

\begin{aligned} \underset{p}{\text{minimize}} &\quad \frac{1}{2} p^\intercal H_\mathcal{L} p + \nabla_{x}\mathcal{L}^\intercal p \\ \text{subject to} &\quad J_h p + h = 0 \, . \end{aligned}

(5.69)

We substitute the gradient of the Lagrangian into the objective:

\begin{aligned} \frac{1}{2} p^\intercal H_\mathcal{L} p + \nabla f^\intercal p + \lambda^\intercal J_h p \, . \end{aligned}

(5.69)

Then, we substitute the constraint $J_h p = -h$ into the objective:

\begin{aligned} \frac{1}{2} p^\intercal H_\mathcal{L} p + \nabla f^\intercal p - \lambda^\intercal h \, . \end{aligned}

(5.70)

Now, we can remove the last term in the objective because it does not depend on the variable ( $p$ ), resulting in the following equivalent problem:

\begin{aligned} \underset{p}{\text{minimize}} &\quad \frac{1}{2} p^\intercal H_\mathcal{L} p + \nabla f^\intercal p \\ \text{subject to} &\quad J_h p + h = 0 \, . \end{aligned}

(5.72)

Using the QP solution method outlined previously results in the following system of linear equations:

\begin{bmatrix} H_\mathcal{L} & J_h^\intercal \\ J_h & 0 \end{bmatrix} \begin{bmatrix} p_x \\ \lambda_{k+1} \end{bmatrix} = \begin{bmatrix} -\nabla f \\ - h \end{bmatrix}.

(5.73)

Replacing $\lambda_{k+1} = \lambda_k + p_{\lambda}$ and multiply through:

\begin{bmatrix} H_\mathcal{L} & J_h^\intercal\\ J_h & 0 \end{bmatrix} \begin{bmatrix} p_x \\ p_\lambda \end{bmatrix} + \begin{bmatrix} J_h^\intercal \lambda_k\\ 0 \end{bmatrix} = \begin{bmatrix} -\nabla f\\ -h \end{bmatrix} \, .

(5.73)

Subtracting the second term on both sides yields

\begin{bmatrix} H_\mathcal{L} & J_h^\intercal\\ J_h & 0 \end{bmatrix} \begin{bmatrix} p_x \\ p_\lambda \end{bmatrix} = \begin{bmatrix} -\nabla_x \mathcal{L}\\ -h \end{bmatrix} \, ,

(5.75)

which is the same linear system we found from applying Newton’s method to the KKT conditions (Equation 5.62).

This derivation relies on the somewhat arbitrary choices of choosing a QP as the subproblem and using an approximation of the Lagrangian with constraints (rather than an approximation of the objective with constraints or an approximation of the Lagrangian with no constraints).^[12] Nevertheless, it is helpful to conceptualize the method as solving a sequence of QPs. This concept will motivate the solution process once we add inequality constraints.

5.3.2 Inequality Constraints¶

Introducing inequality constraints adds complications. For inequality constraints, we cannot solve the KKT conditions directly as we could for equality constraints. This is because the KKT conditions include the complementary slackness conditions $\sigma_j g_j = 0$ , which we cannot solve directly. Even though the number of equations in the KKT conditions is equal to the number of unknowns, the complementary conditions do not provide complete information (they just state that each constraint is either active or inactive). Suppose we knew which of the inequality constraints were active ( $g_j = 0$ ) and which were inactive ( $\sigma_j = 0$ ) at the optimum. Then, we could use the same approach outlined in the previous section, treating the active constraints as equality constraints and ignoring the inactive constraints. Unfortunately, we do not know which constraints are active at the optimum beforehand in general. Finding which constraints are active in an iterative way is challenging because we would have to try all possible combinations of active constraints. This is intractable if there are many constraints.

A common approach to handling inequality constraints is to use an active-set method. The active set is the set of constraints that are active at the optimum (the only ones we ultimately need to enforce). Although the actual active set is unknown until the solution is found, we can estimate this set at each iteration. This subset of potentially active constraints is called the working set.

The working set is then updated at each iteration.

Similar to the SQP developed in the previous section for equality constraints, we can create an algorithm based on solving a sequence of QPs that linearize the constraints.^[13] We extend the equality constrained QP (Equation 5.69) to include the inequality constraints as follows:

\begin{aligned} \underset{s}{\text{minimize}} &\quad \frac{1}{2} s^\intercal H_\mathcal{L} s + \nabla_x \mathcal{L}^\intercal s\\ \text{subject to} &\quad J_h s + h = 0\\ &\quad J_g s + g \le 0 \, . \end{aligned}

(5.76)

The determination of the working set could happen in the inner loop, that is, as part of the inequality constrained QP subproblem (Equation 5.76). Alternatively, we could choose a working set in the outer loop and then solve the QP subproblem with only equality constraints (Equation 5.69), where the working-set constraints would be posed as equalities. The former approach is more common and is discussed here. In that case, we need consider only the active-set problem in the context of a QP. Many variations on active-set methods exist; we outline just one such approach based on a binding-direction method.

The general QP problem we need to solve is as follows:

\begin{aligned} \underset{x}{\text{minimize}} &\quad \frac{1}{2}x^\intercal Q x + q^\intercal x\\ \text{subject to} &\quad A x + b = 0\\ &\quad C x + d \le 0 \, . \end{aligned}

(5.77)

We assume that $Q$ is positive definite so that this problem is convex. Here, $Q$ corresponds to the Lagrangian Hessian. Using an appropriate quasi-Newton approximation (which we will discuss in Section 5.5.4) ensures a positive definite Lagrangian Hessian approximation.

Consider iteration $k$ in an SQP algorithm that handles inequality constraints. At the end of the previous iteration, we have a design point $x_k$ and a working set $W_k$ . The working set in this approach is a set of row indices corresponding to the subset of inequality constraints that are active at $x_k$ .^[14]Then, we consider the corresponding inequality constraints to be equalities, and we write:

C_w x_k + d_w = 0 \, ,

(5.77)

where $C_w$ and $d_w$ correspond to the rows of the inequality constraints specified in the working set.

The constraints in the working set, combined with the equality constraints, must be linearly independent. Thus, we cannot include more working-set constraints (plus equality constraints) than design variables.

Although the active set is unique, there can be multiple valid choices for the working set.

Assume, for the moment, that the working set does not change at nearby points (i.e., we ignore the constraints outside the working set). We seek a step $p$ to update the design variables as follows: $x_{k+1} = x_k + p$ . We find $p$ by solving the following simplified QP that considers only the working set:

\begin{aligned} \underset{p}{\text{minimize}} &\quad \frac{1}{2}(x_k + p)^\intercal Q (x_k + p) + q^\intercal (x_k + p)\\ \text{subject to} &\quad A (x_k + p) + b = 0\\ &\quad C_w (x_k + p) + d_w = 0 \, . \end{aligned}

(5.78)

We solve this QP by varying $p$ , so after multiplying out the terms in the objective, we can ignore the terms that do not depend on $p$ . We can also simplify the constraints because we know the constraints were satisfied at the previous iteration (i.e., $A x_k + b = 0$ and $C_w x_k + d_w = 0$ ). The simplified problem is as follows:

\begin{aligned} \underset{p}{\text{minimize}} &\quad \frac{1}{2}p^\intercal Q p + (q + Q^\intercal x_k) p \\ \text{subject to} &\quad A p = 0\\ &\quad C_w p = 0 \, . \end{aligned}

(5.80)

We now have an equality constrained QP that we can solve using the methods from the previous section.

Figure 5.39:Structure of the QP subproblem within the inequality constrained QP solution process.

Using Equation 5.68, the KKT solution to this problem is as follows:

\begin{bmatrix} Q & A^\intercal & C_w^\intercal\\ A & 0 & 0\\ C_w & 0 & 0\\ \end{bmatrix} \begin{bmatrix} p \\ \lambda \\ \sigma \\ \end{bmatrix} = \begin{bmatrix} -q - Q^\intercal x_k \\ 0 \\ 0\\ \end{bmatrix} \, .

(5.81)

Figure 5.39 shows the structure of the matrix in this linear system.

Let us consider the case where the solution of this linear system is nonzero. Solving the KKT conditions in Equation 5.80 ensures that all the constraints in the working set are still satisfied at $x_k + p$ . Still, there is no guarantee that the step does not violate some of the constraints outside of our working set. Suppose that $C_n$ and $d_n$ define the constraints outside of the working set. If

C_n (x_k + p) + d_n \le 0

(5.81)

for all rows, all the constraints are still satisfied. In that case, we accept the step $p$ and update the design variables as follows:

x_{k+1} = x_k + p \, .

(5.82)

The working set remains unchanged as we proceed to the next iteration.

Otherwise, if some of the constraints are violated, we cannot take the full step $p$ and reduce it the step length by $\alpha$ as follows:

x_{k+1} = x_k + \alpha p \, .

(5.84)

We cannot take the full step ( $\alpha = 1$ ), but we would like to take as large a step as possible while still keeping all the constraints feasible.

Let us consider how to determine the appropriate step size, $\alpha$ . Substituting the step update (Equation 5.84) into the equality constraints, we obtain the following:

A (x_k + \alpha p) + b = 0 \, .

(5.85)

We know that $A x_k + b = 0$ from solving the problem at the previous iteration. Also, we just solved $p$ under the condition that $A p = 0$ . Therefore, the equality constraints (Equation 5.85) remain satisfied for any choice of $\alpha$ . By the same logic, the constraints in our working set remain satisfied for any choice of $\alpha$ as well.

Now let us consider the constraints that are not in the working set.

We denote $c_i$ as row $i$ of the matrix $C_n$ (associated with the inequality constraints outside of the working set). If these constraints are to remain satisfied, we require

c_i^\intercal (x_k + \alpha p) + d_i \le 0 \, .

(5.85)

After rearranging, this condition becomes

\alpha c_i^\intercal p \le -(c_i^\intercal x_k + d_i) \, .

(5.87)

We do not divide through by $c_i^\intercal p$ yet because the direction of the inequality would change depending on its sign. We consider the two possibilities separately. Because the QP constraints were satisfied at the previous iteration, we know that $c_i^\intercal x_k + d_i \le 0$ for all $i$ . Thus, the right-hand side is always positive. If $c_i^\intercal p$ is negative, then the inequality will be satisfied for any choice of $\alpha$ . Alternatively, if $c_i^\intercal p$ is positive, we can rearrange Equation 5.87 to obtain the following:

\alpha_i \le - \frac{(c_i^\intercal x_k + d_i)}{c_i^\intercal p} \, .

(5.87)

This equation determines how large $\alpha$ can be without causing one of the constraints outside of the working set to become active. Because multiple constraints may become active, we have to evaluate $\alpha$ for each one and choose the smallest $\alpha$ among all constraints.

A constraint for which $\alpha < 1$ is said to be blocking. In other words, if we had included that constraint in our working set before solving the QP, it would have changed the solution. We add one of the blocking constraints to the working set, and proceed to the next iteration.^[15]Now consider the case where the solution to Equation 5.81 is $p = 0$ . If all inequality constraint Lagrange multipliers are positive ( $\sigma_i > 0$ ), the KKT conditions are satisfied and we have solved the original inequality constrained QP. If one or more $\sigma_i$ values are negative, additional iterations are needed. We find the $\sigma_i$ value that is most negative, remove constraint $i$ from the working set, and proceed to the next iteration.

As noted previously, all the constraints in the reduced QP (the equality constraints plus all working-set constraints) must be linearly independent and thus $[A \; C_w]^\intercal$ has full row rank. Otherwise, there would be no solution to Equation 5.81. Therefore, the starting working set might not include all active constraints at $x_0$ and must instead contain only a subset, such that linear independence is maintained.

Similarly, when adding a blocking constraint to the working set, we must again check for linear independence. At a minimum, we need to ensure that the length of the working set does not exceed $n_x$ . The complete algorithm for solving an inequality constrained QP is shown in Algorithm 5.4.

Tip 5.4: Some equality constraints can be posed as inequality constraints

Equality constraints are less common in engineering design problems than inequality constraints. Sometimes we pose a problem as an equality constraint unnecessarily. For example, the simulation of an aircraft in steady-level flight may require the lift to equal the weight. Formally, this is an equality constraint, but it can also be posed as an inequality constraint (lift greater or equal to weight). There is no advantage to having more lift than the required because it increases drag, so the constraint is always active at the optimum. When such a constraint is not active at the solution, it can be a helpful indicator that something is wrong with the formulation, the optimizer, or the assumptions. Although an equality constraint is more natural from the algorithm perspective, the flexibility of the inequality constraint might allow the optimizer to explore the design space more effectively.

Consider another example: a propeller design problem might require a specified thrust. Although an equality constraint would likely work, it is more constraining than necessary. If the optimal design were somehow able to produce excess thrust, we would accept that design. Thus, we should not formulate the constraint in an unnecessarily restrictive way.

Algorithm 5.4 (Active-set solution method for an inequality constrained QP)

Algorithm

\quad Q,q,A,b,C,DQ, q, A, b, C, DQ,q,A,b,C,D: Matrices and vectors defining the QP (Equation 5.77); Q must be positive definite
\quad ε\varepsilonε: Tolerance used for termination and for determining whether constraint is active
\quad x∗x^*x∗: Optimal point
k=0k = 0k=0
xk=x0x_k = x_0xk​=x0​
Wk=i for all i where (ci⊺xk+di)>−ε and length(Wk)≤nxW_k = {i} \text{ for all i where } \left({c_i}^\intercal x_k + d_i \right) > -\varepsilon \text{ and } \texttt{length}(W_k) \le n_xWk​=i for all i where (ci​⊺xk​+di​)>−ε and length(Wk​)≤nx​ One possible
initial working set
while true do
set Cw=Ci,∗{C}_w = {C}_{i, *}Cw​=Ci,∗​ and dw=did_w = {d_i}dw​=di​ for all i∈Wki \in W_ki∈Wk​ Select rows for working set
Solve the KKT system (Equation 5.81)
if ∥p∥<ε\|p\| < \varepsilon∥p∥<ε then
if σ≥0\sigma \ge 0σ≥0 then
Satisfied KKT conditionsx∗=xkx^* = x_kx∗=xk​
i=arg⁡ ⁣min⁡σi = \arg\!\min \sigmai=argminσ
Wk+1=Wk∖{i}W_{k+1} = W_k \setminus \{i\}Wk+1​=Wk​∖{i} Remove iii from working set
xk+1=xkx_{k+1} = x_kxk+1​=xk​
α=1\alpha = 1α=1 Initialize with optimum step
B={}B = \{\}B={} Blocking index
for i∉Wki \notin W_ki∈/Wk​ do
Checkconstraints outside of working setif ci⊺p>0c_i^\intercal p > 0ci⊺​p>0 then
Potential blocking constraintαb=−(ci⊺xk+di)ci⊺p\alpha_b = \frac{-(c_i^\intercal x_k + d_i)}{c_i^\intercal p}αb​=ci⊺​p−(ci⊺​xk​+di​)​ cic_ici​ is a row of CnC_nCn​
if αb<α\alpha_b < \alphaαb​<α then
α=αb\alpha = \alpha_bα=αb​
B=iB = iB=i Save or overwrite blocking index
Wk+1=Wk∪{B}W_{k+1} = W_{k} \cup \{B\}Wk+1​=Wk​∪{B} Add BBB to working set (if linearly independent)
xk+1=xk+αpx_{k+1} = x_k + \alpha pxk+1​=xk​+αp
k=k+1k = k + 1k=k+1

Example 5.10 (Inequality constrained QP)

Let us solve the following problem using the active-set QP algorithm:

\begin{aligned} \underset{x_1, x_2}{\text{minimize}} &\quad 3 x_1^2 + x_2^2 + 2 x_1 x_2 + x_1 + 6 x_2\\ \text{subject to} &\quad 2 x_1 + 3 x_2 \ge 4\\ &\quad x_1 \ge 0 \\ &\quad x_2 \ge 0 \, . \\ \end{aligned}

Rewriting in the standard form (Equation 5.77) yields the following:

Q = \begin{bmatrix} 6 & 2 \\ 2 & 2 \\ \end{bmatrix}, \quad q = \begin{bmatrix} 1 \\ 6 \end{bmatrix}, \quad C = \begin{bmatrix} -2 & -3 \\ -1 & 0 \\ 0 & -1 \\ \end{bmatrix}, \quad d = \begin{bmatrix} 4 \\ 0 \\ 0 \end{bmatrix}.

Figure 5.40:Iteration history for the active-set QP example.

We arbitrarily chose $x = [3, 2]$ as a starting point. Because none of the constraints are active, the initial working set is empty, $W = \{\}$ . At each iteration, we solve the QP formed by the equality constraints and any constraints in the active set (treated as equality constraints). The sequence of iterations is detailed as follows and is plotted in Figure 5.40:

The QP subproblem yields $p = [-1.75, -6.25]$ and $\sigma = [0, 0, 0]$ . Next, we check whether any constraints are blocking at the new point $x + p$ . Because all three constraints are outside of the working set, we check all three. Constraint 1 is potentially blocking ( $c_i^\intercal p > 0$ ) and leads to $\alpha_b = 0.35955$ . Constraint 2 is also potentially blocking and leads to $\alpha_b = 1.71429$ . Finally, constraint 3 is also potentially blocking and leads to $\alpha_b = 0.32$ . We choose the constraint with the smallest $\alpha$ , which is constraint 3, and add it to our working set. At the end of the iteration, $x = [2.44, 0.0]$ and $W = \{3\}$ .
The new QP subproblem yields $p = [-2.60667, 0.0]$ and $\sigma = [0, 0, 5.6667]$ . Constraints 1 and 2 are outside the working set. Constraint 1 is potentially blocking and gives $\alpha_b = 0.1688$ ; constraint 2 is also potentially blocking and yields $\alpha_b = 0.9361$ . Because constraint 1 yields the smaller step, we add it to the working set. At the end of the iteration, $x = [2.0, 0.0]$ and $W = \{1, 3\}$ .
The QP subproblem now yields $p = [0, 0]$ and $\sigma = [6.5, 0, -9.5]$ . Because $p = 0$ , we check for convergence. One of the Lagrange multipliers is negative, so this cannot be a solution. We remove the constraint associated with the most negative Lagrange multiplier from the working set (constraint 3). At the end of the iteration, $x$ is unchanged at $x = [2.0, 0.0]$ , and $W = \{1\}$ .
The QP yields $p = [-1.5, 1.0]$ and $\sigma = [3, 0, 0]$ . Constraint 2 is potentially blocking and yields $\alpha_b = 1.333$ (which means it is not blocking because $\alpha_b > 1$ ). Constraint 3 is also not blocking ( $c_i^\intercal p < 0$ ). None of the $\alpha_b$ values was blocking, so we can take the full step ( $\alpha = 1$ ). The new $x$ point is $x = [0.5, 1.0]$ , and the working set is unchanged at $W = \{1\}$ .
The QP yields $p = [0, 0], \sigma = [3, 0, 0]$ . Because $p = 0$ , we check for convergence. All Lagrange multipliers are nonnegative, so the problem is solved. The solution to the original inequality constrained QP is then $x^* = [0.5, 1.0]$ .

Because SQP solves a sequence of QPs, an effective approach is to use the optimal $x$ and active set from the previous QP as the starting point and working set for the next QP. The algorithm outlined in this section requires both a feasible starting point and a working set of linearly independent constraints. Although the previous starting point and working set usually satisfy these conditions, this is not guaranteed, and adjustments may be necessary.

Algorithms to determine a feasible point are widely used (often by solving a linear programming problem). There are also algorithms to remove or add to the constraint matrix as needed to ensure full rank.12

5.5.3 Merit Functions and Filters¶

Similar to what we did in unconstrained optimization, we do not directly accept the step $p$ returned from solving the subproblem (Equation 5.62 or Equation 5.76). Instead, we use $p$ as the first step length in a line search.

In the line search for unconstrained problems (Section 4.3), determining if a point was good enough to terminate the search was based solely on comparing the objective function value (and the slope when enforcing the strong Wolfe conditions). For constrained optimization, we need to make some modifications to these methods and criteria.

In constrained optimization, objective function decrease and feasibility often compete with each other. During a line search, a new point may decrease the objective but increase the infeasibility, or it may decrease the infeasibility but increase the objective. We need to take these two metrics into account to determine the line search termination criterion.

The Lagrangian is a function that accounts for the two metrics. However, at a given iteration, we only have an estimate of the Lagrange multipliers, which can be inaccurate.

One way to combine the objective value with the constraints in a line search is to use merit functions, which are similar to the penalty functions introduced in Section 5.4. Common merit functions include functions that use the norm of constraint violations:

\hat{f}(x; \mu) = f(x) + \mu \|\bar{g}(x)\|_p \, ,

(5.88)

where $p$ is 1 or 2 and $\bar{g}$ are the constraint violations, defined as

\bar{g}_j(x) = \begin{cases} h_j(x) & \text{ for equality constraints}\\ \max(0, g_j(x)) & \text{ for inequality constraints \, .} \end{cases}

(5.90)

The augmented Lagrangian from Section 5.4.1 can also be repurposed for a constrained line search (see Equation 5.53 and Equation 5.54).

Like penalty functions, one downside of merit functions is that it is challenging to choose a suitable value for the penalty parameter $\mu$ . This parameter needs to be large to ensure feasibility. However, if it is too large, a full Newton step might not be permitted. This might slow the convergence unnecessarily. Using the augmented Lagrangian can help, as discussed in Section 5.4.1. However, there are specific techniques used in SQP line searches and various safeguarding techniques needed for robustness.

Filter methods are an alternative to using penalty-based methods in a line search.13 Filter methods interfere less with the full Newton step and are effective for both SQP and interior-point methods (which are introduced in Section 5.6).1415 The approach is based on concepts from multiobjective optimization, which is the subject of Chapter 9. In the filter method, there are two objectives: decrease the objective function and decrease infeasibility. A point is said to dominate another if its objective is lower and the sum of its constraint violations is lower. The filter consists of all the points that have been found to be non-dominated in the line searches so far. The line search terminates when it finds a point that is not dominated by any point in the current filter. That new point is then added to the filter, and any points that it dominates are removed from the filter.^[16]This is only the basic concept. Robust implementation of a filter method requires imposing sufficient decrease conditions, not unlike those in the unconstrained case, and several other modifications. Fletcher et al. (2006) provide more details on filter methods.

Example 5.11 (Using a filter)

A filter consists of pairs $\left(f(x), \|\bar{g}\|_1\right)$ , where $\|\bar{g}\|_1$ is the sum of the constraint violations (Equation 5.90). Suppose that the current filter contains the following three points: $\{(2, 5), (3, 2),(7, 1)\}$ . None of the points in the filter dominates any other. These points are plotted as the blue dots in Figure 5.41, where the shaded regions correspond to all the points that are dominated by the points in the filter.

Figure 5.41:Filter method example showing three points in the filter (blue dots); the shaded regions correspond to all the points that are dominated by the filter. The red dots illustrate three different possible outcomes when new points are considered.

During a line search, a new candidate point is evaluated. There are three possible outcomes. Consider the following three points that illustrate these three outcomes (corresponding to the labeled points in Figure 5.41):

$(1, 4)$ : This point is not dominated by any point in the filter. The step is accepted, the line search ends, and this point is added to the filter. Because this new point dominates one of the points in the filter, $(2, 5)$ , that dominated point is removed from the filter. The current set in the filter is now $\{ (1, 4),(3, 2),(7, 1)\}$ .
$(1, 6)$ : This point is not dominated by any point in the filter. The step is accepted, the line search ends, and this new point is added to the filter. Unlike the previous case, none of the points in the filter are dominated. Therefore, no points are removed from the filter set, which becomes $\{(1, 6), (2, 5), (3, 2), (7, 1) \}$ .
$(4, 3)$ : This point is dominated by a point in the filter, $(3, 2)$ . The step is rejected, and the line search continues by selecting a new candidate point. The filter is unchanged.

5.5.4 Quasi-Newton SQP¶

In the discussion of the SQP method so far, we have assumed that we have the Hessian of the Lagrangian $H_\mathcal{L}$ . Similar to the unconstrained optimization case, the Hessian might not be available or be too expensive to compute. Therefore, it is desirable to use a quasi-Newton approach that approximates the Hessian, as we did in Section 4.4.4.

The difference now is that we need an approximation of the Lagrangian Hessian instead of the objective function Hessian. We denote this approximation at iteration $k$ as $\tilde H_{\mathcal{L}_{k}}$ .

Similar to the unconstrained case, we can approximate $\tilde H_{\mathcal{L}_{k}}$ using the gradients of the Lagrangian and a quasi-Newton update, such as the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update.

Unlike in unconstrained optimization, we do not want the inverse of the Hessian directly. Therefore, we use the version of the BFGS formula that computes the Hessian (Equation 4.87):

\tilde H_{\mathcal{L}_{k+1}} = \tilde H_{\mathcal{L}_{k}} - \frac{\tilde H_{\mathcal{L}_{k}} s_k s_k^\intercal \tilde H_{\mathcal{L}_{k}}}{s_k^\intercal \tilde H_{\mathcal{L}_{k}} s_k} + \frac{y_k y_k^\intercal}{y_k^\intercal s_k} \, ,

(5.91)

where:

\begin{aligned} s_k &= x_{k+1} - x_k\\ y_k &= \nabla_x\mathcal{L}(x_{k+1}, \lambda_{k+1}) - \nabla_x\mathcal{L}(x_{k}, \lambda_{k+1}) \, . \end{aligned}

(5.92)

The step in the design variable space, $s_k$ , is the step that resulted from the latest line search. The Lagrange multiplier is fixed to the latest value when approximating the curvature of the Lagrangian because we only need the curvature in the space of the design variables.

Recall that for the QP problem (Equation 5.76) to have a solution, $\tilde H_{\mathcal{L}_{k}}$ must be positive definite. To ensure a positive definite approximation, we can use a damped BFGS update.16^[17]

This method replaces $y$ with a new vector $r$ , defined as

r_k = \theta_k y_k + (1-\theta_k) \tilde H_{\mathcal{L}_{k}} s_k \, ,

(5.93)

where the scalar $\theta_k$ is defined as

\theta_k = \begin{cases} 1 & \text{if} \quad s_k^\intercal y_k \ge 0.2 s_k^\intercal \tilde H_{\mathcal{L}_{k}} s_k \\ \frac{0.8s_k^\intercal \tilde H_{\mathcal{L}_{k}} s_k}{s_k^\intercal \tilde H_{\mathcal{L}_{k}} s_k -s_k^\intercal y_k} & \text{if} \quad s_k^\intercal y_k < 0.2 s_k^\intercal \tilde H_{\mathcal{L}_{k}} s_k \, , \end{cases}

(5.94)

which can range from 0 to 1. We then use the same BFGS update formula (Equation 5.91), except that we replace each $y_k$ with $r_k$ .

To better understand this update, let us consider the two extremes for $\theta$ . If $\theta_k = 0$ , then Equation 5.93 in combination with Equation 5.91 yields $\tilde H_{\mathcal{L}_{k+1}} = \tilde H_{\mathcal{L}_{k}}$ ; that is, the Hessian approximation is unmodified. At the other extreme, $\theta_k = 1$ yields the full BFGS update formula ( $r_k$ is set to $y_k$ ). Thus, the parameter $\theta_k$ provides a linear weighting between keeping the current Hessian approximation and using the full BFGS update.

The definition of $\theta_k$  (Equation 5.94) ensures that $\tilde H_{\mathcal{L}_{k+1}}$ stays close enough to $\tilde H_{\mathcal{L}_{k}}$ and remains positive definite. The damping is activated when the predicted curvature in the new latest step is below one-fifth of the curvature predicted by the latest approximate Hessian. This could happen when the function is flattening or when the curvature becomes negative.

5.5.5 Algorithm Overview¶

We now put together the various pieces in a high-level description of SQP with quasi-Newton approximations in Algorithm 5.5.^[18] For the convergence criterion, we can use an infinity norm of the KKT system residual vector. For better control over the convergence, we can consider two separate tolerances: one for the norm of the optimality and another for the norm of the feasibility. For problems that only have equality constraints, we can solve the corresponding QP (Equation 5.62) instead.

Algorithm 5.5 (SQP with quasi-Newton approximation)

Algorithm

\quad x0x_0x0​: Starting point
\quad τopt\tau_\text{opt}τopt​: Optimality tolerance
\quad τfeas\tau_\text{feas}τfeas​: Feasibility tolerance
\quad x∗x^*x∗: Optimal point
\quad f(x∗)f(x^*)f(x∗): Corresponding function value
λ0=0,σ0=0\lambda_0 = 0, \sigma_0 = 0λ0​=0,σ0​=0 Initial Lagrange multipliers
αinit=1\alpha_\text{init} = 1αinit​=1 For line search
Evaluate functions (f,g,hf, g, hf,g,h) and derivatives (∇f,Jg,Jh\nabla f, J_g, J_h∇f,Jg​,Jh​)
∇xL=∇f+Jh⊺λ+Jg⊺σ\nabla_x \mathcal{L} = \nabla f + J_h^\intercal \lambda + J_g^\intercal \sigma∇x​L=∇f+Jh⊺​λ+Jg⊺​σ
k=0k = 0k=0
while ∥∇xL∥∞>τopt\|\nabla_x \mathcal{L}\|_\infty > \tau_\text{opt}∥∇x​L∥∞​>τopt​ or ∥h∥∞>τfeas\|h\|_\infty > \tau_\text{feas}∥h∥∞​>τfeas​  do
if k=0k = 0k=0 or reset = true then
H~L0=I\tilde H_{\mathcal{L}_0} = IH~L0​​=I Initialize to identity matrix or scaled version (Equation 4.95)
Update H~Lk+1\tilde H_{\mathcal{L}_{k+1}}H~Lk+1​​ Compute damped BFGS (Equation 5.91, Equation 5.92, Equation 5.93, and Equation 5.94)
Solve QP subproblem (Equation 5.76) for px,pλp_x, p_\lambdapx​,pλ​
minimize12px⊺H~Lpx+∇xL⊺pxby varyingpxsubject toJhpx+h=0Jgpx+g≤0\begin{align*}
\text{minimize} &\quad \frac{1}{2} p_x^\intercal \tilde H_\mathcal{L} p_x + \nabla_x \mathcal{L}^\intercal p_x\\
      \text{by varying} &\quad p_x \\
      \text{subject to} &\quad J_h p_x + h = 0\\
      &\quad J_g p_x + g \le 0
\end{align*}minimizeby varyingsubject to​21​px⊺​H~L​px​+∇x​L⊺px​px​Jh​px​+h=0Jg​px​+g≤0​
(5.94)
λk+1=λk+pλ\lambda_{k+1} = \lambda_k + p_\lambdaλk+1​=λk​+pλ​
α=linesearch(px,αinit)\alpha = \texttt{linesearch} \left(p_x, \alpha_\text{init} \right)α=linesearch(px​,αinit​) Use merit function or filter (Section 5.5.3)
xk+1=xk+αpkx_{k+1} = x_k + \alpha p_kxk+1​=xk​+αpk​ Update step
Wk+1=WkW_{k+1} = W_kWk+1​=Wk​ Active set becomes initial working set for next QP
Evaluate functions (f,g,hf, g, hf,g,h) and derivatives (∇f,Jg,Jh\nabla f, J_g, J_h∇f,Jg​,Jh​)
∇xL=∇f+Jh⊺λ+Jg⊺σ\nabla_x \mathcal{L} = \nabla f + J_h^\intercal \lambda + J_g^\intercal \sigma∇x​L=∇f+Jh⊺​λ+Jg⊺​σ
k=k+1k = k + 1k=k+1

Example 5.12 (SQP applied to equality constrained problem)

We now solve Example 5.2 using the SQP method (Algorithm 5.5). We start at $x_{0} = \left[ 2, 1 \right]$ with an initial Lagrange multiplier $\lambda = 0$ and an initial estimate of the Lagrangian Hessian as $\tilde H_\mathcal{L} = I$ for simplicity. The line search uses an augmented Lagrangian merit function with a fixed penalty parameter ( $\mu = 1$ ) and a quadratic bracketed search as described in Section 4.3.2. The choice between a merit function and line search has only a small effect in this simple problem. The gradient of the equality constraint is

J_h = \begin{bmatrix} \frac{1}{2} x_1 & 2 x_2 \end{bmatrix} = \begin{bmatrix} 1 & 2 \end{bmatrix} \, ,

and differentiating the Lagrangian with respect to $x$ yields

\nabla_x {\cal L} = \begin{bmatrix} 1 + \frac{1}{2} \lambda x_1 \\ 2 + 2 \lambda x_2 \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \end{bmatrix} \, .

The KKT system to be solved (Equation 5.62) in the first iteration is

\begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 2 \\ 1 & 2 & 0 \end{bmatrix} \begin{bmatrix} s_{x_1}\\s_{x_2}\\s_{\lambda} \end{bmatrix} = \begin{bmatrix} -1 \\ -2 \\ -1 \end{bmatrix} \, .

The solution of this system is $s = \left[-0.2, -0.4, -0.8 \right]$ . Using $p = \left[-0.2, -0.4\right]$ , the full step $\alpha = 1$ satisfies the strong Wolfe conditions, so for the new iteration we have $x_{1} = \left[1.8, 0.6\right]$ , $\lambda_1 = -0.8$ .

To update the approximate Hessian $\tilde H_\mathcal{L}$ using the damped BFGS update (Equation 5.93), we need to compare the values of $s_0^\intercal y_0 = -0.272$ and $s_0^\intercal W_0 s_0 = 0.2$ . Because $s_k^\intercal y_k < 0.2 s_k^\intercal \tilde H_{\mathcal{L}_k} s_k$ , we need to compute the scalar $\theta = 0.339$ using Equation 5.94. This results in a partial BFGS update to maintain positive definiteness. After a few iterations, $\theta = 1$ for the remainder of the optimization, corresponding to a full BFGS update. The initial estimate for the Lagrangian Hessian is poor (just a scaled identity matrix), so some damping is necessary. However, the estimate is greatly improved after a few iterations. Using the quasi-Newton update in Equation 5.91, we get the approximate Hessian for the next iteration as

\tilde H_{\mathcal{L}_1} = \begin{bmatrix} 1.076 & -0.275 \\ -0.275 & 0.256 \end{bmatrix} \, .

Figure 5.42:SQP algorithm iterations.

We repeat this process for subsequent iterations, as shown in Figure 5.42. The gray contours show the QP subproblem (Equation 5.72) solved at each iteration: the quadratic objective appears as elliptical contours and the linearized constraint as a straight line. The starting point is infeasible, and the iterations remain infeasible until the last few iterations.

Figure 5.43:Convergence history of the norm of the Lagrangian gradient.

This behavior is common for SQP because although it satisfies the linear approximation of the constraints at each step, it does not necessarily satisfy the constraints of the actual problem, which is nonlinear. As the constraint approximation becomes more accurate near the solution, the nonlinear constraint is then satisfied. Figure 5.43 shows the convergence of the Lagrangian gradient norm, with the characteristic quadratic convergence at the end.

Tip 5.6: How to handle maximum and minimum constraints

Constraints that take the maximum or minimum of a set of quantities are often desired. For example, the stress in a structure may be evaluated at many points, and we want to make sure the maximum stress does not exceed a specified yield stress, such that

\text{max}(\sigma) \le \sigma_\text{yield} \, .

However, the maximum function is not continuously differentiable (because the maximum can switch elements between iterations), which may cause difficulties when using gradient-based optimization. The constraint aggregation methods from Section 5.7 can enforce such conditions with a smooth function. Nevertheless, it is challenging for an optimizer to find a point that satisfies the KKT conditions because the information is reduced to one constraint.

Instead of taking the maximum, you should consider constraining the stress at all $n_\sigma$ points as follows

\sigma_j \le \sigma_\text{yield}, \quad j = 1, \ldots, n_\sigma \, .

Now all constraints are continuously differentiable. The optimizer has $n_\sigma$ constraints instead of 1, but that generally provides more information and makes it easier for the optimizer to satisfy the KKT conditions with more than one Lagrange multiplier. Even though we have added more constraints, an active set method makes this efficient because it considers only the critical constraints.

5.6 Interior-Point Methods¶

Interior-point methods use concepts from both SQP and interior penalty methods.^[19]These methods form an objective similar to the interior penalty but with the key difference that instead of penalizing the constraints directly, they add slack variables to the set of optimization variables and penalize the slack variables. The resulting formulation is as follows:

\begin{aligned} \underset{x, s}{\text{minimize}} &\quad f(x) -\mu_{b} \sum_{j=1}^{n_g} \ln s_j \\ \text{subject to} &\quad h(x) = 0 \\ &\quad g(x) + s = 0 \, . \end{aligned}

(5.95)

This formulation turns the inequality constraints into equality constraints and thus avoids the combinatorial problem.

Similar to SQP, we apply Newton’s method to solve for the KKT conditions. However, instead of solving the KKT conditions of the original problem (Equation 5.59), we solve the KKT conditions of the interior-point formulation (Equation 5.95).

These slack variables in Equation 5.95 do not need to be squared, as was done in deriving the KKT conditions, because the logarithm is only defined for positive $s$ values and acts as a barrier preventing negative values of $s$ (although we need to prevent the line search from producing negative $s$ values, as discussed later). Because $s$ is always positive, that means that $g(x^*) < 0$ at the solution, which satisfies the inequality constraints.

Like penalty method formulations, the interior-point formulation (Equation 5.95) is only equivalent to the original constrained problem in the limit, as $\mu_b \rightarrow 0$ . Thus, as in the penalty methods, we need to solve a sequence of solutions to this problem where $\mu_b$ approaches zero.

First, we form the Lagrangian for this problem as

\mathcal{L}(x, \lambda, \sigma, s) = f(x) - \mu_{b} e^\intercal \ln s + h(x)^\intercal \lambda + (g(x) + s)^\intercal \sigma \, ,

(5.96)

where $\ln s$ is an $n_g$ -vector whose components are the logarithms of each component of $s$ , and $e = [1, \ldots, 1]$ is an $n_g$ -vector of 1s introduced to express the sum in vector form. By taking derivatives with respect to $x$ , $\lambda$ , $\sigma$ , and $s$ , we derive the KKT conditions for this problem as

\begin{aligned} \nabla f(x) + J_h(x)^\intercal \lambda + J_g(x)^\intercal \sigma &= 0 \\ h &= 0 \\ g + s &= 0 \\ -\mu_{b} S^{-1} e + \sigma &= 0 \, , \end{aligned}

(5.97)

where $S$ is a diagonal matrix whose diagonal entries are given by the slack variable vector, and therefore $S_{kk}^{-1} = 1/s_k$ . The result is a set of $n_x + n_h + 2 n_g$ equations and the same number of variables.

To get a system of equations that is more favorable for Newton’s method, we multiply the last equation by $S$ to obtain

\begin{aligned} \nabla f(x) + J_h(x)^\intercal \lambda + J_g(x)^\intercal \sigma &= 0 \\ h &= 0 \\ g + s &= 0 \\ -\mu_{b} e + S \sigma &= 0 \, . \end{aligned}

(5.98)

We now have a set of residual equations to which we can apply Newton’s method, just like we did for SQP. Taking the Jacobian of the residuals in Equation 5.98, we obtain the linear system

\begin{bmatrix} H_\mathcal{L}(x) & J_h(x)^\intercal & J_g(x)^\intercal & 0 \\ J_h(x) & 0 & 0 & 0 \\ J_g(x) & 0 & 0 & I \\ 0 & 0 & S & \Sigma \\ \end{bmatrix} \begin{bmatrix} s_x\\s_\lambda\\s_{\sigma}\\s_s \end{bmatrix} = - \begin{bmatrix} \nabla_x \mathcal{L}(x, \lambda, \sigma)\\ h(x)\\ g(x) + s\\ S\sigma - \mu_{b} e\\ \end{bmatrix} \, ,

(5.99)

where $\Sigma$ is a diagonal matrix whose entries are given by $\sigma$ , and $I$ is the identity matrix. For numerical efficiency, we make the matrix symmetric by multiplying the last equation by $S^{-1}$ to get the symmetric linear system, as follows:

\begin{bmatrix} H_\mathcal{L}(x) & J_h(x)^\intercal & J_g(x)^\intercal & 0 \\ J_h(x) & 0 & 0 & 0 \\ J_g(x) & 0 & 0 & I \\ 0 & 0 & I & S^{-1}\Sigma \\ \end{bmatrix} \begin{bmatrix} s_x\\s_\lambda\\s_{\sigma}\\s_s \end{bmatrix} = - \begin{bmatrix} \nabla_x \mathcal{L}(x, \lambda, \sigma)\\ h(x)\\ g(x) + s\\ \sigma - \mu_{b} S^{-1} e\\ \end{bmatrix} \, .

(5.100)

The advantage of this equivalent system is that we can use a linear solver specialized for symmetric matrices, which is more efficient than a solver for general linear systems.

Structure and shape of the interior-point system matrix from . — Figure 5.46:Structure and shape of the interior-point system matrix from Equation 5.100.

If we had applied Newton’s method to the original KKT system (Equation 5.97) and then made it symmetric, we would have obtained a term with $S^{-2}$ , which would make the system more challenging than with the $S^{-1}$ term in Equation 5.100. Figure 5.46 shows the structure and block sizes of the matrix.

5.6.1 Modifications to the Basic Algorithm¶

We can reuse many of the concepts covered under SQP, including quasi-Newton estimates of the Lagrangian Hessian and line searches with merit functions or filters. The merit function would usually be modified to a form more consistent with the formulation used in Equation 5.95. For example, we could write a merit function as follows:

\hat{f}(x) = f(x) -\mu_{b} \sum_{i=1}^{n_g} \ln s_i + \frac{1}{2}\mu_p \left(\| h(x) \|^2 + \|g(x) + s \|^2\right) \, ,

(5.101)

where $\mu_{b}$ is the barrier parameter from Equation 5.95, and $\mu_p$ is the penalty parameter. Additionally, we must enforce an $\alpha_\text{max}$ in the line search so that the implicit constraint on $s > 0$ remains enforced. The maximum allowed step size can be computed prior to the line search because we know the value of $s$ and $p_s$ and require that

s + \alpha p_s \ge 0 \, .

(5.102)

In practice, we enforce a fractional tolerance so that we do not get too close to zero. For example, we could enforce the following:

s + \alpha_{\text{max}} p_s = \tau s \, ,

(5.103)

where $\tau$ is a small value (e.g., $\tau = 0.005$ ). The maximum step size is the smallest positive value that satisfies this equation for all entries in $s$ . A possible algorithm for determining the maximum step size for feasibility is shown in Algorithm 5.6.

The line search typically uses a simple backtracking approach because we must enforce a maximum step length. After the line search, we can update $x$ and $s$ as follows:

\begin{align} x_{k+1} &= x_k + \alpha_k p_x, \quad \text{where} \quad \alpha_k \in (0, \alpha_{\text{max}}] \\ s_{k+1} &= s_k + \alpha_k p_s \, . \end{align}

(5.104)

The Lagrange multipliers $\sigma$ must also remain positive, so the procedure in Algorithm 5.6 is repeated for $\sigma$ to find the maximum step length for the Lagrange multipliers $\alpha_\sigma$ . Enforcing a maximum step size for Lagrange multiplier updates was not necessary for the SQP method because the QP subproblem handled the enforcement of nonnegative Lagrange multipliers. We then update both sets of Lagrange multipliers using this step size:

\begin{align} \lambda_{k+1} &= \lambda_k + \alpha_\sigma p_\lambda \\ \sigma_{k+1} &= \sigma_k + \alpha_\sigma p_\sigma \, . \end{align}

(5.105)

Finally, we need to update the barrier parameter $\mu_{b}$ . The simplest approach is to decrease it by a multiplicative factor:

{\mu_{b}}_{k+1} = \rho {\mu_{b}}_{k} \, ,

(5.106)

where $\rho$ is typically around 0.2. Better methods are adaptive based on how well the optimizer is progressing. There are other implementation details for improving robustness that can be found in the literature.2223

The steps for a basic interior-point method are detailed in Algorithm 5.7.^[20]

This version focuses on a line search approach, but there are variations of interior-point methods that use the trust-region approach.

Algorithm 5.7 (Interior-point method with a quasi-Newton approximation)

Algorithm

\quad x0x_0x0​: Starting point
\quad τopt\tau_\text{opt}τopt​: Optimality tolerance
\quad τfeas\tau_\text{feas}τfeas​: Feasibility tolerance
\quad x∗x^*x∗: Optimal point
\quad f(x∗)f(x^*)f(x∗): Optimal function value
λ0\lambda_0λ0​ = 0; σ0\sigma_0σ0​ = 0 Initial Lagrange multipliers
s0=1s_0 = 1s0​=1 Initial slack variables
H~L0=I\tilde H_{\mathcal{L}_0}=IH~L0​​=I Initialize Hessian of Lagrangian approximation to identity matrix
k=0k = 0k=0
while ∥∇xL∥∞>τopt\|\nabla_x \mathcal{L}\|_\infty > \tau_\text{opt}∥∇x​L∥∞​>τopt​ or ∥h∥∞>τfeas\|h\|_\infty > \tau_\text{feas}∥h∥∞​>τfeas​  do
Evaluate JhJ_hJh​, JgJ_gJg​, ∇xL\nabla_x \mathcal{L}∇x​L
Solve the KKT system (Equation 5.100) for ppp
[H~LkJh⊺Jg⊺0Jh(x)000Jg(x)00I00IS−1Σ][pxpλpσps]=−[∇xL(x,λ,σ)h(x)g(x)+sσ−μS−1e]\begin{bmatrix}
            \tilde H_{\mathcal{L}_k} & J_h^\intercal & J_g^\intercal & 0 \\
            J_h(x) & 0 & 0 & 0 \\
            J_g(x) & 0 & 0 & I \\
            0 & 0 & I & S^{-1}\Sigma \\
          \end{bmatrix}
          \begin{bmatrix}
             p_x\\ p_\lambda \\ p_{\sigma}\\ p_s
          \end{bmatrix}
          =
          -
          \begin{bmatrix}
            \nabla_x \mathcal{L}(x, \lambda, \sigma)\\
            h(x)\\
            g(x) + s\\
            \sigma - \mu S^{-1} e\\
          \end{bmatrix}⎣⎡​H~Lk​​Jh​(x)Jg​(x)0​Jh⊺​000​Jg⊺​00I​00IS−1Σ​⎦⎤​⎣⎡​px​pλ​pσ​ps​​⎦⎤​=−⎣⎡​∇x​L(x,λ,σ)h(x)g(x)+sσ−μS−1e​⎦⎤​
αmax\alpha_{\text{max}}αmax​ = alphamax(sss, psp_sps​) Use Algorithm 5.6
αk\alpha_kαk​ = backtrack(pxp_xpx​, psp_sps​, αmax\alpha_\text{max}αmax​) Line search (Algorithm 4.2) with merit function (Equation 5.101)
xk+1=xk+αkpxx_{k+1} = x_k + \alpha_k p_xxk+1​=xk​+αk​px​ Update design variables
sk+1=sk+αkpss_{k+1} = s_k + \alpha_k p_ssk+1​=sk​+αk​ps​ Update slack variables
ασ\alpha_\sigmaασ​ = alphamax(σ\sigmaσ, pσp_\sigmapσ​)
λk+1=λk+ασsλ\lambda_{k+1} = \lambda_k + \alpha_\sigma s_\lambdaλk+1​=λk​+ασ​sλ​ Update equality Lagrange multipliers
σk+1=σk+ασsσ\sigma_{k+1} = \sigma_k + \alpha_\sigma s_\sigmaσk+1​=σk​+ασ​sσ​ Update inequality Lagrange multipliers
Update H~Lk+1\tilde H_{\mathcal{L}_{k+1}}H~Lk+1​​ Compute quasi-Newton approximation using Equation 5.91
μb=ρμb\mu_{b} = \rho \mu_{b}μb​=ρμb​ Reduce barrier parameter
k=k+1k = k + 1k=k+1

5.6.2 SQP Comparisons and Examples¶

Both interior-point methods and SQP are considered state-of-the-art approaches for solving nonlinear constrained optimization problems. Each of these two methods has its strengths and weaknesses. The KKT system structure is identical at each iteration for interior-point methods, so we can exploit this structure for improved computational efficiency. SQP is not as amenable to this because changes in the working set cause the system’s structure to change between iterations. The downside of the interior-point structure is that turning all constraints into equalities means that all constraints must be included at every iteration, even if they are inactive. In contrast, active-set SQP only needs to consider a subset of the constraints, reducing the subproblem size.

Active-set SQP methods are generally more effective for medium-scale problems, whereas interior-point methods are more effective for large-scale problems.

Interior-point methods are usually more sensitive to the initial starting point and the scaling of the problem. Therefore, SQP methods are usually more suitable for solving sequences of warm-started problems.2511 These are just general guidelines; both approaches should be considered and tested for a given problem of interest.

Example 5.15 (Interior-point method applied to inequality constrained \ problem)

Here we solve Example 5.4 using the interior-point method (Algorithm 5.7) starting from $x_{0} = \left[ 2, 1 \right]$ . The initial Lagrange multiplier is $\sigma = 0$ , and the initial slack variable is $s = 1$ .

Starting with a penalty parameter of $\mu = 20$ results in the iterations shown in Figure 5.48.

Figure 5.48:Interior-point algorithm iterations.

For the first iteration, differentiating the Lagrangian with respect to $x$ yields

\nabla_x {\cal L} (x_1, x_2) = \begin{bmatrix} 1 + \frac{1}{2} \sigma x_1 \\ 2 + 2 \sigma x_2 \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \end{bmatrix},

and the gradient of the constraint is

\nabla g (x_1, x_2) = \begin{bmatrix} \frac{1}{2} x_1 \\ 2 x_2 \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \end{bmatrix} \, .

The interior-point system of equations (Equation 5.100) at the starting point is

\begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 2 & 0 \\ 1 & 2 & 0 & 1 \\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} s_{x_1}\\s_{x_2}\\s_{\sigma}\\s_s \end{bmatrix} = \begin{bmatrix} -1 \\ -2 \\ -2 \\ 20 \end{bmatrix} \, .

The solution is $s = \left[ -21, -42, 20, 103 \right]$ . Performing a line search in the direction $p = \left[ -21, -42 \right]$ yields $x_{1} = \left[ 1.34375, -0.3125 \right]$ . The Lagrange multiplier and slack variable are updated to $\sigma_1 = 20$ and $s_1 = 104$ , respectively.

To update the approximate Hessian $\tilde H_{\mathcal{L}_k}$ , we use the damped BFGS update (Equation 5.93) to ensure that $\tilde H_{\mathcal{L}_k}$ is positive definite. By comparing $s_0^\intercal y_0 = 73.21$ and $s_0^\intercal \tilde H_{\mathcal{L}_0} s_0 = 2.15$ , we can see that $s_k^\intercal y_k \ge 0.2 s_k^\intercal \tilde H_{\mathcal{L}_k} s_k$ , and therefore, we do a full BFGS update with $\theta_0 = 1$ and $r_0 = y_0$ . Using the quasi-Newton update (Equation 5.91), we get the approximate Hessian:

\tilde H_{\mathcal{L}_1} = \begin{bmatrix} 1.388 & 4.306 \\ 4.306 & 37.847 \end{bmatrix} \, .

We reduce the barrier parameter $\mu$ by a factor of 2 at each iteration. This process is repeated for subsequent iterations.

The starting point is infeasible, but the algorithm finds a feasible point after the first iteration. From then on, it approaches the optimum from within the feasible region, as shown in Figure 5.48.

Example 5.16 (Constrained spring system)

Consider the spring system from Example 4.17, which is an unconstrained optimization problem. We can constrain the spring system by attaching two cables as shown in Figure 5.49, where $\ell_{c_1}=9$  m, $\ell_{c_2}=6$  m, $y_c=2$  m, $x_{c_1}=7$  m, and $x_{c_2}=3$  m.

Figure 5.49:Spring system constrained by two cables.

Because the cables do not resist compression forces, they correspond to inequality constraints, yielding the following problem:

\begin{aligned} \underset{x_1, x_2}{\text{minimize}} & \quad \frac{1}{2} k_1 \left( \sqrt{(\ell_1+x_1)^2 + x_2^2} - \ell_1 \right)^2 + \frac{1}{2} k_2 \left( \sqrt{(\ell_2-x_1)^2 + x_2^2} - \ell_2 \right)^2 - mg x_2 \\ \text{subject to} & \quad \sqrt{ \left( x_1 + x_{c_1} \right)^2 + \left( x_2 + y_c \right)^2 } \leq \ell_{c_1} \\ & \quad \sqrt{ \left( x_1 - x_{c_2} \right)^2 + \left( x_2 + y_c \right)^2 } \leq \ell_{c_2} \, . \end{aligned}

The optimization paths for SQP and the interior-point method are shown in Figure 5.50.

Figure 5.50:Optimization of constrained spring system.

5.7 Constraint Aggregation¶

As will be discussed in Chapter 6, some derivative computation methods are efficient for problems with many inputs and few outputs, and others are advantageous for problems with few inputs and many outputs. Thus, if we have many design variables and many constraints, there is no efficient way to compute the required constraint Jacobian.

One workaround is to aggregate the constraints and solve the optimization problem with a new set of constraints. Each aggregation would have the form

\bar g(x) \equiv \bar g(g(x)) \le 0 \, ,

(5.107)

where $\bar g$ is a scalar, and $g$ is the vector of constraints we want to aggregate. One of the properties we want for the aggregation function is that if any of the original constraints are violated, then $\bar g > 0$ .

One way to aggregate constraints would be to define the aggregated constraint function as the maximum of all constraints,

\bar g(x) = \max(g(x)) \, .

(5.110)

If $\max(g(x)) \le 0$ , then we know that all of components of $g(x) \le 0$ . However, the maximum function is not differentiable, so it is not desirable for gradient-based optimization.

In the rest of this section, we introduce several viable functions for constraint aggregation that are differentiable.

The Kreisselmeier–Steinhauser (KS) aggregation was one of the first aggregation functions proposed for optimization and is defined as follows:26

\bar g_\text{KS}(\rho, g) = \frac{1}{\rho} \ln \left( \sum_{j=1}^{n_g} \exp({\rho g_j}) \right) \, ,

(5.111)

where $\rho$ is an aggregation factor that determines how close this function is to the maximum function (Equation 5.110). As $\rho \rightarrow \infty$ , $\bar g_\text{KS}(\rho, g) \rightarrow \max(g)$ . However, as $\rho$ increases, the curvature of $\bar g$ increases, which can cause ill-conditioning in the optimization.

The exponential function disproportionately weighs the higher positive values in the constraint vector, but it does so in a smooth way. Because the exponential function can easily result in overflow, it is preferable to use the alternate (but equivalent) form of the KS function,

\bar g_\text{KS}(\rho, g) = \max(g) + \frac{1}{\rho} \ln \left( \sum_{j=1}^{n_g} \exp\left({\rho \left(g_j - \max(g) \right)}\right) \right) \, .

(5.112)

The value of $\rho$ should be tuned for each problem, but $\rho=100$ works well for many problems.

Example 5.17 (Constrained spring system with aggregated constraints)

Consider the constrained spring system from Example 5.16. Aggregating the two constraints using the KS function, we can formulate a single constraint as

\bar g_\text{KS}(x_1, x_2) = \frac{1}{\rho} \ln \left( \exp\left({\rho g_1(x_1, x_2)}\right) + \exp\left({\rho g_2(x_1, x_2)}\right) \right) \, ,

where

\begin{aligned} g_1(x_1, x_2) &= \sqrt{ \left( x_1 + x_{c_1} \right)^2 + \left( x_2 + y_c \right)^2 } - \ell_{c_1} \\ g_2(x_1, x_2) &= \sqrt{ \left( x_1 - x_{c_2} \right)^2 + \left( x_2 + y_c \right)^2 } - \ell_{c_2} \, . \end{aligned}

Figure 5.51 shows the contour of $\bar g_\text{KS}=0$ for increasing values of the aggregation parameter $\rho$ .

$KS function aggregation of two constraints. The optimum of the problem with aggregated constraints, x^*_\text{KS}, approaches the true optimum as the aggregation parameter \rho_\text{KS} increases.$

Figure 5.51:KS function aggregation of two constraints. The optimum of the problem with aggregated constraints, $x^*_\text{KS}$ , approaches the true optimum as the aggregation parameter $\rho_\text{KS}$ increases.

For the lowest value of $\rho$ , the feasible region is reduced, resulting in a conservative optimum. For the highest value of $\rho$ , the optimum obtained with constraint aggregation is graphically indistinguishable, and the objective function value approaches the true optimal value of -22.1358.

The $p$ -norm aggregation function is another option for aggregation and is defined as follows:27

\bar g_{PN}(\rho)= \max_j |g_j| \left( \sum_{j=1}^{n_g} \left|\frac{g_j}{\max_j g_j}\right|^{\rho} \right)^{\frac{1}{\rho}} \, .

(5.113)

The absolute value in this equation can be an issue if $g$ can take both positive and negative values because the function is not differentiable in regions where $g$ transitions from positive to negative.

A class of aggregation functions known as induced functions was designed to provide more accurate estimates of $\max(g)$ for a given value of $\rho$ than the KS and induced norm functions.28 There are two main types of induced functions: one uses exponentials, and the other uses powers. The induced exponential function is given by

g_\text{IE}(\rho) = \frac{\sum_{j=1}^{n_g} g_j \exp(\rho g_j)}{\sum_{j=1}^{n_g} \exp(\rho g_j)} \, .

(5.114)

The induced power function is given by

g_\text{IP}(\rho) = \frac{\sum_{j=1}^{n_g} g_i^{\rho + 1} }{\sum_{j=1}^{n_g} g_i^{\rho}} \, .

(5.115)

The induced power function is only applicable if $g_j \ge 0$ for $j = 1,\ldots,n_g$ .

5.8 Summary¶

Most engineering design problems are constrained. When formulating a problem, practitioners should be critical of their choice of objective function and constraints. Metrics that should be constraints are often wrongly formulated as objectives. A constraint should not limit the design unnecessarily and should reflect the underlying physical reason for that constraint as much as possible.

The first-order optimality conditions for constrained problems—the KKT conditions—require the gradient of the objective to be a linear combination of the gradients of the constraints. This ensures that there is no feasible descent direction. Each constraint is associated with a Lagrange multiplier that quantifies how significant that constraint is at the optimum. For inequality constraints, a Lagrange multiplier that is zero means that the corresponding constraint is inactive. For inequality constraints, slack variables quantify how close a constraint is to becoming active; a slack variable that is zero means that the corresponding constraint is active. Lagrange multipliers and slack variables are unknowns that need to be solved together with the design variables. The complementary slackness condition introduces a combinatorial problem that is challenging to solve.

Penalty methods solve constrained problems by adding a metric to the objective function quantifying how much the constraints are violated. These methods are helpful as a conceptual model and are used in gradient-free optimization algorithms (Chapter 7). However, penalty methods only find approximate solutions and are subject to numerical issues when used with gradient-based optimization.

Methods based on the KKT conditions are preferable. The most widely used among such methods are SQP and interior-point methods. These methods apply Newton’s method to the KKT conditions. One primary difference between these two methods is in the treatment of inequality constraints. SQP methods distinguish between active and inactive constraints, treating potentially active constraints as equality constraints and ignoring the potentially inactive ones. Interior-point methods add slack variables to force all constraints to behave like equality constraints.

Problems¶

Exercise 1

Answer true or false and correct the false statements.

Penalty methods are among the most effective methods for constrained optimization.
For an equality constraint in $n$ -dimensional space, all feasible directions about a point are perpendicular to the constraint gradient at that point and define a hyperplane with dimension $n-1$ .
The feasible directions about a point on an inequality constraint define an open half-space whose dividing hyperplane is perpendicular to the gradient of the constraint at that point.
A point is optimal if there is only one feasible direction that is also a descent direction.
For an inequality constrained problem, if we replace the inequalities that are active at the optimum with equality constraints and ignore the inactive constraints, we get the same optimum.
For a point to be optimal, the Lagrange multipliers for both the equality constraint and the active inequality constraints must be positive.
The complementary slackness conditions are easy to solve for because either the Lagrange multiplier is zero or the slack variable is zero.
At the optimum of a constrained problem, the Hessian of the Lagrangian function must be positive semidefinite.
The Lagrange multipliers represent the change in the objective function we would get for a perturbation in the constraint value.
SQP seeks to find the solution of the KKT system.
Interior-point methods must start with a point in the interior of the feasible region.
Constraint aggregation combines multiple constraints into a single constraint that is equivalent.

Exercise 7

Figure 5.52:Slender tubular column in compression.

Column in compression.

Consider a thin-walled tubular column subjected to a compression force, as shown in Figure 5.52. We want to minimize the mass of the column while ensuring that the structure does not yield or buckle under a compression force of magnitude $F$ . The design variables are the radius of the tube ( $R$ ) and the wall thickness ( $t$ ). This design optimization problem can be stated as follows:

\begin{aligned} \text{minimize} & \quad 2 \rho \ell \pi R t & \text{mass} \\ \text{by varying} & \quad R, t & \text{radius, wall thickness}\\ \text{subject to} & \quad \frac{F}{2 \pi Rt} - \sigma_\text{yield}\le 0 & \text{yield stress} \\ & \quad F - \frac{\pi^3 E R^3 t}{4 \ell^2} \leq 0 & \text{buckling load} \end{aligned}

In the formula for the mass in this objective, $\rho$ is the material density, and we assume that $t \ll R$ . The first constraint is the compressive stress, which is simply the force divided by the cross-sectional area. The second constraint uses Euler’s critical buckling load formula, where $E$ is the material Young’s modulus, and the second moment of area is replaced with the one corresponding to a circular cross section ( $I=\pi R^3 t$ ).

Find the optimum $R$ and $t$ as a function of the other parameters. Pick reasonable values for the parameters, and verify your solution graphically. Plot the gradients of the objective and constraints at the optimum, and verify the Lagrange multipliers graphically.

Exercise 8

Figure 5.53:Cantilever beam with H section.

Beam with H section. Consider a cantilevered beam with an H-shaped cross section composed of a web and flanges subject to a transverse load, as shown in Figure 5.53. The objective is to minimize the structural weight by varying the web thickness $t_w$ and the flange thickness $t_b$ , subject to stress constraints. The other cross-sectional parameters are fixed; the web height $h$ is 250 mm, and the flange width $b$ is 125 mm. The axial stress in the flange and the shear stress in the web should not exceed the corresponding yield values ( $\sigma_\text{yield}=200$  MPa, and $\tau_\text{yield}=116$  MPa, respectively). The optimization problem can be stated as follows:

\begin{aligned} \text{minimize} & \quad 2b t_b + h t_w & \text{mass} \\ \text{by varying} & \quad t_b, t_w & \text{flange and web thicknesses}\\ \text{subject to} & \quad \frac{P \ell h}{2I} -\sigma_\text{yield} \leq 0 & \text{axial stress} \\ & \quad \frac{1.5 P}{h t_w} -\tau_\text{yield} \leq 0 & \text{shear stress} \end{aligned}

The second moment of area for the H section is

I = \frac{h^3}{12} t_w + \frac{b}{6} t_b^3 + \frac{h^2 b}{2} t_b \,.

(5.117)

Find the optimal values of $t_b$ and $t_w$ by solving the KKT conditions analytically. Plot the objective contours and constraints to verify your result graphically.

Exercise 11

Figure 5.54:Ellipsoid fuel tank.

Aircraft fuel tank. A jet aircraft needs to carry a streamlined external fuel tank with a required volume. The tank shape is approximated as an ellipsoid (Figure 5.54). We want to minimize the drag of the fuel tank by varying its length and diameter—that is:

\begin{aligned} \text{minimize} & \quad D(\ell, d) \\ \text{by varying} & \quad \ell, d \\ \text{subject to} & \quad V_\text{req} - V(\ell, d) \le 0 \, . \end{aligned}

The drag is given by

D = \frac{1}{2} \rho v^2 C_D S,

where the air density is $\rho= 0.55$  kg/m $^3$ , and the aircraft speed is $v=300$  m/s. The drag coefficient of an ellipsoid can be estimated as^[21]

C_D = C_f \left[1 + 1.5\left(\frac{d}{\ell}\right)^{3/2} + 7\left(\frac{d}{\ell}\right)^3\right]\, .

We assume a friction coefficient of $C_f=0.0035$ . The drag is proportional to the surface area of the tank, which, for an ellipsoid, is

S = \frac{\pi}{2} d^2 \left(1 + \frac{\ell}{de}\arcsin{e}\right) \, ,

(5.118)

where $e = \sqrt{1 - d^2/\ell^2}$ . The volume of the fuel tank is

V = \frac{\pi}{6} d^2 \ell \, ,

(5.119)

and the required volume is $V_\text{req} = 2.5$  m $^3$ .

Find the optimum tank length and diameter numerically using your own optimizer or a software package. Verify your solution graphically by plotting the objective function contours and the constraint.

Exercise 13

Three-bar truss.

Consider the truss shown in Figure 5.56. The truss is subjected to a load $P$ , and we want to minimize the mass of the structure subject to stress and buckling constraints.^[22] The axial stresses in each bar are

\begin{aligned} \sigma_1 =& \frac{1}{\sqrt{2}}\left(\frac{P \cos\theta}{A_o}+\frac{P \sin\theta}{A_o+\sqrt{2}A_m}\right)\\ \sigma_2 =& \frac{\sqrt{2}P \sin\theta}{A_o+\sqrt{2}A_m}\\ \sigma_3 =& \frac{1}{\sqrt{2}}\left(\frac{P \sin\theta}{A_o+\sqrt{2}A_m} - \frac{P \cos\theta}{A_o} \right) \, , \end{aligned}

(5.120)

where $A_o$ is the cross-sectional area of the outer bars 1 and 3, and $A_m$ is the cross-sectional area of the middle bar 2.

Figure 5.56:Three-bar truss elements.

The full optimization problem for the three-bar truss is as follows:

\begin{aligned} \text{minimize} & \quad \rho \left(\ell (2\sqrt{2}A_o+A_m)\right) & \text{mass}\\ \text{by varying} & \quad A_o, A_m & \text{cross-sectional areas} \\ \text{subject to} & \quad A_{\min} - A_o \leq 0 & \text{area lower bound} \\ & \quad A_{\min} - A_m \leq 0 & \\ & \sigma_\text{yield} - \sigma_1 \leq 0 & \text{stress constraints} \\ & \sigma_\text{yield} - \sigma_2 \leq 0 & \\ & \sigma_\text{yield} - \sigma_3 \leq 0 & \\ & - \sigma_1 - \frac{\pi^2 E \beta A_o}{2\ell^2} \leq 0 & \text{buckling constraints} \\ & - \sigma_2 - \frac{\pi^2 E \beta A_m}{2\ell^2} \leq 0 & \\ & - \sigma_3 - \frac{\pi^2 E \beta A_o}{2\ell^2} \leq 0 & \\ \end{aligned}

In the buckling constraints, $\beta$ relates the second moment of area to the area ( $I=\beta A^2$ ) and is dependent on the cross-sectional shape of the bars. Assuming a square cross section, $\beta = 1/12$ . The bars are made out of an aluminum alloy with the following properties: $\rho=2710$  kg/m $^3$ , $E=69$  GPa, $\sigma_\text{yield} =110$  MPa.

Find the optimal bar cross-sectional areas using your own optimizer or a software package. Which constraints are active? Verify your result graphically. Exploration: Try different combinations of unit magnitudes (e.g., Pa versus MPa for the stresses) for the functions of interest and the design variables to observe the effect of scaling.

Exercise 15

Ten-bar truss.Consider the 10-bar truss structure described in Section D.2.2. The full design optimization problem is as follows:

\begin{aligned} \text{minimize} & \quad \rho \sum_{i=1}^{10} A_i \ell_i & \text{mass} \\ \text{by varying} & \quad A_i, \quad i=1,\ldots,10 & \text{cross-sectional areas}\\ \text{subject to} & \quad A_i \geq A_{\min} & \text{minimum area} \\ & |\sigma_i| \le {\sigma_y}_i \quad i = 1, \ldots, 10 & \text{stress constraints} \\ \end{aligned}

Find the optimal mass and corresponding cross-sectional areas using your own optimizer or a software package. Show a convergence plot. Report the number of function evaluations and the number of major iterations. Exploration: Restart from different starting points. Do you get more than one local minimum? What can you conclude about the multimodality of the design space?

Exercise 17

Consider the aircraft wing design problem described in Section D.1.6. Now we will add a constraint on the bending stress at the root of the wing, as described in Example 1.3.

We derive the bending stress using the one-dimensional beam bending theory. Assuming that the lift distribution is uniform, the load per unit length is ${L}/{b}$ . We can consider the wing as a cantilever of length ${b}/{2}$ . The bending moment at the wing root is

M = \frac{(L/b) (b/2)^2}{2} = \frac{Lb}{8} \, .

Now we assume that the wing structure has the H-shaped cross section from Exercise 8 with a constant thickness of $t_w=t_b=4$  mm. We relate the cross-section height $h_\text{sec}$ and width $b_\text{sec}$ to the chord as $h_\text{sec} = 0.1 c$ and $b_\text{sec} = 0.4 c$ . With these assumptions, we can compute the second moment of area $I$ in terms of $c$ .

The maximum bending stress is then

\sigma_\text{max} = \frac{M h_\text{sec}}{2 I}.

Considering the safety factor of 1.5 and the ultimate load factor of 2.5, the stress constraint is

2.5 \sigma_\text{max} - \frac{\sigma_\text{yield}}{1.5} \leq 0 \, ,

where $\sigma_\text{yield} = 200$  MPa.

Solve this problem and compare the solution with the unconstrained optimum. Plot the objective contours and constraint to verify your result graphically.

Footnotes¶

For a more formal introduction to these concepts, see Chapter 2 in 1 2 provides a comprehensive treatment of linear algebra.
↩
The subspaces spanned by $A$ , $A^\intercal$ , and their respective nullspaces constitute four fundamental subspaces, which we elaborate on in Four Fundamental Subspaces in Linear Algebra.
↩
Recall the fundamental theorem of linear algebra illustrated in Figure 5.3 and the four subspaces reviewed in Four Fundamental Subspaces in Linear Algebra.
↩
Despite our convention of reserving Greek symbols for scalars, we use $\lambda$ to represent the $n_h$ -vector of Lagrange multipliers because it is common usage.
↩
This happens to be the same condition for a positive-definite $H_\mathcal{L}$ in this case, but this does not happen in general.
↩
Although this point does not satisfy the second-order necessary conditions, it is still a constrained minimum.
↩
Farkas’ lemma has other applications beyond optimization and can be written in various equivalent forms. Using the statement by 3 we set $A=J_g$ , $x=-p$ , $c=-\nabla f$ , and $y=\sigma$ .
↩
This is a special case of the Hadamard product of two matrices.
↩
As an example, we could change the value of the allowable stress constraint in the structural optimization problem of Example 3.9.
↩
This condition is similar to Equation 5.7, but here we apply it to all equality and active constraints except for constraint $i$ .
↩
In other words, this is a convex problem. Convex optimization is discussed in Chapter 11.
↩
The Lagrangian objective can also be considered to be an approximation of the objective along the feasible surface $h(x) = 0$ .10
↩
Linearizing the constraints can sometimes lead to an infeasible QP subproblem; additional techniques are needed to handle such cases.11 $^,$ 12
↩
This is not a universal definition. For example, the constraints in the working set need not be active at $x_k$ in some approaches.
↩
In practice, adding only one constraint to the working set at a time (or removing only one constraint in other steps described later) typically leads to faster convergence.
↩
See Section 9.2 for more details on the concept of dominance.
↩
The damped BFGS update is not always the best approach. There are approaches built around other approximation methods, such as symmetric rank 1 (SR1).17 Limited-memory updates similar to L-BFGS (see Section 4.4.5) can be used when storing a dense Hessian for large problems is prohibitive.18
↩
A few popular SQP implementations include SNOPT,12 Knitro,19 MATLAB’s fmincon, and SLSQP.20 The first three are commercial options, whereas SLSQP is open source. There are interfaces in different programming languages for these optimizers, including pyOptSparse (for SNOPT and SLSQP).21
↩
The name interior point stems from early methods based on interior penalty methods that assumed that the initial point was feasible. However, modern interior-point methods can start with infeasible points.
↩
IPOPT is an open-source nonlinear interior-point method.24 The commercial packages Knitro19 and fmincon mentioned earlier also include interior-point methods.
↩
29 provides this approximation on page 6-17.
↩
This is a well-known optimization problem formulated by 30 when he first proposed integrating numerical optimization with finite-element structural analysis.
↩

References¶

Boyd, S. P., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.
Strang, G. (2006). Linear Algebra and its Applications (4th ed.). Cengage Learning.
Dax, A. (1997). Classroom note: An elementary proof of Farkas’ lemma. SIAM Review, 39(3), 503–507. 10.1137/S0036144594295502
Gill, P. E., Murray, W., Saunders, M. A., & Wright, M. H. (1986). Some theoretical properties of an augmented Lagrangian merit function [SOL 86-6R]. Systems Optimization Laboratory. https://apps.dtic.mil/sti/citations/ADA168503
Di Pillo, G., & Grippo, L. (1982). A new augmented Lagrangian function for inequality constraints in nonlinear programming problems. Journal of Optimization Theory and Applications, 36(4), 495–519. 10.1007/BF00940544
Birgin, E. G., Castillo, R. A., & MartÍnez, J. M. (2005). Numerical comparison of augmented Lagrangian algorithms for nonconvex problems. Computational Optimization and Applications, 31(1), 31–55. 10.1007/s10589-005-1066-7
Rockafellar, R. T. (1973). The multiplier method of Hestenes and Powell applied to convex programming. Journal of Optimization Theory and Applications, 12(6), 555–562. 10.1007/BF00934777
Murray, W. (1971). Analytical expressions for the eigenvalues and eigenvectors of the Hessian matrices of barrier and penalty functions. Journal of Optimization Theory and Applications, 7(3), 189–196. 10.1007/bf00932477
Forsgren, A., Gill, P. E., & Wright, M. H. (2002). Interior methods for nonlinear optimization. SIAM Review, 44(4), 525–597. 10.1137/s0036144502414942
Gill, P. E., & Wong, E. (2012). Sequential quadratic programming methods. In J. Lee & S. Leyffer (Eds.), Mixed Integer Nonlinear Programming (Vol. 154). Springer. 10.1007/978-1-4614-1927-3_6
Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer. 10.1007/978-0-387-40065-5
Gill, P. E., Murray, W., & Saunders, M. A. (2005). SNOPT: An SQP algorithm for large-scale constrained optimization. SIAM Review, 47(1), 99–131. 10.1137/S0036144504446096
Fletcher, R., & Leyffer, S. (2002). Nonlinear programming without a penalty function. Mathematical Programming, 91(2), 239–269. 10.1007/s101070100244
Benson, H. Y., Vanderbei, R. J., & Shanno, D. F. (2002). Interior-point methods for nonconvex nonlinear programming: Filter methods and merit functions. Computational Optimization and Applications, 23(2), 257–272. 10.1023/a:1020533003783
Fletcher, R., Leyffer, S., & Toint, P. (2006). A brief history of filter methods [ANL/MCS-P1372-0906]. Argonne National Laboratory. http://www.optimization-online.org/DB_FILE/2006/10/1489.pdf

5 Constrained Gradient-Based Optimization

5.1 Constrained Problem Formulation¶

5.2 Understanding n-Dimensional Space¶

5.3 Optimality Conditions¶

5.3.1 Equality Constraints¶

5.3.2 Inequality Constraints¶

5.3.3 Meaning of the Lagrange Multipliers¶

5.3.4 Post-Optimality Sensitivities¶

5.4 Penalty Methods¶

5.4.1 Exterior Penalty Methods¶

Quadratic Penalty Method¶

Augmented Lagrangian¶

5.4.2 Interior Penalty Methods¶

5.5 Sequential Quadratic Programming¶

5.5.1 Equality Constrained SQP¶

5.3.2 Inequality Constraints¶

5.5.3 Merit Functions and Filters¶

5.5.4 Quasi-Newton SQP¶

5.5.5 Algorithm Overview¶

5.6 Interior-Point Methods¶

5.6.1 Modifications to the Basic Algorithm¶

5.6.2 SQP Comparisons and Examples¶

5.7 Constraint Aggregation¶

5.8 Summary¶

Problems¶