EECS 127 Class Notes

Created by Yunhao Cao (Github@ToiletCommander) in Fall 2022 for UC Berkeley EECS127.

Reference Notice: Material highly and mostly derived from Prof. Ranade’s Lectures

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Why Optimization

Control + Robotics

Resource allocation problem

Communications

Machine Learning

🔥

Gauss used least squares technique to predict where planets would appear in the space

Optimization Forms

General:

\min f_0(\vec{x}) \\ \text{subject to} f_i(\vec{x})\le b_i \text{ for } i=1,2,\dots,m

Notation:

$\vec{x}$ - optimization variable, $\vec{x} \in \mathbb{R}^n$

$\vec{x}^*$ - optimal value

$\vec{x} \in \text{feasible set}$

Least Squares

Find $\vec{x}$ such that $A\vec{x} \approx \vec{b}$ by minimizing $||A\vec{x}-\vec{b}||^2$

It is relatively easy to minimize the squared norm rather than the norm because now the squared norm is differentiable and convex(any local minimum is a global minimum)

🔥

\text{Quadratic function} \sub \text{Convex function}

Solution:

We can either derive the solution by minimizing derivative of $||A\vec{x}-b||^2$ or projecting $\vec{b}$ to $Col(A)$ - column space of matrix $A$ .

Intuition of method 2 (projection) ⇒ projection of $\vec{b}$ onto $Col(A)$ would form a right angle triangle and we know that perpendicular distances are the shortest.

We derive by:

\begin{split} & \text{we want } (A\vec{x}- \vec{b})=\vec{e} \\ & \text{where } \vec{e} \text{ must be orthogonal to all of the columns of } A \\ & A^{\top}\vec{e}=0 \\ & A^{\top}(A\vec{x}-\vec{b})=0 \\ & A^{\top}A\vec{x}=A^{\top}\vec{b} \\ & \vec{x}^{*}=(A^\top A)^{-1} A^\top \vec{b} \end{split}

Linear Algebra Review

Vectors, Norms, Gram-Schmidt, Fundamental Theorem of Linear Algebra

Vector

\vec{x} \in \mathbb{R}^n

Norms

If we have a vector space $X$

then a function from $X \rightarrow \mathbb{R}$ is a norm provided that it satisfies

$\forall \vec{x} \in X, ||\vec{x}|| \ge 0$ , and $||x|| = 0 \iff \vec{x} = \vec{0}$

Triangle Inequality: $||\vec{x} + \vec{y}|| \le ||\vec{x}|| + ||\vec{y}||$

$||\alpha \vec{x}|| = |\alpha| \times ||\vec{x}||$

LP Norm

||\vec{x}||_p = (\sum_i |x_i|^p)^{1/p}, 1 \le p < \infin

Extreme case of $p \rightarrow \infin$

||\vec{x}||_\infin = \max_{i} |x_i|

TODO: Search proof for this

Intuition:

||\vec{x}||_\infin = \lim_{p \rightarrow \infin} (\underbrace{(\max_i |x_i| )^{p}}_{\mathclap{\text{dominates}}}+\sum_{i \ne \argmax |x_i|} |x_i|^p)^{1/p}

L0-Norm (Cardinality)

||\vec{x}||_0 = \sum_i \mathbb{I}\{x_i \ne 0\}

L2-Norm (Euclidean Norm)

||\vec{x}||_2 = \sqrt{\sum_i x_i^2}

Cauchy Schwartz Inequality

<\vec{x},\vec{y}>=\vec{x}^{\top}\vec{y} = ||x||_2 \times ||y||_2 \times \cos \theta \le ||x||_2 \times ||y||_2

where $\theta$ is the angle between vector $\vec{x}$ and $\vec{y}$

Holder’s Inequality

Generalization of Cauchy Schwartz

p,q \ge 1, \text{s.t. } \frac{1}{p}+\frac{1}{q} = 1 \\ |\vec{x}^t \vec{y}| \le \sum_{i} |x_iy_i| \le ||\vec{x}||_p||\vec{y}||_q

Proof not in scope for this class

First Optimization Problem

\max \vec{x}^\top \vec{y} \\ \text{s.t. } ||\vec{x}||_p \le 1, \vec{y} \in \mathbb{R}^n \text{ is constant} \\

p=1,

x_i = \begin{cases} \text{sign} (y_i) &\text{if } \argmax_i |y_i| = i \\ 0 &\text{otherwise} \end{cases}

Produces sparse solution

\max_{||\vec{x}||_1 \le 1} \vec{x}^\top \vec{y} = \max_i |y_i| = ||\vec{y}||_\infin

p = 2,

We will choose the $\vec{x}$ on the unit circle in the direction of $\vec{y}$ because we want $\cos \theta$ to be 1.

$p = \infin$

x_i = \begin{cases} 1 &y_i \ge 0 \\ -1 &y_i < 0 \end{cases}, \\ \vec{x} = \text{sign}(\vec{y})

\max_{||\vec{x}||_\infin \le 1} \vec{x}^\top \vec{y} = \sum_i |y_i| = ||\vec{y}||_1

Gram-Schmidt / Orthonormalization + QR Decomposition

We have a vector space $X$ and given basis $\vec{a_1}, \dots, \vec{a_n}$

We can generate an orthonormal basis for the vector space

\vec{v_1} = \frac{\vec{a}_1}{||\vec{a_1}||_2}

\vec{v_2}=\frac{\vec{a}_2-proj_{\vec{v}_1}{\vec{a_2}}}{||\vec{a}_2-proj_{\vec{v}_1}{\vec{a_2}}||_2}

\vec{v_k}=\frac{\vec{a}_k - \sum_{i<k} proj_{\vec{v}_i}\vec{a}_k}{||\vec{a}_k - \sum_{i<k} proj_{\vec{v}_i}\vec{a}_k||_2}

where

proj_{\vec{v}_{to}} \vec{v}_{from} = (\vec{v}_{from} \cdot \dot{\vec{v}}_{to}) \dot{\vec{v}}_{to} =\frac{\vec{v}_{to} \cdot \vec{v}_{from}}{||\vec{v}_{to}||_2^2} \vec{v}_{to}

QR Decomposition

A = QR

where

A = \begin{bmatrix} \vec{a}_1 &\vec{a}_2 &\cdots &\vec{a}_n \end{bmatrix}

Q = \begin{bmatrix} \vec{q}_1 &\vec{q_2} &\cdots &\vec{q}_n \end{bmatrix}

R = \begin{bmatrix} r_{11} &r_{12} &\cdots &r_{1n} \\ 0 &r_{22} &\cdots &r_{2n} \\ \vdots &\vec{0} &\ddots &\vdots \\ 0 &0 &\cdots &r_{nn} \end{bmatrix} \leftarrow \text{upper triangular matrix}

$r_{ij}$ are taken from the gram schmidt equations!

Fundamental Theorem of Linear Algebra

A \in \mathbb{R}^{m \times n} \\ N(A) \underbrace{\oplus}_{\mathclap{\text{direct sum}}} R(A^{\top}) = \mathbb{R}^n

For any vector in $\mathbb{R}^n$ , I can write it as the sum of the null component of a matrix A and another component that belongs to the range space of that matrix transposed.

We can also say:

R(A)\oplus N(A^{\top}) = \mathbb{R}^m

To prove this, we have to use the Orthogonal Decomposition Theorem

Orthogonal Decomposition Theorem (Thm 2.1)

Let $X$ be any general vector space, and $S$ be a subspace, then

\forall \vec{x} \in X, \vec{x}=\vec{s}+\vec{r} \\ \text{where } s \in S, \vec{r} \in S^{\bot}

Note: $S^{\bot} = \{\vec{r}|\forall \vec{s} \in S, <\vec{r},\vec{s}> = 0\}$

This can be summarized by

X = S \oplus S^{\bot}

Proof of Orthogonal Decomposition Theorem

Thanks to the ODT, now we only want to show that

N(A) = R(A^{\top})^{\bot}

This means we need to show:

\begin{align} N(A) \sube R(A^{\top})^{\bot} \\ R(A^{\top})^{\bot} \sube N(A) \end{align}

To show (1)

Let $\vec{x} \in N(A)$ , show $\vec{x} \in R(A^{\top})^{\bot}$

\because \vec{x} \in N(A) \\ \therefore A\vec{x} = \vec{0}

We want to prove:

\forall \vec{w} \in R(A^{\top}), <\vec{x},\vec{w}>=0

We know that since $\vec{w} \in R(A^{\top})$ , $\vec{w} = A^{\top} \vec{y}$ for some $\vec{y}$

<\vec{x},\vec{w}>=<\vec{x},A^{\top}\vec{y}>=\underbrace{\vec{x}^{\top}A^{\top}\vec{y}}_{\mathclap{\text{it's a sclar, so we can simply transpose it}}}=\vec{y}^{\top}A\vec{x}=0

To show (2)

Let $\vec{x} \in R(A^{\top})^{\bot}$ , show $\vec{x} \in N(A)$

\because \vec{x} \in R(A^{\top})^{\bot} \\ \therefore \forall \vec{w} = A^{\top}\vec{y}, \text{where } \vec{y} \in \mathbb{R}^m, \\ <\vec{x},\vec{w}>=0

\begin{split} <\vec{x},A^{\top}\vec{y}> &=\vec{x}^{\top}A^{\top}\vec{y} \\ &= \vec{y}^{\top}A\vec{x} = 0, \forall \vec{y} \in \mathbb{R}^n \\ \end{split}

A specific $A\vec{x}$ might be in the null space of $\vec{y}^{\top}$ , but this cannot be true for every $\vec{y}$ , so therefore

A\vec{x} = 0

Diagonalization of Matrices

A = U\Lambda U^{-1}

Not all matrices are diagonalizable

Matrices are diagonalizable when for each eigenvalue:

\text{Algebraic Multiplicity} = \text{Geometric Multiplicity}

Algebraic Multiplicity:

When finding eigenvalues for a matrix we find roots of the polynomial $det(\lambda I - A)$ , the “algebraic multiplicity” of an eigenvalue $\lambda_i$ is how many times $\lambda_i$ appears as root of the polynomial.

Geometric Multiplicity (for $\lambda_i$ ):

\dim(N(\lambda_i I -A))

📌

Important Property:

N(\lambda_i I - A) = \Phi_i

is exactly the eigen-space of

A

corresponding to eigenvalue

\lambda_i

Symmetric Matrices

A = A^{\top} \text{ or } A_{ij} = A_{ji} \leftrightarrow A \in S^n

e.g.

Covariance Matrices

Graph Laplacians (matrix representing connectivity in a graph)

Properties:

Eigenvalues $\forall \lambda_i, \lambda_i \in \mathbb{R}$

Eigenspaces $\lambda_i \ne \lambda_j$ are orthogonal and ( $\Phi_i \bot \Phi_j$ )
1. $\Phi_i = N(\lambda_i I - A)$

If $\mu_i$ is algebraic multiplicity of $\lambda_i$ , then $\dim(N(\Phi_i))=\mu_i$
1. geometric and algebraic multiplicities are always equal

1-3 shows always diagnolizable
1. $A \in S^{n} \rightarrow A = U\Lambda U^{\top}$
  1. where $U$ ⇒ orthonormal
  1. $\Lambda$ ⇒ diagonal

Proof of Spectral Matrix has eigenvalues of the same algebraic and geometric multiplicity

Spectral Theorem

A \in S^n

Then

A = U\Lambda U^{\top}

Where $U$ is an orthonormal(unitary) matrix, $\Lambda$ is diagonal, and

U = \begin{bmatrix} \underbrace{\begin{matrix} \vec{u}_1 &\vec{u}_2 &\cdots &\vec{u}_r \end{matrix}}_{\mathclap{\text{Range space $R(A)$}}} &\underbrace{\begin{matrix} \vec{u}_{r+1} &\cdots &\vec{u}_n \end{matrix} }_{\mathclap{\text{Null Space $N(A)$}}} \end{bmatrix}

\Lambda = \begin{bmatrix} \lambda_1 &0 &\cdots &0 &0 &\cdots &0 \\ 0 &\lambda_2 &\cdots &0 &0 &\cdots &0 \\ \vdots &\vdots &\ddots &\vdots &\vdots &\ddots &\vdots \\ 0 &0 &\cdots &\lambda_r &0 &\cdots &0 \\ 0 &0 &\cdots &0 &0 &\cdots &0\\ \vdots &\vdots &\ddots &\vdots &\vdots &\ddots &\vdots \\ 0 &0 &\cdots &0 &0 &\cdots &0 \end{bmatrix}

Note: $(i<j) \implies (\lambda_i > \lambda_j)$ , $\forall 0 \le i,j \le r,(\lambda_i, \vec{v}_i)$ are eigenvalue-vector pairs of $A$

We can also write

\begin{split} A &= \sum_i^r \lambda_i \vec{u}_i\vec{u}_i^{\top} \end{split}

Variational Characterization of Eigenvalues of a symmetric matrix (& Rayleigh Coefficient)

Given $A \in S^n$

r(\text{Raylaigh Coefficient}) = \frac{\vec{x}^{\top}A\vec{x}}{\vec{x}^{\top}\vec{x}} = \frac{\vec{x}^{\top}A\vec{x}}{||\vec{x}||_2^2}

Important property:

\lambda_{\min}(A) \le r \le \lambda_{\max}(A)

and

\lambda_{\max}(A) = \max_{||\vec{x}||_2=1} \vec{x}^{\top}A\vec{x} \\ \lambda_{\min}(A) = \min_{||\vec{x}||_2=1} \vec{x}^{\top}A\vec{x}

Note:

\begin{split} \vec{x}^{\top}A\vec{x}&=\vec{x}^{\top}U\Lambda U^{\top}\vec{x} \\ &=\vec{y}^{\top}\Lambda\vec{y}\quad (\vec{y}=U^{\top}\vec{x}) \\ &=\sum_{i=1}^r \lambda_i y_i^2 \\ &\quad \le \sum_{i=1}^r \lambda_{\max} y_i^2 = \lambda_{\max}||\vec{y}||_2^2 = \lambda_{\max}||\vec{x}||_2^2 \\ &\quad \ge \sum_{i=1}^r \lambda_{\min}y_i^2=\lambda_{\min}||\vec{y}||_2^2=\lambda_{\min} ||\vec{x}||_2^2 \end{split}

Note that both $U$ and $U^{\top}$ are orthornormal matrices, and they preserve norm.

And it is obvious what what values of $\vec{x}$ to choose to preserve max or min.

PCA + SVD

Principle Component Analysis + Singular Value Decomposition

🔥

Idea: I have n-dimensional data but my data seems like they have some kind of structure in a lower dimension, how do I extract this out?

Goal of PCA:

Given data vectors $\vec{x}_1, \vec{x}_2, \dots, \vec{x}_n$ , find $k$ orthornormal vectors $\vec{w}_i$ that minimizes the projection error

Err = \frac{1}{n}\sum_{i=1}^n e_i^2

where

\begin{split} e_i^2 &= ||\vec{x}_i - \sum_{j=1}^k <\vec{w}_j,\vec{x}_i>\vec{w}_j ||^2 \\ &= ||\vec{x}_i||_2^2 - \sum_{j=1}^k <\vec{w}_j, \vec{x}_i> \end{split}

So our problem becomes

\begin{split} \max_{\{\vec{w}_1, \cdots, \vec{w}_k\}} \sum_{i=1}^n \sum_{j=1}^k \frac{1}{n}<\vec{w}_j,\vec{x}_i> &= \frac{1}{n}\sum_{j=1}^k||X\vec{w}_j||^2 \\ &= \frac{1}{n}\sum_{j=1}^k(\vec{w}_j^{\top}\underbrace{X^{\top}X}_{{\text{symmetric, use spectral theorem!}}}\vec{w}_j) \end{split}

Its easier to prove if we define data as rows so we will proceed by this…But later we can flip everything.

\text{Data Matrix } X = \begin{bmatrix} \vec{x}_1^{\top} \\ \vec{x}_2^{\top} \\ \vdots \\ \vec{x}_n^{\top} \end{bmatrix}

C = \frac{1}{n}XX^{\top}=\frac{1}{n} \begin{bmatrix} ||\vec{x}_1^{\top}||^2 &<\vec{x}_1,\vec{x}_2> &\cdots &<\vec{x}_1,\vec{x}_n> \\ <\vec{x}_1,\vec{x}_2> &||\vec{x}_2^2||^2 &\cdots &<\vec{x}_2,\vec{x}_n> \\ \vdots &\vdots &\ddots &\vdots \\ <\vec{x}_1,\vec{x}_n> &<\vec{x}_2,\vec{x}_n> &\cdots &||\vec{x}_n||^2 \end{bmatrix}

Note:

XX^{\top} \in S^n, C \in S^n

We will also define

D = \frac{1}{n}X^{\top}X

Note that $D \in S^n$ as well

The SVD Decomposition of A can be written as

A = \underbrace{U}_{m \times m} \underbrace{\Sigma}_{m \times n} \underbrace{V^{\top}}_{n \times n}

where the diagonal values of $\Sigma$ are called singular values of $A$ ⇒ they are actually eigenvalues of both $A^{\top}A, AA^{\top}$

The orthonormal eigenvectors of $A^{\top}A$ are called right singular vectors of $A$ and they rest inside columns of $V$ , in the order such that they corresponds to the square root of their eigenvalues in $\Sigma$

The orthonormal eigenvectors of $AA ^{\top}$ are called left singular vectors of $A$ and they rest inside columns of $U$ , in the order such that they corresponds to the square root of their eigenvalues in $\Sigma$

Proof of SVD

Graphical Interpretation: $U, V \rightarrow \text{Rotation / Reflection}$ , $\Sigma \rightarrow Scaling$

To understand this graphical representation of a general vector, think about decomposing the vector $\vec{x}$ into eigenvectors of $V$

\vec{x}=\alpha_1 \vec{v}_1 + \alpha_2 \vec{v}_2 + \cdots + a_r \vec{v}_r

And think about how $V^{\top}$ first projects $\vec{x}$ onto every eigenvector of $A^{\top}A$ , then scales them, then transforms them as if they were in coordinates of $AA^{\top}$

We can write this in the compact form

A = \sum_{i=1}^r \sigma_i \vec{u}_i\vec{v}_i^{\top}

🔥

Singular Value Decomposition is most of the time not unique because everytime we have a repeated eigenvalue (

\lambda_i = \lambda_j, i \ne j

) then we can order the eigenbasis in different ways.

the Eckart-Young-Misky Theorem

Vector Calculus

f(\vec{x}) \in \mathbb{R}^n \rightarrow \mathbb{R}

Varaiya ⇒ the main theorem

Scalar Calculus Review

Say we have a function

f(x): \mathbb{R} \rightarrow \mathbb{R}

Then the derivative

\frac{df}{dx}

Tells the instant rate of change of f with respect to x

Taylor’s theorem

Let $x_0 \in \mathbb{R}$ be a fixed point, then

f(x+\Delta x) = f(x_0) + \frac{df}{dx}|_{x=x_0} \Delta x + \frac{1}{2} \frac{d^2f}{dx^2} |_{x=x_0} (\Delta x)^2 + \cdots

We usually use taylor’s theorem as an approximation tool and now if we expand our understanding of calculus onto vectors, we can make linear approximations using linear algebra!

Dimensions of Vector Gradients

f(\vec{x}): \mathbb{R}^{n \times 1} \rightarrow \mathbb{R}

\Delta \vec{x} \in \mathbb{R}^{n \times 1}

Therefore the derivative (or the transpose of gradient)

\frac{df}{d\vec{x}} = \nabla f |_{\vec{x}}^{\top} \in \mathbb{R}^{1 \times n}

Notion of Gradient

$\nabla f(\vec{x})$ captures change according to all components of $\vec{x}$

\nabla f(\vec{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} &\frac{\partial f}{\partial x_2} &\cdots &\frac{\partial f}{\partial x_n} \end{bmatrix}^{\top}

Notion of Hessian

\nabla^2 f(\vec{x}) _{i,j} = \frac{\partial^2 f}{\partial x_i \partial x_j}

If we have nice smooth functions, then the order of the denominator can be interchanged. This is not generally true. Therefore often symmetric

Jacobian Matrix

Jacobian matrix describes the derivative of a vector function with respect to a vector

f(\vec{x}): \mathbb{R}^n \rightarrow \mathbb{R}^m

{J} ={\begin{bmatrix}{\dfrac {\partial \mathbf {f} }{\partial x_{1}}}&\cdots &{\dfrac {\partial \mathbf {f} }{\partial x_{n}}}\end{bmatrix}}={\begin{bmatrix}\nabla ^{\mathrm {T} }f_{1}\\\vdots \\\nabla ^{\mathrm {T} }f_{m}\end{bmatrix}}={\begin{bmatrix}{\dfrac {\partial f_{1}}{\partial x_{1}}}&\cdots &{\dfrac {\partial f_{1}}{\partial x_{n}}}\\\vdots &\ddots &\vdots \\{\dfrac {\partial f_{m}}{\partial x_{1}}}&\cdots &{\dfrac {\partial f_{m}}{\partial x_{n}}}\end{bmatrix}}

Taylor’s Theorem for Vectors

f(\vec{x}_0+\Delta \vec{x}) = f(\vec{x}_0)+\nabla f |_{\vec{x}=\vec{x}_0}^{\top} (\Delta \vec{x}) + \frac{1}{2}(\Delta \vec{x})^{\top} \nabla^2 \underbrace{f|_{\vec{x} = \vec{x}_0}}_{\mathclap{\text{Hessian}}} (\Delta \vec{x}) + \cdots

In practice you would never get higher order terms because its very hard to compute with

The main theorem

f: \mathbb{R}^{n} \rightarrow \mathbb{R}, \text{differentiable everywhere}

Then the optimization:

\min f(\vec{x}) \\ \text{s.t. } \vec{x} \in \Omega

Where $\Omega$ is an openset(the boundaries are not included) $\Omega \sube \mathbb{R}^n$

Then if $\vec{x}^*$ is an optimal solution of the optimization problem, then:

\frac{df}{dx}(\vec{x}^*)=0

Proof of Main’s Theorem

Optimization Forms

General:

\min f_0(\vec{x}) \\ \text{subject to} f_i(\vec{x})\le b_i \text{ for } i=1,2,\dots,m

Notation:

$\vec{x}$ - optimization variable, $\vec{x} \in \mathbb{R}^n$

$\vec{x}^*$ - optimal value

$\vec{x} \in \text{feasible set}$

Noise/Perturbation/Sensitivity Analysis

A\vec{x}=\vec{y}

\vec{y} \leftarrow \vec{y} +\vec{\delta_y} \text{ and because of this } \vec{x} \leftarrow \vec{x} + \vec{\delta_x}

We want to measure how sensitive our solution is to a perturbation in our measurement

We want to know how big is $\delta \vec{x}$ , in particular

\frac{||\vec{\delta_x}||_2}{||\vec{x}||_2}

The problem becomes:

\begin{split} A(\vec{x}+\vec{\delta_x}) &= \vec{y} + \vec{\delta_y} \\ A\vec{\delta_x} &= \vec{\delta_y} \\ \vec{\delta_x} &= A^{-1} \vec{\delta_y} \\ ||\vec{\delta_x}||_2 &= ||A^{-1} \vec{\delta_y}||_2 \end{split}

Recall the L2 matrix norm

||A||_2 = \max_{||\vec{y}||_2 = 1} ||A\vec{y}||_2 = \max_{\vec{y}} \frac{||A\vec{y}||_2}{||\vec{y}||_2}

Also:

||A\vec{x}||_2 = ||\vec{y}||_2 \\ ||A||_2 ||\vec{x}||_2 \ge ||\vec{y}||_2 \\ ||\vec{x}||_2 \ge \frac{||\vec{y}||_2}{||A||_2}

Therefore

\begin{split} ||\vec{\delta_x}||_2 &= ||A^{-1} \vec{\delta_y}||_2 \\ ||\vec{\delta_x}||_2 &\le ||A^{-1}||_2 ||\vec{\delta_y}||_2 \\ \end{split}

Combining those two

\begin{split} \frac{||\delta_x||_2}{||x||_2} &\le ||A^{-1}||_2 ||\delta_y||_2 \frac{||A||_2}{||y||_2} \\ \frac{||\delta_x||_2}{||x||_2} &\le (||A||_2)(||A^{-1}||_2) [\frac{||\delta_y||_2}{||y||_2}] \end{split}

We know that $||A||_2 = \sigma_{max}(A)$ , and $||A^{-1}||_2 = 1 / \sigma_{min}(A)$

\frac{||\delta_x||_2}{||x||_2} \le \underbrace{\frac{\sigma_{max}(A)}{\sigma_{min}(A)}}_{\mathclap{\text{condition number of a matrix}}} \frac{||\delta_y||_2}{||y||_2}

Condition Number of a matrix is $\sigma_{max} / \sigma_{min}$

Least Squares

Find $\vec{x}$ such that $A\vec{x} \approx \vec{b}$ by minimizing $||A\vec{x}-\vec{b}||^2$

It is relatively easy to minimize the squared norm rather than the norm because now the squared norm is differentiable and convex(any local minimum is a global minimum)

🔥

\text{Quadratic function} \sub \text{Convex function}

Solution:

We can either derive the solution by minimizing derivative of $||A\vec{x}-b||^2$ or projecting $\vec{b}$ to $Col(A)$ - column space of matrix $A$ .

Intuition of method 2 (projection) ⇒ projection of $\vec{b}$ onto $Col(A)$ would form a right angle triangle and we know that perpendicular distances are the shortest.

We derive by:

\begin{split} & \text{we want } (A\vec{x}- \vec{b})=\vec{e} \\ & \text{where } \vec{e} \text{ must be orthogonal to all of the columns of } A \\ & A^{\top}\vec{e}=0 \\ & A^{\top}(A\vec{x}-\vec{b})=0 \\ & A^{\top}A\vec{x}=A^{\top}\vec{b} \\ & \vec{x}^{*}=(A^\top A)^{-1} A^\top \vec{b} \end{split}

Sensitivity analysis on least squares

\vec{x} = (A^{\top}A)^{-1}A^\top \vec{b}

rewrite:

(A^\top A)\vec{x} = A^\top \vec{b}

Look at the conditional number on $A^\top A$

Minimum Norm Problem

System of equations:

A\vec{x} = \vec{b}

where $A \in \mathbb{R}^{m \times n}, \vec{x} \in \mathbb{R}^n, \vec{b} \in \mathbb{R}^m$

If $m \gg n$ , then we have an overdetermined state (no solution)

If $m \ll n$ , then we have an underdetermined state (inifintely many solutions)

One common solution for the underdetermined state is to pick the minimum energy solution

\min ||\vec{x}||_2^2 \\ \text{s.t. } A\vec{x}=\vec{b}

So how can we optimize this?

Components of $\vec{x}$ that are in $N(A)$
1. $\vec{x} = \vec{y} + \vec{z}, \vec{y} \in N(A), \vec{z} \in R(A^{\top})$
1. $A\vec{x} = A(\vec{y}+\vec{z})=0+A\vec{z}=\vec{b}$
1. $||\vec{x}||_2^2 = ||\vec{y}||_2^2 + ||\vec{z}||_2^2$ by pythagorean theorem
1. So $\exists \vec{w}, \vec{z} = A^{\top} \vec{w}$
1. $A\vec{z}=\vec{b} \rightarrow \underbrace{A A^{\top}}_{\mathclap{\text{If $A$ has full rank, then this square matrix is invertible}}} \vec{w} = \vec{b}$
1. $\vec{w} = (AA^{\top})^{-1} \vec{b} \rightarrow \vec{z} = A^{\top}(AA^{\top})^{-1}\vec{b}$

Tikhonov Regularization

\min_{\vec{x}} ||W_1A\vec{x}-\vec{b}||_2^2+ ||W_2\vec{x}-\vec{x}_0||_2^2

Where $W_1$ is a weight matrix, $W_2$ and $\vec{x}_0$ are supplemental informations

🔥

Prof. Ranade also included MAP and MLE examples of this regularization technique (and in special forms of ridge regression and MSE). See my CS189 Notes for those concepts.

Ridge Regression

Can we shift property of eigenvalues so least squares are less turbulent in response to changes in observations?

(A+\lambda I) \underbrace{\vec{v}_1}_{\mathclap{\text{eigenvector of $A$}}} = A\vec{v}_1 + \lambda \vec{v}_1 = \lambda_1 \vec{v}_1 + \lambda \vec{v}_1

🧙🏽‍♂️

CS 189 also talks about this, basically least square + L2 Norm Mean Loss Notation is a bit different though, CS189 uses

\lambda

in the objective

f(x)

while here we use

\lambda^2

Consider now the objective

\min ||A\vec{x} - \vec{b}||^2 + \lambda^2 ||\vec{x}||_2^2

Copied from prof. Schewchuk’s lecture notes

Also copied from prof. Schewchuk’s slides

Anther interpretation of L2 norm is we’re basically augmenting the data matrix with lambda I

\nabla f(\vec{x}) = 2A^{\top}A\vec{x}- 2 \vec({b}^{\top} A)^{\top} + 2\lambda^2\vec{x}

Set this gradient to 0

(A^{\top}A+\lambda^2I) \vec{x} = A^\top \vec{b}

So now:

\vec{x}^* = (A^{\top}A+\lambda^2 I)^{-1}A^\top \vec{b}

🔥

Also now

A^{\top}A + \lambda^2I

matrix is absolutely invertible because every eigenvalue is now lower-bounded by

\lambda^2

If we let

\vec{x} = V\vec{z}

And

A = U\Sigma V^\top

Remember we have:

\begin{split} & \min ||A\vec{x} - \vec{b}||^2 + \lambda^2 ||\vec{x}||_2^2 \\ =& \min ||AV\vec{z}-\vec{b}||^2 + \lambda^2||\vec{x}||^2_2 \\ =& \min ||U\Sigma \vec{z} - \vec{b}||^2 + \lambda^2 ||\vec{x}||_2^2 \\ \end{split}

Therefore

\begin{split} \vec{x}^* &= V(\Sigma^\top \Sigma+\lambda^2I)^{-1} \Sigma^\top U^\top \vec{b} \\ &=V \begin{bmatrix} diag(\frac{\sigma_i}{\sigma_i^2 + \lambda^2}) & \vec{0} \end{bmatrix} U^\top \vec{y} \end{split}

Lasso Regression

Formulation:

\min_{\vec{x}} ||A\vec{x} - \vec{b}||_1

How do we solve this?

Let $A\vec{x}-\vec{b} = \vec{e}$

then

\min_{A\vec{x} - \vec{e} = \vec{b}} ||\vec{e}||_1

Let’s consider a problem (with similar formulation)

\min ||\vec{x}||_1 \\ \text{s.t.} \\ A\vec{x} = \vec{b}

Convex objective

Constraints are linear

Not differentiable everywhere

Suppose $A$ is wide with full row rank (so that there’s inifite solutions)

Let $x_i$ = $x_i^+ - x_i^-$ , then $|x_i| = x_i^+ + x_i^-$

Our program then becomes

\min \sum_{i=1}^n x_i^+ + \sum_{i=1}^n x_i^- \\ A(\vec{x}^+ - \vec{x}^-) = \vec{b} \\ x_i^+ \ge 0, x_i^- \ge 0

Claim: This new program will always choose only one of $x_i^+$  or $x_i^-$  nonzero

Suppose: $x_i^+ > 0, x_i^- > 0$

Consider: $x_i^+ - \epsilon, x_i^- - \epsilon$

Oops, we can strictly decrease our objective fn! ⇒ so with optimal solution, only one of $x_i^+, x_i^-$ is non-zero

Another problem:

\min_{\vec{x}} \sum_{i=1}^n ||\vec{x} - \vec{b}_i||_1

This is solvable with a LP

Lasso Regression

\min_{\vec{x}} ||A\vec{x}-\vec{b}||_2^2 + \lambda ||\vec{x}||_1

Nice property: encourages sparsity

Lasso can be reformulated as a quadratic problem, Subgradient descent and least-angle regression(LARS), forward stagewise algorithms can be used to solve for Lasso.

Total Least Squares

What is least squares, both $X$ and $y$ have perturbations?

(X+\tilde{X})\vec{w} = \vec{y} + \vec{\tilde{y}}

We can rewrite the problem as

\underbrace{\begin{bmatrix} X+\tilde{X} &\vec{y}+\vec{\tilde{y}} \end{bmatrix}}_{\mathclap{\tilde{Z}}} \begin{bmatrix} \vec{w} \\ -1 \end{bmatrix} = \vec{0}

We know $\tilde{Z}$ has rank at most $n$

If we let $Z = \begin{bmatrix} X &\vec{y} \end{bmatrix}$

Then the optimization problem becomes (an eckart-young problem)

\min_{rk(\tilde{Z}) \le n} ||Z - \tilde{Z}||_F

Low-rank Approximation (SVD)

Is there a way to store the most important parts of a matrix? ⇒ SVD?

But there are multiple perspective considering parts

Matrix as an operator

Matrix as a chunck of data

So we can define matrix norms

“Frobenius Norm” ⇒ As a chunck of data

||A||_F = \sqrt{\sum_{i,j} A_{ij}^2} = \sqrt{trace(A^{\top}A)}

🔥

Frobenius Norms are invariant to orthonormal transforms

||UA||_F = ||AU||_F=||A||_F

👆TODO: Prove this as an exercise!

Frobenius Norm Invariant to Orthonormal Transform Proof

“L2-Norm / Spectral Norm” ⇒ As an operator ← max scaling of a vector

||A||_2 = \max_{||\vec{x}||_2=1} ||A\vec{x}||_2 = \sqrt{\lambda_{max} (A^{\top}A)} = \sigma_{max} (A)

the Eckart-Young-Misky Theorem

Gradient Descent

Primarily for unconstraint optimization problems

p^* = \min_{\vec{x}^* \in \mathbb{R}^n} f_0(\vec{x})

We want to find $p^*$ and $x^*$

“Guess and get better”

The best direction to improve is to use the gradient direction

\vec{x}_{k+1} = \vec{x}_k - \underbrace{\eta}_{\mathclap{\text{step size}}} \nabla f(\vec{x}_k)

Usually SGD with constant step size is not going to converge because each data point pulls the parameters in different directions ⇒ but if we use time-dependent decreasing stepsize it’s more likely to converge

Projected GD

\min_{\vec{x} \in X} f(\vec{x})

Where

$X$ - compact + convex

$f(x)$ - $\beta$ -smooth

So we will do the following:

\vec{x}_{k+1} = \Pi_X (\vec{x}_k - \eta \nabla f(\vec{x}_k))

Where

\Pi_X(\vec{y}) = \argmin_{\vec{v} \in X} ||\vec{y} - \vec{v}||_2^2

Conditional GD / Frank Wolfe

Let $\gamma_k$ be a predetermined sequence

\vec{y}_k = \argmin_{\vec{y} \in X} \nabla f(\vec{x}_k)^{\top} \cdot \vec{y}

Notice here we want to find a direction that maximizes the cross-product with direction of gradient, but since we want to subtract the gradient, we can just add the direction that minimizes (maximizes negative magnitude) cross-correlation with

\begin{split} \vec{x}_{k+1} &= (1-\gamma_k) \vec{x}_k + \gamma_k \vec{y}_k \\ &=\vec{x}_k + \gamma(\vec{y}_k - \vec{x}_k) \end{split}

Has a nice sparsity property

We are basically replacing the projection with something in the set that maximizes the direction of gradient

Newton’s Method

Approximate the function as a quadratic function and descend to the lowest point in the quadratic estimation

f(\vec{x} + \vec{v}) = \underbrace{f(\vec{x}) + \nabla f(\vec{x})^\top \vec{v} + \frac{1}{2} \vec{v}^\top \nabla^2 f(\vec{x}) \vec{v}}_{\mathclap{\text{Quadratic Approximation}}} + \cdots

Suppose

$\vec{x}_0$ initial state

Hessian is PSD

General Quadratic
- $\frac{1}{2} x^\top H x + c^\top x + d$
- $x^* = -H^{-1} c$

Best $\vec{v}$ direction to take a step to minimize quadratic?
- $\vec{v} = - (\nabla^2 f(\vec{x}))^{-1} \nabla f(\vec{x})$

So Newton step:

x_{k+1} = x_k - (\nabla^2 f(\vec{x}_k))^{-1} \nabla f(\vec{x}_k)

If $H$ is singular?

Quasi-Newton methods

Black (Gradient Descent) vs. Blue (Newton’s Methods)

Comparison to other methods:

Converge faster

Hessian Inversion
- Expensive Computation

Coordinate Descent

Given an optimization objective

\min_{\vec{x}} f_0(\vec{x})

The coordinate descent will update one coordinate per timestamp, let $\vec{x} \in \mathbb{R}^n$

(x_{t+1})_i = \begin{cases} \argmin_{(x_{t+1})_i} f_0(\vec{x}) &\text{if } t+1 \mod i \equiv 0 \\ (x_{t})_i &\text{otherwise} \end{cases}

Partitioning Problem

\min \vec{x}^\top W \vec{x}, W \in \mathbb{S}^n \\ \\ \text{s.t. } \forall i \in [1,n], x_i^2 = 1

Not a convex problem ⇒ the domain is a ring

Write out the lagrangian

\begin{split} L(\vec{x}, \vec{\nu}) &= \vec{x}^\top W \vec{x} +\sum_{i=1}^n \nu_i (x_i^2 - 1) \\ &=\vec{x}^\top W \vec{x} + \vec{x}^\top diag(\vec{\nu}) \vec{x} - \sum_{i=1}^n \nu_i \\ &=\vec{x}^\top (W+diag(\vec{\nu})) \vec{x} - \sum_{i=1}^n \nu_i \\ \end{split}

\begin{split} g(\vec{\nu}) &= \min_{\vec{x}} L(\vec{x}, \vec{\nu}) \\ &= \begin{cases} -\sum_{i=1}^n \nu_i &\text{if } W + diag(\vec{\nu}) \ge 0 \\ -\infin &\text{otherwise (choose infinitely large $\vec{x}$ in negative eigenvalue direction)} \end{cases} \end{split}

So dual problem (Semidefinite Program SDP):

\max -\sum_{i=1}^n \nu_i \\ \text{s.t. } W + diag(\vec{\nu}) \ge 0

Solution is

\vec{\nu} = \lambda_{\min} (W) \rightarrow p^* \ge n \cdot \lambda_{\min}(W)

Convexity

Convex Combination

$\sum_{i=1}^n \lambda_i \vec{x}_i$ is a convex combination of $\vec{x}_i$ if $\lambda_i \ge 0$ and $\sum_{i=1}^n \lambda_i = 1$

Convex Set

Set $C$ is convex if any line segment joining any two points in the set is contained in the set.

Set $C$ is convex if any convex combination of points inside the set is contained inside the set

Prove that the set of every PSD symmetric matrices are convex (4)

Seperating Hyperplane Theorem

If there exists two disjoint convex sets $C,D$ then there exists a hyperplane $\vec{a}^\top \vec{x} = b$ that seperates the two sets.

Proof of Seperating Hyperplane Theorem (4)

Convex Function (Bowl)

f: \mathbb{R}^n \rightarrow \mathbb{R}

$f$ is convex if domain of $f$  is a convex set and satisfies the Jensen’s Inequality:

f(\theta \vec{x} + (1-\theta)\vec{y})\le \theta f(\vec{x})+(1+\theta)f(\vec{y})

Epigraph:

\text{Epi }f = \{(x,t)\}, x \in \text{Domain}(f), f(x) \le t

Property:

$f$ is a convex function if and only iff $\text{Epi }f$ is a convex set

First-order conditions

Let the convex function $f$ be differentiable.

Then $f$ is convex if and only if

\forall \vec{x}, \vec{y} \in \text{Domain}(f), f(\vec{y}) \ge f(\vec{x}) + \nabla f(\vec{x})^\top (\vec{y}-\vec{x})

“Tangent line is always below the function”

If $\nabla f(\vec{x}^*) = 0$ , then $f(\vec{y}) \ge f(\vec{x}^*)+0(\vec{y}-\vec{x})$ ⇒ $\vec{x}^*$ is a global min

Proof of First-order condition in convex functions (4)

Second-order conditions

Letconvex function $f$ have the property that $Dom(f)$ is convex and twice differenciable function, then:

\nabla^2 f(\vec{x}) \ge 0

Strict Convexity

Strict Convexity → Convex

$Dom(f)$ should be convex

Zero-order condition ( $\theta \in (0,1)$ ):

\forall x,y \in Dom(f), f(\theta\vec{x} + (1-\theta)\vec{y}) < \theta f(\vec{x}) + (1-\theta) f(\vec{y})

First-order condition:

f(\vec{y}) > f(\vec{x})+\nabla f(\vec{x})^\top (y-x)

Second-order condition:

\nabla^2 f(\vec{x}) > 0

Strong Convexity

Strongly Convex → Strict Convex → Convex

$f$ is differentiable, $Dom(f)$ is convex

Then $f$ is $\mu$ -strongly ( $\mu > 0$ ) convex if

\forall \vec{x} \ne \vec{y} \in Dom(f): \\ f(\vec{y}) \ge f(\vec{x})+\nabla f(\vec{x})^\top(\vec{y}-\vec{x})+\frac{\mu}{2}||\vec{y} - \vec{x}||^2

Note the second-order taylor approximation:

f(\vec{y}) \approx f(\vec{x}) + \nabla f(\vec{x})^\top (\vec{y}-\vec{x}) + \frac{1}{2}(\vec{y} - \vec{x})^\top \nabla^2 f(\vec{x})(\vec{y} - \vec{x})

If we look at the term $\frac{\mu}{2} ||\vec{y} - \vec{x}||^2$ , we can actually rewrite it as

\frac{1}{2}(\vec{y}-\vec{x})^\top diag(\mu, \mu, \dots, \mu) (\vec{y}-\vec{x})

So we are saying that we want the hessian

\nabla^2 f(x) \ge \mu I

We are lower-bounding the hessian at every value

L-Smooth

It’s like the upper bound of “strong convexity” ⇒ strong convexity says that you can fit inside a bowl, and L-smooth says you can fit anther bowl inside your function

Convex Optimization

p^* = \min f_0(\vec{x}) \\ \text{s.t.} \\ \forall i\in [1,m], f_i(\vec{x}) \le 0

Assuming $f_0(\vec{x})$ convex and $\forall i, f_i(\vec{x})$ convex

Problem Transformations

Addition of “slack variables” ⇒ “epigraph” reformulation
1. $\min_{\vec{x} \in X} f_0(\vec{x}) \rightarrow \min_{t: f(\vec{x}) \le t, \vec{x} \in X} t$

Linear Program

\min \vec{c}^\top \vec{x} \\ \text{s.t.} \\ A\vec{x} = \vec{b} \\ P\vec{x} \le \vec{q}

Theorem:

If $\min_{\vec{x} \in X} \vec{c}^\top \vec{x}$ , and $X$ is a closed convex set, then:

\vec{x}^* \in Boundary(X)

Proof:

Assume $\vec{x}^*$ is in the interior of the set $X$ , then there exists some ball of radius $r > 0$ s.t. $Ball \in X$

\forall \vec{z}, ||\vec{x} - \vec{z}||_2 \le r \rightarrow \vec{z} \in X

Consider $\vec{z} = -\alpha \vec{c}$ , where $\alpha = \frac{r}{||\vec{c}||_2}$

Consider: $\vec{x}^* + \vec{z}$

f_0(\vec{x}+\vec{z}) = c^\top \vec{x}^* - \alpha \vec{c}^\top \vec{c} < f_0(\vec{x}^*)

Contradiction!

Minimax Inequality

With sets $X, Y$ and $F$ being any function

\min_{x \in X} \max_{y \in Y} F(x,y) \ge \max_{y \in Y} \min_{x \in X} F(x,y)

Proof of Minmax Inequality (1)

Lagrangian as a Dual

We have optimization questions of the form

p^* = \min f_0(\vec{x}) \\ \\ \text{s.t.} \\ \forall i \in [1,m], f_i(\vec{x}) \le 0 \\ \forall i \in [1,p], h_i(\vec{x}) = 0

Define the Lagrangian function:

L(\vec{x},\vec{\lambda}, \vec{\nu}) = f_0(\vec{x}) + \sum_{i=1}^m \lambda_i f_i(\vec{x}) + \sum_{i=1}^p \nu_i h_i(\vec{x})

So, the Lagrangian problem:

\min_{\vec{x}} L(\vec{x},\vec{\lambda},\vec{\nu}) = g(\vec{\lambda},\vec{\nu}) \\ \text{s.t. } \vec{\lambda} \ge 0

Properties of $g$ :

$g$ is a function of only $\vec{\lambda}, \vec{\nu}$

$L(\vec{x}, \vec{\lambda}, \vec{\nu})$ is an affine function of $\vec{\lambda}, \vec{\nu}$
1. Affine functions are both convex and concave

What can we say about $g$ ?
1. Fact: pointwise max of convex function is convex
1. Fact: pointwise min of concave function is concave
1. $g$ is a pointwise min of concave functions
  1. Because $K_i(\vec{\lambda},\vec{\nu}) = L(\vec{x}_i, \vec{\lambda}, \vec{\nu})$
    1. $g(\vec{\lambda}, \vec{\nu}) = \min \{\forall i, K_i(\lambda,\vec{v}) \}$
  1. Therefore $g$ is a concave function of $\vec{\lambda}, \vec{\nu}$

$g(\vec{\lambda}, \vec{\nu})$ is a lower bound on the primal optimal $p^*$

Proof of g in Lagrangian problem as a lower bound of true objective

Dual problem:

\max_{\vec{\lambda} \ge 0, \nu} g(\vec{\lambda}, \vec{\nu}) = d^* \rightarrow \text{Dual Problem}

Maximization of a concave function with linear constraints ⇒ Convex program

Properties:

Number of variables = number of constraints of optimal

Always convex problem, even if primal is not!

$d^*$ may not always give you the optimal value of your primal
1. but it is always a lower bound $d^* \le p^*$
  1. Weak Duality
1. We call
  1. $p^* - d^*$
    1. Duality Gap
  1. If $p^* = d^*$
    1. Strong Duality
    1. Sometimes hold sometimes don’t

Minmax in Lagrangian

Consider

\begin{split} \max_{\vec{\lambda} \ge 0, \vec{\nu}} L(\vec{x}, \vec{\lambda}, \vec{\nu}) &= \max_{\vec{\lambda} \ge 0, \vec{\nu}} {f_0(\vec{x}) + \sum_{i=1}^m \lambda_i f_i(\vec{x}) + \sum_{i=1}^n \nu_i h_i(\vec{x})} \\ &= \begin{cases} f_0(\vec{x}) &\text{if $\vec{x}$ is feasible} \\ \infin \end{cases} \end{split}

Note

p^* = \min_{\vec{x}} \max_{\vec{\lambda} \ge 0, \vec{\nu}} L(\vec{x}, \vec{\lambda}, \vec{\nu})

Because of minmax inequality,

p^* \ge d^*

Strong Duality / Slater’s Condition

Strong duality holds if $\exists x_0$ such that $f_i(x_0) < 0 \forall i$ (”strictly feasible) and the primal problem is convex

Refined Slater’s Condition

Assume problem is convex, assume we have $f_1, f_2, \dots, f_k$ that are affine functions as constraints (other constraints $f_{k+1}, \dots, f_m$ are not affine)

Then strong duality holds if

\exists \vec{x}_0: \\ \forall i \in [1,k] f_i(\vec{x}_0) \le 0 \wedge \forall i \in [k+1,m] f_i(\vec{x}_0) < 0

Dual of LP

Suppose the problem:

\min_{A\vec{x} \le \vec{b}} \vec{c}^\top \vec{x}

Then

L(\vec{x}, \vec{\lambda}) = \vec{c}^\top \vec{x} + \vec{\lambda}^\top(A\vec{x}-\vec{b}) = (A^\top \vec{\lambda} + \vec{c})^\top \vec{x} - \vec{b}^\top\vec{\lambda}

Therefore

\begin{split} g(\vec{\lambda}) &= \min_{\vec{x}} L(\vec{x}, \vec{\lambda}) \\ &= \begin{cases} -\infin &\text{if } A^\top \vec{\lambda}+c \ne 0 \\ -\vec{b}^\top \vec{\lambda} &\text{if } A^\top \vec{\lambda}+c = 0 \end{cases} \end{split}

So dual is (and strong duality holds for any LP):

\max_{\vec{\lambda} \ge 0, A^\top \vec{\lambda} + \vec{c} =0} -\vec{b}^\top \vec{\lambda} = d^*

Certificate of Optimality

Let $(\vec{\lambda}_1, \vec{\nu}_1)$ be a dual feasible point,

We know that for any feasible point in the dual problem, the optimum of the primal function is always lower-bounded by the dual objective

p^* \ge g(\vec{\lambda}_1, \vec{\nu}_1)

Then if we found a point $\vec{x}_1$ in a primal feasible set,

f_0(\vec{x}_1)-p^* \le f_0(\vec{x}_1) - g(\vec{\lambda}_1, \vec{\nu}_1)

So when we’ve picked a point that we know is $\epsilon$ away from the dual, we know it is at most $\epsilon$ away from the optimal policy ⇒ e.g. dual asend

Complementary Slackness

Original Problem:

p^* = \min f_0(\vec{x}) \\ \text{s.t.} \\ f_i(\vec{x}) \le 0 (\forall i, 1 \le i \le m) \\ h_i(\vec{x}) = 0 (\forall i, 1 \le i \le p)

Consider primal optimal $\vec{x}^*$ and dual optimal $(\vec{\lambda}^*, \vec{\nu}^*)$

Assume strong duality holds $d^* = g(\vec{\lambda}^*, \vec{\nu}^*) = p^* = f_0(\vec{x}^*)$

Note:

\begin{split} g(\vec{\lambda}^*, \vec{\nu}^*) &= \min_{\vec{x}} (f_0(\vec{x}) + \sum_{i=1}^m \lambda_i^* f_i(\vec{x}) + \sum_{i=1}^p \nu_i^* h_i(\vec{x})) \\ &\le f_0(\vec{x}^*) + \underbrace{\sum_{i=1}^m \lambda_i^* f_i(\vec{x}^*)}_{\le 0} + \underbrace{\sum_{i=1}^p \nu_i^* h_i(\vec{x}^*)}_{= 0} \\ &\le f_0(\vec{x}^*) + \underbrace{0}_{\mathclap{\text{because $(\vec{\lambda}^*, \vec{\nu}^*)$ maximizes $g(\cdots)$}}} + 0 \end{split}

But in our assumption, we’ve assumed $p^* = d^*$ ?

This forces the inequality above to become equalities!

\begin{align} g(\vec{\lambda}^*, \vec{\nu}^*) &= f_0(\vec{x}^*) + \sum_{i=1}^m \lambda_i^* f_i(\vec{x}^*) + \sum_{i=1}^p \nu_i^* h_i(\vec{x}^*) \\ &= f_0(\vec{x}^*) + 0 + 0 \end{align}

We’ve showed that:

\min_{\vec{x}} L(\vec{x},\vec{\lambda}^*,\vec{\nu}^*) = L(\vec{x}^*, \vec{\lambda}^*, \vec{\nu}^*)

$\vec{x}^*$ is an minimizer of the lagrangian (not necessarily unique)

Also:

\sum_{i=1}^m \lambda_i^* f_i(\vec{x}^*) = 0 \\ \forall i \in [1,m], \lambda_i^* \cdot f_i(\vec{x}^*) = 0

Therefore

If $\lambda_i^* > 0$ , then $f_i(\vec{x}^*) = 0$
- There’s a value to each extra resource

If $f_i(\vec{x}^*) < 0$ ⇒ $\lambda_i^* = 0$
- Condition has slack, not gonna pay money for more

👆

Those two conditions are called “complementary slackness” Note that we did not use convexity in deriving this, only assumed strong duality.

KKT Conditions (Karush-Kuhn-Tucker Conditions)

KKT Condition Part I (Necessary Conditions)

Consider primal optimal $\vec{x}^*$ and dual optimal $(\vec{\lambda}^*, \vec{\nu}^*)$

Assume strong duality holds $d^* = g(\vec{\lambda}^*, \vec{\nu}^*) = p^* = f_0(\vec{x}^*)$

No assumption on convexity

differentiable $f_0, f_i, h_i$

$\vec{x}^*, \vec{\lambda}^*, \vec{\nu}^*$ optimality implies

$\forall i \in [1,m], f_i(\vec{x}^*) \le 0$
1. feasible set

$\forall i \in [1,p], h_i(\vec{x}^*) = 0$
1. feasible set

$\forall i \in [1,m], \lambda_i^* \ge 0$
1. constraint on $\lambda$ , dual feasible set

$\forall i \in [1,m], \lambda_i^* \cdot f_i(\vec{x}^*) = 0$
1. complementary slackness

$\nabla f_0(\vec{x}^*) + \sum \lambda_i^* \nabla f_i(\vec{x}^*) + \sum \nu_i^* \nabla h_i(\vec{x}^*) = 0$
1. Gradient of lagrangian is equal to 0
1. We proved earlier that $\vec{x}^* = \min_{\vec{x}} L(\vec{x}, \vec{\lambda}^*, \vec{\nu}^*)$
1. implied by “the main theorem”

👆

Having those conditions do not necessarily mean that we have an optimality, but a way that we find optimal

\vec{\lambda}, \vec{\nu}

is to find points where those property are met and that gives us a base space to search for optimality

KKT Condition Part II (Sufficient Condition)

Convex problems
- $f_0, f_i$ are convex
- $h_i$ are affine

Strong Duality

$\tilde{x}, \tilde{\lambda}, \tilde{\nu}$ are points that satisfy

$\forall i \in [1,m], f_i(\tilde{x}) \le 0$

$\forall i \in [1,p], h_i(\tilde{x}) = 0$

$\forall i \in [1,m], \tilde{\lambda}_i \ge 0$

$\forall i \in [1,m], \tilde{\lambda}_i \cdot f_i(\tilde{x}) = 0$

$\nabla f_0(\tilde{x}) + \sum \tilde{\lambda}_i \nabla f_i(\tilde{x}) + \sum \tilde{\nu}_i \nabla h_i(\tilde{x}) = 0$

Then:

$\tilde{x}, \tilde{\lambda}, \tilde{\nu}$ are primal and dual optimum

Proof of KKT Conditions (sufficient)

Linear Program

Special Case of a CONVEX optimization problem

Always have strong duality

Standard Form of LP

\min \vec{c}^\top \vec{x} \\ \text{s.t.} \\ A\vec{x} = \vec{b} \\ \vec{x} \ge 0

All LPs can be translated to the standard form

Eliminate inequalities
1. Add slack variables
1. $\sum_{j=1}^n a_{ij} x_j \le b_i$ ⇒ $\sum_{i=1}^n a_{ij} x_j + s_i = b_i, s_i \ge 0$

Unconstrainted variables
1. $x_i \in \mathbb{R}$ ⇒ $x_i = x_i^+ - x_i^-$ where $x_i^+ \ge 0, x_i^- \ge 0$

Dual of Standard Form

\max -\vec{b}^\top \vec{\lambda} \\ \text{s.t.} \\ A^\top \vec{\lambda} + \vec{c} = 0 \\ \vec{\lambda} \ge 0

Simplex Algorithm

Polyhedron

Set $\{ \vec{x} \in \mathbb{R}^n | A\vec{x} \ge \vec{b} : A \in \mathbb{R}^{m \times n}, \vec{b} \in \mathbb{R}^m \}$

“Standard Form” Set $\{ \vec{x} \in \mathbb{R}^l | C\vec{x} = \vec{d}, \vec{x} \ge 0 \}$

Extreme Point of Polyhedron $P$ :

\{ \vec{x} \in P | \neg(\exists \vec{y} \ne \vec{x},\vec{z} \ne \vec{x} \in P, \lambda \in [0,1] : \vec{x} = \lambda\vec{y} + (1-\lambda) \vec{z} ) \}

We cannot find two vectors for $\vec{x}$ as a convex combination of other two points $\vec{y}$ and $\vec{z}$

Fact:

$P$ has an extreme point iff P does not contain a line

If we assume

$P$ has an extreme point

Opt. solution exists and is finite

Then:

There exists an optimal solution that is an extreme point of $P$

Proof:

P = \{x|Ax \le b\}, p^* = \min_{x \in P} c^\top x

Let $Q$ be the set of all possible solutions

Q = \{x | Ax\le b, c^\top x = p^* \}

$Q$ is also a polyhedron (expressed with inequality constraints

$Q \sube P$ , so $Q$ cannot contain a line.
- $Q$ has an extreme point

If $u$ is a vertex of $Q$ , how do we prove $u$ is a vertex of $P$

Suppose for contradiction that $u$ is not a vertex of $P$

then $\exists y, z \in P, y, z \ne u$ and $\lambda y + (1-\lambda) z = u, \lambda \in (0,1)$

We know $c^\top u = p^*$ ⇒ $\lambda c^\top y + (1-\lambda) c^\top z = c^\top u = p^*$

We know since $p^*$ is the optimal,

c^\top y \ge p^*, c^\top z \ge p^*

By the above strict equation, we infer

c^\top y = c^\top z = p^*

So now $y, z \in Q$

But $u$ is a vertex in $Q$ , we have a contradiction!

Therefore $u$ must be a vertex in $P$ .

Quadratic Programs

\min \frac{1}{2} \vec{x}^\top H \vec{x} + \vec{c}^\top \vec{x} \\ \text{s.t.} \\ A\vec{x} \le \vec{b} \\ C\vec{x} = \vec{d}

If $H \in \mathbb{S}^n$ and $H \ge 0$ then this problem is convex

Eigenvalues of $H \in \mathbb{S}^n$ (assuming unconstrained problem)

$H$ has at least one negative eigenvalue
1. $p^* = -\infin$

$H \ge 0$ , $\vec{c} \in R(H)$
1. Rewrite $f(\vec{x}) = \frac{1}{2} (\vec{x} - \vec{x}_0)^\top H (\vec{x} - \vec{x}_0) + \alpha$
  1. $\alpha$ is an arbitrary constant we choose
1. If $H$ is invertible, then
  1. $\vec{x}^* = - H^{-1} \vec{c}$
1. If not ( $H$ has a null-space), use pseudo-inverse (cuz any $-H \vec{x}_0 = \vec{c}$ would suffice)
  1. $H^{\dagger} = U\Sigma^{-1}V^\top$

$H \ge 0, \vec{c} \notin R(H)$
1. $\vec{c} = -H \vec{x}_0 + \vec{r}$
1. $f(\vec{r}) = \frac{1}{2} \vec{r}^\top H \vec{r} + \vec{c}^\top \vec{r} = ||\vec{r}||_2^2$

Cones

Cone: Set of points $C \sube \mathbb{R}^n$ is a cone iff $\forall \vec{x} \in C, \forall \alpha \ge 0, \alpha \vec{x} \in C$ .

Convex Cone
- A cone $C$ is convex iff $\forall \theta_1, \theta_2 \ge 0, \theta_1 \vec{x}_1 + \theta_2 \vec{x}_2 \in C$

Polyhedral Cone
- $C = \{ (\vec{x},t) | A\vec{x} \le \vec{b}t, t \in \mathbb{R}, t \ge 0 \}$

Ellipsoidal Cone
- Equation for Ellipse
  - $x^\top P x + q^\top x + r \le 0, P > 0$ OR
  - $||Ax+b||_2^2 \le c^2$ , $x^\top A^\top A x + 2 b^\top A x + b^\top b -c^2 \le 0$
- Ellipsoidal Cone Equation
  - $C = \{ (x,t) ; ||Ax + bt||_2 \le ct \}$
  - $(\alpha x, \alpha t) \in C, \forall \alpha \ge 0, (x,t) \in C$
- Special Case:
  - Second-Order Cone
    - $C \sub \mathbb{R}^{n+1}, C = \{ (\vec{x}, t) ; ||\vec{x}||_2 \le t \}$

Second Order Cone Program

\min \vec{q}^\top \vec{x} \\ \text{s.t.} \\ \forall i \in \{1,\dots,m\}, ||A_i \vec{x} + \vec{b}_i||_2 \le \vec{c}_i^\top \vec{x} + d_i

🤔

Note that each constraint must lay in a second-order cone!

(A_i \vec{x} + \vec{b}_i, \vec{c}_i^\top \vec{x} + d_i) \in C

Example(Facility Location Problem) of SOCPs

Application of Optimizations

Linear Quadratic Regulator (LQR)

\vec{x}_{t+1} = A\vec{x}_t + B\vec{u}_t

Cost:

C=\sum_{t=0}^{N-1} \frac{1}{2} (\vec{x}_t^\top Q \vec{x}_t + \vec{u}_t ^\top R \vec{u}_t ) + \frac{1}{2} \vec{x}_N Q_f \vec{x}_N

Note: Assume $Q$ and $R$ PSD

The goal of LQR is:

Minimize Cost $C$

Subject to dynamic transition model $\vec{x}_{t+1} = A\vec{x}_t + B\vec{u}_t$

🧙🏽‍♂️

Turns out to be a quadratic program, and instead of solving the problem in a normal QP formulation (which grows in runtime as the number of timestep grows), we will use a Ricatti equation to solve this

Ricatti Equation is based on the KKT condition (wheras standard derivation is based on dynamic equation / Bellman equation)

The dual: Slater’s Conditions hold ⇒ Strong duality holds

\begin{split} L(\vec{x}_0, \dots, \vec{u}_0, \dots,\vec{\lambda}_1, \dots, \vec{\lambda}_N) &= \sum_{t=0}^{N-1} \frac{1}{2} (\vec{x}_t Q \vec{x}_t + \vec{u}_t^\top R \vec{u}_t) + \frac{1}{2} \vec{x}_N^\top Q_f \vec{x}_N + \sum_{t=0}^{N-1} \vec{\lambda}_{t+1}^\top (A\vec{x}_t + B\vec{u}_t - \vec{x}_{t+1}) \\ \end{split}

Consider KKT Conditions here:

\nabla_{\vec{u}_t}L = R^\top\vec{u}_t + B^{\top} \vec{\lambda}_{t+1} = 0, \forall t = 0, 1, \dots, N-1

\nabla_{\vec{x}_t} L = Q^\top \vec{x}_t +A^\top \vec{\lambda}_{t+1}- \vec{\lambda}_t = 0, \forall 1, \dots, N-1

\nabla_{\vec{x}_N} L = Q_f \vec{x}_N - \vec{\lambda}_N = 0

If we rearrange a bit,

We have (1) dynamics of the “adjoint system”, where $\lambda$ is the co-state

\vec{\lambda}_N = Q_f \vec{x}_N

\vec{\lambda}_t = Q^\top \vec{x}_t + A^\top \vec{\lambda}_{t+1}

We can also solve for $\vec{u}_t$

\vec{u}_t = -(R^\top)^{-1}B^\top \vec{\lambda}_{t+1}

Can we solve this using backward induction?

Goal: Find optimal $\vec{u}_t$

Support Vector Machines (SVM)

Consider binary classification problem

$\{ (\vec{x}_i, y_i) \}$ where $\forall i, y_i \in \{-1, 1\}$

Seperating Hyperplane

A seperating hyperplane of data is

f(\vec{x}) = \vec{w}^\top \vec{x} + b

such that

If $\vec{w}^\top \vec{x}_i + b > 0$ then $y_i = +1$

If $\vec{w}^\top \vec{x}_i + b < 0$ then $y_i = -1$

🧙🏽‍♂️

We have so many seperating hyperplanes (given the datapoints are linearly seperable), which one do we use? Idea: Use hyperplane that maximizes distance between classes

Define distance to hyperplane from a class to be the minimum of all distances of all points in the class to the hyperplane.

So formulation of SVM:

\max_{\vec{w}, b, m} m \\ \text{s.t.} \\ y_i(\vec{w}^\top \vec{x}_i + b) > 0 \\ \frac{|\vec{w}^\top \vec{x}_i + b|}{||\vec{w}||_2} \ge m

If $(m, \vec{w}, b)$ is a solution, then $(m, \alpha \vec{w}, \alpha b)$ is also a solution ⇒ there’s unlimited representation, so let’s impose norm on $\vec{w}$ !

||\vec{w}||_2 = \frac{1}{m}

With this, we can write

\min ||\vec{w}||_2^2 \\ \text{s.t.} \\ y_i (\vec{w}^\top \vec{x}_i + b) \ge 1

This is called a hard-margin SVM

We also have “Soft-margin SVM”

\min_{\vec{w}, b, \vec{\xi}} \frac{1}{2}||\vec{w}||_2^2 + C(\sum_{i=1}^n \xi_i)\\ \text{s.t.} \\ y_i(\vec{w}^\top \vec{x}_i + b) \ge 1 - \xi_i \\ \xi_i \ge 0

Here $C > 0$ is a hyperparameter ⇒ large C means we want to minimize the slack(more like hard-margin SVM)

Hinge Loss formulation of SVM

Suppose we consider the 0-1 loss for classification

L_{01}(y_i,f(x_i)) = \begin{cases} 0 &\text{if } y_i f(\vec{x}_i) > 0 \\ 1 &\text{if } y_i f(\vec{x}_i) < 0 \end{cases}

For the case of SVM, $f(\vec{x}_i) = y_i(\vec{w}^\top \vec{x}_i + b)$

So in order to do a correct classification, we want to

\min \frac{1}{n} \sum_{i=1}^n L_{01}(y_i, \vec{w}^\top \vec{x}_i + b)

We cannot optimize this because it is not convex

So we use hinge loss

L_{hinge}(y_i, f(\vec{x}_i)) = \max(1-y_i f(\vec{x}_i),0)

So now we have

\min_{\vec{w}, b} \frac{1}{n} \sum L_{hinge}(y_i, \vec{w}^\top \vec{x}_i + b) + \lambda ||\vec{w}||_2^2

📌

We will see that this form of hinge loss formulation is exactly the same optimization problem as soft-margin SVM

Remember Soft-margin SVM formulation:

\min_{\vec{w}, b, \vec{\xi}} \frac{1}{2}||\vec{w}||_2^2 + C(\sum_{i=1}^n \xi_i)\\ \text{s.t.} \\ y_i(\vec{w}^\top \vec{x}_i + b) \ge 1 - \xi_i \\ \xi_i \ge 0

Transform the formulation

\min_{\vec{w}, b, \vec{\xi}} \frac{1}{2}||\vec{w}||_2^2 + C(\sum_{i=1}^n \xi_i)\\ \text{s.t.} \\ \xi_i \ge \max (1-y_i(\vec{w}^\top \vec{x}_i + b), 0)

We can claim that at optimum, the constraint is tight (otherwise the objective function can be lowered)

\min_{\vec{w}, b} \frac{1}{2}||\vec{w}||_2^2 + C\bigg(\sum_{i=1}^n L_{0,1}(y_i, \vec{w}^\top \vec{x}_i + b)\bigg)

Therefore if we let $C = \frac{1}{2n \lambda}$ then this is equivalent to the hinge-loss formulation

Dual perspective of Soft-margin SVM

\begin{split} L(\vec{w},b,\vec{\xi},\vec{\alpha},\vec{\beta}) &= \frac{1}{2}||\vec{w}||_2^2 + C \sum_{i=1}^n \xi_i + \sum_{i=1}^n \alpha_i ((1-\xi_i) - y_i(\vec{w}^\top\vec{x}_i + b)) + \sum_{i=1}^n \beta_i (-\xi_i) \\ &=\frac{1}{2} ||\vec{w}||_2^2 - \sum_{i=1}^n \alpha_i y_i(\vec{w}^\top\vec{x}_i + b) + \sum_{i=1}^n \alpha_i + \sum_{i=1}^n (C-\alpha_i-\beta_i) \xi_i \end{split}

We have the primal and dual formulations:

p^* = \min_{\vec{w},\vec{b}, \vec{\xi}} \max_{\vec{\alpha}, \vec{\beta}} L(\cdots) \\ d^* = \max_{\vec{\alpha}, \vec{\beta}} \min_{\vec{w},\vec{b}, \vec{\xi}} L(\cdots)

Consider first-order KKT conditions

\nabla_{\vec{w}} L = \vec{w} - \sum \alpha_i y_i \vec{x}_i = 0 \Longrightarrow \vec{w} = \sum_{i=1}^n \alpha_iy_i\vec{x}_i

\frac{\partial L}{\partial b} = -\sum \alpha_i y_i = 0

\frac{\partial L}{\partial \xi_i} = C - \alpha_i - \beta_i = 0

Consider complementary slackness

\alpha_i ((1-\xi_i) - y_i(\vec{w}^\top\vec{x}_i + b)) = 0, \forall i

\beta_i \xi_i = 0

Combining those equations,

if $\alpha_i \ne 0$ , $\alpha_i \ne C$ , $0 < \alpha_i < C$
1. $(1-\xi_i) - y_i(\vec{w}^\top\vec{x}_i + b) = 0$
1. $\beta_i = C - \alpha_i$ ⇒ $\beta_i \ne 0$ ⇒ $\xi_i = 0$
1. $y_i(\vec{w}^\top \vec{x}_i + b) = 1$
  1. Exactly on the margin!
  1. “Support vectors”

if $\alpha_i \ne 0, \alpha_i = C$
1. $C-\alpha_i - \beta_i = -\beta_i = 0$
  1. Then $\xi_i$ can choose to not be zero(but it’s also possible to be 0)
1. Also, $(1-\xi_i) - y_i(\vec{w}^\top\vec{x}_i + b) = 0$
  1. $y_i(\vec{w}^\top\vec{x}_i + b) = 1-\xi_i \le 1$

$\alpha_i=0$
1. $\beta_i = C \ne 0$
1. $\xi_i = 0$
  1. No margin violation
  1. Decision boundary not dependent on this point

The dual problem

L(\cdots) = -\frac{1}{2} (\sum \alpha_i y_i\vec{x}_i)^\top(\sum \alpha_i y_i \vec{x}_i) + \sum \alpha_i

By substituting in KKT conditions

We have expressed the whole lagrangian in dual variables!

\begin{split} L(\cdots) &= -\frac{1}{2}\begin{bmatrix} \alpha_1 y_1 &\cdots &\alpha_ny_n \end{bmatrix} \begin{bmatrix} \vec{x}_1^\top \\ \vec{x}_2^\top \\ \vdots \\ \vec{x}_n^\top \end{bmatrix} \begin{bmatrix} \vec{x}_1 & \vec{x}_2 & \cdots & \vec{x}_n \end{bmatrix} \begin{bmatrix} \alpha_1 y_1 \\ \vdots \\ \alpha_n y_n \end{bmatrix} + \sum \alpha_i \\ &=-\frac{1}{2} \vec{\alpha}^\top \underbrace{diag(\vec{y}) XX^\top diag(\vec{y})}_{Q} \vec{\alpha} + \sum \alpha_i \end{split}

Kernels(Lifting)

Instead of using

f(\vec{x}) = \vec{w}^\top \vec{x}+b

Use

f(\vec{x})=\vec{w}^\top\Phi(\vec{x})+b

Interior Point Methods

\min f_0(\vec{x}) \\ \text{s.t.} \\ f_i(\vec{x}) \le 0, \forall i \\ A\vec{x} = \vec{b}

Conditions:

$f_i(\vec{x})$ are convex, twice differentiable

$A \in \mathbb{R}^{p \times n}$ , $rank(A) = p$ , A has full row rank

Problem is solvable, $\vec{x}^*$ exists

Slater’s condition holds ⇒ strong duality

With those conditions, for optimal points, KKT conditions are neccesary and sufficient!

\exists \vec{x}^*, \vec{\lambda}^*, \vec{\nu}^* \text{ s.t.} \\ A\vec{x}^* = \vec{b}, f_i(\vec{x}^*) \le 0, \vec{\lambda}^* \ge 0, \\ \nabla f_0(\vec{x}^*) + \sum_{i=1}^m \lambda_i^* \nabla f_i(\vec{x}^*) + A^\top \vec{\nu}^* = 0, \\ \lambda_i^* \cdot f_i(\vec{x}^*) = 0, \forall i

Those works for:

LPs, QPs, QCQP(Quadratic constrained quadratic programs), SOCPs, SDPs(Semi-definite programming)

Motivation:

Newton’s method ⇒ series of QPs

Equality constrained ⇒ series of equality constrained QPs

🔥

Convert our problem to a series of equality constrained problems and remove inequality constraints ⇒ IPM

Barrier Function

\min_{A\vec{x} = b} f_0(\vec{x})+\sum_{i=1}^n I_-(f_i(\vec{x}))

Where

I_-: \mathbb{R} \rightarrow \mathbb{R} \text{ s.t. } I_- = \begin{cases} 0 &\text{if } u\le 0\\ \infin &\text{if } u > 0 \end{cases}

Can we make a differentiable convex approximation to this function?

I_{approx}(u) = -\frac{1}{t} \log(-u)

Where $t > 0$ , as $t$ gets larger and larger the approximation is sharper and sharper

$I_{approx}$ is called the approximate barrier function

Therefore we transformed the original problem into an approximate form, using logarithmic barrier function

\phi(\vec{x}) = -\sum_{i=1}^m \log(-f_i(\vec{x}))

\min_{A\vec{x} = \vec{b}} f_0(\vec{x}) + \frac{1}{t} \phi(\vec{x})

Compute the gradient and hessian

\nabla \phi(\vec{x}) = \sum_{i=1}^m -\frac{1}{f_i(\vec{x})}\nabla f_i(\vec{x})

\nabla^2 \phi(\vec{x}) = \sum_{i=1}^m \frac{1}{f_i(\vec{x})^2} \nabla f_i(\vec{x})\nabla f_i(\vec{x})^\top + \sum_{i=1}^m -\frac{1}{f_i(\vec{x})} \nabla^2 f_i(\vec{x})

Assume that this is solvable by Newton + unique solution

Solution: $\vec{x}^*(t)$ ⇒ belongs to the “central path”

This solution should satisfy the KKT condition

A \vec{x}^*(t) = b, f_i(\vec{x}^*(t)) < 0, \\ \exists \hat{\vec{\nu}} : t \cdot \nabla f_0(\vec{x}^*(t)) + \nabla \phi(\vec{x}^*(t)) + A^\top \hat{\vec{\nu}} = 0

From this define

\lambda_i^*(t) = -\frac{1}{t f_i(\vec{x}^*(t))}, \vec{\nu}^*(t) = \hat{\vec{\nu}}/t

We claim that $\vec{\lambda}^*, \vec{\nu}^*$ is the feasible duals for the original problem

Let’s write the FOC condition of the transformed problem

t \cdot \nabla f_0(\vec{x}^*(t)) + \sum -\frac{1}{f_i(\vec{x}^*(t))} \cdot \nabla f_i(\vec{x}^*(t)) + A^\top \hat{\vec{\nu}} =0 \\ \nabla f_0(\vec{x}^*(t)) + \sum \lambda_i^*(t) \cdot \nabla f_i(\vec{x}^*(t)) + A^\top \nu^*(t) = 0

🔥

Notice that now this is the FOC of the lagrangian of the original problem!

Means ⇒ $\vec{x}^*(t)$ is the minimizer of $L(\vec{x}, \vec{\lambda}^*(t), \vec{\nu}^*(t))$

not necessarily means that $\vec{x}^*(t)$ is the minimizer for the original problem, just means $\vec{x}^*(t)$ minimizes the Lagrangian with specific dual variables

But we can derive a partial dual for the original problem

\begin{split} g(\vec{\lambda}^*(t), \vec{\nu}^*(t)) &= \min_{\vec{x}} L(\vec{x}, \vec{\lambda}^*(t), \vec{\nu}^*(t)) \\ &= f_0(\vec{x}^*(t)) + \sum_{i=1}^m \lambda_i(t) \cdot f_i(\vec{x}^*(t))+\vec{\nu}^*(t)^\top (A\vec{x}^*(t)-\vec{b}) \\ &= f_0(\vec{x}^*(t)) -\frac{m}{t} \end{split}

Another way of stating $\lambda_i(t) = -\frac{1}{t f_i(\vec{x}^*(t))}$ :

-\lambda_i(t) f_i(\vec{x}^*(t)) = \frac{1}{t}

This is “approximate complementary slackness” ⇒ as $t \rightarrow \infin$ , completementary slackness is satisfied

Say I want accuracy $\epsilon$

Set $\epsilon = \frac{m}{t}$ ⇒ $t = \frac{m}{\epsilon}$ ⇒ solve the approximate problem using equality constrained Newton’s method

Barrier Methods

Given strictly feasible $x, t > 0, \mu > 1, \epsilon >0$

Repeat:

Centering
1. Compute $\vec{x}^*(t)$ by minimizing $t f_0(\vec{x}) + \phi(\vec{x})$ s.t. $A\vec{x} = \vec{b}$
1. Starting at $\vec{x}_1$ , using Newton’s method

Update $\vec{x} := \vec{x}^*(t)$

Stopping critera: If $m/t < \epsilon$

Increase $t = \mu \cdot t$