CS189 Final Review Doc

Created by Yunhao Cao (Github@ToiletCommander) in Spring 2022 for UC Berkeley CS189.

Reference Notice: Material highly and mostly derived from Prof Shewchuk's lecture notes, some ideas were borrowed from wikipedia & Stanford CS231N & Andrew Ng’s Intro to ML Course@Stanford.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Last Update: 2022-05-13 21:46 PST ⇒ After final release

Machine Learning Abstraction

Application / Data	Is the data labeled? 1. Yes - categorical(classification) or quantatative(regression) 2. No - similarity (clustering) or positioning (dimensional reduction)
Model	-Decision Fns: linear, polynomial, logistic, neural net... -Nearest Neighbours, Decision Trees -Features -Low vs. high capacity (affects overfitting, underfitting, inference)
Optimization Problem	Variables(weights), objective function, constraints
Optimization Algorithm	Gradient Descent, Simplex, SVD

Classifiers

Suppose we have Bayes Optimal Classifier $r(x)$ , that is $r(x)$ will output the prediction y that minimizes Bayesian Risk $R(r(x))$ .

3 ways to build classifiers

Generative Models(e.g. LDA)
1. Assume sample points come from probability distributions, different for each class.
1. Guess form of distributions
1. For each class C, fit distribution parameters to class C points, giving $f(X|Y=C)$
1. For each class C, estimate $P(Y=C)$
1. Baye’s Theorem gives $P(Y|X)$
1. If 0-1 loss, pick class C that maximizes
  1. $P(Y=C|X=x)=f(X=x|Y=C)P(Y=C)$

Discriminative Models (e.g. logistic regression)
1. Model $P(Y|X)$ directly.

Decision Boundary
1. Model $r(x)$ directly

ROC Curve (Receiver Operating Characteristics)

X axis - False Positive Rate (FP) - Predictor predicts + but actually is -

Y axis - Sensitivity / True Positive Rate (TP) - Predictor predicts + and right

1 - x ⇒ Specificity / True Negative Rate (TN) - Predictor predicts - and right

1 - y ⇒ False Negative Rate (FN) - Predictor predicts - but actually is +

We want to minimize x and (1-y)

🔥

Has to be better than the line

y=x

if not we can simply flip the classifier.

Data Cleaning

Centering

🔥

Subtracting mean

\hat{\mu}

from each sample

X \rightarrow \dot{X}

Decorrelating

Let $R$ be uniform distribution on sample points,

Var(R)=\frac{1}{n} \dot{X}^{\top}\dot{X}

We want to perform a linear transformation that maps sample points to axis-aligned distribution

We will apply rotation $Z = \dot{X}V$ , where $Var(R)=V\Lambda V^{\top}$ .

Then $Var(Z) = \frac1n Z^{\top}Z = \frac1n V^{\top}\dot{X}^{\top}\dot{X}V=V^{\top}Var(R)V=V^{\top}V\Lambda V^{\top}V = \Lambda$

Sphering

Apply transform $W = \dot{X}Var(R)^{-1/2}$

Recall that $\Sigma^{-1/2}$ maps ellipsoids to spheres.

Whitening

Centering + Sphering

$X \rightarrow W$

Ideas

Feature Lifting (Kernel)

Suppose we have data point $X_i$ , the lifted data point is $\Phi(X_i)$ .

Parabolic Lifting Map

🔥

\Phi(X_1),\dots,\Phi(X_n)

are linearly seperable if

X_1, \dots, X_n

are seperable by a hypersphere.

\Phi: \mathbb{R}^d \rightarrow \mathbb{R}^{d+1}

\Phi(X_i)=\begin{bmatrix} X_i \\ ||X_i||^2 \end{bmatrix}

p-degree polynomial

\Phi(X_i): \mathbb{R}^d \rightarrow \mathbb{R}^{O(d^p)}

\Phi(X_i)=\begin{bmatrix} x_1^3 &x_1^2x_2 &x_1x_2^2 &x_2^3 &x_1^2 &x_1x_2 &x_2^2 &x_1 &x_2 \end{bmatrix}^{\top}

Polynomial Kernel:

k(x,z)=(x^{\top}z+1)^p = \Phi(x)^{\top}\Phi(z)

where $\Phi(x)$ contains every monomial in $x$ of degree $0, \dots, p$

Demonstration of d=2, p=2

\begin{split} &(x^{\top}z+1)^2 \\ &= x_1^2z_1^2+x_2^2z_2^2+2x_1z_1x_2z_2+2x_1z_1+2x_2z_2+1 \\ &=\begin{bmatrix} x_1^2 &x_2^2 &\sqrt2x_1x_2 &\sqrt2x_1 &\sqrt2x_2 &1 \end{bmatrix} \begin{bmatrix} z_1^2 &z_2^2 &\sqrt2z_1z_2 &\sqrt2z_1 &\sqrt2z_2 &1 \end{bmatrix}^{\top} \\ &=\Phi(x)^{\top}\Phi(z) \end{split}

Saves tons of computing power since we only need to compute the inner product of the original data point instead of adding the polynomial features.

Gaussian Kernel (RBF - Radial Basis Function Kernel)

\Phi(x)=\exp(-\frac{x^2}{2\sigma^2})\begin{bmatrix} 1 &\frac{x}{\sigma\sqrt{1!}} &\frac{x^2}{\sigma^2\sqrt{2!}} &\frac{x^3}{\sigma^3\sqrt{3!}} &\dots \end{bmatrix}^{\top} \quad \text{(for $d=1$)}

An infinite vector, and $\Phi(x) \cdot \Phi(z)$ is a series that converges to $k(x,z)$

RBF Kernel Function:

k(x,z)=\exp(-\frac{||x-z||^2}{2\sigma^2})

🔥

Very popular in practice 1. gives very smooth h (infinitely differentiable) 2. Behaviors like k-nearest neighbours, but smoother 3. Osciallates less than polynomials (depending on

\sigma

) 4.

k(x,z)

is more like a similarity measure 5. Sample points “vote” for value at z, closer points get weightier vote.

Choose $\sigma$ by cross-validation. Larger $\sigma$ meas wider Gaussians and smoother h, but more bias and less variance

Transformation of Space in finding weights

In lots of ML algorithms we want to find a decision function that usually took form of a geometric shape like hyperplane / isocontour / isosurface in the original data space (x-space). However, in doing the optimization we usually want to find a vector(point) of weight $\vec{w}$  in the w-spaace that represents the geometric shape in the x-space. Therefore we have implicitly performed a transformation of space from x-space to w-space

Transformation to w-space in perceptron algorithm

For the perceptron algorithm, the transformation is performed below:

	x-space	w-space
decision boundry	$\{z : w \cdot z = 0\}$ All points that has inner product of 0 with $w$	w
data	$x$	$\{z : x \cdot z = 0\}$ All points that has inner product of 0 with $x$

Since we want to force inequality $x \cdot w \ge 0$ (for $y_i = 1$ , then

in x-space, $x$ points for $y_i = 1$ should be on the same side of $\{z : w \cdot z = 0\}$ as $w$

in w-space, $w$ should be on the same side of $\{ z : w \cdot z = 0\}$ as $x$ that have $y_i=1$

(Batch) Gradient Descent (GD / BGD)

🔥

Walk downhill by using the “steepest” direction(the gradient) Complexity:

O(nd)

With a risk function $R(w)$ , calculate:

\nabla R(w) = \begin{bmatrix} \frac{\partial R}{\partial w_1} \\ \frac{\partial R}{\partial w_2} \\ \vdots \\ \frac{\partial R}{\partial w_d} \end{bmatrix}

Then for each propagating step, run the following:

w \leftarrow w - \epsilon\nabla R(w)

Where $\epsilon$ is the learning rate.

Stochastic Gradient Descent (SGD)

🔥

Instead of GD on the entire batch, do GD on single training data Complexity:

O(d)

With a cost function $L(w,X_i)$ , calculate:

\nabla L(w) = \begin{bmatrix} \frac{\partial L}{\partial w_1} \\ \frac{\partial L}{\partial w_2} \\ \vdots \\ \frac{\partial L}{\partial w_d} \end{bmatrix}

Then for each propagating step, run the following:

w \leftarrow w - \epsilon\nabla L(w)

Where $\epsilon$ is the learning rate.

🤔

Note: Not necessarily the “steepest” descent direction when viewing the effect on the whole batch(does not guarentee to work for every problem that GD works for), techniques like RMSProp / Momentum might help resolve this issue.

Quadratic Form

A way to visualize symmetric matrices. Shows how applying the matrix affects the length of a vector.

Matrix A streching along the direction of eigenvector (1,1) with eigenvalue 2 and shrinking along the direction of eigenvector (1,-1) with eigenvalue -1/2.

If a symmetric matrix is diagonal:

eigenvectors are coordinate axes

ellipsoids are axis-aligned

A symmetric matrix $M$ is:

Positive Definite (PD)
1. if $\forall w \ne 0, w^{\top}Mw > 0$
1. or all eigenvalues positive

Positive Semi-Definite (PSD)
1. if $\forall w, w^{\top}Mw \ge 0$
1. or all eigenvalues non-negative

Indefinite
1. if both positive eigenvalue and negative eigenvalue exists

Invertible(non-singular)
1. if has no eigenvalue of 0

⚠️

Note that pos definite’s quadratic form only has one point of value 0, pos semidefinite has (infinitely) many points of value 0.

By eigendecomposition, we see that any real symmetric matrix has:

A=V\Lambda V^{\top}

Where $\Lambda$ is a diagonal matrix containing the eigenvalues of $A$ and $V$ is a orthonormal matrix which each column is the eigenvector of $A$ .

Note that $||A^{-1}x|| = 1 \text{ or } x^{\top}A^{-2}x = 1$ will have ellipsoid with axes and radii specified by the eigenvector and eigenvalues.

Neton’s Method

Iterative optimization method for smooth function $J(w)$

🔥

You are at point

v

, Approximate

J(w)

near

v

by quadratic function and jump to the quadratic function’s critical point.

Taylor’s series of cost function around point v:

\nabla J(w) \approx \nabla_wJ(w)\bigg\rvert_{w=v}+\nabla^2 J(w) \bigg\rvert_{w=v}(w-v)+O(||w-v||^2)

Where $\nabla^2 J(w) \bigg\rvert_{w=v}$ is the Hessian of $J$ evaluated at point $v$

We will have further exapsnsions like $\nabla^3 J(w)\bigg\rvert_{w=v}$ but those are simplified to $O(||w-v||^2)$ since we are only approximating with quadratic function.

By setting $\nabla J(w) = 0$ , we find

w = v-(\nabla^2 J(w)\bigg\rvert_{w=v})^{-1}\nabla J(w)\bigg\rvert_{w=v}

Note that we don’t want to compute a matrix inverse directly. It is faster to solve a linear system of equations, by Cholesky factorization or the conjugate gradient method.

We set $e=-(\nabla^2 J(v))^{-1}\nabla J(v)$ , then we just need to solve

(\nabla^2 J(v))e=-\nabla J(v)

Therefore, in conclusion, the whole process of Newton’s method:

Pick starting point $w$

Repeat until Converge
1. $e \leftarrow \text{ solution to linear system } (\nabla^2 J(w))e = -\nabla J(w)$
1. $w \leftarrow w + e$

Bias-Variance Decomposition

Bias	Error due to iability of hypothesis $h$ to fit ground truth, $g$ , perfectly. e.g. fitting quadratic $g$ with linear $h$
Variance	Error due to fitting random noise in data

For each model, we have

X_i \sim D, \epsilon_i \sim D', y_i=g(X_i)+\epsilon_i

Note that $D’$ has mean $0$ .

We want fit hypothesis $h$ to $X, y$ , where $h$ is a random variable (weights)

If we take any arbitraty point(not necessarily a sample point, just any point from the real ground-truth distribution) $z \in \mathbb{R}^d$ and $\gamma = g(z)+\epsilon, \epsilon \in D’$

Then the risk function when loss = sqaured error:

\begin{split} R(h)&=\mathbb{E}[L(h(z),\gamma)] \quad \text{(the expectation is taken over all possible values of weights)} \\ &=\mathbb{E}[(h(z)-\gamma)^2] \\ &=\mathbb{E}[h(z)^2] + \mathbb{E}[\gamma^2] - 2\mathbb{E}[\gamma h(z)] \\ &=Var(h(z))+\mathbb{E}[h(z)]^2+Var(\gamma)+\mathbb{E}[\gamma]^2 - 2\mathbb{E}[\gamma]\mathbb{E}[h(z)] \\ &=(\mathbb{E}[h(z)]-\mathbb{E}[\gamma])^2+Var(h(z))+Var(\gamma) \\ &=(\mathbb{E}[h(z)]-g(z))^2+Var(h(z))+Var(\epsilon) \end{split}

The first term is $bias^2$ , second term is $var$ and third term is $\text{irreducible error}$ of the method.

$\gamma$ and $h(z)$ are independent since $h(z)$ only depends on training data and $\gamma$ only on label of z.

🔥

We cannot precisely measure bias or variance of real-world data

For many distributions, $variance \rightarrow 0 \text{ as } n \rightarrow \infin$

If $h$ can fit $g$ exactly, then $bias \rightarrow 0 \text{ as } n \rightarrow \infin$

Adding a good feature reduces bias; adding a bad feature rarely increases it.

Adding a feature usually increases variance (added dimension).

Noise in test set only affects $Var(\epsilon)$

Noise in training set only affects $bias \& Var(h)$

We can test learning algorithms by choosing $g$ and making synthetic data

Feature Selection

We want to identify poorly predictive features and ignore them so that we can:

enjoy better inference speed

less variance

Best Subset Selection

🔥

Choose

2^d-1

nonempty subsets of features. Choose best classifier by cross-validation. Very slow.

Forward Stepwise Selection

Start with null model (0 features)

repeatedly add best feature until validation errors start increasing
1. At each outer iteration, inner loop tries every feature and chooses the best by validation

Requires training $O(d^2)$ modules instead of $O(2^d)$

But
1. won’t find the best 2-feature model if neither one of those features yields the best 1-feature model.

Backward stepwise selection

Start with all d features

repeatedly remove feature whose removal gives best reduction in validation error

Also trains $O(d^2)$ models

Additional heuristic: only try to remove features with small weights
1. Small relative to the variance of least-square regression
1. Variance of least sqaure regression is proportional to $\sigma^2 (X^{\top}X)^{-1}$
1. z-score of weight $w_i$ is $z_i = \frac{w_i}{\sigma\sqrt{v_i}}$ where $v_i$ is the ith diagonal entry of $(X^{\top}X)^{-1}$
1. Small z-score hint that the “true” $w_i$ could be zero.

Ensemble Learning 集成学习

🔥

Methods that have low bias, but high variance like decision trees can use this technique to reduce variance

Idea: take an average answer of a bunch of decision trees

Bagging

Same learning algoirthm on many random subsamples of one training set.

Works well on most learning algorithms, maybe not k-nearest neighbours?

🔥

Given n-point training sample, generate random subsample of size n’ by sampling with replacement.

If $n=n'$ then around 63.2% are chosen.

Different Learning Algorithms

Different hypothesis can help reduce variance

Random Forests

Discussed in the “Algorithms” section

Adaboost

Similar to Random Forests, but different and more powerful.

Discussed in the “Algorithms” section

Kernel Trick

Kernel Trick is developed from the fact that in many algorithms,

weights can be written as a linear combination of sample points
1. Therefore we can write $w = \sum_{i=1}^n a_i X_i = X^{\top} a, a \in \mathbb{R}^n$

we can use inner products of $\Phi(x)$ only instead of directly computing $\Phi(x)$ .
1. We will define the kernel function $k(x,z) = \Phi(x) \cdot \Phi(z)$
1. Note that we can use different kernel function can be substituted to use different kernels.
1. The real magic happens when we can compute $k(x,z)$ really quickly.

Kernel Ridge Regression

Remember in Ridge Regression,

(X^{\top}X+\lambda I')w=X^{\top}y \\ w = (X^{\top}X+\lambda I')^{-1}X^{\top}y

To dualize ridge regression, we need the weights to be a linear combination of the sample points, but that would not happen unless we penalize the bias term $w_{d+1}=\alpha$ .

Therefore we want to center $X$ and $y$ so that the “expected” value of bias is zero

X_i \leftarrow X_i - \mu_X, y_i \leftarrow y_i - \mu_y, X_{i,d+1}=1

🔥

The actual bias won’t usually be exactly zero, but it will often be close enough that we won’t do too much harm by penalizing it.

Now we have

(X^{\top}X+\lambda I)w = X^{\top}y

Suppose $a$ is the solution to the following:

(XX^{\top}+\lambda I)a=y

Then we see that

X^{\top}y=X^{\top}XX^{\top}a+X^{\top}\lambda I a=(X^{\top}X+\lambda I)X^{\top}a

Since the First term(LHS) the above equation matches the RHS of the original ridge regression, and that the term $X^{\top}a$ of the RHS of the above equation matches the $w$ term in the original ridge regression solution

🔥

we’ve proved that if we set

w

X^{\top}a

then we have a solution of weight

w

that is a linear combination of sample point, where

a

is the solution to our assumption.

Therefore we want to find $a$ , the dual solution that minimizes the dual form of ridge regression (obtained by substituting in $w=X^{\top}a$ )

||XX^{\top}a-y||^2+\lambda||X^{\top}a||^2

Therefore the entire algorithm:

We will define $K=XX^{\top}$ where $K_{ij}=k(X_i,X_j)$

$K$ is singular if $n > d+1$ (and sometimes even its not) ⇒ we have to choose a positive $\lambda$ .

Training:

\text{Solve }(XX^{\top} + \lambda I)a = y \text{ for a}

Optimized:

\begin{split} \forall i,j \quad K_{ij} \leftarrow k(X_i,X_j) &\qquad\Longleftarrow O(n^2d) \\ \text{Solve } (K+\lambda I)a=y \text{ for } a &\qquad\Longleftarrow O(n^3) \end{split}

Testing:

\begin{split} h(z) &= w^{\top}z \\ &= a^{\top}Xz \\ &=\sum_{i=1}^n a_i (X_i^{\top}z) \\ &=\sum_{i}^n a_i k(X_i,z) \quad \Longleftarrow O(nd) \end{split}

Comparison between dual and primal

Dual	Solve $n \times n$ linear system, $O(n^3+n^2d)$
Primal	Solve $d \times d$ linear system, $O(d^3+d^2n)$

Therefore we prefer dual when $d > n$ (when we add a lot of polynomial / rbf ... features)

Kernel Perceptrons

Featurized(Primal) Perceptron Algorithm:

Training

\begin{split} &w \leftarrow y_1 \Phi(X_1) \\ &\text{while some } y_i\Phi(X_i)\cdot x < 0 \\ &\quad w \leftarrow w + \epsilon y_i \Phi(X_i) \\ \end{split}

Testing

h(z)=w\cdot \Phi(z)

We will Dualize the problem with $w = \Phi(X)^{\top}a$

Same as before, Let $K$ be $\Phi(X)\Phi(X)^{\top}$ , so $K_{i,j} = k(X_i,X_j)$

🔥

Now

\Phi(X_i) \cdot w = (\Phi(X)w)_i = (\Phi(X)\Phi(X)^{\top}a)_i=(Ka)_i

Training:

\begin{split} &a \leftarrow \begin{bmatrix} y_1 &0 &\dots &0 \end{bmatrix} \quad \text{start point is arbitrary, but cannot be zero} \\ &\forall i, j, K_{ij} \leftarrow k(X_i, X_j) \quad\Longleftarrow O(n^2d) \\ &\text{while some } y_i(Ka)_i < 0 \\ &\quad a_i \leftarrow a_i + \epsilon y_i \quad \Longleftarrow O(1) \text{ time, update } Ka \text{ in } O(n) \text{ time} \\ \end{split}

Testing:

Same as kernelized ridge regression,

\begin{split} h(z) &= w^{\top}z \\ &= a^{\top}Xz \\ &=\sum_{i=1}^n a_i (X_i^{\top}z) \\ &=\sum_{i}^n a_i k(X_i,z) \quad \Longleftarrow O(nd) \end{split}

Kernel Logistic Regression

Training:

Our starting point would be zero.

SGD:

a_i \leftarrow a_i + \epsilon(y_i-s((Ka)_i))

GD:

a \leftarrow a + \epsilon(y-s(Ka))

🔥

For SGD, update

Ka

O(n)

time each iteration instead of computing

Ka

entirely each time

Testing:

h(z)\leftarrow s(\sum_{i=1}^n a_i k(X_i,z))

High-Dimensional Data

🔥

Intuition about high-dimensional data is very important

Distribution of random points

Suppose we have a random point $p \sim N(0,I) \in \mathbb{R}^d$

The vast majority of the random points are at approximately the same distance from the mean, laying in a thin shell. (Think about chi-sqaure distribution)

Angles between random points

cos \theta = \frac{p \cdot q}{||p|| ||q||}=\frac{p_1}{||p||}

\mathbb{E}(cos \theta)=0; Var(cos\theta)\approx \frac{1}{\sqrt{d}}

Therefore as d becomes larger, $\theta$ becomes more and more likely to be very close to 90 deg!

Learning Theory

🔥

If we want to generalize (to things we haven’t learned about), we must constrain what hypothesis we allow our learner to consider.

Range Space(Set System) $A$ : a pair $(P,H)$ where

$P$ is the set of all possible test/training points

$H$ is the set of hypotheses(ranges/classifiers), each hypothesis states which subset of P ( $h \sube P$ ) are in class $C$

Examples of classifiers:

Power set classifier: $P$ is a set of $k$ numbers, $H$ is the power set of P, containing all $2^k$ subsets of $P$
1. This classifier cannot generalize at all because if given any subset of $P$ as training data, it can fit a lot of classifiers and we wouldn’t know which classfier would be good generalization of the other data we haven’t seen unless we see the other data.

Linear classifier: $P = \mathbb{R}^d$ , $H$ is the set of all halfspaces: $\{\forall w, \alpha, \text{list of }\{x: w \cdot x \ge -\alpha\}\}$

A great question to ask about a certain class of hypotheses would be:

🔥

How well the training error predicts the test error?

Let all training points and test points drawn independently from probability distribution $D$ defined on domain $P$ , then

Suppose our hypothesis is $h \in H$ ,

the risk / generalization error $R(h)$ of $h$ , is the probability that $h$ misclassifies a random point $x$ drawn from $D$ ⇒ average test error

the empirical risk / training error $\hat{R}(h)$ is the percentage of training data points misclassified by $h$

Hoeffding’s inequality tells us probability of bad estimate:

“How likely that a number drawn from a binomial distribution will be far from its mean”

\mathbb{P}(|\hat{R}(h)-R(h)|>\epsilon) \le 2e^{-2\epsilon^2n}

So...we want to choose $\hat{h} \in H$ that minimizes $\hat{R}(\hat{h})$ , a technique called empirical risk minimization

however, its computationally infeasible to pick the best hypothesis ⇒ SVM find a linear classifier with zero training error when the training data is linearly separable, but when it isn’t SVM try to find a linear classifier with low training error, not minimum.

🔥

Its okay to have infinitely many hypotheses, but its not okay to have too many dichotomies

Dichotomies

A dichotomy of $X$ is $X \cap h$ , where $h \in H$

It picks out the training points that h predicts are in class C

Up to $O(2^n)$ dichotomies. The more dichotomies, the more likely it is that one of them will get lucky and have misleading low empirical risk

Given $\Pi$ dichotomies,

\mathbb{P}(\text{at least one dichotomy has } |\hat{R}-R|>\epsilon) \le \delta \\ \text{where } \delta = 2\Pi e^{-2\epsilon^2n}

\text{fixing a value of $\delta$ and solve for $\epsilon$, } \forall h \in H, \\ |\hat{R}-R|\le \epsilon=\sqrt{\frac{1}{2n}\ln(\frac{2\Pi}{\delta})}

Therefore, the smaller we make $\Pi$ and the bigger we make $n$ , the more accurate we can predict the risk based on the empirical risk

Suppose we chose $\hat{h}$ as our hypothesis, then with probability $\ge 1 - \delta$ ,

R(\hat{h})\le \hat{R}(\hat{h})+\epsilon\le \hat{R}(h^*)+\epsilon\le R(h^*)+2\epsilon

Where

\epsilon = \sqrt{\frac{1}{2n}\ln(\frac{2\Pi}{\delta})}

and $h^*$ being the optimal hypothesis in the class $H$

Therefore, with enough training data and a limit on the number of dichotomies, empirical risk minimization usually chooses a classifier close to the best one in the hypothesis class.

After choosing $\delta$ and $\epsilon$ ,

the sample complexity is the number of training points needed to achieve this $\epsilon$ with probability $\ge 1-\delta$ :

n \ge \frac{1}{2\epsilon^2}\ln(\frac{2\Pi}{\delta})

Shatter Function

number of dichotomies: $\Pi_H(X) = |\{X \cap h: h \in H\}| \quad \in [1,2^{n}] \text{ where } n = |X|$

shatter function: $\Pi_H(n) = \max_{|X|=n, X \sube P} \Pi_H(X)$ ⇒ max number of dichotomies of any training set of size n

Its called a shatter function because $H$ shatters aset $X$ of n points if $\Pi_H(X)=2^n$ .

e.g. Linear classifiers in plane, $H$ = set of all halfplanes, $\Pi_H(3)=8$

Fact: for all range spaces, either $\Pi_H(n)$ is $O(n^p)$ or $O(2^n)$

Cover’s theorem

linear classifiers in $\mathbb{R}^d$ allow up to $\Pi_H(n)=2\sum_{i=0}^d {n-1 \choose i}$ dichotomies of n points

For $n \le d+1$ , $\Pi_H(n)=2^n$

For $n \ge d+1$ , $\Pi_H(n) \le 2{e(n-1) \choose d}^d \quad \Longleftarrow \text{polynomial in n with exponent d}$

🔥

Linear Classifiers need only

n \in O(d)

training points for the training error to predict the test error

VC Dimension (Vapnik-Chervonenkis Dimension)

VC(H)=\max\{n: \Pi_H(n) = 2^n\} \quad \Longleftarrow \text{can be } \infin

VC dimension is the bound for polynomial shatter function

Theorem:

\Pi_H(n) \le \sum_{i=0}^{VC(H)}{n \choose i}

Therefore,

\Pi_H(n) \le (\frac{e \times n}{VC(H)})^{VC(H)} \quad (n \ge VC(H))

Also: $O(VC(H))$ training points suffice for accuracy (hidden constant is big though)

Algorithms

Perceptron Algorithm

🤔

Augment data to have extra feature with value fixed to 1 so that one weight can offset the seperating hyperplane to offset from the origin.

Label

y_i = \begin{cases} 1 \text{ if } X_i \in \text{class C} \\ -1 \text{ otherwise} \end{cases}

Goal:

Find weights $w$ such that

X_i \cdot w \ge 0 \text{ if } y_i = 1, \\ X_i \cdot w \le 0 \text{ if } y_i = -1

Loss Function:

L(\hat{y}_i,y_i)=\begin{cases} 0 &\text{if } y_i\hat{y}_i \ge 0 \\ -y_i\hat{y}_i &\text{otherwise} \end{cases}

Cost/Risk Function

\begin{split} R(w)&=\frac1n \sum_{i=1}^nL(X_i \cdot w, y_i) \\ &=\frac1n \sum_{i: y_iX_i \cdot w < 0} -y_iX_i \cdot w \end{split}

If w classifies all $X_1, \dots, X_n$ correctly, then $R(w) = 0$ .

Optimizer:

Gradient Descent, SGD.
1. Both guarenteed to work if linear seperable
1. Guarenteed to classfy all data in at most $O(r^2/\gamma^2)$ iteratinos, where $r=\max ||X_i||$ is the radius of the data and $\gamma$ is the maximum margin (defined in Max-Margin Classifier).

Hard-Margin Support Vector Machine (Hard-Margin SVM)

The problem with perceptron algorihtm is that we can fit infinitely many hyperplanes to the data since it does not constrain the weight vector, just asking it to be correct on classifying data (it can fit training data 100% but it may not generalize well to validation or test set). This may lead to increased variane.

🔥

One idea we find that generalizes well with validation / test set is to find a unique boundry that maximizes the margin between the two classes in the training set.

We will enforce the constraints:

y_i(w \cdot X_i + \alpha) \ge 1 \text{ for } i \in [1,n]

Note that the RHS is a 1, and it is makes impossible for the weight vector $w$ to get set to zero.

🔥

Note: Signed distance from hyperplane to

X_i

\frac{w}{||w||}\cdot X_i + \frac{\alpha}{||w||}

Therefore the margin is

\min_i \frac1{||w||} |w \cdot X_i + \alpha| \ge \frac1{||w||}

In order to maximize the margin, we can instead minimize

||w||

. (At the optimal solution, the margin is exactly

\frac1{||w||}

because at least one constraint holds with equality)

Therefore the optimization problem:

\text{Find } w \text{ and } \alpha \text{ that minimize } ||w||^2 \\ \text{s.t. } \forall i \in [1,n], y_i(X_i \cdot x + \alpha) \ge 1

Its a quadratic program in $d+1$ dimensions and $n$ constraints, and has a unique solution (if points are linearly seperable).

The reason that we don’t want to use $||w||$ is because it is not smooth at $w=0$ , whereas $||w||^2$ is smooth everywhere, making it easier to optimize.

Left: SVM in 3D w-space(w1,w2,alpha), Right: 2D Cross-section at w1=1/17. The SVM is trying to find a solution that lays in the pocket that is as close to the origin as possible.

Soft-Margin SVM

🔥

Hard-margin SVM fail if data not linearly seperable & sensitive to outliers.

Example where one outlier moves the entire boundry of hard-margin SVM a lot

Idea: Allow some points to violate the margin with slack variables $\xi_i \ge 0$ .

y_i(X_i \cdot w + \alpha) \ge 1 - \xi_i

Now we re-define the margin to be $1/||w||$ , since the margin is no longer the distance from the decision boundary to the nearest sample point.

🔥

We also want to add a term to the loss function that penalizes the abuse of slack variable.

Therefore the optimization prblem:

\text{Find } w, \alpha, \xi_i \text{ that minimize } ||w||^2+C\sum_{i=1}^n \xi_i \\ \begin{split} \text{s.t. } &\forall i, y_i(X_i \cdot w + \alpha) \ge 1 - \xi_i \\ &\forall i, \xi_i \ge 0 \end{split}

$C>0$ is a scalar regularization hyperparameter that trades off

	small C	big C
desire	maximize margin $\frac1{\|\|w\|\|}$	keep most slack variables zero or small
danger	underfitting	overfitting
outliers	less sensitive	very sensitive
boundary	more “flat”	more sinuous

Gaussian Discriminant Analysis(GDA)

⚠️

We assume that each class comes from normal distribution (Gaussian) For each class, estimate mean

\mu_c

, and prior

\pi_c = P(Y=C)

Notation of

\sigma

might be different between QDA and LDA.

Isotropic Gaussians

X \sim N(\mu,\sigma^2): f(x)=\frac1{(\sqrt{2\pi}\sigma)^d}\exp(-\frac{||x-\mu||^2}{2\sigma^2})

Anisotropic Gaussians(Different variances along different directions)

X \sim N(\mu, \Sigma): f(x)=\frac{1}{\sqrt{(2\pi)^d|\Sigma|}}\exp(-\frac{(x-\mu)^{\top}\Sigma^{-1}(x-\mu)}{2})

Where $\Sigma$ is $d \times d$ Semi Positive Definite Covariance Matrix.

\Sigma=Var(X)=\begin{bmatrix} Var(X_1) &Cov(X_1,X_2) &\dots &Cov(X_1,X_d) \\ Cov(X_2,X_1) &Var(X_2) &\dots &Cov(X_2,X_d) \\ \vdots &\vdots &\vdots &\vdots \\ Cov(X_d,X_1) &Cov(X_d,X_2) &\dots &Var(X_d) \end{bmatrix}

Semi Positive Definite: $\forall \vec{w}, \vec{w}^{\top}\Sigma \vec{w} \ge 0$

And $\Sigma^{-1}$ is $d \times d$ SPD precision matrix.

Let $q(x) = (x-\mu)^{\top}\Sigma^{-1}(x-\mu)$ , therefore $f(x) = n(q(x))$ where $n(\cdot)$ is a sample, monotonic, convex function that is an exponential of the negative of its argument. $n(\cdot)$ does not affect the isosurfaces.

$q(x)$ is the squared distance from $\Sigma^{-1/2}x$ to $\Sigma^{-1/2}\mu$ .

$d(x,\mu)=||\Sigma^{-1/2}x-\Sigma^{-1/2}\mu||=\sqrt{(x-\mu)^{\top}\Sigma^{-1}(x-\mu)}=\sqrt{q(x)}$

Important fact: if you understands isosurface of quadratic form then you can understand isosurface of Gaussian, only difference is that the max isovalue is at the mean of the Gaussian.

Suppose only two classes C, D Bayes decision rule $r^*(x)$ predicts class C that maximizes $P(Y=C|X=x)=P(X=x|Y=C)\pi_C$

Since $\ln(\omega)$ is monotonically increasing for $\omega > 0$ , so

\begin{split} &\argmax_c P(Y=C|X=x)\\ = &\argmax_c Q_c(x) \\ = &\argmax_c ln((\sqrt{2\pi})^d f_c(x) \pi_c) \\ = &\argmax_c -\frac{||x-\mu_c||^2}{2\sigma_c^2}-d\ln(\sigma_c)+\ln(\pi_c) \quad \text{if not anisotropic} \\ = &\argmax_c -\frac{(x-\mu)^{\top}\Sigma^{-1}(x-\mu)}{2}-d\ln(|\Sigma|)+\ln(\pi_c) \quad \text{if anisotropic} \end{split}

MLE for Isotropic Gaussian

For isotropic Gaussian, use the following MLE:

L(\mu,\sigma|X_1,\dots,X_n) = P(X_1,\dots,X_n|\mu,\sigma)=f(X_1)f(X_2)\cdots f(X_n)

“log likelihood”

l(\mu,\sigma|X_1,X_2,\cdots,X_n)=\ln f(X_1)+\ln f(X_2) \cdots \ln f(X_n)

After computing the derivative, we see

\hat{\mu} = \frac{1}{n}\sum_{i=1}^n X_i

\hat{\sigma}^2 = \frac{1}{nd}\sum_{i=1}^n||X_i-\mu||^2

We dont know $\mu$ exactly, so substitue $\hat{\mu}$ for $\mu$ to compute $\hat{\sigma}^2$

Quadratic Discriminant Analysis (QDA)

⚠️

Assume each class has different variance

\Sigma_c

\sigma_c^2

compute conditional mean $\hat{\mu}_c$ and conditional variance $\hat{\sigma}_c^2$ , $\Sigma_c$ of each class C seperately.

And estimate the priors $\hat{\pi}_c = \frac{n_c}{n}$ (number of points in class c divided by total number of sample points)

\hat{\sigma}^2 = \frac{1}{n_cd}\sum_{i:y_i=C}||X_i-\mu||^2

\hat{\Sigma}_c = \frac{1}{n_c} \sum_{i: y_i = C} (X_i-\hat{\mu}_c)(X_i-\hat{\mu}_c)^{\top}

Note: If we had zero eigenvalues in the variance matrix, then the standard QDA doesn’t work. We can fix it by adding a regularization term $\lambda I$ on the matrix $\Lambda$ , which $\hat{\Sigma}_c = V\Lambda V^{\top}$ , and use this $\tilde{\Sigma}_c = V(\Lambda + \lambda I) V^{\top}$ term as our new covariance matrix. We want to keep the term $\lambda$ relatively small to not affect the covariance matrix too much.

Note: Posterior $P(Y=C|X=x) = s(Q_c(x) - \sum_{d \ne c} Q_d(x))$ where $s(\cdot)$ is the sigmoid function.

Linear Discriminant Analysis (LDA)

⚠️

Assume each class has same variance

\Sigma

\sigma^2

compute conditional mean $\hat{\mu}_c$ of each class C seperately.

Assuming same variance across classes help eliminates the quadratic term in the optimization problem.

\hat{\sigma}^2 = \frac{1}{nd} \sum_c \sum_{i: y_i=C} ||X_i - \hat{\mu}_c||^2

\hat{\Sigma} = \frac{1}{n} \sum_c \sum_{i:y_i = C} (X_i-\hat{\mu}_c)(X_i-\hat{\mu}_c)^{\top}

Regression

Need:

Regression function $h(x|p)$ , $p$ is parameter

Cost function to optimize

Regression Functions.

(1) Lienar: $h(x|w,\alpha) = w \cdot x + \alpha$

(2) polynomial [equivalent to lineear with added polynomial features]

(3) logistic: $h(x|w,\alpha) = s(w \cdot x + \alpha)$ ⇒ no hidden layer neural net with only one output node and final activation sigmoid. Only seperates linearly seperable classes.

Loss Functions:

(A) Sqaured Error $L(\hat{y},y) = (\hat{y}-y)^2$

(B) Absolute Error $L(\hat{y},y)=|\hat{y}-y|$

Cost Functions to Optimize:

(a) Mean Loss $J(h) = \frac1n \sum_{i=1}^n L(h(X_i),y_i)$

(b) Maximum Loss $J(h) = \max_{i=1}^n L(h(X_i), y_i)$

(d) L2 Penalized $J(h) = J_0(h) + \lambda||w||^2$ where $J_0(h)$ is one of (a), (b), or (c)

(e) L1 Penalized $J(h) = J_0(h) + \lambda||w||_{l1}$

Least Squares Linear Regression

(1) Linear Function + (A) Squared Error + (a) Mean Loss

🔥

We will augment the data matrix with 1s so that the last term of weight

w_{d+1}

would be offset

\alpha

, of the line from the origin.

By calculus,

X^{\top}Xw=X^{\top}y

However if $X^{\top}X$ is singular, then the problem is underconstrained (infinitely many solutions for $w$ ).

w = (X^{\top}X)^{-1}X^{\top}y

$X^{\top}X$ is always PSD because all sample points lie on a common hyperplane. If $X^{\top}X$ is singular, we can use the pseudoinverse of this matrix, to compute $w$ .

Logistic Regression

(3) Logistic Function + (C) Logistic Loss + (a) Mean Loss

Solved by Gradient Descent

🔥

Usually used for classification. The input

y_i

can be probabilities, but usually they’re all 0 or 1.

In LDA, the posterior probabilities are often modeled well by a logistic function. So why not just try to fit a logistic function directly to the data, skipping Gaussian modeling?

J=\sum_{i=1}^n L(s_i, y_i)=-\sum_{i=1}^n y_i \ln s_i + (1-y_i)\ln(1-s_i)

Where $s_i = s(X_i \cdot w)$

\nabla_{w}J = - X^{\top} (y-s(Xw))

\nabla^2_w J(w)=X^{\top}\Omega X, \Omega = \begin{bmatrix} s_1(1-s_1) & 0 &\cdots &0 \\ 0 &s_2(1-s_2) & \cdots &0 \\ \vdots &\vdots &\vdots &\vdots \\ 0 &0 &\dots &s_n(1-s_n) \end{bmatrix}

Converges within 1 iteration of Newton’s Method.

📎

Logistic Regression ALWAYS seperates linearly seperable points (as we scale w to have inifite length for points in class C,

s(X_i \cdot w) \rightarrow 1

and for points not in class C,

s(X_i \cdot w) \rightarrow 0

and

J(w) \rightarrow 0

) A 2018 paper by Soudry, Hoffer, Nacson, Gunasekar, and Srebro shows that gradient descent applied to logistic regression eventually converges to maximum margin classifier

🔥

Compared to LDA, 1. LDA stable for well-separated classes. Logistic regression suprisingly unstable 2. LDA is elegend when 2 classes, logistic regression needs modify to softmax. 3. LDA slightly more accurate when classes are nearly normal 4. Logistic regression puts more emphasis on decision boundary / always seperates linearly seperable points. 5. Misclassified points far from the decision boundary have biggest effect in logistic regression but same weight across points in LDA. 6. Logistic regression more robust on some non-Gaussian distributions (distributions with large skew) 7. Logistic Regression naturally fits labels between 0 and 1.

Least-Squares Polynomial Regression

Replace each $X_i$ with all terms of degree 0...p

\Phi(X_i)=\begin{bmatrix} X_{i1}^2 &X_{i1}X_{i2} &X_{i2}^2 &X_{i1} &X_{i2} &1 \end{bmatrix}^{\top}

Not numerically stable...higher degree polynomials tend to oscilate and does not e

Developed from MLE for multivariate normal distributions, see lecture 12 for detail.

Weighted Least Squares Regression

(1) Linear Function + (A) Squared Loss + (c) Weighted Sum Cost

🔥

Some points might be more trusted than others, or there might be certain points that we want to fit particularly well.

Collect weight $\omega_i$ in $n \times n$ diagonal matrix $\Omega$ .

Find $w$ that minimizes $(Xw-y)^{\top}\Omega(Xw-y)=\sum_{i=1}^n \omega_i(X_i \cdot w - y_i)^2$

Solve for w in normal equations:

X^{\top}\Omega X w = X^{\top}\Omega y

Ridge Regression

(1) Linear Function + (A) Squared Loss + (d - a) L2 Penalized Mean Loss

🔥

Adds a regularization(penalty) term for shrinkage: to encourage small

||w’||

Guarentees positive definite normal equations
1. Always unique solution

Left: Cost function many minima and the problem is ill-posed.

After setting $\nabla J = 0$ ,

(X^{\top}X+\lambda I')w=X^{\top}y

Where $I’$ is identity matrix with last diagonal term zero (so we don’t penalize the bias term $\alpha$ )

🔥

Ideally, features should be “normalized” to have the same variance. Or use asymmetric penaly by replacing

I’

with other diagonal matrix.

Bayesian Justification for Ridge Regression:

Suppose we have a prior on $w’: w' \sim N(0,\sigma^2)$

then

f(w|X,y)=\frac{f(y|X,w)\times prior f(w')}{f(y|X)}=\frac{L(w)f(w')}{f(y|X)}

Maximize log posterior:

\begin{split} \text{log posterior}&=\ln L(w)+\ln f(w')-const\\ &=-\text{const}||Xw-y||^2-\text{const}||w'||^2-\text{const} \end{split}

So therefore minimize $||Xw-y||^2+\lambda||w’||^2$

Lasso Regression

(1) Linear Function + (A) Squared Loss + (e - a) L1 Penalized Mean Loss

Similar to ridge regression, but has the advantage that it often naturally sets some of the weights to zero (not always zero, just due to the graphical property of L1 Loss the weights are more likley to rest on points of 0)

The isosurfaces of $||w’||_1$ are cross-polytopes (we write $w’$ because we don’t want to penalize $\alpha$ )

The unit cross-polytope is the convex hull (凸包 - 刚好包住所有点的塑胶膜) of all the positive and negative unit coordinate vectors.

Compare to ridge regression, the solution given by lasso is more likely to be sparse because the isosurfaces of the regularization term and the isosurfaces of squared distance is more likely to meet at the direction of the unit coordinate vectors.

Lasso can be reformulated as a quadratic problem, Subgradient descent and least-angle regression(LARS), forward stagewise algorithms can be used to solve for Lasso.

Decision Trees

Still widely used, fast, interpretable, invariant to scaling/translation, robust to irrelevant features ,and can achieve 100% training accuracy, but prune to overfitting. Random Forest and Adaboost is introduced to prevent some of the overfitting

Cuts x-space into rectangular cells

Works well with both categorical(split by subset) and quantitative(split by boundary) features

Interpretable result

Decision boundary can be arbitrarily complicated

Learning:

def GrowTree(S):
	if(training points in node have same class) or (node exceeds maximum depth):
		return most common class in node
	else:
		choose best splitting feature j and sppliting value (or subset) beta
		filter training points to the left split or right split
		return Node(
				j,
				beta,
				GrowTree(left training points),
				GrowTree(right training points)
		)

⚠️

We may want to stop early (max depth, stop when 95% points same class)... 1. Most of the node’s points have same class (deal with outliers) 2. Complete tree may overfit (cells edge too tiny)

📌

Instead of returning labels, we can also return posteriors of each class

P(Y=C|v)

. Its pretty reasonable to do if there are many points in each leaf.

How to select best split:

Try all splits

Choose split that maximizes information gain $H(S)-H_{after}$

Where $H(S)$ is the entropy of an index set $S$ ⇒ the average surprise

The suprise of $Y$ being in class C is $-log_2(p_c)$

event with probability gives us zero surprise

event with probability gives us inifinite surprise

The entropy of an index set $S$ is the average surprise

H(S)=-\sum_C p_C \log_2(p_C), p_C = \frac{|{i\in S: y_i = C}|}{|S|}

If all points in S belong to the same class ⇒ $H(S)=-1\log_2(1)=0$

Half class C, half class D ⇒ $H(S)=-0.5\log_2(0.5)-0.5\log_2(0.5)=1$

n points, all different classes ⇒ $H(S)=-\log_2 \frac{1}{n} = \log_2(n)$

Where $H_{after}$ is the weighted average entropy after the split

H_{after} = \frac{|S_l|H(S_l)+|S_r|H(S_r)}{|S_l|+|S_r|}

🔥

The entropy function we defined is strictly concave. This is important because we want the interior (linear combination of left+right splits) to be strictly below the curve.

Multivariate splits

🔥

Better classifier at cost of worse interpretability or speed

Standard decision trees are fast because they only check one feature at each node. But if there are hundreds of features, it slows down classification a lot since in every node you need to look at lots of features at once.

Pruning

Grow tree

greedily remove each split whose removal improves validation performance

More reliable than stopping early

📌

Reason wht pruning often works better is often a split that doesn’t make much progress is followed by a split that makes a lot of progress

Decision Tree Regression

Uses the concept of decision tree

Creates a piecewise constant regression function

Cost:

J(S)=\frac{1}{|S|}\sum_{i \in S}(y_i-\mu_s)^2

Where $\mu_s$ is the mean label $y_i$ for sample points $i \in S$

🔥

We choose split that minimizes the weighted average of the costs of the children after the split, instead of minimizing entropy.

Random Forests

🔥

Random Sampling (Bagging) isn’t enough since with bagging, the decision trees look very similar because the first level always chooses the best feature that best “splits” the data.

Idea:

At each treenode, only allowed to split on m features (out of d)
1. not allowed to split on the other d-m features
1. $m \approx \sqrt{d}$ great for classification, and $m \approx d/3$ for regression
1. $m$ is a hyperparameter

Pros:

Sometimes test error reduction up to 100x or 1000x of decision trees

Cons:

Slow

Loses interpretability

Variant: Multivariate split random forest

Left: Axis-aligned splits, middle: splits with lines & arbitrary rotations, right: splits with conic sections

🔥

Special note: For axis-aligned splits Random Forests, the decision boundaries are still segments of linear boundaries since Random Forests are just linear combinations of Decision Trees.

Neural Networks

Oh yes, please also check out my notes for Andrew Ng’s DL Specialization

🔥

Very similar to perceptron (actually called Multi-layer Perceptron(MLP) if no activation function is used)

Problem with perceptron: only linear combinations, cannot model a XOR gate

Idea 1: Feature Lifting

Idea 2: Activation Function (Used by NN!) ⇒ Added Non-linearity between the linear combinations, can be as simple as clamping the output

History chooses sigmoid, but now we can use tanh, relu, leaky relu, softmax, etc.

Therefore steps to train neural network

Design architecture (# of hidden layers, activation function, etc.)

Choose Cost Function (BCE Loss, Squared Loss, etc.)

Choose Optimizer (Gradient Descent, Stochastic Gradient Descent, Mini-batch Gradient Descent, Momentum, RMSProp, Adam, AMSGrad, etc.)

Training Loop
1. Forward prop ⇒ back prop

Problem for Neural Net:

Has a symmetry problem
1. If weights are initialized to be zero, there’s no difference between one hidden unit and an other - they are computing the same thing
1. Break symmetry by weight initialization

Diminishing/Exploding Gradient ⇒ Gradient fail to pass through a network too deep / Gradient being too big
1. Gradient Clipping(for exploding gradient)
1. BCE loss for the last layer
1. Skip connections
1. Centering the hidden units
  1. replace sigmoids with tanh
1. Whitening Training Data (to gain a great bowl shape in w-space)
1. BatchNorm layer

If dimension of data is big, then second-order optimization is not applicable
1. Newton’s method required too much computation power for the Hessian
1. Alternative - Stochastic Levenberg Marquardt: approximates a diagonal Hessian

To avoid overfitting:

Random init weights

Bagging

L2 Regularization

Dropout

For CNN, see my notes for Andrew Ng.

Principal Component Analysis (PCA)

A technique for dimensional reduction (subset of unsupervised learning)

🔥

Find k directions that capture most of the variation Before doing PCA, make sure the data is centered.

Normalize the data? Sometimes yes, sometimes no If some features are much larger than others, they will tend to dominate the Euclidean distance. So if we have features in different units of measurement we should probably normalize. But if in the same unit of measurement, it depends on the context (usually shouldn’t)

Reason for PCA:

Speed - Reducing number of dimensions makes some computations cheaper

Variance Reduction - Remove irrelevant dimensions to reduce overfitting
1. Like subset selection, but features are not axis-aligned
1. instead linear combinations of input features

Finding a small basis for representing maximum variations in complex things

Minimizes the mean sqaured projection distance

Projection: Orthogonal projection of point $x$ onto vector $w$ is

proj_{w}(x)=\begin{cases} (x\cdot w)w &\quad \text{if $w$ is unit, $||w||=1$} \\ \frac{x \cdot w}{||w||^2}w &\quad \text{otherwise} \end{cases}

We want to pick the best k directions, $span\{v_1,v_2,\dots,v_k\}$ to project our inputs onto.

The Singular Value Decomposition of X,

X = U \Lambda V^{\top}

gives us the best basis to project onto.

Note: Here, $U,V$ are orthonormal matrices with column vectors and $\Lambda$ is a diagonal matrix with singular value $\sigma_i$ s.

Columns of $U$  are unit eigenvectors of $XX^{\top}$ , Columns of $V$  (Rows of $V^{\top}$ ) are eigenvectors of $X^{\top}X$ . And the singular value, $\sigma_i$ s, are square root ( $\sigma_i=\sqrt{\lambda_i}$ ) of eigenvalues of both $XX^{\top}$  and $X^{\top}X$ .

We can verify that by...

X^{\top}X = V \Lambda^{\top} U^{\top}U\Lambda V^{\top}=V\Lambda^2V^{\top}

XX^{\top} = U \Lambda V^{\top}V\Lambda^{\top} U^{\top}=U\Lambda^2U^{\top}

Oh yes, now we’ve expressed those two symmetric matrices as their eigen-decomposition form.

🔥

If the data in

X

are stored row by row, then the principle components of

X

are rows of

V^{\top}

If the data in

X

are stored column by column, then the principle components of

X

are columns of

U

. We can pick the first k principle components by picking the components that have the k biggest singular values.

Clustering

🔥

Partition data into clusters so points in a cluster are more similar than across clusters

K-Means Clustering

Normalize the data? Sometimes yes, sometimes no If some features are much larger than others, they will tend to dominate the Euclidean distance. So if we have features in different units of measurement we should probably normalize. But if in the same unit of measurement, it depends on the context (usually shouldn’t)

Partition n points into k disjoint clusters

Assign each input point $X_i$ a cluster label $y_i \in [1,k]$

Cluster $i$ will have mean $\mu_i = \frac1{n_i} \sum_{y_j=i} X_j$ given $n_i$ points in cluster $i$ .

Find $y$ that minimizes $\sum_{i=1}^k \sum_{y_j=i} ||X_j - \mu_i||^2$

NP-hard, try every partition ⇒ $O(nk^n)$

Lloyd’s Algorithm (Finds a Local Minimum on the Cost Function)

🔥

Must optimize the cost function, if changes nothing than does not change the objective function and should probably terminate.

First, start with either

Forgy method: choose k random sample points to be initial $\mu_i$ ⇒ goto 2

Random partition: randomly assign each sample point to a cluster ⇒ goto 1

k-means ++: Like Forgy, but biased distribution so that each center is chosen with a preference for points far from previous centers.

Alternates between

fixing $y_j$ , updating $\mu_i$
1. Optimal $\mu_i$ is the mean of points in cluster i by calculus

fixing $\mu_i$ , updating $y_j$
1. Optimial $y$ assigns each point $X_j$ to the closest center $\mu_i$

k-Medoids Clustering

Generalizes k-means clustering beyond Euclidean distance

Specify a distance(dissimilarity) function between points $d(x,y)$

Replace mean with medoid(中心点), the sample point that minimizes total distance to other points in the same cluster.

Hierarchical Clustering

🔥

One difficulty with k-means/k-medoids is that it is hard to choose number of k before we start, and there isn’t any reliable way.

Creates a tree
1. Every subtree is a cluster

Two techniques

agglomerative clustering (Greedy agglomerative algorithm naturally takes $O(n^3)$ time)
1. From bottom up
1. Start with each point being a cluster, repeatedly fuse pairs

divisive clustering
1. From top down
1. Start with all points in the same cluster, repeatedly split it.

We need a distance function between clusters A,B

complete linkage $d(A,B)=\max\{d(w,x): w \in A, x \in B\}$

single linkage $d(A,B)=\min\{d(w,x): w \in A, x \in B\}$

average linkage $d(A,B)=\frac{1}{|A||B|}\sum_{w \in A}\sum_{x \in B}d(w,x)$

centroid linkage $d(A,B) = d(\mu_A, \mu_B)$ where $\mu_s$ is mean of $S$

Visualization of Hierarchical Tree using Dendrogram: The vertical axis encodes all the linkage distances

Comparison of three linkages, the complete linkage gives the best-balanced dendrogram, whereas the single linkage gives a very unbalanced dendrogram that is sensitive to outliers

Random Projection

An alternative to PCA as preprocess for clustering, classification, regression

Sometimes preserves distance better than PCA
1. best when projecting a very high-dim space to a medium-dim space

We pick a small $\epsilon$ , a small $\delta$ , and a random subspace $S \sub \mathbb{R}^d$ of dimension $k = \lceil \frac{2\ln(1/\delta)}{\epsilon^2/2-\epsilon^3/3} \rceil$

Subspace $S$ can be obtained by choosing $k$ arbitrary directions and use Gram-Schmidt to orthonormalize them.

We will project any point $q$ to $\hat{q}$ , and $\hat{q} = \sqrt{\frac{d}{k}} \cdot proj_{S}(q)$

The constant $\sqrt{\frac{d}{k}}$ is used to preserve distance

Johnson-Lindenstrauss Lemma(modified)

For any two points, $q,w \in \mathbb{R}^d$

(1-\epsilon)||q-w||^2 \le ||\hat{q}-\hat{w}||^2\le (1+\epsilon)||q-w||^2

Has Probability $P \ge 1-2\delta$

🔥

Typical values:

\epsilon \in [0.02,0.5]

\delta \in [1/n^3,0.05]

⚠️

The squared distance between two points after projection might change by 2% - 50%. For the point distances to be accurate,

\delta < 1/n^2

, needs a subspace of dimension

\Theta(log n)

. Reducing

\delta

doesn’t cost much, but reducing

\epsilon

costs more.

k-Nearest Neighbour

🔥

Idea: When querying a point q, find k nearest points of q and return the “average” label returned by those nearest points.

As k becomes larger, the decision boundary is smoother.

Depending on dimension,

2-5 dimensions: Vornoni Diargrams

Medium dimension (up to ~30): k-d trees

Large dim
1. exhaustive k-NN with PCA or random projection
1. locality sensitive hashing

Exhaustive k-NN Algorithm:

Given query point q:

Scan through all n sampple points, computing (squared) distance to q

Maintain a max-heap with the k-shortest distances so far
1. When we encounter a sample point closer to q than the point at the top of the heap, simply remove the top of the heap and insert the new pint.

Time to train: $O(0)$

Time to query: $O(nd+n\log(k))$ , expected $\Theta(nd+k \log(n) \log(k))$ if random point order

Voroni Diagrams

🔥

Original VD only supports 1-NN, there are modified order-k Voronoi digrams, but nobody uses those because the size gets really bad. (for 2D it is

O(k^2n)

)

Let $X$ be a point set, the Voroni cell(always a convex polyhedron or polytope) of $w \in X$ is

Vor_w = \{p \in \mathbb{R}^d: ||p-w|| \le ||p-v|| \quad \forall v \in X\}

Voroni diagram of $X$ is the set of $X$ ’s Voroni cells

$\text{Size(\# of vertices)} \in O(n^{\lceil d/2 \rceil})$

2D ⇒ $O(n\log n)$ time to compute the digram and a trapezoidal map for point location, $O(logn)$ query time

dD ⇒ Use binary space partition tree (BSP tree) for point location

k-d Trees

Build:

Very similar to decision trees, differences:

Chooses splitting feature with greatest width ⇒ choose feature $i$ ⇒ $\argmax_{i} (X_{ji} - X_{ki})$
1. We want the cutted box to be as closely to cubical as possible
1. Or we can instead rotate the features, builds the tree faster by a factor of $O(d)$

Choose splitting value ⇒ median point for feature i
1. Guarentees $\lfloor \log_2 n \rfloor$ tree depth

Each internal node stores a sample point ⇒ usually its the splitting point

Build time: $O(nd \log n)$ , or $O(n \log n)$ if we rotate through the features

Query:

$\epsilon$ ⇒ parameter for approximate NN ⇒ in high dimensional space its likely to have several points equidistant from the query point, approximate NN saves tons of time

\begin{split} &Q \leftarrow \text{heap containing root node with key zero} \\ &r \leftarrow \text{distance to nearest point seen so far (variable for 1-NN, max heap for k-NN)} \\ &\text{while Q not empty and } (1+\epsilon) \cdot minkey(Q) < r \\ &\quad B \leftarrow removemin(Q) \\ &\quad w \leftarrow \text{B's sample point} \\ &\quad r \leftarrow \min\{r, dist(q,w)\} \\ &\quad B', B'' \leftarrow \text{child boxes of B} \\ &\quad \text{if } (1+\epsilon) \cdot dist(q,B')<r \text{ then insert B' into Q with key } dist(q,B') \\ &\quad \text{if } (1+\epsilon) \cdot dist(q,B'') < r \text{ then insert B'' into Q with key } dist(q,B'') \\ &\text{return point that determined r} \end{split}

🔥

Worst case, we have to visit every node in the tree to find the exact nearest neighbour, then the k-d tree is slower than the simple exhaustive search.

AdaBoost

👆Check out the video above, very useful

Idea: We will train $T$ classifiers(weak learners), $G_1, \dots, G_T$ , and combine them into a big metalearner $M(z)$

Each weak learner will get a different “amount of say” $\beta_t$ by how much error they made during training.

Risk Function:

R(\beta_1,\dots,\beta_T,G_1,\dots,G_T)=\frac1n \sum_{i=1}^n L(M(X_i),y_i)

(Exponential) Loss Function:

L(\hat{y_i},y_i)=e^{-\hat{y_i}y_i}=\begin{cases} e^{-\hat{y_i}} &\quad y_i=1 \\ e^{\hat{y_i}} &\quad y_i=-1 \end{cases}

With this loss, we will pick $G_T$ that minimizes the sum of sample weights over all misclassified points!

Sample Weights: (Initialize each weight with $\frac{1}{n}$ )

w_i^{t+1}=w_i^{t}e^{-\beta_{t}y_iG_t(X_i)}=\begin{cases} w_i^{(t)}e^{-\beta_t} &\quad y_i = G_t(X_i) \\ w_i^{(t)}e^{\beta_t} &\quad y_i \ne G_t(X_i) \end{cases}

Amount of say: (Equation obtained by setting derivative of risk to 0)

\beta_t = \frac12(\frac{1-err_t}{err_t})

Where

err_t = \sum_{i: G_t(X_i) \ne y_i}{w_i^{t}}