Created by Yunhao Cao (Github@ToiletCommander) in Spring 2022 for UC Berkeley CS189.
Reference Notice: Material highly and mostly derived from Prof Shewchuk's lecture notes, some ideas were borrowed from wikipedia & Stanford CS231N & Andrew Ng’s Intro to ML Course@Stanford.
Last Update: 2022-05-13 21:46 PST ⇒ After final release
Machine Learning Abstraction
Application / Data
Is the data labeled?
1. Yes - categorical(classification) or quantatative(regression)
2. No - similarity (clustering) or positioning (dimensional reduction)
Model
-Decision Fns: linear, polynomial, logistic, neural net...
-Nearest Neighbours, Decision Trees
-Features
-Low vs. high capacity (affects overfitting, underfitting, inference)
An infinite vector, and Φ(x)⋅Φ(z) is a series that converges to k(x,z)
RBF Kernel Function:
k(x,z)=exp(−2σ2∣∣x−z∣∣2)
🔥
Very popular in practice
1. gives very smooth h (infinitely differentiable)
2. Behaviors like k-nearest neighbours, but smoother
3. Osciallates less than polynomials (depending on σ)
4. k(x,z) is more like a similarity measure
5. Sample points “vote” for value at z, closer points get weightier vote.
Choose σ by cross-validation.
Larger σ meas wider Gaussians and smoother h, but more bias and less variance
Transformation of Space in finding weights
In lots of ML algorithms we want to find a decision function that usually took form of a geometric shape like hyperplane / isocontour / isosurface in the original data space (x-space). However, in doing the optimization we usually want to find a vector(point) of weight w in the w-spaace that represents the geometric shape in the x-space. Therefore we have implicitly performed a transformation of space from x-space to w-space
Transformation to w-space in perceptron algorithm
For the perceptron algorithm, the transformation is performed below:
x-space
w-space
decision boundry
{z:w⋅z=0}
All points that has inner product of 0 with w
w
data
x
{z:x⋅z=0}
All points that has inner product of 0 with x
Since we want to force inequality x⋅w≥0 (for yi=1, then
in x-space, x points for yi=1 should be on the same side of {z:w⋅z=0} as w
in w-space, w should be on the same side of {z:w⋅z=0} as x that have yi=1
(Batch) Gradient Descent (GD / BGD)
🔥
Walk downhill by using the “steepest” direction(the gradient)
Complexity: O(nd)
With a risk function R(w), calculate:
∇R(w)=⎣⎡∂w1∂R∂w2∂R⋮∂wd∂R⎦⎤
Then for each propagating step, run the following:
w←w−ϵ∇R(w)
Where ϵ is the learning rate.
Stochastic Gradient Descent (SGD)
🔥
Instead of GD on the entire batch, do GD on single training data
Complexity: O(d)
With a cost function L(w,Xi), calculate:
∇L(w)=⎣⎡∂w1∂L∂w2∂L⋮∂wd∂L⎦⎤
Then for each propagating step, run the following:
w←w−ϵ∇L(w)
Where ϵ is the learning rate.
🤔
Note: Not necessarily the “steepest” descent direction when viewing the effect on the whole batch(does not guarentee to work for every problem that GD works for), techniques like RMSProp / Momentum might help resolve this issue.
Quadratic Form
A way to visualize symmetric matrices. Shows how applying the matrix affects the length of a vector.
Matrix A streching along the direction of eigenvector (1,1) with eigenvalue 2 and shrinking along the direction of eigenvector (1,-1) with eigenvalue -1/2.
If a symmetric matrix is diagonal:
eigenvectors are coordinate axes
ellipsoids are axis-aligned
A symmetric matrix M is:
Positive Definite (PD)
if ∀w=0,w⊤Mw>0
or all eigenvalues positive
Positive Semi-Definite (PSD)
if ∀w,w⊤Mw≥0
or all eigenvalues non-negative
Indefinite
if both positive eigenvalue and negative eigenvalue exists
Invertible(non-singular)
if has no eigenvalue of 0
⚠️
Note that pos definite’s quadratic form only has one point of value 0, pos semidefinite has (infinitely) many points of value 0.
By eigendecomposition, we see that any real symmetric matrix has:
A=VΛV⊤
Where Λ is a diagonal matrix containing the eigenvalues of A and V is a orthonormal matrix which each column is the eigenvector of A.
Note that ∣∣A−1x∣∣=1 or x⊤A−2x=1 will have ellipsoid with axes and radii specified by the eigenvector and eigenvalues.
Neton’s Method
Iterative optimization method for smooth function J(w)
🔥
You are at point v, Approximate J(w) near v by quadratic function and jump to the quadratic function’s critical point.
Where ∇2J(w)∣∣w=v is the Hessian of J evaluated at point v
We will have further exapsnsions like ∇3J(w)∣∣w=vbut those are simplified to O(∣∣w−v∣∣2) since we are only approximating with quadratic function.
By setting ∇J(w)=0, we find
w=v−(∇2J(w)∣∣w=v)−1∇J(w)∣∣w=v
Note that we don’t want to compute a matrix inverse directly. It is faster to solve a linear system of equations, by Cholesky factorization or the conjugate gradient method.
We set e=−(∇2J(v))−1∇J(v), then we just need to solve
(∇2J(v))e=−∇J(v)
Therefore, in conclusion, the whole process of Newton’s method:
Pick starting point w
Repeat until Converge
e← solution to linear system (∇2J(w))e=−∇J(w)
w←w+e
Bias-Variance Decomposition
Bias
Error due to iability of hypothesis h to fit ground truth, g, perfectly.
e.g. fitting quadratic g with linear h
Variance
Error due to fitting random noise in data
For each model, we have
Xi∼D,ϵi∼D′,yi=g(Xi)+ϵi
Note that D’ has mean 0.
We want fit hypothesis h to X,y, where h is a random variable (weights)
If we take any arbitraty point(not necessarily a sample point, just any point from the real ground-truth distribution) z∈Rd and γ=g(z)+ϵ,ϵ∈D’
Then the risk function when loss = sqaured error:
R(h)=E[L(h(z),γ)](the expectation is taken over all possible values of weights)=E[(h(z)−γ)2]=E[h(z)2]+E[γ2]−2E[γh(z)]=Var(h(z))+E[h(z)]2+Var(γ)+E[γ]2−2E[γ]E[h(z)]=(E[h(z)]−E[γ])2+Var(h(z))+Var(γ)=(E[h(z)]−g(z))2+Var(h(z))+Var(ϵ)
The first term is bias2, second term is var and third term is irreducible error of the method.
γ and h(z) are independent since h(z) only depends on training data and γ only on label of z.
🔥
We cannot precisely measure bias or variance of real-world data
For many distributions, variance→0 as n→∞
If h can fit g exactly, then bias→0 as n→∞
Adding a good feature reduces bias; adding a bad feature rarely increases it.
Adding a feature usually increases variance (added dimension).
Noise in test set only affects Var(ϵ)
Noise in training set only affects bias&Var(h)
We can test learning algorithms by choosing g and making synthetic data
Feature Selection
We want to identify poorly predictive features and ignore them so that we can:
enjoy better inference speed
less variance
Best Subset Selection
🔥
Choose 2d−1 nonempty subsets of features. Choose best classifier by cross-validation. Very slow.
Forward Stepwise Selection
Start with null model (0 features)
repeatedly add best feature until validation errors start increasing
At each outer iteration, inner loop tries every feature and chooses the best by validation
Requires training O(d2) modules instead of O(2d)
But
won’t find the best 2-feature model if neither one of those features yields the best 1-feature model.
Backward stepwise selection
Start with all d features
repeatedly remove feature whose removal gives best reduction in validation error
Also trains O(d2) models
Additional heuristic: only try to remove features with small weights
Small relative to the variance of least-square regression
Variance of least sqaure regression is proportional to σ2(X⊤X)−1
z-score of weight wi is zi=σviwi where vi is the ith diagonal entry of (X⊤X)−1
Small z-score hint that the “true” wi could be zero.
Ensemble Learning 集成学习
🔥
Methods that have low bias, but high variance like decision trees can use this technique to reduce variance
Idea: take an average answer of a bunch of decision trees
Bagging
Same learning algoirthm on many random subsamples of one training set.
Works well on most learning algorithms, maybe not k-nearest neighbours?
🔥
Given n-point training sample, generate random subsample of size n’ by sampling with replacement.
If n=n′ then around 63.2% are chosen.
Different Learning Algorithms
Different hypothesis can help reduce variance
Random Forests
Discussed in the “Algorithms” section
Adaboost
Similar to Random Forests, but different and more powerful.
Discussed in the “Algorithms” section
Kernel Trick
Kernel Trick is developed from the fact that in many algorithms,
weights can be written as a linear combination of sample points
Therefore we can write w=∑i=1naiXi=X⊤a,a∈Rn
we can use inner products of Φ(x) only instead of directly computing Φ(x).
We will define the kernel function k(x,z)=Φ(x)⋅Φ(z)
Note that we can use different kernel function can be substituted to use different kernels.
The real magic happens when we can compute k(x,z) really quickly.
Kernel Ridge Regression
Remember in Ridge Regression,
(X⊤X+λI′)w=X⊤yw=(X⊤X+λI′)−1X⊤y
To dualize ridge regression, we need the weights to be a linear combination of the sample points, but that would not happen unless we penalize the bias term wd+1=α.
Therefore we want to center X and y so that the “expected” value of bias is zero
Xi←Xi−μX,yi←yi−μy,Xi,d+1=1
🔥
The actual bias won’t usually be exactly zero, but it will often be close enough that we won’t do too much harm by penalizing it.
Now we have
(X⊤X+λI)w=X⊤y
Suppose a is the solution to the following:
(XX⊤+λI)a=y
Then we see that
X⊤y=X⊤XX⊤a+X⊤λIa=(X⊤X+λI)X⊤a
Since the First term(LHS) the above equation matches the RHS of the original ridge regression, and that the term X⊤a of the RHS of the above equation matches the w term in the original ridge regression solution
🔥
we’ve proved that if we set w to X⊤a then we have a solution of weight w that is a linear combination of sample point, where a is the solution to our assumption.
Therefore we want to find a, the dual solution that minimizes the dual form of ridge regression (obtained by substituting in w=X⊤a)
∣∣XX⊤a−y∣∣2+λ∣∣X⊤a∣∣2
Therefore the entire algorithm:
We will define K=XX⊤ where Kij=k(Xi,Xj)
K is singular if n>d+1 (and sometimes even its not) ⇒ we have to choose a positive λ.
Training:
Solve (XX⊤+λI)a=y for a
Optimized:
∀i,jKij←k(Xi,Xj)Solve (K+λI)a=y for a⟸O(n2d)⟸O(n3)
Therefore we prefer dual when d>n (when we add a lot of polynomial / rbf ... features)
Kernel Perceptrons
Featurized(Primal) Perceptron Algorithm:
Training
w←y1Φ(X1)while some yiΦ(Xi)⋅x<0w←w+ϵyiΦ(Xi)
Testing
h(z)=w⋅Φ(z)
We will Dualize the problem with w=Φ(X)⊤a
Same as before, Let K be Φ(X)Φ(X)⊤, so Ki,j=k(Xi,Xj)
🔥
Now Φ(Xi)⋅w=(Φ(X)w)i=(Φ(X)Φ(X)⊤a)i=(Ka)i
Training:
a←[y10…0]start point is arbitrary, but cannot be zero∀i,j,Kij←k(Xi,Xj)⟸O(n2d)while some yi(Ka)i<0ai←ai+ϵyi⟸O(1) time, update Ka in O(n) time
For SGD, update Ka in O(n) time each iteration instead of computing Ka entirely each time
Testing:
h(z)←s(i=1∑naik(Xi,z))
High-Dimensional Data
🔥
Intuition about high-dimensional data is very important
Distribution of random points
Suppose we have a random point p∼N(0,I)∈Rd
The vast majority of the random points are at approximately the same distance from the mean, laying in a thin shell. (Think about chi-sqaure distribution)
Angles between random points
cosθ=∣∣p∣∣∣∣q∣∣p⋅q=∣∣p∣∣p1
E(cosθ)=0;Var(cosθ)≈d1
Therefore as d becomes larger, θ becomes more and more likely to be very close to 90 deg!
Learning Theory
🔥
If we want to generalize (to things we haven’t learned about), we must constrain what hypothesis we allow our learner to consider.
Range Space(Set System)A: a pair (P,H) where
P is the set of all possible test/training points
H is the set of hypotheses(ranges/classifiers), each hypothesis states which subset of P (h⊆P) are in class C
Examples of classifiers:
Power set classifier: P is a set of k numbers, H is the power set of P, containing all 2k subsets of P
This classifier cannot generalize at all because if given any subset of P as training data, it can fit a lot of classifiers and we wouldn’t know which classfier would be good generalization of the other data we haven’t seen unless we see the other data.
Linear classifier: P=Rd, H is the set of all halfspaces: {∀w,α,list of {x:w⋅x≥−α}}
A great question to ask about a certain class of hypotheses would be:
🔥
How well the training error predicts the test error?
Let all training points and test points drawn independently from probability distribution D defined on domain P, then
Suppose our hypothesis is h∈H,
the risk / generalization errorR(h) of h, is the probability that h misclassifies a random point x drawn from D ⇒ average test error
the empirical risk / training errorR^(h) is the percentage of training data points misclassified by h
Hoeffding’s inequality tells us probability of bad estimate:
“How likely that a number drawn from a binomial distribution will be far from its mean”
P(∣R^(h)−R(h)∣>ϵ)≤2e−2ϵ2n
So...we want to choose h^∈H that minimizes R^(h^), a technique called empirical risk minimization
however, its computationally infeasible to pick the best hypothesis ⇒ SVM find a linear classifier with zero training error when the training data is linearly separable, but when it isn’t SVM try to find a linear classifier with low training error, not minimum.
🔥
Its okay to have infinitely many hypotheses, but its not okay to have too many dichotomies
Dichotomies
A dichotomy of X is X∩h, where h∈H
It picks out the training points that h predicts are in class C
Up to O(2n) dichotomies. The more dichotomies, the more likely it is that one of them will get lucky and have misleading low empirical risk
Given Π dichotomies,
P(at least one dichotomy has ∣R^−R∣>ϵ)≤δwhere δ=2Πe−2ϵ2n
fixing a value of δ and solve for ϵ, ∀h∈H,∣R^−R∣≤ϵ=2n1ln(δ2Π)
Therefore, the smaller we make Π and the bigger we make n, the more accurate we can predict the risk based on the empirical risk
Suppose we chose h^ as our hypothesis, then with probability ≥1−δ,
R(h^)≤R^(h^)+ϵ≤R^(h∗)+ϵ≤R(h∗)+2ϵ
Where
ϵ=2n1ln(δ2Π)
and h∗ being the optimal hypothesis in the class H
Therefore, with enough training data and a limit on the number of dichotomies, empirical risk minimization usually chooses a classifier close to the best one in the hypothesis class.
After choosing δ and ϵ,
the sample complexity is the number of training points needed to achieve this ϵ with probability ≥1−δ:
n≥2ϵ21ln(δ2Π)
Shatter Function
number of dichotomies: ΠH(X)=∣{X∩h:h∈H}∣∈[1,2n] where n=∣X∣
shatter function: ΠH(n)=max∣X∣=n,X⊆PΠH(X) ⇒ max number of dichotomies of any training set of size n
Its called a shatter function because H shatters aset X of n points if ΠH(X)=2n.
e.g. Linear classifiers in plane, H = set of all halfplanes, ΠH(3)=8
Fact: for all range spaces, either ΠH(n) is O(np) or O(2n)
Cover’s theorem
linear classifiers in Rd allow up to ΠH(n)=2∑i=0d(in−1) dichotomies of n points
For n≤d+1, ΠH(n)=2n
For n≥d+1, ΠH(n)≤2(de(n−1))d⟸polynomial in n with exponent d
🔥
Linear Classifiers need only n∈O(d) training points for the training error to predict the test error
VC Dimension (Vapnik-Chervonenkis Dimension)
VC(H)=max{n:ΠH(n)=2n}⟸can be ∞
VC dimension is the bound for polynomial shatter function
Theorem:
ΠH(n)≤i=0∑VC(H)(in)
Therefore,
ΠH(n)≤(VC(H)e×n)VC(H)(n≥VC(H))
Also: O(VC(H)) training points suffice for accuracy (hidden constant is big though)
Algorithms
Perceptron Algorithm
🤔
Augment data to have extra feature with value fixed to 1 so that one weight can offset the seperating hyperplane to offset from the origin.
If w classifies all X1,…,Xn correctly, then R(w)=0.
Optimizer:
Gradient Descent, SGD.
Both guarenteed to work if linear seperable
Guarenteed to classfy all data in at most O(r2/γ2) iteratinos, where r=max∣∣Xi∣∣ is the radius of the data and γ is the maximum margin (defined in Max-Margin Classifier).
Hard-Margin Support Vector Machine (Hard-Margin SVM)
The problem with perceptron algorihtm is that we can fit infinitely many hyperplanes to the data since it does not constrain the weight vector, just asking it to be correct on classifying data (it can fit training data 100% but it may not generalize well to validation or test set). This may lead to increased variane.
🔥
One idea we find that generalizes well with validation / test set is to find a unique boundry that maximizes the margin between the two classes in the training set.
We will enforce the constraints:
yi(w⋅Xi+α)≥1 for i∈[1,n]
Note that the RHS is a 1, and it is makes impossible for the weight vector w to get set to zero.
🔥
Note: Signed distance from hyperplane to Xi is ∣∣w∣∣w⋅Xi+∣∣w∣∣α
Therefore the margin is mini∣∣w∣∣1∣w⋅Xi+α∣≥∣∣w∣∣1
In order to maximize the margin, we can instead minimize ∣∣w∣∣. (At the optimal solution, the margin is exactly ∣∣w∣∣1 because at least one constraint holds with equality)
Therefore the optimization problem:
Find w and α that minimize ∣∣w∣∣2s.t. ∀i∈[1,n],yi(Xi⋅x+α)≥1
Its a quadratic program in d+1 dimensions and n constraints, and has a unique solution (if points are linearly seperable).
The reason that we don’t want to use ∣∣w∣∣ is because it is not smooth at w=0, whereas ∣∣w∣∣2 is smooth everywhere, making it easier to optimize.
Left: SVM in 3D w-space(w1,w2,alpha), Right: 2D Cross-section at w1=1/17.
The SVM is trying to find a solution that lays in the pocket that is as close to the origin as possible.
Soft-Margin SVM
🔥
Hard-margin SVM fail if data not linearly seperable & sensitive to outliers.
Example where one outlier moves the entire boundry of hard-margin SVM a lot
Idea: Allow some points to violate the margin with slack variables ξi≥0.
yi(Xi⋅w+α)≥1−ξi
Now we re-define the margin to be 1/∣∣w∣∣, since the margin is no longer the distance from the decision boundary to the nearest sample point.
🔥
We also want to add a term to the loss function that penalizes the abuse of slack variable.
Therefore the optimization prblem:
Find w,α,ξi that minimize ∣∣w∣∣2+Ci=1∑nξis.t. ∀i,yi(Xi⋅w+α)≥1−ξi∀i,ξi≥0
C>0 is a scalar regularization hyperparameter that trades off
small C
big C
desire
maximize margin ∣∣w∣∣1
keep most slack variables zero or small
danger
underfitting
overfitting
outliers
less sensitive
very sensitive
boundary
more “flat”
more sinuous
Gaussian Discriminant Analysis(GDA)
⚠️
We assume that each class comes from normal distribution (Gaussian)
For each class, estimate mean μc, and prior πc=P(Y=C)
Notation of σ might be different between QDA and LDA.
Isotropic Gaussians
X∼N(μ,σ2):f(x)=(2πσ)d1exp(−2σ2∣∣x−μ∣∣2)
Anisotropic Gaussians(Different variances along different directions)
X∼N(μ,Σ):f(x)=(2π)d∣Σ∣1exp(−2(x−μ)⊤Σ−1(x−μ))
Where Σ is d×d Semi Positive Definite Covariance Matrix.
Let q(x)=(x−μ)⊤Σ−1(x−μ), therefore f(x)=n(q(x)) where n(⋅) is a sample, monotonic, convex function that is an exponential of the negative of its argument. n(⋅) does not affect the isosurfaces.
q(x) is the squared distance from Σ−1/2x to Σ−1/2μ.
d(x,μ)=∣∣Σ−1/2x−Σ−1/2μ∣∣=(x−μ)⊤Σ−1(x−μ)=q(x)
Important fact: if you understands isosurface of quadratic form then you can understand isosurface of Gaussian, only difference is that the max isovalue is at the mean of the Gaussian.
Suppose only two classes C, D
Bayes decision rule r∗(x) predicts class C that maximizes P(Y=C∣X=x)=P(X=x∣Y=C)πC
Since ln(ω) is monotonically increasing for ω>0, so
====cargmaxP(Y=C∣X=x)cargmaxQc(x)cargmaxln((2π)dfc(x)πc)cargmax−2σc2∣∣x−μc∣∣2−dln(σc)+ln(πc)if not anisotropiccargmax−2(x−μ)⊤Σ−1(x−μ)−dln(∣Σ∣)+ln(πc)if anisotropic
We dont know μ exactly, so substitue μ^ for μ to compute σ^2
Quadratic Discriminant Analysis (QDA)
⚠️
Assume each class has different variance Σc or σc2
compute conditional mean μ^c and conditional variance σ^c2, Σc of each class C seperately.
And estimate the priors π^c=nnc (number of points in class c divided by total number of sample points)
σ^2=ncd1i:yi=C∑∣∣Xi−μ∣∣2
Σ^c=nc1i:yi=C∑(Xi−μ^c)(Xi−μ^c)⊤
Note: If we had zero eigenvalues in the variance matrix, then the standard QDA doesn’t work. We can fix it by adding a regularization term λI on the matrix Λ, which Σ^c=VΛV⊤, and use this Σ~c=V(Λ+λI)V⊤ term as our new covariance matrix. We want to keep the term λ relatively small to not affect the covariance matrix too much.
Note: Posterior P(Y=C∣X=x)=s(Qc(x)−∑d=cQd(x)) where s(⋅) is the sigmoid function.
Linear Discriminant Analysis (LDA)
⚠️
Assume each class has same variance Σ or σ2
compute conditional mean μ^c of each class C seperately.
Assuming same variance across classes help eliminates the quadratic term in the optimization problem.
σ^2=nd1c∑i:yi=C∑∣∣Xi−μ^c∣∣2
Σ^=n1c∑i:yi=C∑(Xi−μ^c)(Xi−μ^c)⊤
Regression
Need:
Regression function h(x∣p), p is parameter
Cost function to optimize
Regression Functions.
(1) Lienar: h(x∣w,α)=w⋅x+α
(2) polynomial [equivalent to lineear with added polynomial features]
(3) logistic: h(x∣w,α)=s(w⋅x+α) ⇒ no hidden layer neural net with only one output node and final activation sigmoid. Only seperates linearly seperable classes.
Loss Functions:
(A) Sqaured Error L(y^,y)=(y^−y)2
(B) Absolute Error L(y^,y)=∣y^−y∣
(C) Logistic Loss (Cross-entropy loss) L(y^,y)=−yln(y^)−(1−y)ln(1−y^)
Cost Functions to Optimize:
(a) Mean Loss J(h)=n1∑i=1nL(h(Xi),yi)
(b) Maximum Loss J(h)=maxi=1nL(h(Xi),yi)
(c) Weighted Sum J(h)=∑i=1nωiL(h(Xi),yi)
(d) L2 Penalized J(h)=J0(h)+λ∣∣w∣∣2 where J0(h) is one of (a), (b), or (c)
(e) L1 Penalized J(h)=J0(h)+λ∣∣w∣∣l1
Least Squares Linear Regression
(1) Linear Function + (A) Squared Error + (a) Mean Loss
🔥
We will augment the data matrix with 1s so that the last term of weight wd+1 would be offset α , of the line from the origin.
By calculus,
X⊤Xw=X⊤y
However if X⊤X is singular, then the problem is underconstrained (infinitely many solutions for w).
w=(X⊤X)−1X⊤y
X⊤X is always PSD because all sample points lie on a common hyperplane. If X⊤X is singular, we can use the pseudoinverse of this matrix, to compute w.
Logistic Regression
(3) Logistic Function + (C) Logistic Loss + (a) Mean Loss
Solved by Gradient Descent
🔥
Usually used for classification. The input yi can be probabilities, but usually they’re all 0 or 1.
In LDA, the posterior probabilities are often modeled well by a logistic function. So why not just try to fit a logistic function directly to the data, skipping Gaussian modeling?
Logistic Regression ALWAYS seperates linearly seperable points (as we scale w to have inifite length for points in class C, s(Xi⋅w)→1 and for points not in class C, s(Xi⋅w)→0 and J(w)→0)
A 2018 paper by Soudry, Hoffer, Nacson, Gunasekar, and Srebro shows that gradient descent applied to logistic regression eventually converges to maximum margin classifier
🔥
Compared to LDA,
1. LDA stable for well-separated classes. Logistic regression suprisingly unstable
2. LDA is elegend when 2 classes, logistic regression needs modify to softmax.
3. LDA slightly more accurate when classes are nearly normal
4. Logistic regression puts more emphasis on decision boundary / always seperates linearly seperable points.
5. Misclassified points far from the decision boundary have biggest effect in logistic regression but same weight across points in LDA.
6. Logistic regression more robust on some non-Gaussian distributions (distributions with large skew)
7. Logistic Regression naturally fits labels between 0 and 1.
Least-Squares Polynomial Regression
Replace each Xi with all terms of degree 0...p
Φ(Xi)=[Xi12Xi1Xi2Xi22Xi1Xi21]⊤
Not numerically stable...higher degree polynomials tend to oscilate and does not e
Developed from MLE for multivariate normal distributions, see lecture 12 for detail.
Weighted Least Squares Regression
(1) Linear Function + (A) Squared Loss + (c) Weighted Sum Cost
🔥
Some points might be more trusted than others, or there might be certain points that we want to fit particularly well.
Collect weight ωi in n×n diagonal matrix Ω.
Find w that minimizes (Xw−y)⊤Ω(Xw−y)=∑i=1nωi(Xi⋅w−yi)2
Solve for w in normal equations:
X⊤ΩXw=X⊤Ωy
Ridge Regression
(1) Linear Function + (A) Squared Loss + (d - a) L2 Penalized Mean Loss
🔥
Adds a regularization(penalty) term for shrinkage: to encourage small ∣∣w’∣∣
Guarentees positive definite normal equations
Always unique solution
Left: Cost function many minima and the problem is ill-posed.
After setting ∇J=0,
(X⊤X+λI′)w=X⊤y
Where I’ is identity matrix with last diagonal term zero (so we don’t penalize the bias term α)
🔥
Ideally, features should be “normalized” to have the same variance.
Or use asymmetric penaly by replacing I’ with other diagonal matrix.
(1) Linear Function + (A) Squared Loss + (e - a) L1 Penalized Mean Loss
Similar to ridge regression, but has the advantage that it often naturally sets some of the weights to zero (not always zero, just due to the graphical property of L1 Loss the weights are more likley to rest on points of 0)
The isosurfaces of ∣∣w’∣∣1 are cross-polytopes (we write w’ because we don’t want to penalize α)
The unit cross-polytope is the convex hull (凸包 - 刚好包住所有点的塑胶膜) of all the positive and negative unit coordinate vectors.
A typical isosurface of ∣w’∣1
Compare to ridge regression, the solution given by lasso is more likely to be sparse because the isosurfaces of the regularization term and the isosurfaces of squared distance is more likely to meet at the direction of the unit coordinate vectors.
Lasso can be reformulated as a quadratic problem, Subgradient descent and least-angle regression(LARS), forward stagewise algorithms can be used to solve for Lasso.
Decision Trees
Still widely used, fast, interpretable, invariant to scaling/translation, robust to irrelevant features ,and can achieve 100% training accuracy, but prune to overfitting. Random Forest and Adaboost is introduced to prevent some of the overfitting
Cuts x-space into rectangular cells
Works well with both categorical(split by subset) and quantitative(split by boundary) features
Interpretable result
Decision boundary can be arbitrarily complicated
Learning:
def GrowTree(S):
if(training points in node have same class) or (node exceeds maximum depth):
return most common class in node
else:
choose best splitting feature j and sppliting value (or subset) beta
filter training points to the left split or right split
return Node(
j,
beta,
GrowTree(left training points),
GrowTree(right training points)
)
⚠️
We may want to stop early (max depth, stop when 95% points same class)...
1. Most of the node’s points have same class (deal with outliers)
2. Complete tree may overfit (cells edge too tiny)
📌
Instead of returning labels, we can also return posteriors of each class P(Y=C∣v). Its pretty reasonable to do if there are many points in each leaf.
How to select best split:
Try all splits
Choose split that maximizes information gain H(S)−Hafter
Where H(S) is the entropy of an index set S ⇒ the average surprise
The suprise of Y being in class C is −log2(pc)
event with probability gives us zero surprise
event with probability gives us inifinite surprise
The entropy of an index set S is the average surprise
H(S)=−C∑pClog2(pC),pC=∣S∣∣i∈S:yi=C∣
If all points in S belong to the same class ⇒ H(S)=−1log2(1)=0
Half class C, half class D ⇒ H(S)=−0.5log2(0.5)−0.5log2(0.5)=1
n points, all different classes ⇒ H(S)=−log2n1=log2(n)
Where Hafter is the weighted average entropy after the split
Hafter=∣Sl∣+∣Sr∣∣Sl∣H(Sl)+∣Sr∣H(Sr)
🔥
The entropy function we defined is strictly concave. This is important because we want the interior (linear combination of left+right splits) to be strictly below the curve.
Multivariate splits
🔥
Better classifier at cost of worse interpretability or speed
Standard decision trees are fast because they only check one feature at each node. But if there are hundreds of features, it slows down classification a lot since in every node you need to look at lots of features at once.
Pruning
Grow tree
greedily remove each split whose removal improves validation performance
More reliable than stopping early
📌
Reason wht pruning often works better is often a split that doesn’t make much progress is followed by a split that makes a lot of progress
Decision Tree Regression
Uses the concept of decision tree
Creates a piecewise constant regression function
Cost:
J(S)=∣S∣1i∈S∑(yi−μs)2
Where μs is the mean label yi for sample points i∈S
🔥
We choose split that minimizes the weighted average of the costs of the children after the split, instead of minimizing entropy.
Random Forests
🔥
Random Sampling (Bagging) isn’t enough since with bagging, the decision trees look very similar
because the first level always chooses the best feature that best “splits” the data.
Idea:
At each treenode, only allowed to split on m features (out of d)
not allowed to split on the other d-m features
m≈d great for classification, and m≈d/3 for regression
m is a hyperparameter
Pros:
Sometimes test error reduction up to 100x or 1000x of decision trees
Cons:
Slow
Loses interpretability
Variant: Multivariate split random forest
Left: Axis-aligned splits, middle: splits with lines & arbitrary rotations, right: splits with conic sections
🔥
Special note: For axis-aligned splits Random Forests, the decision boundaries are still segments of linear boundaries since Random Forests are just linear combinations of Decision Trees.
If weights are initialized to be zero, there’s no difference between one hidden unit and an other - they are computing the same thing
Break symmetry by weight initialization
Diminishing/Exploding Gradient ⇒ Gradient fail to pass through a network too deep / Gradient being too big
Gradient Clipping(for exploding gradient)
BCE loss for the last layer
Skip connections
Centering the hidden units
replace sigmoids with tanh
Whitening Training Data (to gain a great bowl shape in w-space)
BatchNorm layer
If dimension of data is big, then second-order optimization is not applicable
Newton’s method required too much computation power for the Hessian
Alternative - Stochastic Levenberg Marquardt: approximates a diagonal Hessian
To avoid overfitting:
Random init weights
Bagging
L2 Regularization
Dropout
For CNN, see my notes for Andrew Ng.
Principal Component Analysis (PCA)
3D points projected to 2D by PCA
A technique for dimensional reduction (subset of unsupervised learning)
🔥
Find k directions that capture most of the variation
Before doing PCA, make sure the data is centered.
Normalize the data? Sometimes yes, sometimes no
If some features are much larger than others, they will tend to dominate the Euclidean distance. So if we have features in different units of measurement we should probably normalize. But if in the same unit of measurement, it depends on the context (usually shouldn’t)
Reason for PCA:
Speed - Reducing number of dimensions makes some computations cheaper
Variance Reduction - Remove irrelevant dimensions to reduce overfitting
Like subset selection, but features are not axis-aligned
instead linear combinations of input features
Finding a small basis for representing maximum variations in complex things
Minimizes the mean sqaured projection distance
Projection: Orthogonal projection of point x onto vector w is
projw(x)={(x⋅w)w∣∣w∣∣2x⋅wwif w is unit, ∣∣w∣∣=1otherwise
We want to pick the best k directions, span{v1,v2,…,vk} to project our inputs onto.
The Singular Value Decomposition of X,
X=UΛV⊤
gives us the best basis to project onto.
Note: Here, U,V are orthonormal matrices with column vectors and Λ is a diagonal matrix with singular value σis.
Columns of U are unit eigenvectors of XX⊤, Columns of V (Rows of V⊤) are eigenvectors of X⊤X. And the singular value, σis, are square root (σi=λi) of eigenvalues of both XX⊤ and X⊤X.
We can verify that by...
X⊤X=VΛ⊤U⊤UΛV⊤=VΛ2V⊤
XX⊤=UΛV⊤VΛ⊤U⊤=UΛ2U⊤
Oh yes, now we’ve expressed those two symmetric matrices as their eigen-decomposition form.
🔥
If the data in X are stored row by row, then the principle components of X are rows of V⊤
If the data in X are stored column by column, then the principle components of X are columns of U.
We can pick the first k principle components by picking the components that have the k biggest singular values.
Clustering
🔥
Partition data into clusters so points in a cluster are more similar than across clusters
K-Means Clustering
Normalize the data? Sometimes yes, sometimes no
If some features are much larger than others, they will tend to dominate the Euclidean distance. So if we have features in different units of measurement we should probably normalize. But if in the same unit of measurement, it depends on the context (usually shouldn’t)
Partition n points into k disjoint clusters
Assign each input point Xi a cluster label yi∈[1,k]
Cluster i will have mean μi=ni1∑yj=iXj given ni points in cluster i.
Find y that minimizes ∑i=1k∑yj=i∣∣Xj−μi∣∣2
NP-hard, try every partition ⇒ O(nkn)
Lloyd’s Algorithm (Finds a Local Minimum on the Cost Function)
🔥
Must optimize the cost function, if changes nothing than does not change the objective function and should probably terminate.
First, start with either
Forgy method: choose k random sample points to be initial μi ⇒ goto 2
Random partition: randomly assign each sample point to a cluster ⇒ goto 1
k-means ++: Like Forgy, but biased distribution so that each center is chosen with a preference for points far from previous centers.
Alternates between
fixing yj, updating μi
Optimal μi is the mean of points in cluster i by calculus
fixing μi, updating yj
Optimial y assigns each point Xj to the closest center μi
Start with each point being a cluster, repeatedly fuse pairs
divisive clustering
From top down
Start with all points in the same cluster, repeatedly split it.
We need a distance function between clusters A,B
complete linkage d(A,B)=max{d(w,x):w∈A,x∈B}
single linkage d(A,B)=min{d(w,x):w∈A,x∈B}
average linkage d(A,B)=∣A∣∣B∣1∑w∈A∑x∈Bd(w,x)
centroid linkage d(A,B)=d(μA,μB) where μs is mean of S
Visualization of Hierarchical Tree using Dendrogram: The vertical axis encodes all the linkage distancesComparison of three linkages, the complete linkage gives the best-balanced dendrogram, whereas the single linkage gives a very unbalanced dendrogram that is sensitive to outliers
Random Projection
An alternative to PCA as preprocess for clustering, classification, regression
Sometimes preserves distance better than PCA
best when projecting a very high-dim space to a medium-dim space
We pick a small ϵ, a small δ, and a random subspace S⊂Rd of dimension k=⌈ϵ2/2−ϵ3/32ln(1/δ)⌉
Subspace S can be obtained by choosing k arbitrary directions and use Gram-Schmidt to orthonormalize them.
We will project any point q to q^, and q^=kd⋅projS(q)
The constant kd is used to preserve distance
Johnson-Lindenstrauss Lemma(modified)
For any two points, q,w∈Rd
(1−ϵ)∣∣q−w∣∣2≤∣∣q^−w^∣∣2≤(1+ϵ)∣∣q−w∣∣2
Has Probability P≥1−2δ
🔥
Typical values:
ϵ∈[0.02,0.5], δ∈[1/n3,0.05]
⚠️
The squared distance between two points after projection might change by 2% - 50%.
For the point distances to be accurate, δ<1/n2, needs a subspace of dimension Θ(logn).
Reducing δ doesn’t cost much, but reducing ϵ costs more.
k-Nearest Neighbour
🔥
Idea: When querying a point q, find k nearest points of q and return the “average” label returned by those nearest points.
As k becomes larger, the decision boundary is smoother.
Depending on dimension,
2-5 dimensions: Vornoni Diargrams
Medium dimension (up to ~30): k-d trees
Large dim
exhaustive k-NN with PCA or random projection
locality sensitive hashing
Exhaustive k-NN Algorithm:
Given query point q:
Scan through all n sampple points, computing (squared) distance to q
Maintain a max-heap with the k-shortest distances so far
When we encounter a sample point closer to q than the point at the top of the heap, simply remove the top of the heap and insert the new pint.
Time to train: O(0)
Time to query: O(nd+nlog(k)), expected Θ(nd+klog(n)log(k)) if random point order
Voroni Diagrams
🔥
Original VD only supports 1-NN, there are modified order-k Voronoi digrams, but nobody uses those because the size gets really bad. (for 2D it is O(k2n))
Let X be a point set, the Voroni cell(always a convex polyhedron or polytope) of w∈X is
Vorw={p∈Rd:∣∣p−w∣∣≤∣∣p−v∣∣∀v∈X}
Voroni diagram of X is the set of X’s Voroni cells
Size(# of vertices)∈O(n⌈d/2⌉)
2D ⇒ O(nlogn) time to compute the digram and a trapezoidal map for point location, O(logn) query time
dD ⇒ Use binary space partition tree (BSP tree) for point location
We want the cutted box to be as closely to cubical as possible
Or we can instead rotate the features, builds the tree faster by a factor of O(d)
Choose splitting value ⇒ median point for feature i
Guarentees ⌊log2n⌋ tree depth
Each internal node stores a sample point ⇒ usually its the splitting point
Build time: O(ndlogn), or O(nlogn) if we rotate through the features
Query:
ϵ ⇒ parameter for approximate NN ⇒ in high dimensional space its likely to have several points equidistant from the query point, approximate NN saves tons of time
Q←heap containing root node with key zeror←distance to nearest point seen so far (variable for 1-NN, max heap for k-NN)while Q not empty and (1+ϵ)⋅minkey(Q)<rB←removemin(Q)w←B’s sample pointr←min{r,dist(q,w)}B′,B′′←child boxes of Bif (1+ϵ)⋅dist(q,B′)<r then insert B’ into Q with key dist(q,B′)if (1+ϵ)⋅dist(q,B′′)<r then insert B” into Q with key dist(q,B′′)return point that determined r
🔥
Worst case, we have to visit every node in the tree to find the exact nearest neighbour, then the k-d tree is slower than the simple exhaustive search.