Created by Yunhao Cao (Github@ToiletCommander) in Spring 2022 for UC Berkeley CS189.
Reference Notice: Material highly and mostly derived from Prof Shewchuk's lecture notes, some ideas were borrowed from wikipedia & Stanford CS231N & Andrew Ng’s Intro to ML Course@Stanford.
Last Update: 2022-05-13 21:46 PST ⇒ After final release
Machine Learning Abstraction
Application / Data
Is the data labeled?
1. Yes - categorical(classification) or quantatative(regression)
2. No - similarity (clustering) or positioning (dimensional reduction)
Model
-Decision Fns: linear, polynomial, logistic, neural net...
-Nearest Neighbours, Decision Trees
-Features
-Low vs. high capacity (affects overfitting, underfitting, inference)
Suppose we have Bayes Optimal Classifier r(x), that is r(x) will output the prediction y that minimizes Bayesian Risk R(r(x)).
3 ways to build classifiers
Generative Models(e.g. LDA)
Assume sample points come from probability distributions, different for each class.
Guess form of distributions
For each class C, fit distribution parameters to class C points, giving f(X∣Y=C)
For each class C, estimate P(Y=C)
Baye’s Theorem gives P(Y∣X)
If 0-1 loss, pick class C that maximizes
P(Y=C∣X=x)=f(X=x∣Y=C)P(Y=C)
Discriminative Models (e.g. logistic regression)
Model P(Y∣X) directly.
Decision Boundary
Model r(x) directly
ROC Curve (Receiver Operating Characteristics)
X axis - False Positive Rate (FP) - Predictor predicts + but actually is -
Y axis - Sensitivity / True Positive Rate (TP) - Predictor predicts + and right
1 - x ⇒ Specificity / True Negative Rate (TN) - Predictor predicts - and right
1 - y ⇒ False Negative Rate (FN) - Predictor predicts - but actually is +
We want to minimize x and (1-y)
Data Cleaning
Centering
Decorrelating
Let R be uniform distribution on sample points,
We want to perform a linear transformation that maps sample points to axis-aligned distribution
We will apply rotation Z=X˙V, where Var(R)=VΛV⊤.
Then Var(Z)=n1Z⊤Z=n1V⊤X˙⊤X˙V=V⊤Var(R)V=V⊤VΛV⊤V=Λ
Sphering
Apply transform W=X˙Var(R)−1/2
Recall that Σ−1/2 maps ellipsoids to spheres.
Whitening
Centering + Sphering
X→W
Ideas
Feature Lifting (Kernel)
Suppose we have data point Xi, the lifted data point is Φ(Xi).
Parabolic Lifting Map
p-degree polynomial
Polynomial Kernel:
where Φ(x) contains every monomial in x of degree 0,…,p
Demonstration of d=2, p=2
Saves tons of computing power since we only need to compute the inner product of the original data point instead of adding the polynomial features.
Gaussian Kernel (RBF - Radial Basis Function Kernel)
An infinite vector, and Φ(x)⋅Φ(z) is a series that converges to k(x,z)
RBF Kernel Function:
Choose σ by cross-validation.
Larger σ meas wider Gaussians and smoother h, but more bias and less variance
Transformation of Space in finding weights
In lots of ML algorithms we want to find a decision function that usually took form of a geometric shape like hyperplane / isocontour / isosurface in the original data space (x-space). However, in doing the optimization we usually want to find a vector(point) of weight w in the w-spaace that represents the geometric shape in the x-space. Therefore we have implicitly performed a transformation of space from x-space to w-space
Transformation to w-space in perceptron algorithm
For the perceptron algorithm, the transformation is performed below:
x-space
w-space
decision boundry
{z:w⋅z=0}
All points that has inner product of 0 with w
w
data
x
{z:x⋅z=0}
All points that has inner product of 0 with x
Since we want to force inequality x⋅w≥0 (for yi=1, then
in x-space, x points for yi=1 should be on the same side of {z:w⋅z=0} as w
in w-space, w should be on the same side of {z:w⋅z=0} as x that have yi=1
(Batch) Gradient Descent (GD / BGD)
With a risk function R(w), calculate:
Then for each propagating step, run the following:
Where ϵ is the learning rate.
Stochastic Gradient Descent (SGD)
With a cost function L(w,Xi), calculate:
Then for each propagating step, run the following:
Where ϵ is the learning rate.
Quadratic Form
A way to visualize symmetric matrices. Shows how applying the matrix affects the length of a vector.
If a symmetric matrix is diagonal:
eigenvectors are coordinate axes
ellipsoids are axis-aligned
A symmetric matrix M is:
Positive Definite (PD)
if ∀w=0,w⊤Mw>0
or all eigenvalues positive
Positive Semi-Definite (PSD)
if ∀w,w⊤Mw≥0
or all eigenvalues non-negative
Indefinite
if both positive eigenvalue and negative eigenvalue exists
Invertible(non-singular)
if has no eigenvalue of 0
By eigendecomposition, we see that any real symmetric matrix has:
Where Λ is a diagonal matrix containing the eigenvalues of A and V is a orthonormal matrix which each column is the eigenvector of A.
Note that ∣∣A−1x∣∣=1 or x⊤A−2x=1 will have ellipsoid with axes and radii specified by the eigenvector and eigenvalues.
Neton’s Method
Iterative optimization method for smooth function J(w)
Taylor’s series of cost function around point v:
Where ∇2J(w)∣∣w=v is the Hessian of J evaluated at point v
We will have further exapsnsions like ∇3J(w)∣∣w=vbut those are simplified to O(∣∣w−v∣∣2) since we are only approximating with quadratic function.
By setting ∇J(w)=0, we find
Note that we don’t want to compute a matrix inverse directly. It is faster to solve a linear system of equations, by Cholesky factorization or the conjugate gradient method.
We set e=−(∇2J(v))−1∇J(v), then we just need to solve
Therefore, in conclusion, the whole process of Newton’s method:
Pick starting point w
Repeat until Converge
e← solution to linear system (∇2J(w))e=−∇J(w)
w←w+e
Bias-Variance Decomposition
Bias
Error due to iability of hypothesis h to fit ground truth, g, perfectly.
e.g. fitting quadratic g with linear h
Variance
Error due to fitting random noise in data
For each model, we have
Note that D’ has mean 0.
We want fit hypothesis h to X,y, where h is a random variable (weights)
If we take any arbitraty point(not necessarily a sample point, just any point from the real ground-truth distribution) z∈Rd and γ=g(z)+ϵ,ϵ∈D’
Then the risk function when loss = sqaured error:
The first term is bias2, second term is var and third term is irreducible error of the method.
γ and h(z) are independent since h(z) only depends on training data and γ only on label of z.
For many distributions, variance→0 as n→∞
If h can fit g exactly, then bias→0 as n→∞
Adding a good feature reduces bias; adding a bad feature rarely increases it.
Adding a feature usually increases variance (added dimension).
Noise in test set only affects Var(ϵ)
Noise in training set only affects bias&Var(h)
We can test learning algorithms by choosing g and making synthetic data
Feature Selection
We want to identify poorly predictive features and ignore them so that we can:
enjoy better inference speed
less variance
Best Subset Selection
Forward Stepwise Selection
Start with null model (0 features)
repeatedly add best feature until validation errors start increasing
At each outer iteration, inner loop tries every feature and chooses the best by validation
Requires training O(d2) modules instead of O(2d)
But
won’t find the best 2-feature model if neither one of those features yields the best 1-feature model.
Backward stepwise selection
Start with all d features
repeatedly remove feature whose removal gives best reduction in validation error
Also trains O(d2) models
Additional heuristic: only try to remove features with small weights
Small relative to the variance of least-square regression
Variance of least sqaure regression is proportional to σ2(X⊤X)−1
z-score of weight wi is zi=σviwi where vi is the ith diagonal entry of (X⊤X)−1
Small z-score hint that the “true” wi could be zero.
Ensemble Learning 集成学习
Idea: take an average answer of a bunch of decision trees
Bagging
Same learning algoirthm on many random subsamples of one training set.
Works well on most learning algorithms, maybe not k-nearest neighbours?
If n=n′ then around 63.2% are chosen.
Different Learning Algorithms
Different hypothesis can help reduce variance
Random Forests
Discussed in the “Algorithms” section
Adaboost
Similar to Random Forests, but different and more powerful.
Discussed in the “Algorithms” section
Kernel Trick
Kernel Trick is developed from the fact that in many algorithms,
weights can be written as a linear combination of sample points
Therefore we can write w=∑i=1naiXi=X⊤a,a∈Rn
we can use inner products of Φ(x) only instead of directly computing Φ(x).
We will define the kernel function k(x,z)=Φ(x)⋅Φ(z)
Note that we can use different kernel function can be substituted to use different kernels.
The real magic happens when we can compute k(x,z) really quickly.
Kernel Ridge Regression
Remember in Ridge Regression,
To dualize ridge regression, we need the weights to be a linear combination of the sample points, but that would not happen unless we penalize the bias term wd+1=α.
Therefore we want to center X and y so that the “expected” value of bias is zero
Now we have
Suppose a is the solution to the following:
Then we see that
Since the First term(LHS) the above equation matches the RHS of the original ridge regression, and that the term X⊤a of the RHS of the above equation matches the w term in the original ridge regression solution
Therefore we want to find a, the dual solution that minimizes the dual form of ridge regression (obtained by substituting in w=X⊤a)
Therefore the entire algorithm:
We will define K=XX⊤ where Kij=k(Xi,Xj)
K is singular if n>d+1 (and sometimes even its not) ⇒ we have to choose a positive λ.
Training:
Optimized:
Testing:
Comparison between dual and primal
Dual
Solve n×n linear system, O(n3+n2d)
Primal
Solve d×d linear system, O(d3+d2n)
Therefore we prefer dual when d>n (when we add a lot of polynomial / rbf ... features)
Kernel Perceptrons
Featurized(Primal) Perceptron Algorithm:
Training
Testing
We will Dualize the problem with w=Φ(X)⊤a
Same as before, Let K be Φ(X)Φ(X)⊤, so Ki,j=k(Xi,Xj)
Training:
Testing:
Same as kernelized ridge regression,
Kernel Logistic Regression
Training:
Our starting point would be zero.
SGD:
GD:
Testing:
High-Dimensional Data
Distribution of random points
Suppose we have a random point p∼N(0,I)∈Rd
The vast majority of the random points are at approximately the same distance from the mean, laying in a thin shell. (Think about chi-sqaure distribution)
Angles between random points
Therefore as d becomes larger, θ becomes more and more likely to be very close to 90 deg!
Learning Theory
Range Space(Set System)A: a pair (P,H) where
P is the set of all possible test/training points
H is the set of hypotheses(ranges/classifiers), each hypothesis states which subset of P (h⊆P) are in class C
Examples of classifiers:
Power set classifier: P is a set of k numbers, H is the power set of P, containing all 2k subsets of P
This classifier cannot generalize at all because if given any subset of P as training data, it can fit a lot of classifiers and we wouldn’t know which classfier would be good generalization of the other data we haven’t seen unless we see the other data.
Linear classifier: P=Rd, H is the set of all halfspaces: {∀w,α,list of {x:w⋅x≥−α}}
A great question to ask about a certain class of hypotheses would be:
Let all training points and test points drawn independently from probability distribution D defined on domain P, then
Suppose our hypothesis is h∈H,
the risk / generalization errorR(h) of h, is the probability that h misclassifies a random point x drawn from D ⇒ average test error
the empirical risk / training errorR^(h) is the percentage of training data points misclassified by h
Hoeffding’s inequality tells us probability of bad estimate:
“How likely that a number drawn from a binomial distribution will be far from its mean”
So...we want to choose h^∈H that minimizes R^(h^), a technique called empirical risk minimization
however, its computationally infeasible to pick the best hypothesis ⇒ SVM find a linear classifier with zero training error when the training data is linearly separable, but when it isn’t SVM try to find a linear classifier with low training error, not minimum.
Dichotomies
A dichotomy of X is X∩h, where h∈H
It picks out the training points that h predicts are in class C
Up to O(2n) dichotomies. The more dichotomies, the more likely it is that one of them will get lucky and have misleading low empirical risk
Given Π dichotomies,
Therefore, the smaller we make Π and the bigger we make n, the more accurate we can predict the risk based on the empirical risk
Suppose we chose h^ as our hypothesis, then with probability ≥1−δ,
Where
and h∗ being the optimal hypothesis in the class H
Therefore, with enough training data and a limit on the number of dichotomies, empirical risk minimization usually chooses a classifier close to the best one in the hypothesis class.
After choosing δ and ϵ,
the sample complexity is the number of training points needed to achieve this ϵ with probability ≥1−δ:
Shatter Function
number of dichotomies: ΠH(X)=∣{X∩h:h∈H}∣∈[1,2n] where n=∣X∣
shatter function: ΠH(n)=max∣X∣=n,X⊆PΠH(X) ⇒ max number of dichotomies of any training set of size n
Its called a shatter function because H shatters aset X of n points if ΠH(X)=2n.
e.g. Linear classifiers in plane, H = set of all halfplanes, ΠH(3)=8
Fact: for all range spaces, either ΠH(n) is O(np) or O(2n)
Cover’s theorem
linear classifiers in Rd allow up to ΠH(n)=2∑i=0d(in−1) dichotomies of n points
For n≤d+1, ΠH(n)=2n
For n≥d+1, ΠH(n)≤2(de(n−1))d⟸polynomial in n with exponent d
VC Dimension (Vapnik-Chervonenkis Dimension)
VC dimension is the bound for polynomial shatter function
Theorem:
Therefore,
Also: O(VC(H)) training points suffice for accuracy (hidden constant is big though)
Algorithms
Perceptron Algorithm
Label
Goal:
Find weights w such that
Loss Function:
Cost/Risk Function
If w classifies all X1,…,Xn correctly, then R(w)=0.
Optimizer:
Gradient Descent, SGD.
Both guarenteed to work if linear seperable
Guarenteed to classfy all data in at most O(r2/γ2) iteratinos, where r=max∣∣Xi∣∣ is the radius of the data and γ is the maximum margin (defined in Max-Margin Classifier).
Hard-Margin Support Vector Machine (Hard-Margin SVM)
The problem with perceptron algorihtm is that we can fit infinitely many hyperplanes to the data since it does not constrain the weight vector, just asking it to be correct on classifying data (it can fit training data 100% but it may not generalize well to validation or test set). This may lead to increased variane.
We will enforce the constraints:
Note that the RHS is a 1, and it is makes impossible for the weight vector w to get set to zero.
Therefore the optimization problem:
Its a quadratic program in d+1 dimensions and n constraints, and has a unique solution (if points are linearly seperable).
The reason that we don’t want to use ∣∣w∣∣ is because it is not smooth at w=0, whereas ∣∣w∣∣2 is smooth everywhere, making it easier to optimize.
Soft-Margin SVM
Idea: Allow some points to violate the margin with slack variables ξi≥0.
Now we re-define the margin to be 1/∣∣w∣∣, since the margin is no longer the distance from the decision boundary to the nearest sample point.
Therefore the optimization prblem:
C>0 is a scalar regularization hyperparameter that trades off
small C
big C
desire
maximize margin ∣∣w∣∣1
keep most slack variables zero or small
danger
underfitting
overfitting
outliers
less sensitive
very sensitive
boundary
more “flat”
more sinuous
Gaussian Discriminant Analysis(GDA)
Isotropic Gaussians
Anisotropic Gaussians(Different variances along different directions)
Where Σ is d×d Semi Positive Definite Covariance Matrix.
Semi Positive Definite: ∀w,w⊤Σw≥0
And Σ−1 is d×d SPD precision matrix.
Let q(x)=(x−μ)⊤Σ−1(x−μ), therefore f(x)=n(q(x)) where n(⋅) is a sample, monotonic, convex function that is an exponential of the negative of its argument. n(⋅) does not affect the isosurfaces.
q(x) is the squared distance from Σ−1/2x to Σ−1/2μ.
d(x,μ)=∣∣Σ−1/2x−Σ−1/2μ∣∣=(x−μ)⊤Σ−1(x−μ)=q(x)
Important fact: if you understands isosurface of quadratic form then you can understand isosurface of Gaussian, only difference is that the max isovalue is at the mean of the Gaussian.
Suppose only two classes C, D
Bayes decision rule r∗(x) predicts class C that maximizes P(Y=C∣X=x)=P(X=x∣Y=C)πC
Since ln(ω) is monotonically increasing for ω>0, so
MLE for Isotropic Gaussian
For isotropic Gaussian, use the following MLE:
“log likelihood”
After computing the derivative, we see
We dont know μ exactly, so substitue μ^ for μ to compute σ^2
Quadratic Discriminant Analysis (QDA)
compute conditional mean μ^c and conditional variance σ^c2, Σc of each class C seperately.
And estimate the priors π^c=nnc (number of points in class c divided by total number of sample points)
Note: If we had zero eigenvalues in the variance matrix, then the standard QDA doesn’t work. We can fix it by adding a regularization term λI on the matrix Λ, which Σ^c=VΛV⊤, and use this Σ~c=V(Λ+λI)V⊤ term as our new covariance matrix. We want to keep the term λ relatively small to not affect the covariance matrix too much.
Note: Posterior P(Y=C∣X=x)=s(Qc(x)−∑d=cQd(x)) where s(⋅) is the sigmoid function.
Linear Discriminant Analysis (LDA)
compute conditional mean μ^c of each class C seperately.
Assuming same variance across classes help eliminates the quadratic term in the optimization problem.
Regression
Need:
Regression function h(x∣p), p is parameter
Cost function to optimize
Regression Functions.
(1) Lienar: h(x∣w,α)=w⋅x+α
(2) polynomial [equivalent to lineear with added polynomial features]
(3) logistic: h(x∣w,α)=s(w⋅x+α) ⇒ no hidden layer neural net with only one output node and final activation sigmoid. Only seperates linearly seperable classes.
Loss Functions:
(A) Sqaured Error L(y^,y)=(y^−y)2
(B) Absolute Error L(y^,y)=∣y^−y∣
(C) Logistic Loss (Cross-entropy loss) L(y^,y)=−yln(y^)−(1−y)ln(1−y^)
Cost Functions to Optimize:
(a) Mean Loss J(h)=n1∑i=1nL(h(Xi),yi)
(b) Maximum Loss J(h)=maxi=1nL(h(Xi),yi)
(c) Weighted Sum J(h)=∑i=1nωiL(h(Xi),yi)
(d) L2 Penalized J(h)=J0(h)+λ∣∣w∣∣2 where J0(h) is one of (a), (b), or (c)
(e) L1 Penalized J(h)=J0(h)+λ∣∣w∣∣l1
Least Squares Linear Regression
(1) Linear Function + (A) Squared Error + (a) Mean Loss
By calculus,
However if X⊤X is singular, then the problem is underconstrained (infinitely many solutions for w).
X⊤X is always PSD because all sample points lie on a common hyperplane. If X⊤X is singular, we can use the pseudoinverse of this matrix, to compute w.
Logistic Regression
(3) Logistic Function + (C) Logistic Loss + (a) Mean Loss
Solved by Gradient Descent
In LDA, the posterior probabilities are often modeled well by a logistic function. So why not just try to fit a logistic function directly to the data, skipping Gaussian modeling?
Where si=s(Xi⋅w)
Converges within 1 iteration of Newton’s Method.
Least-Squares Polynomial Regression
Replace each Xi with all terms of degree 0...p
Not numerically stable...higher degree polynomials tend to oscilate and does not e
Developed from MLE for multivariate normal distributions, see lecture 12 for detail.
Weighted Least Squares Regression
(1) Linear Function + (A) Squared Loss + (c) Weighted Sum Cost
Collect weight ωi in n×n diagonal matrix Ω.
Find w that minimizes (Xw−y)⊤Ω(Xw−y)=∑i=1nωi(Xi⋅w−yi)2
Solve for w in normal equations:
Ridge Regression
(1) Linear Function + (A) Squared Loss + (d - a) L2 Penalized Mean Loss
Guarentees positive definite normal equations
Always unique solution
After setting ∇J=0,
Where I’ is identity matrix with last diagonal term zero (so we don’t penalize the bias term α)
Bayesian Justification for Ridge Regression:
Suppose we have a prior on w’:w′∼N(0,σ2)
then
Maximize log posterior:
So therefore minimize ∣∣Xw−y∣∣2+λ∣∣w’∣∣2
Lasso Regression
(1) Linear Function + (A) Squared Loss + (e - a) L1 Penalized Mean Loss
Similar to ridge regression, but has the advantage that it often naturally sets some of the weights to zero (not always zero, just due to the graphical property of L1 Loss the weights are more likley to rest on points of 0)
The isosurfaces of ∣∣w’∣∣1 are cross-polytopes (we write w’ because we don’t want to penalize α)
The unit cross-polytope is the convex hull (凸包 - 刚好包住所有点的塑胶膜) of all the positive and negative unit coordinate vectors.
Compare to ridge regression, the solution given by lasso is more likely to be sparse because the isosurfaces of the regularization term and the isosurfaces of squared distance is more likely to meet at the direction of the unit coordinate vectors.
Lasso can be reformulated as a quadratic problem, Subgradient descent and least-angle regression(LARS), forward stagewise algorithms can be used to solve for Lasso.
Decision Trees
Still widely used, fast, interpretable, invariant to scaling/translation, robust to irrelevant features ,and can achieve 100% training accuracy, but prune to overfitting. Random Forest and Adaboost is introduced to prevent some of the overfitting
Cuts x-space into rectangular cells
Works well with both categorical(split by subset) and quantitative(split by boundary) features
Interpretable result
Decision boundary can be arbitrarily complicated
Learning:
def GrowTree(S):
if(training points in node have same class) or (node exceeds maximum depth):
return most common class in node
else:
choose best splitting feature j and sppliting value (or subset) beta
filter training points to the left split or right split
return Node(
j,
beta,
GrowTree(left training points),
GrowTree(right training points)
)
How to select best split:
Try all splits
Choose split that maximizes information gain H(S)−Hafter
Where H(S) is the entropy of an index set S ⇒ the average surprise
The suprise of Y being in class C is −log2(pc)
event with probability gives us zero surprise
event with probability gives us inifinite surprise
The entropy of an index set S is the average surprise
If all points in S belong to the same class ⇒ H(S)=−1log2(1)=0
Half class C, half class D ⇒ H(S)=−0.5log2(0.5)−0.5log2(0.5)=1
n points, all different classes ⇒ H(S)=−log2n1=log2(n)
Where Hafter is the weighted average entropy after the split
Multivariate splits
Standard decision trees are fast because they only check one feature at each node. But if there are hundreds of features, it slows down classification a lot since in every node you need to look at lots of features at once.
Pruning
Grow tree
greedily remove each split whose removal improves validation performance
More reliable than stopping early
Decision Tree Regression
Uses the concept of decision tree
Creates a piecewise constant regression function
Cost:
Where μs is the mean label yi for sample points i∈S
Random Forests
Idea:
At each treenode, only allowed to split on m features (out of d)
not allowed to split on the other d-m features
m≈d great for classification, and m≈d/3 for regression
m is a hyperparameter
Pros:
Sometimes test error reduction up to 100x or 1000x of decision trees
If weights are initialized to be zero, there’s no difference between one hidden unit and an other - they are computing the same thing
Break symmetry by weight initialization
Diminishing/Exploding Gradient ⇒ Gradient fail to pass through a network too deep / Gradient being too big
Gradient Clipping(for exploding gradient)
BCE loss for the last layer
Skip connections
Centering the hidden units
replace sigmoids with tanh
Whitening Training Data (to gain a great bowl shape in w-space)
BatchNorm layer
If dimension of data is big, then second-order optimization is not applicable
Newton’s method required too much computation power for the Hessian
Alternative - Stochastic Levenberg Marquardt: approximates a diagonal Hessian
To avoid overfitting:
Random init weights
Bagging
L2 Regularization
Dropout
For CNN, see my notes for Andrew Ng.
Principal Component Analysis (PCA)
A technique for dimensional reduction (subset of unsupervised learning)
Normalize the data? Sometimes yes, sometimes no
If some features are much larger than others, they will tend to dominate the Euclidean distance. So if we have features in different units of measurement we should probably normalize. But if in the same unit of measurement, it depends on the context (usually shouldn’t)
Reason for PCA:
Speed - Reducing number of dimensions makes some computations cheaper
Variance Reduction - Remove irrelevant dimensions to reduce overfitting
Like subset selection, but features are not axis-aligned
instead linear combinations of input features
Finding a small basis for representing maximum variations in complex things
Minimizes the mean sqaured projection distance
Projection: Orthogonal projection of point x onto vector w is
We want to pick the best k directions, span{v1,v2,…,vk} to project our inputs onto.
The Singular Value Decomposition of X,
gives us the best basis to project onto.
Note: Here, U,V are orthonormal matrices with column vectors and Λ is a diagonal matrix with singular value σis.
Columns of U are unit eigenvectors of XX⊤, Columns of V (Rows of V⊤) are eigenvectors of X⊤X. And the singular value, σis, are square root (σi=λi) of eigenvalues of both XX⊤ and X⊤X.
We can verify that by...
Oh yes, now we’ve expressed those two symmetric matrices as their eigen-decomposition form.
Clustering
K-Means Clustering
Normalize the data? Sometimes yes, sometimes no
If some features are much larger than others, they will tend to dominate the Euclidean distance. So if we have features in different units of measurement we should probably normalize. But if in the same unit of measurement, it depends on the context (usually shouldn’t)
Partition n points into k disjoint clusters
Assign each input point Xi a cluster label yi∈[1,k]
Cluster i will have mean μi=ni1∑yj=iXj given ni points in cluster i.
Find y that minimizes ∑i=1k∑yj=i∣∣Xj−μi∣∣2
NP-hard, try every partition ⇒ O(nkn)
Lloyd’s Algorithm (Finds a Local Minimum on the Cost Function)
First, start with either
Forgy method: choose k random sample points to be initial μi ⇒ goto 2
Random partition: randomly assign each sample point to a cluster ⇒ goto 1
k-means ++: Like Forgy, but biased distribution so that each center is chosen with a preference for points far from previous centers.
Alternates between
fixing yj, updating μi
Optimal μi is the mean of points in cluster i by calculus
fixing μi, updating yj
Optimial y assigns each point Xj to the closest center μi
We want the cutted box to be as closely to cubical as possible
Or we can instead rotate the features, builds the tree faster by a factor of O(d)
Choose splitting value ⇒ median point for feature i
Guarentees ⌊log2n⌋ tree depth
Each internal node stores a sample point ⇒ usually its the splitting point
Build time: O(ndlogn), or O(nlogn) if we rotate through the features
Query:
ϵ ⇒ parameter for approximate NN ⇒ in high dimensional space its likely to have several points equidistant from the query point, approximate NN saves tons of time
AdaBoost
👆Check out the video above, very useful
Idea: We will train T classifiers(weak learners), G1,…,GT, and combine them into a big metalearner M(z)
Each weak learner will get a different “amount of say” βt by how much error they made during training.
Risk Function:
(Exponential) Loss Function:
With this loss, we will pick GT that minimizes the sum of sample weights over all misclassified points!
Sample Weights: (Initialize each weight with n1)
Amount of say: (Equation obtained by setting derivative of risk to 0)