P implies Q, equivalen to not P or Q, and equivalent to its contrapositive(not Q implies not P)
Probability Notations
Ω ⇒ Outcome Space
ω ⇒ Single Event
Basic Principle of Probability
“Complement Rule”
“Principle of Inclusion-Exclusion”
Similarly, “Principle of Inclusion-Exclusion” of Probability...
“Two Events are DISJOINT”
“Two Events are INDEPENDENT”
Conditional Probability
P(A∣B) denotes the probability of event A happening when we know that event B is already happening.
Bayes’ Rule
Counting & Sampling
Counting
Counting Sequence(Order Matters) Without Replacement
First Rule of Counting: If an object can be made by a succession of k choices, where there are n1 ways of making the first choice, and for every way of making the first choice there are n2 ways of making the second choice, and for every way of making the first and second choice there are n3 ways of making the third choice, and so on up to the nk-th choice, then the total number of distinct objects that can be made in this way is the product n1×n2×n3×⋯×nk.
Counting Sequence With Replacement
Counting Sets without replacement
“n choose k”
Second Rule of Counting: Assume an object is made by a succession of choices, and the order in which the choices are made does not matter. Let A be the set of ordered objects and let B be the set of unordered objects. If there exists an m-to-1 function f:A→B, we can count the number of ordered objects (pretending that the order matters) and divide by m (the number of ordered objects per unordered objects) to obtain ∣B∣, the number of unordered objects.
Counting Sets with replacement
Say you have unlimited quantities of apples, bananas and oranges. You want to select 5 pieces of fruit to make a fruit salad. How many ways are there to do this? In this example, S = {1, 2, 3}, where 1 represents apples, 2 represents bananas, and 3 represents oranges. k = 5 since we wish to select 5 pieces of fruit. Ordering does not matter; selecting an apple followed by a banana will lead to the same salad as a banana followed by an apple.
It may seem natural to apply the Second Rule of Counting because order does not matter. Let’s consider this method. We first pretend that order matters and observe the number of ordered objects is 35 as discussed above. How many ordered options are there for every un-ordered option? The problem is that this number differs depending on which unordered object we are considering. Let’s say the unordered object is an outcome with 5 bananas. There is only one such ordered outcome. But if we are considering 4 bananas and 1 apple, there are 5 such ordered outcomes (represented as 12222, 21222, 22122, 22212, 22221).
Assume we have one bin for each element of S, so n bins in total. For example, if we selected 2 apples and 1 banana, bin 1 would have 2 elements and bin 2 would have 1 element. In order to count the number of multisets, we need to count how many different ways there are to fill these bins with k elements. We don’t care about the order of the bins themselves, just how many of the k elements each bin contains. Let’s represent each of the k elements by a 0 in the binary string, and separations between bins by a 1.
Counting the number of multisets is now equivalent to counting the number of placements of the k 0’s
The length of our binary string is k + n − 1, and we are choosing which k locations should contain 0’s. The remaining n − 1 locations will contain 1’s.
Zeroth Rule of Counting: If a set A can be placed into a one-to-one correspondance with a set B (i.e. you can find a bijection between the two — an invertible pair of maps that map elements of A to elements of B and vice-versa), then |A| = |B|.
Sampling
A population of N total, G good and B bad. sample size of n = g + b, with 0 ≤ g ≤ n
Sampling Sets with replacement
Sampling Sets without replacement (Hypergeometric Dist.)
Probability Concepts
Consecutive Odds Ratios
Mainly used for binomial distribution
“Analyze the chance of k successes with respect to k-1 successes”
Law of Large Numbers
As n→∞, the sampled mean will have a higher and higher probability of being in the actual distribution mean, P(∣n1Sn−μ∣<ϵ)→1, no matter how small ϵ is.
Random Variable
A random variable X on a sample space Ω is a function X:Ω→R that assigns to each sample point ω∈Ω a real number X(ω).
Probability of a value of a discrete random variable
Has to satisfy:
Note:
For discrete r.v., the CDF(cumulative distribution function) would be a step function
Probability of a value of a continuous random variable
Probability of a range of value of a continuous random variable
Has to satisfy:
Continuous Random Variable
For continuous r.v.
There’s a Probability Density Function(PDF) and a Cumulative Distribution Function(CDF)
Inverse Distribution Function
For what value of x is there probability 1/2 that X ≤ x?
Either calculate the inverse function of F(x) or solve equation F(x)=p, treating p as the variable.
Inverse CDF applied to standard norm
For any cumulative distribution function with inverse function F−1, if U has uniform (0,1) distribution, then F−1(U) has shape of CDF F(x).
Probability Distribution
The distribution of a discrete random variable X is the collection of values {(a,P[X=a]):a∈A}, where A is the set of all possible values taken by X.
The distribution of a continuous random variable X is defined by its Probability Density Function f(x).
Example, For a dice, where X denotes the number on the dice when we roll the dice once.
Joint Distribution
The joint distribution for two discrete random variables X and Y is the collection of values {(∀(a,b),P[X=a,Y=b]):a∈A,b∈B}, where A is the set of all possible values taken by X and B is the set of all possible values taken by Y
And the joint distribution for two continuous random variables X and Y is defined by the joint probabilty density function f(x,y). It gives the density of probability per unit area for values of (X,Y) near the point (x,y). P(X∈dx,Y∈dy)=f(x,y)dxdy.
In other words, list all probabilities of all posible X and Ys.
Has to satify the following constraint:
One useful equation:
Symmetry of Joint Distribution
If X1,X2,…,Xn be random variables with joint distribution defined by
Then the joint distribution is symmetric if P(x1,…,xn) is a symmetric function of (x1,…,xn). In other words, for the value of P(x1,…,xn) is unchanged if we swap the positions of arbitrary number of parameters.
If the joint distribution is symmetric, then all n! possible orderings of the random variables X1,…,Xn have the same joint distribution and this means that X1,…,Xn are exchangeable. Exchangeable random variables have the same distribution.
Independence of random variables
Random variables X and Y on the same probability space are said to be independent if the events X=a and Y=b are independent for all values a, b. Equivalently, the joint distribution of independent r.v.’s decomposes as
Conditional Distribution
Usually used when two variables are dependent.
“Conditional Distribution of Y given X=x”
Continuous Variables:
Rule of Average Conditional Probabilities:
Conditional Expectation
E(Y∣X) is defined by the function of X whose value is E(Y∣X=x)
Properties of Conditional Expectation
Identical Distribution
X and Y has the same range, and for every possible value in the range,
If X and Y has same distribution, then...
any statement of X has the same probability of the same statement of Y
g(X) has the same distribution with g(Y)
Combination of Random Variables
If Z=Y/X, Then Z∈dz is drawn in the following diagram
Note the area of the heavily shaded region here. It looks similar to a parallelogram. So area would be dx⋅[(z+dz)x−zx]=dxdz∣x∣.
Therefore, if we integrate out X,
Equality of random variables
“Equality implies identical distribution”
Probability of Events of two Random Variables
Symmetry of R.V.
The random variable X said to be symmetric around 0 if:
Linear Function Mapping of Continuous Random Variable
suppose Y=aX+b and X has PDF fX(x)
then Y has PDF fY(y)=∣a∣1fX(ay−b)
One-to-one Differentiable Function of Continuous Random Variable
Let X be a r.v. with density fX(x) on the range (a,b). Let Y=g(x) where g is strictly increasing or decreasing on interval (a,b).
For an infinitestimal interval dy near y, the event Y∈dy is identical to the event X∈dx.
Thus fY(y)∣dy∣=fX(x)∣dx∣. The absolute values are added in because for a decreasing function g(x), only the ratio of the magnitudes of dx and dy matters, thus:
Change of Variable Principle
If X has the same distribution as Y, then g(X) has the same distribution as g(Y), for any function g(⋅)
Max and Min of Independent R.V.s
CDF makes it easy to find dist of max and mins.
For any number x:
Xmax≤x≡(∀i,Xi≤x)
Xmin≥x≡(∀i,Xi≥x)
So...
Expectation
For Discrete R.V.
For continuous R.V.
If we have a non-negative r.v.
Linearity of Expectation
Independent R.V. Expectation
For Independent R.V. X,Y, we have:
Variance
Variance of the sum of n variables
Coefficient Property
Independent R.V. Variance
For Independent R.V. X,Y, we have:
Standard Deviation σ
Standardizations of Random Variable
“X in standard units” with E(X∗)=0,Var(X)=σx=1
Skewness of R.V.
Let X be a random variable with E(X)=μ and Var(X)=σx2, Let X∗=σX−μ be X in standard units.
Markov’s Inequality
Note: For nonnegative R.V.s only
Chebychev’s Inequality
Order Statistics
Let X1,X2,X3,…,Xn be random variables with the independent and identical distribution pdf f(x) and cdf F(x)
Then if we denote X(1),X(2),…,X(n) being the smallest, the second smallest, etc. among X1,…,Xn. Then what is the distribution of X(i)?
Covariance Cov(X,Y)
Condition
Description
Cov(X,Y)>0
X and Y are positively dependent, P(X∣Y)>P(X),P(Y∣X)>P(Y)
Cov(X,Y)=0
X and Y are uncorrelated (covariance equal zero doesn’t always imply independence)
Cov(X,Y)<0
X and Y are negatively dependent, P(X∣Y)<P(X),P(Y∣X)<P(Y)
Covariance of the same variable
Bilinearity of Covariance
Standard Form:
Simpler Form:
Correlation
Because it is hard to interpret magnitude of Cov(X,Y), we standardize this to correlation
Moment Generating Function
Let X be a random variable, then the MGF(Moment Generating Function) of X is defined as
Important Properties
Equality of MGF means Equality of CDF
Jensen’s Inequality
Upper tail of random variable using Markov’s Inequality
Linear Transformation of Random Variable
Linear Combination of Independent Random Variables
Additional Topic - MLE and MAP
Say we have observations x, we want to estimate a parameter θ, where θ is a random variable.
MLE does the following:
MAP does the following:
MLE - iterate through distribution parameters and find the distribution parameter such that producing x is most likely using the parameter.
MAP - iterate through distribution parameters and find the distribution parameter that is most likely to be right given the observed data.
By Bayes Theorem,
Note that
Since P(x) is independent of individual θ, we can view P(x) as a constant, and therefore,
Therefore, MAP can also be written as
Properties of MLE and MAP
Note that both MLE and MAP are point estimators, that is if the parameter is continuous (and usually this is the case), then it picks the max point on the PDF function, not the midpoint of the points where their areas are the biggest.
MLE is more prune to overfitting then MAP, since the prior in MAP kind of restricts the parameter estimated in an area.
the asymptotic behavior of MAP and MLE are the same, that is as we collect more data, the result of MAP and MLE tends to converge.
Special Property of MLE (not applicable to MAP): T=g(θ)⟹TMLE=g(θMLE)
Distribution
Bernoulli(Indicator) Distribution In∼Bernoulli(p)
Uniform Distribution on a finite set
Let we have a list of uniform events, Ω={ω1,ω2,…,ωn}
Uniform(a,b) distribution
Distribution of a point picked uniformly at random from the interval (a,b)
For a < x < y < b the probability that a point falls in the interval (x,y) is...
for b-a= 1, long-run frequency is almost exactly equal to y-x.
Empirical Distribution
Opposed to theoretical distribution, empirical distribution is the distribution of your observed data.
Suppose we have X={x1,x2,…,xn}
In other words, Pn(a,b) gives the proportion of the numbers in the list that lies in-between range of (a,b)
Estimating Empirical Distribution With Continuous PDF
The distribution of a data list can be displayed in a histogram, and such histogram smoothes out the data to display the general shape of the empirical distribution.
Such histograms often follows a smooth curve f(x). And it is safe to assume ∀x,f(x)≥0
Idea is that if (a,b) is a bin interval, then the area under the bar between (a,b) should roughly equal to the area under the curve between (a,b) ⇒ the proportion of data from a to b is roughly equal to the area under f(x).
f(x) functions like a continuous PDF estimation for distribution of data.
Now we can also use an indicator distribution to estimate its average
So I(a,b)(xi) is an indicator stating if xi is in range of (a,b)
Integration Approximation of Averages
If the empirical distribution of list (x1,x2,…,xn) is well approximated by the theoretical distribution with PDF f(x), then the average value of a function g(x) over the n values can be approximated by
Binomial Distribution X∼B(n,p)
“probability of k successes in n trials with success rate of p”
Note: “Binomial Expansion”
Square Root Law
For large n, in n independent trials with probability p of success on each trial:
the number of success will, with high probability, lie in a ralatively small region centered on np, with width a moderate multiple of n on the numerical scale
the proportion of success will, with high probability, lie in a small interval centered on p, with width a moderate multiple of n1
Normal Distribution as an approximation for Binomial Distribution
provided n is large enough in X∼B(n,p), X can be approximated with normal distribution
If approximating a to b success in a bionomial distribution, use b+21 and a−21 as the boundary instead of using b and a, this is called continuity correction. Very important for small values of npq.
Note: for f(x), use μ=E(X) and σ2=Var(X) of the Binomial distribution
W(n,p) denotes the WORST ERROR in the normal approximation to the binomial distribution.
Poisson Distribution as an approximation for binomial distribution
When n is large and p is close to 0, normal distribution cannot properly estimate binomial distribution, so let’s use poisson! (if p is close to one, use the p = original q and mirror the approximation to get the approximation)
Probability of the Most Likely Number of Successes
Normal Distribution X∼N(μ,σ2)
Standard Normal Distribution (μ=0,σ=1)
Where z=σx−μ=x
Skew Normal Aproximation
“Third derivative of ϕ(z)”
“Skew-normal PDF”
“Skew-normal CDF”
“0 to b success rate” for binomial distribution
where z=σb+21−μ,μ=np,σ=npq
Sum of Independent Normal Variables
If X∼N(λ,σ2) and Y∼N(μ,r2), then X+Y∼N(λ+μ,σ2+r2)
Joint Distribution for Independent Standard Normal Distributions
let X and Y be standard normal dist, that is X,Y∼N(μ=0,σ2=1)
Rayleigh Distribution R∼Rayleigh
Its the distribution of the above joint distribution of standard normal distribution along the radius.
Derivation of Rayleigh Distribution
And therefore
Notice that fR(r) must integrate to 1 over (−∞,∞), if we calculate this, we see that ∫0∞fR(r)dr=2πc2, and therefore c=1/2π.
Chi-Square Distribution
Joint density of n independent normal variables at every point on the sphere radius r in n-dimensional space is:
For independent standard normal Zi, let the following denote the distance in n-dimensional space
So the n-dimensional volumnof a thin spherical shell of thickness dr at radius r is
Where cn is the (n-1) dimensional volumn of the “surface” of a sphere of radius 1 in n dimensions
Through a change of variable and evaluating cn we see that the density of Rn2∼Gamma(r=n/2,λ=1/2).
We call Rn2 follows the distribution of the chi-square distribution with n degrees of freedom.
Standard Bivariate Normal Distribution
X and Y have standard bivariate normal distribution with correlation ρ iff:
where X and Z are independent standard normal variables
Joint Density:
Properties:
Marginals
Both X and Y have standard normal distribution
Conditionals Given X
Given X=x, Y∼N(ρx,1−ρ2).
Conditionals Given Y
Given Y=y, X∼N(ρy,1−ρ2)
Independence
X and Y are independent iff ρ=0
Bivariate Normal Distribution as a description for Linear Combinations of Independent Normal Variables
Let
Where Zi∼N(μi,σi2) are independent normal variables.
Then the joint distribution of V,W is bivariate normal.
Where
Independence
Two linear combinations V=∑iaiZi and W=∑ibiZi of independent
normal (μi,σi2) variables Zi are independent iff they are uncorrelated,
that is, if and only if ∑iaibiσi2=0.
Bivariate Normal Distribution
Random Variables U and V have bivariate normal distribution with parameters μU,μV,σU2,σV2,ρ iff the standardized variables
have standard bivariate normal distribution with correlation ρ. Then,
and U,V are independent iff ρ=0
Derivation
We start with a pair of independent standard normal variables, X and Z.
Let Y be the projection of (X,Z) onto an axis at an angle θ to the X-axis,
We see on the diagram that Y=Xcos(θ)+Zsin(θ)
By rotational symmetry of the joint distribution of X,Z, the distribution of Y is standard normal.
Since E(X2)=Var(X)=1,E(XZ)=E(X)E(Z)=0.
Some special cases:
Condition
Result
θ=0
ρ=1,Y=X
θ=2π
ρ=0,Y=Z
θ=π
ρ=−1,Y=−X
Since we have ρ=cos(θ), θ=arccos(ρ).
Therefore, sin(θ)=1−ρ2
And
Poisson Distribution X∼Poisson(λ)
As μ starts small the distribution is piled up on the side and as λ gets bigger and bigger, the poisson distribution becomes closer to the normal distribution (theres a proof that n→∞,p=nλ→0, the approximation becomes better and better)
Sum of Independent Poisson Variables
N1,...,Nj are independent Poisson random variables with parameters μ1,...,μj, then S=N1+N2+⋯+Nj, S∼Poisson(∑i=1jμi)
Skew-normal approximation for Poisson Distribution
If Nμ∼Poisson(μ), then for b = 0, 1, ...
where ϕ(z) is the standard normal curve and Φ(z) is the standard normal CDF.
Multinomial Distribution
“a generalization of the binomial distribution” ⇒ Bionmoial Distribution With Multiple Types of Outcomes(instead of 2)
Let Ni denote the number of results in category i in a sequence of independent trials with probability pi for a result in the ith category for each trial, 1≤i≤m, where ∑i=1mpi=1. Then for every m-tuple of non-negative integers (n1,n2,…,nm) with sum n:
Sum of Independent R.V.s
Let Sn be the sum, X~n=nSn the average, of n independent random variables X1,X2,…,Xn, each with the same distribution as X
Square Root Law
Skewness
With Chebychev’s Inequality
Central Limit Theorem(Normal Approx.)
For large n, the distribution of Sn is approximately normal, which means:
Note: To approximate Probability of Sn taking a specific range of values, we need to use continuity estimation(add and subtract 21×n on the upper and lower bound)
For all a≤b,
Skewed-Normal Approximation
Hypergeometric Distribution
This is also the section of “sampling without replacement”
where b=n−g
where p=NG and q=NB.
N−1N−n is the “finite population correction factor”
Exponential Distribution T∼Exponential(λ)
A random time T has exponential distribution with rate λ ⇒ probability of death per unit time
Memoryless Property
A positive random variable T has exponential(λ) distribution for some λ > 0 if and only if T has the memoryless property
“Given survival to time t, the chance of surviving a further time s is
the same as the chance of surviving to time s in the first place.”
Relation to Geometric Distribution
The exponential distribution on (0,∞) is the continuous analog of the geometric distribution on {1,2,3,…}
Relation to Poisson Arrival Process
A sequence of independent Bernoulli(Indicator) trials, with probability p of success on each trial, can be characterized in two ways:
Counts of successes - number of successes in n trials is Binomial(n,p)
Times between successes - distribution of the waiting time until the first success is Geometric(p), waiting times between each successes and the next are independent with the same geometric distribution.
These characterization of Bernoulli trials lead to the two descriptions in the next box of a Poisson Arrival Process With Rate λ.
Gamma Distribution Tr∼Gamma(r,λ)
If Tr is the time of the r-th arrival after time 0 in a Poisson process with rate λ or if Tr=W1+W2+⋯+Wr where the Wi are independent with Wi∼Exponential(λ) distribution, then Tr∼Gamma(r,λ)
Note: Nt is the number of arrivals by time t in the Poisson process with rate λ (Nt∼Poisson(μ=λt))
“The probability per unit time that the r-th arrival comes around time t is the probability of exactly r-1 arrivals by time t multiplied by the arrival rate”
Tr>t iff there are at most r-1 arrivals in the interval (0, t]
Expectation and Variance
General Gamma Distribution with r∈R
In the previous PDF definition, we’ve only defined r∈Z+
PDF for r∈R:
Where
If we apply integration by parts,
Geometric Distribution X∼Geo(p)
“number X of Bernoulli trials needed to get one success”
Beta Distribution X∼Beta(r,s)
For r,s > 0, the beta distribution on (0,1) is defined by the density:
where
So we see that B(r,s) serves the purpose of normalizing the PDF to integrate to 1.
For all positive r,s
We see that
Beta Distribution as a distribution to calculate Order Statistics
The kth order statistic of n independent uniform(0,1) random variables has Beta(r=k,s=n−k+1) distribution.
Since f(k) is a pdf it must integrate to 1 over [0,1], and therefore