STAT 134 Notes

Notes of UC Berkeley’s STAT 134 Notes created by Yunhao Cao (Github@ToiletCommander)

Disclaimer:

Some of the images and resources are taken from notes on STAT 134’s official site and official textbook(Jim Pitman’s “Probability”).

Some of the stuffs are from CS70’s class materials

Some of the images and resources are taken from google image search, or other websites

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Last Update: 2022-05-08 21:00 PST ⇒ Updated through Ch 6.5 + MGF + Additional Topics(MLE and MAP)

😊

Happy RRR Week!

Notations

Sets

$|A|$ ⇒ Cardinality of set A, the number of elements in A

$\emptyset = \{\}$ ⇒ Empty Set Notation

$A \sube B$ or $B \supe A$ ⇒ A is a subset of B, or B is a superset of A

$A \sub B$ or $B \sub A$ ⇒ A is a strict subset of B, or B is a strict superset of A

$\alpha \in A, \beta \notin A$ ⇒ $\alpha$ is a member in the set of $A$ , but $\beta$ is not a member in the set of A

$A \cup B$ ⇒ The UNION of A and B(place taken by either A or B or both)

$A \cap B$ ⇒ The INTERSECTION of A and B

$A \cap B = \emptyset$ ⇒ A and B are disjoint

$B - A = B \setminus A$ ⇒ Relative Complement of A in B, or Set Difference between B and A.

Note that $A \setminus \emptyset = A, A \setminus \emptyset = A$

$A^\complement = \overline{A} = A'$ ⇒ A’s Complement

Significant Sets

$\mathbb{N}$ ⇒ Natural Numbers

$\mathbb{Z}, \mathbb{Z^+}, \mathbb{Z^-}$ ⇒ Integers, Positive Integers, Negative Integers

$\mathbb{Q}$ ⇒ Rational Numbers, $\mathbb{Q} = \{\frac{a}{b}: a,b \in \mathbb{Z}, b \neq 0\}$

$\mathbb{R}$ ⇒ Real Numbers

$\mathbb{C}$ ⇒ Complex Numbers, $\mathbb{C} = \{a+bi:a,b \in \mathbb{R}\}$

Boolean Logic

\neg P \equiv \overline{P} \equiv (\sim P)

Not P

P \times Q \equiv P \cdot Q \equiv P \wedge Q

P and Q

P + Q \equiv P \vee Q

P or Q

P \rightarrow Q \equiv P \implies Q \equiv \neg P \vee Q \equiv \neg Q \rightarrow \neg P

P implies Q, equivalen to not P or Q, and equivalent to its contrapositive(not Q implies not P)

Probability Notations

$\Omega$ ⇒ Outcome Space

$\omega$ ⇒ Single Event

Basic Principle of Probability

“Complement Rule”

P(\overline{A}) = 1 - P(A)

“Principle of Inclusion-Exclusion”

|A_1 \cup A_2 \cup A_3 \dots \cup A_n| = \sum_{k=1}^{n}(-1)^{k-1} \sum_{S\sube\{1,\dots, n\} : |S| = k} |\cap_{i \in S} A_i|

|A_1 \cup A_2 \dots \cup A_n| = \sum_{i=1}^n |A_i| - \sum_{i<j} |A_i \cap A_j| + \sum_{i<j<k} |A_i \cap A_j \cap A_k| - \dots + (-1)^{n-1}|A_1 \cap A_2 \cap \dots \cap A_n|

Similarly, “Principle of Inclusion-Exclusion” of Probability...

P(A_1 \cup A_2 \cup \dots \cup A_n) = \sum_{k=1}^n (-1)^{k-1} \sum_{S \in \{1,\dots,n\}:|S|=k} P(\cap_{i \in S} A_i)

P(A_1 \cup A_2 \dots \cup A_n) = \sum_{i=1}^n P(A_i) - \sum_{i<j} P(A_i \cap A_j) + \sum_{i<j<k} P(A_i \cap A_j \cap A_k) - \dots + (-1)^{n-1}P(A_1 \cap A_2 \cap \dots \cap A_n)

“Two Events are DISJOINT”

P(A \cap B) = 0

“Two Events are INDEPENDENT”

P(A \cap B) = P(A) \cdot P(B)

P(A|B)=P(A|\overline{B}) = P(A)

Conditional Probability

$P(A|B)$ denotes the probability of event A happening when we know that event B is already happening.

P(A|B)=1-P(\overline{A}|B)

P(A|B)=\frac{P(A \cap B)}{P(B)}

P(A|B) \cdot P(B) = P(A \cap B)

P(A) = P(A|B) \cdot P(B) + P(A|\overline{B}) \cdot P(\overline{B})

Bayes’ Rule

P(A|B)=\frac{P(B|A) \cdot P(A)}{P(B)}

Counting & Sampling

Counting

Counting Sequence(Order Matters) Without Replacement

First Rule of Counting: If an object can be made by a succession of $k$ choices, where there are $n_1$ ways of making the first choice, and for every way of making the first choice there are $n_2$ ways of making the second choice, and for every way of making the first and second choice there are $n_3$ ways of making the third choice, and so on up to the $n_k$ -th choice, then the total number of distinct objects that can be made in this way is the product $n_1 \times n_2 \times n_3 \times \dots \times n_k$ .

n_c = (N)_k = \frac{n!}{(n-k)!}

Counting Sequence With Replacement

n_c = n^k

Counting Sets without replacement

“n choose k”

n_c={n \choose k} = {n \choose n-k} = \frac{(N)_k}{k!} = \frac{n!}{(n-k)!k!}

Second Rule of Counting: Assume an object is made by a succession of choices, and the order in which the choices are made does not matter. Let A be the set of ordered objects and let B be the set of unordered objects. If there exists an m-to-1 function $f: A \rightarrow B$ , we can count the number of ordered objects (pretending that the order matters) and divide by $m$ (the number of ordered objects per unordered objects) to obtain $|B|$ , the number of unordered objects.

Counting Sets with replacement

Say you have unlimited quantities of apples, bananas and oranges. You want to select 5 pieces of fruit to make a fruit salad. How many ways are there to do this? In this example, S = {1, 2, 3}, where 1 represents apples, 2 represents bananas, and 3 represents oranges. k = 5 since we wish to select 5 pieces of fruit. Ordering does not matter; selecting an apple followed by a banana will lead to the same salad as a banana followed by an apple.

It may seem natural to apply the Second Rule of Counting because order does not matter. Let’s consider this method. We first pretend that order matters and observe the number of ordered objects is 35 as discussed above. How many ordered options are there for every un-ordered option? The problem is that this number differs depending on which unordered object we are considering. Let’s say the unordered object is an outcome with 5 bananas. There is only one such ordered outcome. But if we are considering 4 bananas and 1 apple, there are 5 such ordered outcomes (represented as 12222, 21222, 22122, 22212, 22221).

Assume we have one bin for each element of S, so n bins in total. For example, if we selected 2 apples and 1 banana, bin 1 would have 2 elements and bin 2 would have 1 element. In order to count the number of multisets, we need to count how many different ways there are to fill these bins with k elements. We don’t care about the order of the bins themselves, just how many of the k elements each bin contains. Let’s represent each of the k elements by a 0 in the binary string, and separations between bins by a 1.

Example of placement where |S| = 5 and k = 4

Counting the number of multisets is now equivalent to counting the number of placements of the k 0’s

The length of our binary string is k + n − 1, and we are choosing which k locations should contain 0’s. The remaining n − 1 locations will contain 1’s.

n_c = {n+k-1 \choose k}

Zeroth Rule of Counting: If a set A can be placed into a one-to-one correspondance with a set B (i.e. you can find a bijection between the two — an invertible pair of maps that map elements of A to elements of B and vice-versa), then |A| = |B|.

Sampling

A population of N total, G good and B bad. sample size of n = g + b, with 0 ≤ g ≤ n

Sampling Sets with replacement

P(\text{g good and b bad}) = {n \choose g}\frac{G^gB^b}{N^n}

Sampling Sets without replacement (Hypergeometric Dist.)

P(\text{g good and b bad}) = {n \choose g} \frac{(G)_g(B)_b}{(N)_n} = \frac{{G \choose g}{B \choose b}}{{N \choose n}}

Probability Concepts

Consecutive Odds Ratios

Mainly used for binomial distribution

“Analyze the chance of k successes with respect to k-1 successes”

R(k) = \frac{P(k \text{ successes})}{P(k-1 \text{ successes})}

Law of Large Numbers

As $n \rightarrow \infin$ , the sampled mean will have a higher and higher probability of being in the actual distribution mean, $\mathbb{P}(|\frac{1}{n}S_n-\mu| < \epsilon) \rightarrow 1$ , no matter how small $\epsilon$ is.

Random Variable

A random variable $X$ on a sample space $\Omega$ is a function $X:\Omega \rightarrow R$ that assigns to each sample point $\omega \in \Omega$ a real number $X(\omega)$ .

Probability of a value of a discrete random variable

P(X=x) = \sum_{\omega: X(\omega)=x} P(\omega)

Has to satisfy:

\forall x, P(X=x) \geq 0 \\ \sum_{x} P(X=x) = 1

Note:

For discrete r.v., the CDF(cumulative distribution function) would be a step function

Probability of a value of a continuous random variable

P(X=x)=0

Probability of a range of value of a continuous random variable

P(a<X<b)=\int_{a}^{b}f(x)dx

Has to satisfy:

\int_{-\infin}^{\infin}f(x)dx=1 \\ \forall x, f(x) \ge 0

Continuous Random Variable

For continuous r.v.

There’s a Probability Density Function(PDF) and a Cumulative Distribution Function(CDF)

\text{PDF: } f(x) \quad \\ \forall x, f(x) \ge 0 \text{ and } \int_{-\infin}^{\infin}f(x)dx = 1

\text{CDF: } F(x)=P(X<x)=\int_{-\infin}^{x}f(x)dx \\ \lim_{x \to -\infin}F(x) = 0 \text{ and } \lim_{x \to \infin} F(x)=1

Inverse Distribution Function

For what value of x is there probability 1/2 that X ≤ x?

x = F^{-1}(p)

Either calculate the inverse function of $F(x)$ or solve equation $F(x) = p$ , treating $p$ as the variable.

Inverse CDF applied to standard norm

For any cumulative distribution function with inverse function $F^{-1}$ , if $U$ has uniform (0,1) distribution, then $F^{-1}(U)$ has shape of CDF $F(x)$ .

Simulating binomial(n=2,p=0.5) with uniform distribution. g(u) here is equivalent to F(x) while g(U) here is equivalent to X

Probability Distribution

The distribution of a discrete random variable $X$ is the collection of values $\{(a, P[X = a]): a \in A \}$ , where $A$ is the set of all possible values taken by $X$ .

The distribution of a continuous random variable $X$ is defined by its Probability Density Function $f(x)$ .

Example, For a dice, where X denotes the number on the dice when we roll the dice once.

P(X=1) = P(X=2) = P(X=3) = P(X=4) = P(X=5) = P(X=6) = \frac16

Joint Distribution

The joint distribution for two discrete random variables $X$ and $Y$ is the collection of values $\{(\forall(a, b), P[X = a,Y = b]) : a \in A , b \in B\}$ , where $A$ is the set of all possible values taken by $X$ and $B$ is the set of all possible values taken by $Y$

And the joint distribution for two continuous random variables X and Y is defined by the joint probabilty density function $f(x,y)$ . It gives the density of probability per unit area for values of (X,Y) near the point (x,y). $P(X\in dx, Y \in dy)=f(x,y)dxdy$ .

In other words, list all probabilities of all posible X and Ys.

Has to satify the following constraint:

\forall(a,b), P(X=a,Y=b) \geq 0 \\ \sum_{a,b} P(X=a,Y=b) = 1

f(x,y) \ge 0 \\ \int_{-\infin}^{\infin}\int_{-\infin}^{\infin}f(x,y)dxdy=1

One useful equation:

P(X=a) = \sum_{b \in B} P(X=a, Y=b)

Symmetry of Joint Distribution

If $X_1,X_2,\dots,X_n$ be random variables with joint distribution defined by

P(x_1,x_2,\dots,x_n)=P(X_1=x_1,\dots,X_n=x_n)

Then the joint distribution is symmetric if $P(x_1,\dots,x_n)$ is a symmetric function of $(x_1,\dots,x_n)$ . In other words, for the value of $P(x_1,\dots,x_n)$ is unchanged if we swap the positions of arbitrary number of parameters.

If the joint distribution is symmetric, then all $n!$ possible orderings of the random variables $X_1, \dots, X_n$ have the same joint distribution and this means that $X_1, \dots, X_n$ are exchangeable. Exchangeable random variables have the same distribution.

Independence of random variables

Random variables X and Y on the same probability space are said to be independent if the events $X=a$ and $Y=b$ are independent for all values a, b. Equivalently, the joint distribution of independent r.v.’s decomposes as

\forall (a,b), P[X=a, Y=b] = P[X=a] \cdot P[Y=b]

Conditional Distribution

Usually used when two variables are dependent.

“Conditional Distribution of Y given X=x”

P(Y=y|X=x)

Continuous Variables:

f_y(y|X=x)=P(Y=y|X=x)=\frac{P(Y \in dy, X \in dx)}{P(X \in dx)} = \frac{f(x,y)}{f_x(x)}

\int f_y(y|X=x) dy = 1

f(x,y), the probability density function for the joint distribution of x and y

Rule of Average Conditional Probabilities:

P(Y=y)=\sum_x P(Y=y, X=x)=\sum_x P(Y=y|X=x)P(X=x)

Conditional Expectation

$\mathbb{E}(Y|X)$ is defined by the function of X whose value is $\mathbb{E}(Y|X=x)$

\text{discrete case: }\mathbb{E}(Y|A)=\sum_y yP(Y=y|A) \\ \text{continuous case: } \mathbb{E}(Y|A)=\int E(Y|X=x)f_x(x)dx

Properties of Conditional Expectation

E(X+Y|A)=E(X|A)+E(Y|A)

E(Y)=E(E(Y|A))=\sum_i E(Y|A_i)P(A_i)

P(Y)=\sum_i P(Y|A_i)P(A_i) \\ P(Y)=\int_a P(Y|A_i=a)f_{A_i}(a) da \\ f_y(y)=\int f_y(y|X=x)f_x(x)dx

Identical Distribution

X and Y has the same range, and for every possible value in the range,

\forall v, P(X=v) = P(Y=v)

If X and Y has same distribution, then...

any statement of X has the same probability of the same statement of Y

g(X) has the same distribution with g(Y)

Combination of Random Variables

P(X+Y=k)=\sum_xP(X=x,Y=k-x)

f_z(z)dz=\int_x f_x(x,z-x)dxdy \\ dy = dz \text{ since } y=z-x

f_z(z)=\int_x f_x(x)f_y(z-x)dx

If $Z=Y/X$ , Then $Z \in dz$ is drawn in the following diagram

z=\frac{y}{x} \rightarrow y=zx

Note the area of the heavily shaded region here. It looks similar to a parallelogram. So area would be $dx \cdot [(z+dz)x-zx]=dxdz|x|$ .

P(X \in dx,Z \in dz)=f(x,xz)|x|dxdz

Therefore, if we integrate out X,

f_z(z)=\int_x f(x,xz)|x|dx

🙄

If X and Y are independent positive random variables,

f_z(z)=0 \quad (z \le 0)

Equality of random variables

P(X=Y) = 1 \Leftrightarrow X=Y

“Equality implies identical distribution”

X = Y \implies \forall v, P(X=v)= P(Y=v)

Probability of Events of two Random Variables

P(X<Y) = \sum_{(x,y):x<y}P(x,y) =\sum_{x}\sum_{y:y>x}P(x,y)

P(X=Y)=\sum_{(x,y):x=y}P(x,y)=\sum_{x}P(X=x,Y=x)

Symmetry of R.V.

The random variable X said to be symmetric around 0 if:

P(X=-x)=P(X=x) \Leftrightarrow \text{-X has the same distribution as X} \Leftrightarrow P(X \ge a) = P(-X \le -a) = P(X \le -a)

Linear Function Mapping of Continuous Random Variable

suppose $Y = aX+b$ and $X$ has PDF $f_X(x)$

then $Y$ has PDF $f_Y(y) = \frac{1}{|a|}f_X(\frac{y-b}{a})$

One-to-one Differentiable Function of Continuous Random Variable

Let $X$ be a r.v. with density $f_X(x)$ on the range $(a,b)$ . Let $Y = g(x)$ where $g$ is strictly increasing or decreasing on interval $(a,b)$ .

For an infinitestimal interval $dy$ near $y$ , the event $Y \in dy$ is identical to the event $X \in dx$ .

Thus $f_Y(y)|dy| = f_X(x)|dx|$ . The absolute values are added in because for a decreasing function $g(x)$ , only the ratio of the magnitudes of $dx$ and $dy$ matters, thus:

f_Y(y)=\frac{f_X(x)}{|\frac{dy}{dx}|}

Change of Variable Principle

If $X$ has the same distribution as $Y$ , then $g(X)$ has the same distribution as $g(Y)$ , for any function $g(\cdot)$

Max and Min of Independent R.V.s

CDF makes it easy to find dist of max and mins.

X_{max} = \max(X_1,\dots,X_n) \quad \text{and} \quad X_{min} = \min(X_1,\dots,X_n)

For any number x:

$X_{max} \le x \equiv (\forall i, X_i \le x)$

$X_{min} \ge x \equiv (\forall i, X_i \ge x)$

So...

\begin{align} F_{max}(x) &= P(X_{max} \le x) \\ &=P(X_1 \le x, X_2 \le x, \dots, X_n \le x) \\ &=P(X_1 \le x)P(X_2 \le x)\dots P(X_n \le x) \quad \text{(independence)} \\ &=F_1(x)F_2(x) \cdots F_n(x) \end{align}

\begin{align} F_{min}(x) &= P(X_{min} \le x) \\ &= 1 - P(X_{min}>x) \\ &=1 - P(X_1 > x, X_2 > x, \dots, X_n > x) \\ &=1 - P(X_1 > x)P(X_2 > x)\dots P(X_n > x) \quad \text{(independence)} \\ &=1 - (1-F_1(x))(1-F_2(x)) \cdots (1-F_n(x)) \end{align}

Expectation

For Discrete R.V.

\mathbb{E}(X) = \sum_{x}x \cdot P(x)

\mathbb{E}(X)=\sum_{k=1}^{\infin}P(X \ge k) \quad (\text{for } X \ge 0)

For continuous R.V.

\mathbb{E}(X) = \int_{-\infin}^{\infin}xf(x)dx

If we have a non-negative r.v.

\mathbb{E}(X)=\int_{0}^{\infin}P(X \ge x)dx=\int_{0}^{\infin}(1-F(x))dx

Linearity of Expectation

\mathbb{E}(aX+bY+c) = a\mathbb{E}(X) + b\mathbb{E}(Y) + c, \text{where } a,b,c \in \mathbb{R}

Independent R.V. Expectation

For Independent R.V. $X, Y$ , we have:

\mathbb{E}(XY) = \mathbb{E}(X)\mathbb{E}(Y)

Variance

Var(X)=\sigma_x^2 = \mathbb{E}[(X-\mathbb{E}[X])^2] = \mathbb{E}(X^2) - \mathbb{E}(X)^2

👉

We use variance (squared error) instead of absolute error,

\mathbb{E}(|X-\mu_x|)

, because it is easier to calculate

Variance of the sum of n variables

Var(\sum_k X_k)=\sum_k Var(X_k)+2\sum_{j<k} Cov(X_j,X_k)

Coefficient Property

Var(cX+b) = c^2Var(X), \text{where } b,c \in R

Independent R.V. Variance

For Independent R.V. $X, Y$ , we have:

Var(X+Y) = Var(X)+Var(Y)

Standard Deviation $\sigma$

\sigma_x=\sqrt{Var(X)}

Standardizations of Random Variable

X^* = \frac{X-\mu_x}{\sigma}

“X in standard units” with $\mathbb{E}(X^*)=0, Var(X) = \sigma_x = 1$

Skewness of R.V.

Let $X$ be a random variable with $E(X) = \mu$ and $Var(X) = \sigma_x^2$ , Let $X^* = \frac{X-\mu}{\sigma}$ be $X$ in standard units.

Skewness(X) = E[(X^{*})^3]= \frac{E[(X-\mu)^3]}{\sigma^3}

Markov’s Inequality

Note: For nonnegative R.V.s only

\text{If } X \ge 0, \\ P(X>a) \le \frac{E(X)}{a}

Chebychev’s Inequality

P(|X-\mathbb{E}(X)| \ge k\sigma_x)\le \frac{1}{k^2} \\ P(|X-\mu| \ge c) \le \frac{Var(X)}{c^2}

Order Statistics

Let $X_1, X_2, X_3, \dots, X_n$ be random variables with the independent and identical distribution pdf $f(x)$ and cdf $F(x)$

Then if we denote $X_{(1)}, X_{(2)}, \dots, X_{(n)}$ being the smallest, the second smallest, etc. among $X_1, \dots, X_n$ . Then what is the distribution of $X_{(i)}$ ?

\begin{align} f_{(k)}(x)dx &=P(X_{(k)}\in dx) \\ &=P(\text{one of the X's} \in \text{dx,exactly } k-1 \text{ of others } < x \\ &=nP(X_1\in dx,\text{exactly } k-1 \text{ of others } < x) \\ &=nP(X_1\in dx)P(\text{exactly } k-1 \text{ of others } < x) \\ &=nf(x)dx {n-1 \choose k-1} (F(x))^{k-1}(1-F(x))^{n-k} \end{align} \\ (-\infin < x < \infin)

Covariance $Cov(X,Y)$

Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y) \\ Var(X-Y)=Var(X)+Var(Y)-2Cov(X,Y)

X, Y \text{ independent} \implies Cov(X,Y)=0

Cov(X,Y)=\mathbb{E}[(X-\mathbb{E}(X))(Y-\mathbb{E}(Y))] = \mathbb{E}(XY)-\mathbb{E}(X)\mathbb{E}(Y)

Condition	Description
$Cov(X,Y)>0$	X and Y are positively dependent, $P(X\|Y)>P(X), P(Y\|X)>P(Y)$
$Cov(X,Y)=0$	X and Y are uncorrelated (covariance equal zero doesn’t always imply independence)
$Cov(X,Y)<0$	X and Y are negatively dependent, $P(X\|Y)<P(X), P(Y\|X)<P(Y)$

Covariance of the same variable

Cov(X,X)=Var(X)

Bilinearity of Covariance

Standard Form:

Cov(\sum_ia_iXi,\sum_j b_jY_j)=\sum_i \sum_j a_i b_j Cov(X_i,Y_j)

Simpler Form:

Cov(X,Y+Z)=Cov(X,Y)+Cov(X,Z)

Cov(W+X,Y)=Cov(W,Y)+Cov(X,Y)

Correlation

Because it is hard to interpret magnitude of $Cov(X,Y)$ , we standardize this to correlation

Corr(X,Y)=\frac{Cov(X,Y)}{\sigma_x \sigma_y} \in [-1,1]

Proof that Correlation is between -1 and 1 and Linear Relationship

🤧

If correlation is equal to

\pm1

, it means the two variables have linear relationship. For details, see the proof above.

Moment Generating Function

Let $X$ be a random variable, then the MGF(Moment Generating Function) of $X$ is defined as

M_X(t)=\mathbb{E}(e^{tX})

Important Properties

Equality of MGF means Equality of CDF

M_X(t)=M_Y(t) \implies F_X(x)=F_Y(x)

Jensen’s Inequality

M_X(t) \ge e^{\mu t}, \mu = \mathbb{E}(X)

Upper tail of random variable using Markov’s Inequality

\mathbb{P}(X\ge a)=\mathbb{P}(e^{tX} \ge e^{ta}) \le \frac{\mathbb{E}(e^{tX})}{e^{ta}}=e^{-ta}M_X(t)

Linear Transformation of Random Variable

M_{\alpha X + \beta}(t)=\mathbb{E}(e^{\alpha tX+\beta t})=e^{\beta t}\mathbb{E}(e^{\alpha tX})=e^{\beta t}M_X(\alpha t)

Linear Combination of Independent Random Variables

M_{\sum_{i=1}^n a_iX_i}(t)=\prod_{i=0}^n M_{X_i}(a_it)

Additional Topic - MLE and MAP

💡

Maximum Likelihood Estimation and Maximum A Posteriori are two topics that I personally got confused by during the study of CS189. So here I will give links to some great videos and a little bit mathematical formulas for MAP and MLE.

Say we have observations $\vec{x}$ , we want to estimate a parameter $\theta$ , where $\theta$ is a random variable.

MLE does the following:

\theta_{MLE} = \argmax_{\theta} \mathbb{P}(x|\theta)

MAP does the following:

\theta_{MAP}=\argmax_{\theta} \mathbb{P}(\theta|x)

💡

We see that MLE maximizes the likelihood of having the observations given the parameter, while MAP maximizes the posterior probability of having the parameter giving the observation.

MLE - iterate through distribution parameters and find the distribution parameter such that producing $\vec{x}$  is most likely using the parameter.

MAP - iterate through distribution parameters and find the distribution parameter that is most likely to be right given the observed data.

By Bayes Theorem,

\mathbb{P}(\theta|x)=\frac{\mathbb{P}(x|\theta)\mathbb{P}(\theta)}{\mathbb{P}(x)}

Note that

\mathbb{P}(x) = \mathbb{E}(\mathbb{P}(x|\theta))=\int_{\theta} f(x|\theta)f(\theta)d\theta

Since $\mathbb{P}(x)$ is independent of individual $\theta$ , we can view $\mathbb{P}(x)$ as a constant, and therefore,

\mathbb{P}(\theta|x) \propto \mathbb{P}(x|\theta)\mathbb{P}(\theta) = \mathbb{P}(x,\theta)

Therefore, MAP can also be written as

\theta_{MAP} = \argmax_{\theta} \mathbb{P}(\theta|x)=\argmax_{\theta} \mathbb{P}(x|\theta)\mathbb{P}(\theta)

💡

See the difference between the term optimized by MAP and MLE here? The MAP has an addition of

\mathbb{P}(\theta)

. If the prior,

\mathbb{P}(\theta)

is the same across all

\theta

(uniform prior distribution), then MAP is identical with MLE.

Properties of MLE and MAP

Note that both MLE and MAP are point estimators, that is if the parameter is continuous (and usually this is the case), then it picks the max point on the PDF function, not the midpoint of the points where their areas are the biggest.

MLE is more prune to overfitting then MAP, since the prior in MAP kind of restricts the parameter estimated in an area.

the asymptotic behavior of MAP and MLE are the same, that is as we collect more data, the result of MAP and MLE tends to converge.

Special Property of MLE (not applicable to MAP): $\Tau = g(\theta) \implies \Tau_{MLE} = g(\theta_{MLE})$

Distribution

Bernoulli(Indicator) Distribution $I_n \sim Bernoulli(p)$

P(I_n=1) = p \\ P(I_n=0) = 1-p

\mathbb{E}(I_n) = p

Var(I_n) = p(1-p)

Uniform Distribution on a finite set

Let we have a list of uniform events, $\Omega = \{\omega_1, \omega_2, \dots, \omega_n\}$

\forall i, P(\omega_i) = \frac{1}{n}

Uniform(a,b) distribution

Distribution of a point picked uniformly at random from the interval (a,b)

For a < x < y < b the probability that a point falls in the interval (x,y) is...

P(a < x <y) = \frac{y-x}{b-a}

for b-a= 1, long-run frequency is almost exactly equal to y-x.

\text{PDF: } f(x) = \begin{cases} \frac{1}{b-a} &\text{if a<x<b} \\ 0 &\text{otherwise} \end{cases}

Empirical Distribution

Opposed to theoretical distribution, empirical distribution is the distribution of your observed data.

Suppose we have $X = \{x_1, x_2, \dots, x_n\}$

P_n(a,b) = \frac{|i:1 \leq i \leq n, a \lt x_i \lt b|}{n}

In other words, $P_n(a,b)$ gives the proportion of the numbers in the list that lies in-between range of (a,b)

Estimating Empirical Distribution With Continuous PDF

The distribution of a data list can be displayed in a histogram, and such histogram smoothes out the data to display the general shape of the empirical distribution.

Such histograms often follows a smooth curve $f(x)$ . And it is safe to assume $\forall x, f(x) \ge 0$

Idea is that if (a,b) is a bin interval, then the area under the bar between (a,b) should roughly equal to the area under the curve between (a,b) ⇒ the proportion of data from a to b is roughly equal to the area under $f(x)$ .

P_n(a,b) \approx \int_{a}^{b}f(x)dx

$f(x)$ functions like a continuous PDF estimation for distribution of data.

Now we can also use an indicator distribution to estimate its average

I_{(a,b)}(x) = \begin{cases} 1 &\text{if } x \in (a,b) \\ 0 &\text{otherwise} \end{cases}

So $I_{(a,b)}(x_i)$ is an indicator stating if $x_i$ is in range of $(a,b)$

P_n(a,b) = \frac{1}{n}\sum_{i=1}^{n} I_{(a,b)}(x_i) \approx \int_{a}^{b}f(x)dx=\int_{-\infin}^{\infin}I_{(a,b)}(x)f(x)dx

Integration Approximation of Averages

If the empirical distribution of list $(x_1,x_2,\dots,x_n)$ is well approximated by the theoretical distribution with PDF $f(x)$ , then the average value of a function $g(x)$ over the n values can be approximated by

\frac{1}{n}\sum_{i=1}^ng(x_i)\approx\int_{-\infin}^{\infin}g(x)f(x)dx

Binomial Distribution $X \sim B(n,p)$

“probability of k successes in n trials with success rate of p”

P(X=k) = {n \choose k} p^k (1-p)^{n-k}

R(k) = [\frac{n-k+1}{k}]\frac{p}{1-p}

\mathbb{E}(X) = np \\ Var(X)= np(1-p)

Note: “Binomial Expansion”

(p+q)^n = \sum_{k=0}^n {n \choose k} p^k q^{n-k}

Square Root Law

For large n, in n independent trials with probability p of success on each trial:

the number of success will, with high probability, lie in a ralatively small region centered on $np$ , with width a moderate multiple of $\sqrt{n}$ on the numerical scale

the proportion of success will, with high probability, lie in a small interval centered on $p$ , with width a moderate multiple of $\frac{1}{\sqrt{n}}$

Normal Distribution as an approximation for Binomial Distribution

👉

Used when

p \approx \frac{1}{2}

provided n is large enough in $X \sim B(n,p)$ , $X$ can be approximated with normal distribution

If approximating a to b success in a bionomial distribution, use $b + \frac12$ and $a - \frac12$ as the boundary instead of using $b$ and $a$ , this is called continuity correction. Very important for small values of $\sqrt{npq}$ .

P(a \leq X \leq b) \approx \Phi(b+\frac12) - \Phi(a-\frac12)

Note: for $f(x)$ , use $\mu = \mathbb{E}(X)$ and $\sigma^2 = Var(X)$ of the Binomial distribution

👉

How good is the normal distribution? Its best when

\sigma = \sqrt{npq}

is big and

p

is close to

\frac12

W(n,p) = \max_{0 \leq a \leq b \leq n}|P(\text{a to b}) - N(\text{a to b})| \approx \frac1{10}\frac{|1-2p|}{\sqrt{npq}}

W(n,p) denotes the WORST ERROR in the normal approximation to the binomial distribution.

Poisson Distribution as an approximation for binomial distribution

When n is large and p is close to 0, normal distribution cannot properly estimate binomial distribution, so let’s use poisson! (if p is close to one, use the p = original q and mirror the approximation to get the approximation)

X \sim B(n,p) \sim Poisson(\mu = np)

Probability of the Most Likely Number of Successes

m = \lfloor np+p \rfloor = \lceil np + p -1 \rceil

P(m) \sim \frac1{\sqrt{2\pi}\sigma}

Normal Distribution $X \sim N(\mu , \sigma^2)$

\text{PDF}: f(x) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{(x-\mu)}{\sigma})^2}, -\infin < x < \infin

Standard Normal Distribution ( $\mu = 0, \sigma = 1$ )

\text{PDF}: \phi(z) = \frac{1}{\sqrt{2\pi}}e^{-\frac12z^2}

\text{CDF}: \Phi(z) = \int_{-\infin}^z\phi(y)dy

Where $z = \frac{x-\mu}{\sigma} = x$

Skew Normal Aproximation

“Third derivative of $\phi(z)$ ”

\phi'''(z) = (3z-z^3)\phi(z)

“Skew-normal PDF”

\phi(z) \approx \phi(z) - \frac16Skewness(n,p)\cdot\phi'''(z)

Skewness(n,p)=\frac{1-2p}{\sqrt{npq}} = \frac{1-2p}{\sigma}

“Skew-normal CDF”

\Phi(z) \approx \Phi(z)-\frac{1}{6}Skewness(X) \cdot (z^2-1) \cdot \phi(z)

“0 to b success rate” for binomial distribution

P(\text{0 to b success}) \approx \Phi(z) - \frac16Skewness(n,p)\cdot(z^2-1)\cdot \phi(z)

where $z = \frac{b+\frac12-\mu}{\sigma}, \mu=np, \sigma=\sqrt{npq}$

Sum of Independent Normal Variables

If $X \sim N(\lambda,\sigma^2)$ and $Y \sim N(\mu,r^2)$ , then $X+Y \sim N(\lambda + \mu,\sigma^2+r^2)$

Joint Distribution for Independent Standard Normal Distributions

let X and Y be standard normal dist, that is $X, Y \sim N(\mu=0,\sigma^2=1)$

\phi(z)=ce^{-\frac{1}{2}z^2} \quad (c=\frac{1}{\sqrt{2\pi}})

f(x,y)=\phi(x)\phi(y)=c^2e^{-\frac{1}{2}(x^2+y^2)}

📎

Notice the rotational symmetry of this joint distribution.

Rayleigh Distribution $R \sim Rayleigh$

Its the distribution of the above joint distribution of standard normal distribution along the radius.

\text{PDF: } f_R(r)=re^{-\frac12r^2} \quad (r>0)

\text{CDF: } F_R(r)=\int_0^r se^{-\frac12s^2}ds=1-e^{-\frac12r^2}

\mathbb{E}(R)=\sigma\sqrt{\frac{\pi}{2}}

Var(R)=\frac{4-\pi}{2}\sigma^2

Derivation of Rayleigh Distribution

R=\sqrt{X^2+Y^2}

\phi(x,y)=c^2e^{-\frac12(x^2+y^2)}=c^2e^{-\frac12r^2}

P(R \in dr)=2 \pi r \cdot dr \cdot c^2e^{-\frac12r^2}

And therefore

f_R(r)=2 \pi r c^2e^{-\frac12r^2}

Notice that $f_R(r)$ must integrate to 1 over $(-\infin,\infin)$ , if we calculate this, we see that $\int_{0}^{\infin}f_R(r)dr = 2\pi c^2$ , and therefore $c=1/\sqrt{2\pi}$ .

Chi-Square Distribution

Joint density of n independent normal variables at every point on the sphere radius $r$ in n-dimensional space is:

f_R(r)=(\frac{1}{\sqrt{2\pi}})^nexp(-\frac12r^2)

For independent standard normal $Z_i$ , let the following denote the distance in n-dimensional space

R_n=\sqrt{Z_1^2+\cdots+Z_n^2}

So the n-dimensional volumnof a thin spherical shell of thickness $dr$ at radius $r$ is

c_nr^{n-1}dr

Where $c_n$ is the (n-1) dimensional volumn of the “surface” of a sphere of radius 1 in n dimensions

c_2 = 2\pi, c_3 = 4\pi, \dots

P(R_n \in dr)=c_nr^{n-1}(1/\sqrt{2\pi})^ne^{-\frac12r^2}dr \quad(r>0)

Through a change of variable and evaluating $c_n$ we see that the density of $R_n^2 \sim Gamma(r=n/2, \lambda = 1/2)$ .

We call $R_n^2$ follows the distribution of the chi-square distribution with n degrees of freedom.

\mathbb{E}(R_n^2)=n \\ Var(R_n^2)=2n

Skewness(R_n^2)=4/\sqrt{2n}

Standard Bivariate Normal Distribution

$X$ and $Y$ have standard bivariate normal distribution with correlation $\rho$ iff:

Y=\rho X + \sqrt{1-\rho^2}Z

where $X$ and $Z$ are independent standard normal variables

Joint Density:

f(x,y)=\frac{1}{2\pi \sqrt{1-\rho^2}}exp\{-\frac{1}{2(1-\rho^2)}(x^2-2\rho xy+y^2)\}

Properties:

Marginals	Both $X$ and $Y$ have standard normal distribution
Conditionals Given X	Given $X=x$ , $Y \sim N(\rho x, 1- \rho^2)$ .
Conditionals Given Y	Given $Y=y$ , $X \sim N(\rho y, 1-\rho^2)$
Independence	$X$ and $Y$ are independent iff $\rho = 0$

Bivariate Normal Distribution as a description for Linear Combinations of Independent Normal Variables

Let

V=\sum_i a_iZ_i, W=\sum_i b_i Z_i

Where $Z_i \sim N(\mu_i,\sigma_i^2)$ are independent normal variables.

Then the joint distribution of $V, W$ is bivariate normal.

Where

\mu_V=\sum_i a_i\mu_i, \mu_W=\sum_i b_i \mu_i \\ \sigma_V^2=\sum_i a_i^2\sigma_i^2, \sigma_W^2=\sum_i b_i^2\sigma_i^2 \\ Cov(V,W)=\sum_i a_i b_i \sigma_i^2 \\ \rho=Cov(V,W)/\sigma_V\sigma_W

Independence

Two linear combinations $V=\sum_i a_i Z_i$ and $W = \sum_i b_iZ_i$ of independent normal $(\mu_i, \sigma_i^2)$ variables $Z_i$ are independent iff they are uncorrelated, that is, if and only if $\sum_i a_i b_i \sigma_i^2 = 0$ .

Bivariate Normal Distribution

Random Variables $U$ and $V$ have bivariate normal distribution with parameters $\mu_U, \mu_V, \sigma_U^2, \sigma_V^2, \rho$ iff the standardized variables

U^*=\frac{U-\mu_U}{\sigma_U}, V^*=\frac{V-\mu_V}{\sigma_V}

have standard bivariate normal distribution with correlation $\rho$ . Then,

\rho=Corr(U^*,V^*)=Corr(U,V)

and $U,V$ are independent iff $\rho = 0$

Derivation

📎

Goal: to construct a pair of correlated standard normal variables.

We start with a pair of independent standard normal variables, $X$ and $Z$ .

Let $Y$ be the projection of $(X,Z)$ onto an axis at an angle $\theta$ to the X-axis,

We see on the diagram that $Y=Xcos(\theta)+Zsin(\theta)$

By rotational symmetry of the joint distribution of $X,Z$ , the distribution of $Y$ is standard normal.

E(X)=E(Y)=E(Z)=0

SD(X)=SD(Y)=SD(Z)=1

\begin{align} \rho = Corr(X,Y)=E(XY) &=E[X(Xcos(\theta)+Zsin(\theta))] \\ &=E(X^2)cos(\theta)+E(XZ)cos(\theta) \\ &=cos(\theta) \end{align}

Since $E(X^2)=Var(X)=1, E(XZ)=E(X)E(Z)=0$ .

Some special cases:

Condition	Result
$\theta=0$	$\rho=1, Y=X$
$\theta=\frac{\pi}{2}$	$\rho=0, Y=Z$
$\theta=\pi$	$\rho=-1,Y=-X$

Since we have $\rho = cos(\theta)$ , $\theta = arccos(\rho)$ .

Therefore, $sin(\theta)=\sqrt{1-\rho^2}$

And

Y=\rho X + \sqrt{1-\rho^2}Z

Poisson Distribution $X \sim Poisson(\lambda)$

P(X=k) = e^{-\lambda}\frac{\lambda^k}{k!}

\mathbb{E}(X) = \lambda \\ Var(X) = \lambda

As $\mu$ starts small the distribution is piled up on the side and as $\lambda$ gets bigger and bigger, the poisson distribution becomes closer to the normal distribution (theres a proof that $n \rightarrow \infin, p =\frac{\lambda}{n} \rightarrow 0$ , the approximation becomes better and better)

Sum of Independent Poisson Variables

$N_1, ..., N_j$ are independent Poisson random variables with parameters $\mu_1, ..., \mu_j$ , then $S = N_1 + N_2 + \cdots + N_j$ , $S \sim Poisson(\sum_{i=1}^j\mu_i)$

Skew-normal approximation for Poisson Distribution

If $N_\mu \sim Poisson(\mu)$ , then for b = 0, 1, ...

P(N_\mu \le b) \approx \Phi(z)-\frac{1}{6\sqrt{\mu}}(z^2-1)\phi(z)

where $\phi(z)$ is the standard normal curve and $\Phi(z)$ is the standard normal CDF.

Multinomial Distribution

“a generalization of the binomial distribution” ⇒ Bionmoial Distribution With Multiple Types of Outcomes(instead of 2)

Let $N_i$ denote the number of results in category $i$ in a sequence of independent trials with probability $p_i$ for a result in the $i^{th}$ category for each trial, $1 \le i \le m$ , where $\sum_{i=1}^{m}p_i = 1$ . Then for every m-tuple of non-negative integers $(n_1,n_2,\dots,n_m)$ with sum n:

P(N_1=n_1,N_2=n_2,\dots,N_m=n_m) = \frac{n!}{n_1!n_2!\cdots n_m!}p_1^{n_1}p_2^{n_2}p_3^{n_3} \cdots p_m^{n_m}

Sum of Independent R.V.s

Let $S_n$ be the sum, $\tilde X_n = \frac{S_n}{n}$ the average, of $n$ independent random variables $X_1, X_2, \dots, X_n$ , each with the same distribution as $X$

Square Root Law

\mathbb{E}(S_n) = n\mathbb{E}(X), \sigma(S_n)=\sqrt{n}\sigma_x \\ \mathbb{E}(\tilde X_n) = \mathbb{E}(X), \sigma(\tilde X_n)=\frac{\sigma_x}{\sqrt{n}}

Skewness

Skewness(S_n)=\frac{Skewness(X)}{\sqrt{n}}

With Chebychev’s Inequality

P(|\tilde X_n-\mu_x|<\epsilon) \le \frac{\sigma^2}{n\epsilon^2}

Central Limit Theorem(Normal Approx.)

For large n, the distribution of $S_n$ is approximately normal, which means:

E(S_n)=n\mu_x \\ \sigma^2(S_n) = n\sigma_x^2 \\ S_n \sim N(n\mu_x,n\sigma_x^2)

Note: To approximate Probability of $S_n$ taking a specific range of values, we need to use continuity estimation(add and subtract $\frac12 \times n$ on the upper and lower bound)

👉

Why

\frac12 \times n

instead of

\frac12

? Because

nX

can only take values that are multiples of n

For all $a \le b$ ,

P(a \le \frac{S_n-n\mu_x}{\sigma\sqrt{n}} \le b) = \Phi(b) - \Phi(a)

👉

S_n^*

, the “

S_n

in standard units”, equals to

\frac{S_n-\mathbb{E}(S_n)}{SD(S_n)} = \frac{S_n-n\mu_x}{\sigma \sqrt{n}}

Skewed-Normal Approximation

\Phi(z) \approx \Phi(z)-\frac{1}{6\sqrt{n}}Skewness(X_i)\phi(z)

Hypergeometric Distribution

This is also the section of “sampling without replacement”

P(\text{g good and b bad}) = {n \choose g} \frac{(G)_g(B)_b}{(N)_n} = \frac{{G \choose g}{B \choose b}}{{N \choose n}}

P(S_n=g)=\frac{{G \choose g}{B \choose b}}{{N \choose n}}

where $b = n-g$

\mathbb{E}(S_n)=np \\ Var(S_n) = \frac{N-n}{N-1} \cdot npq

where $p = \frac{G}{N}$ and $q = \frac{B}{N}$ .

$\sqrt{\frac{N-n}{N-1}}$ is the “finite population correction factor”

Exponential Distribution $T \sim Exponential(\lambda)$

A random time T has exponential distribution with rate $\lambda$ ⇒ probability of death per unit time

\text{PDF: } f(t)=\lambda e^{-\lambda t} \quad (t \ge 0)

\text{CDF: } F(t)=P(T \le t)=\int_{0}^{t} f(t)dt = -e^{-\lambda t}\big|_{0}^{t}=1-e^{-\lambda t} \quad (t \ge 0)

\mathbb{E}(T) = \frac{1}{\lambda} \\ Var(T) = \frac{1}{\lambda^2}

Memoryless Property

A positive random variable T has exponential(λ) distribution for some λ > 0 if and only if T has the memoryless property

P(T>t+s|T>t)=P(T>s) \qquad (s \ge 0, t \ge 0)

“Given survival to time t, the chance of surviving a further time s is the same as the chance of surviving to time s in the first place.”

Relation to Geometric Distribution

The exponential distribution on $(0, \infin)$ is the continuous analog of the geometric distribution on $\{1,2,3,\dots\}$

Relation to Poisson Arrival Process

A sequence of independent Bernoulli(Indicator) trials, with probability p of success on each trial, can be characterized in two ways:

Counts of successes - number of successes in n trials is Binomial(n,p)

Times between successes - distribution of the waiting time until the first success is Geometric(p), waiting times between each successes and the next are independent with the same geometric distribution.

These characterization of Bernoulli trials lead to the two descriptions in the next box of a Poisson Arrival Process With Rate λ.

👉

Arrivals are at times marked with X on the time line, think of arrivals representing something like incoming calls or customers entering a store

Gamma Distribution $T_r \sim Gamma(r,\lambda)$

If $T_r$ is the time of the r-th arrival after time 0 in a Poisson process with rate $\lambda$ or if $T_r = W_1 + W_2 + \cdots + W_r$ where the Wi are independent with $W_i \sim Exponential(\lambda)$ distribution, then $T_r \sim Gamma(r,\lambda)$

\text{PDF} (t \ge 0) \text{: } f(t)=P(T_r \in dt)/dt=P(N_t=r-1)\cdot\lambda=e^{-\lambda t}\frac{(\lambda t)^{r-1}}{(r-1)!}\cdot\lambda

Note: $N_t$ is the number of arrivals by time t in the Poisson process with rate $\lambda$ ( $N_t \sim Poisson(\mu=\lambda t)$ )

“The probability per unit time that the r-th arrival comes around time t is the probability of exactly r-1 arrivals by time t multiplied by the arrival rate”

P(T_r>t)=P(N_t \le r-1) = \sum_{k=0}^{r-1}e^{-\lambda t}\frac{(\lambda t)^k}{k!}

$T_r > t$ iff there are at most r-1 arrivals in the interval (0, t]

\text{CDF: } P(T_r \le t) = 1-P(T_r>t)

Expectation and Variance

\mathbb{E}(T_r)=\frac{r}{\lambda} \\ Var(T_r) = \sigma_{T_r}^2=\frac{r}{\lambda^2}

General Gamma Distribution with $r \in \mathbb{R}$

In the previous PDF definition, we’ve only defined $r \in \mathbb{Z^{+}}$

PDF for $r \in \mathbb{R}$ :

\text{PDF}(t \ge 0) \text{: } f_{r,\lambda}(t)=\begin{cases} [\Gamma(r)]^{-1}\lambda^r t^{r-1}e^{-\lambda t} &t\ge0 \\ 0 &t < 0 \end{cases}

Where

\Gamma(r) = \int_{0}^{\infin}t^{r-1}e^{-t}dt

\forall r \in \mathbb{Z^+}, \Gamma(r) = (r-1)!

If we apply integration by parts,

\Gamma(r+1)=r\Gamma(r)

Geometric Distribution $X \sim Geo(p)$

“number X of Bernoulli trials needed to get one success”

P(X=k)=(1-p)^{k-1}p \quad (k \ge 1)

\mathbb{E}(X) = \frac{1}{p} \\ Var(X) = \sigma_x^2 = \frac{1-p}{p^2}

Skewness(X)=\frac{2-p}{\sqrt{1-p}}

Beta Distribution $X \sim Beta(r,s)$

For r,s > 0, the beta distribution on (0,1) is defined by the density:

\text{PDF: } f(x)=\frac{1}{B(r,s)}x^{r-1}(1-x)^{s-1} \qquad (0<x<1)

\mathbb{E}(X)=\frac{r}{r+s}

\mathbb{E}(X^2)=\frac{(r+1)r}{(r+s+1)(r+s)}

Var(X)=\frac{rs}{(r+s)^2(r+s+1)}

where

B(r,s)=\int_{0}^{1} x^{r-1}(1-x)^{s-1}dx

So we see that $B(r,s)$ serves the purpose of normalizing the PDF to integrate to 1.

For all positive r,s

B(r,s)=\int_0^1x^{r-1}(1-x)^{s-1}dx=\frac{\Gamma(r)\Gamma(s)}{\Gamma(r+s)} \\ \text{where } \Gamma(r) = (r-1)!

We see that

\begin{align} \mathbb{E}(X^k)&=\int_0^1 x^kf(x)dx \\ &=\int_0^1x^k\frac1{B(r,s)}x^{r-1}(1-x)^{s-1} \\ &=\frac1{B(r,s)}\int_0^1 x^{r+k-1}(1-x)^{s-1} \\ &=\frac{B(r+k,s)}{B(r,s)} \\ \end{align}

Beta Distribution as a distribution to calculate Order Statistics

The $k$ th order statistic of $n$ independent uniform(0,1) random variables has $Beta(r=k,s = n-k+1)$ distribution.

Since $f_{(k)}$ is a pdf it must integrate to 1 over [0,1], and therefore

\int_0^1x^{k-1}(1-x)^{n-k}dx=\frac{1}{n{n-1 \choose k-1}}=\frac{(k-1)!(n-k)!}{n!}

Notations

Sets

Significant Sets

Boolean Logic

Probability Notations

Basic Principle of Probability

Conditional Probability

Bayes’ Rule

Counting & Sampling

Counting

Counting Sequence(Order Matters) Without Replacement

Counting Sequence With Replacement

Counting Sets without replacement

Counting Sets with replacement

Sampling

Sampling Sets with replacement

Sampling Sets without replacement (Hypergeometric Dist.)

Probability Concepts

Consecutive Odds Ratios

Law of Large Numbers

Random Variable

Continuous Random Variable

Inverse Distribution Function

Probability Distribution

Joint Distribution

Independence of random variables

Conditional Distribution

Conditional Expectation

Identical Distribution

Combination of Random Variables

Equality of random variables

Probability of Events of two Random Variables

Symmetry of R.V.

Linear Function Mapping of Continuous Random Variable

One-to-one Differentiable Function of Continuous Random Variable

Change of Variable Principle

Max and Min of Independent R.V.s

Expectation

For Discrete R.V.

For continuous R.V.

Linearity of Expectation

Independent R.V. Expectation

Variance

Variance of the sum of n variables

Coefficient Property

Independent R.V. Variance

Standard Deviation @import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css')σ\sigmaσ﻿

Standardizations of Random Variable

Skewness of R.V.

Markov’s Inequality

Chebychev’s Inequality

Order Statistics

Covariance @import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css')Cov(X,Y)Cov(X,Y)Cov(X,Y)﻿

Covariance of the same variable

Bilinearity of Covariance

Correlation

Moment Generating Function

Important Properties

Additional Topic - MLE and MAP

Properties of MLE and MAP

Distribution

Bernoulli(Indicator) Distribution @import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css')In∼Bernoulli(p)I_n \sim Bernoulli(p)In​∼Bernoulli(p)﻿

Uniform Distribution on a finite set

Uniform(a,b) distribution

Empirical Distribution

Estimating Empirical Distribution With Continuous PDF

Integration Approximation of Averages

Binomial Distribution @import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css')X∼B(n,p)X \sim B(n,p)X∼B(n,p)﻿

Square Root Law

Normal Distribution as an approximation for Binomial Distribution

Poisson Distribution as an approximation for binomial distribution

Probability of the Most Likely Number of Successes

Normal Distribution @import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css')X∼N(μ,σ2)X \sim N(\mu , \sigma^2)X∼N(μ,σ2)﻿

Standard Normal Distribution (@import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css')μ=0,σ=1\mu = 0, \sigma = 1μ=0,σ=1﻿)

Skew Normal Aproximation

Sum of Independent Normal Variables

Joint Distribution for Independent Standard Normal Distributions

Rayleigh Distribution @import url('https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css')R∼RayleighR \sim RayleighR∼Rayleigh﻿

Derivation of Rayleigh Distribution

Chi-Square Distribution

Standard Deviation $\sigma$

Covariance $Cov(X,Y)$

Bernoulli(Indicator) Distribution $I_n \sim Bernoulli(p)$

Binomial Distribution $X \sim B(n,p)$

Normal Distribution $X \sim N(\mu , \sigma^2)$

Standard Normal Distribution ( $\mu = 0, \sigma = 1$ )

Rayleigh Distribution $R \sim Rayleigh$

Poisson Distribution $X \sim Poisson(\lambda)$

Exponential Distribution $T \sim Exponential(\lambda)$

Gamma Distribution $T_r \sim Gamma(r,\lambda)$

General Gamma Distribution with $r \in \mathbb{R}$

Geometric Distribution $X \sim Geo(p)$

Beta Distribution $X \sim Beta(r,s)$