Math Cheatsheet for ML

Calculus

Gradient

def z=f(x)z = f(\bold{x}) with x=[x1x2...xn]\bold{x} = \begin{bmatrix} x_1\\x_2\\...\\x_n \end{bmatrix}

f=fx=[fx1,...,fxn]\nabla f = \frac{\partial f}{\partial \bold{x}} = [\frac{\partial f}{\partial x_1},..., \frac{\partial f}{\partial x_n}] is the formula of the function's gradient and also the partial derivative of the input vector, output:

in addition:

Lagrange Multiplier

Q1: constrain g(x,y)=0g(x,y) = 0, find extreme value of f(x,y)f(x,y)

\to (at extreme value point (x0,y0)(x_0,y_0), f(x0,y0)\nabla f(x_0,y_0) and g(x0,y0)\nabla g(x_0,y_0) should have same or opposite direction)

\to f(x0,y0)=λg(x0,y0),λR\nabla f(x_0,y_0) = \lambda \nabla g(x_0,y_0), \lambda \in R

\to def F(x,y)=f(x,y)λg(x,y)F(x,y) = f(x,y) - \lambda g(x,y), F(x0,y0)=0\nabla F(x_0,y_0) = \vec{0}

\to solve the equations:
F(x0,y0)=0g(x,y)=0\quad\nabla F(x_0,y_0) = \vec{0} \quad g(x,y) = 0

Q2: constrains g1(x,y)=0g_1(x,y) = 0, g2(x,y)=0g_2(x,y) = 0, find extreme value of f(x,y)f(x,y)

\to F(x,y)=f(x,y)λ1g1(x,y)λ2g2(x,y)F(x,y) = f(x,y) - \lambda_1 g_1(x,y) - \lambda_2 g_2(x,y)

\to ...

Taylor Expansion

for a given y=f(x)y = f(x)

f(x1)f(x0)=f(x0)(x1x0)+o(x1x0)f(x_1)-f(x_0) = f'(x_0)(x_1-x_0) + o(x_1-x_0)

taylor1

then we get:

f(x1)f(x0)=f(x0)(x1x0)+f(x0)2!(x1x0)2+f3(x0)3!(x1x0)3+...+fn(x0)n!(x1x0)n+...f(x_1)-f(x_0) = f'(x_0)(x_1-x_0) + \frac{f''(x_0)}{2!}(x_1-x_0)^2 + \frac{f^3(x_0)}{3!}(x_1-x_0)^3 + ... + \frac{f^n(x_0)}{n!}(x_1-x_0)^n + ...

then we get:

f(x1)=f(x0)+f(x0)(x1x0)+f(x0)2!(x1x0)2+f3(x0)3!(x1x0)3+...+fn(x0)n!(x1x0)n+...f(x_1) = f(x_0) + f'(x_0)(x_1-x_0) + \frac{f''(x_0)}{2!}(x_1-x_0)^2 + \frac{f^3(x_0)}{3!}(x_1-x_0)^3 + ... + \frac{f^n(x_0)}{n!}(x_1-x_0)^n + ...

nn \to \infty

or:

f(x)=f(0)+f(0)x+f(0)2!x2+f3(0)3!x3+...+fn(0)n!xn+...f(x) = f(0) + f'(0)x + \frac{f''(0)}{2!}x^2 + \frac{f^3(0)}{3!}x^3 + ... + \frac{f^n(0)}{n!}x^n + ...

nn \to \infty

ex=1+x+x22!+x33!+...e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + ...

Newton's Method

given function f:RnRf: R^n \to R, its secondary Taylor expansion:
f(x1)=f(x0)+f(x0)(x1x0)+12(x1x0)T2f(x0)(x1x0)f(x_1) = f(x_0) + \nabla f(x_0)(x_1 - x_0) + \frac{1}{2}(x_1 - x_0)^T \nabla^2f(x_0) (x_1 - x_0)

we wish to know when does fx=0\frac{\partial f}{\partial x} = 0, so we differentiate both sides:
f(x1)=f(x0)+(2f(x0)(x1x0))T=0\nabla f(x_1) = \nabla f(x_0) + (\nabla^2f(x_0) (x_1 - x_0))^T = 0
after solving the equation:
x1=x0(2f(x0))1(f(x0))Tx_1 = x_0 - (\nabla^2 f(x_0))^{-1}(\nabla f(x_0))^T
we get iteration formula (introducing Hessian matrix to replace 2\nabla^2):
xk+1=xk(Hf(xk))1(f(xk))Tx_{k + 1} = x_{k} - (H_f(x_{k}))^{-1}(\nabla f(x_{k}))^T

now, in order to answer where the minimum of ff is, we could do the following things:

Gauss-Newton Method

optimization on least square problem f(x)=yAxf(x) = y - Ax, minimize f(x)2||f(x)||^2, Taylor expand f(x)f(x) only:
f(x0)+f(x0)(x1x0)2x1=0\frac{\partial || f(x_0) + \nabla f(x_0)(x_1 - x_0)||^2}{\partial x_1} = 0

x1=x0(fT(x0)f(x0))1fT(x0)f(x0)x_1 = x_0 - (\nabla f^T(x_0)\nabla f(x_0))^{-1}\nabla f^T(x_0)f(x_0)

Then f(x0)\nabla f(x_0) and f(x0)f(x_0) are needed only.

Linear Algebra

Differentiation

Numerator Layout

Gradients of Vectors

Given function f:RnRm\mathbf{f}: R^n \to R^m, xRn\mathbf{x} \in R^n, f(x)Rm\mathbf{f}(\mathbf{x}) \in R^m

df(x)dx=[f1(x)x1...f1(x)xn.........fm(x)x1...fm(x)xn]m×n\frac{d \mathbf{f}(\mathbf{x})}{d \mathbf{x}} = \begin{bmatrix} \frac{\partial f_1(\mathbf{x})}{\partial x_1} & ... & \frac{\partial f_1(\mathbf{x})}{\partial x_n} \\ ... & ... & ... \\ \frac{\partial f_m(\mathbf{x})}{\partial x_1} & ... & \frac{\partial f_m(\mathbf{x})}{\partial x_n} \\ \end{bmatrix}_{m \times n}

Sometimes it will be written as df(x)dxT\frac{d \mathbf{f}(\mathbf{x})}{d \mathbf{x}^T}. But in Mathematics for Machine Learning, Deisenroth et al., it is written as df(x)dx\frac{d \mathbf{f}(\mathbf{x})}{d \mathbf{x}}.

Jacobian

The jacobian matrix of function f\mathbf{f} is defined as follows, which is the local approximated linear transformation of function f\mathbf{f} (maybe nonlinear transformation).

Jf(x)=df(x)dx=[f1(x)x1...f1(x)xn.........fm(x)x1...fm(x)xn]m×nJ_{\mathbf{f}(\mathbf{x})} = \frac{d \mathbf{f}(\mathbf{x})}{d \mathbf{x}} = \begin{bmatrix} \frac{\partial f_1(\mathbf{x})}{\partial x_1} & ... & \frac{\partial f_1(\mathbf{x})}{\partial x_n} \\ ... & ... & ... \\ \frac{\partial f_m(\mathbf{x})}{\partial x_1} & ... & \frac{\partial f_m(\mathbf{x})}{\partial x_n} \\ \end{bmatrix}_{m \times n}

Consider scalar Newton's Method:
f(x1)f(x2)+f(x2)(x1x2)f(x_1) \approx f(x_2) + f'(x_2)(x_1 - x_2)

Now for f:RnRm\mathbf{f}: R^n \to R^m:
f(x1)mf(x2)m+Jf(x2)m×n(x1x2)n\underset{m}{\mathbf{f}(\mathbf{x_1})} \approx \underset{m}{\mathbf{f}(\mathbf{x_2})} + \underset{m \times n}{J_{\mathbf{f}(\mathbf{x_2})}} \underset{n}{(\mathbf{x_1} - \mathbf{x_2})}

Hessian

Given another function f:RnRf: R^n \to R, xRn\mathbf{x} \in R^n, hessian matrix is:
Hf(x)=2f(x)=[2fx12...2fx1xn.........2fxnx1...2fxn2]n×nH_{f(\bold{x})} = \nabla^2f(\bold{x}) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2}&...&\frac{\partial^2 f}{\partial x_1x_n}\\ ...&...&...\\ \frac{\partial^2 f}{\partial x_nx_1}&...&\frac{\partial^2 f}{\partial x_n^2}\\ \end{bmatrix}_{n \times n}

Chain Rule

Given function h:RnRm\mathbf{h}: R^n \to R^m, xRn\mathbf{x} \in R^n, h(x)=f(g(x))\mathbf{h}(\mathbf{x}) = \mathbf{f}(\mathbf{g}(\mathbf{x})), where g:RnRk,f:RkRm\mathbf{g}: R^n \to R^k, \mathbf{f}: R^k \to R^m.
dh(x)dxm×n=f(g(x))g(x)m×kg(x)xk×n\underset{m \times n}{\frac{d \mathbf{h}(\mathbf{x})}{d \mathbf{x}}} = \underset{m \times k}{\frac{\partial \mathbf{f}(\mathbf{g}(\mathbf{x}))}{\partial \mathbf{g}(\mathbf{x})}} \underset{k \times n}{\frac{\partial \mathbf{g}(\mathbf{x})}{\partial \mathbf{x}}}

Gradients of Matrices

Given function f:Rm×nRp×q,XRm×n\mathbf{f}: R^{m \times n} \to R^{p \times q}, X \in R^{m \times n}, the differentiation is a tensor:

df(X)dXR(p×q)×(m×n)\frac{d \mathbf{f}(X)}{d X} \in R^{(p \times q) \times (m \times n)}

Denominator Layout

Given function f:RnRm\mathbf{f}: R^n \to R^m, xRn\mathbf{x} \in R^n, f(x)Rm\mathbf{f}(\mathbf{x}) \in R^m

df(x)dx=[f1(x)x1...fm(x)x1.........f1(x)xn...fm(x)xn]n×m\frac{d \mathbf{f}(\mathbf{x})}{d \mathbf{x}} = \begin{bmatrix} \frac{\partial f_1(\mathbf{x})}{\partial x_1} & ... & \frac{\partial f_m(\mathbf{x})}{\partial x_1} \\ ... & ... & ... \\ \frac{\partial f_1(\mathbf{x})}{\partial x_n} & ... & \frac{\partial f_m(\mathbf{x})}{\partial x_n} \\ \end{bmatrix}_{n \times m}

Chain Rule

Given function h:RnRm\mathbf{h}: R^n \to R^m, xRn\mathbf{x} \in R^n, h(x)=f(g(x))\mathbf{h}(\mathbf{x}) = \mathbf{f}(\mathbf{g}(\mathbf{x})), where g:RnRk,f:RkRm\mathbf{g}: R^n \to R^k, \mathbf{f}: R^k \to R^m.
dh(x)dxn×m=g(x)xn×kf(g(x))g(x)k×m\underset{n \times m}{\frac{d \mathbf{h}(\mathbf{x})}{d \mathbf{x}}} = \underset{n \times k}{\frac{\partial \mathbf{g}(\mathbf{x})}{\partial \mathbf{x}}} \underset{k \times m}{\frac{\partial \mathbf{f}(\mathbf{g}(\mathbf{x}))}{\partial \mathbf{g}(\mathbf{x})}}

Gradients of Matrices

Given function f:Rm×nRp×q,XRm×n\mathbf{f}: R^{m \times n} \to R^{p \times q}, X \in R^{m \times n}, the differentiation is a tensor:

df(X)dXR(m×n)×(p×q)\frac{d \mathbf{f}(X)}{d X} \in R^{(m \times n) \times (p \times q)}

Least Squares

There are mm data, each with nn features.

Data: Xm×(n+1)=[1x11x12...x1n1x21x22...x2n...............1xm1xm2...xmn]X_{m\times (n+1)} = \begin{bmatrix} 1 & x_{11} & x_{12} & ... & x_{1n}\\ 1 & x_{21} & x_{22} & ... & x_{2n}\\ ... & ... & ... & ... & ...\\ 1 & x_{m1} & x_{m2} & ... & x_{mn} \end{bmatrix}

Label: Ym×1=[y1y2...ym]Y_{m \times 1} = \begin{bmatrix} y_1\\ y_2\\ ...\\ y_m \end{bmatrix}

to predict y^=θ0+θ1x1+θ2x2+...+θnxn\hat{y} = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n, construct: θ(n+1)×1=[θ0θ1...θn]\theta_{(n+1) \times 1} = \begin{bmatrix} \theta_0\\ \theta_1\\ ...\\ \theta_n \end{bmatrix}

therefore Y^=Xθ\hat{Y} = X\theta

we want least square error, loss function L=12(XθY)T(XθY)L = \frac{1}{2}(X\theta - Y)^T(X\theta - Y) to be minimized.

Lθ=0\to\frac{\partial L}{\partial \theta} = 0

θ=(XTX)1XTY\to\theta = (X^TX)^{-1}X^TY

Determinant

square matrix AA, its determinant detA\det A is a number, A=[abcd]A = \begin{bmatrix} a & b\\ c & d \end{bmatrix}, detA=abcd\det A = \begin{vmatrix} a & b\\ c & d \end{vmatrix}

3 basic properties

derived properties

how to calculate

A=[abcd],detA=adbcA = \begin{bmatrix} a & b\\ c & d \end{bmatrix}, \det A = ad - bc

If: A=[a11a12...a1na21a22...a2n............an1an2...ann]A = \begin{bmatrix} a_{11} & a_{12} & ... & a_{1n}\\ a_{21} & a_{22} & ... & a_{2n}\\ ... & ... & ... & ...\\ a_{n1} & a_{n2} & ... & a_{nn} \end{bmatrix}

let: Mij=[a11...a1(j1)a1(j+1)...a1n..................a(i1)1...a(i1)(j1)a(i1)(j+1)...a(i1)na(i+1)1...a(i+1)(j1)a(i+1)(j+1)...a(i+1)n..................an1...an(j1)an(j+1)...ann]M_{ij} = \begin{bmatrix} a_{11} & ... & a_{1(j-1)} & a_{1(j+1)} & ... & a_{1n}\\ ... & ... & ... & ... & ... & ...\\ a_{(i-1)1} & ... & a_{(i-1)(j-1)} & a_{(i-1)(j+1)} & ... & a_{(i-1)n}\\ a_{(i+1)1} & ... & a_{(i+1)(j-1)} & a_{(i+1)(j+1)} & ... & a_{(i+1)n}\\ ... & ... & ... & ... & ... & ...\\ a_{n1} & ... & a_{n(j-1)} & a_{n(j+1)} & ... & a_{nn} \end{bmatrix}

let: cij=(1)j+1detMijc_{ij} = (-1)^{j + 1} \det M_{ij}

then: detA=r=1naircir\det A = \sum_{r = 1}^{n} a_{ir}c_{ir}

in another word: detA=a11detM11a12detM12+a13detM13...+(1)n+1a1ndetM1n\det A = a_{11}\det M_{11} - a_{12}\det M_{12} + a_{13}\det M_{13} - ... + (-1)^{n+1} a_{1n}\det M_{1n}

Eigenvalues & Eigenvectors

square matrix AA, its eigenvalue λ\lambda its eigenvector vv, Av=λvAv = \lambda v.

(AλI)v=0\to (A - \lambda I) v = 0

vv other than 0(AλI)0 \to (A - \lambda I) singular

det(AλI)=0\to \det(A - \lambda I) = 0

Singular Value Decomposition (SVD)

Probability

Maximum Likelihood Estimation (MLE)

Given probability distribution P(xμ)P(x|\mu) and NN experiment results XX. Parameter μ\mu is estimated most likely to be μML\mu_{ML}

Do it step by step:

  1. Probability of these NN experiments: P(Xμ)=n=1NP(xnμ)P(X|\mu) = \prod_{n = 1}^{N}P(x_n|\mu)
  2. Log it: lnP(Xμ)=n=1NlnP(xnμ)\ln P(X|\mu) = \sum_{n = 1}^{N}\ln P(x_n|\mu)
  3. Maximum P(Xμ)P(X|\mu), lnP(Xμ)μ=n=1NlnP(xnμ)μ=0\frac{\partial \ln P(X|\mu)}{\partial \mu} = \frac{\partial \sum_{n = 1}^{N}\ln P(x_n|\mu)}{\partial \mu} = 0
  4. Solve equation: μ=μML\mu = \mu_{ML}

Covariance

DD-dimensional vector x=[x1,x2,...,xD]T\bold{x} = [x_1,x_2,...,x_D]^T

Covariance:
cov(xi,xj)=E((xiE(xi))(xjE(xj))),i,j{1,2,...,D}cov(x_i,x_j) = E((x_i - E(x_i))(x_j - E(x_j))), i,j\in\set{1,2,...,D}
Properties:
cov(xi,xi)=E((xiE(xi))2)=var(xi)cov(x_i,x_i) = E((x_i - E(x_i))^2) = var(x_i)
cov(xi,xj)=cov(xj,xi)cov(x_i,x_j) = cov(x_j,x_i)

Covariance matrix:
Σ=[cov(x1,x1)...cov(x1,xD).........cov(xD,x1)...cov(xD,xD)]\Sigma = \begin{bmatrix} cov(x_1,x_1) & ... & cov(x_1,x_D)\\ ...&...&...\\ cov(x_D,x_1) & ... & cov(x_D,x_D)\\ \end{bmatrix}
Properties:
Σ=ΣT\Sigma = \Sigma^T
Σ1=Λ\Sigma^{-1} = \Lambda

Gaussian Distribution

Single variable xx with mean μ\mu and variance σ2\sigma^2:
N(xμ,σ2)=12πσ2exp{12σ2(xμ)2}N(x|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}}\exp\{-\frac{1}{2\sigma^2}(x - \mu)^2\}

DD-dimensional vector x\bold{x} with mean μ\mu (DD-dimensional vector) and covariance matrix Σ\Sigma (matrix D×DD\times D):
N(xμ,Σ)=1(2π)D/2det(Σ)1/2exp{12(xμ)TΣ1(xμ)}N(\bold{x}|\mu,\Sigma) = \frac{1}{(2\pi)^{D/2} \det(\Sigma)^{1/2}}\exp\{-\frac{1}{2}(\bold{x} - \mu)^T\Sigma^{-1}(\bold{x} - \mu)\}

Δ\Delta: Mahalanobis distance

Δ\Delta: Mahalanobis distance from μ\mu to x\bold{x}, Δ2=(xμ)TΣ1(xμ)\Delta^2 = (\bold{x} - \mu)^T\Sigma^{-1}(\bold{x} - \mu)

Exponent

CC is constant.
lnN(xμ,Σ)=12(xμ)TΣ1(xμ)+C=12xTΣ1x+xTΣ1μ+C\ln N(\bold{x}|\mu,\Sigma) = -\frac{1}{2}(\bold{x} - \mu)^T\Sigma^{-1}(\bold{x} - \mu) + C\\ \qquad\qquad\quad = -\frac{1}{2}\bold{x}^T\Sigma^{-1}\bold{x} + \bold{x}^T\Sigma^{-1}\mu + C
which ease the calculation.

Conditional

Partition x\bold{x} into two disjoint subsets xa,xb\bold{x}_a, \bold{x}_b, x=(xaxb)\bold{x} = \begin{pmatrix} \bold{x}_a\\ \bold{x}_b\\ \end{pmatrix}

Conditional distribution is also a Gaussian: p(xaxb)=N(xaμab,Σab)p(\bold{x}_a|\bold{x}_b) = N(\bold{x}_a|\mu_{a|b}, \Sigma_{a|b})
μab=μa+ΣabΣbb1(xbμb)\mu_{a|b} = \mu_{a} + \Sigma_{ab}\Sigma_{bb}^{-1}(\bold{x}_b - \mu_b)
Σab=ΣaaΣabΣbb1Σba\Sigma_{a|b} = \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba}

Marginal

Marginal distribution is also a Gaussian: p(xa)=p(xa,xb)dxb=N(xaμa,Σaa)p(\bold{x}_a) = \int p(\bold{x}_a,\bold{x}_b)d\bold{x}_b = N(\bold{x}_a|\mu_a,\Sigma_{aa})

Maximum Likelihood

Given NN observations X=[x1,x2,...,xN]TX = [\bold{x}_1,\bold{x}_2,...,\bold{x}_N]^T to estimate maximum likelihood of μ\mu and Σ\Sigma:
μML=1Nn=1Nxn\mu_{ML} = \frac{1}{N}\sum_{n = 1}^{N}\bold{x}_n
ΣML=1Nn=1N(xnμML)(xnμML)T\Sigma_{ML} = \frac{1}{N}\sum_{n = 1}^{N}(\bold{x}_n - \mu_{ML})(\bold{x}_n - \mu_{ML})^T

Information Theory

Entropy

xP(x)H(X)=xP(x)logP(x)=E(logP(x))x \sim P(x)\\ H(X) = -\sum_{x} P(x)\log P(x) = E(- \log P(x))
if log\log is log2\log_2 then unit is in bit.

Cross Entropy

Random variables xx and yy have the same number of probable values.
xP(x),yQ(y)H(x,y)=P(x)logQ(y)=ExP(logQ(y))x \sim P(x), y \sim Q(y)\\ H(x, y) = -\sum P(x)\log Q(y) = E_{x \sim P}(- \log Q(y))
Encode xx by yy.

KL Divergence

xP(x),yQ(y)DKL(xy)=xP(x)logQ(y)P(x)=ExP(logQ(y)P(x))=H(x,y)H(x)x \sim P(x), y \sim Q(y)\\ D_{KL}(x||y) = -\sum_{x} P(x)\log \frac{Q(y)}{P(x)} = E_{x \sim P}(\log \frac{Q(y)}{P(x)}) = H(x, y) - H(x)
Difference of distribution xx from yy