Math Cheatsheet for ML
Calculus
Gradient
def z = f ( x ) z = f(\bold{x}) z = f ( x ) with x = [ x 1 x 2 . . . x n ] \bold{x} = \begin{bmatrix}
x_1\\x_2\\...\\x_n
\end{bmatrix} x = x 1 x 2 ... x n
∇ f = ∂ f ∂ x = [ ∂ f ∂ x 1 , . . . , ∂ f ∂ x n ] \nabla f = \frac{\partial f}{\partial \bold{x}} = [\frac{\partial f}{\partial x_1},..., \frac{\partial f}{\partial x_n}] ∇ f = ∂ x ∂ f = [ ∂ x 1 ∂ f , ... , ∂ x n ∂ f ] is the formula of the function's gradient and also the partial derivative of the input vector, output :
is a row vector in the input space.
go to the direction where goes upward fastest.
magnitude (length) is the gradient of this direction.
in addition:
∇ f ( x ) + ∇ g ( x ) = ( ∂ f ( x ) ∂ x 1 + ∂ g ( x ) ∂ x 1 , . . . , ∂ f ( x ) ∂ x n + ∂ g ( x ) ∂ x n ) = ( ∂ f ( x ) + g ( x ) ∂ x 1 , . . . , ∂ f ( x ) + g ( x ) ∂ x n ) = ∇ ( f ( x ) + g ( x ) ) \quad \nabla f(\bold{x}) + \nabla g(\bold{x}) \\
= (\frac{\partial f(\bold{x})}{\partial x_1} + \frac{\partial g(\bold{x})}{\partial x_1},..., \frac{\partial f(\bold{x})}{\partial x_n} + \frac{\partial g(\bold{x})}{\partial x_n}) \\
= (\frac{\partial f(\bold{x}) + g(\bold{x})}{\partial x_1},...,\frac{\partial f(\bold{x}) + g(\bold{x})}{\partial x_n}) \\
= \nabla (f(\bold{x}) + g(\bold{x})) ∇ f ( x ) + ∇ g ( x ) = ( ∂ x 1 ∂ f ( x ) + ∂ x 1 ∂ g ( x ) , ... , ∂ x n ∂ f ( x ) + ∂ x n ∂ g ( x ) ) = ( ∂ x 1 ∂ f ( x ) + g ( x ) , ... , ∂ x n ∂ f ( x ) + g ( x ) ) = ∇ ( f ( x ) + g ( x ))
∇ ( f g ) = f ∇ g + g ∇ f \nabla(fg) = f\nabla g + g\nabla f ∇ ( f g ) = f ∇ g + g ∇ f
∇ ( f g ) = g ∇ f − f ∇ g g 2 \nabla(\frac{f}{g}) = \frac{g\nabla f - f\nabla g}{g^2} ∇ ( g f ) = g 2 g ∇ f − f ∇ g
∂ ( x T ) = ( ∂ x ) T \partial (x^T) = (\partial x)^T ∂ ( x T ) = ( ∂ x ) T
∂ x T A x ∂ x = x T ( A + A T ) \frac{\partial x^T A x}{\partial x} = x^T(A + A^T) ∂ x ∂ x T A x = x T ( A + A T )
Lagrange Multiplier
Q1 : constrain g ( x , y ) = 0 g(x,y) = 0 g ( x , y ) = 0 , find extreme value of f ( x , y ) f(x,y) f ( x , y )
→ \to → (at extreme value point ( x 0 , y 0 ) (x_0,y_0) ( x 0 , y 0 ) , ∇ f ( x 0 , y 0 ) \nabla f(x_0,y_0) ∇ f ( x 0 , y 0 ) and ∇ g ( x 0 , y 0 ) \nabla g(x_0,y_0) ∇ g ( x 0 , y 0 ) should have same or opposite direction)
→ \to → ∇ f ( x 0 , y 0 ) = λ ∇ g ( x 0 , y 0 ) , λ ∈ R \nabla f(x_0,y_0) = \lambda \nabla g(x_0,y_0), \lambda \in R ∇ f ( x 0 , y 0 ) = λ ∇ g ( x 0 , y 0 ) , λ ∈ R
→ \to → def F ( x , y ) = f ( x , y ) − λ g ( x , y ) F(x,y) = f(x,y) - \lambda g(x,y) F ( x , y ) = f ( x , y ) − λ g ( x , y ) , ∇ F ( x 0 , y 0 ) = 0 ⃗ \nabla F(x_0,y_0) = \vec{0} ∇ F ( x 0 , y 0 ) = 0
→ \to → solve the equations:
∇ F ( x 0 , y 0 ) = 0 ⃗ g ( x , y ) = 0 \quad\nabla F(x_0,y_0) = \vec{0} \quad g(x,y) = 0 ∇ F ( x 0 , y 0 ) = 0 g ( x , y ) = 0
Q2 : constrains g 1 ( x , y ) = 0 g_1(x,y) = 0 g 1 ( x , y ) = 0 , g 2 ( x , y ) = 0 g_2(x,y) = 0 g 2 ( x , y ) = 0 , find extreme value of f ( x , y ) f(x,y) f ( x , y )
→ \to → F ( x , y ) = f ( x , y ) − λ 1 g 1 ( x , y ) − λ 2 g 2 ( x , y ) F(x,y) = f(x,y) - \lambda_1 g_1(x,y) - \lambda_2 g_2(x,y) F ( x , y ) = f ( x , y ) − λ 1 g 1 ( x , y ) − λ 2 g 2 ( x , y )
→ \to → ...
Taylor Expansion
for a given y = f ( x ) y = f(x) y = f ( x )
f ( x 1 ) − f ( x 0 ) = f ′ ( x 0 ) ( x 1 − x 0 ) + o ( x 1 − x 0 ) f(x_1)-f(x_0) = f'(x_0)(x_1-x_0) + o(x_1-x_0) f ( x 1 ) − f ( x 0 ) = f ′ ( x 0 ) ( x 1 − x 0 ) + o ( x 1 − x 0 )
then we get:
f ( x 1 ) − f ( x 0 ) = f ′ ( x 0 ) ( x 1 − x 0 ) + f ′ ′ ( x 0 ) 2 ! ( x 1 − x 0 ) 2 + f 3 ( x 0 ) 3 ! ( x 1 − x 0 ) 3 + . . . + f n ( x 0 ) n ! ( x 1 − x 0 ) n + . . . f(x_1)-f(x_0) = f'(x_0)(x_1-x_0) + \frac{f''(x_0)}{2!}(x_1-x_0)^2 + \frac{f^3(x_0)}{3!}(x_1-x_0)^3 + ... + \frac{f^n(x_0)}{n!}(x_1-x_0)^n + ... f ( x 1 ) − f ( x 0 ) = f ′ ( x 0 ) ( x 1 − x 0 ) + 2 ! f ′′ ( x 0 ) ( x 1 − x 0 ) 2 + 3 ! f 3 ( x 0 ) ( x 1 − x 0 ) 3 + ... + n ! f n ( x 0 ) ( x 1 − x 0 ) n + ...
then we get:
f ( x 1 ) = f ( x 0 ) + f ′ ( x 0 ) ( x 1 − x 0 ) + f ′ ′ ( x 0 ) 2 ! ( x 1 − x 0 ) 2 + f 3 ( x 0 ) 3 ! ( x 1 − x 0 ) 3 + . . . + f n ( x 0 ) n ! ( x 1 − x 0 ) n + . . . f(x_1) = f(x_0) + f'(x_0)(x_1-x_0) + \frac{f''(x_0)}{2!}(x_1-x_0)^2 + \frac{f^3(x_0)}{3!}(x_1-x_0)^3 + ... + \frac{f^n(x_0)}{n!}(x_1-x_0)^n + ... f ( x 1 ) = f ( x 0 ) + f ′ ( x 0 ) ( x 1 − x 0 ) + 2 ! f ′′ ( x 0 ) ( x 1 − x 0 ) 2 + 3 ! f 3 ( x 0 ) ( x 1 − x 0 ) 3 + ... + n ! f n ( x 0 ) ( x 1 − x 0 ) n + ...
n → ∞ n \to \infty n → ∞
or:
f ( x ) = f ( 0 ) + f ′ ( 0 ) x + f ′ ′ ( 0 ) 2 ! x 2 + f 3 ( 0 ) 3 ! x 3 + . . . + f n ( 0 ) n ! x n + . . . f(x) = f(0) + f'(0)x + \frac{f''(0)}{2!}x^2 + \frac{f^3(0)}{3!}x^3 + ... + \frac{f^n(0)}{n!}x^n + ... f ( x ) = f ( 0 ) + f ′ ( 0 ) x + 2 ! f ′′ ( 0 ) x 2 + 3 ! f 3 ( 0 ) x 3 + ... + n ! f n ( 0 ) x n + ...
n → ∞ n \to \infty n → ∞
e x = 1 + x + x 2 2 ! + x 3 3 ! + . . . e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + ... e x = 1 + x + 2 ! x 2 + 3 ! x 3 + ...
Newton's Method
given function f : R n → R f: R^n \to R f : R n → R , its secondary Taylor expansion:
f ( x 1 ) = f ( x 0 ) + ∇ f ( x 0 ) ( x 1 − x 0 ) + 1 2 ( x 1 − x 0 ) T ∇ 2 f ( x 0 ) ( x 1 − x 0 ) f(x_1) = f(x_0) + \nabla f(x_0)(x_1 - x_0) + \frac{1}{2}(x_1 - x_0)^T \nabla^2f(x_0) (x_1 - x_0) f ( x 1 ) = f ( x 0 ) + ∇ f ( x 0 ) ( x 1 − x 0 ) + 2 1 ( x 1 − x 0 ) T ∇ 2 f ( x 0 ) ( x 1 − x 0 )
we wish to know when does ∂ f ∂ x = 0 \frac{\partial f}{\partial x} = 0 ∂ x ∂ f = 0 , so we differentiate both sides:
∇ f ( x 1 ) = ∇ f ( x 0 ) + ( ∇ 2 f ( x 0 ) ( x 1 − x 0 ) ) T = 0 \nabla f(x_1) = \nabla f(x_0) + (\nabla^2f(x_0) (x_1 - x_0))^T = 0 ∇ f ( x 1 ) = ∇ f ( x 0 ) + ( ∇ 2 f ( x 0 ) ( x 1 − x 0 ) ) T = 0
after solving the equation:
x 1 = x 0 − ( ∇ 2 f ( x 0 ) ) − 1 ( ∇ f ( x 0 ) ) T x_1 = x_0 - (\nabla^2 f(x_0))^{-1}(\nabla f(x_0))^T x 1 = x 0 − ( ∇ 2 f ( x 0 ) ) − 1 ( ∇ f ( x 0 ) ) T
we get iteration formula (introducing Hessian matrix to replace ∇ 2 \nabla^2 ∇ 2 ):
x k + 1 = x k − ( H f ( x k ) ) − 1 ( ∇ f ( x k ) ) T x_{k + 1} = x_{k} - (H_f(x_{k}))^{-1}(\nabla f(x_{k}))^T x k + 1 = x k − ( H f ( x k ) ) − 1 ( ∇ f ( x k ) ) T
now, in order to answer where the minimum of f f f is, we could do the following things:
initiate a x 0 x_0 x 0 as a first approximation of the min point.
(for better explanation) get x 1 = x 0 − ( H f ( x 0 ) ) − 1 ( ∇ f ( x 0 ) ) T x_1 = x_0 - (H_f(x_0))^{-1}(\nabla f(x_0))^T x 1 = x 0 − ( H f ( x 0 ) ) − 1 ( ∇ f ( x 0 ) ) T
iteratively do x k + 1 = x k − ( H f ( x k ) ) − 1 ( ∇ f ( x k ) ) T x_{k+1} = x_{k} - (H_f(x_{k}))^{-1}(\nabla f(x_{k}))^T x k + 1 = x k − ( H f ( x k ) ) − 1 ( ∇ f ( x k ) ) T until the value is satisfiable.
Gauss-Newton Method
optimization on least square problem f ( x ) = y − A x f(x) = y - Ax f ( x ) = y − A x , minimize ∣ ∣ f ( x ) ∣ ∣ 2 ||f(x)||^2 ∣∣ f ( x ) ∣ ∣ 2 , Taylor expand f ( x ) f(x) f ( x ) only:
∂ ∣ ∣ f ( x 0 ) + ∇ f ( x 0 ) ( x 1 − x 0 ) ∣ ∣ 2 ∂ x 1 = 0 \frac{\partial || f(x_0) + \nabla f(x_0)(x_1 - x_0)||^2}{\partial x_1} = 0 ∂ x 1 ∂ ∣∣ f ( x 0 ) + ∇ f ( x 0 ) ( x 1 − x 0 ) ∣ ∣ 2 = 0
x 1 = x 0 − ( ∇ f T ( x 0 ) ∇ f ( x 0 ) ) − 1 ∇ f T ( x 0 ) f ( x 0 ) x_1 = x_0 - (\nabla f^T(x_0)\nabla f(x_0))^{-1}\nabla f^T(x_0)f(x_0) x 1 = x 0 − ( ∇ f T ( x 0 ) ∇ f ( x 0 ) ) − 1 ∇ f T ( x 0 ) f ( x 0 )
Then ∇ f ( x 0 ) \nabla f(x_0) ∇ f ( x 0 ) and f ( x 0 ) f(x_0) f ( x 0 ) are needed only.
Linear Algebra
Differentiation
Numerator Layout
Gradients of Vectors
Given function f : R n → R m \mathbf{f}: R^n \to R^m f : R n → R m , x ∈ R n \mathbf{x} \in R^n x ∈ R n , f ( x ) ∈ R m \mathbf{f}(\mathbf{x}) \in R^m f ( x ) ∈ R m
d f ( x ) d x = [ ∂ f 1 ( x ) ∂ x 1 . . . ∂ f 1 ( x ) ∂ x n . . . . . . . . . ∂ f m ( x ) ∂ x 1 . . . ∂ f m ( x ) ∂ x n ] m × n \frac{d \mathbf{f}(\mathbf{x})}{d \mathbf{x}} = \begin{bmatrix}
\frac{\partial f_1(\mathbf{x})}{\partial x_1} & ... & \frac{\partial f_1(\mathbf{x})}{\partial x_n} \\
... & ... & ... \\
\frac{\partial f_m(\mathbf{x})}{\partial x_1} & ... & \frac{\partial f_m(\mathbf{x})}{\partial x_n} \\
\end{bmatrix}_{m \times n} d x d f ( x ) = ∂ x 1 ∂ f 1 ( x ) ... ∂ x 1 ∂ f m ( x ) ... ... ... ∂ x n ∂ f 1 ( x ) ... ∂ x n ∂ f m ( x ) m × n
Sometimes it will be written as d f ( x ) d x T \frac{d \mathbf{f}(\mathbf{x})}{d \mathbf{x}^T} d x T d f ( x ) . But in Mathematics for Machine Learning, Deisenroth et al. , it is written as d f ( x ) d x \frac{d \mathbf{f}(\mathbf{x})}{d \mathbf{x}} d x d f ( x ) .
Jacobian
The jacobian matrix of function f \mathbf{f} f is defined as follows, which is the local approximated linear transformation of function f \mathbf{f} f (maybe nonlinear transformation).
J f ( x ) = d f ( x ) d x = [ ∂ f 1 ( x ) ∂ x 1 . . . ∂ f 1 ( x ) ∂ x n . . . . . . . . . ∂ f m ( x ) ∂ x 1 . . . ∂ f m ( x ) ∂ x n ] m × n J_{\mathbf{f}(\mathbf{x})} = \frac{d \mathbf{f}(\mathbf{x})}{d \mathbf{x}} = \begin{bmatrix}
\frac{\partial f_1(\mathbf{x})}{\partial x_1} & ... & \frac{\partial f_1(\mathbf{x})}{\partial x_n} \\
... & ... & ... \\
\frac{\partial f_m(\mathbf{x})}{\partial x_1} & ... & \frac{\partial f_m(\mathbf{x})}{\partial x_n} \\
\end{bmatrix}_{m \times n} J f ( x ) = d x d f ( x ) = ∂ x 1 ∂ f 1 ( x ) ... ∂ x 1 ∂ f m ( x ) ... ... ... ∂ x n ∂ f 1 ( x ) ... ∂ x n ∂ f m ( x ) m × n
Consider scalar Newton's Method:
f ( x 1 ) ≈ f ( x 2 ) + f ′ ( x 2 ) ( x 1 − x 2 ) f(x_1) \approx f(x_2) + f'(x_2)(x_1 - x_2) f ( x 1 ) ≈ f ( x 2 ) + f ′ ( x 2 ) ( x 1 − x 2 )
Now for f : R n → R m \mathbf{f}: R^n \to R^m f : R n → R m :
f ( x 1 ) m ≈ f ( x 2 ) m + J f ( x 2 ) m × n ( x 1 − x 2 ) n \underset{m}{\mathbf{f}(\mathbf{x_1})} \approx \underset{m}{\mathbf{f}(\mathbf{x_2})} + \underset{m \times n}{J_{\mathbf{f}(\mathbf{x_2})}} \underset{n}{(\mathbf{x_1} - \mathbf{x_2})} m f ( x 1 ) ≈ m f ( x 2 ) + m × n J f ( x 2 ) n ( x 1 − x 2 )
Hessian
Given another function f : R n → R f: R^n \to R f : R n → R , x ∈ R n \mathbf{x} \in R^n x ∈ R n , hessian matrix is:
H f ( x ) = ∇ 2 f ( x ) = [ ∂ 2 f ∂ x 1 2 . . . ∂ 2 f ∂ x 1 x n . . . . . . . . . ∂ 2 f ∂ x n x 1 . . . ∂ 2 f ∂ x n 2 ] n × n H_{f(\bold{x})} = \nabla^2f(\bold{x}) = \begin{bmatrix}
\frac{\partial^2 f}{\partial x_1^2}&...&\frac{\partial^2 f}{\partial x_1x_n}\\
...&...&...\\
\frac{\partial^2 f}{\partial x_nx_1}&...&\frac{\partial^2 f}{\partial x_n^2}\\
\end{bmatrix}_{n \times n} H f ( x ) = ∇ 2 f ( x ) = ∂ x 1 2 ∂ 2 f ... ∂ x n x 1 ∂ 2 f ... ... ... ∂ x 1 x n ∂ 2 f ... ∂ x n 2 ∂ 2 f n × n
Chain Rule
Given function h : R n → R m \mathbf{h}: R^n \to R^m h : R n → R m , x ∈ R n \mathbf{x} \in R^n x ∈ R n , h ( x ) = f ( g ( x ) ) \mathbf{h}(\mathbf{x}) = \mathbf{f}(\mathbf{g}(\mathbf{x})) h ( x ) = f ( g ( x )) , where g : R n → R k , f : R k → R m \mathbf{g}: R^n \to R^k, \mathbf{f}: R^k \to R^m g : R n → R k , f : R k → R m .
d h ( x ) d x m × n = ∂ f ( g ( x ) ) ∂ g ( x ) m × k ∂ g ( x ) ∂ x k × n \underset{m \times n}{\frac{d \mathbf{h}(\mathbf{x})}{d \mathbf{x}}} = \underset{m \times k}{\frac{\partial \mathbf{f}(\mathbf{g}(\mathbf{x}))}{\partial \mathbf{g}(\mathbf{x})}} \underset{k \times n}{\frac{\partial \mathbf{g}(\mathbf{x})}{\partial \mathbf{x}}} m × n d x d h ( x ) = m × k ∂ g ( x ) ∂ f ( g ( x )) k × n ∂ x ∂ g ( x )
Gradients of Matrices
Given function f : R m × n → R p × q , X ∈ R m × n \mathbf{f}: R^{m \times n} \to R^{p \times q}, X \in R^{m \times n} f : R m × n → R p × q , X ∈ R m × n , the differentiation is a tensor:
d f ( X ) d X ∈ R ( p × q ) × ( m × n ) \frac{d \mathbf{f}(X)}{d X} \in R^{(p \times q) \times (m \times n)} d X d f ( X ) ∈ R ( p × q ) × ( m × n )
Denominator Layout
Given function f : R n → R m \mathbf{f}: R^n \to R^m f : R n → R m , x ∈ R n \mathbf{x} \in R^n x ∈ R n , f ( x ) ∈ R m \mathbf{f}(\mathbf{x}) \in R^m f ( x ) ∈ R m
d f ( x ) d x = [ ∂ f 1 ( x ) ∂ x 1 . . . ∂ f m ( x ) ∂ x 1 . . . . . . . . . ∂ f 1 ( x ) ∂ x n . . . ∂ f m ( x ) ∂ x n ] n × m \frac{d \mathbf{f}(\mathbf{x})}{d \mathbf{x}} = \begin{bmatrix}
\frac{\partial f_1(\mathbf{x})}{\partial x_1} & ... & \frac{\partial f_m(\mathbf{x})}{\partial x_1} \\
... & ... & ... \\
\frac{\partial f_1(\mathbf{x})}{\partial x_n} & ... & \frac{\partial f_m(\mathbf{x})}{\partial x_n} \\
\end{bmatrix}_{n \times m} d x d f ( x ) = ∂ x 1 ∂ f 1 ( x ) ... ∂ x n ∂ f 1 ( x ) ... ... ... ∂ x 1 ∂ f m ( x ) ... ∂ x n ∂ f m ( x ) n × m
Chain Rule
Given function h : R n → R m \mathbf{h}: R^n \to R^m h : R n → R m , x ∈ R n \mathbf{x} \in R^n x ∈ R n , h ( x ) = f ( g ( x ) ) \mathbf{h}(\mathbf{x}) = \mathbf{f}(\mathbf{g}(\mathbf{x})) h ( x ) = f ( g ( x )) , where g : R n → R k , f : R k → R m \mathbf{g}: R^n \to R^k, \mathbf{f}: R^k \to R^m g : R n → R k , f : R k → R m .
d h ( x ) d x n × m = ∂ g ( x ) ∂ x n × k ∂ f ( g ( x ) ) ∂ g ( x ) k × m \underset{n \times m}{\frac{d \mathbf{h}(\mathbf{x})}{d \mathbf{x}}} = \underset{n \times k}{\frac{\partial \mathbf{g}(\mathbf{x})}{\partial \mathbf{x}}} \underset{k \times m}{\frac{\partial \mathbf{f}(\mathbf{g}(\mathbf{x}))}{\partial \mathbf{g}(\mathbf{x})}} n × m d x d h ( x ) = n × k ∂ x ∂ g ( x ) k × m ∂ g ( x ) ∂ f ( g ( x ))
Gradients of Matrices
Given function f : R m × n → R p × q , X ∈ R m × n \mathbf{f}: R^{m \times n} \to R^{p \times q}, X \in R^{m \times n} f : R m × n → R p × q , X ∈ R m × n , the differentiation is a tensor:
d f ( X ) d X ∈ R ( m × n ) × ( p × q ) \frac{d \mathbf{f}(X)}{d X} \in R^{(m \times n) \times (p \times q)} d X d f ( X ) ∈ R ( m × n ) × ( p × q )
Least Squares
There are m m m data, each with n n n features.
Data: X m × ( n + 1 ) = [ 1 x 11 x 12 . . . x 1 n 1 x 21 x 22 . . . x 2 n . . . . . . . . . . . . . . . 1 x m 1 x m 2 . . . x m n ] X_{m\times (n+1)} = \begin{bmatrix}
1 & x_{11} & x_{12} & ... & x_{1n}\\
1 & x_{21} & x_{22} & ... & x_{2n}\\
... & ... & ... & ... & ...\\
1 & x_{m1} & x_{m2} & ... & x_{mn}
\end{bmatrix} X m × ( n + 1 ) = 1 1 ... 1 x 11 x 21 ... x m 1 x 12 x 22 ... x m 2 ... ... ... ... x 1 n x 2 n ... x mn
Label: Y m × 1 = [ y 1 y 2 . . . y m ] Y_{m \times 1} = \begin{bmatrix}
y_1\\
y_2\\
...\\
y_m
\end{bmatrix} Y m × 1 = y 1 y 2 ... y m
to predict y ^ = θ 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ n x n \hat{y} = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n y ^ = θ 0 + θ 1 x 1 + θ 2 x 2 + ... + θ n x n , construct: θ ( n + 1 ) × 1 = [ θ 0 θ 1 . . . θ n ] \theta_{(n+1) \times 1} = \begin{bmatrix}
\theta_0\\
\theta_1\\
...\\
\theta_n
\end{bmatrix} θ ( n + 1 ) × 1 = θ 0 θ 1 ... θ n
therefore Y ^ = X θ \hat{Y} = X\theta Y ^ = Xθ
we want least square error, loss function L = 1 2 ( X θ − Y ) T ( X θ − Y ) L = \frac{1}{2}(X\theta - Y)^T(X\theta - Y) L = 2 1 ( Xθ − Y ) T ( Xθ − Y ) to be minimized.
→ ∂ L ∂ θ = 0 \to\frac{\partial L}{\partial \theta} = 0 → ∂ θ ∂ L = 0
→ θ = ( X T X ) − 1 X T Y \to\theta = (X^TX)^{-1}X^TY → θ = ( X T X ) − 1 X T Y
Determinant
square matrix A A A , its determinant det A \det A det A is a number, A = [ a b c d ] A = \begin{bmatrix}
a & b\\
c & d
\end{bmatrix} A = [ a c b d ] , det A = ∣ a b c d ∣ \det A = \begin{vmatrix}
a & b\\
c & d
\end{vmatrix} det A = a c b d
3 basic properties
det I = 1 \det I = 1 det I = 1
row exchanges, sign changes: eg. ∣ c d a b ∣ = − ∣ a b c d ∣ \begin{vmatrix}
c & d\\
a & b
\end{vmatrix} = - \begin{vmatrix}
a & b\\
c & d
\end{vmatrix} c a d b = − a c b d
add vector in row, add determinant in row: eg. ∣ a + a ′ b + b ′ c d ∣ = ∣ a b c d ∣ + ∣ a ′ b ′ c d ∣ \begin{vmatrix}
a + a' & b + b'\\
c & d
\end{vmatrix} = \begin{vmatrix}
a & b\\
c & d
\end{vmatrix} + \begin{vmatrix}
a' & b'\\
c & d
\end{vmatrix} a + a ′ c b + b ′ d = a c b d + a ′ c b ′ d
derived properties
if A A A is singular, then det A = 0 \det A = 0 det A = 0 .
...
how to calculate
A = [ a b c d ] , det A = a d − b c A = \begin{bmatrix}
a & b\\
c & d
\end{bmatrix}, \det A = ad - bc A = [ a c b d ] , det A = a d − b c
If: A = [ a 11 a 12 . . . a 1 n a 21 a 22 . . . a 2 n . . . . . . . . . . . . a n 1 a n 2 . . . a n n ] A = \begin{bmatrix}
a_{11} & a_{12} & ... & a_{1n}\\
a_{21} & a_{22} & ... & a_{2n}\\
... & ... & ... & ...\\
a_{n1} & a_{n2} & ... & a_{nn}
\end{bmatrix} A = a 11 a 21 ... a n 1 a 12 a 22 ... a n 2 ... ... ... ... a 1 n a 2 n ... a nn
let: M i j = [ a 11 . . . a 1 ( j − 1 ) a 1 ( j + 1 ) . . . a 1 n . . . . . . . . . . . . . . . . . . a ( i − 1 ) 1 . . . a ( i − 1 ) ( j − 1 ) a ( i − 1 ) ( j + 1 ) . . . a ( i − 1 ) n a ( i + 1 ) 1 . . . a ( i + 1 ) ( j − 1 ) a ( i + 1 ) ( j + 1 ) . . . a ( i + 1 ) n . . . . . . . . . . . . . . . . . . a n 1 . . . a n ( j − 1 ) a n ( j + 1 ) . . . a n n ] M_{ij} = \begin{bmatrix}
a_{11} & ... & a_{1(j-1)} & a_{1(j+1)} & ... & a_{1n}\\
... & ... & ... & ... & ... & ...\\
a_{(i-1)1} & ... & a_{(i-1)(j-1)} & a_{(i-1)(j+1)} & ... & a_{(i-1)n}\\
a_{(i+1)1} & ... & a_{(i+1)(j-1)} & a_{(i+1)(j+1)} & ... & a_{(i+1)n}\\
... & ... & ... & ... & ... & ...\\
a_{n1} & ... & a_{n(j-1)} & a_{n(j+1)} & ... & a_{nn}
\end{bmatrix} M ij = a 11 ... a ( i − 1 ) 1 a ( i + 1 ) 1 ... a n 1 ... ... ... ... ... ... a 1 ( j − 1 ) ... a ( i − 1 ) ( j − 1 ) a ( i + 1 ) ( j − 1 ) ... a n ( j − 1 ) a 1 ( j + 1 ) ... a ( i − 1 ) ( j + 1 ) a ( i + 1 ) ( j + 1 ) ... a n ( j + 1 ) ... ... ... ... ... ... a 1 n ... a ( i − 1 ) n a ( i + 1 ) n ... a nn
let: c i j = ( − 1 ) j + 1 det M i j c_{ij} = (-1)^{j + 1} \det M_{ij} c ij = ( − 1 ) j + 1 det M ij
then: det A = ∑ r = 1 n a i r c i r \det A = \sum_{r = 1}^{n} a_{ir}c_{ir} det A = ∑ r = 1 n a i r c i r
in another word: det A = a 11 det M 11 − a 12 det M 12 + a 13 det M 13 − . . . + ( − 1 ) n + 1 a 1 n det M 1 n \det A = a_{11}\det M_{11} - a_{12}\det M_{12} + a_{13}\det M_{13} - ... + (-1)^{n+1} a_{1n}\det M_{1n} det A = a 11 det M 11 − a 12 det M 12 + a 13 det M 13 − ... + ( − 1 ) n + 1 a 1 n det M 1 n
Eigenvalues & Eigenvectors
square matrix A A A , its eigenvalue λ \lambda λ its eigenvector v v v , A v = λ v Av = \lambda v A v = λ v .
→ ( A − λ I ) v = 0 \to (A - \lambda I) v = 0 → ( A − λ I ) v = 0
v v v other than 0 → ( A − λ I ) 0 \to (A - \lambda I) 0 → ( A − λ I ) singular
→ det ( A − λ I ) = 0 \to \det(A - \lambda I) = 0 → det ( A − λ I ) = 0
Singular Value Decomposition (SVD)
Probability
Maximum Likelihood Estimation (MLE)
Given probability distribution P ( x ∣ μ ) P(x|\mu) P ( x ∣ μ ) and N N N experiment results X X X . Parameter μ \mu μ is estimated most likely to be μ M L \mu_{ML} μ M L
Do it step by step:
Probability of these N N N experiments: P ( X ∣ μ ) = ∏ n = 1 N P ( x n ∣ μ ) P(X|\mu) = \prod_{n = 1}^{N}P(x_n|\mu) P ( X ∣ μ ) = ∏ n = 1 N P ( x n ∣ μ )
Log it: ln P ( X ∣ μ ) = ∑ n = 1 N ln P ( x n ∣ μ ) \ln P(X|\mu) = \sum_{n = 1}^{N}\ln P(x_n|\mu) ln P ( X ∣ μ ) = ∑ n = 1 N ln P ( x n ∣ μ )
Maximum P ( X ∣ μ ) P(X|\mu) P ( X ∣ μ ) , ∂ ln P ( X ∣ μ ) ∂ μ = ∂ ∑ n = 1 N ln P ( x n ∣ μ ) ∂ μ = 0 \frac{\partial \ln P(X|\mu)}{\partial \mu} = \frac{\partial \sum_{n = 1}^{N}\ln P(x_n|\mu)}{\partial \mu} = 0 ∂ μ ∂ l n P ( X ∣ μ ) = ∂ μ ∂ ∑ n = 1 N l n P ( x n ∣ μ ) = 0
Solve equation: μ = μ M L \mu = \mu_{ML} μ = μ M L
Covariance
D D D -dimensional vector x = [ x 1 , x 2 , . . . , x D ] T \bold{x} = [x_1,x_2,...,x_D]^T x = [ x 1 , x 2 , ... , x D ] T
Covariance:
c o v ( x i , x j ) = E ( ( x i − E ( x i ) ) ( x j − E ( x j ) ) ) , i , j ∈ { 1 , 2 , . . . , D } cov(x_i,x_j) = E((x_i - E(x_i))(x_j - E(x_j))), i,j\in\set{1,2,...,D} co v ( x i , x j ) = E (( x i − E ( x i )) ( x j − E ( x j ))) , i , j ∈ { 1 , 2 , ... , D }
Properties:
c o v ( x i , x i ) = E ( ( x i − E ( x i ) ) 2 ) = v a r ( x i ) cov(x_i,x_i) = E((x_i - E(x_i))^2) = var(x_i) co v ( x i , x i ) = E (( x i − E ( x i ) ) 2 ) = v a r ( x i )
c o v ( x i , x j ) = c o v ( x j , x i ) cov(x_i,x_j) = cov(x_j,x_i) co v ( x i , x j ) = co v ( x j , x i )
Covariance matrix:
Σ = [ c o v ( x 1 , x 1 ) . . . c o v ( x 1 , x D ) . . . . . . . . . c o v ( x D , x 1 ) . . . c o v ( x D , x D ) ] \Sigma = \begin{bmatrix}
cov(x_1,x_1) & ... & cov(x_1,x_D)\\
...&...&...\\
cov(x_D,x_1) & ... & cov(x_D,x_D)\\
\end{bmatrix} Σ = co v ( x 1 , x 1 ) ... co v ( x D , x 1 ) ... ... ... co v ( x 1 , x D ) ... co v ( x D , x D )
Properties:
Σ = Σ T \Sigma = \Sigma^T Σ = Σ T
Σ − 1 = Λ \Sigma^{-1} = \Lambda Σ − 1 = Λ
Gaussian Distribution
Single variable x x x with mean μ \mu μ and variance σ 2 \sigma^2 σ 2 :
N ( x ∣ μ , σ 2 ) = 1 2 π σ 2 exp { − 1 2 σ 2 ( x − μ ) 2 } N(x|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}}\exp\{-\frac{1}{2\sigma^2}(x - \mu)^2\} N ( x ∣ μ , σ 2 ) = 2 π σ 2 1 exp { − 2 σ 2 1 ( x − μ ) 2 }
D D D -dimensional vector x \bold{x} x with mean μ \mu μ (D D D -dimensional vector) and covariance matrix Σ \Sigma Σ (matrix D × D D\times D D × D ):
N ( x ∣ μ , Σ ) = 1 ( 2 π ) D / 2 det ( Σ ) 1 / 2 exp { − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) } N(\bold{x}|\mu,\Sigma) = \frac{1}{(2\pi)^{D/2} \det(\Sigma)^{1/2}}\exp\{-\frac{1}{2}(\bold{x} - \mu)^T\Sigma^{-1}(\bold{x} - \mu)\} N ( x ∣ μ , Σ ) = ( 2 π ) D /2 det ( Σ ) 1/2 1 exp { − 2 1 ( x − μ ) T Σ − 1 ( x − μ )}
Δ \Delta Δ : Mahalanobis distance
Δ \Delta Δ : Mahalanobis distance from μ \mu μ to x \bold{x} x , Δ 2 = ( x − μ ) T Σ − 1 ( x − μ ) \Delta^2 = (\bold{x} - \mu)^T\Sigma^{-1}(\bold{x} - \mu) Δ 2 = ( x − μ ) T Σ − 1 ( x − μ )
Exponent
C C C is constant.
ln N ( x ∣ μ , Σ ) = − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) + C = − 1 2 x T Σ − 1 x + x T Σ − 1 μ + C \ln N(\bold{x}|\mu,\Sigma) = -\frac{1}{2}(\bold{x} - \mu)^T\Sigma^{-1}(\bold{x} - \mu) + C\\
\qquad\qquad\quad = -\frac{1}{2}\bold{x}^T\Sigma^{-1}\bold{x} + \bold{x}^T\Sigma^{-1}\mu + C ln N ( x ∣ μ , Σ ) = − 2 1 ( x − μ ) T Σ − 1 ( x − μ ) + C = − 2 1 x T Σ − 1 x + x T Σ − 1 μ + C
which ease the calculation.
Conditional
Partition x \bold{x} x into two disjoint subsets x a , x b \bold{x}_a, \bold{x}_b x a , x b , x = ( x a x b ) \bold{x} = \begin{pmatrix}
\bold{x}_a\\
\bold{x}_b\\
\end{pmatrix} x = ( x a x b )
Conditional distribution is also a Gaussian: p ( x a ∣ x b ) = N ( x a ∣ μ a ∣ b , Σ a ∣ b ) p(\bold{x}_a|\bold{x}_b) = N(\bold{x}_a|\mu_{a|b}, \Sigma_{a|b}) p ( x a ∣ x b ) = N ( x a ∣ μ a ∣ b , Σ a ∣ b )
μ a ∣ b = μ a + Σ a b Σ b b − 1 ( x b − μ b ) \mu_{a|b} = \mu_{a} + \Sigma_{ab}\Sigma_{bb}^{-1}(\bold{x}_b - \mu_b) μ a ∣ b = μ a + Σ ab Σ bb − 1 ( x b − μ b )
Σ a ∣ b = Σ a a − Σ a b Σ b b − 1 Σ b a \Sigma_{a|b} = \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba} Σ a ∣ b = Σ aa − Σ ab Σ bb − 1 Σ ba
Marginal
Marginal distribution is also a Gaussian: p ( x a ) = ∫ p ( x a , x b ) d x b = N ( x a ∣ μ a , Σ a a ) p(\bold{x}_a) = \int p(\bold{x}_a,\bold{x}_b)d\bold{x}_b = N(\bold{x}_a|\mu_a,\Sigma_{aa}) p ( x a ) = ∫ p ( x a , x b ) d x b = N ( x a ∣ μ a , Σ aa )
Maximum Likelihood
Given N N N observations X = [ x 1 , x 2 , . . . , x N ] T X = [\bold{x}_1,\bold{x}_2,...,\bold{x}_N]^T X = [ x 1 , x 2 , ... , x N ] T to estimate maximum likelihood of μ \mu μ and Σ \Sigma Σ :
μ M L = 1 N ∑ n = 1 N x n \mu_{ML} = \frac{1}{N}\sum_{n = 1}^{N}\bold{x}_n μ M L = N 1 ∑ n = 1 N x n
Σ M L = 1 N ∑ n = 1 N ( x n − μ M L ) ( x n − μ M L ) T \Sigma_{ML} = \frac{1}{N}\sum_{n = 1}^{N}(\bold{x}_n - \mu_{ML})(\bold{x}_n - \mu_{ML})^T Σ M L = N 1 ∑ n = 1 N ( x n − μ M L ) ( x n − μ M L ) T
Entropy
x ∼ P ( x ) H ( X ) = − ∑ x P ( x ) log P ( x ) = E ( − log P ( x ) ) x \sim P(x)\\
H(X) = -\sum_{x} P(x)\log P(x) = E(- \log P(x)) x ∼ P ( x ) H ( X ) = − x ∑ P ( x ) log P ( x ) = E ( − log P ( x ))
if log \log log is log 2 \log_2 log 2 then unit is in bit.
Cross Entropy
Random variables x x x and y y y have the same number of probable values.
x ∼ P ( x ) , y ∼ Q ( y ) H ( x , y ) = − ∑ P ( x ) log Q ( y ) = E x ∼ P ( − log Q ( y ) ) x \sim P(x), y \sim Q(y)\\
H(x, y) = -\sum P(x)\log Q(y) = E_{x \sim P}(- \log Q(y)) x ∼ P ( x ) , y ∼ Q ( y ) H ( x , y ) = − ∑ P ( x ) log Q ( y ) = E x ∼ P ( − log Q ( y ))
Encode x x x by y y y .
KL Divergence
x ∼ P ( x ) , y ∼ Q ( y ) D K L ( x ∣ ∣ y ) = − ∑ x P ( x ) log Q ( y ) P ( x ) = E x ∼ P ( log Q ( y ) P ( x ) ) = H ( x , y ) − H ( x ) x \sim P(x), y \sim Q(y)\\
D_{KL}(x||y) = -\sum_{x} P(x)\log \frac{Q(y)}{P(x)} = E_{x \sim P}(\log \frac{Q(y)}{P(x)}) = H(x, y) - H(x) x ∼ P ( x ) , y ∼ Q ( y ) D K L ( x ∣∣ y ) = − x ∑ P ( x ) log P ( x ) Q ( y ) = E x ∼ P ( log P ( x ) Q ( y ) ) = H ( x , y ) − H ( x )
Difference of distribution x x x from y y y