Matrix methods in data analysis (part2)
November 28, 2023
Matrix methods #
This is my note taking for https://ocw.mit.edu/courses/18-065-matrix-methods-in-data-analysis-signal-processing-and-machine-learning-spring-2018/pages/readings/
lecture 12-21
These are practical applications, but hard to remember.. without actually applying it somewhere..
Randomized Matrix multiplication #
When matrix is huge, multiplying them is prohibitive, so we take samples (sum rank-1 multiplication) with good property.
Example #
Simulating a matrix by random sampling.
Suppose \(A = \begin{bmatrix} a & b \end{bmatrix} \)
We construct a matrix \(\hat{A}\) by randomly selecting two columns with probability \( \frac{1}{2} \)
\[ \begin{aligned} X_1 = \begin{cases} [a, 0] & \text{with probability } \frac{1}{2} \\ [0, b] & \text{with probability } \frac{1}{2} \\ \end{cases} \end{aligned} \]
\( E(X_1) = \frac{1}{2} [a,0] + \frac{1}{2} [0, b] \)
\( X = X_1 + X_2\)
\( E(X) = [a, b]\)
Likewise, \( Var(X_1) = \frac{1}{2}([a,0] - \frac{1}{2}[a,b])^2 + \frac{1}{2}([0,b] - \frac{1}{2}[a,b])^2 = [\frac{a^2}{4}, \frac{b^2}{4}] \)
\(Var(X) = \frac{1}{2}[a^2, b^2] \)
Sample with good property means: #
- \(E(\hat{A}) = E(A) \)
- keep \( Var(\hat{A}) \) low
Strategy #
- assign bigger probability for bigger column
- mix vectors to make them more similar to each other
Solution for strategy #1 #
Problem
We are given \( A, B\), and take sample \( a_j, b_j\) and multiply to get a rank-1 matrix. And we take multiple samples.
How should we pick the \( a_j, b_j \) so that our variance is the minimum? What probability should we assign to picking \( a_j, b_j \) ?
conceptual strategy
(I couldn’t follow the exact argument)
Find a strategy which satisfies \( E(\hat{A}) = A \).
Find the expression for variance and find \( p_j \) for selecting \( a_j, b_j \) which minimizes the variance. With a constraint \( \sum p_j = 1\)
So this is miminization problem with a constraint, a lagrange multiplier.
Solution
It turns out \( P_j = \frac{\Vert a_j \Vert \Vert b_j^T \Vert}{C} \) where C is the normalizing constant minimizes the variance while keeping the expectation intact.
Low rank changes in A and its inverse #
When we perturb a matrix by small amount, what’s the inverse of the perturbed matrix?
\[ (I - uv^T)^{-1} = I + \frac{uv^T}{1-v^Tu} \]
If i change a matrix by rank-1, I change the inverse matrix by rank-1.
Check:
\( (I - uv^T)(I + \frac{uv^T}{1-v^Tu}) = I - uv^T + \frac{(I-uv^T)uv^T}{1-v^Tu} \)
\( (I -uv^T)uv^T = uv^T - u(v^Tu)v^T = uv^T(1- v^Tu) \)
And this applies when we change the vector \( u, v\) to matrix \( U, V\), or change \(I\) to any matrix \( A \)
Because we can think view \( UV^T \) as the sum of rank-1.
\[(I_n - UV^T)^{-1} = I_n + U(I_k - V^TU)^{-1}V^T \]
S-M-W #
When we have \(A\) instead of \(I\):
\[(A + UV^T)^{-1} = A^{-1} - A^{-1}U (I - V^TA^{-1}U)^{-1} V^TA^{-1} \]
Usage #
Recursive Least squares #
We have a \( A\mathbf{x} = \mathbf{b} \), and try to find \( A^TA\mathbf{\hat{x}} = A^T\mathbf{b} \)
We have a new measurement, \( \begin{bmatrix} A \\ v^T \end{bmatrix} \begin{bmatrix} \mathbf{x} \end{bmatrix} = \begin{bmatrix} \mathbf{b} \\ b_n \end{bmatrix} \), and new normal \( \begin{bmatrix} A^T & v \end{bmatrix} \begin{bmatrix} A \\ v^T \end{bmatrix} \)
\[ \hat{x}_{new} = \begin{bmatrix} A^T & v \end{bmatrix} \begin{bmatrix} \mathbf{b} \\ b_n \\ \end{bmatrix} \]
\( (A^TA + vv^T) \hat{x}_{new} = A^T\mathbf{b} + vb_n\)
So we are perturbing the \( A^TA \) with rank-1 matrix. Here we don’t wanna recompute the inverse of the \( A^TA\) with the new data.
Kalman filter
Kalman filter uses two additional concept on top of the recursive least square.
- weighted least square (covariance)
- state equation
Solve \( (A - uv^T)x = b \) quickly #
Suppose \( Aw = b \) is solved for \( w\).
We don’t wanna recompute the inverse of the new matrix.
We use \( Az = u \) (z can be computed fast because we have already solved the \( Aw = b\))
Now we use Sherman-Morrison formula to compute \(x\)
\( x = w + \frac{v^Twz}{1-v^Tz}\)
\( \frac{dA}{dt}\) when matrix \(A(t) \) depends on t #
given \( \frac{dA}{dt} \), find \( \frac{dA^{-1}}{dt} \) #
\[B^{-1} - A^{-1} = B^{-1}(A-B)A^{-1} \]
\(\Delta A^{-1} = (A + \Delta A)^{-1} (-\Delta A) A^{-1} \)
\( \frac{\Delta A^{-1}}{\Delta t} = (A + \Delta A)^{-1} \frac{-\Delta A}{\Delta t} A^{-1} \)
\( \frac{dA^{-1}}{dt} = -A^{-1} \frac{dA}{dt} A^{-1} \)
\( \frac{d\lambda}{dt} \) #
We are given: \( AX = X\Lambda \), \( Y^TA = \Lambda Y^T \)
And we have an assumption on the length of eigen vectors to make calculation easier. We normalize the vectors by \( y^Tx = 1\). This choice doesn’t affect the generality of the result because any scalar multiple of an eigenvector is also an eigenvector.
\[y^T(t) A(t) x(t) = \lambda(t) y^T(t) x(t) = \lambda(t) \]
Left and right eigen vector has the same eigen values #
\[f(Av) = (A^Tf)(v) \]
If \(A\) maps a vector v to \(Av\), then \(A^T\) maps a functional \(f \in V^* \) in a way that \( f(Av) = (A^Tf)(v) \)
Let \( f(x) = u^T x \)
\( (A^T f)(v) = (A^Tu^T)v = u^T(Av) = u^T(\lambda v) = \lambda u^Tv\)
From \( (A^Tu^T)v = \lambda u^Tv \), \( A^Tu^T = \lambda u^T\), or equivalently \( uA^T = \lambda u\)
I.e. Left eigenvector has the same eigen value.
Compute \( \frac{d\lambda}{dt} \) #
From \( y^T(t) A(t) x(t) = \lambda(t) y^T(t) x(t) = \lambda(t) \)
Take derivative on both side: \[ \frac{d \lambda}{dt} = \frac{dy^T}{dt} A x(t) + y^T(t) \frac{dA}{dt} x(t) + y^T(t)A\frac{dx}{dt} = y^T(t) \frac{dA}{dt} x(t) \]
Because we assumed \( y^Tx = 1\), it follows \(\frac{dy^T}{dt} A x(t) + y^T(t)A\frac{dx}{dt} = \lambda(t)\frac{dy^T}{dt} x(t) + \lambda y^T(t)\frac{dx}{dt} = \lambda(t) \frac{d y^Tx}{dt} = 0 \)
\( \frac{d \lambda}{dt} = y^T(t) \frac{dA}{dt} x(t) \): The derivative doesn’t involve derivative of y or x.
\( \lambda_j (S + uu^T) \) #
What will happen to the eigenvalues of \(S\) when you add a positive rank-1 matrix.
The positiveness of \( uu^T \) #
\( uu^T \) is a rank1 matrix, and the eigen vector of the matrix is \( u\)
\( (uu^T) u = \lambda u \), and \( \lambda = u^Tu \gt 0 \)
Adding positive #
Adding positive \( uu^T \) to \( S \) makes it more positive.
I.e. the eigen value goes up.
So the interpretation of matrix as sum of rank-1 is a fundamental view. I could see this, can’t remember now.
\( \frac{d\sigma}{dt} \) #
Singular value (not the eigen value)
\( \sigma(t) = u^TAv\) where \( u^Tu = 1\)
\( \frac{d\sigma}{dt} = \frac{du^T}{dt}Av + u^T\frac{dA}{dt}v + u^TA\frac{dv}{dt} \)
\( \frac{du^T}{dt}Av = \frac{du^T}{dt} \sigma u = 0 \), because \( u^Tu = 1\)
Interlacing #
When we pertub a matrix with rank-1 matrix, the eigen value goes up by doesn’t go over the previous eigen value.
\[\mu_1 \ge \lambda_1 \ge \mu_2 \ge \lambda_2 … \]
Weyl’s inequality for symmetric S, T #
\( \lambda_{i+j-1} (S + \) ≤ λ_i (s) + \lambd_j(T) \) (I haven’t proved)
Why are so many matrices of low rank in nature? #
Low rank #
We decompose n dimensional matrix of rank k, into k rank-1 matrix.
\[A = u_1v_1^T + … + u_kv_k^T \]
Each rank-1 matrix has \( 2n \) numbers. We can send \( 2 n k \) numbers instead of \( n^2 \) in order to send \(A\)
if \( 2nk < n^2 \), we say it’s row rank
Diagonal matrix are high ranks #
Of course, the eigen values are the diagonal entries.
Numerical rank #
We define \( rank_\epsilon(X) = k\) for \( 0 < \epsilon < 1 \).
Where \( rank_0(X) = rank(X) \), which would allow no tolerance, \(rank_\epsilon(X) \) ignores small eigenvalues relative to the first eigen value \( \sigma_1 \)
\( rank_\epsilon \) count the eigenvalues who are bigger than \( \epsilon \sigma_1 \)
Matrices of numerical low rank #
They are hard to invert. (Because small eigenvalues blow up the inverse matrix)
- Hilbert matrix
- Vandermonde matrix
Ex. \( X_jk = p(j,k) = 1 + j + jk \)
\( X = \begin{bmatrix} 1\\ \vdots \\ 1 \end{bmatrix} \begin{bmatrix} 1 & 1 & … & 1 \\ \vdots \\ n & n & … & n \end{bmatrix} \begin{bmatrix} 1 & 2 & … & n \\ \vdots \\ n & 2n & … & n^2 \end{bmatrix} \)
The world is Sylvester
Hypothesis why matrices are low rank.
Maybe because matrices satisfies \( AX - XB = C \) for some \( A, B, C \)
With suitable \( A, B, C \) we can find upper bound for Hilbert matrix much closer to the (real?) bound of the matix. (I can’t even tell the main result clearly)
Saddle point #
\( S = \begin{bmatrix}5 \\ & 3 \\ & & 1 \end{bmatrix}, x = \begin{bmatrix}u \\ v \\ w \end{bmatrix} \)
\( R(x) = \frac{x^TSx}{x^Tx} = \frac{ 5u^2 + 3v^2 + w^2 }{u^2 + v^2 + w^2} \)
Then, we can see that
\( max R = 5, \) at \( (1, 0, 0) \) (the eigen value and eigen vector) \( min R = 1, \) at \( (0, 0, 1) \)
How about saddle of R? #
Saddle is at Maximum of minimum
\( \lambda_2 = max min \frac{x^TSx}{x^Tx} \)
Min: Look for all possible 2d subspaces and find the minimum eigen value of the 2d subspace
Max: Look for the maximum of all those minimums
So saddle point is minimum in one direction, and maximum in another direction.
QUESTION
Can I take this as the definition of saddle point?
#
QUESTION
I’ve seen it in game theory, is the min-max strategy in game theory related to saddle point?
#
Probability Inequalities #
Markov inequalities #
When \( x_i \ge 0 \), \( Pr[x \ge a] \le \frac{E[x]}{a} \)
For example, \( Pr(x \ge 3) \le \frac{2}{3} \), when \(x\) is one of \( 1, 2, 3, 4, 5\) and their mean is 2
\( 1p_1 + 2p_2 + 3p_3 + 4p_4 + 5p_5 = 1p_1 + 2p_2 + 3(p_3 + p_4 + p_5) + p_4 + 2p_5 = 2\)
Therefore \( p_3 + p_4 + p_5 \le \frac{2}{3} \)
Chebyshev inequality #
(no asumption \( x_i \ge 0 \) )
\[Pr(| x-m| \ge a) \le \frac{\sigma^2}{a^2} \]
Let \( y = |x-m|^2 \), apply markov to y.
\( \bar{y} = \sum p_i y_i = \sum pi |x-m|^2 = \sigma^2 \)
\( Pr( |x-m| \ge a) = Pr(|x-m|^2 \ge a^2) = pr(Y \ge a^2) \le \frac{E(y)}{a^2} \)
Covariance matrix #
\[ V = E\left[\begin{array}{ccc} \left[\begin{array}{c} X_1 - \mu_1 \\ \vdots \\ X_n - \mu_n \\ \end{array}\right] \left[\begin{array}{ccc} X_1 - \mu_1 & \cdots & X_n - \mu_n \end{array}\right] \end{array}\right] \]
When \(A\) has independent columns, \( A^T A\) is invertible #
null space #
When \( A \) has independent columns, \( A \) has empty null space.
\( Ax\) is non zero. For any non zero \(w\), \( A^T w\) is non zero, because for each basis \(w_i\), \( A^T_{. i} \) is a matrix where \( A^T_{ji} \) is the coefficient of \(w_i \) of \( Tv_j \).
Hence the coefficients can’t be all zero because that would have produced zero \( w\). Therefore \( A^T A \) has empty null space, therefore invertible.
Another way (hackish?) #
Think of isomorphism
Calculus #
Gradient #
When f is a function of form: \( f(x_1, x_2, \ldots, x_n) \)
\[\nabla f = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \right) \]
\[ df = \nabla f \cdot d\mathbf{x} \]
So, gradient says, change in f due to change in x is linear combination of the change in x (in the limit)
Jacobian #
Scalar valued function f #
When f is of form \( f(u_1, u_2, \ldots, u_n) \), where each \(u_i \) is a function of \( (x_1, x2, …, x_n) \),
Jacobian is the collection of gradient for each component function \( u_i \)
When f is \( f(u_1) \), where \( u_1 \) is a function of \( (x_1, x2, …, x_n) \),
\( \frac{df(u_1(\mathbf{x}))}{d \mathbf{x}} = \frac{d f}{d u_1} \frac{du_1}{d \mathbf{x}} = \frac{d f}{d u_1} \nabla u_1 \)
Now, when f is of form \( f(u_1, u_2, \ldots, u_n) \), where each \(u_i \) is a function of \( (x_1, x2, …, x_n) \),
\[ \begin{bmatrix} d u_1 \\ d u_2 \\ \vdots \\ d u_n \\ \end{bmatrix} = J(f) \begin{bmatrix} d x_1 \\ d x_2 \\ \vdots \\ d x_n \\ \end{bmatrix} \]
Where, \( J(f)_1 = \nabla u_1 \)
\[ J(\mathbf{f}) = \begin{bmatrix} \frac{\partial u_1}{\partial x_1} & \frac{\partial u_1}{\partial x_2} & \cdots & \frac{\partial u_1}{\partial x_n} \\ \frac{\partial u_2}{\partial x_1} & \frac{\partial u_2}{\partial x_2} & \cdots & \frac{\partial u_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial u_m}{\partial x_1} & \frac{\partial u_m}{\partial x_2} & \cdots & \frac{\partial u_m}{\partial x_n} \end{bmatrix} \]
\( df = \nabla f \cdot d\mathbf{x} \) would become:
\( df = \nabla f \cdot d\mathbf{u} = \nabla f \cdot J(f) \cdot d \mathbf{x} \)
Vector valued function f #
When f is of form \( f = (f_1, f_2, …, f_n) \), where \(f_i\) is a function of \( (x_1, x_2, …, x_n) \)
Jacobian describes the gradient of each \( f_i \)
\[\begin{bmatrix} d f_1 \\ \vdots \\ d f_n \\ \end{bmatrix} = \begin{bmatrix} \nabla f_1 \cdot d \mathbf{x} \\ \vdots \\ \nabla f_n \cdot d \mathbf{x} \\ \end{bmatrix} = \begin{bmatrix} \nabla f_1 \\ \vdots \\ \nabla f_n \\ \end{bmatrix} \begin{bmatrix} d x_1 \\ \vdots \\ d x_n \\ \end{bmatrix} = J(f) \cdot \begin{bmatrix} d x_1 \\ \vdots \\ d x_n \\ \end{bmatrix} \]
Or,
\( f(x + \Delta x) = f(x) + J(f) \Delta x \)
In integral #
\[ \iint_{R_{uv}} f(u, v) \, dA_{uv} = \iint_{R_{xy}} f(g(x, y), h(x, y)) \left| J \right| \, dx \, dy \]
This comes from the previous
\[ \begin{bmatrix} d u_1 \\ d u_2 \\ \vdots \\ d u_n \\ \end{bmatrix} = J(f) \begin{bmatrix} d x_1 \\ d x_2 \\ \vdots \\ d x_n \\ \end{bmatrix} \]
Where \( u, v \) and \( x, y \) are linear relation, their size scales by the determinant of the matrix. (proof?)
Hessian #
The first derivative of multivariable f is gradient.
The second derivative is the derivative of the first derivative,
\[ \begin{aligned} H(f) = J(\nabla f) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix} \end{aligned} \]
Taylor series #
Taylor approximation of function F
\( F(\mathbf{x} + \Delta \mathbf{x}) = F(\mathbf{x}) + (\Delta \mathbf{x})^T \nabla F(\mathbf{x}) + \frac{1}{2} (\Delta \mathbf{x})^T H (\Delta \mathbf{x})\)
Question
Why do we have the quadratic form that way?
#
We start from \( \Delta W \approx \Delta W_{tan} = f_x \Delta x + f_y \Delta y \)
\[W_s = \frac{\Delta W_{tan}}{\Delta s} = f_x cos\phi + f_y sin\phi \]
Now we want to look at the rate of change of \( W_s \)
\[\begin{aligned} \Delta W_s &= \frac{ \partial W_s}{\partial x} \Delta x + \frac{\partial W_s}{\partial y} \Delta y \\ &= (\frac{\partial^2w}{\partial x^2} cos\phi + \frac{\partial^2w}{\partial y \partial x} sin\phi ) \Delta x + (\frac{\partial^2w}{\partial x \partial y} cos\phi + \frac{\partial^2w}{\partial y^2} sin\phi ) \Delta y \\ &= \begin{bmatrix} \Delta x & \Delta y \end{bmatrix} H \begin{bmatrix} cos\phi \\ sin\phi \end{bmatrix} \end{aligned} \]
Optimization #
Optimization #
Minimize scalar \( F\) : find \( \mathbf{x} \) where \(\nabla f(\mathbf{x}) = 0 \)
Find solution #
Find \( \mathbf{x} \) where vector valued function \( f = 0 \)
It is analogous of optimization when \( f = \nabla F \)
Newton’s method #
We try to find \( x^* \) where \( f(x^*) = 0 \)
Starting from \( x_1 \), we hope to \( x_2 \) which is closer to \( x^* \) by:
\( 0 = f(x_1) + J(f)_{x_1} (x_2-x_1) \)
\( x_{k+1} = x_{k} - J(f)_{x_1}^{-1} f(x_k) \)
It might converge or it might not.
Optimization #
Gradient descent #
\( x_{k+1} = x_k - s_k \nabla F \), where \( s_k \) is a learning rate.
Newton’s method #
\[x_{k+1} = x_k - H^{-1}(x_k) \nabla F \]
Why \( H \) appears there?
We are trying to find \( \mathbf{x} \) where \( \nabla F(\mathbf{x}) = 0 \)
And, we find \( \nabla F(\mathbf{x}) = 0\) by newton’s method, and since \(H(F) = J(\nabla F)\) we get the above equation.
Convex #
Convex set K #
Any line from \(x_1\) to \( x_2\) stays in K
Intersection of convex is convex #
Convex function F #
Points on and above the graph are convex set.
How to test
\(F(x)\) case: \( \frac{d^2f}{dx^2} \ge 0 \)
\( F(\mathbf{x}) \) is positive semidefinite.
QUESTION
why?