Matrix methods in data analysis (part2)

Matrix methods in data analysis (part2)

November 28, 2023

Matrix methods #

This is my note taking for https://ocw.mit.edu/courses/18-065-matrix-methods-in-data-analysis-signal-processing-and-machine-learning-spring-2018/pages/readings/

lecture 12-21

These are practical applications, but hard to remember.. without actually applying it somewhere..

Randomized Matrix multiplication #

When matrix is huge, multiplying them is prohibitive, so we take samples (sum rank-1 multiplication) with good property.

Example #

Simulating a matrix by random sampling.

Suppose \(A = \begin{bmatrix} a & b \end{bmatrix} \)

We construct a matrix \(\hat{A}\) by randomly selecting two columns with probability \( \frac{1}{2} \)

\[ \begin{aligned} X_1 = \begin{cases} [a, 0] & \text{with probability } \frac{1}{2} \\ [0, b] & \text{with probability } \frac{1}{2} \\ \end{cases} \end{aligned} \]

\( E(X_1) = \frac{1}{2} [a,0] + \frac{1}{2} [0, b] \)

\( X = X_1 + X_2\)

\( E(X) = [a, b]\)

Likewise, \( Var(X_1) = \frac{1}{2}([a,0] - \frac{1}{2}[a,b])^2 + \frac{1}{2}([0,b] - \frac{1}{2}[a,b])^2 = [\frac{a^2}{4}, \frac{b^2}{4}] \)

\(Var(X) = \frac{1}{2}[a^2, b^2] \)

Sample with good property means: #

  • \(E(\hat{A}) = E(A) \)
  • keep \( Var(\hat{A}) \) low

Strategy #

  1. assign bigger probability for bigger column
  2. mix vectors to make them more similar to each other

Solution for strategy #1 #

  • Problem

    We are given \( A, B\), and take sample \( a_j, b_j\) and multiply to get a rank-1 matrix. And we take multiple samples.

    How should we pick the \( a_j, b_j \) so that our variance is the minimum? What probability should we assign to picking \( a_j, b_j \) ?

  • conceptual strategy

    (I couldn’t follow the exact argument)

    Find a strategy which satisfies \( E(\hat{A}) = A \).

    Find the expression for variance and find \( p_j \) for selecting \( a_j, b_j \) which minimizes the variance. With a constraint \( \sum p_j = 1\)

    So this is miminization problem with a constraint, a lagrange multiplier.

  • Solution

    It turns out \( P_j = \frac{\Vert a_j \Vert \Vert b_j^T \Vert}{C} \) where C is the normalizing constant minimizes the variance while keeping the expectation intact.

Low rank changes in A and its inverse #

When we perturb a matrix by small amount, what’s the inverse of the perturbed matrix?

\[ (I - uv^T)^{-1} = I + \frac{uv^T}{1-v^Tu} \]

If i change a matrix by rank-1, I change the inverse matrix by rank-1.

Check:

\( (I - uv^T)(I + \frac{uv^T}{1-v^Tu}) = I - uv^T + \frac{(I-uv^T)uv^T}{1-v^Tu} \)

\( (I -uv^T)uv^T = uv^T - u(v^Tu)v^T = uv^T(1- v^Tu) \)

And this applies when we change the vector \( u, v\) to matrix \( U, V\), or change \(I\) to any matrix \( A \)

Because we can think view \( UV^T \) as the sum of rank-1.

\[(I_n - UV^T)^{-1} = I_n + U(I_k - V^TU)^{-1}V^T \]

S-M-W #

When we have \(A\) instead of \(I\):

\[(A + UV^T)^{-1} = A^{-1} - A^{-1}U (I - V^TA^{-1}U)^{-1} V^TA^{-1} \]

Usage #

Recursive Least squares #

We have a \( A\mathbf{x} = \mathbf{b} \), and try to find \( A^TA\mathbf{\hat{x}} = A^T\mathbf{b} \)

We have a new measurement, \( \begin{bmatrix} A \\ v^T \end{bmatrix} \begin{bmatrix} \mathbf{x} \end{bmatrix} = \begin{bmatrix} \mathbf{b} \\ b_n \end{bmatrix} \), and new normal \( \begin{bmatrix} A^T & v \end{bmatrix} \begin{bmatrix} A \\ v^T \end{bmatrix} \)

\[ \hat{x}_{new} = \begin{bmatrix} A^T & v \end{bmatrix} \begin{bmatrix} \mathbf{b} \\ b_n \\ \end{bmatrix} \]

\( (A^TA + vv^T) \hat{x}_{new} = A^T\mathbf{b} + vb_n\)

So we are perturbing the \( A^TA \) with rank-1 matrix. Here we don’t wanna recompute the inverse of the \( A^TA\) with the new data.

  • Kalman filter

    Kalman filter uses two additional concept on top of the recursive least square.

    • weighted least square (covariance)
    • state equation

Solve \( (A - uv^T)x = b \) quickly #

Suppose \( Aw = b \) is solved for \( w\).

We don’t wanna recompute the inverse of the new matrix.

We use \( Az = u \) (z can be computed fast because we have already solved the \( Aw = b\))

Now we use Sherman-Morrison formula to compute \(x\)

\( x = w + \frac{v^Twz}{1-v^Tz}\)

\( \frac{dA}{dt}\) when matrix \(A(t) \) depends on t #

given \( \frac{dA}{dt} \), find \( \frac{dA^{-1}}{dt} \) #

\[B^{-1} - A^{-1} = B^{-1}(A-B)A^{-1} \]

\(\Delta A^{-1} = (A + \Delta A)^{-1} (-\Delta A) A^{-1} \)

\( \frac{\Delta A^{-1}}{\Delta t} = (A + \Delta A)^{-1} \frac{-\Delta A}{\Delta t} A^{-1} \)

\( \frac{dA^{-1}}{dt} = -A^{-1} \frac{dA}{dt} A^{-1} \)

\( \frac{d\lambda}{dt} \) #

We are given: \( AX = X\Lambda \), \( Y^TA = \Lambda Y^T \)

And we have an assumption on the length of eigen vectors to make calculation easier. We normalize the vectors by \( y^Tx = 1\). This choice doesn’t affect the generality of the result because any scalar multiple of an eigenvector is also an eigenvector.

\[y^T(t) A(t) x(t) = \lambda(t) y^T(t) x(t) = \lambda(t) \]

Left and right eigen vector has the same eigen values #

\[f(Av) = (A^Tf)(v) \]

If \(A\) maps a vector v to \(Av\), then \(A^T\) maps a functional \(f \in V^* \) in a way that \( f(Av) = (A^Tf)(v) \)

Let \( f(x) = u^T x \)

\( (A^T f)(v) = (A^Tu^T)v = u^T(Av) = u^T(\lambda v) = \lambda u^Tv\)

From \( (A^Tu^T)v = \lambda u^Tv \), \( A^Tu^T = \lambda u^T\), or equivalently \( uA^T = \lambda u\)

I.e. Left eigenvector has the same eigen value.

Compute \( \frac{d\lambda}{dt} \) #

From \( y^T(t) A(t) x(t) = \lambda(t) y^T(t) x(t) = \lambda(t) \)

Take derivative on both side: \[ \frac{d \lambda}{dt} = \frac{dy^T}{dt} A x(t) + y^T(t) \frac{dA}{dt} x(t) + y^T(t)A\frac{dx}{dt} = y^T(t) \frac{dA}{dt} x(t) \]

Because we assumed \( y^Tx = 1\), it follows \(\frac{dy^T}{dt} A x(t) + y^T(t)A\frac{dx}{dt} = \lambda(t)\frac{dy^T}{dt} x(t) + \lambda y^T(t)\frac{dx}{dt} = \lambda(t) \frac{d y^Tx}{dt} = 0 \)

\( \frac{d \lambda}{dt} = y^T(t) \frac{dA}{dt} x(t) \): The derivative doesn’t involve derivative of y or x.

\( \lambda_j (S + uu^T) \) #

What will happen to the eigenvalues of \(S\) when you add a positive rank-1 matrix.

The positiveness of \( uu^T \) #

\( uu^T \) is a rank1 matrix, and the eigen vector of the matrix is \( u\)

\( (uu^T) u = \lambda u \), and \( \lambda = u^Tu \gt 0 \)

Adding positive #

Adding positive \( uu^T \) to \( S \) makes it more positive.

I.e. the eigen value goes up.

So the interpretation of matrix as sum of rank-1 is a fundamental view. I could see this, can’t remember now.

\( \frac{d\sigma}{dt} \) #

Singular value (not the eigen value)

\( \sigma(t) = u^TAv\) where \( u^Tu = 1\)

\( \frac{d\sigma}{dt} = \frac{du^T}{dt}Av + u^T\frac{dA}{dt}v + u^TA\frac{dv}{dt} \)

\( \frac{du^T}{dt}Av = \frac{du^T}{dt} \sigma u = 0 \), because \( u^Tu = 1\)

Interlacing #

When we pertub a matrix with rank-1 matrix, the eigen value goes up by doesn’t go over the previous eigen value.

\[\mu_1 \ge \lambda_1 \ge \mu_2 \ge \lambda_2 … \]

Weyl’s inequality for symmetric S, T #

\( \lambda_{i+j-1} (S + \) ≤ λ_i (s) + \lambd_j(T) \) (I haven’t proved)

Why are so many matrices of low rank in nature? #

Low rank #

We decompose n dimensional matrix of rank k, into k rank-1 matrix.

\[A = u_1v_1^T + … + u_kv_k^T \]

Each rank-1 matrix has \( 2n \) numbers. We can send \( 2 n k \) numbers instead of \( n^2 \) in order to send \(A\)

if \( 2nk < n^2 \), we say it’s row rank

Diagonal matrix are high ranks #

Of course, the eigen values are the diagonal entries.

Numerical rank #

We define \( rank_\epsilon(X) = k\) for \( 0 < \epsilon < 1 \).

Where \( rank_0(X) = rank(X) \), which would allow no tolerance, \(rank_\epsilon(X) \) ignores small eigenvalues relative to the first eigen value \( \sigma_1 \)

\( rank_\epsilon \) count the eigenvalues who are bigger than \( \epsilon \sigma_1 \)

Matrices of numerical low rank #

They are hard to invert. (Because small eigenvalues blow up the inverse matrix)

  • Hilbert matrix
  • Vandermonde matrix
  • Ex. \( X_jk = p(j,k) = 1 + j + jk \)

    \( X = \begin{bmatrix} 1\\ \vdots \\ 1 \end{bmatrix} \begin{bmatrix} 1 & 1 & … & 1 \\ \vdots \\ n & n & … & n \end{bmatrix} \begin{bmatrix} 1 & 2 & … & n \\ \vdots \\ n & 2n & … & n^2 \end{bmatrix} \)

  • The world is Sylvester

    Hypothesis why matrices are low rank.

    Maybe because matrices satisfies \( AX - XB = C \) for some \( A, B, C \)

    With suitable \( A, B, C \) we can find upper bound for Hilbert matrix much closer to the (real?) bound of the matix. (I can’t even tell the main result clearly)

Saddle point #

\( S = \begin{bmatrix}5 \\ & 3 \\ & & 1 \end{bmatrix}, x = \begin{bmatrix}u \\ v \\ w \end{bmatrix} \)

\( R(x) = \frac{x^TSx}{x^Tx} = \frac{ 5u^2 + 3v^2 + w^2 }{u^2 + v^2 + w^2} \)

Then, we can see that

\( max R = 5, \) at \( (1, 0, 0) \) (the eigen value and eigen vector) \( min R = 1, \) at \( (0, 0, 1) \)

How about saddle of R? #

Saddle is at Maximum of minimum

\( \lambda_2 = max min \frac{x^TSx}{x^Tx} \)

Min: Look for all possible 2d subspaces and find the minimum eigen value of the 2d subspace
Max: Look for the maximum of all those minimums

So saddle point is minimum in one direction, and maximum in another direction.

QUESTION Can I take this as the definition of saddle point? #

Probability Inequalities #

Markov inequalities #

When \( x_i \ge 0 \), \( Pr[x \ge a] \le \frac{E[x]}{a} \)

For example, \( Pr(x \ge 3) \le \frac{2}{3} \), when \(x\) is one of \( 1, 2, 3, 4, 5\) and their mean is 2

\( 1p_1 + 2p_2 + 3p_3 + 4p_4 + 5p_5 = 1p_1 + 2p_2 + 3(p_3 + p_4 + p_5) + p_4 + 2p_5 = 2\)

Therefore \( p_3 + p_4 + p_5 \le \frac{2}{3} \)

Chebyshev inequality #

(no asumption \( x_i \ge 0 \) )

\[Pr(| x-m| \ge a) \le \frac{\sigma^2}{a^2} \]

Let \( y = |x-m|^2 \), apply markov to y.

\( \bar{y} = \sum p_i y_i = \sum pi |x-m|^2 = \sigma^2 \)

\( Pr( |x-m| \ge a) = Pr(|x-m|^2 \ge a^2) = pr(Y \ge a^2) \le \frac{E(y)}{a^2} \)

Covariance matrix #

\[ V = E\left[\begin{array}{ccc} \left[\begin{array}{c} X_1 - \mu_1 \\ \vdots \\ X_n - \mu_n \\ \end{array}\right] \left[\begin{array}{ccc} X_1 - \mu_1 & \cdots & X_n - \mu_n \end{array}\right] \end{array}\right] \]

When \(A\) has independent columns, \( A^T A\) is invertible #

null space #

When \( A \) has independent columns, \( A \) has empty null space.

\( Ax\) is non zero. For any non zero \(w\), \( A^T w\) is non zero, because for each basis \(w_i\), \( A^T_{. i} \) is a matrix where \( A^T_{ji} \) is the coefficient of \(w_i \) of \( Tv_j \).

Hence the coefficients can’t be all zero because that would have produced zero \( w\). Therefore \( A^T A \) has empty null space, therefore invertible.

Another way (hackish?) #

Think of isomorphism

Calculus #

Gradient #

When f is a function of form: \( f(x_1, x_2, \ldots, x_n) \)

\[\nabla f = \left( \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \right) \]

\[ df = \nabla f \cdot d\mathbf{x} \]

So, gradient says, change in f due to change in x is linear combination of the change in x (in the limit)

Jacobian #

Scalar valued function f #

When f is of form \( f(u_1, u_2, \ldots, u_n) \), where each \(u_i \) is a function of \( (x_1, x2, …, x_n) \),

Jacobian is the collection of gradient for each component function \( u_i \)

When f is \( f(u_1) \), where \( u_1 \) is a function of \( (x_1, x2, …, x_n) \),

\( \frac{df(u_1(\mathbf{x}))}{d \mathbf{x}} = \frac{d f}{d u_1} \frac{du_1}{d \mathbf{x}} = \frac{d f}{d u_1} \nabla u_1 \)

Now, when f is of form \( f(u_1, u_2, \ldots, u_n) \), where each \(u_i \) is a function of \( (x_1, x2, …, x_n) \),

\[ \begin{bmatrix} d u_1 \\ d u_2 \\ \vdots \\ d u_n \\ \end{bmatrix} = J(f) \begin{bmatrix} d x_1 \\ d x_2 \\ \vdots \\ d x_n \\ \end{bmatrix} \]

Where, \( J(f)_1 = \nabla u_1 \)

\[ J(\mathbf{f}) = \begin{bmatrix} \frac{\partial u_1}{\partial x_1} & \frac{\partial u_1}{\partial x_2} & \cdots & \frac{\partial u_1}{\partial x_n} \\ \frac{\partial u_2}{\partial x_1} & \frac{\partial u_2}{\partial x_2} & \cdots & \frac{\partial u_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial u_m}{\partial x_1} & \frac{\partial u_m}{\partial x_2} & \cdots & \frac{\partial u_m}{\partial x_n} \end{bmatrix} \]

\( df = \nabla f \cdot d\mathbf{x} \) would become:

\( df = \nabla f \cdot d\mathbf{u} = \nabla f \cdot J(f) \cdot d \mathbf{x} \)

Vector valued function f #

When f is of form \( f = (f_1, f_2, …, f_n) \), where \(f_i\) is a function of \( (x_1, x_2, …, x_n) \)

Jacobian describes the gradient of each \( f_i \)

\[\begin{bmatrix} d f_1 \\ \vdots \\ d f_n \\ \end{bmatrix} = \begin{bmatrix} \nabla f_1 \cdot d \mathbf{x} \\ \vdots \\ \nabla f_n \cdot d \mathbf{x} \\ \end{bmatrix} = \begin{bmatrix} \nabla f_1 \\ \vdots \\ \nabla f_n \\ \end{bmatrix} \begin{bmatrix} d x_1 \\ \vdots \\ d x_n \\ \end{bmatrix} = J(f) \cdot \begin{bmatrix} d x_1 \\ \vdots \\ d x_n \\ \end{bmatrix} \]

Or,

\( f(x + \Delta x) = f(x) + J(f) \Delta x \)

In integral #

\[ \iint_{R_{uv}} f(u, v) \, dA_{uv} = \iint_{R_{xy}} f(g(x, y), h(x, y)) \left| J \right| \, dx \, dy \]

This comes from the previous

\[ \begin{bmatrix} d u_1 \\ d u_2 \\ \vdots \\ d u_n \\ \end{bmatrix} = J(f) \begin{bmatrix} d x_1 \\ d x_2 \\ \vdots \\ d x_n \\ \end{bmatrix} \]

Where \( u, v \) and \( x, y \) are linear relation, their size scales by the determinant of the matrix. (proof?)

Hessian #

The first derivative of multivariable f is gradient.

The second derivative is the derivative of the first derivative,

\[ \begin{aligned} H(f) = J(\nabla f) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix} \end{aligned} \]

Taylor series #

Taylor approximation of function F

\( F(\mathbf{x} + \Delta \mathbf{x}) = F(\mathbf{x}) + (\Delta \mathbf{x})^T \nabla F(\mathbf{x}) + \frac{1}{2} (\Delta \mathbf{x})^T H (\Delta \mathbf{x})\)

Question Why do we have the quadratic form that way? #

We start from \( \Delta W \approx \Delta W_{tan} = f_x \Delta x + f_y \Delta y \)

\[W_s = \frac{\Delta W_{tan}}{\Delta s} = f_x cos\phi + f_y sin\phi \]

Now we want to look at the rate of change of \( W_s \)

\[\begin{aligned} \Delta W_s &= \frac{ \partial W_s}{\partial x} \Delta x + \frac{\partial W_s}{\partial y} \Delta y \\ &= (\frac{\partial^2w}{\partial x^2} cos\phi + \frac{\partial^2w}{\partial y \partial x} sin\phi ) \Delta x + (\frac{\partial^2w}{\partial x \partial y} cos\phi + \frac{\partial^2w}{\partial y^2} sin\phi ) \Delta y \\ &= \begin{bmatrix} \Delta x & \Delta y \end{bmatrix} H \begin{bmatrix} cos\phi \\ sin\phi \end{bmatrix} \end{aligned} \]

Optimization #

Optimization #

Minimize scalar \( F\) : find \( \mathbf{x} \) where \(\nabla f(\mathbf{x}) = 0 \)

Find solution #

Find \( \mathbf{x} \) where vector valued function \( f = 0 \)

It is analogous of optimization when \( f = \nabla F \)

Newton’s method #

We try to find \( x^* \) where \( f(x^*) = 0 \)

Starting from \( x_1 \), we hope to \( x_2 \) which is closer to \( x^* \) by:

\( 0 = f(x_1) + J(f)_{x_1} (x_2-x_1) \)

\( x_{k+1} = x_{k} - J(f)_{x_1}^{-1} f(x_k) \)

It might converge or it might not.

Optimization #

Gradient descent #

\( x_{k+1} = x_k - s_k \nabla F \), where \( s_k \) is a learning rate.

Newton’s method #

\[x_{k+1} = x_k - H^{-1}(x_k) \nabla F \]

Why \( H \) appears there?

We are trying to find \( \mathbf{x} \) where \( \nabla F(\mathbf{x}) = 0 \)

And, we find \( \nabla F(\mathbf{x}) = 0\) by newton’s method, and since \(H(F) = J(\nabla F)\) we get the above equation.

Convex #

Convex set K #

Any line from \(x_1\) to \( x_2\) stays in K

Intersection of convex is convex #

Convex function F #

Points on and above the graph are convex set.

  • How to test

    \(F(x)\) case: \( \frac{d^2f}{dx^2} \ge 0 \)

    \( F(\mathbf{x}) \) is positive semidefinite. QUESTION why?