Matrix methods in data analysis

November 24, 2023

matrix methods #

This is my note taking for https://ocw.mit.edu/courses/18-065-matrix-methods-in-data-analysis-signal-processing-and-machine-learning-spring-2018/pages/readings/

lecture 1-9 Probably I should do the problem sets.

I diagressed from statiscal learning Statistical learning (part 2) because it required some math background

Contrary to linear algebra done right (Linear algebra), the treatment of matrix is more concrete with assumption that your matrix \( M(T) \) is real and the inner product is assumed to be the inner product

Also whether you treat a matrix \( A\) as a set of vectors for \( col(A) \), or as an \(A = M(T, V^s, W^s ) \) a representation of linear map, where \( V^s \) denotes the set of basis in V leads to a different interpretation.

CR factorization #

Given matrices:

\[ A = \begin{bmatrix} 2 & 1 & 3 \\ 3 & 1 & 4 \\ 5 & 7 & 12 \\ \end{bmatrix} \]

\[ \begin{aligned} A = CR =\begin{bmatrix} 2 & 1 \\ 3 & 1 \\ 5 & 7 \\ \end{bmatrix}\begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \\ \end{bmatrix} \end{aligned} \]

\( A \) is a linear combination of \(C\) column vectors
\( A \) is a linear combination of \(R\) row vectors

Col rank = Row Rank #

You can see it in the following ways

Get independent column vectors and construct \( C \) with \( n_c \) vectors

\( A \) is a linear combination of those vectors The linear combinations are the \( R \)

Hence \(R\)’s \( n_c * n_c \) matrix are identity matrix and they are independent
\( A^T = R^T C^T \), Since \(R\) ’s row space is \(R^T \) ’s column space.
\( C \) shows how we can combine their basis to represent our vector \( R \) shows ..
combining rows or columns to have more 0 is essentially changing the basis to represent our vector to their vector, when we need \( n \) vectors to represent theirs, they also need \( n \) vectors to represent ours

LU factorization #

\[ \begin{aligned} A = \begin{bmatrix} 2 & 3 \\ 4 & 7 \\ \end{bmatrix} \text{ With elimination, we get: -> } \begin{bmatrix} 2 & 3 \\ 0 & 1 \\ \end{bmatrix} \end{aligned} \]

What have we done, is the LU factorization

as LU factorization #

\[ \begin{aligned} A = \begin{bmatrix} 1 & 0 \\ 2 & 1 \\ \end{bmatrix} \begin{bmatrix} 2 & 3 \\ 0 & 1 \\ \end{bmatrix} \end{aligned} \]

LU factorization can be viewed as sum of rank 1 matrixs #

\[ \begin{aligned} A &= \begin{bmatrix} 1 & 0 \\ 2 & 1 \\ \end{bmatrix} \begin{bmatrix} 2 & 3 \\ 0 & 1 \\ \end{bmatrix} \\ &= \begin{bmatrix} 1 \\ 2 \\ \end{bmatrix} \begin{bmatrix} 2 & 3 \\ \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \\ \end{bmatrix} \begin{bmatrix} 0 & 1 \\ \end{bmatrix} \\ &= \begin{bmatrix} 2 & 3 \\ 4 & 6 \\ \end{bmatrix} + \begin{bmatrix} 0 & 0 \\ 0 & 1 \\ \end{bmatrix} = \begin{bmatrix} col_1 row_1 \\ \end{bmatrix} + \begin{bmatrix} 0 & 0 \\ 0 & A_2 \\ \end{bmatrix} \end{aligned} \]

Where \(A_2\) is the remaining entries of \(A\)

\( A = QR \) Gram schmidt #

Abstractically, Gram schmidt says we can construct orthornomal basis from any basis.

The deconstruction is actually decomposing the column vectors as linear combination of orthorgonal vectors.

\(R \) upper trianglular (because the gram shmidt uses \( span(v_1 .. v_k) = span(e_1 … e_k) \))

Spectral theorem #

For any operator, there’s an upper triangular matrix respect to an orthornomal basis. For normal, self-adjoint (symmetric, in simple case) is special.

For normal \( C\) or self-adjoint \( R \) operator, there is eigen basis \( e^S \), such that \( M(T, e^S) = \lambda \) where \( \lambda \) is a eigenvalue diagonal matrix

\[ Sv = \lambda v \]

If we collect eigenvectors and place them in \( Q \),

\[ SQ = \lambda Q \]

Since \( Q^TQ = QQ^T = I \), we get \( S = Q \lambda Q^T \)

Singular value decomposition #

If we relax the condition on the eigenbasis, allowing two separate eigen basis, we can decompose any operator into diagonal matrix.

I.e. there’s eigen basis \( V^s, U^s\) and singular values such that: \[ \begin{aligned} Av_1 &= \sigma_1 u_1 \\ Av_2 &= \sigma_2 u_2 \\ \\ A V &= \sigma U \\ \\ \\ A \begin{bmatrix} v_1 … v_r \\ \end{bmatrix} &= \begin{bmatrix} u_1 … u_r \\ \end{bmatrix} \begin{bmatrix} \sigma_1 & … \\ 0 & \sigma_2 & … \\ … \\ 0 & … & \sigma_r \\ \end{bmatrix} \end{aligned} \]

\( AV = U\Sigma \), \( A = U\Sigma V^T \)

What would \( V, U \) should be? we can look at the \(A^TA\):

\[ A^TA = V\Sigma^T U^T U \Sigma V^T = V \Sigma^2 V^T \] \[ AA^T = U\Sigma V^T V \Sigma^T U^T = U \Sigma^2 U^T \]

I.e. \(U, V\) are eigen basis for the symetric matrix \(A^TA\), which spectral theorem provides us. (guarantees the existence)

Now, how do we find the \( v, u, \sigma \)? Can we use the \(v\): orthorgonal basis of \(A^TA\) and \( \sigma \): eigen values for \( Av_1 = \sigma_1 u_1 \)?

Indeed, \( (\frac{Av_1}{\sigma_1})^T \frac{Av_2}{\sigma_2} = 0 \), i.e. they are orthornormal.

\[ A = U\Sigma V^T\]

Singular values are the sqrt of eigenvalues of \( A^TA\).

Polar decomposition #

In number world, in polar coordinate, we have \( r e^{i\theta}\).

\( r \) is “real”
\( e^{i\theta}\) denotes “direction”

In linear map world, (matrix)

self-adjoint is “real” self adjoint \( T = T^* \) (symmetric in R is self adjoint)
orthornomal basis is the direction (of the linear map)

\[ A = SQ \]

We can get the decomposition from Singular value decomposition \(A = U\Sigma V^T\)

\[ A = U\Sigma V^T = U\Sigma (U^T U) V^T = (U\Sigma U^T) (U V^T) = SQ \]

Transformation which keeps eigenvalue #

Similar matrices have the same eigenvalues.

\( A, B \) matrices are similar when \(B = M A M^{T}\)

Transformation which keeps singular value #

We can multiply any orthorgonal matrix \(Q\), i.e. \( A = Q U\Sigma V^T \)

\( QU \) product of orthorgonal matrix is orthorgonal #

\( (QU)^{-1} = U^{-1}Q^{-1} = U^TQ^T = (QU)^T \)

If \(A^T A = I \), then \( A \) is orthorgonal

Vector Norm #

Non-negativity: As with vector norms, matrix norms are non-negative and are zero only for the zero matrix.
Scalar Multiplication: The norm of a scalar multiple of a matrix is the absolute value of the scalar times the norm of the matrix.
Triangle Inequality (Subadditivity): \( \|A + B\| \leq \|A\| + \|B\| \) for any matrices \( A \) and \( B \).

L2 norm #

our usual norm of sqrt after squaring \[ |v|_2 = \sqrt{\sum_i v_i^2} \]

L1 norm #

\[ |v|_1 = \sum_i |v_i|\]

Matrix Norm #

Definition of Matrix Norm #

A matrix norm is a function \( \Vert \cdot \Vert \) that maps a matrix to a non-negative real number, satisfying the following properties for all matrices \( A, B \) and all scalars \( c \):

Non-negativity: \( \Vert A \Vert \ge 0 \), and \( \Vert A \Vert = 0 \) if and only if \( A \) is the zero matrix.
Scalar multiplication: \( \Vert cA \Vert = |c| \Vert A \Vert \).
Triangle inequality: \( \Vert A + B \Vert \le \Vert A \Vert + \Vert B \Vert \).
Submultiplicativity: \( \Vert AB \Vert \le \Vert A \Vert \Vert B \Vert \).

Property 3 is about matrix addition, how it should behave when our linear map is added. Property 4 is about matrix multiplication, how it should behave when our linear map is composited.

The matrix norm is about how much linear map stretches vectors.

QUESTION So here we have a property (of matrix) which gives us different answer when we think a unified transformation or separate sequential transformation. #

It’s weird property. Which of the norms (L1, L2 and so on) have the \( = \) equality property?

Types of Matrix Norms #

There are several different ways to define a matrix norm, but some of the most common include:

Induced (or Operator) Norm: #

\[ \Vert A \Vert = \max_{{x} \neq 0} \frac{\Vert A x \Vert_v}{\Vert x \Vert_v} \]

This represents the maximum “stretch” that the matrix \( A \) applies to any vector \( x \).

It happens to be \( x \) is the \( v_1 \), the first eigen vector of \( A^TA\), and \( \Vert A \Vert = \sigma_1 \)

proof
let’s prove it

Frobenius Norm: #

\[ \Vert A \Vert_F = \sqrt{\sum_{i,j} |a_{ij}|^2} \] where \( a_{ij} \) are the elements of \( A \). This is analogous to the Euclidean norm for vectors.

Nuclear Norm: #

\[ \Vert A \Vert = \sum_{i} |\sigma_i| \]

Netflix competition winner used this norm. There’s hypothetis that Deep learning finds the nuclear norm.

QUESTION Doesn’t deep learning’s loss function specify the which norm to use?
what does he mean when he hypothesize?
The important distiction is that, we have many more possible solutions because # of parameters > # of data

Eckart-Young #

Given a matrix A, there’s a unique matrix \(B\) with rank \(r\) such that \( \Vert A-C \Vert \geq \Vert A-B \Vert \) for any mattrix C with rank \( r \)

\( B \) is the first \(r\) rank-1 sum of SVD.

proof #

https://math.mit.edu/~gs/learningfromdata/DanielDrucker01.pdf

QUESTION why is he looking at \( A-CR\)?
He must be assuming, any matrix can be represented by \(CR\)?

PCA #

Pick the first \( r\) number of \( v \)s to best represent the data, after Singular value decomposition. Eckart-Young says they are the best. The first \(r\) 1-by-1 rank of SVD.

QUESTION What is the different from MSE method? #

(ANSWER) PCA measures the orthogonal distance, and MSE measures the distance using the space’s basis. But I can’t see why.

Least squares #

We have more constraints than variables, so there are not exact solutions.

So we try to find the best alternative. There are multiple ways to do this

Pseudo inverse #

We handle only what we can handle.

For given \( T \) which has non zero null space, we define \( T^+ \) which maps back the range of \(T\) back to the image, but 0 otherwise.

We can get the \( A^+ \) from \( A = UDV^T\)

\( A^+ = VD^+U^T\) where \(D^+\) is the reciprocal of \( D\)

MSE #

by minimizing the errors #

MSE derivation

\[ \mathbf{A}^{\mathsf{T}}\mathbf{A}\hat{\mathbf{x}} = \mathbf{A}^{\mathsf{T}}\mathbf{b}, \hat{x} = (\mathbf{A}^{\mathsf{T}}\mathbf{A})^{-1} \mathbf{A}^{\mathsf{T}} \mathbf{b} \]

By projection #

The same solution can obtained by a projection.

We are looking for \( \hat{x} \) where \( A\hat{x} \) is best (cloesest) to \( b \). Problem is that our column space of \(A\) doesn’t span \(b \)

We find the closest vector in the column space to the \( b \): by projecting b on to the column space

The vector \(\mathbf{A}\hat{\mathbf{x}}\) is the projection of \(\mathbf{b}\) onto Col(\(\mathbf{A}\)). The difference \(\mathbf{b} - \mathbf{A}\hat{\mathbf{x}}\) is orthogonal to every column of \(\mathbf{A}\), which leads to the equation \( \mathbf{A}^{\mathsf{T}}(\mathbf{b} - \mathbf{A}\hat{\mathbf{x}}) = \mathbf{0} \)

QUESTION MSE and PCA ? #

MSE (when compared to PCA), seemed to be using a direction which has no ‘perpendicular’ sense to it, but apparently it is, so can you visualize PCA which direction it’s trying to minimize?

When does it work? #

\(\mathbf{A}^{\mathsf{T}}\mathbf{A}\hat{\mathbf{x}} = \mathbf{A}^{\mathsf{T}}\mathbf{b}\) can be solved when \( A^T A \) is invertible.

When \( A \) is invertible \( A^T A \) is invertible.

Why? can be viewed in multiple ways

Use dot product
\( (Ax)^T (Ax) = 0 \) only when \(Ax =0\), but \(A\) ’s null space is empty. so \( (Ax)^T (Ax) \neq 0 \) for any \( x\). Therefore \( A^T A\) has empty null space.

Use \( A = UDV^T\)

Treat \(A\) as a \( M(T) \) and \(A^T\) as \( M(S) \) where \( S \) is a dual map
- after \(A^T A\), we get linear combinations (of our basis) such that they will produce a linear functional, e.g. we will know rental price coefficients for our vector
  And this rental price coefficients are isomorphism from our vector
- \(Av\) shows linear combination of \( W^s\), \(A^T_1\) shows all the components of \( V^s\) which contribute to \( w_1 \)

QUESTION Think of \( A^* \)

can we not use pseudo inverse formula even when \( A \) is invertible? #

I guess yes we can use it.

\( A^+ = VD^+U^T = (\mathbf{A}^{\mathsf{T}}\mathbf{A})^{-1} \mathbf{A}^{\mathsf{T}} \), and \( (\mathbf{A}^{\mathsf{T}}\mathbf{A})^{-1} \mathbf{A}^{\mathsf{T}} \) is the left side true inverse (not the right side inverse)

But, does that also mean, that pseudo inverse generates L_2 minimum solution when \( A \) is not invertible?

QUESTION I.e. Does pseudo inverse gives \( \min_x \|b - Ax\|_2^2\) ? #

I think it does..

\( \min_x \|b - UDV^Tx\|_2^2 = \min_x \|U^Tb - DV^Tx\|_2^2 \), and expanding out gives \( \hat{x} = VD^+U^T b\)

But then, not sure if you can see it without actually computing it

Isn’t it weird that the simple reasoning which lead to the construction of \( A^+ \) actually performs the L2 minimization?

Difficulties with \( Ax = b\) #

Under determined #

Deep learning, many more parameters than equations. There are many solutions.

What solution deep learning is picking? #

gradient descent or whatever algorithm are picking L2 norm or L1 norm? The same question as posed above. I am not sure if I understand the question correctly

How to pick the solution which will generalize to unseen data? #

How to make an algorithm so that it is biased towards the solution that will work on the unseen data?

So he is posing it as a math question rather than an empirical question where we use the cross validation and etc.

Near singular #

When a matrix is near singular, (i.e. invertible but barely, singular values are small) the small change in input can produce big change in output.

This happens in inverse problems, where we are trying to guess the parameter from output. The linear map (from parameter to output) maps big numbers into small numbers near zero.

\((\mathbf{A}^{\mathsf{T}}\mathbf{A})^{-1} \) is huge.

So we try to solve it by adding a term:

\[\Vert Ax- b\Vert^2 + \delta\Vert x\Vert^2 \]

This is ridge regression, and the professor was saying something like embarassed to say this is a solution. (I guess mathematically) I guess the reasoning is that it feels like a patch up work. Good to know there’s some hidden (yet obvious) math I don’t see underneath.

But then, he also mentions lasso is a big thing these days.

The equation can be interpreted as, finding the \( \min_{x} \)

\[ \begin{aligned} \begin{bmatrix} A \\ \delta I \\ \end{bmatrix}\begin{bmatrix} x \\ \end{bmatrix} = \begin{bmatrix} b \\ 0 \\ \end{bmatrix} \end{aligned}\]

How to find \( \delta \) ? #

It’s emprical question. It depends on our accessment of the data.

How does \( \delta \) behave? #

Think 1d case where \( A = [ \sigma ] \)

\( (A^{T} A + \delta^2I) x = A^T b \), when 1d case, \( A^T A = \sigma^2\)

\( \hat{x} = \frac{\sigma}{\sigma^2 + \delta^2} b\)

When \( \sigma > 0 \), limit is \( \frac{1}{\sigma}\)
When \( \sigma = 0 \), limit is 0

\( (A^{T} A + \delta^2I)^{-1} A^T \to A^+ \)

QUESTION So what’s the point of the statement ?
Does it have any practical value? Do we decrease \( \delta\) and see if we can pluck out some variables?

When A is big #

\( A^TA \) is not computable

use iteration method #

When matrix is 1000 ish big..

Krylov subspaces
\( \{b, Ab, A^2b, \ldots, A^{m-1}b\} \)
- QUESTION why is it a good basis?
  Maybe, because we are using \(A\) to form basis, but can’t tell

Arnoldi iteration
- apply Gram-Schmidt at each step
- Hessenberg (almost diagonal)
- shifting (not sure why this helps.. feels like randomize to get different eigenvalue)

if too big.. #

When matrix is big .. like million, we need whole new perpsective.

Use randomized linear algebra (pick random column vectors from \( A \))