Statistical learning (part 3)

December 8, 2023

Statistical learning (part 3) #

This is my note taking course (lecture 6.10 - 7.4) on https://learning.edx.org/course/course-v1:StanfordOnline+STATSX0001+2T2023/home

PCA #

Find a direction where data varies the most. What does it mean?

Suppose you have a such direction, how do I verify the direction is the wanted direction? I.e. what is our goal?

Suppose we have a direction, rotate the axis so that the direction is horizontal.

Now our data point is plotted in this new coordinate.

We project our data onto this the line (the line of direction we found), and compute the variance of the projected data.

We are looking for a line with the high variance.

X is mean-subtracted, then project data \( x \) onto a vector \( v\), \( \langle x, v \rangle \) represents the magnitude (with sign) of the projected vector, and it’s mean is 0.

Therefore we are looking for \(\argmin_{v} Var( \langle x, v \rangle \). With matrix notation, \( (Xv)^T (Xv) = v^T(X^TX)v = v^TCv \) where \( C \) is a covariance matrix.

If we add constraint that the direction to be unit length \( v^Tv = 1 \)

\[ L(v, \lambda) = v^TCv - \lambda(v^TV -1 ) = 0 \]

By differentiating it, we get

\[ Cv = \lambda v \]

Hence we are looking for the eigenvectors of covariance matrix.

Also, since \( v^TCv = v^T\lambda v = \lambda \) the variance of the projected data along the eigenvector is the corresponding eigenvalue.

Compare to Least squares #

Maximizing the projected component squared is equivalent to minimizing the risidual squared (the orthogonal component) by Pytagorian theorem.

So PCA is finding a line where error measured in the perpendicular direction to the line is minimized.

Wheares in least squares, error is measured in the orthogonal direction to the data axis (the horizontal axis)

So are they good direction? #

Be aware the methodology looks at only data \( X \) not \( Y\), it looks at the direction where data varies the most.

It might not be the direction where correlation with \( Y \) is high.

Variance #

Variance due to different data samples #

Measures how my model fluctuate across different samples.

This is about model’s stability. Often accessed through cross validation or bootstrapping, typically expensive test.

Variance due to the uncertainty in a single sample #

Measures how precise my prediction is for a single data point in a sample, assuming my model is correct.

This is about how much my prediction would fluctuate for the data point.

How do we measure this?

How to measure \( \text{Var}(\hat{f}(x_0)) \) ? #

Strategy outline
For a linear regression model:
\( \mathbf{Y} = \mathbf{X}\beta + \epsilon \)
We are going to get \(\text{Var}(\hat{\beta})\), then get \( \text{Var}(x_0^T\hat{\beta} ) \)

\(\text{Var}(\hat{\beta})\)
From SSR, \( S(\beta) = (\mathbf{Y} - \mathbf{X}\beta)^\top(\mathbf{Y} - \mathbf{X}\beta) \)
Differentiating \(S(\beta)\) with respect to \(\beta\) and setting it to zero gives:
\[ \frac{\partial S}{\partial \beta} = -2\mathbf{X}^\top(\mathbf{Y} - \mathbf{X}\beta) = 0 \]
From this, we can solve for \(\hat{\beta}\):
\[ \mathbf{X}^\top \mathbf{Y} = \mathbf{X}^\top \mathbf{X}\hat{\beta} \] \[ \hat{\beta} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} \]
To solve for \( \hat{\beta} \), we treat \(\mathbf{Y} \) as the observed data.
However, to get variance of \( \hat{\beta} \), we treat \(\mathbf{Y} \) as a random variable. So \(\mathbf{Y} \) and \(\hat{\beta}\) has random error term \(\epsilon\)
\[ \text{Var}(\hat{\beta}) = \text{Var}((\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}) \]
Assuming that the error terms \(\epsilon\) are independent and identically distributed with a variance of \(\sigma^2\), and considering that \(\mathbf{X}\) is non-random, we have:
\[ \text{Var}(\hat{\beta}) = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \text{Var}(\mathbf{Y}) \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \] \[ \text{Var}(\mathbf{Y}) = \text{Var}(\mathbf{X}\beta + \epsilon) = \text{Var}(\epsilon) = \sigma^2 \mathbf{I} \]
We used the following two properties to arrive:
\[ \text{Cov}(\hat{\beta}) = (\mathbf{X}^\top \mathbf{X})^{-1} \sigma^2 \]
- \(\text{Var}(AY) = A \text{Var}(Y) A^\top\)
  For a matrix \(A\) and a random vector \(Y\), the linear transformation is \(Z = AY\). The variance-covariance matrix of \(Z\) is defined as:
  \[ \text{Var}(Z) = E[(Z - E[Z])(Z - E[Z])^\top] \]
  Substitute \(Z\) with \(AY\):
  \[ \text{Var}(AY) = E[(AY - E[AY])(AY - E[AY])^\top] \]
  Since \(E[AY] = AE[Y]\) (expectation is a linear operator), this becomes:
  \[ \text{Var}(AY) = E[(AY - AE[Y])(AY - AE[Y])^\top] \]
  Expanding this and using the property that \(E[AB] = AE[B]\) for any non-random matrix \(A\) and random matrix \(B\), we get:
  \[ \text{Var}(AY) = E[A(Y - E[Y])(Y - E[Y])^\top A^\top] \]
  Since \(A\) is non-random, it can be taken out of the expectation:
  \[ \text{Var}(AY) = A E[(Y - E[Y])(Y - E[Y])^\top] A^\top \]
  The term \(E[(Y - E[Y])(Y - E[Y])^\top]\) is simply the variance-covariance matrix of \(Y\), denoted as \(\text{Var}(Y)\):
  \[ \text{Var}(AY) = A \text{Var}(Y) A^\top \]
- \(\mathbf{X}^\top c\mathbf{I} \mathbf{X} = c\mathbf{X}^\top \mathbf{X}\)

Now, \( \text{Var}(\hat{f}(x_0)) \)
We wanted to get \( \text{Var}(\hat{f}(x_0)) = \text{Var}(x_0^T\hat{\beta} )\)
- \(\text{Var}(a^\top X) = a^\top \text{Cov}(X) a\)
  \[\begin{aligned} \text{Var}(a^\top X) &= \text{Var}(a_1X_1 + a_2X_2 + … + a_nX_n) \\ &= \text{Cov}(a_1X_1 + a_2X_2 + … + a_nX_n, a_1X_1 + a_2X_2 + … + a_nX_n) \end{aligned}\]
  I.e. we are looking for the variance of linear combination.
  Finally, we get \( \text{Var}(\hat{f}(x_0)) = \text{Var}(x_0^\top \hat{\beta}) = x_0^\top \text{Cov}(\hat{\beta}) x_0\)

Non linear regression #

Polynomial regression #

You use \( y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + … + \epsilon_i \). Note that even though predictors are non linear, they are just predictors. i.e. we could treat it as multi variable linear regression, and use least squares.

But they are very bad at predicting near or beyond the boundary (leftmost and rightmost) regions.

Step functions #

We can cut our domain into multiple ranges.

\( C_1(X) = I (c_1 \le X \lt c_2\) \( C_2(X) = I (c_2 \le X \lt c_3\)

Then try to find a constant for the region.

\( y_i = \beta_0 + \beta_1 C_1(x_i) + \beta_2 C_2(x_i) + … + \epsilon_i \)

Essentially, we are creating new predictors \( C_1(x_i), C_2(x_i), … \), and we could again use the ordinary least square method to fit the model.

Regression splines #

Combine the polynomial and step function ideas. We use a piecewise polynomial function for cutted regions.

In order to make them continuous at cut points (knots), we add constraint so that the functions are continuous up to 2nd derivatives. (It’s hard to detect non continuity of more than 2nd degree)

A natural cubic spline is a spline with the constraint and extra constraint that the function should be linear beyond the outer boundary.

Smoothing splines #

\[ \text{argmin}_g \sum (y_i - g(x_i))^2 + \lambda \int g’’(t)^2 \, dt \]

Much like ridge or lasso, we add a penalty term.

Surprisingly, the solution to this optimization problem (in a restricted solution space of cubic splines), is equal to the natural spline with knots at every data point. (but shrunked)

Local regression #

Regression over moving window. I.e. regression using only neighbors, and weight the neighbors according to distance from the center.

Generalized additive models #

Upto now, we dealt with a single variable \( x \). Now extend the idea to the multiple variables with \(f_i\) where each \( f_i \) could be splines.

Contrast to multiple linear regression,

\[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2 … }\]

We have: \[y_i = \beta_0 + f_1( x_{i1} )+ \beta_2 f_2(x_{i2})… \]

Euler-Lagrange equation #

chatgpt https://chat.openai.com/share/bc64d1c6-066a-47b8-bb26-5a1fce10540a

I’m interested in a model which can express infromation delay of network.

Definitions #

State space #

configuration the object can take

Path #

maps time t to state space \( q(t) \)

Lagrangian L #

application specific function of interest. e.g. L = T - V (difference between kinetic and potential energy)

Measure of the dynamic state of the system.

If L measures asset valuation, S would be total asset change over time t for possible path q(t)

Functional #

A function which takes a function (whose domain is a state space) and maps to a scalar.

Action S #

Functional given by the integral of the lagrangian over time \(S = \int L \, dt\) It is the “total accumulation” of the system’s dynamic over the path.

Principle of Least Action #

This is the heart of the Euler-Lagrange equation.

The path that the system actually takes is the one that makes S stationary (minimum, maximum, or saddle point).

\[ \delta S = \delta \int_{t_1}^{t_2} L(q(t), \dot{q}(t), t) \, dt = 0 \]

The variation \( \delta S \) is not the same as the time derivative \( \frac{dS}{dt} \). It represents a functional derivative, which considers how \( S \) changes in response to infinitesimal changes in the function \( q(t) \) itself, not in time.

Functional Variation: The Delta Notation (δ) #

Delta (δ) Notation:
- In the calculus of variations, the “delta” (\( \delta \)) notation is used to denote a variation in a functional. This variation represents an infinitesimal change in the functional due to an infinitesimal change in its input function.
- It’s important to note that this is different from the “d” notation used for derivatives in regular calculus, which typically represents an infinitesimal change in a function due to an infinitesimal change in its argument (like time or space).
Functional Variation Example:
- Consider a functional \( S[q(t)] \) and let \( q(t) + \delta q(t) \) be a slight variation of the function \( q(t) \). The variation \( \delta S \) is then defined as: \[ \delta S = S[q(t) + \delta q(t)] - S[q(t)] \]
- In the limit as \( \delta q(t) \) becomes infinitesimally small, this leads to the calculus of variations and the derivation of the Euler-Lagrange equations.

Understanding the Concept #

Infinitesimal Changes: The idea behind \( \delta S \) and \( \delta q(t) \) is to consider what happens to the functional \( S \) when the function \( q(t) \) is slightly “nudged” or varied. This “nudging” is not a change in time or space, but a change in the function’s shape or form over its domain.
Physical Interpretation: In physical terms, this can be thought of as slightly altering the path a system takes and seeing how this affects the overall action (or whatever the functional represents).

Mathematical Representation #

Variational Principle: In the Principle of Least Action, the variation of the action \( \delta S \) due to a variation in the path \( \delta q(t) \) is used to find the path that makes \( S \) stationary. This principle is expressed as \( \delta S = 0 \).

Deriving the Euler-Lagrange Equation #

Start with the Variation of the Action: \[ \delta S = \delta \int_{t_1}^{t_2} L(q(t), \dot{q}(t), t) \, dt \]
Introduce a Small Variation in the Path: Consider a small variation in the path \( q(t) \), denoted as \( \delta q(t) \). The variation in the derivative of the path \( \dot{q}(t) \) is then \( \delta \dot{q}(t) \).
Expand the Variation of the Lagrangian: The variation in the Lagrangian \( L \) due to these small changes can be expanded using a Taylor series (keeping only first-order terms): \[ \delta L = \frac{\partial L}{\partial q} \delta q + \frac{\partial L}{\partial \dot{q}} \delta \dot{q} \]
Substitute into the Variation of the Action: \[ \delta S = \int_{t_1}^{t_2} \left( \frac{\partial L}{\partial q} \delta q + \frac{\partial L}{\partial \dot{q}} \delta \dot{q} \right) dt \]
Integration by Parts: For the term involving \( \delta \dot{q} \), use integration by parts. Let’s denote \( u = \frac{\partial L}{\partial \dot{q}} \) and \( dv = \delta \dot{q} dt \), then \( du = \frac{d}{dt}\left(\frac{\partial L}{\partial \dot{q}}\right) dt \) and \( v = \delta q \)

\[ \int \frac{\partial L}{\partial \dot{q}} \delta \dot{q} dt = \left. \frac{\partial L}{\partial \dot{q}} \delta q \right|_{t_1}^{t_2} - \int \frac{d}{dt}\left(\frac{\partial L}{\partial \dot{q}}\right) \delta q \, dt \]

The boundary term vanishes if we assume that the variation \( \delta q(t) \) is zero at the endpoints \( t_1 \) and \( t_2 \).

Combine and Set the Variation to Zero: After combining the terms, we have: \[ \delta S = \int_{t_1}^{t_2} \left( \frac{\partial L}{\partial q} - \frac{d}{dt}\left(\frac{\partial L}{\partial \dot{q}}\right) \right) \delta q \, dt = 0 \]
Fundamental Lemma of Calculus of Variations: For the integral above to be zero for all variations \( \delta q(t) \), the integrand itself must be zero. Therefore: \[ \frac{\partial L}{\partial q} - \frac{d}{dt}\left(\frac{\partial L}{\partial \dot{q}}\right) = 0 \]

Fixed Endpoints #

In the context of the Principle of Least Action, the path \( q(t) \) is typically assumed to be fixed at the endpoints of the time interval. This means the initial and final states of the system are predetermined and not subject to variation.

Fundamental Lemma of Calculus of Variations #

Statement: The Fundamental Lemma of Calculus of Variations essentially states that if an integral of the form \[ \int_{a}^{b} f(x) \, \delta q(x) \, dx = 0 \] is zero for all admissible functions \(\delta q(x)\) that vanish at the endpoints \(a\) and \(b\), then the function \(f(x)\) must be zero for all \(x\) in \([a, b]\).
Interpretation: In simpler terms, if the integral of a function \(f(x)\) multiplied by any arbitrary small variation \(\delta q(x)\) is always zero, then \(f(x)\) itself must be zero everywhere on the interval. This is because the only way for this integral to be zero for all possible choices of \(\delta q(x)\) (which can be any function that is zero at the endpoints) is if \(f(x)\) is zero.