Statistical Learning (part 2)

November 18, 2023

Statistical learning (part 2) #

This is my note taking course on https://learning.edx.org/course/course-v1:StanfordOnline+STATSX0001+2T2023/home (lecture 4.8-6.9)

Linear regression (part 2) #

The model #

The Linear Model:
- The linear regression model is typically written as: \[ Y = X\beta + \epsilon \]
where \( Y \) is the response vector, \( X \) is the matrix of predictors (including a column of ones for the intercept), \( \beta \) is the vector of regression coefficients, and \( \epsilon \) represents the error terms.
Why Not Just Use \( X^{-1}Y \)?:
- The idea of directly using \( X^{-1}Y \) to solve for \( \beta \) (like in a basic linear equation) is tempting, but it faces several issues:
  - Matrix Inversion: \( X \) is usually not a square matrix (especially in practical datasets where the number of observations \( n \) is different from the number of predictors \( p \)), so you can’t compute \( X^{-1} \) in the usual sense.
  - Overdetermined System: You have more equations than unknowns (more observations than coefficients), which leads to an overdetermined system that cannot be solved with a simple inversion.
  - Underdetermined System: You have more variables than equations
Deriving the OLS Estimator:
- The goal of OLS is to find the best-fitting line through the data points, i.e., to minimize the sum of the squared differences (residuals) between the observed values and the values predicted by the linear model.
- Mathematically, this is framed as minimizing the sum of squared residuals (SSR): \[ \text{Minimize SSR} = (Y - X\beta)^\top (Y - X\beta) \]
- To find the minimum, we take the derivative of SSR with respect to \( \beta \) and set it to zero: \[ \frac{\partial}{\partial \beta} \left[ (Y - X\beta)^\top (Y - X\beta) \right] = 0 \]
- Simplifying this, we get: \[ -2X^\top Y + 2X^\top X\beta = 0 \]
- Solving for \( \beta \), we obtain: \[ \beta = (X^\top X)^{-1}X^\top Y \]
- This is the OLS estimator for \( \beta \). It effectively projects the vector \( Y \) onto the column space of \( X \) and finds the coefficients \( \beta \) that minimize the distance between \( Y \) and its projection.

QUESTION we had alternative form for \( \beta \) They should be the same expression #

check it.. find \( \hat\beta_0 \) and \( \hat\beta_1\)

QUESTION What happens when we have more data than predictors #

That’s what MSE is doing. To find the best estimate which can explain the data.

Cross validation #

purpose #

get an idea of test error
find out how complex model we need to use

LOOCV #

LOOCV has nice formular where we can predict the left out X’s \( \hat{Y} \) without fitting the model.

\[ \hat{y}_{i(-i)} = \frac{\hat{y}i - y_i}{1 - H{ii}} \]

Where \( \hat{Y} = X\hat{\beta} = X(X^\top X)^{-1}X^\top Y = HY \)

Interpretation #

This formula arises because \( H_{ii} \) (the leverage of the ith observation) affects the influence of \( y_i \) on its own prediction. When we remove \( y_i \) from the dataset, we must adjust the prediction \( \hat{y}i \) by the amount of leverage \( H{ii} \).

High variance #

what does it mean that LOOCV has high variance?

What variance are we talking about?

cross validation #

Suppose you split your data into 5 chunks, and use 4 chunks as training set and 1 as validation set. Then repeat it 5 times by using a different chunk as the validation set.

Common error #

People often select predictors using the whole dataset. Then we the selected predictors, setup a hypothesis such as a predictor has a power to predict. (I.e their F score is not 0, or p-value is small)

In the pre screening step, we use all the data. In the second step, we erase our model and rebuild the model from scratch. Is this still causing the problem? Yes.

Because in the first step, you are finding the predictors that happen to have predictive power over the sampled data.

So using the data for filtering step is biasing your test towards the data. Separate out validation set before filtering.

Separation of test and validation set. #

Each iteration has clean separation between validation set and training set.

QUESTION However, since we are rotating the test set and validation set, it might affect the validity of validation esitimate somewhat? #

QUESTION No information is gained on the process of K-fold iteration?

QUESTION No information is gained on the process of iterating the cross validation?
If we repeat the cross validation and choose (hyper) parameters with a less average validation error, can we expect that we did indeed achieved the lower validation error and hope to see lower error on real data?

MyANSWER if we have a shared parameter (such as hyper parameter) between the K-fold iterations
Then we can look at the whole cross-validation as the training process, and the validation measurement is biased towards the data. So if your hyper parameters are hugely influntial on model prediction power, I think this could cause a problem. I.e. the validation error you get might not reflect the true validation error you will see with the unseen data.

Standard Error #

What is standard error in statistical test?

When you have a single sample #

A standard error is the deviation of our estimator (such as mean) which is derived from the standard deviation of our variable and the sample size.

Since we know how to get variance of sum (divided by constant) of n random variables \( Var(\frac{\sum X}{n}) = \frac{Var(\sum X)}{n^2} \)

We can get variance of the mean \( SE^2 = \frac{\sum_{i}{(x_i - \bar{x})^2}}{n-1} \), we use n-1 because there’s only n-1 random variables.

When we have multiple samples, a sampling distribution #

We get a sampling distribution for the estimator (such as “mean” which we are trying to estimate (guess) from our data)

We have a standard deviation of our estimator.

Suppose we try to guess the population mean:

we take a sample and find the mean of the sample
we repeat it multiple times (we take multiple samples)

Then we have the standard deviation of the sample means and we call it sample error.

\[ SE^2 = \frac{\sum_{i}{(x_i - \bar{x})^2}}{m-1} \]

Here \( x_i \) is a sample mean, \( \bar{x} \) is the mean (of the sample mean) over multiple samples and \( m \) is the number of samples.

Why we use them? #

Because we wanna express our uncertainty of our predction (e.g. our sample mean)

E.g we use them to build confidence interval.

Second type of sample error is better than the first type of sample error? #

When we have a single sample, we can’t do the 2nd type.

While both tries to answer the how uncertain we are

The first one is using the spread of the data in the sample.
The second one is using the empirical data (i.e. actually measuring the uncertainty (variance)) which we try to estimate.

QUESTION When we have multiple samples, is it better than the first type? #

QUESTION When we can use the 2nd type of standard error, not using it makes sense if we are trying to be more accurate on our uncertainty?

QUESTION If we have two samples, using the 2nd type SE is better?

QUESTION If we have many more (?) samples, using the 2nd type SE is better?

Bootstrap #

Can we have the 2nd type of sample error when we have only single sample?

Yes, we can.

We treat our single sample as population.
We take multiple samples from this population with sample with replacement.
if our population size (the original sample size) is n, we sample with size n
We call it bootstrap sample.

QUESTION can we use bootstrap sample as training set and the original sample as validation set? #

No, Bootstrap sample has about \( \frac{2}{3} \) of the original sample (why?)

Using bootstrap for cross validation is doable but hard, using a simple cross validation is better most of times.

Feature selection #

Best subset selection #

look at all possible combination of features

If you look at all the possible combinations, you end up with the best model right?

Yes, for your data, but No for the unseen data. You have so many models and you will end up with model that fits to your data so well. (you are overfitting)

This is really counter intuitive. You don’t wanna look at all possibles to do better.

Forward and Backward selection #

We use MSE or \( R^2 \) as backward selection criteria. So we need \( n > p \).

But we don’t for forward selection. Why?, how do we choose the best predictor in forward selection?

It’s because backward selection starts with the full p features, but forward selection adds a feature one at a time.

Cross validation #

Here we use cross validation to pick the optimal number of predictors.

QUESTION is there a way other than best forward selection to choose a model with k predictors? #

One standard error rule #

When we use cross validation to choose the optimal number of predictors, the rule says, choose simpler model within 1 standard error from the optimal one.

I.e. when cross validation says 5 is optimal number, we choose smaller feature because, the cross validation also are random test, and we might be better off choosing simpler model.

Shirinkage, Ridge, Lasso #

Ridge #

L2 penalty

QUESTION why does ridge improve over least squares?
In ridge, we are trying different bias-variance tradeoff by changing the \( \lambda \). It mainly controls the variance, it controls varaince not to be too big. How?
let’s try to simmulate the data and graph it.. or mathematically derive it.

Lasso #

L1 penalty more recent method

It does subset selection, how?

The above pictorial intrepretation is possible because we can represent the optimization problem with two equations.

\[ \min_{\beta} \left\{ \sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p}\beta_j^2 \right\} \]

\[ \min_{\beta} \left\{ \sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij})^2 \right\} \quad \text{subject to} \quad \sum_{j=1}^{p}\beta_j^2 \leq s \]

QUESTION I can see they are related, but are they really same?
How do we know the first equation’s minimum occurs at the boundary of the circle (of the 2nd) or square (for lasso)

How to choose which? #

When actual model is dense (has high number of non zero parameters), Ridge can do better. When actual model is sparse, Lasso can do better.

Better here means, they can achieve lower training MSE (and maybe lower test MSE).

But we don’t know if true model is dense or sparse, so we try both and do cross validation to choose one.
So one is not mathematically superior, one just work better for certain scenarios (the real model), so we have to test it.

the # of free vairable (d)
AIC, BIC, and etc uses degree of freedom, but the meaning of degree of freedom changes because we are controlling the effect of the variables.

Dimension reduction #

We use a linear map from the original variables \( X \) to smaller dimension. \( U \)

QUESTION This can perform better because of we might get lower bias and lower variance. Why? #

maximum likelihood #

Deviance #

QUESTION deviance is the generalization of least squares #

Estimating test error #

adjust training error to be closer to test error #

\( C_p, AIC, BIC \) and adjusted \( R^2 \)

Some formulas use \( \hat{\sigma}^2 \), and since we normally use the MSE using all predictors to estimate \( \hat{\sigma}^2 \) , we have problem when \( n < p \). (We can’t use those formula)

QUESTION what is difference betwween \( R^2 \) (adjusted) and F statistic #

\[ R^2 \text{ (adjusted)} = 1 - \frac{\frac{RSS}{(n-d-1)}}{\frac{TSS}{n-1}} \] \[ F = \frac{(TSS - RSS) / p}{RSS / (n - p - 1)} \]

They try to answer closely related question “how good our model is” in a slightly different format. Where \( R^2 \) tries to directly answer the question directly by a number, \( F \) is a value we are gonna use with the assumed distribution (F statistic)