Statistical Learning
November 14, 2023
Statistical Learning #
This is my note taking course on https://learning.edx.org/course/course-v1:StanfordOnline+STATSX0001+2T2023/home (lecture 1-4.7)
gpt #
linear regression https://chat.openai.com/share/30270f42-7ef7-45f7-adec-5858d3dc8360
logistic regression https://chat.openai.com/share/ab52dce1-f66a-4cc6-9807-03f1b7888e31
case-control sampling https://chat.openai.com/share/d1986c13-a01b-4bc4-afa8-e1384ddc303f
bias variance tradeoff #
\[\begin{aligned} \text{MSE}(x) &= E[(Y - \hat{f}(x))^2] \\ \text{MSE}(x) &= E[(f(x) + \epsilon - \hat{f}(x))^2] \\ \text{MSE}(x) &= \text{Var}(\hat{f}(x)) + [\text{Bias}(\hat{f}(x))]^2 + \text{Var}(\epsilon) \\ \text{MSE}(x) &= \text{Var}(\hat{f}(x)) + [\text{Bias}(\hat{f}(x))]^2 + \text{Var}(\epsilon) \\ \text{Bias}(\hat{f}(x)) &= E[\hat{f}(x)] - f(x) \end{aligned} \]
linear regression #
setup #
\( \mathbb{E}[Y|X=4] \) means expected value of Y given \( X = 4\)
Then, we want our fuction to be \( f(x) = \mathbb{E}[Y|X=x] \). We call it the regression function. i.e. we are trying to find prediction function that can predict the mean. (regression comes from ‘regression to mean’)
And we constrain our \( f(x) \) to be \( Y = \beta_0 + \beta_1 X + \epsilon \)
loss function #
A typical loss function is MSE.
\[ \text{Minimize SSR} = (Y - X\beta)^\top (Y - X\beta) \]
Since we want to find The global minima or maxima of the loss function must occur at a point where the gradient (slope) is equal to zero, so the partial derivatives here must be equal to zero.
\[ \frac{\partial}{\partial \beta} \left[ (Y - X\beta)^\top (Y - X\beta) \right] = 0 \]
find \( \hat\beta_0 \) and \( \hat\beta_1\) #
\[ \nabla (L)= \left[\frac{\partial L}{\partial \beta_0},\frac{\partial L}{\partial \beta_1} \right] = 0 \]
\[ \hat\beta_1 = \frac{\sum_{i}^{}(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sum^{}_{i}({x_i}-\bar{x})^{2}} \] \[ \hat\beta_0 = \bar{y}-\hat\beta_1{\bar{x}} \]
Var(B0) #
Our model is:
\[ Y = B_0 + B_1X + \epsilon \]
We have the following new random variable, and we want to find their spread.
\[ B_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}, B_0 = \bar{y}-B_1{\bar{x}} \]
What is random here?, given x, y has a random factor e.
Since \( \sum_{i=1}^{n}(x_i - \bar{x})\bar{y} = 0 \)
\[ \text{Var}(B_1) = \frac{\text{Var}\left(\sum_{i=1}^{n}(x_i - \bar{x})y_i\right)}{\left(\sum_{i=1}^{n}(x_i - \bar{x})^2\right)^2} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2\sigma^2}{\left(\sum_{i=1}^{n}(x_i - \bar{x})^2\right)^2} \]
\( SE(B_1) = \sqrt{\frac{\hat{\sigma}^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}} \), where \( \hat{\sigma} = s_e = \sqrt{\frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{n - 2}} \)
From \( B_0 = \bar{y} - B_1\bar{x} \) \[ \text{Var}(B_0) = \text{Var}(\bar{y}) + \bar{x}^2\text{Var}(B_1) - 2\bar{x}\text{Cov}(\bar{y}, B_1) \]
Given \(\bar{y}\) is independent of \( B_1\), (i.e. our linear coefficient is not dependent on average of y, \( \bar{y} \) is a constant)
\[ \text{Var}(B_0) = \frac{\sigma^2}{n} + \frac{\bar{x}^2\sigma^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2} \]
t-statistics for \(B_0, B_1\) #
Why we care about \( \text{Var}(B_1) \) ? Because our estimate \( \hat\beta_1\) depends on our sample, and we need to intepret this value.
E.g. does our predictor (X) affect Y? We answer it by using t-statistics. (confidence interval or p-value)
colinearity #
When you have multiple features, you get the corresponding coefficients for them.
When they are linearly correlated, \( SE(B_1) \) is bumped up, therefore you lose precision on the feature and t-test.
More importantly, when variables are correlated, the magnitude of coefficient is distributed over correlated variables, so it’s harder to interpret the coefficient.
Where you have contour plot of two coefficients \( B_{1a} , B_{1b} \), when they are linearly correlated, the plot will have narrow elipses.
\( R^2 = r^2 \) #
We need a way to tell how good our model is, MSE is depend on units, so it is not a good measure.
\[ R^2 = 1 - \frac{\sum (\hat{y}_i - \bar{y})^2}{\sum (y_i - \bar{y})^2} = 1 - \frac{\text{Unexplained Variance}}{\text{Total Variance}} \] \[ = \frac{\text{Explained Variance}}{\text{Total Variance}} \]
The coefficient of determination \( R^2\) is defined as the proportion of the variance in the dependent variable that is predictable from the independent variable
Since \( R^2 = r^2 \):
Corelation r measures how much the best linear line explains total variance. Note that this interpretation is much more meaningful than interpretation we can derive from the definition of \( r \).
proof #
\[ \sum_{i=1}^{n} (\hat{y_i} - \bar{y})^2 = \beta_1^2 \sum_{i=1}^{n} (x_i - \bar{x})^2 \]
And because \( \beta_1 = r \cdot \frac{s_y}{s_x} \), if we substitute \( \beta_1 \) into the equation, we get:
\[ \sum_{i=1}^{n} (\hat{y_i} - \bar{y})^2 = r^2 \cdot \frac{s_y^2}{s_x^2} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]
Since the total variance in \( Y \) is \( s_y^2 \), the \( R^2 \) formula simplifies to:
\[ R^2 = \frac{r^2 \cdot \frac{s_y^2}{s_x^2} \sum_{i=1}^{n} (x_i - \bar{x})^2}{s_y^2} \]
The \( s_y^2 \) cancels out and the \( \sum_{i=1}^{n} (x_i - \bar{x})^2 \) is the total variance in \( X \), which leaves us with:
\[ R^2 = r^2 \]
\[ \beta_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} \] \[ r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2}\sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}} \]
regularization #
ridge, lasso regression #
The optimization problem for Ridge regression is given by:
\[ \min_{\beta} \left\{ \sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p}\beta_j^2 \right\} \]
where \(\lambda\) is the regularization parameter. The term \(\lambda \sum_{j=1}^{p}\beta_j^2\) is the L2 penalty, which is the sum of the squares of the coefficients multiplied by the regularization parameter.
This can be shown to be equivalent to solving the following constrained optimization problem:
\[ \min_{\beta} \left\{ \sum_{i=1}^{n}(y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij})^2 \right\} \quad \text{subject to} \quad \sum_{j=1}^{p}\beta_j^2 \leq s \]
for some \( s \geq 0 \).
lagrange multiplier #
https://youtu.be/io8kTsSnOdE?si=DbL44RYDVIGgzoiM&t=1395
https://youtu.be/5A39Ht9Wcu0?si=xNS89EKijayT7dWY
You are considering multiple contour plots for the function you are trying to find maximum. You are also considering the contour plot given by the constraint.
If you place them, you can see when the constraint level curve intersect the function level curves, you can go higher by moving to the higher level curve.
The maximum is only possible when your level curve doesn’t cross the other level curves. (it only touches at a point). So at this point, the level curve is aligned. All you need to test is the alignability by the gradient.
gradient tells the direction of the level curve
If it is possible to define a particular surface in the three-dimensional space as $f(x, y, z)=c$, then we can define a contour on the surface by parametrization as $x(t), y(t), z(t)$. Then, we have for all points in the contour,
$$ f(x(t), y(t), z(t))=c $$
Let us calculate the total differential for $f(x(t), y(t), z(t))$ with respect to $t$. We obtain
$$ \begin{aligned} & d f=\frac{\partial f}{\partial x} \mathrm{~d} x+\frac{\partial f}{\partial y} \mathrm{~d} y+\frac{\partial f}{\partial z} \mathrm{~d} z \\ & =\left(\frac{\partial f}{\partial x} \mathbf{i}+\frac{\partial f}{\partial y} \mathbf{j}+\frac{\partial f}{\partial z} \mathbf{k}\right) \cdot(\mathrm{d} x \mathbf{i}+\mathrm{d} y \mathbf{j}+\mathrm{d} z \mathbf{k}) \end{aligned} $$
But since $\mathrm{d} f=0$, we conclude
$$ \left(\frac{\partial f}{\partial x} \mathbf{i}+\frac{\partial f}{\partial y} \mathbf{j}+\frac{\partial f}{\partial z} \mathbf{k}\right) .(\mathrm{d} x \mathbf{i}+\mathrm{d} y \mathbf{j}+\mathrm{d} z \mathbf{k})=0 $$
The first of the factors defines $\nabla f$ while the second defines a vector tangential to the contour. That their dot product is zero implies that they are perpendicular.
Multiple linear regression #
interpreting the coefficient #
If predictors are uncorrelated, it can be said: “a unit change in \( X_i \) is associated with a \( B_i \) change in \( Y \) while all other variables stay fixed”
But most of times, when \( X_i \) changes, other \( X_j \) changes.
observational (correlated) data doesn’t imply causality #
Claims of causality should be avoided (after observing data)
MSE #
\[ \text{MSE}(\mathbf{b}) = \frac{1}{n} \lVert \mathbf{Y} - \mathbf{X}\mathbf{b} \rVert^2 \]
\[ \lVert \mathbf{Y} - \mathbf{X}\mathbf{b} \rVert^2 = \mathbf{Y}^\mathsf{T}\mathbf{Y} - 2\mathbf{Y}^\mathsf{T}\mathbf{X}\mathbf{b} + \mathbf{b}^\mathsf{T}\mathbf{X}^\mathsf{T}\mathbf{X}\mathbf{b} \]
We find \(\beta\) to minimize the MSE where gradient is 0:
\[ \frac{\partial}{\partial \mathbf{b}} \left( \frac{1}{n} \lVert \mathbf{Y} - \mathbf{X}\mathbf{b} \rVert^2 \right) = \frac{1}{n} \left( -2\mathbf{X}^\mathsf{T}\mathbf{Y} + 2\mathbf{X}^\mathsf{T}\mathbf{X}\mathbf{b} \right) \]
So, our best \( b \) is: \[ \mathbf{X}^\mathsf{T}\mathbf{X}\mathbf{\hat{b}} = \mathbf{X}^\mathsf{T}\mathbf{Y} \]
Simple vector differentiation #
Derivative of \( \mathbf{A}\mathbf{b} \) with respect to \( \mathbf{b} \):
\( \mathbf{A} \) is a constant matrix and \( \mathbf{b} \) is a vector of variables. When we differentiate each element of the product \( \mathbf{A}\mathbf{b} \) with respect to each element of \( \mathbf{b} \), we essentially get back the matrix \( \mathbf{A} \), because this is a linear transformation. Here’s a simple example to illustrate:
Let’s take \( \mathbf{A} \) as a \( 2 \times 2 \) matrix and \( \mathbf{b} \) as a \( 2 \times 1 \) vector: \[ \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix}, \quad \mathbf{b} = \begin{bmatrix} b_{1} \\ b_{2} \end{bmatrix} \]
Then \( \mathbf{A}\mathbf{b} \) is: \[ \mathbf{A}\mathbf{b} = \begin{bmatrix} a_{11}b_{1} + a_{12}b_{2} \\ a_{21}b_{1} + a_{22}b_{2} \end{bmatrix} \]
The derivative of this with respect to \( \mathbf{b} \) is: \[ \frac{\partial}{\partial \mathbf{b}} (\mathbf{A}\mathbf{b}) = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} = \mathbf{A} \]
Derivative of \( \mathbf{b}^\mathsf{T}\mathbf{X}^\mathsf{T}\mathbf{X}\mathbf{b} \) with respect to \( \mathbf{b} \):
This is a bit more complex because it involves a quadratic form. Let’s denote \( \mathbf{C} = \mathbf{X}^\mathsf{T}\mathbf{X} \), which is a constant matrix since \( \mathbf{X} \) is constant. The expression can be rewritten as \( \mathbf{b}^\mathsf{T}\mathbf{C}\mathbf{b} \).
Now, consider the function \( f(\mathbf{b}) = \mathbf{b}^\mathsf{T}\mathbf{C}\mathbf{b} \), which is a scalar. If we increment \( \mathbf{b} \) by a small vector \( \mathbf{h} \), the change in \( f \) is:
\[ f(\mathbf{b} + \mathbf{h}) - f(\mathbf{b}) = (\mathbf{b} + \mathbf{h})^\mathsf{T}\mathbf{C}(\mathbf{b} + \mathbf{h}) - \mathbf{b}^\mathsf{T}\mathbf{C}\mathbf{b} \]
Expanding the right side gives us:
\[ \mathbf{b}^\mathsf{T}\mathbf{C}\mathbf{b} + \mathbf{h}^\mathsf{T}\mathbf{C}\mathbf{b} + \mathbf{b}^\mathsf{T}\mathbf{C}\mathbf{h} + \mathbf{h}^\mathsf{T}\mathbf{C}\mathbf{h} - \mathbf{b}^\mathsf{T}\mathbf{C}\mathbf{b} \]
\[ = \mathbf{h}^\mathsf{T}\mathbf{C}\mathbf{b} + \mathbf{b}^\mathsf{T}\mathbf{C}\mathbf{h} + \mathbf{h}^\mathsf{T}\mathbf{C}\mathbf{h} \]
As \( \mathbf{h} \) goes to zero, the last term becomes negligible, and since \( \mathbf{C} \) is symmetric, \( \mathbf{h}^\mathsf{T}\mathbf{C}\mathbf{b} \) and \( \mathbf{b}^\mathsf{T}\mathbf{C}\mathbf{h} \) are actually the same.
logistic regression #
use linear regression for classification? #
When we have a binary Y (either 0 or 1), Y is an indicator variable:
\[ \mathbb{E}[Y=1|X=x] = P(Y=1|X=x) \]
So using linear regression as a classifier is justified, but practically linear regression can have negative intercept and so on. I.e. the linear model doesn’t correctly model the underlying probability. Linear model assumes a uniform density on x having probability p. (Logistic regression assume a different distribution on distribution of p)
Confusing?, why the correct property yields wrong probability?
\( f(x) = \mathbb{E}[Y|X=x] \) is the good property we tried to use. Linear regression has the property. If our data follows the linear model, then we have good reason to use it to predict probability. But our indicators might not have the linear relation to the probability we are trying to predict, so we need alternative model.
Setup #
\[ \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1X_1 + \ldots + \beta_kX_k \]
Let \( p = a(x) \), i.e. for a given x, our model predicts a probability p, then the model assumes the following:
\[ E\left[\log\left(\frac{p}{1-p}\right)\right] = \int \log\left(\frac{a(X)}{1-a(X)}\right) f_X(X) \, dX = \log(\frac{\pi}{1-\pi}) \]
Where \( \pi \) represents the population probability.
So here \( E\left[\log\left(\frac{p}{1-p}\right)\right] \) is playing the role of \( \mathbb{E}[Y|X=x] \) where our \( E(f(x)) \) is trying to achieve.
The function (inside E) is called the link function. It captures our assumption (or model) on how our predictor relates to the mean of the response variable.
Assumption #
You are assuming the above expression holds. Which means, you are assuming the distribution of X \( f_X \) and your model on predictor and the probability (log odds) are somehow the correct representation of the underlying problem.
The assumption includes assumption of monotonicity of probability distribtution (which is modeled by logistic function)
so why use this logistic function? #
It has many nice characteristics, and can model quite many problems.
- By taking the log of the odds, we’re applying a transformation that compresses the scale of the odds when they are high and stretches them when they are low. This has a variance-stabilizing effect because it makes the scale of changes in the log-odds more uniform over the range of p. Variance in the log-odds does not depend as strongly on the value of p as it does in the original odds scale.
- The log-odds transformation is symmetric around \( p = 0.5 \).
- Perhaps most importantly, by transforming to log-odds, we can often establish a linear relationship between the predictors and the response variable. This is the foundation of logistic regression.
Loss function #
MLE #
The idea of MLE is to find the parameter values that maximize the likelihood function, which represents the probability of observing the given sample of data
Convex #
The log function is “convex”, so if we find an optimum for the log-likelihood, it is equivalently an optimum for the total likelihood. (What does it mean?)
Case control sampling #
For example, click through rate might be less than 0.0001, extremely low.
Do you have to sample the data accordingly to match the population proportion?
No, you can get by using 1:5+ case/control ratio. This is used in epidemiology extensively.
F statistics #
How good our predictor (model) is?
\[ F = \frac{(TSS - RSS) / p}{RSS / (n - p - 1)} \]
- The numerator, \( (TSS - RSS) / p \), is the mean square due to regression (MSR), which tells you how much of the variance in the dependent variable is explained by the independent variables in the model, on average per predictor.
- The denominator, \( RSS / (n - p - 1) \), is the mean square error (MSE), which is an estimate of the variance of the residuals or errors.
If the F-statistic is significantly larger than 1, it suggests that the regression model explains a significant amount of variance in the dependent variable, justifying the inclusion of the predictors in the model.
Total Sum of Squares (TSS): \( TSS \) measures the total variation in the response variable \( Y \). It has \( n-1 \) degrees of freedom because it’s essentially the variation of \( Y \) around its mean \( \bar{Y} \), and the calculation of the mean consumes one degree of freedom.
Residual Sum of Squares (RSS): \( RSS \) measures the variation that remains unexplained by the model. The model includes \( p \) predictors plus an intercept term, consuming \( p+1 \) degrees of freedom. That leaves \( n-(p+1) \) degrees of freedom for the residuals, also known as the error dof.
Degrees of Freedom for ESS: The degrees of freedom for \( ESS \) are the difference between the dof for \( TSS \) and the dof for \( RSS \). Since \( TSS \) has \( n-1 \) dof and \( RSS \) has \( n-(p+1) \) dof, the dof for \( ESS \) is:
\[ (n-1) - [n-(p+1)] = p \]
why? #
Since \( \sum_{i=1}^{n} (Y_i - \hat{Y}_i) = 0 \)
\[ \sum_{i=1}^{n} (\hat{Y}_i - \bar{Y})(Y_i - \hat{Y}_i) = 0 \]
This cross-product term being zero is what’s referred to as the “orthogonality” of the explained and unexplained parts of the model. This means that the ESS and RSS are independent of each other, and their degrees of freedom can be added together to give the degrees of freedom for the TSS.
.. I can buy that they are orthogonal, but I still can’t follow why it lets us to add dof.
bayesian classifier is an optimal classifier #
what does it try to verify? #
assumption #
Assumption of Perfect Knowledge:
- Assume we have perfect knowledge of the probability distributions: both the priors \(\Pr(Y=k)\) and the likelihoods \(f_{X|Y=k}(x)\) for all classes \(k\).
Bayesian Decision Rule:
- The Bayesian decision rule \(d_{Bayes}(x)\) selects the class \(k\) for each observation \(x\) that maximizes the posterior probability \(\Pr(Y=k|X=x)\).
- This is achieved by selecting \(k\) for which the product of the likelihood and the prior, \(f_{X|Y=k}(x) \cdot \Pr(Y=k)\), is the greatest.
Bayesian classifier is optimal to minimize risk where risk is defined: #
\[ R(d) = E[L(Y, d(X))] = \sum_{k} \int L(k, d(x)) \cdot f_{X|Y=k}(x) \cdot \Pr(Y=k) \, dx \]
Do we need to verify this? #
This is the most interesting question to me. I kinda seem to know that it is obviously true. Then I also sense the feeling is somewhat unjustified and not clearly stated.
I’m selecting, the most probable class based on data, and my error rate is lowest because of that. That’s obvious, and it is also not obvious.
Because it’s not obvious, we are proving it.
proof #
Risk is the expected loss, and we are trying to minimize it. And we show Bayes decision criterior is the one which minimizes it under perfect knowledge.
\[ \begin{aligned} L(Y, d(X)) &= \begin{cases} 1 & \text{if } Y \neq d(X) \\ 0 & \text{if } Y = d(X) \end{cases} \\ L(Y, d(X)) &= I_{\{d(X) \neq Y\}} \\ R(d) &= E[L(Y, d(X))] = \sum_k \int I_{\{d(x) \neq k\}} \cdot f_{X|Y=k}(x) \cdot \Pr(Y=k) \, dx \end{aligned} \]
and if we expand the sum
\[ \mathbb{I}{\{d(x) = 1\}} \cdot Pr(x, Y=1) + \mathbb{I}{\{d(x) = 2\}} \cdot Pr(x, Y=2) … \, dx \]
But bayes chooses \( d(x) = k \) for Pr(x, Y=k) is highest among all k.
So the sum of the remaining terms \( \mathbb{I}_{\{d(x) \neq k\}} Pr(x, Y=k) \) are lowest for the bayes decision. (because it chooses the highest term it could possibly pick)
alternative proof #
https://www.ee.columbia.edu/~vittorio/BayesProof.pdf
interesting to note that
given \( X = x \)
\[ Pr(g(x) = k, Y=k) = Pr( g(x)=k ) Pr(Y=k) \]
I.e. event of true value is k and event of predicting k is independent given X.
Your probability of getting class k right when true class is k , is same as the probability of getting class k right regardless of true class, no mather what your probability distribution for predicting the class is.
(because the decision is singular and all the other classes gets ignored)
But we might not want optimal classifier #
When the cost for different type of error is different, the bayes optimality is not what we want.
ROC curve shows two types of error according to the different decision boundary (threshold)

Covariance #
\[ \begin{aligned} \Sigma & =\left[\begin{array}{ccc} \operatorname{Cov}\left[X_1, X_1\right] & \cdots & \operatorname{Cov}\left[X_1, X_n\right] \\ \vdots & \ddots & \vdots \\ \operatorname{Cov}\left[X_n, X_1\right] & \cdots & \operatorname{Cov}\left[X_n, X_n\right] \end{array}\right] \\ & =\left[\begin{array}{ccc} E\left[\left(X_1-\mu_1\right)^2\right] & \cdots & E\left[\left(X_1-\mu_1\right)\left(X_n-\mu_n\right)\right] \\ \vdots & \ddots & \vdots \\ E\left[\left(X_n-\mu_n\right)\left(X_1-\mu_1\right)\right] & \cdots & E\left[\left(X_n-\mu_n\right)^2\right] \end{array}\right] \\ & =E\left[\begin{array}{ccc} \left(X_1-\mu_1\right)^2 & \cdots & \left(X_1-\mu_1\right)\left(X_n-\mu_n\right) \\ \vdots & \ddots & \vdots \\ \left(X_n-\mu_n\right)\left(X_1-\mu_1\right) & \cdots & \left(X_n-\mu_n\right)^2 \end{array}\right] \\ & =E\left[\begin{array}{ccc} \left[\begin{array}{c} X_1 - \mu_1 \\ \vdots \\ X_n - \mu_n \\ \end{array}\right] \left[\begin{array}{ccc} X_1 - \mu_1 & \cdots & X_n - \mu_n \end{array}\right] \end{array}\right] \\ & =E\left[(X-\mu)(X-\mu)^T\right] \\ \end{aligned} \]
Positive semidefinite of covariance matrix #
let \(w = \left[\begin{array}{ccc} X_1 - \mu_1 & \cdots & X_n - \mu_n \end{array}\right] \)
\( vww^Tv^T = (vw)(vw)^T >= 0\)
linear discriminant analysis #
Linear Discriminant Analysis (LDA) assumes that the conditional probability density functions \( f_k(x) \) for each class \( k \) are Gaussian, and under the assumptions that the covariances of these Gaussians are equal and the classes have prior probabilities \( \pi_k \), we can derive the expression for the posterior probabilities \( p_k(x) \) and show that there’s no quadratic term in \( x \).
In short, when we can assume our data follows normal distribution (because it’s pdf is nice to work with to compute the likelihood), we can compute the likelihood and apply bayes decision rule.
setup #
Our predictor follows gaussian with equal variances.
\[f_k(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu_k)^2}{2\sigma^2}\right)\]
We use bayes theorem to select the most probable class.
\[p_k(x) = \frac{f_k(x)\pi_k}{\sum_j f_j(x)\pi_j}\]
If we take log:
\[\ln(p_k(x)) = \ln(\pi_k) + \frac{x\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} - \text{[terms not involving k]}\]
\[\delta_k(x) = x \cdot \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + \log(\pi_k)\]
As can be seen by the discriminant function \(\delta_k(x) \), the decision boundary is linear.
(Effectively we are separating space by decision boundary which is determined by a distance to the mean of each class.)
single variable #
As we saw, bayes classifier is the theoretically superior if one has the perfect knoweledge. So if we know the distribution of our data, we can use bayesian classifier.
multi variable #
In the multi-feature scenario, each observation is a vector \( \mathbf{x} \) in a multi-dimensional space. The Gaussian density function for a class \( k \) with mean vector \( \boldsymbol{\mu}_k \) and a common covariance matrix \( \Sigma \) for all classes is given by:
\[ f_k(\mathbf{x}) = \frac{1}{\sqrt{(2\pi)^d |\Sigma|}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)^\top \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu}_k)\right) \]
When there are multiple predictors, there could be correlation among the predictors and we need a joint probability distribution to model the data. For gaussian models, multivariate gaussian distribution provides a nice theoretical background on the pdf. (We can at least understand what our pdf means, i.e. what we are assuming, what are our predictor’s distribution is)
When there are K classes, linear discriminant analysis can be viewed exactly in a K-1 dimensional plot. (Why?)
https://www.youtube.com/watch?v=eho8xH3E6mE
why multivariate gaussian is defined this way? #
https://cs229.stanford.edu/section/gaussians.pdf
Theorem 1. Let \( X ∼ N (µ, Σ) \) for some \(µ ∈ Rn \) and \(Σ ∈ S^n_++\). Then, there exists a matrix \( B ∈ R^{n×n} \) such that if we define \(Z = B^{-1} (X − µ)\), then \(Z ∼ N (0, I)\).
Any random variable X with a multivariate Gaussian distribution can be interpreted as the result of applying a linear transformation \((X = BZ + µ)\) to some collection of n independent standard normal random variables (Z).
the crux of the argument is
One can interpret covariance matrix by:
\[ \Sigma=U \Lambda U^T=U \Lambda^{1 / 2}\left(\Lambda^{1 / 2}\right)^T U^T=U \Lambda^{1 / 2}\left(U \Lambda^{1 / 2}\right)^T=B B^T \]
where $Z=B^{-1}(X-\mu)$.
Then the gaussian distribution we obtain after the change of variable can be interpreted as collection of n independent normal variables.
Theorem1: If A is (conjugate) symmetric positive definite, all its eigenvalues are positive
\(A = A^*\) which means \( \langle Tv, w\rangle = \langle v, Tw\rangle \) i.e. their inner product is the same whether you apply it to any of the two vector.
Then eigenvalues are real because inner product has the property of conjugate symmetric. \( \langle v, w\rangle = \overline{\langle w, v\rangle} \)
A positive operator T \( \langle Tv, v\rangle >= 0 \) has a self-adjoint square root. i.e. \( T = R^* R \)
Hence \( vAv^T \) can be converted to \( (vR)(vR)^T >= 0 \)
Then \( v\lambda v^T >= 0 \), therefore \( \lambda\) (eigenvalues) are positive.
fischer’s iris data #
read section 4.5 logistic, LDA and KNN #
Generalized linear model #
In GLMs, the choice of distribution is for the response variable, not the predictors. This distribution is selected based on the nature of the response data (e.g., binomial for binary data, Poisson for count data). It’s an important distinction because it determines how the variance of the response is related to its mean.
Also GLM can utilize a link function to relate the predictor (a linear combination) to the mean of the response variable.
I guess discriminant analysis can be regarded as a special type where there’s no link function, and we use a distribution which we can compute likelihood such as normal distribution.
In linear regression, the randomness was explicitly added via the error term, but in GLM (or any framework that views as response variable from a distribution), the randomness comes from the distribution.