MCMC

MCMC

February 22, 2024

Reference #

https://angeloyeo.github.io/2020/09/16/rejection_sampling.html

https://www.youtube.com/watch?v=yCv2N7wGDCw

gpt conversation: https://chat.openai.com/share/c65f4fe6-3274-4267-a67e-41ba780ef198

Likelihood #

Suppose you have data, and you are speculating what parameters have generated the data.

For instance, when a someone is test positive (data) in a doping test, you are wondering if he actually have doped (parameter: probability of doping, or whether he doped or not).

The likelihood is \( P(\text{test positive} | \text{dope}\) or \( P(\text{test positive} | \text{not dope})\)

Here the difference between probability and likelihood is seen as,

  • probability is looking at \( P(\text{test positive} | \text{dope}\) and \( P(\text{test negative} | \text{dope}\)
  • likelihood is looking at \( P(\text{test positive} | \text{dope}\) or \( P(\text{test positive} | \text{not dope})\)

The probability is summed to 1, but likelihood may not. we are meerly comparing the relative probability of data happening given various parameters.

So probability is, assuming we know the true state (parameter, e.g. person doped), how likely different outcomes arrise. Whereas likelihood compares the plausiblility of different states of world given observed data. It is a function of parameter, and it doesn’t sum to 1 over the parameter space.

Discrete vs Continuous #

The unit (?) of likelihood is probability in discrete case. Whereas in continuous case, the unit is pdf.

Question Is using pdf directly compatible with using probability? #

This seems like a calculus question.

Intuitively, it seems like pdf is a measure of density of x-axis: it tells how thick x-axis for that point, the thickness (density) tells how likely we are to pick a point on the axis.

So if the measurer is assumed to be same, we can assume the measurement range would be same regardless of pdf, i.e. when we sample a datapoint, we can instead think of a small range of a data point is sampled, and this doesn’t distort the test. I.e. the measurement is done independently of pdfs, we are the measurer and we like to think the range we pick doesn’t favor one pdf over another.

So although a single pdf value is not a probability, the relative measure of two pdf value can be indeed interpreted as the measure of probability (around the small region)

Multiple data #

When there are multiple data points, we can get the joint probability or joint pdf to construct likelihood.

\( x_k = \text{data points} \), \(\theta = \text{parameter} \)

\[ P(x|\theta) = \prod_{k=1}^{n}P(x_k|\theta) \]

With probability computed, we can define the likelihood

\( L(\theta|x) = P(x|\theta) \), as noted, likelihood is a function of \( \theta \) and is not a probability although we write it using the conditional probability notation.

Joint pdf of independent random variable #

A joint pdf can be constructed by multiplication of two pdfs. As a check, it sums to 1.

$$ \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dx \, dy = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X}(x) f_{Y}(y) \, dx \, dy = \int_{-\infty}^{\infty} f_{Y}(y) \int_{-\infty}^{\infty} f_{X}(x) \, dx \, dy = 1 $$

Log likelihood #

Since we are only interested in relative strength, we can compare their log value.

MLE #

The idea is simple, given data, we want to find the parameter that has the maximum likelihood.

To get maximum, we use calculus,

\[ \frac{\partial}{\partial \theta}L(\theta|x) = \frac{\partial}{\partial \theta}\log P(x|\theta) = \sum_{i=1}^{n}\frac{\partial}{\partial\theta}\log P(x_i|\theta) = 0 \]

Example of MLE, for population mean and variance #

\[\hat{\mu} = \frac{1}{n}\sum_{i=1}^{n}x_i\]

\[ \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n}\left(x_i-\mu\right)^2 \]

I.e. we assume population mean is around sample mean, and so on.

And this can be derived by the partial derivatives on the likelihood. i.e. the assumption is correct from MLE perspective

\[ P(x|\theta) = \prod_{i=1}^{n}f_{\mu, \sigma^2}(x_i) = \prod_{i=1}^{n}\frac{1}{\sigma\sqrt{2\pi}} \exp \left( -\frac{(x_i-\mu)^2}{2\sigma^2} \right) \]

Sampling #

We want to sample data from a distribution. Why do we want to do this? because we can use the data to test our model.

For instance, in a baseball game, if we can draw a sample throw from a pitcher, we can construct a better batting strategy.

Inverse transfrom sampling #

Say, we have a distribution, and we know its CDF. If we can invert CDF, then we can sample according to the distribution. (assuming you can sample from a uniform distribution)

Let \( T \) be a transfrom of \( u\) such that \( T(u) = X \)

\[ F(x) = P(X \le x) = P(T(u) <=x) = P(U \le T^{-1}(x)) = T^{-1}(x)\]

So, \( F^{-1}(x) = T(x) \), if we can inverse CDF, we can use it as a transform function.

Rejection sampling #

What if we can’t invert the CDF? How can we sample from a distribution? We might not even have a complete pdf. We only have a function which is proportional to pdf.

I.e. we know \( p(s) = \frac{f(s)}{NC} \), but not know the exact normalizing constant. (we can’t integrate)

How does sampling works #

Come up with \( g(s) \) #

  • which we can sample from
  • which covers the entire domain of \( f(s) \)
  • (optional) which are close to \( f(s) \), it makes the sampling efficient

Algorithm #

  1. sample s from \(g(s)\)
  2. accept with probability \( \frac{f(s)}{Mg(s)} \), where \( M \) is chosen so that the quantity is less than 1.

Why it works? #

We look at density of accepted sample:

\[ D(s|A) = \frac{P(A|s)D(s)}{P(A)} = \frac{\frac{f(s)}{Mg(s)} g(s)}{P(A)}\]

Where \(P(A) = \int_{-\infty}^{\infty} g(s) \frac{f(s)}{Mg(s)} \, ds = \frac{1}{M} \int_{-\infty}^{\infty} f(s) \, ds = \frac{NC}{M} \)

Hence, \( D(s|A) = \frac{\frac{f(s)}{M}}{\frac{NC}{M}} = \frac{f(s)}{NC} = p(s)\)

Issues #

When \( f \) spikes, M can be huge, and it makes the acceptance rate to be low, and the algorithm can be inefficient.

MCMC #

In, Rejection sampling, a sample is independent of another sample.

In MCMC, a sample is dependent on a previous sample. Markov here means, it depends on a single immediate previous sample.

Suppose a chain is stationary, meaning the probability of each state at a specific time is preserved over time.

Also, suppose the states are the possible values of a random variable.

How can you sample from the distribution? Just run the chain, it will output states according to the distribution.

Stationary #

A probability of state k at time t is equal to the probability of state k at time t+1 for all k states.

\( \pi P = \pi \), where \( \pi \) is a probability vector, and \(P \) is a transition matrix.

Detailed balance condition #

If the chance of transit from A to B is the same as the chance of transit from B to A, it is stationary.

\( p(x) T(y|x) = p(y) T(x|y), \text{for } \forall_{x,y} \), where \( T(y|x) \) is a probability of transiting from x to y, \(P_{x,y}\) in a transition matrix.

  • Proof

    \[ \sum_{x}p(x)T(y|x) = p(y)\sum_{x}T(x|y) = p(y) \]

    I.e. \(\pi P = \pi \)

Ergodicity #

We don’t know the stationary state from the start, we need to derive the chain to the stationary state.

Ergodic chain is the chain we know we can. For an ergodic chain, there exists a unique stationary distribution, and the chain will converge to the distribution regardless of its initial state.

  • irreducible: possible from every state to every other state
  • aperiodic: the same state is not reached on a fixed frequency. (GCD of all paths returning to a state is 1)

Metropolis - Hastings #

It’s one method of MCMC samplings.

We have established that if we can design a chain which is stationary and its stationary distribution is the same as the distribution we wanna sample from, then, we can use the chain for sampling.

How to design such a chain? Metropolis is one way of designing a such chain.

Given #

We know \( p(s) = \frac{f(s)}{NC} \), we don’t know the exact value of \( NC\), normalizing constant.

We can come up with a proposal distribution which we can use to sample \( g(X_{t+1} | X_{t} ) \)

E.g. \( g(X_{t+1} | X_{t} = N(X_t, \sigma^2 ) \)

Algorithm #

Accept with probability given by \( A(X_t \text{->} X_{t+1} ) \), but how to construct \(A\)?

Recall, the detailed balance condition

\( p(a) T(a\text{->}b) = p(b) T(b\text{->}a) \)

The probabilty of going from a to b, \( T(a\text{->}b) \) can be broken into two pieces, the first is the probability which proposal distribution is proposing \(b\) given at state \( a\): \( g(b|a) \), and the second is that the probability of we accepting the proposal: \( A(a\text{->}b) \)

\[ \frac{f(a)}{NC} g(b|a) A(a\text{->}b) = \frac{f(b)}{NC} g(a|b) A(b\text{->}a) \]

If we reorder the terms,

\[ \frac{A(a\text{->}b)}{A(b\text{->}a)} = \frac{f(b)}{f(a)} * \frac{g(a|b)}{g(b|a)} = r_f * r_g\]

So, if we come up with \( A \) such that it satisfies the above equation, we can construct a stationary markov chain.

Choosing \( A \) #

  1. \( r_{f}r_{g} < 1\)
    • \( A(a \text{->} b) = r_{f}r_{g} \)
    • \( A(b \text{->} a) = 1 \)
  2. \( r_{f}r_{g} \ge 1 \)
    • \( A(a \text{->} b) = 1 \)
    • \( A(b \text{->} a) = \frac{1}{r_{f}r_{g}} \)

In simpler form, \( A(a \text{->} b) = Max(1, r_{f}r_{g}) \) does satisfy the detailed balance condition.

Symmetric case #

\( r_{g} = 1\), if we center around the \( x_t \), and symmetric distribution has the same pdf value for left and right.

Test implementation #

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(1)

n_iter = 5000
target = lambda x: 0.3*np.exp(-0.2*x**2) + 0.7*np.exp(-0.2*(x-10)**2)
xx = np.linspace(-10, 20, 1000)

# Metropolis Hastings

# initialize x_0
x = []
x.append((np.random.rand() - 0.5) * 30 + 5)  # Random uniform between -10 and 20

proposal = lambda x, mu, s: 1/(s*np.sqrt(2*np.pi))*np.exp(-(x-mu)**2/(2*s**2))  # Normal distribution

my_std = 10

for i in range(n_iter):
    u = np.random.rand()  # sample u
    x_old = x[-1]

    # Sample from the proposal distribution q(x_new | x_old) = N(x_old, sigma)
    x_new = np.random.randn() * my_std + x_old

    # Calculate A(x_old, x_new)
    A = min(1,
            (target(x_new) * proposal(x_old, x_new, my_std)) /
            (target(x_old) * proposal(x_new, x_old, my_std))
           )

    if u < A:
        x.append(x_new)
    else:
        x.append(x_old)

# Plotting the results
plt.figure()
plt.hist(x, bins=int(30 / 0.5), density=True, alpha=0.6, label='Sampled Distribution')
plt.plot(xx, target(xx)/max(target(xx))*max(np.histogram(x, bins=int(30 / 0.5), density=True)[0]), linewidth=2, label='Target Distribution')
plt.grid(True)
plt.xlabel('x')
plt.legend()
plt.show()

Gibbs sampling #

We want to sample when the sampling space is multidimensional.

Given #

The joint probability \( p(x, y) \) is hard to get. We can sample from the conditional probabilities, \( p(x|y), p(y|x), or p(x_1|x_2, x_3) \). We are not using a proposed distribution, we use the conditional distribution directly.

Algorithm #

  1. pick initial sample \( x_0, y_0\)
  2. sample \( x_1 \) using \( X_1 ~ p(X|y_0) \)
  3. sample \( y_1 \) using \( Y_1 ~ p(Y|x_1) \)
  4. \( x_1, y_1 \) becomes the next sample and we repeat.

Proof #

Again using the detailed balance condition? \( p(a) T(a\text{->}b) = p(b) T(b\text{->}a) \)

We need to show, our chain has has the following property

\[ P(x_0, y_0) \cdot P(x_1, y_1 | x_0, y_0) = P(x_1, y_1) \cdot P(x_0, y_0 | x_1, y_1) \]

Gibbs constructs the chain so that \( P(x_1, y_1 | x_0, y_0) = P(x_1 | y_0) \cdot P(y_1 | x_1) \), i.e. its transition is a two step process

So we need to show the following:

\[ P(x_0, y_0) \cdot P(x_1 | y_0) \cdot P(y_1 | x_1) = P(x_1, y_1) \cdot P(x_0 | y_1) \cdot P(y_0 | x_0) \]

If we use, \( P(x_1 | y_0) = \frac{P(x_1, y_0)}{P(y_0)} \) and so on,

\[ P(x_0, y_0) \cdot \frac{P(x_1, y_0)}{P(y_0)} \cdot \frac{P(x_1, y_1)}{P(x_1)} = P(x_1, y_1) \cdot \frac{P(x_0, y_1)}{P(y_1)} \cdot \frac{P(x_0, y_0)}{P(x_0)} \]

The equation simplifies to :\( \frac{P(x_1, y_0)}{P(y_0)} \cdot \frac{1}{P(x_1)} = \frac{P(x_0, y_1)}{P(y_1)} \cdot \frac{1}{P(x_0)}\)

It doesn’t check out.. what…

What if we look at the intermediate steps?

\[ P(x_0, y_0) P(x_1 | y_0) = P(x_1, y_0) P(x_0 | y_0) \]

Again, using the definition of conditional probability,

\[ P(x_0, y_0) \left( \frac{P(x_1, y_0)}{P(y_0)} \right) = P(x_1, y_0) \left( \frac{P(x_0, y_0)}{P(y_0)} \right) \]

We see that they are equal. So we can say, since the gibbs transition maintains the balance condition at each intermediary step, we maintain the balance condition as a whole.

Question So why did the proof of combined step fail? #

\( P(x_1, y_1 | x_0, y_0) = P(x_1 | y_0) \cdot P(y_1 | x_1) \) is the wrong interpretation?

But it seems correct.. so what went wrong?

may read up on https://gregorygundersen.com/blog/2020/02/23/gibbs-sampling/ later