Information theory

November 29, 2023

Information theory #

This is my note for reading a book Elements of information theory.

Entropy #

The measure of uncertainty of probablity distribution.

\[E = E(\frac{1}{\log(p(x))}) \]

If you have 2 side coins, your uncertainty is \( \frac{1}{2} \log(\frac{1}{2}) * 2 = 1\)

If you have 4 side coins, your uncertainty is \( \frac{1}{4} \log(\frac{1}{4}) * 4 = 2\)

And your uncertainty is measured by how many bits (the base can be changed) you need to represent the probability distribution on average.

Conditional expectation #

Ex. Y: weight, x: height

For a given \( X = x \), find the expected value of \( Y \). \[E(Y|X=x) = \sum p(Y|X=x) y = g(x)\]

For example, when height is 150, what’s the expected weight? The answer depends on \( x \). So it is a function of \( x \) (possibly a constant function)

If we let \( x \) be a random, \[E(Y|X) = \sum p(Y|X) y = g(X)\]

The answer depends on \( X \) and it is a random variable which depends on \(X\) : \( g(X) \)

You purposefully leave the \( X \) as random variable, and if you want to consider all possibility of \( X \) by \( E( E(Y|X)) \) you get \( E(Y) \)

Conditional Entropy #

Conditional entropy measures the remaining length you need after knowing something. For all cases, you look at the length of \( P(Y|X) \).

I.e. if I know \(X\) how uncertain am I about \( Y \)?

\[H(Y|X) = -E(\log(P(Y|X)) = \sum P(x, y) \log(P(y|x)) \]

This is different from conditional expectation in that

we use the probability of \( p(X, Y) \) not \( p(Y|X) \) as a pdf.
X, Y is summed over the possibilities

Compare with conditional expectation #

\(H(Y|X=x) = - \sum_{y\in Y} P(y|X=x) \log P(y|X=x)\)

\( H(Y|X) = - \sum_{x \in X} P(x) \sum_{y\in Y} P(y|X=x) \log P(y|X=x) \)

Joint Entropy #

Look at two event as single event, and measure the length of the probability.

\[H(X, Y) = -E(\log(P(X, Y)) \]

Chain rule of entropy #

The length I need to know for both is the length I need for single and the remaining.

\[H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y) \]

Relation to probability #

\( H(X) \) have similar property as the \( P(X) \).

Multiplication becomes addition under log #

\[\frac{P(X, Y)}{P(X)} = P(Y|X) \]

If we take log, \[ \log(P(X, Y)) = \log(P(X)) + \log(P(Y|X)) \]

If we take \( \mathbf{E} \) on both sides: \[ H(X, Y) = H(X) + H(Y|X) \]

We can take Conditional on both sides (tautology) #

If we can do \( \frac{P(X, Y|Z)}{P(X|Z)} = P(Y|X, Z) \)

it becomes:

\[ H(X, Y|Z) = H(X|Z) + H(Y|X, Z) \]

Relative Entropy #

Suppose there’s a distribution \( p(x) \), \( H(X) \) is the minimum average length we can achieve to send \(X\).

If we use length (coding scheme) based on \( q(x) \), it will be less efficient, the average length will be longer, and relative entropy measures that extra length.

\[D(p||q) = \text{ ineffcient length} - \text{ optimal length} = -\sum p(x) (\log(q(x)) - \log(p(x))) \]

Chain rule for relative entropy #

Mutual Information #

It is the information (length) of the other random variable provides me.

If I know \( X \), how much does I know about \( Y \) because of the knowledge of \( X \)

\[I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = \sum p(x,y) \frac{p(x,y)}{p(x)p(y)} \]

`QUESTION` Another interpretation? #

The sum expression seems to sugges another interpretation is possible.

By deconstructing \(\frac{p(x,y)}{p(x)p(y)} \)

Chain rule for information #

Self Information #

\( I(X; X) = H(X) \) what?

\( I(X; X) \): How much info you can get by knowing about yourself \( H(X) \): How much you are uncertain about yourself

The amount you gain by knowing yourself is the same as the amount you need to know about yourself.

Jenson’s inequality #

Convex function #

The line over two points are above the function.

\[f(cx_1 +(1-c)x_2) \le cf(x_1) + (1-c)f(x_2) \]

Convexity #

If a 2nd derivative is non negative over an interval, the function is convex over that interval.

Proof uses the Taylor series expansion.

`QUESTION` Taylor series? #

Taylor series is an approximation for the function. Why are we using it to prove a function has a property with the approximation?

Are we assuming the approximation is good enough over the region?

Jensen’s inequality #

If f is a convex function,

\[ Ef(X) \ge f(EX) \]

`QUESTION` What does it mean? #

Could it mean something like the following? I haven’t proved it.

If we find the mean of the \(X\) and draw the tangent line on \(f(EX) \)

\( E(f(X)) \): the area below the curved line \( f(E(X)) \): the area below the tangent line

Information theory

November 29, 2023