Information theory

Information theory

November 29, 2023

Information theory #

This is my note for reading a book Elements of information theory.

Entropy #

The measure of uncertainty of probablity distribution.

E=E(1log(p(x)))E = E(\frac{1}{\log(p(x))})

If you have 2 side coins, your uncertainty is 12log(12)2=1 \frac{1}{2} \log(\frac{1}{2}) * 2 = 1

If you have 4 side coins, your uncertainty is 14log(14)4=2 \frac{1}{4} \log(\frac{1}{4}) * 4 = 2

And your uncertainty is measured by how many bits (the base can be changed) you need to represent the probability distribution on average.

Conditional expectation #

Ex. Y: weight, x: height

For a given X=x X = x , find the expected value of Y Y . E(YX=x)=p(YX=x)y=g(x)E(Y|X=x) = \sum p(Y|X=x) y = g(x)

For example, when height is 150, what’s the expected weight? The answer depends on x x . So it is a function of x x (possibly a constant function)

If we let x x be a random, E(YX)=p(YX)y=g(X)E(Y|X) = \sum p(Y|X) y = g(X)

The answer depends on X X and it is a random variable which depends on XX : g(X) g(X)

You purposefully leave the X X as random variable, and if you want to consider all possibility of X X by E(E(YX)) E( E(Y|X)) you get E(Y) E(Y)

Conditional Entropy #

Conditional entropy measures the remaining length you need after knowing something. For all cases, you look at the length of P(YX) P(Y|X) .

I.e. if I know XX how uncertain am I about Y Y ?

H(YX)=E(log(P(YX))=P(x,y)log(P(yx))H(Y|X) = -E(\log(P(Y|X)) = \sum P(x, y) \log(P(y|x))

This is different from conditional expectation in that

  • we use the probability of p(X,Y) p(X, Y) not p(YX) p(Y|X) as a pdf.
  • X, Y is summed over the possibilities

Compare with conditional expectation #

H(YX=x)=yYP(yX=x)logP(yX=x)H(Y|X=x) = - \sum_{y\in Y} P(y|X=x) \log P(y|X=x)

H(YX)=xXP(x)yYP(yX=x)logP(yX=x) H(Y|X) = - \sum_{x \in X} P(x) \sum_{y\in Y} P(y|X=x) \log P(y|X=x)

Joint Entropy #

Look at two event as single event, and measure the length of the probability.

H(X,Y)=E(log(P(X,Y))H(X, Y) = -E(\log(P(X, Y))

Chain rule of entropy #

The length I need to know for both is the length I need for single and the remaining.

H(X,Y)=H(X)+H(YX)=H(Y)+H(XY)H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)

Relation to probability #

H(X) H(X) have similar property as the P(X) P(X) .

Multiplication becomes addition under log #

P(X,Y)P(X)=P(YX)\frac{P(X, Y)}{P(X)} = P(Y|X)

If we take log, log(P(X,Y))=log(P(X))+log(P(YX)) \log(P(X, Y)) = \log(P(X)) + \log(P(Y|X))

If we take E \mathbf{E} on both sides: H(X,Y)=H(X)+H(YX) H(X, Y) = H(X) + H(Y|X)

We can take Conditional on both sides (tautology) #

If we can do P(X,YZ)P(XZ)=P(YX,Z) \frac{P(X, Y|Z)}{P(X|Z)} = P(Y|X, Z)

it becomes:

H(X,YZ)=H(XZ)+H(YX,Z) H(X, Y|Z) = H(X|Z) + H(Y|X, Z)

Relative Entropy #

Suppose there’s a distribution p(x) p(x) , H(X) H(X) is the minimum average length we can achieve to send XX.

If we use length (coding scheme) based on q(x) q(x) , it will be less efficient, the average length will be longer, and relative entropy measures that extra length.

D(pq)= ineffcient length optimal length=p(x)(log(q(x))log(p(x)))D(p||q) = \text{ ineffcient length} - \text{ optimal length} = -\sum p(x) (\log(q(x)) - \log(p(x)))

Chain rule for relative entropy #

Mutual Information #

It is the information (length) of the other random variable provides me.

If I know X X , how much does I know about Y Y because of the knowledge of X X

I(X;Y)=H(X)H(XY)=H(Y)H(YX)=p(x,y)p(x,y)p(x)p(y)I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = \sum p(x,y) \frac{p(x,y)}{p(x)p(y)}

QUESTION Another interpretation? #

The sum expression seems to sugges another interpretation is possible.

By deconstructing p(x,y)p(x)p(y)\frac{p(x,y)}{p(x)p(y)}

Chain rule for information #

Self Information #

I(X;X)=H(X) I(X; X) = H(X) what?

I(X;X) I(X; X) : How much info you can get by knowing about yourself H(X) H(X) : How much you are uncertain about yourself

The amount you gain by knowing yourself is the same as the amount you need to know about yourself.

Jenson’s inequality #

Convex function #

The line over two points are above the function.

f(cx1+(1c)x2)cf(x1)+(1c)f(x2)f(cx_1 +(1-c)x_2) \le cf(x_1) + (1-c)f(x_2)

Convexity #

If a 2nd derivative is non negative over an interval, the function is convex over that interval.

Proof uses the Taylor series expansion.

QUESTION Taylor series? #

Taylor series is an approximation for the function. Why are we using it to prove a function has a property with the approximation?

Are we assuming the approximation is good enough over the region?

Jensen’s inequality #

If f is a convex function,

Ef(X)f(EX) Ef(X) \ge f(EX)

QUESTION What does it mean? #

Could it mean something like the following? I haven’t proved it.

If we find the mean of the XX and draw the tangent line on f(EX)f(EX)

E(f(X)) E(f(X)) : the area below the curved line f(E(X)) f(E(X)) : the area below the tangent line