Information and Statistical Independence

Andrew’s reply to my previous post on information suggests an interesting question.

We play a game where every day I flip a coin and get either heads or tails. Then I email you the result, sending you one bit of information about the coin toss.

But in reading that email, you gain the information that I survived the night. You get the information that our computers are still working, and that the laws of physics are mostly the same. Those little gnomes inside your vacuum tubes are still pushing hundreds of electrons and gravitinos all over the place inside there every single hour, just so you can watch YouTube. Those selfless little heroes. Bless their 2 and a half hearts apiece.

Try to catalog everything you learn by reading that email. No chance. You learn infinitely many things. You learn that I didn’t choke to death on my own vomit at 2am. You learn that I didn’t choke to death on my own vomit at 1am. At 12:30, 12:15, 12:07:30, …

Can the definition of information accommodate all these revelations? Is it robust if it tries to do so?

I might assign a probability to every event that could lead up to you not reading a morning ‘H’ or ‘T’ email from me (and consciously registering the result, if you want to get super-deep), from things as trivial as my making a typo to as extravagant as this:

or, of course, this:

The probability of most of those events is very small, which means their information content is low. Maybe I’ll even be able to get infinitely many possibilities to converge to a finite amount of information.

But suppose I break the process down differently. A simple way to incorporate this whole new world of possibilities would be to lump them all as one, and then say there are three possible results – that you read ‘H’, that you read ‘T’, that you read neither.

A complicated way would be to make a long list of the various ways you might end up reading neither. These are two different procedures that attempt to find the information in the same situation, and they come up with different results.

When you break the third category down into a bunch of little ones, you’ll always increase the information content. Start with some possible outcome with probability p. Its information content is p\log p. Now break it down into two sub-outcomes with probabilities x and y. We have

x + y = p

x, y > 0

p\log p = (x+y)\log(x+y) = x\log(x+y) + y\log(x+y) > x\log(x) + y\log(y)

where the inequality follows because the logarithm is monotonic increasing (for a base greater than one). This means

-p\log p < - \left(x\log x + y\log y\right),

so that whenever we break a broad-category event down into (nontrivial) particulars, we increase the total information. So it appears that the information you get by reading your morning email depends on how you decide to count it.

To be more concrete, let’s consider a specific, tractable example. I decide that in addition to telling you whether I flipped the coin heads up or tails up, I’m also going to tell you about the bottom side of the coin. So now, every day I send you an email saying ‘HT’ if I flipped heads, then checked the bottom and got tails.

You wouldn’t get all that excited about this change in the rules. The information content in this game is just as it was before, because if you know what’s on the top of the coin, you know what’s on the bottom (trick coins notwithstanding).

The new message wants to be two bits, but somehow it fails. It’s lacking something. That thing is statistical independence.

A half-way scenario would be if I have three coins – two trick coins (one with two heads, the other with two tails) and one normal coin. I choose a coin at random each morning and flip. If I just told you the normal result of the flip, it would be one bit of information because heads and tails are equally likely. Alternatively, I could tell you just what’s on bottom, and that would also be one bit.

But if I told you both top and bottom, it would be somewhere between one bit (as for the normal coin) and two bits (as for two different normal coins).

If you know the top of the coin is heads, the bottom could also be heads (if it’s the trick coin) or tails (if it’s the normal coin), but they aren’t equally likely. If the top is heads, there’s a \frac{2}{3} chance the bottom is also heads. The report on the top of the coin gives one bit, while the report on the bottom gives

\frac{2}{3}\log\left(\frac{3}{2}\right) + \frac{1}{3}\log(3)

additional bits. The total information comes to (I’ll do the arithmetic)

\log\left(3*2^{1/3}\right).

Another way to think of it is that there are four outcomes, with associated probabilities

HH – 1/3
HT – 1/6
TH – 1/6
TT – 1/3

which, on arithmetic, gives the same result.

With these new considerations, return to the question of the potentially infinite information you get from reading my morning email. In order to quantify the total information, we’d need to work out which events are independent of each other, and get all their conditional probabilities.

Two reasons the email might not come are that your computer is broken and that I decided I hate you now (and therefore not longer want to send you emails). By successfully getting the morning email, you rule out both those possibilities. But you rule them out for the same reason. There is no way that simply reading ‘H’ or ‘T’ can rule one out without ruling out the other.

Sure, in practice you can tell the difference. For example, if no email comes it may be very clear that your computer is working fine, but that’s outside information. That’s information based on things other than the email, and I want to consider only the information in the email itself. We can safely lump all those horrible email-blocking possibilities into one category, slap a probability on it, and get a new, consistent information quantity.

Tags: ,

3 Responses to “Information and Statistical Independence”

  1. amzuko Says:

    “We can safely lump all those horrible email-blocking possibilities into one category, slap a probability on it, and get a new, consistent information quantity.”

    probably. But what’s the most efficient system to store that information? You’ve used base 2 in your calculations, implying a base-two storage mechanism. That means you need a lot of bits to store anything large. Babbage used base 10 for his various and sundry machinations… not as many digits need to be used, but each 10-sided wheel is pretty unwieldy.

    I realize this drifts from math to… engineering…
    but all the same, no google-hunting the answer.

  2. Pablo Says:

    Just wandering around, found your post. Note the decomposition you propose is not correct. Where it reads

    p = x + y

    intending to note the realtions between the probabilities of composite events, it should read instead

    p(C) = P(A U B ) = P(A) + P(B) – P(A ^ B)

    This invalidates the main point of ever increasing information. Note also that statistical independence (defined as P(A^B) = P(A)P(B) ) has nothing to do with it.

    Regards

  3. Mark Eichenlaub Says:

    Hi Pablo,

    No, it is not wrong. The post implicitly assumes the two outcomes are mutually exclusive.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: