A friend of mine has been doing some research on information and causality recently. He told me that the mathematical definition of the information generated by a random process is
where is the probability of one particular outcome of the process and the sum is taken over all possible outcomes. I was curious to see why this particular definition should be picked out of infinitely-many possibilities. This post explains what I came up with.
Suppose that every morning I flip a coin and send you an email telling you what I got. It’d be totally exciting – completely unpredictable. (Let me know if you actually want to do this. Especially if you live in, like, Malawi or Liechtenstein. I haven’t had a pen pal since second grade and we can tell each other all about the ways of our respective people and what kinds of diseases our grandparents get etc.) My email would be “heads” half the time and “tails” the other half, but only over the course of many tosses. You’d have no idea what it would be on any particular day until you got the email. (Surprisingly, if I always flip the coin starting heads-up, it would be heads about 51% of the time. Imagine I’m a perfect 50-50 coinflipper.)
If we did this little email game, I’d be telling you just a little bit about my life. I’d be giving you some information. It’s that idea of information that I want to make mathematically precise here.
Let’s start with a simple definition. Every day I send you an email that says either “H” or “T”. Just one character out of two. The simplest possible message, but totally random. Let’s call that one bit of information.
I could do other strange things, though. What if I rolled a die and sent you a number 1-6? What if I rolled the die and told you whether or not it showed a number that divides ten evenly? What if the die had any number of sides, and what if it were loaded? I could do even weirder things than that, such as telling you the morning temperature, but I won’t discuss that here. (The key difference is that knowing yesterday’s coin flip result doesn’t tell you about today’s, but if you know today’s temperature, that gives you a basis for a reasonable guess about tomorrow’s.)
Since I send you one bit on the first day, and I’m doing the same thing the next day, let’s call that a total of two bits. Technically, it wouldn’t have to be two bits. It could be four bits, for example, because there are four possible two-day scenarios (HH, HT, TH, TT). But it makes a lot of sense to think of it as one new bit each day.
Suppose I wanted to do things faster. I want to send you not one bit per day, but two. I could do this by flipping in the morning and sending you and email, and flipping again in the afternoon, sending another email. Or I could flip, send you an email, and then immediately flip again and send a follow up. Finally, I might decide to do it all in one step. I’d flip the coin twice in a row, and then send you one email with one of the four possible sequences in it. And that’s two bits.
Want three bits? I’d have to send you one of the eight possible three-coin-flip sequences. But suppose I get lazy. I don’t want to flip the coin three times. Instead, I get an eight-sided die and roll that once. It gives me a number 1-8, but I don’t want to send you a number. I want to send you a sequence of heads and tails, since I’m trying to trick you into thinking I’m not a lazy bum. So I make a little chart like this:
1 -> HHH
2 -> HHT
3 -> HTH
4 -> HTT
5 -> THH
6 -> THT
7 -> TTH
8 -> TTT
Then whatever number I roll, I just look up the corresponding sequence on my chart, and send off the email. Each sequence comes up of the time, just as it would as if I were actually flipping the coin. You’d never know the difference. The information you receive must be exactly the same – three bits.
All this discussion leads to two conclusions: first, the information generated by a random process (flipping a coin, rolling a die, etc.) depends only on the probability distribution, not the details of the physical process. If any two process have the same probability distribution the way three coins flips and one octohedral die roll do, they have the same information.
Second, coin flips have bits of information, and this is equivalent to a rolling a die with sides. So if we have a flat probability distribution (all outcomes equally likely) with possible outcomes, that has bits of information. Let’s say this relation holds true for all integer , whether or not is a power of two.
Suppose we roll two dice, one with sides and the other with . The total information generated should, we expect, be the sum of the information generated by each of the dice separately, since they’re totally independent. When we roll the dice, the result is an ordered pair with any of values in the first slot and any of values in the second. The total number of ordered pairs is , all equally likely. Then the information is . So the information in two dice rolls is just the sum of the information in the individual rolls.
We still need to figure out how to quantify the information contained in distributions that aren’t flat. For example, suppose my coin had two heads on it, so that every morning I sent you and email that just said, “H”. That wouldn’t give you any information at all, because you could easily predict what I was going to send you. There’s no magic there. You could just write a little program on your computer to send yourself an email with “H” each morning and cut me totally out of the picture with the same exact result. Clearly, I’m generating zero information in this case.
There could be intermediate cases, though. What if my coin gave heads 75% of the time and tails 25%? It would be semi-predictable, semi-random. Then the information should be somewhere in between one bit (as for the flat, totally unpredictable distribution) and zero bits (as for the totally predictable one).
To handle this, we’ll re-examine the flat distribution. There are equally-likely outcomes. Let’s say that each outcome carries its own little packet of information, and that adding them all up, we get the total information. So in flipping a fair coin, the heads side alone has half a bit of information. The tails side has the other half. Adding them, we get the full one bit.
Then each outcome in the flat distribution has
bits associated with it, so that when we add them all up we get the correct total bits.
Now we can do distributions like
We just find the individual information of each outcome based on its probability, then add.
In general, if we write is the probability for a given outcome, we can write the total information as
Which holds true for all distributions made from adding up bits and pieces of flat guys. But why not extend the definition to all (discrete) distributions whatsoever? This is just the principle that as far as our mathematical formalism goes, fractions should not be treated as special just because their numerator is .
With this final conclusion, we reach the desired definition:
The only difference is that you don’t have to use as the basis of the logarithm. You could use any other number you wanted. For example, if you were using English characters rather than heads and tails as your fundamental unit, you might use base 26. This just multiplies the information by a constant factor, though, and so isn’t very important.