## Where Does the Formula for Information Come From?

A friend of mine has been doing some research on information and causality recently. He told me that the mathematical definition of the information generated by a random process is

$I = - \sum p\log p$

where $p$ is the probability of one particular outcome of the process and the sum is taken over all possible outcomes. I was curious to see why this particular definition should be picked out of infinitely-many possibilities. This post explains what I came up with.

Suppose that every morning I flip a coin and send you an email telling you what I got. It’d be totally exciting – completely unpredictable. (Let me know if you actually want to do this. Especially if you live in, like, Malawi or Liechtenstein. I haven’t had a pen pal since second grade and we can tell each other all about the ways of our respective people and what kinds of diseases our grandparents get etc.) My email would be “heads” half the time and “tails” the other half, but only over the course of many tosses. You’d have no idea what it would be on any particular day until you got the email. (Surprisingly, if I always flip the coin starting heads-up, it would be heads about 51% of the time. Imagine I’m a perfect 50-50 coinflipper.)

If we did this little email game, I’d be telling you just a little bit about my life. I’d be giving you some information. It’s that idea of information that I want to make mathematically precise here.

Let’s start with a simple definition. Every day I send you an email that says either “H” or “T”. Just one character out of two. The simplest possible message, but totally random. Let’s call that one bit of information.

I could do other strange things, though. What if I rolled a die and sent you a number 1-6? What if I rolled the die and told you whether or not it showed a number that divides ten evenly? What if the die had any number of sides, and what if it were loaded? I could do even weirder things than that, such as telling you the morning temperature, but I won’t discuss that here. (The key difference is that knowing yesterday’s coin flip result doesn’t tell you about today’s, but if you know today’s temperature, that gives you a basis for a reasonable guess about tomorrow’s.)

Since I send you one bit on the first day, and I’m doing the same thing the next day, let’s call that a total of two bits. Technically, it wouldn’t have to be two bits. It could be four bits, for example, because there are four possible two-day scenarios (HH, HT, TH, TT). But it makes a lot of sense to think of it as one new bit each day.

Suppose I wanted to do things faster. I want to send you not one bit per day, but two. I could do this by flipping in the morning and sending you and email, and flipping again in the afternoon, sending another email. Or I could flip, send you an email, and then immediately flip again and send a follow up. Finally, I might decide to do it all in one step. I’d flip the coin twice in a row, and then send you one email with one of the four possible sequences in it. And that’s two bits.

Want three bits? I’d have to send you one of the eight possible three-coin-flip sequences. But suppose I get lazy. I don’t want to flip the coin three times. Instead, I get an eight-sided die and roll that once. It gives me a number 1-8, but I don’t want to send you a number. I want to send you a sequence of heads and tails, since I’m trying to trick you into thinking I’m not a lazy bum. So I make a little chart like this:

1 -> HHH
2 -> HHT
3 -> HTH
4 -> HTT
5 -> THH
6 -> THT
7 -> TTH
8 -> TTT

Then whatever number I roll, I just look up the corresponding sequence on my chart, and send off the email. Each sequence comes up $\frac{1}{8}$ of the time, just as it would as if I were actually flipping the coin. You’d never know the difference. The information you receive must be exactly the same – three bits.

All this discussion leads to two conclusions: first, the information generated by a random process (flipping a coin, rolling a die, etc.) depends only on the probability distribution, not the details of the physical process. If any two process have the same probability distribution the way three coins flips and one octohedral die roll do, they have the same information.

Second, $n$ coin flips have $n$ bits of information, and this is equivalent to a rolling a die with $2^n$ sides. So if we have a flat probability distribution (all outcomes equally likely) with $2^n \equiv x$ possible outcomes, that has $n = \log_2x$ bits of information. Let’s say this $\log_2x$ relation holds true for all integer $x$, whether or not $x$ is a power of two.

Suppose we roll two dice, one with $x$ sides and the other with $y$. The total information generated should, we expect, be the sum of the information generated by each of the dice separately, since they’re totally independent. When we roll the dice, the result is an ordered pair with any of $x$ values in the first slot and any of $y$ values in the second. The total number of ordered pairs is $xy$, all equally likely. Then the information is $\log_2(xy) = \log_2(x) + \log_2(y)$. So the information in two dice rolls is just the sum of the information in the individual rolls.

We still need to figure out how to quantify the information contained in distributions that aren’t flat. For example, suppose my coin had two heads on it, so that every morning I sent you and email that just said, “H”. That wouldn’t give you any information at all, because you could easily predict what I was going to send you. There’s no magic there. You could just write a little program on your computer to send yourself an email with “H” each morning and cut me totally out of the picture with the same exact result. Clearly, I’m generating zero information in this case.

There could be intermediate cases, though. What if my coin gave heads 75% of the time and tails 25%? It would be semi-predictable, semi-random. Then the information should be somewhere in between one bit (as for the flat, totally unpredictable distribution) and zero bits (as for the totally predictable one).

To handle this, we’ll re-examine the flat distribution. There are $x$ equally-likely outcomes. Let’s say that each outcome carries its own little packet of information, and that adding them all up, we get the total information. So in flipping a fair coin, the heads side alone has half a bit of information. The tails side has the other half. Adding them, we get the full one bit.

Then each outcome in the flat distribution has

$\frac{\log_2x}{x}$

bits associated with it, so that when we add them all up we get the correct $\log_2x$ total bits.

Now we can do distributions like

$\{\frac{1}{7}, \frac{1}{2}, \frac{1}{3},\frac{1}{42}\}$.

We just find the individual information of each outcome based on its probability, then add.

$\frac{1}{7}\log_27 + \frac{1}{2}\log_22 + \frac{1}{3}\log_23 + \frac{1}{42}\log_242$.

In general, if we write $p = 1/x$ is the probability for a given outcome, we can write the total information as

$\sum p\log_2\frac{1}{p} = - \sum p\log_2p$.

Which holds true for all distributions made from adding up bits and pieces of flat guys. But why not extend the definition to all (discrete) distributions whatsoever? This is just the principle that as far as our mathematical formalism goes, fractions should not be treated as special just because their numerator is $1$.

With this final conclusion, we reach the desired definition:

$I = - \sum p \log_2p$.

The only difference is that you don’t have to use $2$ as the basis of the logarithm. You could use any other number you wanted. For example, if you were using English characters rather than heads and tails as your fundamental unit, you might use base 26. This just multiplies the information by a constant factor, though, and so isn’t very important.

### 4 Responses to “Where Does the Formula for Information Come From?”

1. Andrew Says:

There are other interesting measures of information. For instance, Kolmogorov Complexity: http://en.wikipedia.org/wiki/Kolmogorov_complexity

The Kolmogorov complexity is a measure of the computational resources required for recreating a given string. Wikipedia uses the example of the mandlebrot set: any representation of the set will be large and highly entropic to Claude Shannon’s eyes. However, the definition of the mandlebrot set, from which the desired representation could be generated, is a fairly short statement. This definition is pretty worthless for random processes, however.

Oh, and reading an email containing ‘Heads’ arriving in my inbox actually does contain information, on a different level of abstraction then the results of a coin flip: I know that I still have eyes, a computer, electric power, an internet connection, and can be reasonably sure that you have the same.

2. meichenl Says:

Hey Zuke,

Good to hear from you. Kolmogorov complexity looks really interesting. How would you know you’ve found the best way to recreate the string? If you sent me a few thousand digits starting from somewhere deep in PI, it’d have way low Kolmogorov complexity if I realized where they were coming from, but high complexity otherwise.

I agree about the email. I was just thinking about the information contained in the binary fact that there is an “H” as opposed to a “T”, with this being abstracted from things like light and dark pixels. Getting the email tells you, for example, that I didn’t choke to death on my own vomit overnight, but we’re just taking all that sort of thing for granted. (To be honest, I really DO take it for granted every night I fail to choke on my own vomit. Kind of puts life in perspective, huh?)

3. Information and Statistical Independence « Arcsecond Says:

[…] and Statistical Independence By meichenl Andrew’s reply to my previous post on information suggests an interesting […]

4. RaiulBaztepo Says:

Hello!
Very Interesting post! Thank you for such interesting resource!
PS: Sorry for my bad english, I’v just started to learn this language ;)
See you!
Your, Raiul Baztepo