Posts Tagged ‘information’

November 23, 2010

There’s a room with 64 labeled light bulbs, each of the which can be either on or off, and each of which has its own switch. You walk into the room and find the light bulbs in some arbitrary pattern of on and off. You get to flip one light bulb of your choosing (and you must flip one).

Next, you leave the room and I enter. By looking at the pattern now on the lights, I might be able to get some information – you might be able to send me a message.

If we convene before hand to create a strategy and write a list of pre-recorded messages we want to be able to send by picking one off the list, how long should the list be?

The answer is that the list should be 64 messages long.

By the pigeonhole principle, 64 messages is an upper bound, because when you enter the room you have have to choose between 64 light bulbs.

We’ll prove 64 is attainable constructively.

Since we’re flipping a switch, it’s natural to think about the parity of the bulbs. We’re not interested in the parity of the entire sequence, though; that is forced to change because you flip exactly one switch.

If we wanted to settle for a two-item list, we could simply consider the parities first and second halves of the bulbs independently. We can easily control the parity of the first half by flipping a bulb in the first half if we want to change its parity, or flipping a bulb in the second half if we don’t. This lets us encode a one-bit message, but it’s a wasteful strategy. All the bulbs in the first half are treated the same. It would be better to cut the sequence in half more ways, and do them simultaneously. Let’s call these different halves “slices”.

Another natural slice is to take evens and odds. But in between first half/second half and even/odd slices there is a sequence of slices.

• (1-32), (33-64)
• (1-16, 33-48), (17-32, 49-64)
• (1-8, 17-24, 33-40, 49-56), (9-16, 25-32, 41-48, 57-64)
• etc.

Or, to put it another way, start the light bulbs’ numbering at zero and make the slices

1. (0-31 mod 64),(32-63 mod 64)
2. (0-15 mod 32),(16-31 mod 32)
3. (0-7 mod 16),(8-15 mod 16)
4. (0-3 mod 8),(4-7 mod 8)
5. (0,1 mod 4),(2,3 mod 4)
6. (0 mod 2),(1 mod 2)

The important thing about this breakdown is that it allows you to zero in on a single light bulb by saying that that it’s in the left half slices 1, 4, 5, and 6 and the right half of slices 2 and 3, for example. That uniquely picks out bulb 24. It also goes the other way. Any bulb has a unique combination of left halves/right halves.

This is perfect! Now we can send six one-bit messages at the same time. To send our first one-bit message, as before we control the parity of the first half of the bulbs. To send the second bit, we control the parity of the bulbs in the left half of slice two. To send the third bit, we control the parity of the bulbs in the left half of slice three, etc. There will always be exactly one bulb that correctly changes / fails to change all six parities in the way we want.

So we can send six bits, or 2^6=64 messages.

The Entropy of the SAT

November 2, 2010

If you get a question wrong on the SAT, you get a 1/4 point deduction. If you skip that question, there’s no penalty. This way, the test is “guessing-neutral”. Each question has five choices, so if you guess randomly your expected score increase is 1/5*1 + 4/5 * (-1/4) = 0. On average, you neither win nor lose points by guessing.

However, guessing still increases the variance of your score by 0.1 raw points per guess. Maybe there is a better way to account for test-takers getting some correct answers at random?

I remember the words of my 9th grade geometry teacher. We had to answer some True/False questions, and he was wise to the strategy of writing an ambiguous sort of pencil scratch which, if it were marked wrong, you could later come back and complain had been misread.

Mr. Allen told us, “Be very clear about your T’s and F’s. If I can’t tell whether what you wrote is a T or an F, I’m going to guess. And I always guess whichever one is the wrong answer. I am the worst guesser ever.” He paused a moment after that and added, “Well, I guess that means I’m the best guesser ever, too.”

What he meant was that if you guess randomly, you’ll guess that the student got it right 50% of the time. If your guess has the student getting it wrong 100% of the time, that means you, the guesser definitely know the answers. From the student’s point of view, getting all the answers wrong on a T/F test is just as hard as getting them all right.

This makes a T/F test very easy to pass. Even if you only know the answers to 20% of the questions, you can still guess on the rest and get an expected score of 60%.

We can adjust the grading of the test to get an estimate of the student’s knowledge by analyzing the amount of information they put into the test. By “information” what I mean is the reduction in the entropy.

Say we have a T/F test with 100 questions. There are 100891344545564193334812497256 ways to get exactly 50/100 on the test, but only 1 way to get 100/100 or 0/100. There are 100 ways to get 99/100 or 1/100.

If we take the binary log of these numbers, we find the entropy of a score 50/100 is 96.3 bits. The entropy of a perfect score is 0 bits. To convert to a reasonable test score, we multiply this entropy by -1 and add 96.3. Then to get on a scale of 0 – 100 multiply by 100/96.3. (This is equivalent to using not a binary log, but a log with base 2^0.963 = 1.95). Since this score is based on information, let’s call it the iScore. The random guessing iScore is zero, which reflects the test-taker’s complete lack of knowledge. The perfect iScore is 100. A raw score of 99/100 is worth 93.1. One answer wrong means a lot! Here’s a plot of the iScore as a function of the raw score.

In the general case of $n$ questions with $c$ choices each and a raw score $r$, the formula for the iScore is

$iS(r) = \left(-\log\frac{n!(c-1)^{n-r}}{r!(n-r!)} + \log\frac{n!(c-1)^{n-n/c}}{n/c!(n-n/c)!}\right)\frac{100}{\log\frac{n!(c-1)^{n-n/c}}{n/c!(n - n/c)!}}$

For example, here’s the iScore plot for a test with 100 questions, each with 4 choices.

The more choices there are, the less difference between the iScore and the raw score. In the limit where $c \to \infty$, the iScore and the raw score are the same thing. That means that if you want to give a test so the raw score accurately determines how much students know, make it free response.

The SAT does not report your raw score to colleges, even after its guessing adjustment. Instead it reports a scaled score, indicating how well you did relative to the other test takers. I have a suggestion for an improved algorithm for relative scoring, too. The basic idea is that you should be rewarded for getting very difficult questions correct. Meanwhile, if you make a boo boo on one of the easy questions, it shouldn’t mess you up too much because it was probably not indicative of a gap in your knowledge, and might be due to something dumb like accidentally marking the wrong bubble on the answer sheet.

When you get your SAT score report, you get to see every question on the test along with the answers you gave and the percent of test takers who got that question right. If 50% of test takers got a certain question right, that question is only as hard as a normal true/false question. We could construct an effective number of choices for each question by taking 1/(proportion correct answers). Let’s call that the question’s “quality”. The question that 50% of people got right has quality 2, while a question that only 10% of people got right as quality 10.

We take the binary log of the quality of each question and call that the question’s “weight”. A question’s weight is the number of true/false questions it’s worth. Now we multiply your raw score on each question (1 or 0) by the question’s weight and sum to create a weighted raw score. This is a different metric than the iScore, and I’m not sure which I like better. This score allows us to account for the SAT’s free response math questions, though.

Of course, none of this actually removes the advantage of guessing. To answer that problem, I propose that the test’s answer choices be randomized before the test is administered, then before scoring someone’s test, all the answers they left blank are automatically filled in with choice A, or filled in randomly. This simply takes the guesswork out of the test-taker’s hands and makes guessing compulsory, but it seems fair enough to me.

Information and Statistical Independence

March 27, 2009

Andrew’s reply to my previous post on information suggests an interesting question.

We play a game where every day I flip a coin and get either heads or tails. Then I email you the result, sending you one bit of information about the coin toss.

But in reading that email, you gain the information that I survived the night. You get the information that our computers are still working, and that the laws of physics are mostly the same. Those little gnomes inside your vacuum tubes are still pushing hundreds of electrons and gravitinos all over the place inside there every single hour, just so you can watch YouTube. Those selfless little heroes. Bless their 2 and a half hearts apiece.

Try to catalog everything you learn by reading that email. No chance. You learn infinitely many things. You learn that I didn’t choke to death on my own vomit at 2am. You learn that I didn’t choke to death on my own vomit at 1am. At 12:30, 12:15, 12:07:30, …

Can the definition of information accommodate all these revelations? Is it robust if it tries to do so?

I might assign a probability to every event that could lead up to you not reading a morning ‘H’ or ‘T’ email from me (and consciously registering the result, if you want to get super-deep), from things as trivial as my making a typo to as extravagant as this:

or, of course, this:

The probability of most of those events is very small, which means their information content is low. Maybe I’ll even be able to get infinitely many possibilities to converge to a finite amount of information.

But suppose I break the process down differently. A simple way to incorporate this whole new world of possibilities would be to lump them all as one, and then say there are three possible results – that you read ‘H’, that you read ‘T’, that you read neither.

A complicated way would be to make a long list of the various ways you might end up reading neither. These are two different procedures that attempt to find the information in the same situation, and they come up with different results.

When you break the third category down into a bunch of little ones, you’ll always increase the information content. Start with some possible outcome with probability $p$. Its information content is $p\log p$. Now break it down into two sub-outcomes with probabilities $x$ and $y$. We have

$x + y = p$

$x, y > 0$

$p\log p = (x+y)\log(x+y) = x\log(x+y) + y\log(x+y) > x\log(x) + y\log(y)$

where the inequality follows because the logarithm is monotonic increasing (for a base greater than one). This means

$-p\log p < - \left(x\log x + y\log y\right)$,

so that whenever we break a broad-category event down into (nontrivial) particulars, we increase the total information. So it appears that the information you get by reading your morning email depends on how you decide to count it.

To be more concrete, let’s consider a specific, tractable example. I decide that in addition to telling you whether I flipped the coin heads up or tails up, I’m also going to tell you about the bottom side of the coin. So now, every day I send you an email saying ‘HT’ if I flipped heads, then checked the bottom and got tails.

You wouldn’t get all that excited about this change in the rules. The information content in this game is just as it was before, because if you know what’s on the top of the coin, you know what’s on the bottom (trick coins notwithstanding).

The new message wants to be two bits, but somehow it fails. It’s lacking something. That thing is statistical independence.

A half-way scenario would be if I have three coins – two trick coins (one with two heads, the other with two tails) and one normal coin. I choose a coin at random each morning and flip. If I just told you the normal result of the flip, it would be one bit of information because heads and tails are equally likely. Alternatively, I could tell you just what’s on bottom, and that would also be one bit.

But if I told you both top and bottom, it would be somewhere between one bit (as for the normal coin) and two bits (as for two different normal coins).

If you know the top of the coin is heads, the bottom could also be heads (if it’s the trick coin) or tails (if it’s the normal coin), but they aren’t equally likely. If the top is heads, there’s a $\frac{2}{3}$ chance the bottom is also heads. The report on the top of the coin gives one bit, while the report on the bottom gives

$\frac{2}{3}\log\left(\frac{3}{2}\right) + \frac{1}{3}\log(3)$

additional bits. The total information comes to (I’ll do the arithmetic)

$\log\left(3*2^{1/3}\right)$.

Another way to think of it is that there are four outcomes, with associated probabilities

HH – 1/3
HT – 1/6
TH – 1/6
TT – 1/3

which, on arithmetic, gives the same result.

With these new considerations, return to the question of the potentially infinite information you get from reading my morning email. In order to quantify the total information, we’d need to work out which events are independent of each other, and get all their conditional probabilities.

Two reasons the email might not come are that your computer is broken and that I decided I hate you now (and therefore not longer want to send you emails). By successfully getting the morning email, you rule out both those possibilities. But you rule them out for the same reason. There is no way that simply reading ‘H’ or ‘T’ can rule one out without ruling out the other.

Sure, in practice you can tell the difference. For example, if no email comes it may be very clear that your computer is working fine, but that’s outside information. That’s information based on things other than the email, and I want to consider only the information in the email itself. We can safely lump all those horrible email-blocking possibilities into one category, slap a probability on it, and get a new, consistent information quantity.

Where Does the Formula for Information Come From?

March 25, 2009

A friend of mine has been doing some research on information and causality recently. He told me that the mathematical definition of the information generated by a random process is

$I = - \sum p\log p$

where $p$ is the probability of one particular outcome of the process and the sum is taken over all possible outcomes. I was curious to see why this particular definition should be picked out of infinitely-many possibilities. This post explains what I came up with.

Suppose that every morning I flip a coin and send you an email telling you what I got. It’d be totally exciting – completely unpredictable. (Let me know if you actually want to do this. Especially if you live in, like, Malawi or Liechtenstein. I haven’t had a pen pal since second grade and we can tell each other all about the ways of our respective people and what kinds of diseases our grandparents get etc.) My email would be “heads” half the time and “tails” the other half, but only over the course of many tosses. You’d have no idea what it would be on any particular day until you got the email. (Surprisingly, if I always flip the coin starting heads-up, it would be heads about 51% of the time. Imagine I’m a perfect 50-50 coinflipper.)

If we did this little email game, I’d be telling you just a little bit about my life. I’d be giving you some information. It’s that idea of information that I want to make mathematically precise here.

Let’s start with a simple definition. Every day I send you an email that says either “H” or “T”. Just one character out of two. The simplest possible message, but totally random. Let’s call that one bit of information.

I could do other strange things, though. What if I rolled a die and sent you a number 1-6? What if I rolled the die and told you whether or not it showed a number that divides ten evenly? What if the die had any number of sides, and what if it were loaded? I could do even weirder things than that, such as telling you the morning temperature, but I won’t discuss that here. (The key difference is that knowing yesterday’s coin flip result doesn’t tell you about today’s, but if you know today’s temperature, that gives you a basis for a reasonable guess about tomorrow’s.)

Since I send you one bit on the first day, and I’m doing the same thing the next day, let’s call that a total of two bits. Technically, it wouldn’t have to be two bits. It could be four bits, for example, because there are four possible two-day scenarios (HH, HT, TH, TT). But it makes a lot of sense to think of it as one new bit each day.

Suppose I wanted to do things faster. I want to send you not one bit per day, but two. I could do this by flipping in the morning and sending you and email, and flipping again in the afternoon, sending another email. Or I could flip, send you an email, and then immediately flip again and send a follow up. Finally, I might decide to do it all in one step. I’d flip the coin twice in a row, and then send you one email with one of the four possible sequences in it. And that’s two bits.

Want three bits? I’d have to send you one of the eight possible three-coin-flip sequences. But suppose I get lazy. I don’t want to flip the coin three times. Instead, I get an eight-sided die and roll that once. It gives me a number 1-8, but I don’t want to send you a number. I want to send you a sequence of heads and tails, since I’m trying to trick you into thinking I’m not a lazy bum. So I make a little chart like this:

1 -> HHH
2 -> HHT
3 -> HTH
4 -> HTT
5 -> THH
6 -> THT
7 -> TTH
8 -> TTT

Then whatever number I roll, I just look up the corresponding sequence on my chart, and send off the email. Each sequence comes up $\frac{1}{8}$ of the time, just as it would as if I were actually flipping the coin. You’d never know the difference. The information you receive must be exactly the same – three bits.

All this discussion leads to two conclusions: first, the information generated by a random process (flipping a coin, rolling a die, etc.) depends only on the probability distribution, not the details of the physical process. If any two process have the same probability distribution the way three coins flips and one octohedral die roll do, they have the same information.

Second, $n$ coin flips have $n$ bits of information, and this is equivalent to a rolling a die with $2^n$ sides. So if we have a flat probability distribution (all outcomes equally likely) with $2^n \equiv x$ possible outcomes, that has $n = \log_2x$ bits of information. Let’s say this $\log_2x$ relation holds true for all integer $x$, whether or not $x$ is a power of two.

Suppose we roll two dice, one with $x$ sides and the other with $y$. The total information generated should, we expect, be the sum of the information generated by each of the dice separately, since they’re totally independent. When we roll the dice, the result is an ordered pair with any of $x$ values in the first slot and any of $y$ values in the second. The total number of ordered pairs is $xy$, all equally likely. Then the information is $\log_2(xy) = \log_2(x) + \log_2(y)$. So the information in two dice rolls is just the sum of the information in the individual rolls.

We still need to figure out how to quantify the information contained in distributions that aren’t flat. For example, suppose my coin had two heads on it, so that every morning I sent you and email that just said, “H”. That wouldn’t give you any information at all, because you could easily predict what I was going to send you. There’s no magic there. You could just write a little program on your computer to send yourself an email with “H” each morning and cut me totally out of the picture with the same exact result. Clearly, I’m generating zero information in this case.

There could be intermediate cases, though. What if my coin gave heads 75% of the time and tails 25%? It would be semi-predictable, semi-random. Then the information should be somewhere in between one bit (as for the flat, totally unpredictable distribution) and zero bits (as for the totally predictable one).

To handle this, we’ll re-examine the flat distribution. There are $x$ equally-likely outcomes. Let’s say that each outcome carries its own little packet of information, and that adding them all up, we get the total information. So in flipping a fair coin, the heads side alone has half a bit of information. The tails side has the other half. Adding them, we get the full one bit.

Then each outcome in the flat distribution has

$\frac{\log_2x}{x}$

bits associated with it, so that when we add them all up we get the correct $\log_2x$ total bits.

Now we can do distributions like

$\{\frac{1}{7}, \frac{1}{2}, \frac{1}{3},\frac{1}{42}\}$.

We just find the individual information of each outcome based on its probability, then add.

$\frac{1}{7}\log_27 + \frac{1}{2}\log_22 + \frac{1}{3}\log_23 + \frac{1}{42}\log_242$.

In general, if we write $p = 1/x$ is the probability for a given outcome, we can write the total information as

$\sum p\log_2\frac{1}{p} = - \sum p\log_2p$.

Which holds true for all distributions made from adding up bits and pieces of flat guys. But why not extend the definition to all (discrete) distributions whatsoever? This is just the principle that as far as our mathematical formalism goes, fractions should not be treated as special just because their numerator is $1$.

With this final conclusion, we reach the desired definition:

$I = - \sum p \log_2p$.

The only difference is that you don’t have to use $2$ as the basis of the logarithm. You could use any other number you wanted. For example, if you were using English characters rather than heads and tails as your fundamental unit, you might use base 26. This just multiplies the information by a constant factor, though, and so isn’t very important.