Posts Tagged ‘Outliers’

My Peers’ Birthdays

May 18, 2011

follow-up to My Friends’ Birthdays

The main conclusion I drew from examining my Facebook friends’ birthdays is that I didn’t have enough data to see the birth month effect – when your month of birth influences your success in a field because it decides your relative age to your peers early on in sports or school.

The birth month effect is real in some circumstances. Just now, I searched for “US junior baseball team” and found this roster.

In Outliers, Malcolm Gladwell explained that the cutoff date for youth baseball leagues in the US is July 31. (It’s now changed to May 1, so in ten years we can do this experiment over and see the effect.) Thirteen players on the roster were born in the half of the year directly following July 31 (August through January), and only five were born in the next half (February through July). With data like that, even a sample of eighteen people is enough to see the strong effects that birth month has on athletic success. The odds of such lopsidedness occurring by random chance are about 5%.

If 18 baseball players is enough to see a significant birth month effect in sports, then shouldn’t more than 100 Facebook friends have been plenty to see it in education?

In American education, there is no firm, uniform cut-off date like there is with baseball. Different states have different dates. Also, parents may have a choice about when to send their child to kindergarten if the child is born in a certain window. I was born in December in Maryland, where entering kindergarteners must be five years old by December 31. I could have been one of the youngest students in my grade, but my parents held me back, making me one of the oldest. Their stated reason was that they thought I’d appreciate being one of the first kids with a driver’s license come high school.

Mixed-up birth months, along with other obfuscating factors the reader may imagine, could easily make a real signal difficult to pick up, so I asked the Caltech registrar’s office for data on all the domestic Caltech students. They kindly obliged, with birth months tallied for the 5083 students enrolled since 1985. I was asked not to release the data directly, but I can report on its statistics.

Since September to December babies can be either old or young when entering kindergarten, let’s leave them out. The hypothesis is that entering Caltech students are more likely to be born in the January to April time frame than May to August. (If you want to be a stickler for experimental design, we could say that the null hypothesis is that students are equally likely to be in those categories.)

There were 3399 students whose birth months fell into one of these two ranges. If each student were a simple binomial variable with even probability we’d expect a standard deviation of 29 in the number of students in each range. We should also take into account that these periods aren’t perfectly equal in numbers of births. According to a Google result, a baby born anywhere from January to August has a 51.85% chance of being born in the May-August window, due partially to the three extra days and partially to higher birth rates. Thus, we expect that if domestic Caltech students have birth month patterns that mirror the American population at large, there should be 1762 +/- 30 students born in the May-August window. If there are fewer than 1700, we have evidence that Caltech students are less likely to be born in the summer.

The statistic is 1713 born in those months, compared to 1686 in January – April. The discarded period, from September to December, has 1684. There is no significant evidence to suggest that Caltech students are more likely to be born in any particular month.

This certainly doesn’t disprove the idea that your month of birth impacts your success in school, but the effect, if present, is not as powerful in education as it is in organized sports.

Advertisements

My Friends’ Birthdays

May 2, 2011

Malcolm Gladwell’s Outliers describes how elite hockey players in Canada are far more likely to be born in the first half of the year than the second. There’s a simple explanation – Canadian youth hockey leagues bin age group teams according to the calendar year of birth. Two young players, one born January 1, 2003 and the other December 31, 2003 are considered the same age and play in the same league.

Being born in at the beginning of the year makes you a few months older than most of your peers. When you’re eight year old, those few months equate to a big advantage in physical maturity. Being more mature, you perform better, get selected for elite teams, and receive better training. You get better and better while your peers born near the end of the year are left behind.

The data shown in the book are convincing. The phenomenon is seen not just in Canadian hockey, but in a host of other sports where a similar age cutoff exists, and when the cutoff date changes from January 1, the distribution of birth months changes, too. (Basketball in the US is one exception, presumably because kids learn on the streets regardless of their birth month, and don’t need to be selected for elite training until later on.)

Then Gladwell goes on to suggest that the same effect dominates academic achievement in the US.

Parents with a child born at the end of the calendar year often think about holding their child back before the start of kindergarten: it’s hard for a five-year-old to keep up with a child born many months earlier. But most parents, one suspects, think that whatever disadvantage a younger child faces in kindergarten eventually goes away. But it doesn’t. It’s just like hockey. The small initial advantage that the child born in the early part of the year has over the child born at the end of the year persists. It locks children into patterns of achievement and underachievement, encouragement and discouragement, that stretch on and on for years.

Recently, two economists — Kelly Bedard and Eliza­beth Dhuey—looked at the relationship between scores on what is called the Trends in International Mathematics and Science Study, or TIMSS (math and science tests given every four years to children in many countries around the world), and month of birth. They found that among fourth graders, the oldest children scored somewhere between four and twelve percentile points better than the young­est children. That, as Dhuey explains, is a “huge effect.” It means that if you take two intellectually equivalent fourth graders with birthdays at opposite ends of the cutoff date, the older student could score in the eightieth percentile, while the younger one could score in the sixty-eighth percentile. That’s the difference between qualifying for a gifted program and not.
pp 28

The first paragraph seems like a rather wild extrapolation, based solely on the second.

I wanted to know if I could see this birthday effect in some data I had readily available – that generated by my Facebook friends.

I have about 700 Facebook friends, many of whom were Caltech students. These people represent an academic elite, so if the birthday effect is extraordinarily strong, I ought to have friends whose birthdays come in a clump, assuming they are educated in the US.

I tallied the birth months of all my Facebook friends who are or were students at Caltech and who listed themselves as being from somewhere in the US. (I wound up throwing out a lot of people from the US because they didn’t list a home town, but I thought it was better to have a uniform data collection policy than to guess.) 110 people made the cut.

I made a plot of their birth months, and it looked like maybe there was some sort of signal in there. So then I made seven fake plots by randomly generating birth months from a flat distribution. Here are the eight plots. Can you tell which one is the real data?

One of these plots is real data from the birth months of students at one the world’s top universities. The other seven plots are as random as Python can make them. I challenge Malcolm Gladwell to tell me which one is which.

This challenge is a bit unfair. What I really ought to plot is not birth month, but age when starting kindergarten. These aren’t the same, largely because people born near the end of the year (like me) can wind up either old for their grade or young for it. Still, January babies are almost uniformly old for their grade in the US, and August babies are almost uniformly young. If the effect is as powerful as Gladwell suggests, we ought to see it at play here.

If birth months were evenly distributed and I took 110 data points, the expectation value for a each month is 9.2 and the standard deviation is 2.9. Since the standard deviation is pretty big compared to the expectation value, we would need a large signal in order to see an obvious effect in the data. So to make a strong case, I really ought to have more data.

Still, we actually expect whatever effect there is to be magnified when looking at this data. The reason is that, with Caltech students, we’re looking far out on the tail of the distribution of academic ability.

Here are two normal distributions that are the same except that one is shifted to the right.

The original gaussian, centered on zero, represents students born late in the year, and the shifted one represents students born near the beginning. (This is only supposed to be a heuristic, of course.)

I’ve added two vertical lines. The first vertical line shows a cut off for students who are “good at school”. There are about three times as many students from the shifted distribution that make it beyond this cut off.

The second line shows students who are “very good at school”. There are about ten times as many students from the shifted distribution that make it beyond this tougher cutoff.

Even though I don’t expect the age-selection effect to work in such a simple way, the main idea is simply that if you give one population a small advantage over the other, the effect becomes magnified when you look at the frequency of outliers. So, in the birthdays of my Caltech friends, I ought to see a pretty strong signal, if the basic effect exists.

So, for now I’d say that, lacking further data, either the effect is not very large, or it is not very simple, so that somehow it allows Caltech students form an exceptional bunch.