Archive for the ‘math’ Category

Why Least Squares?

February 7, 2012

Rod and Pegboard

Suppose you have a bunch of pegs scattered around on a wall, like this:

You see a general trend, and you want to take a rod and use it to go through the points in the best possible way, like this:

How do you decide which way is best? Here is a physical solution.

To each peg you attach a spring of zero rest length. You then attach the other side of the spring to the rod. Make sure the springs are all constrained to be vertical.

Now let the rod go. If most of the points are below it, the springs on the bottom will be longer, exert more force, and pull the rod down. Similarly, if the rod’s slope is shallower than that of trend in the points, the rod will be torqued up to a steeper slope. The final resting place of the rod is one sort of estimate of the best straight-line approximation of the pegs.

To see mathematically what this system does, remember that the energy stored in a spring of zero rest length is the square of its length. The system finds a stable static equilibrium, so it is at a minimum of potential energy. Thus, this best-fit line is the line that minimizes the squares of the lengths of the springs, or minimizes the squares of the residuals, as they’re called.

This picture lets us find a formula for the least-squares line. To be in equilibrium, the rod must have no force on it. The force exerted by a spring is proportional to its length, so the lengths of all the springs must add to zero. (We count length as negative if the spring is above the rod and positive otherwise.)

Mathematically, we’ll write the points as (x_i, y_i) and the line as y = mx+b. Then the no-net-force condition is written

\sum_i y_i - (mx_i+b) = 0

There must also be no net torque on the rod. The torque exerted by a spring (relative to the origin) is its length multiplied by its x_i. That means

\sum_i x_i \left(y_i - (mx_i + b)\right) = 0

These two equation determine the unknowns m and b. The reader will undoubtedly be unable to stop themselves from completing the algebra, finding that if there are N data points

m = \frac{\frac{1}{N}\sum_i x_iy_i - \frac{1}{N}\sum_i y_i \frac{1}{N}\sum_i x_i}{\frac{1}{N} \sum_i x_i^2 - (\frac{1}{N}\sum_i x_i)^2}

b = \frac{1}{N}\sum_i y_i - m \frac{1}{N} \sum_i x_i

These formulas clearly contain some averages. Let’s denote \frac{1}{N}\sum_i x_i = \langle x \rangle and similarly for y and combination of the two. Then we can rewrite the formulae as

m = \frac{\langle xy\rangle - \langle x \rangle \langle y\rangle }{\langle x^2\rangle - \langle x\rangle ^2}

\langle y \rangle = m \langle x \rangle + b

This is called a least-squares linear regression.


The story about the rod and minimizing potential energy is not the really the reason we use least-squares regression; it was only convenient illustration. Students are often curious why we do not, for example, minimize the sum of the absolute values of the residuals.

Take a look at the value \langle x^2\rangle - \langle x\rangle ^2 from the expression for the least-squares regression. This is called the variance of x. It’s a very natural measure of the spread of x – more so than the one you’d get by adding up the absolute values of the errors.

Suppose you have two variables, x and u. Then

\mathrm{var}(x+u) = \mathrm{var}(x) + \mathrm{var}(u) + \langle 2xu\rangle - 2\langle x \rangle \langle u \rangle

The reader is no doubt currently wearing a pencil down to the nub showing this.

If x and u are independent, the last two terms cancel (down to the nub!), and we have

\mathrm{var}(x+u) = \mathrm{var}(x) + \mathrm{var}(u)

In practical terms: flip a coin once and the number of heads has a variance of .25. Flip it a hundred times and the variance is 25, etc. This linearity property does not hold for absolute values.

So variance is a very natural measure of variation. Simple linear regression is nice, then, because it

  1. makes the mean residual zero
  2. minimizes the variance of the residuals
Defining the covariance as a generalization of the variance \mathrm{cov}(x,y) \equiv \langle xy\rangle - \langle x\rangle \langle y\rangle (so that \mathrm{var}(x) = \mathrm{cov}(x,x)), we can rewrite the slope m in the least-squares formula as
m = \frac{\mathrm{cov}(x,y)}{\mathrm{var}(x)}

The Distance Formula

The distance d of a point (x,y) from the origin is

d^2 = x^2 + y^2

In three dimensions, this becomes

d^2 = x^2 + y^2 + z^2

The generalization to n dimensions is clear.

If we imagine the residual as coordinates of a point in n-dimensional space, the simple linear regression is the line that brings that point in as close to the origin as possible, another cute visualization.

Further Reading

The physical analogy to springs and minimum energy comes from Mark Levi’s book The Mathematical Mechanic. Amazon Google Books

The Wikipedia articles on linear regression and simple linear regression are good.

There’s much mathematical insight to be had at Math.Stackexchange, Stats.StackExchange and MathOverflow


What is a determinant?

January 24, 2012

A simple introduction to a determinant is that it’s the area of a box.

Working in two dimensions, I’ll outline

  • the geometric picture of a linear transformation
  • the geometric picture of a determinant as an area
  • how the geometric picture leads to a few important properties of determinants
  • how the geometric picture of linear transformations can be expressed with matrices
  • how the geometric picture leads us to a formula for the determinant of a matrix

This post is long already. To keep it from becoming even longer, in some places I have had to leave out certain steps in the logic.

We’ll start with the coordinate plane. It’s a grid of points.


Then we scissor it or blow it up or shrink it down. Here are some examples:

To make them, I took the original image and applied the “shear”, “rotate”, and “scale” tools in GIMP (an open-source PhotoShop equivalent). You can try it yourself on any image just by using those tools.

These are called “linear transformations”. To simplify the way we picture them, we can just look at what they do to a box at the origin. I’ll make the box 3×3 so it’s visible, but imagine that each line represents a distance 1/3, so the sides of the box are length 1.

If we wanted, we could use something more complicated:

But since the apple is made from little boxes and all little boxes get treated the same way, we might as well focus on what happens to just one box.

Under any linear transformation, the box turns into a parallelogram.

The area of that parallelogram is called the determinant of the linear transformation.

There’s one extra rule. If the red and blue sides switch (as they would if I used the “flip” tool in GIMP), the determinant is negative. Here’s an example:

Since any area is made from little boxes and each little box’s area gets multiplied by the determinant, the area of any shape at all gets multiplied by the determinant. So for the apple, the determinant is the area of the apple on the right divided by the area of the apple on the left.

So that’s what a determinant is. What remains is to show what it’s about and what it has to do with the matrices you were wondering about.

Let’s look at some properties first. Imagine doing two transformations in a row. We’ll call this “multiplying the transformations”. The result is just another linear transformation.

When we do these sequential transformations, the area of our box gets multiplied by the determinant each time. If the determinant of the first transformation is 3 and the determinant of the second transformation is 5, the area gets multiplied by 15 overall, so the determinant of the combined transformation is 15. Multiplying transformations means multiplying determinants.

Next we’ll think about inverses. An inverse is a transformation that takes you back to where you started. The inverse of a transformation that rotates 45 degrees clockwise and multiplies everything by 2 is a transformation that rotates 45 degrees counterclockwise and cuts everything in half.

The determinant of the first transformation is 4 because each side of the box is doubled. The determinant of the second transformation is 1/4.

This is a general rule. Suppose two transformations are inverses. Then their determinants must multiply to 1, because the area of the box doesn’t change overall.

Next suppose a transformation’s determinant is zero. Then it doesn’t have an inverse because any number times zero is still zero, so there’s no transformation that takes the determinant back to one.

Geometrically, a transformation with zero determinant collapses everything to a line.

The line doesn’t have to be flat like this. It could be at any angle. Also, I didn’t collapse this completely to a line, since then you couldn’t see it. Transformations with zero determinant are bad news.

To review

  • Linear transformations are some combination of the “scale”, “rotate”, “shear”, and “flip” tools in Photoshop.
  • The determinant of a linear transformation is the factor by which the transformation changes the area.
  • The determinants of inverse transformations multiply to 1.
  • If the determinant is zero, the matrix doesn’t have an inverse. (The converse of this also holds, although we didn’t discuss it.)

Let’s move on to matrices. Take a linear transformation like this:

If we superimpose the original onto the final, we can see the coordinates of the new parallelogram in terms of the original grid.

We can describe the transformation completely using four numbers, two for the coordinates of the blue side and two for the coordinates of the red side. We’ll call those numbers a, b, c, d.

We’ll represent points with column matrices. So the point (a,b) will be represented by the matrix \left[ \begin{array}{c} a \\ b \end{array} \right]. (A matrix doesn’t have to be square. This is a 2×1 matrix.)

With this notation, we can represent our linear transformation by

\left[ \begin{array}{c} 1 \\ 0 \end{array} \right] \to   \left[ \begin{array}{c} a \\ b \end{array} \right]

\left[ \begin{array}{c} 0 \\ 1 \end{array} \right] \to  \left[ \begin{array}{c} c \\ d \end{array} \right]

This actually represents the entire transformation, even though it looks like we’ve only looked at two points. The reason is that any other point is made up out of the two we’ve already examined. For example

\left[ \begin{array}{c} 4 \\ 7 \end{array} \right] =    \left[ \begin{array}{c} 4 \\ 0 \end{array} \right] +    \left[ \begin{array}{c} 0 \\ 7 \end{array} \right] \to    \left[ \begin{array}{c} 4a \\ 4b \end{array} \right] +    \left[ \begin{array}{c} 7c \\ 7d \end{array} \right] =    \left[ \begin{array}{c} 4a + 7c \\ 4b + 7d \end{array} \right]

There’s a much more convenient way to write all this, which is in the form of a 2×2 matrix. \left[ \begin{array}{c} a \\ b \end{array} \right], which is the blue part of our parallelogram, becomes the first column of the matrix. \left[ \begin{array}{c} c \\ d \end{array} \right] is the second column.

We can view matrix multiplication as

\left[ \begin{array}{cc} a & c \\ b & d \end{array} \right]   \left[ \begin{array}{c} e \\ f \end{array} \right] = e   \left[ \begin{array}{c} a \\ b \end{array} \right] +    f \left[ \begin{array}{c} c \\ d \end{array} \right] =    \left[ \begin{array}{c} ea + fc \\ eb + fd \end{array} \right]

Check that this works for the example of \left[ \begin{array}{c} 4 \\ 7 \end{array} \right].

You may have learned to do this multiplication one row at a time rather than one column at a time. The result is the same.

This shows how a matrices describe linear transformations. All that remains is to tie in the concept of a determinant.

Remembering that a determinant is the area of a box, we can find a formula for the determinant by looking at some properties of area.

The area of the original 1×1 box is 1. That means

\left| \begin{array}{cc} 1 & 0 \\ 0 & 1 \end{array}\right| = 1

because that’s the identity matrix. It’s the linear transformation that does nothing. (The vertical lines around the matrix indicate that we’re taking a determinant.)

When we switch the blue and red sides of the box, the determinant is -1. The matrix that does this is

\left| \begin{array}{cc} 0 & 1 \\ 1 & 0 \end{array}\right| = -1

When we multiply the blue side by two, the determinant gets multiplied by that same factor. Since this is represented in the matrix by multiplying the first column by two, we have

\left| \begin{array}{cc} 2 & 0 \\ 0 & 1 \end{array}\right| = 2

and similarly

\left| \begin{array}{cc} 2 & 0 \\ 0 & 3 \end{array}\right| = 6

How about

\left| \begin{array}{cc} 1 & 1 \\ 0 & 0 \end{array}\right| = ?

This matrix is not invertible. It collapse everything onto the x-axis, making a “box” of zero area, so its determinant is zero. Similarly,

\left| \begin{array}{cc} 0 & 0 \\ 1 & 1 \end{array}\right| = 0

The final property we need of determinants/areas is linearity. Check out this picture:

It requires a little explanation. There are three linear transformations here, all sharing the same red side. The first two have the blue and purple sides. These are smaller. When we add them up, we get the third one with the gray side, so this picture represents adding linear transformations (which is different than multiplying them.) The green area is the area of the big transformation with the gray side.

The two smaller ones, with the blue and purple sides, have a total area equal to the green area. We can see this because there is a triangle of stuff that’s outside the green area, and therefore not counted. However, there’s also a triangle of extra stuff in the green area that’s not part two smaller parallelograms. These two triangles have the same area and cancel each other out, so that the small parallelograms have the same total area as the single big one.

Translating this into matrices means we can add determinants when one column is shared. This is called linearity in a column. For example

\left| \begin{array}{cc} a & 0 \\ b & 1 \end{array}\right| +   \left| \begin{array}{cc} c & 0 \\ d & 1 \end{array}\right| =   \left| \begin{array}{cc} a+c & 0 \\ b + d & 1 \end{array}\right|

So the properties we found are

  • The determinant of the identity is one.
  • The determinant of the matrix that switches horizontal and vertical is -1.
  • Multiplying a column by a number multiplies the determinant by that number.
  • The determinant is linear in a column.

These properties combined let us find the determinant of any matrix. Start with

\left| \begin{array}{cc} a & c \\ b & d \end{array}\right|

use linearity in the first column to write this as

\left| \begin{array}{cc} a & c \\ 0 & d \end{array}\right| +   \left| \begin{array}{cc} 0 & c \\ b & d \end{array}\right|

now use linearity in the second column to make it

\left| \begin{array}{cc} a & c \\ 0 & 0 \end{array}\right| +   \left| \begin{array}{cc} a & 0 \\ 0 & d \end{array}\right|    +   \left| \begin{array}{cc} 0 & c \\ b & 0 \end{array}\right| +   \left| \begin{array}{cc} 0 & 0 \\ b & d \end{array}\right|

We have already set up the tools to evaluate each of these individually. The determinant is

\left| \begin{array}{cc} a & c \\ b & d \end{array}\right| = 0 + ad - cb - 0

That’s the area of the parallelogram. You could find it by other geometrical means, too, but knowing the formula for the determinant makes it easy.

Why is the integral of 1/x equal to the natural logarithm of x?

December 17, 2011

The title of this post asks a question that many calculus students find befuddling. Here I’ll give some geometric intuition behind it. I leave small logical gaps to avoid cheating the reader of the pleasure of their discovery.

One essential feature of logarithms is that they make a multiplication problem equivalent to an addition problem, by which I mean

\ln(ab) = \ln(a) + \ln(b)

Meanwhile, \int\frac{1}{x}\mathrm{d}x is usually thought of geometrically as the area underneath a curve. The problem, then, is to try to see visually what an area under a curve has to do with turning multiplication into addition.

Here’s a graph of 1/x, and we’re finding, as an example, the area under it from 1 to 2.


Let’s say now that we multiply the limits of integration by two, so we’re now finding the area from 2 to 4. Here’s what that looks like.

second integral

The two portions are actually very similar to each other in their overall shape. The orange one is twice is wide as the green one, but also half as tall. Here they are overlaid.

overlaid integrals

If you take the green shape and first squash it down vertically by a factor of two then stretch it out horizontally by a factor of two, you get the orange shape exactly. (If you don’t believe this, convince yourself it works!) This means the areas of these shapes are exactly the same, even though we don’t know what that area is.

Show for yourself that this result is general. The area under 1/x from a to b is the same as that from ac to bc.

What, then, is the area from 1 to 6? We can break it into two parts – the area from 1 to 2 and the area from 2 to 6. But the area from 2 to 6 is the same as the area from 1 to 3, by the above reasoning.

Thus, the area from 1 to 6 is the same as the sum of the areas from 1 to 2 and from 1 to 3. Note that 6 = 3*2. Again, this is general. The area under 1/x from 1 to ab is the same as the sum of the areas from 1 to a and from 1 to b.

That’s pretty good motivation for the definition

\ln(x) = \int_1^x\frac{1}{t}\mathrm{d}t

Note that this is being taken as a definition of the natural logarithm, not a proof of the relationship. Our argument about the integral of 1/x now translates to the statement

\ln(ab) = \ln(a) + \ln(b)

Now, step by step, we will show that all the other properties you expect of the natural logarithm follow from this definition.

It is evident that

\ln(1) = 0

Our definition implies that the logarithm grows without bound because if we continually multiply the argument of the logarithm by two, we continually add \ln(2) to the value. (i.e. \ln(2x) = \ln(x) + \ln(2)). Since we can multiply any number by two over and over, we can add \ln(2) to the logarithm as many times as we want. That means we can make the logarithm arbitrarily big.

This also means that starting the integral from 1 rather than from zero was a good idea. If we start from zero, the integral is infinite. We can see this because 1/x is symmetric about the line y = x.


This implies that the area to the left of the curve is the same as the area under the curve, like this.


We just showed that the area under the curve diverges as we move the right hand side of the integral out to infinity, so the area to the left of the curve diverges, too. If we started the integral at zero, it would be infinite.

What about taking the logarithm of numbers less than one? A good check of whether everything makes sense so far is to work out that \ln(1/x) = - \ln(x).

Since the area under 1/x starts at zero when x=1 and goes up infinitely, it is clear that there must be some number x such that \ln(x) = 1. Let’s choose to call that number e. We don’t know what it is yet, but it certainly exists. Thus

\ln(e) = 1

Again, this is definition, not proof.

It is immediately apparent that, for example, \ln(e^5) = \ln(e*e*e*e*e) = 5\ln(e) = 5. That makes e a pretty handy number. It shows us that the logarithm of a number x is how many times you need to multiply e to itself in order to get x.

How about \ln(e^{3/2})? That is \ln\left([e^{1/2}]^3\right) = 3\ln(e^{1/2}). So in order to understand logarithms of rational numbers, we need to understand roots of e.

That’s not so hard, though.

\ln(e^{1/2}*e^{1/2}) = \ln(e) = 1.

On the other hand,

\ln(e^{1/2}*e^{1/2}) = \ln(e^{1/2}) + \ln(e^{1/2}) = 2 \ln(e^{1/2})

From this we deduce \ln(e^{1/2}) = 1/2. Returning to the unfinished example, \ln(e^{3/2}) = 3*(1/2) = 3/2. It is not great leap to say that for any rational number x, we have

\ln(e^x) = x

This is important result; it is probably the definition of \ln(x) that you’re used to. The pieces are falling into place. The main remaining hurdle is to find the value of e and show it comes to what we expect.

Before that, we should mention how the above relation works for irrational numbers. Irrational numbers are squeezed in between the rational ones, and since the definition of the logarithm as the area under a curve is evidently smooth, the logarithm of an irrational number is squeezed in tightly as well. Ultimately, the above relation holds for all positive numbers. However, the fine details of real numbers are more involved than I would like to address here. (The logarithm of a negative number or of zero isn’t defined, at least not in the real numbers. What is a difficulty with doing so?)

Finally, we would like some way of determining what e is. Here is one way to do it. For small values of x, we can see that

\ln(1+x) \approx x

This follows from the extremely simple approximation below.


The red box is an approximation to the area of the green integral. The red box clearly has area x while the green integral is \ln(1+x). Thus

\ln(1+x) \approx x

It’s crude, but it works better and better as x becomes tiny. Multiplying both sides of the approximation by 1/x, we get

\frac{1}{x}\ln(1+x) \approx 1

We know how to rewrite the left hand side. It gives

\ln\left([1+x]^{1/x}\right) \approx 1

Since we have defined e by \ln(e) = 1, we finally see

e = \lim_{x\to 0} (1+x)^{1/x}

This is the common definition of e. At last we see that the reason that the integral of 1/x is \ln(x) is that all the properties of the two functions are exactly the same, and so they must be the same function.

My Peers’ Birthdays

May 18, 2011

follow-up to My Friends’ Birthdays

The main conclusion I drew from examining my Facebook friends’ birthdays is that I didn’t have enough data to see the birth month effect – when your month of birth influences your success in a field because it decides your relative age to your peers early on in sports or school.

The birth month effect is real in some circumstances. Just now, I searched for “US junior baseball team” and found this roster.

In Outliers, Malcolm Gladwell explained that the cutoff date for youth baseball leagues in the US is July 31. (It’s now changed to May 1, so in ten years we can do this experiment over and see the effect.) Thirteen players on the roster were born in the half of the year directly following July 31 (August through January), and only five were born in the next half (February through July). With data like that, even a sample of eighteen people is enough to see the strong effects that birth month has on athletic success. The odds of such lopsidedness occurring by random chance are about 5%.

If 18 baseball players is enough to see a significant birth month effect in sports, then shouldn’t more than 100 Facebook friends have been plenty to see it in education?

In American education, there is no firm, uniform cut-off date like there is with baseball. Different states have different dates. Also, parents may have a choice about when to send their child to kindergarten if the child is born in a certain window. I was born in December in Maryland, where entering kindergarteners must be five years old by December 31. I could have been one of the youngest students in my grade, but my parents held me back, making me one of the oldest. Their stated reason was that they thought I’d appreciate being one of the first kids with a driver’s license come high school.

Mixed-up birth months, along with other obfuscating factors the reader may imagine, could easily make a real signal difficult to pick up, so I asked the Caltech registrar’s office for data on all the domestic Caltech students. They kindly obliged, with birth months tallied for the 5083 students enrolled since 1985. I was asked not to release the data directly, but I can report on its statistics.

Since September to December babies can be either old or young when entering kindergarten, let’s leave them out. The hypothesis is that entering Caltech students are more likely to be born in the January to April time frame than May to August. (If you want to be a stickler for experimental design, we could say that the null hypothesis is that students are equally likely to be in those categories.)

There were 3399 students whose birth months fell into one of these two ranges. If each student were a simple binomial variable with even probability we’d expect a standard deviation of 29 in the number of students in each range. We should also take into account that these periods aren’t perfectly equal in numbers of births. According to a Google result, a baby born anywhere from January to August has a 51.85% chance of being born in the May-August window, due partially to the three extra days and partially to higher birth rates. Thus, we expect that if domestic Caltech students have birth month patterns that mirror the American population at large, there should be 1762 +/- 30 students born in the May-August window. If there are fewer than 1700, we have evidence that Caltech students are less likely to be born in the summer.

The statistic is 1713 born in those months, compared to 1686 in January – April. The discarded period, from September to December, has 1684. There is no significant evidence to suggest that Caltech students are more likely to be born in any particular month.

This certainly doesn’t disprove the idea that your month of birth impacts your success in school, but the effect, if present, is not as powerful in education as it is in organized sports.

Visualizing Elementary Calculus: Statics

April 29, 2011

This post assumes a little physics, specifically the relationship between work and energy.

This series:

I – Introduction
II – Trigonometry
III – Differentiation Rules 1
IV – Graphs, Tangents, Derivatives
V – Optimization
VI – Statics


In physics, we sometimes like to look at stuff that isn’t doing anything. This is called “statics”. It’s kind of boring after a while, which is why you would only take an entire course on it if you’re an engineer.

If something is stationary, there must be no net force on it. That means that if you move it around a little bit, the work done is zero and its energy doesn’t change. This is called the principle of virtual work. The result is that when things aren’t moving, they are generally at a minimum of potential energy. In theory they could be at a maximum or other stationary point, but these equilibrium states are unstable – the difference between a ball nestled securely at the bottom of a valley and perched precariously at the top of a mountain.


Here’s a picture of a disappointing jug of milk.

The water rises to the same height in the handle and in the main body. It’ll do this even if you make a hundred little tubes, all with different shapes, even if they’re curved around in strange ways. How does the water in one tube know how high the water is in all the others?

The water must be at a height such that a small movement of water from the handle to the body would cause no change in potential energy. Since the potential energy is a function of height, the water must the be same height everywhere, else we would be able to release energy by moving some water from high to low.

Hill and Chain

There’s a lumpy hill with a chain lying on it. When will the chain be stationary, assuming the hill is frictionless?

The condition is that the potential energy of the chain shouldn’t change if you move it a little left or right. Assuming the chain is uniform, moving it a little bit to the right is identical to chopping a little bit off the left hand side and moving it all the way over to the right.

This chopping operation isn’t supposed to change the potential energy, so the left and right hand sides of the chain must be at the same height. That’s the condition for stability.

Unlike this water in the jug, this is an unstable equilibrium because if the chain is just a little bit off, it will fall towards the side that’s already further down, making that side drop down even further, and the chain will get further out of equilibrium.

Push Ups

I have often been asked why it is much easier to do a push up than to bench press your body weight. The motion of your arms is essentially the same, so shouldn’t the tasks be about equally difficult?

A push up, unlike a bench press, is a sort of lever. We can model it like this:

The horizontal brown stick is a board – your body as you do the push up. The triangle is the fulcrum – your feet. The gray ball represents your body weight. Your true weight is actually distributed from your feet to your head, but the ball represents an average, called your center of mass. The green arrow represents the force from your arms on your body.

A force is just a force, regardless of where it comes from, so instead of your arms, we’ll imagine the same force is generated by a mass on a pulley.

The blue circle is a pulley. There’s a rope tied to the end of the board that goes around the pulley and attaches to a platform with a weight on it.

In order for the system to be in equilibrium, the pulley needs to minimize its potential energy, so it must be at a stationary point. We would like \textrm{d}U/\textrm{d}\theta = 0, with U the potential energy of the system and \theta the angle of the board with the horizontal.

U changes when the weights change height. If we slant the board up by a small angle, the weight on the board will go up and the weight on the platform will go down. We need the resulting changes in potential energy to cancel each other.

Let the distance from the fulcrum to the weight on the board (the body weight) be L_B, and the body weight itself w_B. U_B = w_B h_B, with h_B the height of the weight. Then \textrm{d}U_B = L_B w_B \textrm{d}\theta. (Draw a picture to see this. Also, try finding some of the assumptions being used). Similarly \textrm{d}U_w = L_w w_w \textrm{d}\theta. These need to be equal, so

L_B w_B = L_w w_w

w_w = w_B \frac{L_B}{L_w}

Your center of mass is maybe 70% of the way to your shoulders, so doing a push up requires a force about 70% of your body weight.


Sometimes it’s easy to work with energy, but other times it’s easier to work with forces. If we hang a chain between two points, we could find its shape by minimizing its energy, subject to the constraint that the length is constant. This requires the more-advanced calculus of variations.

On the other hand, we can still analyze the hanging chain in terms of force with elementary calculus.

That’s the chain, hanging between two posts. We’ll zoom in on just the red part.

\textrm{d}x and \textrm{d}y show the length and width of the entire segment. There are two tension forces, T_1 and T_2, acting on the segment, and their components are shown in the picture.

The segment doesn’t go left or right, so T_{1x} = T_2x. The segment doesn’t go up or down, either, so the vertical components of the tension must balance gravity. That means \textrm{d}T_y = \sqrt{\textrm{d}x^2 + \textrm{d}y^2}\lambda g, with g the acceleration due to gravity and \lambda the mass per unit length of the segment.

Tension must also point along the direction of the rope, so T_y/T_x = \textrm{d}y/\textrm{d}x.

Combining this algebraically gives

\frac{\textrm{d}T_y}{\textrm{d}x} = \sqrt{1 + (T_y/T_x)^2} g \lambda

The trick is then to figure out what function satisfies this equation. A trig function looks good because of the relation \sin(\theta) = \sqrt{1 - \cos^2\theta}. In fact, the solution is

\frac{T_y}{T_x} = \frac{\textrm{d}y}{\textrm{d}x}  = \sinh \left( \frac{g\lambda}{T_x}x \right) + C_1

y = \frac{T_x}{g\lambda} \cosh\left( \frac{g\lambda}{T_x}x \right) + C_1 x + C_2

This is called a catenary curve.


  1. Rework the hanging chain problem, where the chain is a suspension bridge. Imagine the chain itself with no weight, and the bridge having constant density. This is equivalent to taking the mass of the segment to being simply \lambda g \textrm{d}x rather than \lambda g \sqrt{\textrm{d}x^2 + \textrm{d}y^2}. (Answer: a parabola)
  2. Find the shape of a hanging spring of zero rest length. In this case, the total tension is inversely proportional to the mass density. (Answer: also a parabola)

Visualizing Elementary Calculus: Optimization

April 26, 2011

Geometric thinking sometimes lets us skip a bunch of algebraic steps in basic min/max problems. Here are some common problems solved geometrically. I learned to think about optimization this way from The Feynman Lectures.

This series:

I – Introduction
II – Trigonometry
III – Differentiation Rules 1
IV – Graphs, Tangents, Derivatives
V – Optimization


Where is the vertex of the parabola

y = ax^2 + bx + c ?

A parabola looks like this, with the vertex at the lowest part (or highest if it opens down)

If you’re at the bottom like that, the tangent line must be flat. Otherwise, you could take a small step in whichever direction on the tangent line went down, and you’d get to something smaller, and hence you weren’t at the bottom to begin with.

So to find the vertex, we simply need to look for where the tangent is horizontal. In the previous post, we saw that the slope of the tangent is the derivative, so we need to set the derivative to zero.

y = ax^2 + bx + c

\frac{\textrm{d}y}{\textrm{d}x} = 2ax + b = 0

x = \frac{-b}{2a} ,

a fact you may remember from algebra.


Suppose we want to put up some fence to make a rectangular pen. We only have 90m of fence to use, and we want the biggest possible pen.

It’s easy guess by symmetry that the optimal shape is a square, but what if we twist the problem slightly? Say the fence is going up against the side of a cliff, and so we get one side of it for free. Now what is the best rectangular pen?

The fence has a length and a width like this:

To maximize or minimize A respect to B, \textrm{d}A/\textrm{d}B must be zero. So we want the derivative of area with respect to side length to be zero. We draw a picture to show the product rule

\textrm{d}A = l\textrm{d}w + w \textrm{d}l  = 0

If we take away 1 meter from the vertical length of the fence, it has to be split in two to go on the horizontal widths, so they only add half a meter. \textrm{d}w = -\frac{1}{2} \textrm{d}l, so

\textrm{d}A = \frac{-1}{2} l \textrm{d} l + w \textrm{d} l = 0

l = 2w

so the vertical length of the fence should be twice the horizontal width.

Distance to a line

Here is a easy problem: Given a line and a point, what is the shortest path from the point to the line?

There are many ways to go from the point to the line. Here are a few:

If the point of contact with the line is called x and the distance from our original point to the line is called l, we can form the derivative \textrm{d}l/\textrm{d}x. This derivative tells us how the distance to the line changes as we move the point around.

If we find x that minimizes l, the \textrm{d}l/\textrm{d}x is zero there. We can make a little picture to illustrate \textrm{d}l and \textrm{d}x. \textrm{d}x is a little distance along the line.

\textrm{d}l is the change in the length of the segment. To find it, draw a circle with the center at the point off the line going through one of the candidate points on the line. The circle shows everywhere that’s equidistant, so the length of the other segment outside the circle is how much longer it is.

In order for the extra bit to shrink to zero, indicating the derivative is zero, we must have the circle be tangent to the line. Tangents to circle are perpendicular to radii, so the shortest possible path from a point to a line is perpendicular to the line. This is a result you could probably get without calculus, but it’s a good warm up for the next bit.

Fermat’s Principle

Fermat’s principle for optics says that light takes the whatever path from A to B is fastest. We can find such paths by calculus, keeping in mind “the fastest path” means the derivative of the time of travel is zero.

Take two arbitrary points on the same side of a flat mirror.

What is the fastest route from A to B? The answer is a straight line, ignoring the mirror. But what is the fastest route that also touches the mirror somewhere? There are many potential places to touch the mirror, and therefore many potential paths.

The fastest one has the derivative of path length with respect to contact point equal to zero, so take two nearby points and compare.

In this picture, segment AC is clearly shorter than segment AD. How much shorter? Draw a circle with AC as a radius.

The purple segment shows the discrepancy. We would like to find its length. Zoom in on the interesting area.

Since this is calculus, we are letting the points C and D get be separated by a very small distance \textrm{d}x. When we zoom in, the circle appears indistinguishable from its tangent line, which is a line perpendicular to AC. Also, as C and D get closer together, AC become parallel to AD, so the circle is also perpendicular to AD.

The purple segment’s length is just \sin\theta \textrm{d}x.

Next we want to do basically the same thing to figure out how much longer CB is than DB.

Again, we zoom in on the interesting area, making the same linear approximations as the separation \textrm{d}x becomes very small.

This time, we get that the extra length is \textrm{d}x\sin\phi.

These two extra lengths must cancel each other out if the paths are going to be the same length, so

\textrm{d}x\sin\theta = \textrm{d}x\sin\phi

so \theta = \phi. \theta is the angle that the incoming rays make with the vertical, and \phi is the angle that the outgoing rays make with the vertical (exercise). So Fermat’s principle says that light bounces off a mirror at the same angle it came in.

A similar problem is the “lifeguard problem”. You’re a lifeguard. You see a drowning person, and you want to go save them, but you have to decide what path is fastest. You run part way on sand and swim part way in the water. What path should you choose?

You go faster on the beach, so you probably shouldn’t take a straight line. Instead, run further on the beach and turn a bit when you enter the water. We want to know how much you should turn. Again, take two nearby points and find the condition so that the difference in path lengths sums to zero. Let’s bring those green circles and purple segments back again.

They’re different lengths, which is actually what we want. We want those two purple segments to take the same amount of time to traverse, not to be the same length. That way, the two nearby paths take equal total time and the derivative of the time with respect to the entry point is zero.

From here, you can follow the details through to find v_{water} \sin\theta = v_{land} \sin \phi, which is called Snell’s Law.

Fermat’s principle does not really state that light takes a path of least time – in fact having the derivative be zero is enough. In most cases the time is least, in some applications, images actually form where the time is at a maximum compared to nearby paths, or even where it is a “stationary point” – the derivative is zero but not a minimum or maximum, which happens, for example, at the origin of the graph of y = x^3.

Witches with unusually-shaped heads

This example is somewhat artificial, but what is the largest cylinder (an unusual head shape) that fits inside a given right circular cone (witch’s hat)?

The correct cylinder is clearly something along these lines:

We want to optimize V by changing h, so we had better set \textrm{d}V/\textrm{d}h = 0.

As the cylinder gets a little taller, it sweeps out some volume with its top, and sucks in some volume with its sides, so

\textrm{d}V = A_{top}\textrm{d}h - A_{sides} \textrm{d}r

\textrm{d}h and \textrm{d}r are related by the slope of the cone, which is R/H, so we have

\textrm{d}V = \pi r^2\textrm{d}h - 2\pi r h \textrm{d}h * \frac{R}{H} = 0

which is equivalent to

\frac{r}{h} = 2\frac{R}{H}

This happens when

h = \frac{H}{3}

r = \frac{2R}{3}

These toy optimization problems are given to calculus students for practice. This is a useful skill, but many real optimization problems are more difficult because they involve many variables (even infinitely many). These problems are extremely important to physics, though. In the next posts, we’ll see some physics examples.


  1. Write down the quadratic formula and stare at it until you understand how it shows you what the vertex of a parabola is.
  2. What is the smallest and largest value of the function f(x) = \sin(x)/x? (you can check your answer like this.)
  3. Where are the “humps” in the graph of the cubic equation y = ax^3 + bx^2 + cx + d? Under what conditions does it have humps? How can you use this to tell whether a cubic has one real zero or three?
  4. Use the Pythagorean theorem and some algebra to solve the problem of finding the shortest segment from a point to a line.
  5. Prove the result about bouncing off the flat mirror using the concept of an image point. Create a new point, called B' on the other side of the mirror opposite B. For every path from A to the mirror to B, there is an equally-long path from A to the same point on the mirror to B'. Now use the fact that the fastest route from A to B' is a straight line to find the fastest route from A to B touching the mirror.
  6. Imagine that instead of a single pen, we want to make a whole grid of pens (or cubicles) enclosed on all sides. Our grid is m pens wide and n pens tall. If we have a fixed amount of fencing, what should the aspect ratio of the pens be to maximize their area? (Answer: m+1 : n+1
  7. Find the optimal height for a cylinder with fixed surface area and maximal volume. Compare this to a cylinder with only one end cap, and then one with no end caps. (Answer: r = 3/2 h, r = 3h, and unbounded)
  8. Here’s a modified version of the lifeguard problem. The pool has become an ellipse. What path should the lifeguard take? Try to find a condition such that there are three equal, optimal paths.

Visualizing Elementary Calculus: Graphs, Tangents, Derivatives

April 17, 2011

The derivative as the slope of a graph is standard fare, and it’s important for visualizing calculus.

This series
I – Introduction
II – Trigonometry
III – Differentiation Rules 1
IV – Graphs, Tangents, Derivatives

The Derivative as Slope

Let’s look at the graph of y=x^2.

If we take a point on this graph, for example (2,4), the y-value is the square of the x-value.

If look at a nearby point, those values have changed by \textrm{d}y and \textrm{d}x respectively. We can visualize those changes like this:

\textrm{d}y and \textrm{d}x are supposed to represent tiny changes, so we better bring the points in close to each other and zoom in. Any reasonable curve looks like a straight line when you zoom in on it enough, including this one. As far as these nearby points are concerned, y = x^2 is a line, and they are on it. That line is called the tangent line. Here it is:

The value of \textrm{d}y/\textrm{d}x is the derivative of y with respect to x, but in this context it is also called the slope of the tangent line. So, the derivative of a function at a certain point is the slope of the tangent line that point.

If we zoom back out again, eventually the graph of y = x^2 no longer looks like a line; we can see its curvature. The tangent line tracks the graph for a while, but eventually diverges. The red line shown below is the tangent line to the parabola. The derivative of x^2 with respect to x is 2x, so the slope of this tangent line through (2,4) is 2x = 2*2 = 4. To find the equation for tangent line itself, we choose the line with the specified slope that goes through the point. That would be y = 4x-4.


Elementary geometry tells us that the tangent to a circle is perpendicular to the radius. Let’s combine this fact with some calculus.

If we have a circle at the origin, the slope to a point (x,y) on the circle is y/x.

The circle is given by x^2 + y^2 = R^2. Applying \textrm{d} to both sides gives 2x\textrm{d}x + 2y\textrm{d}y = 0 (because \textrm{d}R = 0). This simplifies to

\frac{\textrm{d}y}{\textrm{d}x} = -\frac{x}{y}

Which is the slope of the tangent line.

Since this is perpendicular to a line of slope y/x, we see that perpendicular lines have negative-reciprocal slopes, a fact familiar from algebra.

Square Roots

If you want to estimate the square root of a number n, a good way is take a guess g, then average g with n/g. For example, to find the square root of 37, guess that it’s 6, then take the average of 6 and 37/6.

\frac{6 + 37/6}{2} = 6.0833

The actual answer is about 6.0828. It’s close. To get closer, iterate.

\frac{6.0833 + 37/6.0833}{2} = 6.08276256

The actual answer, with more accuracy, is 6.08276253. So we’ve got 7 decimal places of accuracy after two iterations of guessing.

Calculus shows us where this comes from. We are estimating \sqrt{n}. That is a zero of x^2 - n. So we plot y = x^2 - n (here, n = 37).

We don’t know where the zero is, but we know that x = 6 is near the zero. So we draw the tangent line to the graph at x = 6. This tangent is y = 12x - 73.

The tangent line tracks the parabola quite closely for the very short \textrm{d}x from the point x = 6 to wherever the zero is. So closely that we can’t even see the difference there. Zoom in near the point (6,-1).

Now we see that the tangent line is a very good approximation to the parabola near the zero, so we can approximate the zero using the zero of the tangent line instead of the zero of the parabola. The zero of the tangent line is given by

0 = 12x - 73

x = 6.0833

This is our first new guess for the zero of the parabola. It’s off, but only by a tiny bit, as this even-more-zoomed picture shows. We’ve zoomed in so closely that the original point (6,-1) is no longer visible.

From here, we can iterate the process by drawing a new tangent line like this:

We’ve zoomed in even closer. The red line is the tangent that gave us our first improved guess of 6.0833. Next, we drew a new tangent (purple) to the graph (blue) at the location of the improved guess to get a second improved guess, which is again so close we can’t even see the difference on this picture, despite zooming in three times already.

This general idea of estimating the zeroes of a function by guessing, drawing tangents, and finding a zero of the tangent, is called Newton’s method.


  1. Graph y = \sin x and find the places where the tangent line slices through the graph, rather than lying completely above or below it near the point of tangency. What is unique about the derivative at these points? (Answer: the derivative is at a local minimum or maximum (i.e. the graph is steepest) when the tangent line slices through)
  2. Find the slope of the tangent line to a point (x,y) on the ellipse (x/a)^2 + (y/b)^2 = 1 via calculus. Find it again by starting with the unit circle x'^2 + y'^2 = 1, for which you already know the slope of the tangent, and making appropriate substitutions for x' and y'. (Answer: \textrm{d}y/\textrm{d}x = -x/y * (b/a)^2)
  3. In this post, we found that y = 4x - 4 is tangent to y = x^2 at (2,4). Confirm this without calculus by noting that there are many lines through (2,4), all with different slopes. The thing that singles out the tangent line is that it only intersects the parabola once. Any line through (2,4) with a shallower slope than the tanget will intersect the parabola at (2,4), but intersect again somewhere off to the left. Any line with a steeper slope will have a second intersection to the right. Use algebra to write down the equation for a line passing through (2,4) with unknown slope, and set its y-value equal to x^2 to find the intersections with the parabola. What slope does the line need to have so that there is only one such intersection?
  4. Do the previous exercise over for a circle (i.e. use algebra to find the tangent line to a circle)
  5. For any point outside a circle, there are two tangents to the circle that pass through the point. When are these tangents perpendicular? (Answer: When the point is on a circle with the same center and radius \sqrt{2} as much)
  6. Newton’s method of estimating zeroes gave the same numerical answer for the zero of x^2 - 37 as the algorithm for estimating square roots gave for \sqrt{37}. Show that this is always the case (i.e. perform Newton’s method on y = x^2 -n with a tangent at some point g, and show that the new guess generated is the same as that given in the algorithm).
  7. Use Newton’s method to estimate 28^{1/3} to four decimal places (Answer: 3.0366).

Visualizing Elementary Calculus: Differentiation Rules 1

March 27, 2011

The basic rules of differentiation are linearity, the product rule, and the chain rule. Once we start graphing functions, we’ll revisit these rules.

This Series
I – Introduction
II – Trigonometry
III – Differentiation Rules


The linearity of differentials means

\textrm{d}(\alpha u + \beta v) = \alpha \textrm{d}u + \beta \textrm{d}v

\alpha and \beta are constants, while u and v might change.

This looks obvious, but here’s a quick sketch.

First we’ll look at \textrm{d}(\alpha u). Construct a right triangle with base 1 and hypotenuse \alpha. Then extend the base by length u. This creates a larger, similar triangle. The hypotenuse must be \alpha times the base, so the hypotenuse is extended by \alpha u.

Then increase u by \textrm{d}u. This induces an increase \textrm{d}(\alpha u) in the hypotenuse.

We draw an original blue triangle with base 1 and hypotenuse alpha. Then it's extended to the dark green triangle, adding u to the base and alpha*u to the hypotenuse. Finally, we increment u by du and observe the effect.

The little right triangle made by \textrm{d}u and \textrm{d}(\alpha u) is similar to the original, so

\frac{\textrm{d}(\alpha u)}{\textrm{d}u} = \frac{\alpha}{1}


\textrm{d}(\alpha u) = \alpha \textrm{d}u

Next look at \textrm{d}(u + v). u+v is just two line segments laid one after the other. We increase the lengths by \textrm{d}u and \textrm{d}v and see what the change in the total length \textrm{d}(u+v) is.

The total change is equal to the sum of the changes.

\textrm{d}(u + v) = \textrm{d}u + \textrm{d}v

These rules combine to give the rule for linearity

\textrm{d}(\alpha u + \beta v) = \alpha \textrm{d}u + \beta \textrm{d}v

The Product Rule

The product rule is

\textrm{d}(uv) = u\textrm{d}v + v\textrm{d}u

To show this, we need a line segment with length uv.

Start by drawing u, then drawing a segment of length 1 starting at the same place as u and going an arbitrary direction.

Close the triangle. Extend the segment of length 1 by v, and close the new triangle. We’ve now extended the base by uv.

Construction of length u*v, by similar triangles.

Increase u by \textrm{d}u and v by \textrm{d}v. This results in several changes to uv.

The segment uv has a little bit chopped off on the left, since \textrm{d}u cuts into the place where it used to be.

uv is also extended twice on the right. The first extension is the projection of \textrm{d}v down onto the base. All such projections multiply the length by u, so the piece added is u\textrm{d}v.

Finally there is a piece added from the very skinny tall triangle. It is similar to the skinny, short triangle created by adding \textrm{d}u to u. The tall triangle is (1+v) times as far from the bottom left corner as the short one, so it is (1+v) times as big. Since the base of the short one is \textrm{d}u, the base of the tall one is (1+v)\textrm{d}u.

Combining all three changes to uv, one subtracting from the left and two adding to the right, we get

\textrm{d}(uv) = -\textrm{d}u + u\textrm{d}v + (1+v)\textrm{d}u = u\textrm{d}v + v\textrm{d}u

This is the product rule. We’ll give another visual proof in the exercises.

The Chain Rule

Suppose we want \textrm{d}\sin x^2. (There’s no particular reason I can think of to want that, but we have a limited milieu of functions at hand right now.)

We know \textrm{d}(\sin\theta) = \cos\theta\textrm{d}{\theta}. Let \theta = x^2.

\textrm{d}(\sin x^2) = \cos(x^2)\textrm{d}(x^2)

But we already know that \textrm{d}(x^2) = 2x\textrm{d}x, so substitute that in to get

\textrm{d}(\sin x^2) = \cos(x^2)2x\textrm{d}x

This is called the chain rule. A symbolic way to right it is

\frac{\textrm{d}f}{\textrm{d}t} = \frac{\textrm{d}f}{\textrm{d}x}\frac{\textrm{d}x}{\textrm{d}t}

Suppose you are hiking up a mountain trail. f is your height above sea level. x is the distance you’ve gone down the trail. t is the time you’ve been hiking.

\textrm{d}f/\textrm{d}t is the rate you are gaining height. According to the chain rule, you can calculate this rate by multiplying the slope of the trail \textrm{d}f/\textrm{d}x to your speed \textrm{d}x/\textrm{d}t.


  • Show that the linearity rule \textrm{d}(\alpha u) = \alpha \textrm{d}u is a special case of the product rule.
  • What is the derivative of A\sin\theta + C\cos\theta with respect to \theta? Take the derivative with respect to \theta of that. (This is called a “second derivative”.) What do you get? (Answer: -1 times the original function)
  • Use the product rule to prove by induction that the derivative of x^n is n x^{n-1} for all positive integers n.
  • Apply the product rule to x^nx^{-n} = 1 to prove that the “power rule” from the previous question holds for all integers n.
  • Look back at the arguments from the introduction. Draw a rectangle with one side length u and one side length v. Its area is uv. Use this to prove the product rule.
  • Apply the chain rule to (x^{1/n})^{n} = x to find the derivative of x^{1/n} with respect to x for all integers n (Answer: \frac{1}{n} x^{1/n -1})
  • Argue that the derivative of x^{p/q} = \frac{p}{q}x^{p/q - 1} for all rational numbers p/q.
  • Show that the derivative of a polynomial is always another polynomial. Is there any polynomial that is its own derivative? (Answer: no, except zero)
  • Combine the product rule with the chain rule to prove the quotient rule \textrm{d}\frac{u}{v} = \frac{u\textrm{d}v - v\textrm{d}u}{v^2}

Visualizing Elementary Calculus: Trigonometry

March 26, 2011

Here we’ll find the derivatives of trigonometric functions. The goal is to reinforce the idea of \textrm{d} as a thing that means “a little bit of” and grant some new insight into why these derivatives are what they are. The first argument is based on the preface of Tristan Needham’s Visual Complex Analysis. I haven’t read the bulk of it, but the preface is good.

This series
I – Introduction
II – Trigonometry

The Sine Function

Let’s find \textrm{d}(\sin\theta) / \textrm{d}\theta. The sine function is the height of a right triangle in the unit circle. We’ll draw it, and add a little change in \theta. This induces a change in \sin\theta. The change in \theta is called \textrm{d}\theta and the change in \sin\theta is called \textrm{d}(\sin\theta).

We show the sine of an angle as the dark blue line. The change in the sine when we change the angle slightly is the light blue line.

The interesting part is \textrm{d}\sin\theta, so we’ll zoom in there in the next picture. Before we do, remember that the arc length along a piece of the unit circle is equal to the angle it subtends. This will tell us the length of the little piece of the circumference near \textrm{d}\sin\theta. Also remember that we’re imagining \textrm{d}\theta to get smaller and smaller, until the two radii in the picture are parallel. We get this:

The interesting region is blown up to large size. The black line d(theta) is part of the edge of the circle. The angles marked are congruent to theta.

The section of the circle is \textrm{d}\theta long. It looks like a straight line because we are zoomed in close, like the horizon at the beach. You can use some geometry to show that the angles marked are congruent to \theta.

Looking at the right triangle formed, we can use the definition of the cosine function to read off

\frac{\textrm{d}(\sin\theta)}{\textrm{d}\theta} = \cos\theta

which is the derivative of the sine function.

Motion on the Unit Circle

Another way to view these derivatives is to imagine a point moving around the outside edge of the unit circle with speed one. Its location as a function of time is (\cos t, \sin t).

Its velocity is tangent to the circle and length one. Let’s draw the velocity vector right at the point, and then also translate it to the origin.

The position of the point is the red vector r. Its velocity is the green tangent v, which has also been copied to the origin.

We want to know the coordinates of \vec{v}. That’s not too hard; \vec{v} is a quarter-circle rotation of \vec{r}. Draw in the components of \vec{r}, and rotate those components to get \vec{v}. The x-component of the position becomes the y-component of the velocity, and the y-component of the position becomes minus one times the x-component of the velocity.

The components of the position get rotated a quarter turn to make the components of the velocity.

The derivative of position is velocity, and so comparing components between the position and velocity vectors, we get

\frac{\textrm{d}(\cos\theta)}{\textrm{d}\theta} = -\sin\theta

\frac{\textrm{d}(\sin\theta)}{\textrm{d}\theta} = \cos\theta


  • Look back at the first derivation we gave that \textrm{d}(\sin\theta)/\textrm{d}\theta = \cos\theta. Rework it to find derivatives of the other five trig functions. You might want to note that one way to interpret \tan\theta and \sec\theta is

The tangent and secant of an angle are side lengths of a right triangle with "adjacent" side length one.


  • Look back at the argument about a dot moving around a circle. Consider a larger circle to find the derivative of 5\sin\theta with respect to \theta. (Answer: 5\cos\theta)
  • Suppose the dot moving around the edge of the circle is going three times as fast. What does this mean for the derivative of \sin(3 t) and \cos(3 t) with respect to t? Remember that the velocity must still be perpendicular to the position, but not necessarily unit length and more. (Answer: the derivative of \sin(3 t) with respect to t is 3\cos(3 t).
  • Suppose the dot is moving at a variable speed v(t) = t, so that it keeps getting faster. Then the y-coordinate of the position is \sin(\frac{1}{2}t^2). Again, the velocity is perpendicular to position, but its length is changing. What is the derivative of \sin(\frac{1}{2}t^2) with respect to t? (Answer: t\cos(\frac{1}{2}t^2)

Visualizing Elementary Calculus: Introduction

March 25, 2011

Recently I’ve been trying to be more geometrical when discussing elementary calculus with high school students. I don’t want to write an entire introduction to calculus, but the next few posts will outline some ways I think the geometric view can be helpful.

This series
I – Introduction
II – Trigonometry

You know about \Delta, which means “the change in”. For example, if w represents my weight, then -\Delta w represents the weight of the poop I just took.

Let’s say h is your height above sea level. \Delta h is the change in that height, but what change? The change when you climb the stairs? When you jump out of a plane? When you step on a banana peel?

When we think about change, we usually think about two things changing together. You get higher when you climb another stair on the staircase. h is changing, and so is s, the number of stairs climbed.

These two changes are related to each other. Say the stairs are 10 cm high. Then you gain 10 cm of height for each stair. We can write that as \Delta h = 10 {\rm cm} \hspace{.5em} \Delta s. We can also write it \Delta h / \Delta s = 10 \hspace{.5em}{\rm cm}. This says, “the height per stair is ten centimeters.”

This is the goal of calculus – to study the relationships between changing quantities. Let’s do a real example.

The Area of a Square

Let’s say we have a square whose sides lengths are x. Its area is x^2. What is the relationship between changes in its area and changes in the length of a side? Draw the square, then expand the sides some. The amount the sides have expanded is \Delta x. The new area that’s been added is \Delta (x^2).

We begin with the red square on the left, whose area is x^2. We add an extra amount Delta(x) to the sides, creating all the new green area.

From the picture we see

\Delta(x^2) = 2x\Delta x + (\Delta x)^2

This formula relates \Delta (x^2), the change in the area, to \Delta x, the change in the length of a side.

The Derivative of x^2

In the picture of the square, there is a little piece in the upper-right corner whose area is (\Delta x)^2. It is the smallest bit of area in the whole picture.

Look what happens when we make \Delta x even smaller.

We shrink Delta(x) and observe what happens to the different areas being added on.

In the first picture, \Delta x (no longer marked) is a quarter of x. (\Delta x)^2 is the dark green area, and it is one quarter as large as x \Delta x, the light green area. We see this because the dark patch fits inside the light one four times.

In the second picture, we shrink \Delta x to one eighth of x. All the green areas shrink, but the dark patch shrinks on two sides while the light patches shrink on only one. As a result, the dark (\Delta x)^2 is now only one eighth the size of the light x \Delta x.

If we continued to shrink \Delta x, this ratio would continue to decrease. Eventually we could tile the dark patch a million times into the light one. So, as long as \Delta x is very small, we can get a good estimate of the entire green area by ignoring the dark part (\Delta x)^2. Thus

\Delta(x^2) \approx 2x\Delta x

This approximation becomes better and better as \Delta x shrinks, becoming perfect as \Delta x becomes infinitesimally small.

When we want to indicate these infinitely small changes, we trade in the \Delta for a {\rm d} and write

\textrm{d}(x^2) = 2x \textrm{d}x

The terms \textrm{d}(x^2) and \textrm{d}x are called “differentials”. The equation expresses the relationship between two infinitely-small changes, one in x and one in x^2.

Frequently, we divide by \textrm{d}x on both sides to get

\frac{\textrm{d}(x^2)}{\textrm{d}x} = 2x

This is called “the derivative of x^2 with respect to x“.

Example 1: Estimating Squares

20^2 = 400. What is 21^2?

Here x = 20, and we’re looking at x^2. When x goes from 20 to 21, it changes by 1, so \textrm{d}x = 1. Our formula tells us

\textrm{d}(x^2) = 2x \textrm{d}x = 2*20*(1) = 40

Hence, x^2 increases by about 40, from 400 to 440.

The real value is 441. We got the change in x^2 wrong by about 2%. That’s because \textrm{d}x wasn’t infinitely small.

Let’s try again, this time estimating the square of 20.00458. Now \textrm{d}x = .00458, so

\textrm{d}(x^2) = 2 x \textrm{d}x = 2*20*.00458 = .1832

The estimate is 400.1832. The real value is 400.183221. We did much better, under-estimating the change by only 0.01% this time. Also, it was not much harder to do this problem than the last, but squaring out 20.00458 by hand would be a pain. We saved some work.

Example 2: How Far Is the Horizon?

The beach is a good place to think about calculus. If you look out at the ocean, the horizon appears perfectly flat. Nonetheless, we know the Earth is really curved. In fact, we can deduce the curvature of the Earth by standing on the beach and enlisting the help of a friend in a boat.

It works like this: You stand on the beach with your head two meters above the water. Your friend sails away until the boat begins to disappear from sight. The reason the bottom of the boat is disappearing is that it is hidden behind the curvature of Earth.

When the bottom of the boat disappears, measure the distance to some part of the boat you can still see. What’s the relationship between your height, the distance to the boat, and the radius of Earth?

A picture will help. We’ll call your height h and the distance to the horizon z.

You are the vertical stick on top, height h. The boat is the brown circle. It's at the horizon, a distance z away. The dotted line shows your line of sight. When the bottom of the boat begins disappearing, a right triangle forms.

Your height, the radius of Earth, and the distance to the horizon are related by the Pythagorean theorem to give

R^2 + z^2 = (R+h)^2

this is equivalent to

z^2 = 2Rh + h^2

As we have seen, if your height h is small compared to the size of the Earth (and it is), the term h^2 drops away and the distance to the horizon is

z = \sqrt{2Rh}

You can see about 5 {\rm km} at the beach, making the radius of Earth about 6,000 {\rm km}. (It’s actually 6378.1 {\rm km}).

Next we want to know how much further you can see if you stand on your tiptoes. That would be a small change \textrm{d}h to your height. It would let you see a small amount \textrm{d}z further. How is \textrm{d}h related to \textrm{d}z?

We already know

\textrm{d}(x^2) = 2x\textrm{d}x

So let x^2 = h, or x = \sqrt{h}, and we have

\textrm{d}h = 2\sqrt{h}\hspace{.3em}\textrm{d}(\sqrt{h})

But we also know

\sqrt{h} = \frac{z}{\sqrt{2R}}

so we can substitute that in to \textrm{d}(\sqrt{h}) and get

\textrm{d}h = 2\sqrt{h}\hspace{.3em}\textrm{d}\left(\frac{z}{\sqrt{2R}}\right)


\frac{\textrm{d}z}{\textrm{d}h} = \sqrt{\frac{R}{2h}}

This tells us how much further you can see if you get a little higher up. The interesting thing is it depends on h. The higher you go, the smaller \textrm{d}z. When you’re only two meters up, you get to see almost ten meters further out for every centimeter higher you go. However, if you’re 100m up on top a carousel, you get only 1 meter for each centimeter you rise.

It makes sense that the extra distance you see gets smaller and smaller the higher you go, and eventually shrinks down to zero. No matter how high you go, you can never see more than a quarter way around the globe.

(In reality, light bends due to refraction in the atmosphere, so you can sometimes see a bit further.)


Suppose we have a circle with radius r. It has a certain area (you undoubtedly know the formula already, but play along). Suppose we increase r by a small amount \textrm{d}r. What is the change \textrm{d}A in the area?

The original circle is dark blue with area A and radius R. The radius increases an amount dR, increasing the area by the light blue ring with area dA.

\textrm{d}A is the thin, light-blue ring. Imagine taking that ring and peeling it off the edge of the circle and laying it flat. We’d have a rectangle with width \textrm{d}R. Its length comes from the outside edge of the entire circle – the circumference. The circumference is 2 \pi R, so

\textrm{d}A = 2\pi R \textrm{d}R

We saw earlier that \textrm{d}(x^2) = 2x\textrm{d}x, so let x = R and we have

\textrm{d}A = \pi \textrm{d}(R^2)

Thus the quantities A and \pi R^2 change in exactly the same way. Since they also start out the same (both zero when R is zero), we have

A = \pi R^2

Next Post

We’ll look at trigonometry. Geometric arguments about the derivatives of trig functions are very simple ways of visualizing what’s going one, and are usually not introduced in a basic calculus course.


  • Draw a cube with sides x and show that \textrm{d}(x^3) = 3x^2\textrm{d}x. Thus the derivative of x^3 with respect to x is 3x^2.
  • Draw a line with length x and show that \textrm{d}(x) = \textrm{d}x, which is of course algebraically obvious. Thus the derivative of x with respect to itself is 1.
  • Draw a rectangle with width w and length c*w and show that \textrm{d}(c*w^2) = 2cw\textrm{d}w = c\textrm{d}(w^2). Thus, whenever you have the differential of a variable multiplied by a constant, the constant can pop outside. Where was this property used implicitly in this post?
  • Now that you know \textrm{d}(x^3) = 3x^2\textrm{d}x, let x^3 = u and find the derivative of u^{1/3} with respect to u. (Answer: \frac{1}{3} u^{-2/3})
  • What is \textrm{d}(x^3)/\textrm{d}(x^2)? Let u = x^2 and find the derivative of u^{3/2} with respect to u. (Answer: \frac{3}{2}u^{1/2}).
  • Examine \textrm{d}(x^4) by letting u = x^2, so we’re looking at \textrm{d}(u^2). Find the derivative of x^4 with respect to x. (Answer: 4x^3)
  • Draw an equilateral triangle with sides of length s. Increase the sides a small amount \textrm{d}s and relate this to the change in area \textrm{d}A. Does this agree with our previous findings?
  • Draw an ellipse with a fixed with semi-major axis a and semi-minor axis b. Starting with a unit circle, argue by thinking about stretching that the area of the ellipse is \pi ab. Increase a by a small amount \textrm{d}a and increase b proportionately. This adds a small area \textrm{d}A to the ellipse. Show that this area is \pi(a^2+b^2)/b\hspace{.3em}\textrm{d}a. Does this let us find the circumference of the ellipse by the same thought process as we used for the circle? (Answer: no). Why not?
  • Draw a sphere with radius R. Use the relationship between \textrm{d}R and \textrm{d}A to find the volume of a sphere, given its surface area is 4\pi R^2. Check your answer against this post.