I once wrote about height and speed in tennis arguing that negative correlation appears at the highest level simply because they are substitutes and the athletes are selected to be the very best.  At the blog MickeyMouseModels.blogspot.com, there is a post which shows very nicely the effect using simulated data.  Quoting:

Suppose that, in the general population, the distribution of height and speed looks roughly like this:

Where did I get this data? It’s entirely hypothetical. I made it up! That said, I did try to keep it semi-realistic: the heights are generated as H = 4 + U1 + U2 + U3 feet, where the U are independently uniform on (0, 1); the result is a bell curve on (4, 7) feet, which I prefer to the (-Inf, +Inf) of an actual normal distribution.  (I’ve created something similar to the N=3 frame in this animation.)

The next step is to give individuals a maximum footspeed S = 10 + U4 + U5 + U6 mph, with the U independently uniform on (0, 5). By construction, speed is independent from height, and falls more or less in a bell curve from 10 to 25 mph. Fun anecdote: my population is too slow to include Usain Bolt, whose top footspeed is close to 28 mph.

Back to tennis. Let’s imagine that tennis ability increases with both height and speed — and, moreover, that those two attributes are substitutable: if you’re short (and have a weak serve), you can make up for it by being fast. With that in mind, let’s revisit the scatterplot:

There it is: height and speed are independent in the general population, but very much dependent — and negatively correlated — among tennis players.  The plot really drives the point home:  top athletes will be either very tall, very fast, or nearly both; and excluding everyone else creates a downward slope.

Advertisements