You are currently browsing the tag archive for the ‘statistics’ tag.

My 9 year-old daughter’s soccer games are often high-scoring affairs. Double-digit goal totals are not uncommon.  So when her team went ahead 2-0 on Saturday someone on the sideline remarked that 2-0 is not the comfortable lead that you usually think it is in soccer.

But that got me thinking.  Its more subtle than that.  Suppose that the game is 2 minutes old and the score is 2-0.  If these were professional teams you would say that 2-0 is a good lead but there are still 88 minutes to play and there is a decent chance that a 2-0 lead can be overcome.

But if these are 9 year old girls and you know only that the score is 2-0 after 2 minutes your most compelling inference is that there must be a huge difference in the quality of these two teams and the team that is leading 2-0 is very likely to be ahead 20-0 by the time the game is over.

The point is that competition at higher levels is different in two ways. First there is less scoring overall which tends to make a 2-0 lead more secure.  But second there is also lower variance in team quality.  So a 2-0 lead tells you less about the matchup than it does at lower levels.

Ok so a 2-0 lead is a more secure lead for 9 year olds when 95% of the game remains to be played (they play for 40 minutes). But when 5% of the game remains to be played a 2-0 lead is almost insurmountable at the professional level but can easily be upset in a game among 10 year olds.

So where is the flipping point?  How much of the game must elapse so that a 2-0 lead leads to exactly the same conditional probability that the 9 year olds hold on to the lead and win as the professionals?

Next question.  Let F be the fraction of the game remaining where the 2-0 lead flipping point occurs.  Now suppose we have a 3-0 lead with F remaining.  Who has the advantage now?

And of course we want to define F(k) to be the flipping point of a k-nil lead and we want to take the infinity-nil limit to find the flipping point F(infinity).  Does it converge to zero or one, or does it stay in the interior?

Act as if you have log utility and with probability 1 your wealth will converge to infinity.

Sergiu Hart presented this paper at Northwestern last week.  Suppose you are going to be presented an infinite sequence of gambles.  Each has positive expected return but also a positive probability of a loss.  You have to decide which gambles to accept and which gambles to reject. You can also invest purchase fractions of gambles: exposing yourself to some share \alpha of its returns. Your wealth accumulates (or depreciates) along the way as you accept gambles and absorb their realized returns.

Here is a simple investment strategy that guarantees infinite wealth.  First, for every gamble g that appears you calculate the wealth level such that an investor with that as his current wealth and who has logarithmic utility for final wealth would be just indifferent between accepting and rejecting the gamble.  Let’s call that critical wealth level R(g).  In particular, such an investor strictly prefers to accept g if his wealth is higher than R(g) and strictly prefers to reject it if his wealth is below that level.

Next, when your wealth level is actually W and you are presented gamble g, you find the maximum share of the gamble that an investor with logarithmic utility would be willing to take.  In particular, you determine the share of g such that the critical wealth level R(\alpha g) of the resulting gamble \alpha g is exactly W. Now the sure-thing strategy for your hedge fund is the following:  purchase the share \alpha of the gamble g, realize its returns, wait for next gamble, repeat.

If you follow this rule then no matter what sequence of gambles appears you will never go bankrupt and your wealth will converge to infinity. What’s more, this is in some sense the most aggressive investment strategy you can take without running the risk of going bankrupt.  Foster and Hart show that any investor that is willing to accept some gambles g at wealth levels W below the critical wealth level R(g) there is a sequence of gambles that will drive that investor to bankruptcy.  (This last result assumes that the investor is using a “scale free” investment strategy, one whose acceptance decisions scale proportionally with wealth.  That’s an unappealing assumption but there is a convincing version of the result without this assumption.)

In basketball the team benches are near the baskets on opposite sides of the half court line. The coaches roam their respective halves of the court shouting directions to their team.

As in other sports the teams switch sides at halftime but the benches stay where they were. That means that for half of the game the coaches are directing their defenses and for the other half they are directing their offenses.

If coaching helps then we should see more scoring in the half where the offenses are receiving direction.

This could easily be tested.

Here is an excellent rundown of some soul searching in the neuroscience community regarding statistical significance.  The standard method of analyzing brain scan data apparently involves something akin to data mining but the significance tests use standard single-hypothesis p-values.

One historical fudge was to keep to uncorrected thresholds, but instead of a threshold of p=0.05 (or 1 in 20) for each voxel, you use p=0.001 (or 1 in a 1000).  This is still in relatively common use today, but it has been shown, many times, to be an invalid attempt at solving the problem of just how many tests are run on each brain-scan. Poldrack himself recently highlighted this issue by showing a beautiful relationship between a brain region and some variable using this threshold, even though the variable was entirely made up. In a hilarious earlier version of the same point, Craig Bennett and colleagues fMRI scanned a dead salmon, with a task involving the detection of the emotional state of a series of photos of people. Using the same standard uncorrected threshold, they found two clusters of activation in the deceased fish’s nervous system, though, like the Poldrack simulation, proper corrected thresholds showed no such activations.

Biretta blast:  Marginal Revolution.

So there was this famous experiment and just recently a new team of researchers tried to replicate it and they could not. Quoting Alex Tabarrok:

You will probably not be surprised to learn that the new paper fails to replicate the priming effect. As we know from Why Most Published Research Findings are False (also here), failure to replicate is common, especially when sample sizes are small.

There’s a lot more at the MR link you should check it out. But here’s the thing. If most published research findings are false then which one is the false one, the original or the failed replication? Have you noticed that whenever a failed replication is reported, it is reported with all of the faith and fanfare that the original, now apparently disproven study was afforded? All we know is that one of them is wrong, can we really be sure which?

If I have to decide which to believe in, my money’s on the original. Think publication bias and ask yourself which is likely to be larger:  the number of unpublished experiments that confirmed the original result or the number of unpublished results that didn’t.

Here’s a model. Experimenters are conducting a hidden search for results and they publish as soon as they have a good one. For the original experimenter a good result means a positive result. They try experiment A and it fails so they conclude that A is a dead end, shelve it and turn to something new, experiment B. They continue until they hit on a positive result, experiment X and publish it.

Given the infinity of possible original experiments they could try, it is very likely that when they come to experiment X they were the first team to ever try it. By contrast, Team-Non-Replicate searches among experiments that have already been published, especially the most famous ones.  And for them a good result is a failure to replicate. That’s what’s going to get headlines.

Since X is a famous experiment it’s not going to take long before they try that. They will do a pilot experiment and see if they can fail to replicate it. If they fail to fail to replicate it, they are going to shelve it and go on to the next famous experiment. But then some other Team-Non-Replicate, who has no way of knowing this is a dead-end, is going to try experiment X, etc. This is going to continue until someone succeeds in failing to replicate.

When that’s all over let’s count the number of times X failed:  1.  The number of times X was confirmed equals 1 plus the number of non-non-replications before the final successful failure.

Email is the superior form of communication as I have argued a few times before, but it can sure aggravate your self-control problems. I am here to help you with that.

As you sit in your office working, reading, etc., the random email arrival process is ticking along inside your computer. As time passes it becomes more and more likely that there is email waiting for you and if you can’t resist the temptation you are going to waste a lot of time checking to see what’s in your inbox.  And it’s not just the time spent checking because once you set down your book and start checking you won’t be able to stop yourself from browsing the web a little, checking twitter, auto-googling, maybe even sending out an email which will eventually be replied to thereby sealing your fate for the next round of checking.

One thing you can do is activate your audible email notification so that whenever an email arrives you will be immediately alerted. Now I hear you saying “the problem is my constantly checking email, how in the world am i going to solve that by setting up a system that tells me when email arrives? Without the notification system at least I have some chance of resisting the temptation because I never know for sure that an email is waiting.”

Yes, but it cuts two ways.  When the notification system is activated you are immediately informed when an email arrives and you are correct that such information is going to overwhelm your resistance and you will wind up checking. But, what you get in return is knowing for certain when there is no email waiting for you.

It’s a very interesting tradeoff and one we can precisely characterize with a little mathematics. But before we go into it, I want you to ask yourself a question and note the answer before reading on.  On a typical day if you are deciding whether to check your inbox, suppose that the probability is p that you have new mail. What is going to get you to get up and check?  We know that you’re going to check if p=1 (indeed that’s what your mailbeep does, it puts you at p=1.) And we know that you are not going to check when p=0.  What I want to know is what is the threshold above which its sufficiently likely that you will check and below which is sufficiently unlikely so you’ll keep on reading?  Important:  I am not asking you what policy you would ideally stick to if you could control your temptation, I am asking you to be honest about your willpower.

Ok, now that you’ve got your answer let’s figure out whether you should use your mailbeep or not.  The first thing to note is that the mail arrival process is a Poisson process:  the probability that an email arrives in a given time interval is a function only of the length of time, and it is determined by the arrival rate parameter r.  If you receive a lot of email you have a large r, if the average time spent between arrivals is longer you have a small r.  In a Poisson process, the elapsed time before the next email arrives is a random variable and it is governed by the exponential distribution.

Let’s think about what will happen if you turn on your mail notifier.  Then whenever there is silence you know for sure there is no email, p=0 and you can comfortably go on working temptation free. This state of affairs is going to continue until the first beep at which point you know for sure you have mail (p=1) and you will check it.  This is a random amount of time, but one way to measure how much time you waste with the notifier on is to ask how much time on average will you be able to remain working before the next time you check.  And the answer to that is the expected duration of the exponential waiting time of the Poisson process.  It has a simple expression:

Expected time between checks with notifier on = \frac{1}{r}

Now let’s analyze your behavior when the notifier is turned off.  Things are very different now.  You are never going to know for sure whether you have mail but as more and more time passes you are going to become increasingly confident that some mail is waiting, and therefore increasingly tempted to check. So, instead of p lingering at 0 for a spell before jumping up to 1 now it’s going to begin at 0 starting from the very last moment you previously checked but then steadily and continuously rise over time converging to, but never actually equaling 1.  The exponential distribution gives the following formula for the probability at time T that a new email has arrived.

Probability that email arrives at or before a given time T = 1 - e^{-rT}

Now I asked you what is the p* above which you cannot resist the temptation to check email.  When you have your notifier turned off and you are sitting there reading, p will be gradually rising up to the point where it exceeds p* and right at that instant you will check.  Unlike with the notification system this is a deterministic length of time, and we can use the above formula to solve for the deterministic time at which you succumb to temptation.  It’s given by

Time between checks when the notifier is off = \frac{- log (1 - p^*)}{r}

And when we compare the two waiting times we see that, perhaps surprisingly, the comparison does not depend on your arrival rate r (it appears in the numerator of both expressions so it will cancel out when we compare them.) That’s why I didn’t ask you that, it won’t affect my prescription (although if you receive as much email as I do, you have to factor in that the mail beep turns into a Geiger counter and that may or may not be desirable for other reasons.)  All that matters is your p* and by equating the two waiting times we can solve for the crucial cutoff value that determines whether you should use the beeper or not.

The beep increases your productivity iff your p* is smaller than \frac{e-1}{e}

This is about .63 so if your p* is less than .63 meaning that your temptation is so strong that you cannot resist checking any time you think that there is at least a 63% chance there is new mail waiting for you then you should turn on your new mail alert.  If you are less prone to temptation then yes you should silence it. This is life-changing advice and you are welcome.

Now, for the vapor mill and feeling free to profit, we do not content ourselves with these two extreme mechanisms.  We can theorize what the optimal notification system would be.  It’s very counterintuitive to think that you could somehow “trick” yourself into waiting longer for email but in fact even though you are the perfectly-rational-despite-being-highly-prone-to-temptation person that you are, you can.  I give one simple mechanism, and some open questions below the fold.

Read the rest of this entry »

It’s the canonical example of reference-dependent happiness. Someone from the Midwest imagines how much happier he would be in California but when he finally has the chance to move there he finds that he is just as miserable as he was before.

But can it be explained by a simple selection effect? Suppose that everyone who lives in the Midwest gets a noisy but unbiased signal of how happy they would be in California. Some overestimate how happy they would be and some underestimate it. Then they get random opportunities to move. Who is going to take that opportunity? Those who overestimate how happy they will be.  And so when they arrive they are disappointed.

It also explains why people who are forced to leave California, say for job-related reasons, are pleasantly surprised at how happy they can be in the Midwest. Since they hadn’t moved voluntarily already, its likely that they underestimated how happy they would be.

These must be special cases of this paper by Eric van den Steen, and its similar to the logic behind Lazear’s theory behind the Peter Principle.  (For the latter link I thank Adriana Lleras-Muney.)

Jonah Lehrer didn’t:

In many situations, such reinforcement learning is an essential strategy, allowing people to optimize behavior to fit a constantly changing situation. However, the Israeli scientists discovered that it was a terrible approach in basketball, as learning and performance are “anticorrelated.” In other words, players who have just made a three-point shot are much more likely to take another one, but much less likely to make it:

What is the effect of the change in behaviour on players’ performance? Intuitively, increasing the frequency of attempting a 3pt after made 3pts and decreasing it after missed 3pts makes sense if a made/missed 3pts predicted a higher/lower 3pt percentage on the next 3pt attempt. Surprizingly [sic], our data show that the opposite is true. The 3pt percentage immediately after a made 3pt was 6% lower than after a missed 3pt. Moreover, the difference between 3pt percentages following a streak of made 3pts and a streak of missed 3pts increased with the length of the streak. These results indicate that the outcomes of consecutive 3pts are anticorrelated.

This anticorrelation works in both directions. as players who missed a previous three-pointer were more likely to score on their next attempt. A brick was a blessing in disguise.

The underlying study, showing a “failure of reinforcement learning” is here.

Suppose you just hit a 3-pointer and now you are holding the ball on the next possession. You are an experienced player (they used NBA data), so you know if you are truly on a hot streak or if that last make was just a fluke. The defense doesn’t. What the defense does know is that you just made that last 3-pointer and therefore you are more likely to be on a hot streak and hence more likely than average to make the next 3-pointer if you take it. Likewise, if you had just missed the last one, you are less likely to be on a hot streak, but again only you would know for sure. Even when you are feeling it you might still miss a few.

That means that the defense guards against the three-pointer more when you just made one than when you didn’t. Now, back to you. You are only going to shoot the three pointer again if you are really feeling it. That’s correlated with the success of your last shot, but not perfectly. Thus, the data will show the autocorrelation in your 3-point shooting.

Furthermore, when the defense is defending the three-pointer you are less likely to make it, other things equal. Since the defense is correlated with your last shot, your likelihood of making the 3-pointer is also correlated with your last shot. But inversely this time:  if you made the last shot the defense is more aggressive so conditional on truly being on a hot streak and therefore taking the next shot, you are less likely to make it.

(Let me make the comparison perfectly clear:  you take the next shot if you know you are hot, but the defense defends it only if you made the last shot.  So conditional on taking the next shot you are more likely to make it when the defense is not guarding against it, i.e. when you missed the last one.)

You shoot more often and miss more often conditional on a previous make. Your private information about your make probability coupled with the strategic behavior of the defense removes the paradox. It’s not possible to “arbitrage” away this wedge because whether or not you are “feeling it” is exogenous.

I write all the time about strategic behavior in athletic competitions.  A racer who is behind can be expected to ease off and conserve on effort since effort is less likely to pay off at the margin.  Hence so will the racer who is ahead, etc.  There is evidence that professional golfers exhibit such strategic behavior, this is the Tiger Woods effect.

We may wonder whether other animals are as strategically sophisticated as we are.  There have been experiments in which monkeys play simple games of strategy against one another, but since we are not even sure humans can figure those out, that doesn’t seem to be the best place to start looking.

I would like to compare how humans and other animals behave in a pure physical contest like a race.  Suppose the animals are conditioned to believe that they will get a reward if and only if they win a race.  Will they run at maximum speed throughout regardless of their position along the way?  Of course “maximum speed” is hard to define, but a simple test is whether the animal’s speed at a given point in the race is independent of whether they are ahead or behind and by how much.

And if the animals learn that one of them is especially fast, do they ease off when racing against her?  Do the animals exhibit a tiger Woods effect?

There are of course horse-racing data.  That’s not ideal because the jockey is human.  Still there’s something we can learn from horse racing.  The jockey does not internalize 100% of the cost of the horse’s effort.  Thus there should be less strategic behavior in horse racing than in races between humans or between jockey-less animals.  Dog racing?  Does that actually exist?

And what if a dog races against a human, what happens then?

In the past few weeks Romney has dropped from 70% to under 50% and Gingrich has rocketed to 40% on the prediction markets.  And in this time Obama for President has barely budged from its 50% perch.  As someone pointed out on Twitter (I forget who, sorry) this is hard to understand.

For example if you think that in this time there has been no change in the conditional probabilities that either Gingrich or Romney beats Obama in the general election, then these numbers imply that the market thinks that those conditional probabilities are the same.  Conversely, If you think that Gingrich has risen because his perceived odds of beating Obama have risen over the same period, then it must be that Romney’s have dropped in precisely the proportion to keep the total probability of a GOP president constant.

It’s hard to think of any public information that could have these perfectly offsetting effects.  Here’s the only theory I could come up with that is consistent with the data.  No matter who the Republican candidate is, he has a 50% chance of beating Obama.  This is just a Downsian prediction.  The GOP machine will move whoever it is to a median point in the policy space.  But, and here’s the model, this doesn’t imply that the GOP is indifferent between Gingrich and Romney.

While any candidate, no matter what his baggage, can be repositioned to the Downsian sweet spot, the cost of that repositioning depends on the candidate, the opposition, and the political climate.  The swing from Romney to Gingrich reflects new information about these that alters the relative cost of marketing the two candidates.  Gingrich has for some reason gotten relatively cheaper.

I didn’t say it was a good theory.

Update:  Rajiv Sethi reminded me that the tweet was from Richard Thaler. (And see Rajiv’s comment below.)

Stefan Lauermann points me to a new paper, this is from the abstract:

Our analysis shows that both stake size and communication have a significant impact on the player’s likelihood to cooperate. In particular, we observe a negative correlation between stake size and cooperation. Also certain gestures, as handshakes, decrease the likelihood to cooperate. But, if players mutually promise each other to cooperate and in addition shake hands on it, the cooperation rate increases.

Measuring social influence is notoriously difficult in observational data.  If I like Tin Hat Trio and so do my friends is it because I influenced them or we just have similar tastes, as friends often do.  A controlled experiment is called for.  It’s hard to figure out how to do that.  How can an experimenter cause a subject to like something new and then study the effect on his friends?

Online social networks open up new possibilities.  And here is the first experiment I came across that uses Facebook to study social influence, by Johan Egebark and Mathias Ekstrom.  If one of your friends “likes” an item on Facebook, will it make you like it too?

Making use of five Swedish users’ actual accounts, we create 44 updates in total during a seven month period.1 For every new update, we randomly assign our user’s friends into either a treatment or a control group; hence, while both groups are exposed to identical status updates, treated individuals see the update after someone (controlled by us) has Liked it whereas individuals in the control group see it without anyone doing so. We separate between three different treatment conditions: (i) one unknown user Likes the update, (ii) three unknown users Like the update and (iii) one peer Likes the update. Our motivation for altering treatments is that it enables us to study whether the number of previous opinions as well as social proximity matters.2 The result from this exercise is striking: whereas the first treatment condition left subjects unaffected, both the second and the third more than doubled the probability of Liking an update, and these effects are statistically significant.

I was working on a paper, writing the introduction to a new section that deals with an extension of the basic model. It’s a relevant extension because it fits many real-world applications. So naturally I started to list the many real-world applications.

“This applies to X, Y, and….” hmmm… what’s the Z? Nothing coming to mind.

But I can’t just stop with X and Y. Two examples are not enough. If I only list two examples then the reader will know that I could only think of two examples and my pretense that this extension applies to many real-world applications will be dead on arrival.

I really only need one more. Because if I write “This applies to X, Y, Z, etc.” then the Z plus the “etc.” proves that there is in fact a whole blimpload of examples that I could have listed and I just gave the first three that came to mind, then threw in the etc. to save space.

If you have ever written anything at all you know this feeling. Three equals infinity but two is just barely two.

This is largely an equilbrium phenomenon. A convention emerged according to which those who have an abundance of examples are required to prove it simply by listing three. Therefore those who have listed only two examples truly must have only two.

Three isn’t the only threshold that would work as an equilibrium.  There are many possibilities such as two, four, five etc.  (ha!) Whatever threshold N we settle on, authors will spend the effort to find N examples (if they can) and anything short of that will show that they cannot.

But despite the multiplicity I bet that the threshold of three did not emerge arbitrarily. Here is an experiment that illustrates what I am thinking.

Subjects are given a category and 1 minute, say. You ask them to come up with as many examples from that category they can think of in 1 minute. After the 1 minute is up and you count how many examples they came up with you then give them another 15 minutes to come up with as many as they can.

With these data we would do the following. Plot on the horizontal axis the number x of items they listed in the first minute and on the vertical axis the number E(y|x) equal to the empirical average number y of items they came up with in total conditional on having come up with x items in the first minute.

I predict that you will see an anomalous jump upwards between E(y|2) and E(y|3).

This experiment does not take into account the incentive effects that come from the threshold.  The incentives are simply to come up with as many examples as possible.  That is intentional.  The point is that this raw statistical relation (if it holds up) is the seed for the equilibrium selection.  That is, when authors are not being strategic, then three-or-more equals many more than two.  Given that, the strategic response is to shoot for exactly three.  The equilibrium result is that three equals infinity.

via Arthur Robson:

While appeals often unmask shaky evidence, this was different. This time, a mathematical formula was thrown out of court. The footwear expert made what the judge believed were poor calculations about the likelihood of the match, compounded by a bad explanation of how he reached his opinion. The conviction was quashed.

And the judge ruled that Bayes’ law for conditional probabilities could not be used in court.  Statisticians, Mathematicians, and prosecutors are worried that justice will suffer as a result.  The statistical evidence centered around the likelihood of a coincidental match of shoeprint with shoes owned by the Defendant.

In the shoeprint murder case, for example, it meant figuring out the chance that the print at the crime scene came from the same pair of Nike trainers as those found at the suspect’s house, given how common those kinds of shoes are, the size of the shoe, how the sole had been worn down and any damage to it. Between 1996 and 2006, for example, Nike distributed 786,000 pairs of trainers. This might suggest a match doesn’t mean very much. But if you take into account that there are 1,200 different sole patterns of Nike trainers and around 42 million pairs of sports shoes sold every year, a matching pair becomes more significant.

Now if I can prove to jurors that there was one shoe in the basement and another shoe upstairs, then probably I can legitimately claim to have proven that the total number of shoes is two because the laws of arithmetic should be binding on the jurors deductions.  And if there is a chance that a juror comes to some different conclusion then it would make sense for an expert witness, or the judge even, tell the juror that he is making a mistake.  Indeed a courtroom demonstration could prove the juror wrong.

But do the “laws” of probability have the same status?  If I can prove to the juror that his prior should attach probability p to A and probability q to [A and B], and if the evidence proves that A is true,  should he then be required to attach probability q/p to B?  Suppose for example that a juror disagreed with this conclusion. Could he be proven wrong?  A courtroom demonstration could show something about relative frequencies, but the juror could dispute that these have anything to do with probabilities.

It appears though that the judge’s ruling in this case was not on the basis of bayesian/frequentist philosophy, but rather about the validity of a Bayesian prescription when the prior itself is subjective.

The judge complained that he couldn’t say exactly how many of one particular type of Nike trainer there are in the country. National sales figures for sports shoes are just rough estimates.

And so he decided that Bayes’ theorem shouldn’t again be used unless the underlying statistics are “firm”. The decision could affect drug traces and fibre-matching from clothes, as well as footwear evidence, although not DNA.

This is a reasonable judgment even if the court upholds Bayesian logic per se.  Because the prior probability of a second pair of matching shoes can be deduced from the sales figures only under some assumptions about the distribution of shoes with various tread patterns.  The expert witnesses probably assumed that the accused and a hypothetical third-party murderer were randomly assigned tread patterns on their Nikes and that these assignments were independent.  But if the two live in the same town and shop at the same shoe store and if that store sold shoes with the same tread pattern, then that assumption would significantly understate the probability of a match.

Let’s say I want to know how many students in my class are cheating on exams. Maybe I’d like to know who the individual cheaters are, maybe I don’t but let’s say that the only way I can find out the number of cheaters is to ask the students themselves to report whether or not they cheated.  I have a problem because no matter how hard I try to convince them otherwise, they will assume that a confession will get them in trouble.

Since I cannot persuade them of my incentives, instead I need to convince them that it would be impossible for me to use their confession as evidence against them even if I wanted to.  But these two requirements are contradictory:

  1. The students tell the truth.
  2. A confession is not proof of their guilt.

So I have to abandon one of them.  That’s when you notice that I don’t really need every student to tell the truth.  Since I just want the aggregate cheating rate, I can live with false responses as long as I can use the response data to infer the underlying cheating rate.  If the students randomize whether they tell me the truth or lie, then a confession is not proof that they cheated.  And if I know the probabilities with which they tell the truth or lie, then with a large sample I can infer the aggregate cheating rate.

That’s a trick I learned about from this article.  (Glengarry glide: John Chilton.)  The article describes a survey designed to find out how many South African farmers illegally poached leopards.  The farmers were given a six-sided die and told to privately roll the die before responding to the question.  They were instructed that if the die came up a 1 they should say yes that they killed leopards.  If it came up a 6 they should say that they did not.  And if a 2-5 appears they should tell the truth.

A farmer who rolls a 2-5 can safely tell the researcher that he killed leopards because his confession is indistinguishable from a case in which he rolled a 1 and was just following instructions.  It is statistical evidence against him at worst, probably not admissible in court.  And assuming the farmers followed instructions, those who killed leopards will say so with probability 5/6 and those who did not will say so with probability 1/6.  In a large sample, the fraction of confessions will be a weighted average of those two numbers with the weights telling you the desired aggregate statistic.

Stan Reiter had a standard gripe about statistics/econometrics.  Imagine you there is a cave in front of you and you want to map out its dimensions.  There are many ways you could do it.  One thing you could do is go inside and look. Another thing you could do is stand outside and throw into the cave a bunch of super bouncy balls and when they bounce out, take careful note of their speed and trajectory in order to infer what walls they must have bounced off of and where. Stan equated econometrics with the latter.

That’s not what I am going to say but it is a funny story and its the first thought that came to my mind as I began to write this post.

But I do have something, probably even more heretical, to say about econometrics. Suppose I have a hypothesis or a model and I collect some data that is relevant.  If I am an applied econometrician what I do is run some tests on the data and report the results of the tests.  I tell you with my tests how you should interpret the data.

My tests don’t contain any information in them that isn’t in the raw data.  My tests are just a super sophisticated way to summarize the data.  If I just showed you the tables it would be too much information.  So really, my tests do nothing more than save you the work of doing the tests yourself.

But I pick the tests.  You might have picked different tests.  And even if you like my tests you might disagree with the conclusion I draw from them.  I say “because of these tests you should conclude that H is very likely false.”  But that’s a conclusion that follows not just from the data, but also from my prior which you may not share.

What if instead of giving you the raw data and instead of giving you my test results I did something like the following.  I give you a piece of software which allows you to enter your prior and then it tells you what, based on the data and your prior, your posterior should be?  Note that such a function completely summarizes what is in the data.  And it avoids the most common knee-jerk criticism of Bayesian statistics, namely that it depends on an arbitrary choice of prior.  You tell me what your prior is, I will tell you (what the data says is) your posterior.

Pause and notice that this function is exactly what applied statistics aims to be, and think about why, in practice, it doesn’t seem to be moving in this direction.

First of all, as simple as it sounds, it would be impossible to compute this function in all practical situations.  But still, an approach to statistics based on such an objective, and subject to the technical constraints would look very different than what is done in practice.

A big part of the explanation is that statistics is a rhetorical practice.  The goal is not just to convey information but rather to change minds.  In an imaginary perfect world there is no distinction between these goals.   If I have data that proves H is false I can just distribute that data, everyone will analyze it in their own favorite way, everyone will come to the same conclusion, and that will be enough.

But in the real world that is not enough.  I want to state in clear, plain language terms “H is false, read all about it” and have that statement be the one that everyone focuses on.  I want to shape the debate around that statement.  I don’t want nuances to distract attention away from my conclusion.  In the real world, with limited attention spans, imperfect reasoning, imperfect common-knowledge, and just plain old laziness, I can’t get that kind of focus unless I push the data into the background and my preferred intepretation into the foreground.

I am not being cynical.  All of that is true even if my interpretation is the right one and the most important one.  As a practical matter if I want to maximize the impact of the truth I have to filter it.

Still it’s useful to keep this perspective in mind.

  1. There is an inverse relationship between how carefully you stack the dishes inside the dishwasher and how tidy you keep it outside in your kitchen.
  2. In addition to funny-haha and funny-strange there is a third category of joke where the impetus for laughter is that the comedian has made some embarrassing fact that is privately true for all of us into common knowledge.
  3. It would be too much of an accident for 50-50 genetic mixing to be evolutionarily optimal.  So to compensate we must have a programmed taste either for mates who are similar to us or who are different.
  4. It is well known that in a moderately sized group of total strangers the probability is about 50% that two of them will have the same birthday.  But when that group happens to be at a restaurant the probability is virtually 1.

A buyer and a seller negotiating a sale price.  The buyer has some privately known value and the seller has some privately known cost and with positive probability there are gains from trade but with positive probability the seller’s cost exceeds the buyers value.  (So this is the Myerson-Satterthwaite setup.)

Do three treatments.

  1. The experimenter fixes a price in advance and the buyer and seller can only accept or reject that price.  Trade occurs if and only if they both accept.
  2. The seller makes a take it or leave it offer.
  3. The parties can freely negotiate and they trade if and only if they agree on a price.

Theoretically there is no clear ranking of these three mechanisms in terms of their efficiency (the total gains from trade realized.)  In practice the first mechanism clearly sacrifices some efficiency in return for simplicity and transparency.  If the price is set right the first mechanism would outperform the second in terms of efficiency due to a basic market power effect.  In principle the third treatment could allow the parties to find the most efficient mechanism, but it would also allow them to negotiate their way to something highly inefficient.

A conjecture would be that with a well-chosen price the first mechanism would be the most efficient in practice.   That would be an interesting finding.

A variation would be to do something similar but in a public goods setting.  We would again compare simple but rigid mechanisms with mechanisms that allow for more strategic behavior.  For example, a version of mechanism #1 would be one in which each individual was asked to contribute an equal share of the cost and the project succeeds if and only if all agree to their contributions.  Mechanism #3 would allow arbitrary negotation with the only requirement be that the total contribution exceeds the cost of the project.

In the public goods setting I would conjecture that the opposite force is at work.  The scope for additional strategizing (seeding, cajoling, guilt-tripping, etc) would improve efficiency.

Anybody know if anything like these experiments have been done?

Nonsense?

For Shmanske, it’s all about defining what counts as 100% effort. Let’s say “100%” is the maximum amount of effort that can be consistently sustained. With this benchmark, it’s obviously possible to give less than 100%. But it’s also possible to give more. All you have to do is put forth an effort that can only be sustained inconsistently, for short periods of time. In other words, you’re overclocking.

And in fact, based on the numbers, NBA players pull greater-than-100-percent off relatively frequently, putting forth more effort in short bursts than they can keep up over a longer period. And giving greater than 100% can reduce your ability to subsequently and consistently give 100%. You overdraw your account, and don’t have anything left.

Here is the underlying paper.  <Painfully repressing the theorist’s impulse to redefine the domain to paths of effort rather than flow efforts, thus restoring the spiritually correct meaning of 100%>

Cap curl:  Tim Carmody guest blogging at kottke.org.

In tennis, a server should win a larger percentage of second-serve points compared to first-serve points; that much we know.  Partly that’s because a server optimally serves more faults (serves that land out) on first serve than second serve.  But what if we condition on the event that the first serve goes in? Here’s a flawed logic that takes a bit of thinking to see through:

Even conditional on a first serve going in, the probability that the server wins the point must be no larger than the total win probability for second serves. Because suppose it were larger.  Then the server wins with a higher probability when his first serve goes in.  So he should ease off just a bit on his first serve so that a larger percentage lands in, raising the total probability that he wins the point.  Even though the slightly slower first serve wins with a slightly reduced probability (conditional on going in) he still has a net gain as long as he eases off just slightly so that it is still larger than the second serve percentage. Indeed the lower probability of a fault could even raise the total probability that he wins on the first serve.

Consider the following syllogism:

  1. If a person is an American, he is probably not a member of Congress.
  2. This person is a member of Congress.
  3. Therefore he is probably not American.

As John D. Cook writes:

We can’t reject a null hypothesis just because we’ve seen data that are rare under this hypothesis. Maybe our data are even more rare under the alternative. It is rare for an American to be in Congress, but it is even more rare for someone who is not American to be in the US Congress!

Jonah Lehrer writes about how bad NFL teams are at drafting talented players, particularly at the quarterback position.

Despite this advantage, however, sports teams are impressively amateurish when it comes to the science of human capital. Time and time again, they place huge bets on the wrong players. What makes these mistakes even more surprising is that teams have a big incentive to pick the right players, since a good QB (or pitcher or point guard) is often the difference between a middling team and a contender. (Not to mention, the player contracts are worth tens of millions of dollars.) In the ESPN article, I focus on quarterbacks, since the position is a perfect example of how teams make player selection errors when they focus on the wrong metrics of performance. And the reason teams do that is because they misunderstand the human mind.

He talks about a test that is given to college quarterbacks eligible for the NFL draft to test their ability to make good decisions on the field.  Evidently this test is considered important by NFL scouts and indeed scores on this test are good predictors of whether and when a QB will be selected in the draft.

However,

Consider a recent study by economists David Berri and Rob Simmons. While they found that Wonderlic scores play a large role in determining when QBs are selected in the draft — the only equally important variables are height and the 40-yard dash — the metric proved all but useless in predicting performance. The only correlation the researchers could find suggested that higher Wonderlic scores actually led to slightly worse QB performance, at least during rookie years. In other words, intelligence (or, rather, measured intelligence), which has long been viewed as a prerequisite for playing QB, would seem to be a disadvantage for some guys. Although it’s true that signal-callers must grapple with staggering amounts of complexity, they don’t make sense of questions on an intelligence test the same way they make sense of the football field. The Wonderlic measures a specific kind of thought process, but the best QBs can’t think like that in the pocket. There isn’t time.

I have not read the Berri-Simmons paper but inferences like this raise alarm bells.  For comparison, consider the following observation. Among NBA basketball players, height is a poor predictor of whether a player will be an All-Star.  Therefore, height does not matter for success in basketball.

The problem is that, both in the case of IQ tests for QBs and height for NBA players, we are measuring performance conditional on being good enough to compete with the very best. We don’t have the data to compare the QBs who are drafted to the QBs who are not and how their IQ factors into the difference in performance.

The observable characteristic (IQ scores, height) is just one of many important characteristics, some of which are not quantifiable in data. Given that the player is selected into the elite, if his observable score is low we can infer that his unobservable scores must be very high to compensate. But if we omit those intangibles in the analysis, it will look like people with low scores are about as good as people with high scores and we would mistakenly conclude that they don’t matter.

I am always writing about athletics from the strategic point of view:  focusing on the tradeoffs.  One tradeoff in sports that lends itself to strategic analysis is effort vs performance.  When do you spend the effort to raise your level of play and rise to the occasion?

My posts on those subjects attract a lot of skeptics.  They doubt that professional athletes do anything less than giving 100% effort.  And if they are always giving 100% effort, then the outcome of a contest is just determined by gourd-given talent and random factors. Game theory would have nothing to say.

We can settle this debate.  I can think of a number of smoking guns to be found in data that would prove that, even at the highest levels, athletes vary their level of performance to conserve effort; sometimes trying hard and sometimes trying less hard.

Here is a simple model that would generate empirical predictions.  Its a model of a race. The contestants continuously adjust how much effort to spend to run, swim, bike, etc. to the finish line. They want to maximize their chance of winning the race, but they also want to spend as little effort as necessary.  So far, straightforward.  But here is the key ingredient in the model: the contestants are looking forward when they race.

What that means is at any moment in the race, the strategic situation is different for the guy who is currently leading compared to the trailers.  The trailer can see how much ground he needs to make up but the leader can’t see the size of his lead.

If my skeptics are right and the racers are always exerting maximal effort, then there will be no systematic difference in a given racer’s time when he is in the lead versus when he is trailing.  Any differences would be due only to random factors like the racing conditions, what he had for breakfast that day, etc.

But if racers are trading off effort and performance, then we would have some simple implications that, if it were born out in data, would reject the skeptics’ hypothesis.  The most basic prediction follows from the fact that the trailer will adjust his effort according to the information he has that the leader does not have.  The trailer will speed up when he is close and he will slack off when he has no chance.

In terms of data the simplest implication is that the variance of times for a racer when he is trailing will be greater than when he is in the lead.  And more sophisticated predictions would follow.  For example the speed of a trailer would vary systematically with the size of the gap while the speed of a leader would not.

The results from time trials (isolated performance where the only thing that matters is time) would be different from results in head-to-head competitions. The results in sequenced competitions, like downhill skiing, would vary depending on whether the racer went first (in ignorance of the times to beat) or last.

And here’s my favorite:  swimming races are unique because there is a brief moment when the leader gets to see the competition:  at the turn.  This would mean that there would be a systematic difference in effort spent on the return lap compared to the first lap, and this would vary depending on whether the swimmer is leading or trailing and with the size of the lead.

And all of that would be different for freestyle races compared to backstroke (where the leader can see behind him.)

Finally, it might even be possible to formulate a structural model of an effort/performance race and estimate it with data.  (I am still on a quest to find an empirically oriented co-author who will take my ideas seriously enough to partner with me on a project like this.)

Drawing:  Because Its There from www.f1me.net

Boston being a center for academia as well as professional sports, Harvard and MIT faculty and students are leading the way in the business of sports consulting.

And some of those involved aren’t that far away from being kids. Harvard sophomore John Ezekowitz, who is 20, works for the NBA’sPhoenix Suns from his Cambridge dorm room, looking beyond traditional basketball statistics like points, rebounds, assists, and field goal percentage to better quantify player performance. He is enjoying the kind of early exposure to professional sports once reserved for athletic phenoms and once rare at institutions like Harvard and MIT. “If I do a good job, I can have some new insight into how this team plays, what works and what doesn’t,” says Ezekowitz. “To think that I might have some measure of influence, however small, over how a team plays is a thrill.” It’s not a bad job, either. While he doesn’t want to reveal how much he earns as a consultant, he says that not only does he eat better than most college students, the extra cash also allows him to feed his golf-club-buying habit.

From a fun little article by Andrew Gelman and Deborah Nolan:

The law of conservation of angular momentum tells us that once the coin is in the air, it spins at a nearly constant rate (slowing down very slightly due to air resistance). At any rate of spin, it spends half the time with heads facing up and half the time with heads facing down, so when it lands, the two sides are equally likely (with minor corrections due to the nonzero thickness of the edge of the coin); see Figure 3. Jaynes (1996) explained why weighting the coin has no effect here (unless, of course, the coin is so light that it floats like a feather): a lopsided coin spins around an axis that passes through its center of gravity, and although the axis does not go through the geometrical center of the coin, there is no difference in the way the biased and symmetric coins spin about their axes.

On the other hand, a weighted coin spun on a table will show a bias for the weighted side.  The article describes some experiments and statistical tests to use in the classroom.  There are some entertaining stories too.  Like how the King of Norway avoided losing the entire Island of Hising to the King of Sweden by rolling a 13 with a pair of dice (“One die landed six, and the other split in half landing with both a six and a one showing.”)

Visor volley:  Toomas Hinnosaar.

By asking a hand-picked team of 3 or 4 experts in the field (the “peers”), journals hope to accept the good stuff, filter out the rubbish, and improve the not-quite-good-enough papers.

…Overall, they found a reliability coefficient (r^2) of 0.23, or 0.34 under a different statistical model. This is pretty low, given that 0 is random chance, while a perfect correlation would be 1.0. Using another measure of IRR, Cohen’s kappa, they found a reliability of 0.17. That means that peer reviewers only agreed on 17% more manuscripts than they would by chance alone.

That’s from neuroskeptic writing about an article that studies the peer-review process.  I couldn’t tell you what Cohen’s kappa means but let’s just take the results at face value:  referees disagree a lot.  Is that bad news for peer-review?

Suppose that you are thinking about whether to go to a movie and you have three friends who have already seen it.  You must choose in advance one or two of them to ask for a recommendation.  Then after hearing their recommendation you will decide whether to see the movie.

You might decide to ask just one friend.  If you do it will certainly be the case that sometimes she says thumbs-up and sometimes she says thumbs-down. But let’s be clear why.  I am not assuming that your friends are unpredictable in their opinions.  Indeed you may know their tastes very well.  What I am saying is rather that, if you decide to ask this friend for her opinion, it must be because you don’t know it already. That is, prior to asking you cannot predict whether or not she will recommend this particular movie.  Otherwise, what is the point of asking?

Now you might ask two friends for their opinions.  If you do, then it must be the case that the second friend will often disagree with the first friend.  Again, I am not assuming that your friends are inherently opposed in their views of movies. They may very well have similar tastes. After all they are both your friends. But, you would not bother soliciting the second opinion if you knew in advance that it was very likely to agree or disagree with the first on this particular movie. Because if you knew that then all you would have to do is ask the first friend and use her answer to infer what the second opinion would have been.

If the two referees you consult are likely to agree one way or the other, you get more information by instead dropping one of them and bringing in your third friend, assuming he is less likely to agree.

This is all to say that disagreement is not evidence that peer-review is broken. Exactly the opposite:  it is a sign that editors are doing a good job picking referees and thereby making the best use of the peer-review process.

It would be very interesting to formalize this model, derive some testable implications, and bring it to data. Good data are surely easily accessible.

(Picture:  Right Sizing from www.f1me.net)

(Regular readers of this blog will know I consider that a good thing.)

The fiscal multiplier is an important and hotly debated measure for macroeconomic policy. If the government spends an additional dollar, a dollar’s worth of output is produced, but in addition the dollar is added to disposable income of the recipients who then spend some fraction of it. More output is produced, etc.

It’s hard to measure the multiplier because observed increases in government spending are endogenous and correlated with changes in output for reasons that have nothing to do with fiscal stimulus.

Daniel Shoag develops an instrument which isolates a random component to state-level government spending changes.

Many US states manage pensions which are defined-benefit plans. Defined benefits means that retirees are guaranteed a certain benefit level. This means that the state government bears all of the risk from the investments of these pension funds. Excess returns from these funds are unexpected exogenous windfalls to state spending budgets.

With this instrument, Daniel estimates that an additional dollar of state government spending increases income in the state by $2.12. That is a large multiplier.

The result must be interpreted with some caveats in mind. First, state spending increases act differently than increases at the national level where general equilibrium effects on prices and interest rates would be larger. Second, these spending increases are funded by windfall returns. The effects are likely to be different than spending increases funded by borrowing which crowds out private investment.

Here’s a broad class of games that captures a typical form of competition.  You and a rival simultaneously choose how much effort to spend and depending on your choices, you earn a score, a continuous variable.  The score is increasing in your effort and decreasing in your rival’s effort.  Your payoff is increasing in your score and decreasing in your effort.  Your rival’s payoff is decreasing in your score and his effort.

In football, this could model an individual play where the score is the number of yards gained.  A model like this gives qualitatively different predictions when the payoff is a smooth function of the score versus when there are jumps in the payoff function.  For example, suppose that it is 3rd down and 5 yards to go. Then the payoff increases gradually in the number of yards you gain but then jumps up discretely if you can gain at least 5 yards giving you a first down. Your rival’s payoff exhibits a jump down at that point.

If it is 3rd down and 20 then that payoff jump requires a much higher score. This is the easy case to analyze because the jump is too remote to play a significant role in strategy.  The solution will be characterized by a local optimality condition.  Your effort is chosen to equate the marginal cost of effort to the marginal increase in score, given your rival’s effort.  Your rival solves an analogous problem.  This yields an equilibrium score strictly less than 20.  (A richer, and more realistic model would have randomness in the score.)  In this equilibrium it is possible for you to increase your score, even possibly to 20, but the cost of doing so in terms of increased effort is too large to be profitable.

Suppose that in the above equilibrium you gain 4 yards. Then when it is 3rd down and 5 this equilibrium will unravel.  The reason is that although the local optimality condition still holds, you now have a profitable global deviation, namely putting in enough effort to gain 5 yards.  That deviation was possible before but unprofitable because 5 yards wasn’t worth much more than 4.  Now it is.

Of course it will not be an equilibrium for you to gain 5 yards because then your opponent can increase effort and reduce the score below 5 again.  If so, then you are wasting the extra effort and you will reduce it back to the old value. But then so will he, etc.  Now equilibrium requires mixing.

Finally, suppose it is 3rd down and inches.  Then we are back to a case where we don’t need mixing.  Because no matter how much effort your opponent uses you cannot be deterred from putting in enough effort to gain those inches.

The pattern of predictions is thus:  randomness in your strategy is non-monotonic in the number of yards needed for a first down.  With a few yards to go strategy is predictable, with a moderate number of yards to go there is maximal randomness, and then with many yards to go, strategy is predictable again. Variance in the number of yards gained in these cases will exhibit a similar non-monotonicity.

This could be tested using football data, with run vs. pass mix being a proxy for randomness in strategy.

While we are on the subject, here is my Super Bowl tweet.

I am talking about world records of course.  Tyler Cowen linked to this Boston Globe piece about the declining rate at which world records are broken in athletic events, especially Track and Field.  (Usain Bolt is the exception.)

How quickly should we expect the rate of new world records to decline?  Suppose that long jumps are independent draws from a Normal distribution.  Very quickly the world record will be in the tail.  At that point breaking the record becomes very improbable.  But should the rate decline quickly from there?  Two forces are at work.

First, every new record pushes us further into the tail and reduces the probability, and hence freqeuncy, of new records.  But, because of the thin tail property of the Normal distribution, new records will with very high probability be tiny advances.  So the new record will be harder to beat but not by very much.

So the rate will decline and asymptotically it will be zero, but how fast will it converge to zero?  Will there be a constant K such that we will have to wait no more than nK years for the nth record to be broken or will it be faster than that?

I am sure there is an easy answer to this question for the Normal distribution and probably a more general result, but my intuition isn’t taking me very far.  Probably this is a standard homework problem in probability or statistics.

The Boston Globe piece is about humans ceasing to progress physically.  The theory could shed light on this conclusion.  If the answer above is that the arrival rate increases exponentially, I wonder what rate the mean of the distribution can grow and still give rise to the slowdown.  If the mean grows logarithmically?

Tennis commentators will typically say about a tall player like John Isner or Marin Cilic that their height is a disadvantage because it makes them slow around the court.  Tall players don’t move as well and they are not as speedy.

On the other hand, every year in my daughter’s soccer league the fastest and most skilled player is also among the tallest.  And most NBA players of Isner’s height have no trouble keeping up with the rest of the league. Indeed many are faster and more agile than Isner.  LeBron James is 6’8″.

It is not true that being tall makes you slow. Agility scales just fine with height and it’s a reasonable assumption that agility and height are independently distributed in the population. Nevertheless it is true in practice that all of the tallest tennis players on the tour are slower around the court.

But all of these facts are easily reconcilable.  In the tennis production function, speed and height are substitutes.  If you are tall you have an advantage in serving and this can compensate for lower than average speed if you are unlucky enough to have gotten a bad draw on that dimension.  So if we rank players in terms of some overall measure of effectiveness and plot the (height, speed) combinations that produce a fixed level of effectiveness, those indifference curves slope downward.

When you are selecting the best players from a large population, the top players will be clustered around the indifference curve corresponding to “ridiculously good.” And so when you plot the (height, speed) bundles they represent, you will have something resembling a downward sloping curve.  The taller ones will be slower than the average ridiculously good tennis player.

On the other hand, when you are drawing from the pool of Greater Winnetka Second Graders with the only screening being “do their parent cherish the hour per week of peace and quiet at home while some other parent chases them around?” you will plot an amorphous cloud.  The best player will be the one farthest to the northeast, i.e. tallest and fastest.

Finally, when the sport in question is one in which you are utterly ineffective unless you are within 6 inches of the statistical upper bound in height, then  a) within that range height differences matter much less in terms of effectiveness so that height is less a substitute for speed at the margin and b) the height distribution is so compressed that tradeoffs (which surely are there) are less stark.  Mugsy Bogues notwithstanding.

Jeff’s Twitter Feed

  • In lieu of capers, sprinkle little mischievous adventures on your lox and bagel. 3 days ago
  • Running off a cliff and magically staying airborne until noticing there is no ground beneath your feet and now you will fall to your death. 4 days ago
  • RT @markleidner: erecting a spoiler in the himalayas so earth goes faster and looks cooler 1 week ago
  • RT @tylercowen: This week's possible collapse of the global economy is another reason why another debt ceiling showdown would be insane. 1 week ago
  • RT @markleidner: adam & eve nailing jaw-dropping 720°s down mt eden's gnarliest run… when halfpipe snowboard is the one winter xgame ... 2 weeks ago

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 447 other followers

Follow

Get every new post delivered to your Inbox.

Join 447 other followers