How do you assess whether a probabilistic forecast was successful?  Put aside the question of sequential forecasts updated over time.  That’s a puzzle in itself but on Monday night each forecaster will have its final probability estimate and there remains the question of deciding, on Wednesday morning, which one was “right.”

Give no credibility to pronouncements by, say 538, that they correctly forecasted X out of 50 states.  According to 538’s own model these are not independent events.  Indeed the distinctive feature of 538’s election model is that the statewide errors are highly correlated.  That’s why they are putting Trump’s chances at 35% as of today when a forecast based on independence would put that probability closer to 1% based on the large number of states where Clinton has a significant (marginal) probability of winning.

So for 538 especially (but really for all the forecasters that assume even moderate correlation) Tuesday’s election is one data point.  If I tell you the chance of a coin coming up Armageddon Tails is 35%, you toss it once and it comes up Tails you certainly have not proven me right.

The best we can do is set up a horserace among the many forecasters.  The question is how do you decide which forecaster was “more right” based on Tuesday’s outcome?  Of course if Trump wins then 538 was more right than every other forecaster but we do have more to go on than just the binary outcome.

Each forecaster’s model defines a probability distribution over electoral maps. Indeed they produce their estimates by simulating their models to generate that distribution and then just count the fraction of maps that come out with an Electoral win for Trump.  The outcome on Tuesday will be a map.  And we can ask based on that map who was more right.

What yardstick should be used?  I propose maximum likelihood.  Each forecaster on Monday night should publish their final forecasted distribution of maps.  Then on Wednesday morning we ask which forecaster assigned the highest probability to the realized map.

That’s not the only way to do it of course, but (if you are listening 538, etc) whatever criterion they are going to use to decide whether their model was a success they should announce it in advance.

Advertisements