How shocking was the 2022 World Cup? Was Japan beating Germany all that surprising? Joseph Buchdahl crunches the data to determine if major upsets in a massive tournament is as surprising as we may think.
According to the sports metadata company Gracenote, the 2022 World Cup in Qatar was one for the underdogs, with 15 matches ending in an upset (as they had defined them) – the highest percentage for 64 years.
Qatar was the most shocking World Cup in 64 years
However, just how surprising is such a number, and how do we actually go about deciding what constitutes an upset in the first place?
Yes, we probably all believe that Japan beating Spain and Germany, and Saudi Arabia beating Argentina, were what we’d regard as upsets, but this premise hinges on how valid our beliefs are that Spain, Germany, and Argentina should have won those games.
Intuitively we might think it’s obvious, but when it’s impossible to know true outcome probabilities perfectly, we should always be alert to the potential for error.
When a football result seems surprising, is it because the underdog – rightly regarded as such by an accurate prediction model – got lucky, or because the underdog was actually not an underdog and the prediction model was wrong?
Philosophically, this is an interesting puzzle and one that’s quite hard to unravel. Here, we are dealing with two types of uncertainty.
Uncertainty or error in the validity of the prediction model is called epistemic uncertainty, and in theory it can be reduced with better modelling.
The other kind is intrinsic and is known as aleatory uncertainty, more commonly called chance, luck or randomness.
This uncertainty is irreducible. Separating epistemic and aleatory uncertainty can be tricky. In this pair of articles for Pinnacle, I hope to make a little contribution to that. In this first article, I will try to investigate the surprise factor of the World Cup as a whole.
In the second article, Using the World Cup to test for efficiency, I’ll have more to say on what the findings can tell us about the accuracy (or efficiency) of a bookmaker’s betting odds, and the validity of their prediction model that builds them.
64-match Multiple probabilities
If we estimate the probabilities of each of the three possible 90-minute results for each World Cup match, we can construct a 64-match multiple probability for every possible combination of outcomes. But which outcome probabilities should we use?
Most aspiring bettors no doubt have their own methods of calculating them, but in the interests of saving time, and knowing them to be some of the best probabilities available, I will use those implied by Pinnacle’s closing match betting odds.
I have discussed at length on several occasions why Pinnacle’s closing odds are some of the best available for estimating true probability outcomes.
Pinnacle obviously has a margin added to those odds, so I do have to remove that first. I have my own calculator for doing that.
We can then use these multiple probabilities to attempt to answer the question: how collectively surprising were the results of the 2022 World Cup?
Narrative fallacy
It’s worth just taking a moment to realise that the probability of there being absolutely no upsets at all in the 64 World Cup matches is vanishingly tiny.
There's an 11% chance all the favourites win
Using Pinnacle’s closing odds, I have calculated this to be 6.5 x 10-17, or a little over one in 1 million trillion, for the results after 90 minutes.
Had it happened, it would have been one of the most astonishing events in the history of humanity.
And yet, I wonder whether many people other than statisticians would have even paid much attention to it, beyond perhaps remarking that it was evidence of a very boring World Cup.
Every other 64-match multiple – and there’s a lot of them, three to the power 64 or 3,433,683,820,292,512,484,657,849,089,281 to be precise – has a lower probability.
Every one of them will include upsets (if we define an upset as the favourite result not happening) and the smaller the multiple probability, the larger the number of upsets.
However, there’s only one way to have no upsets – all the favourite outcomes must happen. In contrast, there are many ways for upsets to occur. Individually, their 64-match multiple probability may be smaller, but collectively they are more likely to occur than none at all.
Consider a simple binomial example of 10 matches with two possible outcomes, where each favourite has an 80%-win probability and each underdog a 20% probability.
There’s about an 11% chance that all of the favourites win, but a 20% chance that three underdogs will win, and even a 9% chance of four underdogs winning.
Why so high? Individually, the probabilities are just 0.17% and 0.04% (for any particular set of three or four underdogs respectively), but there are so many ways for them to happen, 120 for three underdogs and 210 for four of them.
This is all another way of saying upsets should be expected. Yet, too often, our brains are primed to create simple and sometimes flawed stories out of data to make sense of the world that considers upsets to be more unexpected than they really are.
When Japan doesn’t beat Spain and Germany, we don’t write the stories, but statistics tells us that these surprising outcomes are effectively a statistical certainty. This is an example of the narrative fallacy.
A Monte Carlo probability distribution
There’s only one way to see a 64-match multiple with a probability of 6.5 x 10-17. The least probable multiple with all the underdogs winning comes in at 1.5 x 10-51, and there’s only one way for that to happen as well. But how many ways could we see multiple probabilities of, say, 10-25 or 10-30?
Handling this sort of computation algorithmically is far too complex. To make things a lot easier, it’s a good idea to build a Monte Carlo simulation.
By randomising the match outcomes according to the defined Pinnacle-implied probabilities, we can create a randomly generated 64-match multiple probability.
Repeat this a large number of times, count how many each of a defined probability occurs, and we can define the likelihood probability distribution. That is to say, we can define the range and likelihood of possible outcome histories of the 64 World Cup matches.
Handling tiny probability values, however, is intuitively rather difficult. We can, however, apply a little transformation to make them much more cognitively manageable: calculate their logarithm.
The logarithm (base 10) of 0.001, for example, is -3; for 0.000001 it’s -6 and for 0.000000000001 it’s -12. In fact, I will use the natural logarithm (ln) for my purposes (base e), and I’ll drop the negative sign.
My Monte Carlo simulation contained 100,000 runs, for 100,000 values of the natural logarithm of every randomised 64-match multiple probability (negative sign removed).
Subgrouping these, I then plotted them on the following frequency (or likelihood) distribution chart.
The full range on the x-axis of this chart is 37.3 (for all 64 favourites winning) to 117.1 (for all underdogs winning), but as we know their likelihoods are impossibly small.
In fact, it’s only necessary to show the most likely outcomes to gain an appreciation of what is the range of possibility. Looking at the chart, we can see that it’s very probable that a 64-match multiple will have a value on the x-axis of somewhere between 45 and 75.
These correspond to multiple-probabilities of roughly 3 x 10-20 and 3 x 10-33 respectively.
Multiple probability decreases as we move right along the x-axis. The average, or most likely observed, multiple outcome has an x-axis figure of about 60, corresponding to a multiple probability of 7.5 x 10-27.
Also illustrated on the chart, by means of the vertical black line, is the location of the actual World Cup multiple outcome that was observed. It has an x-axis value of 63.5 (and a multiple probability of 2.7 x 10-28).
This is about 28 times smaller than the most likely of multiple outcomes.
That sounds like a lot, but the chart tells a different story. You can see that it’s not so far from the centre (the average) of the likelihood distribution. In fact, about 20% of the possible World Cup multiple probabilities were smaller than the one that actually happened.
Statistically, we would not call that surprising. For that, we’d want the vertical line to move to at least 70 on the x-axis, meaning fewer than 1% of possible multiples were less likely. That would correspond to a multiple probability of about 4 x 10-31 or nearly 700 times smaller than the one that happened.
For that, we’d have needed to have seen something like Qatar beating the Netherlands, Poland beating France, and South Korea beating Brazil.
Was this a surprising World Cup?
From the data I’ve presented in this article, we are now in a position to answer my initial question.
No, it wasn’t that surprising. Yes, there were surprising upsets in individual games, but we now know that these are to be expected in tournaments with many matches. Indeed, it would be far more surprising if there hadn’t been any.
Philosophically, however, what does surprising actually mean? This very much hinges on what our initial expectations of match outcomes are.
Suppose as an extreme example, my prediction model made Wales the hot favourites to beat England, Ghana the hot favourites to beat Portugal, Australia the hot favourites to beat France, Costa Rica the hot favourites to beat Germany, and so on through the 64 games.
Arguably, I would then be very surprised by what unfolded. Is that because the underdogs, as my model made them, got lucky, or is it because my prediction model was wrong?
In this case, it’s rather obvious, but more usually the distinction between the two is much more subtle.
Pinnacle’s estimates for the match probabilities did not capture perfectly what unfolded. Is that because of bad luck or because of model error.
Now it’s much harder to tell. However, perhaps we can say that because there is not a statistically significant difference between Pinnacle’s expectation and real-world events, we have good grounds to argue that Pinnacle’s prediction model is not such a bad one.
To put it another way, the World Cup (according to Pinnacle’s view) was, statistically speaking, not a particular surprising one. It was less likely than the most likely World Cup outcome (which would have seen perhaps two or three fewer match upsets than actually occurred), but not hugely so.
Had there been a statistically significant difference, it then becomes easier to argue against Pinnacle’s view.
Hence, we could formulate a rule: the bigger the difference between expectation and reality, the greater the statistical likelihood that our expectation model is wrong. How does Pinnacle’s World Cup match prediction model compare to other bookmakers? This is the subject of part two of this series.
Sign up to Pinnacle for great soccer odds across a wide range of markets. Be sure to check out other insightful articles from Joseph Buchdahl at Betting Resources.