If you're interested in applying a model to goal scoring in soccer, read our updated article on how to calculate expected goals.
The underdeveloped state of soccer analysis is partly down to the lack of accessible data and secondly, the difficulty in describing a fluid sport, where set plays, so common in sports such as baseball are largely absent.
The gradual release of data has begun to address the former problem, while much of the new analysis has concentrated on the few set plays that do exist in soccer.
Penalty kicks are the obvious example of a rigorously enforced soccer set play and many will be aware that a top-flight player facing a similar keeper will have a long term conversion rate of around 78% from a penalty kick.
The logical extension to knowing that a penalty has a goal expectation of 0.78 of a goal is to attempt to apply the same analysis to all shots and headers.
A player may be applauded for a speculative shot 40 yards from goal, however it may benefit the expectation of the fans and temper the optimism of the player if both knew that such attempts result in a goal just once in around 100 attempts.
By attaching probabilistic outcomes to key events in a sport, we can begin to develop useful models to better describe previous games and possibly predict the likely outcome of future events in order to build effective betting models.
We’ll use the recent Arsenal vs. Crystal Palace game to run through the process.
Models require data and although shot location for various leagues can be found on a number of sites the data isn’t available in a spreadsheet friendly format. It is down to the bettor to source their own shot location data - for example, by recording the perpendicular distance from the goal line combined with the horizontal distance from the centre of the goal.
Alternatively by dividing the attacking third into shooting zones and bundling together goal attempts, rather than using individual shooting positions is also a viable alternative.
The closer to the goal a shot is taken the more likely it is to produce a goal. Similarly, shots are generally more potent than headers from the same pitch location.
Dividing the pitch into shooting zones and collecting sufficient data to calculate representative conversion rates for both shots and headers within each zone is one route to calculating goal expectations.
For example, shots taken from inside the central portion of the six yard box are converted at rates of nearly 50%, compared to less than 10% for shots from just outside the penalty area.
A more formal approach uses logistic regression, which is particularly useful when estimating the likelihood of events either occurring or not, such as whether or not shots result in a goal.
The coordinates of each shot location are used as the independent variables and the actual outcome of historical attempts on goal as the dependent variable are used to produce an equation, which can be used to estimate the probability of a goal being scored from an attempt from any pitch location.
This method can be further extended to calculate the probability of attempts being either blocked, off target or on target, simply by changing the dependent variable from goals to, for example, blocked shots.
The limiting factor in either approach is the need for relatively large amounts of initial detailed data to create a robust model.
Knowledge of the goal expectation for individual attempts enables matches to be replayed by simulating the possible outcomes of each attempt taken or faced during an actual game.
For example, the generic success rate for penalties is 78%. By using a randomly generated number between 1 and 0, the spot kick can be considered successful if the random number falls between 0.78 and 0 and unsuccessful if it is between 0.78 and 1.
The larger the value of the goal expectation, the greater the chance that the random number will fall within the successful range and a virtual goal will be scored. This can be extended to include every actual attempt in a match, with the individual successes summed together to simulate a goal total for each team, resulting in a virtual score line.
Repeating this process, typically for 10,000 “games” and summing all score lines which lead to either a home, away win or draw can add context to the actual result and help answer the question did the actual winners create and prevent sufficient shots, once the location is accounted for, to fully deserve their win?
Example: Arsenal 2 Crystal Palace 1
Arsenal opened their season at home to Crystal Palace, out shooting Palace 14-4 and winning the game 2-1 with a last minute Aaron Ramsey goal. Palace’s four efforts are shown below on the screen grab from Statszone.
Their best opportunity came from a 34th minute corner kick which found the head of Brede Hangeland. His header (number 2 on the screen grab) originated inside the six yard box, about 4 yards from goal and close to the near post, a further 4 yards from the centre of the goal.
Typically from a logistic regression analysis of historical data of headers, this type of chance is converted nearly 14% of the time. It has a goal expectation of 0.14.
Palace’s further three efforts, each a shot, were less dangerous. Jason Puncheon had two shots with a goal expectation of 0.10 (shot 4) and 0.01 (shot 3), respectively. Marouane Chamakh’s long-range effort (shot 1) was the least likely to score, leading to a goal on 0.3% of occasions for a goal expectation of 0.003.
In total, Palace’s four attempts when added together had a cumulative goal expectation of 0.25 of a goal and as a general guide would produce, on average, a goal every 4 matches.
These probabilities can be used to simulate the most likely outcomes from the four Crystal Palace goal attempts.
The regression derived conversion rate for Palace’s four chances is shown in column C. A random number between 1 and 0 is generated in column D using the =Rand() spreadsheet function and if that value falls below the value in column C, a goal is considered scored.
In this single iteration, Hangeland’s header was the only attempt to produce a goal.
This is repeated for both Palace’s four attempts and Arsenal’s 14 to generate 10,000 match results based around the quality and quantity of the chances created on the day by each side.
|Arsenal Score||Palace Score||Frequency in 10,000 Trials|
The table above shows the seven most common score lines from these shot based simulations. Palace was most likely to remain scoreless from their four attempts, while Arsenal, despite superior shot numbers, also shot frequently from distance or from wide positions.
In total, 73% of the 10,000 simulations resulted in a score that gave Arsenal a win, 21% were draws and just 6% wins for Crystal Palace. So an Arsenal win was consistent with the frequency and quality of attempts made by each team.
This approach looks to extend the use of shot data to try and identify sides that may have been either fortunate or unfortunate in their match day results.
A side can only create or prevent chances consistent with their ability, but they are less in control of when those chances turn into goals. They cannot guarantee that goals will arrive when they are most needed.
Therefore a team may create excellent shooting chances, but that may not be reflected in similarly excellent results, especially over the short term. Whilst an instant change of luck from bad to good shouldn’t be expected, extreme outcomes are often followed by less extreme ones in the future.
In short, this is a tool to use in identifying possibly overrated, but lucky teams or underrated and unlucky ones, who may experience a more usual connection between shots and results in the future, which is valuable for any bettor looking to find an edge over the market.