How machine learning can be applied to professional soccer

CONTRIBUTED BY JANNES GLAS VIA UNSPLASH
CONTRIBUTED BY JANNES GLAS VIA UNSPLASH

DATA AND soccer: one is analytical, the other passionate. There is no obvious connection between statistical data science and the most popular sport in the world. However, the metric “expected goals” or “xG” manages to tie these together by using a common statistical technique to model the likelihood of a shot resulting in a goal. When soccer analyst Sam Green wrote about the “expected goals” of several Premier League (EPL)[1] goal scorers in his 2012 performance analysis, he created the xG statistical analysis as we know it today[2].

PHOTOGRAPHED BY SAMANTHA BENING
PHOTOGRAPHED BY SAMANTHA BENING

 

What is xG?

   xG is a metric that quantifies the probability of a goal-scoring chance resulting in a goal and can take on values from 0 to 1: 0 meaning the shot has absolutely no chance of scoring, and 1 meaning the shot is guaranteed to score. For example, we can expect to see an xG value near 1 for a shot taken close to an empty goal, but an xG value near 0 for a long-distance shot. xG can be applied on two levels: an individual player or an entire team. 

   Importantly, xG is used as a purely analytical statistic. For example, it can describe a single game or a player’s season, but xG is not helpful in gaining new information. While we may be able to quantify the phenomenon that a tight-angled shot rarely results in a goal, this is not a new insight to any soccer professional. 

   xG better informs us about a player or team’s performance. Additionally, xG values can help evaluate a game’s run of play[3]. Historically, the number of shots on goal[4] have been used as a proxy for which team was able to create more attacking opportunities. Now, xG provides a more accurate statistic by considering not only the number of shots, but also their quality. 

   Understanding exactly what an xG model is and how it is applied can help with avoiding common misconceptions that many fans or even pundits have when interpreting xG values. To clarify: “expected” goals do not literally refer to the goals we expect to be scored in a game; this would be nonsensical as the xG value for a given game is normally a decimal, and we will not actually see 2.5 goals scored. The “expected” refers to the mathematical concept of the “expected value,” or the likelihood of an outcome occurring (in this case, a goal being scored). For example, we do not expect a coin to land on half heads and half tails, but we would expect about half of any number of coin tosses to land on each face. Secondly, xG can quantify the common “they deserved to win” or “they were robbed” feelings about games; even if a team loses a match, they may have won the xG created. However, a team having a higher xG in a match does not always imply that they deserve to win. For example, if a team scores early in the match, they do not necessarily “need” to create more chances to win[5].

 

An xG model: Calculating xG

   An xG model is a mathematical function that estimates the probability of a shot scoring based on patterns found in historical data. This is called “training data,” which is necessary when developing any machine learning model. Data companies like Stats Performs, Opta Pro, or Wyscout can provide large amounts of data from various professional leagues’ previous seasons to be used in training an xG model. The common xG model uses logistical regression trained, or optimized, on data from any number of selected input variables to predict the likelihood, from 0 to 1, that a shot scores. 

   Logistical regression is a statistical analysis method that uses a mathematical function to quantify the relationship between independent (input) and dependent (output) variables. An xG model has many input variables: commonly the distance from the goal, the angle of the shot, the goalkeeper’s stance and location, space allocated to the shooter, what body part the shot was taken with, and more. The logistical regression function models the relationship between these variables and the binary output variable: if a goal was scored or not. 

   Another essential step in machine learning workflows is the testing phase, where the trained xG model is tested on data not in the training set to evaluate its performance and ensure the quality of the model’s predictions. For example, if we input a shot taken 5 m in front of an empty goal and receive a predicted xG value of 0.05, we can assume that something is going wrong with our model as that is a much too low value for that shot’s circumstances. 

   Once an xG model is trained and tested, it can be applied in three main ways: evaluating a single player’s performance over time, a single team’s performance over time, or a single game. For the first two, a player or team’s created xG are summed up over a time period in question—often a single season. When analyzing a single game, commonly both teams’ total xG is displayed in post-match analysis. This way, this player or team’s performance in creating goal-scoring opportunities (their xG value) can be compared with their actual goals scored in the same period. A player or team scoring more goals than their xG may indicate clinical finishing. Conversely, an xG value higher than actual goals scored usually means poor performance in front of goal. 

 

xG in professional soccer

   Ever since the modern xG metric emerged in 2012, it has been gaining popularity with sports media outlets. In 2017, the BBC used xG in their EPL Match of the Day analysis without significant elaboration. In 2019, xG was added to the available statistics in the EPL fantasy league. Since 2020, xG has been added to popular soccer apps like “Fotmob.” Not only analysts and fans have largely adopted the xG statistic, but club managers and owners are also increasingly turning to xG and similar data-based methods to improve their team.

   Brentford Football Club arrived in the Championship, the second-tier professional league in England, in 2014[6]. As a smaller club, Brentford could not compete with the highest flights of soccer clubs on the financial front. Brentford’s owner, Matthew Benham, also owns a gambling consultancy, “SmartOdds,” which was a pioneer in the field of sports data analysis and xG. By applying SmartOdd’s xG data to Brentford’s transfer strategy, the club was able to consistently recruit promising players for a cheap price to their club, and later sell these players for large profits. This profit has been continuously reinvested back into the team’s development by buying talented but underdeveloped players[7]. After being promoted to the EPL at the end of the 2020-21 season, Brentford is 9th in the EPL as of April 7, 2023, positively surprising many spectators and fans alike.

 

*                 *                 *

 

   Although soccer is not the most obvious field to apply data science techniques to, emerging metrics like xG demonstrate promising developments in data-based soccer analytics. With xG’s rising popularity, soccer fans can understand the metric’s insights better after learning the details of how xG is calculated; xG is simply a statistic used to quantify the quality of an attacking opportunity. Who knows—perhaps the next generation’s Messi vs Ronaldo debate will consider some empirical factors as well. 

 

[1] Premier League: Highest professional league in England men’s soccer

[2] Stats Perform

[3] Run of Play: A term used in soccer to describe which team currently is creating more pressure in the attack; if Team A scores “against the run of play,” Team A scored a goal during a phase where Team B was creating more goal-scoring chances

[4] Shot on Goal: A shot that enters the goal or would have entered the goal if not blocked another player

[5] The Analyst

[6] Brentford Football Club: A professional men’s soccer club in Brentford, West London, England currently competing in the EPL

[7] The xG Philosophy

저작권자 © The Yonsei Annals 무단전재 및 재배포 금지