The question I am asked most often is about the skills needed to become a data scientist at a football club. For many, analyzing football is a dream job. If you enjoy both the game itself and statistics, nothing could be better than combining the two in a career. The question then is what skills you need to develop in order to find a position at a club?
To answer that question it is best to start by looking at the data that is available.
Ten years ago, the data used by clubs were limited to stats on goals, shots, numbers of corners, possession etc. This data has limited value to coaching staff. While it might be worrying if your team is conceding too many shots or failing to gain possession, knowing this fact doesn’t provide coaching insights. The typical stats we see on TV do not, in themselves, help teams win games.
The second wave of football data came in the form of on-the-ball event data. The biggest supplier of this data, Opta, provide (x,y) coordinates of every pass of the ball, every defensive action and every shot. Opta is now one of several data suppliers, including Statsbomb, Wyscout and several betting companies, who collect this form of data.
Event data has proved useful to many clubs, in particular, in scouting players. The best-known statistic in this context is expected goals, which measures the quality of chances players create. Other more advanced metrics include expected assists, passing models that assign a value to every pass based on how much it progresses the ball, and possession chains which measure involvement in attacking sequences. These stats, along with more traditional measures, such as tallys of heading duels, interceptions and pass completion, are often presented in the form of a player radar. The radar shows how each player compares to others playing in the same league.
I know from first-hand experience that many club scouts love these diagrams. It gives them, for better or worse, a way of confirming their beliefs about a player or finding new talent to have a look at in more detail.
To be able to deal with and analyze event data you need to be able to program, preferably in Python or R, and you also need to learn about basic statistical modelling. Expected goals is a logistic regression model. Passing models use either logistic regression or basic neural networks. These are topics that come up in all good undergraduate statistics degrees and Masters courses in data science and are covered in online courses.
While it is important to know about ‘on-the-ball’ data, the future of football analytics may well lie elsewhere. I caught up with Raúl Peláez Blanco, Head of Sports Technology Innovation Analysis at FC Barcelona and asked him about the data the team currently uses.
He got straight the point, “We do not rely on event data in player evaluation. We believe we need to understand how players act in different contexts. For example, if we are looking at a winger who dribbles very well in counterattacks, we ask how he also dribbles when the opposing defence is organized. Event data doesn’t tell us this.”
“Before we sign a player, we must examine how he solves problems in the contexts he will face at Barcelona.”, Raul told me, “It has become popular to categorize players using data without taking into account these contexts, but this distort realities.”
It would be wrong, however, to conclude that Raul is opposed to the use of data. On the contrary. For him, the question is about using the right data.
“The problem with event data is that they are decontextualized, we don’t know how the rest of the players are positioned when a pass is made, for example.”, he told me, “Instead we use positional data of the 22 players and the ball. This helps us find tactical insights for the coach.”
The 22-player data, the third wave of data in football, is much richer than event data. As the name implies, it contains the co-ordinates on the pitch of all the players, as well as the position of the ball. This is essential for understanding context. During a typical match, Luis Suarez has the ball for less than 90 seconds of the 90 plus minutes of match time. What Suarez, or any other player, contributes to the play—pressing, runs to open up space and tactical positioning—can’t simply be measured in shot statistics.
For Raul and his team, the first step towards using this data has been automating the work of video analysts. “A few years ago video analysts spent most of their time recording games and labelling matches and workouts.”, he told me, “Now the computers can do the labelling and the video analysts can concentrate on generating insight.”
Performing these tasks requires skills in machine learning and computer vision. Algorithms are needed to correctly identify the players’ positions and body orientations in real-time, as well as decide whether a situation is a counter-attack or an established possession. This problem still isn’t fully solved, and the algorithms make mistakes. Even in the top leagues where multiple cameras are used to film matches from multiple angles, tracking data still isn’t 100% reliable. A job for an ambitious young computer scientist maybe?
Despite the limitations, the 22-player tracking data is already reliable enough to start to generate insights. For example, physicist William Spearman, now working at Liverpool FC, has developed a passing model which shows which passes are possible and which will be blocked. Last year, one of my Master’s students in computational science, Fran Peralta Alguacil, implemented a similar model to Spearman’s in order to look at player decision-making (see figure 1). He was able to show how ‘disruptive runs’ by Barcelona players opened up space for their teammates. The project involved heavy use of his skills in physics to simulate both player movement and ball dynamics. Without proper scientific training, Fran wouldn’t have been able to simulate ball motion.