Preparing data for representative comparisons

6 min readJun 29, 2023

Data can be a powerful tool to help assess and judge players and teams. I think we are at a point that we all recognise and apply some kind of data to our works. And, I’m not talking from the position of a scout/analyst at a professional club — but rather from someone who enjoys writing about it in public.

Football clubs have the analysts and tools to look at a variety of markets and compare players, even if there is little data out there. They use tools that are only available to clubs or cost a lot of money. So they can do specific things with a small dataset or a dataset with varying values.

In public analysis/recruitment articles, this is different. There are people with access to those tools and people who don’t have access. So that’s why it’s important that when we all have access to the same tools — we make sure we align the data. To make a comparison representative. So how do we make a player comparison the most representative? That’s what this article will talk about.

Data

There are two widely used data platforms for public analysis comparisons. Those are Opta via FBREF and Wyscout. The reason I’m using both is because of the importance of those platforms and the data they both give us.

I would say that Wyscout covers the most leagues out of everyone, but can really miss on tagging a specific metric and they can be generous in calculating some goal contributing metrics. Opta is much better in their data models, but not too many leagues are attainable for free + they have a more rigid way of filtering, which we will catch up on later.

So why is it important to look at a specific time period when we use data? When comparing players or teams, it is important to use data from the same season or period. Football is a dynamic sport and many factors affect the outcome of the game, including player form, injuries, management changes and tactical changes. Using data from other seasons or time periods ignores these contextual factors, leading to biased and unreliable analyses.

Consistency is important in statistical analysis. By focusing on a single season or period, analysts can effectively evaluate the performance of players in similar situations. This approach allows fair comparisons because it eliminates external factors that could distort the data. It provides a level playing field to assess a player’s strengths and weaknesses, allowing clubs and scouts to make informed decisions when identifying potential signings.

Visualisation

The way I want to portray these comparisons is to calculate percentile ranks. This means that I’m going to need a dataset and select certain metrics I want to use, in order for it to be functional.

From those metrics, I will calculate the percentile of the given player in question against all his peers (with filters) to get the percentile ranks of this player. If you want to know more about percentile ranks, you can read my piece on it:

Ranking players: Percentile ranks, Z-Scores and Similarities

There has been a huge shift in the use of data in football in the last few years. It would be foolish for me to claim…

marclamberts.medium.com

I will use Python to calculate those percentile rank scores and use Python to put those scores into a so-called pizza plot. That graph will give me a good idea — visually — how a player ranks in certain metrics against his peers.

Data visualisations — Opta via FBREF

The idea is to rank Kai Havertz, freshly signed for Arsenal for Chelsea, against his peers in the Top 5 European leagues. The top leagues consist of Premier League, La Liga, Bundesliga, Ligue 1 and Serie A. The data comes from Opta via FBREF and contains the entirety of the 2022/2023 domestic season (no cups or European competitions)

I will share my code snippet in Python to show you how I use it to get filters and to make it more representative:

import pandas as pd
import numpy as np

from scipy import stats
import math

from mplsoccer import PyPizza, add_image, FontManager
import matplotlib.pyplot as plt

#import csv of T5 EU from fbref.
df = pd.read_csv('Downloads/Final FBRef 2022-2023.csv')

#when you first read in the csv from fbref, you'll notice the player names are kind of weird. This code splits them on the \
df['Player'] = df['Player'].str.split('\\',expand=True)[0]

#if you want to calculate it against a certain league, you use this - otherwise uncomment for T5
#df = df[df['Comp'] == 'NWSL']

#selection for position. The first option if you only want to compare against playes that purely forwads. Otherwise against players that have played forward, but also other positions
#df = df[df['Pos'] == 'FW']
df = df[df['Pos'].str.contains('FW')] 

#select for minutes played. I choose for at least 900 minutes played for a full season.
df = df[df['Min'] > 900]

df

Some thoughts about this code. First of all, these are the filters:

League
Position
Minutes
Age (didn’t in the code, but very important one)
Per 90 stats (didn’t in the code, but very important one)

The first of them all is the League. Now this can be interesting if you only want to select a certain league to ran the percentile ranks against, but in this case we stated that we looked at the Top 5 leagues — so less relevant here.

The position is a very important one. You want to calculate against Havertz’s peers. Which means players who are similar to him. In Opta’s data he is listed as a FW (Forward) so it makes sense to select the forwards. Now there is a vital decision to be made: are you selecting all players who have played forward? Or does who fave forward as primary position? Or players with only forwards? I’ve decided to include all player with “FW” in their position. A good point though is that the positions by Opta here are just very short-sighted. If you were to use Wyscout data, you will get some more specific data and therefore your filter would be more specific as well. Meaning that the dataset would be smaller.

Minutes is another vital one. By giving a minimal amount of minutes we look at players that featured the equivalent of 10 full 90s. The reason why is not only to look who can consistently get good statistics, but also to filter out the outliers who have played 2 games and did very well in those 2 games, or to filter the players who play few minutes in a game, but will give good per 90 stats. The filter will help in judging how good and consistent players can be. Per 90 stats are of vital importance in judging ability per game, rather than the total. 10 goals in 5 games is a better stat than 10 goals in 24, but in totals that doesn’t make a difference — hence per 90 stats.

The last one I would potentially use in the initial comparison pizza plot is age. I would mainly use that if I was looking at players that are U21/U23 and would say they couldn’t be older. In doing so we are just looking at those ages and altering the dataset, which will be reflected in the pizza plot.

All attackers with 1+ minutes played. Percentile ranks Kai Havertz 22–23 at Chelsea

All attackers with 900+ minutes played. Percentile ranks Kai Havertz 22–23 at Chelsea

All U23 attackers with 900+ minutes played. Percentile ranks Kai Havertz 22–23 at Chelsea

Three times Kai Havertz compared to peers with three times the filters altered a little bit. See how it will change the numbers? That’s why it’s important to think about it closely.

Final thoughts

Obviously, there are many more filters you can put on the dataset to make it even more even for a specific data profile, but I wanted to illustrate that levelling your data will have a more positive and representative effect on the outcome — and that’s what you should be striving for in the statistical public analysis.