Weighted Z-scores and profiles in football
Z-scores, also known as standard scores, are a statistical measure used to assess the relative position of a data point within a dataset. They indicate how many standard deviations a particular data point is away from the mean (average) of the dataset. This standardization allows for a meaningful comparison of data points from different distributions, as it places them on a common scale.
Z-scores are superior to percentile ranks for several reasons. First, they provide a more precise and continuous measure of where a data point falls within a distribution. Percentile ranks, on the other hand, categorize data into discrete percentiles, which can obscure fine-grained differences between data points.
Second, Z-scores provide information about the direction of a data point relative to the mean. A positive Z-score indicates that a data point is above the mean, while a negative Z-score suggests it is below the mean. This directional information is valuable in many applications.
Third, Z-scores are particularly useful in normalizing data. By converting data into Z-scores, outliers and extreme values can be easily identified and managed, making them a valuable tool in data analysis and quality control.
I’ve spoken about it here:
But, I want to take it a step further and explain why weighted z-scores can help in creating data profiles. By calculating a z-score you will get to a number that represents all data equally. For a specific role, the data can be different in meaning. Let’s look at a striker that needs to have more goal contribution than anything else. It means that scoring and assisting metrics are more important than progressive runs and short passes, so we have to adjust the metrics for them.
Data
The data I’m using for this specific is from Wyscout and specifically a full season of data from the 2. Bundesliga 2022–2023. Only players with 900+ minutes are featured to hold representative numbers high so that you can make a decent image of the players.
A role is different than a position, but I still opt to only look at strikers (CF) in my database, so I will get a striker that has goalscoring tendencies.
Code
This code is for Python and I use Python a lot to do data analysis and visualisation. The first step is to import the relevant libraries to do the analysis:
import pandas as pd
import numpy as np
from scipy.stats import norm
Pandas and numpy you will have probably seen earlier if you have worked with python, but scipy.stats is vital for this particular analysis because it looks at the calculation of Z-Scores.
Then we move on to the filters. With Wyscout data, you have a huge database, so there are a few filters that have to be applied to make sure we have the right data and it is representative so we can do the analysis:
#WYSCOUT DATA
#importing your excel file
df = pd.read_excel('Wyscout combined.xlsx')
#Optional: select your tier
#df = df[df['Tier'] =='Tier 2']
#Select your league
df = df[df['League'] == '2. Bundesliga']
#Select your position in the Wyscout data - if they only played one position
df = df[df['Position'] =='CF']
#Select this if you want to find all players that have played CF
#df = df[df['Position'].str.contains('CF')]
#Select the minuted played
df = df[df['Minutes played'] > 900]
#Select the age
df = df[df['Age'] <= 25]
So there are a few filters I have after I import the excel file. I look for the league, position, minutes played and age. Because I have a combined file, I tend to filter for tiers (=the level of the league) and for leagues. I also look at players that only have played as CF, but I could also include all players that have played as CF -> this could be wingers who have had to play as CF for a while. But for this exercise, I will not regard them.
So we then go on to the next step, which is looking at the data and the weights we put on them:
# Goalscoring strikers
original_metrics = ["xG per 90", "Goal conversion, %", "Received passes per 90",
"Key passes per 90", "xA per 90", "Head goals per 90",
"Aerial duels won, %", "Touches in box per 90", "Non-penalty goals per 90"]
weights = [5, 5, 3,
1, 1, 0.5,
0.5, 3, 1]
We have listed here all the data we think is logical and crucial for the players in this role. Without giving weights to them, that would be a z-score calculated that would be even. But, because we give weights to them, some metrics will be deemed more valuable and you effectively have created a role: the goalscoring striker.
Now on to the actual calculation of z-scores with weighted values:
# Calculate the composite score for the original metrics
df["Goalscoring strikers"] = np.dot(df[original_metrics], weights)
# Calculate the mean and standard deviation for the composite score of the original metrics
original_mean = df["Goalscoring strikers"].mean()
original_std = df["Goalscoring strikers"].std()
# Calculate the z-scores for the composite score of the original metrics
df["Goalscoring strikers"] = (df["Goalscoring strikers"] - original_mean) / original_std
# Map the z-scores of the original metrics to a range of 0 to 100 with two decimal places
df["Goalscoring strikers(0-100)"] = (norm.cdf(df["Goalscoring strikers"]) * 100).round(2)
So what I’ve done is calculate the z-score for the mean and the standard deviation. I also used that to get to the mean which corresponds with 50, so I will get a score from 0–100. This score will represent the percentage of how well a player fits the data profile.
# Sort the DataFrame by the z-scores of the original metrics in ascending order
df_original = df.sort_values("Goalscoring strikers")
# Save the DataFrame with the z-scores of the original metrics to an Excel file
original_output_filename = "Goalscoring strikers.xlsx"
df_original.to_excel(original_output_filename, index=False)
I now have saved it an excel file and it will feature all the original data that we imported, but also add two columns: z-scores and z-scores from 0–100, with 50 being the mean. This gives us an idea of how well a player fits the profile.
# Print the list of player names, squads, competitions, and the z-scores
print(df[["Player", "Age", "Team within selected timeframe", "Goalscoring striker (0-100)"]])
Additionally, you could also choose to print the scores and get an instant list without having to look at your excel files.