P-values in football: assessing the validity of correlation in scatterplots
Scatterplots. Football analysts in the public space absolutely love them and I’m one of them. I have been using them for years to show two metrics in one graph and show any form of correlation. In my opinion, they were always insightful and meaningful, however, the feedback often has been: why put these metrics against each other? I wanted to see if I could use calculations to check it.
In this article, I will use a mathematical concept called “P-values” to focus on the validity of the correlation and whether it has been a good measurement for the data.
Scatterplots
A scatter plot (or scatter diagram) is a type of graph used to represent the relationship between two continuous variables. It uses Cartesian coordinates, where one variable is plotted along the horizontal axis (x-axis) and the other variable along the vertical axis (y-axis). Each observation in the dataset is displayed as a single point (dot) on the graph.
By revealing relationships, trends, and outliers, scatter plots are indispensable tools in data analysis, making them one of the most widely used visualizations in statistics and research.
In the scatterplot above you can see two different metrics combined in a graph, which we often see. However, we always want to see what correlation does and means. How do these metrics relate to each other and therefor make it a good data visualisation graph.
Correlation
Correlation is a statistical measure expressing the degree to which two variables are related. It quantifies the strength and direction of the relationship between variables, typically represented using a single numerical value called the correlation coefficient. Correlation does not imply causation — it merely indicates that a relationship exists between variables.
So we look at positive, negative or no correlations. We can show that in a scatterplot with a regression line. The correlation exists — negative or positive — when the dots are close to the regression line. You can see this in the plot below. This is an example of a positive correlation.
P-value
We have explored correlation and regression to understand the relationship between two metrics. These methods provide valuable insights into the strength, direction, and predictive nature of the relationship. However, they do not address whether the metrics are conceptually or practically appropriate to compare in the first place. Determining if the comparison is valid requires examining the theoretical relevance of the metrics, their validity in measuring related phenomena, and whether initial patterns suggest a meaningful connection.
The next step is to assess the significance of the observed relationship using the p-value. The p-value tells us whether the relationship is statistically significant, helping us determine if it is likely due to chance. However, it does not evaluate whether the metrics were appropriate to compare — it only validates the significance of the comparison. Before relying on the p-value, it is crucial to ensure the comparison is grounded in logical, domain-specific reasoning and that the metrics are suitable for analysis.
A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis is true. The lower the p-value, the greater the statistical significance of the observed difference. A p-value of 0.05 or lower is generally considered statistically significant.
So our idea is to characterise or measure whether the two metrics you compare are statistically relevant. In the graph below you see two different examples.
In the left graph you see a p-value of 0,0 which is lower than 0,05 and is therefore considered statistically significant. These metrics have a relationship which we are more equipped to work with. When we look at the right graph, we see a difference in p-value. the p-value is 0,2 and that’s bigger than 0,05 and therefore we see that these metrics are not as statistically significant.
Final thoughts
Scatterplots and p-values are valuable tools for understanding relationships between variables, but they play different roles. A scatterplot gives you a clear visual of the data, showing trends, patterns, or outliers that might influence the relationship. It lets you see if the relationship looks linear or if something more complex is going on. The correlation coefficient adds to this by giving a single number that tells you how strong and in what direction the relationship is moving, but it’s the scatterplot that helps you interpret this number in context.
The p-value, on the other hand, helps answer a different question: is the relationship you’re seeing in the scatterplot real, or could it just be random chance? A small p-value (like p<0.05) means it’s unlikely that the relationship happened by accident, giving you confidence that the connection is statistically significant. However, the p-value doesn’t tell you how strong or meaningful the relationship is — it only confirms that it’s not random. By combining scatterplots, correlation, and p-values, you get a fuller picture of what’s happening in your data, balancing visual insights with statistical validation.