Introducing Expected Shots from Cross (xCross): measuring the probability that a shot occurs from a cross

7 min readFeb 2, 2025

How many more expected models do we need? That’s surely a question I have asked myself numerous times while researching the article I am presenting today. I think it all depends on the angles you present your research with and how you approach the research: what’s the aim and what do you want to get out of it?

For me, it’s important to create something that adds something to a conversation about expected value models when I or others make a different model. This can be done by creating a completely new model or metric or recreating a model with enhancements. A combination of the two is also possible, of course.

In this article, I want to talk a little bit more about crosses. It is not so much about cross delivery or cross completion percentages, but what successful follow-up action does entail: shots from crosses and the expected models.

First I will talk about the data and how I have collected it, secondly about the methodology, followed by the analysis of the data and last I will give my final thoughts or conclusion.

Data

For this research, I have used raw event data from Opta/Statsperform. The event data was collected on Thursday, 30 January 2025, focusing on the 2024–2025 season of the Belgian Pro League, the first tier in Belgium football.

The data will be manipulated to have the metrics we need to make this calculation work. The players featured will have played a minimum of 500 minutes played throughout the season.

For this research, we won’t focus on any expected goal metrics, as we are not looking for the probability of a goal being scored.

Expected Defensive Threat Reduction (xDEF): measuring how defensive players reduce attacking…

It’s finished! That’s my initial thought when I started writing this article and that sentiment comes from weeks of…

marclamberts.medium.com

Why this metric?

This is a question I ask myself every time when I set out to make a new model of metric. And sometimes I really don’t know what to answer. The fundamental question remains: do we need it? I guess that’s a question of semantics, but no — I don’t think we need it. However, I believe it can give us some interesting insights into how shots come to be.

The place where I come from is to understand what expected assists or expected goals assisted tell us. These metrics tell us something about passes with a probability of leading to a goal with the key difference between all passes and passes leading to shots. I love the idea of it, but it’s very much focused on the outcome of the shots and expected goals.

I want to do something different. Yes, the outcome will be a probability, but it focuses on the probability of a shot being taken rather than a shot ending up a goal. Furthermore, I want to look at the qualitative nature of the crosses and whethter we can asses something from the delivery taker. In other words, does the quality of the cross lead to more or less shots in similar variability.

Cross: a definition

What is a cross? If we look at the definition of Hudl, we can say the following what constituates as a cross: A ball played from the offensive flanks aimed towards a teammate in the area in front of the opponent’s goal.

In this instance a flank is the utmost 23 meters in a 68 meter wide pitch. This means that everything on the right or left that is a pass from the flank to the central area, can be considered a cross.

As we look to Opta/Statsperform data and are using that in our research, let’s see what their definition is: A ball played from a wide position targeting a teammate(s) in a central area within proximity to the Goal. The delivery must have an element of lateral movement from a wider position to more central area in front of Goal.

If we take some random data for crosses, we can see where crosses come from and what metrics we can pull from them. This is essential for our research into a model and we need to understand what we are working with.

In the image below you can see a pitch map with crosses visualised.

We visualise the crosses coming from open play, so we filter out the set pieces and we make a distinction between successful and unsuccessful passes. On the right side we see some calculation of the crosses which we can also use to work more with. EPV is expected possession value and xT is expected threat.

Methodology

So the thing is that I want to create a model from the crosses we visualised. The aim is to get a model that calculates probability for every cross turning into a shot. There are a few things we need to take from the event data:

Cross origin (location on the field)
Receiving player’s position (inside/outside the box, near/far post, penalty spot based on endlocation of the cross)
Game context (match minute, scoreline, opposition quality)

Passing roles: using pass direction data to establish tactical roles

In data departments all over elite sports and in football in particular, we create and develop metrics. To make them…

marclamberts.medium.com

The first step involves organising the dataset by sorting events chronologically using timestamps. Then, the model identifies cross attempts (Cross == 1) and assigns a binary target variable (leads_to_shot) by checking if the next three recorded events include a shot attempt (typeId in [13, 14, 15, 16]). This ensures that the model captures sequences where a cross directly results in a shot, preventing the influence of unrelated play sequences. These include a missed shot, shot on the post, shot saved or a goal.

After defining the target variable, feature engineering is applied to improve model performance. Several factors influence the probability of a cross leading to a shot, such as the location of the cross (x, y), its target area (endX, endY), and the total time elapsed in the match (totalTime).

The dataset is then split into training (80%) and testing (20%) sets, ensuring that the distribution of positive and negative samples is preserved using stratification.

To estimate the probability that a cross leads to a shot, machine learning models are applied. A Logistic Regression model is trained to predict a probability score for each cross, making it an interpretable baseline model.

In the context of xCross, the goal of the model is to predict whether a cross will lead to a shot attempt (leads_to_shot = 1) or not (leads_to_shot = 0).

Additionally, a Random Forest Classifier is trained to capture non-linear relationships between crossing characteristics and shot generation likelihood. Both models are evaluated using accuracy, ROC AUC (Receiver Operating Characteristic — Area Under Curve), and classification reports, ensuring their ability to distinguish between successful and unsuccessful crosses in terms of shot creation.

Analysis

Now we have an excel file with the results for every cross in our dataset, containing the probability of it leading to a shot in the first three actions. Now we can start analysing the data.

First we can look at the player who have the highest xCross number in the 2024–2025 season so far.

As you can see in the bargraph above, these are the top 15 players who are most likely to give a cross that leads to a shot. When we look at Stassin, for every cross he takes, 82% of them will lead to a shot in the next 3 actions.

In the scatterplot below you can see the total number of crosses with the crosses leading to shots in the next 3 actions.

What I want to try to do is to find the correlation between shots from crosses and the probability of shots coming from crosses. That’s what we can see in the correlation matrix.

As you can see the correlation is very high with 0,99 correlation to 1 shot from xCross. There is a positive relation and that’s something we need to think on.

Final thoughts

Looking ahead, further improvements could include incorporating player movement data, defensive positioning, and match context to refine shot prediction accuracy. Testing more advanced models, such as XGBoost or deep learning, could help capture complex interactions between crossing characteristics and shot outcomes. Additionally, fine-tuning the Random Forest hyperparameters could further optimise performance. Ultimately, these refinements can provide deeper tactical insights.