Extended Kalman filter for Premier League football𝞡
In this example, we'll use the extended Kalman filter implementation from posteriors
to infer the skills of Premier League football teams. We'll use a simple Elo-style
Bayesian model from Duffield et al to model the
outcome of matches.
Data𝞡
We'll use the football-data.co.uk and
pandas
to load the Premier League results for the
2021/22 and 2022/2023 seasons.
Code to download Premier League data
import torch
import matplotlib.pyplot as plt
import pandas as pd
import posteriors
def download_data(start=21, end=23):
urls = [
f"https://www.football-data.co.uk/mmz4281/{y}{y+1}/E0.csv"
for y in range(start, end)
]
origin_date = pd.to_datetime(f"20{start}-08-01")
data = pd.concat(pd.read_csv(url) for url in urls)
data = data.dropna()
data["Timestamp"] = pd.to_datetime(data["Date"], dayfirst=True)
data["Timestamp"] = pd.to_datetime(data["Timestamp"], unit="D")
data["TimestampDays"] = (data["Timestamp"] - origin_date).dt.days.astype(int)
players_arr = pd.unique(pd.concat([data["HomeTeam"], data["AwayTeam"]]))
players_arr.sort()
players_name_to_id_dict = {a: i for i, a in enumerate(players_arr)}
players_id_to_name_dict = {i: a for i, a in enumerate(players_arr)}
data["HomeTeamID"] = data["HomeTeam"].apply(lambda s: players_name_to_id_dict[s])
data["AwayTeamID"] = data["AwayTeam"].apply(lambda s: players_name_to_id_dict[s])
match_times = torch.tensor(data["TimestampDays"].to_numpy(), dtype=torch.float64)
match_player_indices = torch.tensor(data[["HomeTeamID", "AwayTeamID"]].to_numpy())
home_goals = torch.tensor(data["FTHG"].to_numpy())
away_goals = torch.tensor(data["FTAG"].to_numpy())
match_results = torch.where(
home_goals > away_goals, 1, torch.where(home_goals < away_goals, 2, 0)
)
dataset = torch.utils.data.StackDataset(
match_times=match_times,
match_player_indices=match_player_indices,
match_results=match_results,
)
return (
dataset,
players_id_to_name_dict,
players_name_to_id_dict,
)
We can load the dataset into a torch DataLoader
as follows:
(
dataset,
players_id_to_name_dict,
players_name_to_id_dict,
) = download_data()
dataloader = torch.utils.data.DataLoader(dataset, batch_size=1)
Define the model𝞡
We'll define a likelihood function that maps the skills of the teams to the probability of a draw, home win or away win. Following Duffield et al, we'll use a simple sigmoidal model:
Where \(y\) is the match result, \(x^h\) and \(x^a\) are the (latent) skills of the home and away teams. \(\epsilon\) is a static parameter controlling the probability of a draw and \(\textbf{sigmoid}\) is a sigmoid function that maps real values to the interval \([0, 1]\).
In code:
epsilon = 1.0
def log_likelihood(params, batch):
player_indices = batch["match_player_indices"]
match_results = batch["match_results"]
player_skills = params[player_indices]
home_win_prob = torch.sigmoid(player_skills[:, 0] - player_skills[:, 1] - epsilon)
away_win_prob = 1 - torch.sigmoid(
player_skills[:, 0] - player_skills[:, 1] + epsilon
)
draw_prob = 1 - home_win_prob - away_win_prob
result_probs = torch.vstack([draw_prob, home_win_prob, away_win_prob]).T
log_liks = torch.log(result_probs[torch.arange(len(match_results)), match_results])
return log_liks, result_probs
Extended Kalman time!𝞡
Now we'll run an extended Kalman filter to infer the skills of the teams sequentially over the matches.
Because the matches are not equally spaced in time, we'll use a transition_sd
parameter that varies as we move through the matches. This means we can't use
posteriors.ekf.diag_fisher.build
since this would globally configure the transition_sd
,
luckily we can use posteriors.ekf.diag_fisher.init
and posteriors.ekf.diag_fisher.update
directly instead.
transition_sd_scale = 0.1
num_teams = len(players_id_to_name_dict)
init_means = torch.zeros((num_teams,))
init_sds = torch.ones((num_teams,))
state = posteriors.ekf.diag_fisher.init(init_means, init_sds)
all_means = init_means.unsqueeze(0)
all_sds = init_sds.unsqueeze(0)
previous_time = 0.0
for match in dataloader:
match_time = match["match_times"]
state = posteriors.ekf.diag_fisher.update(
state,
match,
log_likelihood,
lr=1.0,
per_sample=True,
transition_sd=torch.sqrt(transition_sd_scale**2 * (match_time - previous_time)),
)
all_means = torch.vstack([all_means, state.params.unsqueeze(0)])
all_sds = torch.vstack([all_sds, state.sd_diag.unsqueeze(0)])
previous_time = match_time
transition_sd_scale
parameter from data, but for
this example we'll set it to 0.1 and see how it looks.
Here we've stored the means and standard deviations of the skills of the teams at each time step, so we can plot them. This is doable as there's only 20 teams in the Premier League.
Plot the skills𝞡
We initialized the skills of the teams to zero, which is not realistic, so we'll use the 2021/22 season as a warm-up period and plot the skills of the teams in the 2022/23 season.
Code to plot the skills
last_season_start = (len(all_means) - 1) // 2
times = dataset.datasets["match_times"][last_season_start:]
means = all_means[last_season_start + 1 :]
sds = all_sds[last_season_start + 1 :]
fig, ax = plt.subplots(figsize=(12, 6))
for i in range(num_teams):
team_name = players_id_to_name_dict[i]
if team_name in ("Arsenal", "Man City"):
c = "skyblue" if team_name == "Man City" else "red"
ax.plot(times, means[:, i], c=c, zorder=2, label=team_name)
ax.fill_between(
times,
means[:, i] - sds[:, i],
means[:, i] + sds[:, i],
color=c,
alpha=0.3,
zorder=1,
)
else:
ax.plot(times, means[:, i], c="grey", alpha=0.2, zorder=0, linewidth=0.75)
ax.set_xticklabels([])
ax.set_xlabel("2022/23 season")
ax.set_ylabel("Skill")
ax.legend()
fig.tight_layout()
Here we can see the inferred skills (and one standard deviation) for Arsenal and Manchester City with the skills of the other teams in grey. We can see that the extended Kalman filter has inferred that Arsenal were the strongest team through most of the season before deteriorating towards the end of the season where Manchester City came through as champions. Note the vacant period mid-season where the 2022 world cup took place.
Observe the standard deviations increasing between matches, this is natural due to the increasing uncertainty as we move further from the last match. But as we observe more matches, the uncertainty decreases.
Note
The raw code for this example can be found in the repo at examples/ekf_premier_league.py.