Tuning an In-Game Win Probability Model Using xgboost

A few weeks ago, we started tweeting out the post-game win probability plots for every FBS college football game. They were a nice addition to our Saturdays, but there were some issues. Namely, we’d get a lot of plots like this:

Now, to be fair, this was a back-and-forth game that ended 70-56 in favor of Wake Forest, but the idea that one play can bring you from a 80% win probability to a 10% win probability, and then back up again on the next play, makes little sense. So now, here’s a look at how I fixed it.

AUC vs. Log Loss

Thankfully, a lot of the heavy lifting had already been done for me by my brother, Drew. He had already put the model together once and selected the initial features and parameters to use. It was up to me to re-evaluate them and tune them to our needs.

The first thing I did was look at how we were evaluating our model. In every model, you’re trying to optimize some number to be as low or high as possible. In a linear model, maybe it’s RMSE, or Root Mean Square Error: the difference between your predicted values and the actual values, on average. However, we’re building a logistic model when we make an in-game win probability model, which means we’re predicting a binary outcome (1 or 0, win or loss).

There are several ways to measure the success of a model. One way is something called AUC: Area Under the Receiver Operating Characteristics Curve. This is a good way to measure how well you’re classifying your predictions into the right or wrong bucket. An AUC closer to 0 means you’re classifying most of your points incorrectly (saying a team will win when they actually lose), and an AUC closer to 1 means you’re classifying most of your points correctly. An AUC of 50% means you may as well be randomly picking winners and losers. The problem with this is that AUC is scale-invariant, meaning it only cares about the classification, rather than the value of the probability you assign to it. As a result, you can get a model that’s overconfident in one direction or the other, immediately trying to classify each point as a 1 or a 0, when it should be getting a value closer to .5.

Our original win probability model—the one that made the plot above—used AUC as the value it was trying to optimize. Here was what our first model looked like with those settings. On the x-axis is our predicted win probabilities, and on the y-axis is the actual average results of game at those points. We get this data by using a “leave one season out” method, where we train the model with data from 2014-2019, and then test it on 2020 and save the results of our predictions. Then, we train it again on data from 2015-2020, and test it on 2014. And so on.

If our model were well-calibrated, then our plot should fall right along that perfect dotted line from 0 to 1, meaning that, for example, if we predicted teams to win 75% of the time, they actually won 75% of the time. In our case above, when we predicted a 75% win probability, teams’ actual win probabilities were closer to 50%! That’s not great! If we take the average error at each of these 1% intervals, then we’re off by an average of 7.1% with this method.

So let’s see if we can do any better simply by switching our evaluation metric. This time, we’ll use something called log loss. Now, the benefit of using log loss is that it actually matters how close your predictions are to the actual values, so the model is incentivized to make realistic predictions to minimize the difference between the actual outcome and the predicted one.

Unfortunately, changing the evaluation to log loss alone was not enough to improve the model significantly. Instead, I have to start tuning the parameters.

Tuning Parameters

Yes, the dreaded parameters. These are the little settings that tell your model how to run, and give it some guidelines to work within. They can be scary, though, because sometimes you don’t know what they even mean! Take, for example, some of the parameters in xgboost, the machine-learning model that we’re using to make our predictions:

eta
gamma
subsample
colsample_bytree
max_depth
min_child_weight

I’m already intimidated! Luckily, the documentation for xgboost (which stands for Extreme Gradient Boosting) is helpful in that it basically tells you “making eta higher makes your model more aggressive and less conservative”. So you can really just play with those values until you find the right balance. And in our case, we want a more conservative model, because right now it’s overfitting, resulting in wild swings in overpredicting and underpredicting. I also took a little bit of guidance from our professional counterpart’s model at NFLfastr to see what they did with their parameters, and then optimized for our data and our limitations to find a happy middle between our original model and our new one.

One of the first things I am going to try is reducing the max_depth parameter down to 5. xgboost creates decision trees, and at each node, there is a split based on some feature of the data to decide how to classify each value. In our original model, our max_depth was a whopping 20! That is way to high for a model that only includes 12 features. So, I’ll lower it down and see how it does.

That’s looking a whole lot better already! By lowering the number of decision points in our trees, we’re making our model more conservative and relying on the most important features only. This time, our average error is down to just 1.6%! That’s about a 75% improvement from the first model, just by reducing the tree depth.

However, there are some other things we need to look at with any win probability model. One key thing our model has struggled with is figuring out when the game is over, and setting the win probability all the way to 100% or 0%. If that doesn’t happen, then you get some weird looking graphs because a team has won the game, but the win probability might be at 75% on the last play of the graph. Not only is this wrong, but it just looks bad as well. So let’s see how this one has done with that task.

So when we apply the model to a testing set of all games from the 2021 college football season, 86% of them finish fairly close to 0 or 100%. That’s not terrible, but I think we should try to do better. Our original model, which was more aggressive and used AUC as the evaluation statistic, was up around 94% of all games, so we have some room for improvement. Our conservative model has helped overall, but has made the end-of-game predictions too conservative.

A lot of this is trial and error. Make some changes, retrain and retest the new model, and see how your results look. So I’ll spare you the trial and error part and jump to the results that worked out the best.

Feature Engineering

A lot of the effort around improving the end-game results came from improving our features that the model uses. One of those key features is our end-of-game row in each game of play-by-play data.

Set Custom End of Game Values

So one thing we do to try to nudge the model in the right direction is manually force all the variables to certain values when the game is over. For instance, instead of the game ending on 3rd and 9, we add a “fake” play at the end of the game and set the down and distance to -10 and -10 yards to go. The idea is that this would be distinctly different than every other play, and so the model would realize that it’s the end of the game, and one team should be named a winner now.

However, we had originally set all these values to -10. Yards to Goal, Down, Distance, Quarter, Home Timeouts, Away Timeouts, etc. So I wondered if this may be throwing the model off, since downs are usually between 1 and 4, and on this one play of the game, the down is -10, and the model has no idea what to do with that. So instead, I changed these end-game values to be more logical. I set the quarter to 20, guaranteeing that it’s the highest quarter in the data (each overtime in our data counts as a quarter). Here are the rest of the values I settled on.

Home Timeouts = 0
Away Timeouts = 0
Clock Remaining in Seconds = -10,000 (In overtime, the clock becomes negative, so this ensures that it’s lower than any overtime game)
Down = 5
Distance = 100
Yards to Goal = 100
Home Possession Flag = 1

I made this update and reran our model, and sure enough, we saw good results, and not just with our end-game win probabilities, where 94% of games were now within .1 of 1 or 0, but also in our overall model performance, where we now had an average calibrated error of .011.

One other addition I’d like to make to this part is with the possession flag. I’m guessing that having the possession improves your chances of winning slightly, so why not give possession to the team that won the game, to give the model an extra hint at who won (outside of the literal score, of course).

After making this change, it didn’t really help the calibration, but it did improve the end-game results, so that now 96.7% of games were classified correctly in the end. That’s about as good as it gets!

Adding In Spread

One last thing I want to include is the pregame spread. This could be helpful because the spread in itself is derived out of a lot of knowledge, from the sportsbook itself and from the betting public. We already use Elo win probability as a factor in our model, but this could help even more.

However, after including the spread, I found that our model actually got worse! How could this be? Well, for one, the spread is being joined to every single play of the game, whether it’s the first play or the last. And once the game gets going, the actual score and how the teams are playing on that day matter a heck of a lot more than the pregame spread. So how do we deal with this reality, while still including this relevant data?

Luckily, I read up on how nflfastR does it, and they gave me a great idea. They use a time-decaying function to decrease the value of the spread as the game goes on. Now, I don’t know exactly how they did that, but I gave it a shot in my own way and got some good results. What I landed on was this function to transform the spread throughout the game:

new_spread = spread * (clock_in_seconds/3600)^3

So, for example, if the spread is -7, and at the start of the game there are 3600 seconds left on the clock, so on the first play of the game the spread value is -7 * (3600/3600)^3 = -7. At halftime, the spread value will be -7 * (1800/3600)^3 = -.875.

So as you can see, the spread exponentially becomes less and less important as the game goes on. The reason I chose the third power instead of squaring is because I tested both and found that the third power was the sweet spot where all evaluation metrics were the best.

So after this last round of changes, our average calibration error was down to .0096, and our accurately classified end-games were up to 97.5%! So I’d say that’s a big win compared to where we started at!

Using Play Number Instead of Clock

A major source of headaches with many of our plots is the game clock. Unfortunately, a lot of the clock data from ESPN’s play-by-play is incorrect. Each week, about half of games have bad clock data: several plays having the same time on the clock, an entire quarter of plays being marked as 0:00, and so on. This makes it very difficult for our model when this happens, because it gets confused why the clock isn’t moving.

Initially, we resolved this by removing any extra plays that happened after the first one at any given timestamp. This worked decently, but it makes the graphs less attractive because it may jump from a play early in the 1st quarter to late in the 1st quarter, and you miss all the information of what happened in between.

So, to try to tackle this issue, we instead use play number, or, more appropriately, the percentage of the game done. We calculate the total plays in the game, and then for each play calculate the percentage of the game done. This is then used in the model in place of the clock. The limitation of this is that if we ever wanted to do a live in-game win probability, it would be difficult because the model doesn’t know how many plays are left in a game that hasn’t finished yet, unlike with the clock where there are a set amount of minutes in a game.

The results from this change were remarkable. Not only did we get smoother graphs that included every play of the game, but we improved the performance of our model by .002, down to .0076 average calibration error—a 21% improvement! On top of that, we classified 98.3% of end-games correctly, a new best for us! Below, we’ll look at some of the before and after plots side-by-side.

Improved Plots

So now that our new model is ready, let’s test it out and see how it handles some of those games it struggled on before.

Compared to the wild swings from the initial graph, this one is much more reserved and doesn’t overreact to individual scores by as much. And, the win probability approaches 0% for Army by the end of the game which is another good sign.

This Washington State-Arizona State game had missing clock data, and an ugly spike right at the end of the game, when Arizona State scored a garbage-time touchdown. The switch to play number resolved our concerns over the clock data, and the inclusion of a simple indicator saying whether a play was a kickoff or not helped the model realize that after ASU scored that garbage-time touchdown, they’d have to give the ball back to Washington State again who would run the clock out.

Here’s another game with bad clock, and you can see that it helps smooth out the movement of the win probability and remove some of the jaggedness that we see on the lefthand side.

Here’s the 9-overtime game between Illinois and Penn State. And while our model still hasn’t been tuned to understand the new CFB overtime rules and the back and forth go-fo-2, which is surely confusing it, it’s at least a lot better than what happened in our original clock-based-model.

So that’s it! We have our new model and it’s ready to go. We can use it to post win probability charts like you see above, or calculate win probability added (WPA), which is just the difference between one play and the next in terms of win probability. Then, we could look at teams on, let’s say, 4th downs only, and see what their average win probability added is on those plays. That could give you insight into which teams are making good decisions on 4th downs, and which teams are not.

Tuning an In-Game Win Probability Model Using xgboost

AUC vs. Log Loss

Tuning Parameters

Feature Engineering

Set Custom End of Game Values

Adding In Spread

Using Play Number Instead of Clock

Improved Plots

Share this with your friends

Comments

Leave a comment Cancel reply

More posts

Race Control — Live Blog: 2026 Indycar Grand Prix of St. Pete — Race

Live Blog: 2026 Indycar Grand Prix of St. Pete — Qualifying

How One setuptools Release Broke Everything, and What We Can Learn From It

Comparing Hurts vs. Mahomes and What Has Changed Since 2022