Kyle Bennison

  • Why Don’t NFL Teams Pay NBA Players to Block Kicks?

    Why Don’t NFL Teams Pay NBA Players to Block Kicks?

    Recently I listened to an episode of Freakonomics Radio where they talked about specialization in the NFL. Specifically, they were talking about how the long snapper position has become something that NFL teams specifically draft for and pay upwards of $1 million a year to have on their 53-man roster. This got me wondering: what other positions are ripe for specialization in the NFL?

    This train of thought brought me to the field-goal unit. Not the kicker, holder, or snapper, but the defense. What if teams hired the tallest guy money could buy to do one thing: block kicks? It’s so obvious that I assumed there must be a reason teams aren’t doing it, or at least trying it. This sent me down a rabbit-hole of heights and wingspans, standing reaches, verticals, salary caps, and something I’m not very good at: geometry.

    So why don’t NFL teams pay a really tall guy to block field goals for them?

    Is Blocking A Kick Just By Being Tall Even Possible?

    My assumption for hiring a tall guy to block kicks is that he wouldn’t need to work hard to do it. In my experiment, he doesn’t even cross the line of scrimmage. He just stands there, arms raised, and jumps to try to swat the ball down. So is this even a feasible strategy? How high does the ball travel over the line of scrimmage? Obviously kicks get blocked, but usually the defender has some penetration. What if he was at the line of scrimmage?

    Thankfully I didn’t have to break out the geometry textbooks to figure this one out. I used this nifty website to calculate exactly how high the ball would be above the line of scrimmage at different kick angles. The assumption here is that the ball is on a straight upward trajectory as it crosses the line of scrimmage, and that the kick is taken from 7 yards out. Here are some heights at various kick angles.

    Angle (Degrees)Height at Line of Scrimmage (Feet)
    3012.1
    3514.7
    4017.6
    4521

    According to one study, the optimal launch angle for a kick is between 38 and 45 degrees, as this will maximize distance. Thus, we can assume that kickers are going to be aiming for that angle. Of course, human error and misjudgment will mean that some kicks go below or above that ideal range.

    Graph from the University of Nebraska study which found that even at different kick speeds, 45 degrees was still the optimal angle.

    So according to our geometry and assumptions about the ideal angle, most balls will be flying over the line of scrimmage 21 feet in the air, far out of the reach of even the tallest NBA players. But, at launch angles of 35 degrees or lower, we have a glimmer of hope, maybe.

    Let’s take a famous big guy in the NBA: Boban Marjanovic. The 7’4″ center has a 7’10” wing span, and a 10’2.5″ standing reach. He’s one of the few NBA players able to dunk the ball without even jumping. He has just a 23″ vertical leap, putting his overall potential range at 12’1.5″. A truly enormous human being. It’s hard to comprehend. However, he’s barely tall enough to block a 30° kick, and that’s with a perfectly-timed jump.

    With that being said, it’s highly probable that Marjanovic gets his hand on at least one kick during the season, and if he can make it even half a yard toward the kicker, he takes the vertical height down a cool foot, making it that much easier.

    There were just 14 field-goals blocked in 2021, out of 1066 field-goal attempts, meaning just 1.3% of all field goal attempts at any distance were blocked. Honestly, that’s higher than I was expecting. I think it’s reasonable to assume that our tall guy could attain at least that and maybe a half-point to a point higher over the course of a season.

    There would also likely be psychological effects on the kicker of having such a large human towering over the line of scrimmage, and the kicker may subconsciously put unnecessary height on the ball, decreasing its horizontal travel. This would especially be apparent early on, when it’s a new phenomenon for the kicker and they have not seen this type of player in front of them before.

    One other logistical challenge in blocking a kick is that it’s not a given, even if the ball is low enough, that your center gets a hand on the ball. What kind of reaction time would be necessary to accurately block a kick from the line of scrimmage?

    According to a highly cited paper, a field goal travels around 19-22 m/s, or 21-24 yards/s. Meaning the ball will reach the line of scrimmage in approximately .33 seconds. That’s about 80ms longer than the average human reaction time, meaning our NBA center could reasonably have some reaction to the ball in that time. How much is another story, but with good positioning I don’t hate our chances.

    So all-in-all, I think we can safely say it’d be difficult, but not impossible for a big guy to block kicks from the line of scrimmage, and that over the course of the season, he’d probably block one or two and influence many more.

    The Existing Arguments Against Having A Designated Tall Guy

    I did a quick Google search and didn’t see any in-depth statistical analysis or mathematical justifications on the subject. The most common argument I found on some reddit threads against the idea was “roster spots are limited” and “roster spots are valuable and blocked kicks are not”. However, we already know that teams are willing to dedicate a whole roster spot to a guy who snaps the ball 8-12 times a game. And he’s not even scoring… they pay him to prevent a slip-up that costs them points; my guy would be paid to actively prevent the other team from scoring.

    My own kin, Drew, offered another argument:

    "I also wonder if having someone like that would cause more fakes because it kind of takes him out of play as a defender since those guys usually can't run"

    Aside from some hurtful accusations he’s making about the speed and agility of big guys, he makes a point. This could be an unintended side-effect of implementing this kind of player. We see this with all elements of the game: someone innovates, and then a few years later the league adapts and the competitive advantage largely disappears. But for the sake of my argument, I’m going to assume that teams wouldn’t immediately be able to exploit this tactic just because there’s one tall guy on the field.

    So that left a few questions: how valuable is a blocked kick, and how much should teams be willing to pay for one?

    How Valuable Is A Blocked Kick?

    In 2021, NFL teams had an average total cap of around $187 million, and scored 391 points on average. So they paid about $478,000 per point last season. This is one way to value blocked kicks. By blocking a kick, you’re preventing 3 points (maybe 2.61 points if you consider that the median FG% in 2021 was 87%. However, you could also argue that a blocked kick sets the offense up with a better chance to score, and that change in expected points should be attributed to the blocker, but let’s keep this simple for now.) So each blocked kick could be valued at approximately $1.434 million (3 points * $478k per point). I’m assuming that the going rate for a point prevented is the same as the rate for a point scored (which you could argue isn’t the case since defensive players aren’t paid the same as offensive players).

    Another way to look at it is by cost per win. The median cap spend per win in 2021 was $21 million. The median points required to win a game was 29. So that gives you a cost of $724k per point, if we’re after wins (which most teams are). This would give even greater value to a blocked kick ($2.2 million)!

    What NBA Players Could NFL Teams Afford?

    Much to my satisfaction, we could actually afford our Serbian center Boban at his modest $3.5 million per year. Using our second valuation, he would only have to block two kicks all year to be worth the investment. If we use our points-based valuation, it would be 3 kicks blocked over the course of a season.

    A younger, more athletic Mo Bamba would cost NFL teams a premium at $7.5 million per year, so he’d need a higher output to justify the cost. At that rate, we may even ask him to rush the kicker a bit or maybe even get in for some corner fade routes in the red zone.

    At 7’1″ and only $2.1 million a year, Bol Bol may be a great investment for some NFL team looking to make a splash on special teams. But wait, could these guys actually block that many field goals in a season?

    There are usually several field goal attempts per game (the top 32 kickers averaged 1.93 field-goal attempts per game). The average team made 32.1 field goal attempts in 2021. Using that 1.3% block rate, we would expect the average team (not individual) to block .42 field goals per year. If you’re paying $724k per point, that means you’d only want to pay a designated kick-blocker $304k per year since that’s likely to be his output. Given the NFL salary cap minimum for 2022 is $705k, that’s not feasible.

    Let’s assume our tall guy makes your team twice as good at blocking kicks, from a 1.3% to 2.6% block rate. Now, he’s still blocking under 1 kick per season at .83. Maybe you can afford the league minimum at this point, but that’s it.

    At this point, I’m feeling a little deflated. I think there’s a reason nobody is trying this. Blocked kicks are rare, and kickers kick the ball high above the line of scrimmage most of the time. So unless you can find a 7-footer that’s also fast and can rush the kicker a bit, and doesn’t cost a fortune, it’s probably not worth it over your two-way guys that can contribute in multiple ways for the team. However, if you find a 7-footer that can also stand in the endzone and cherry pick some passes, then this argument changes a lot.

  • StatRdays: The Easiest Model You’ll Ever Make

    StatRdays: The Easiest Model You’ll Ever Make

    This year, we’ve been participating in the CollegeFootballData.com prediction contest, where each week you predict the actual spread of the game. You’re judged on a variety of factors like your outright picks, picks against the spread, absolute and mean-squared error. Two areas that we are performing very well (1st and 2nd, respectively), are MSE and absolute error. Today, I’ll show you how we do that.

    Getting the Data

    We’ll be using R today, and we’ll be getting all of our data from the collegefootballdata.com API. This part requires you to have your own API key from CFB Data. If you don’t have one yet, you can get one here. This also requires you to store your key. I recommend storing it in your .Renviron file, which you can probably find in your Documents folder on Windows.

    Then, you’ll want to edit it in Notepad and add your key like this:

    cfbd_staturdays_key = "yoursecretkey"
    

    Now, you should be all set to do the rest of the exercise. Let’s first grab a few functions that will help us get the data from the API.

    # Load in required functions
    source(https://raw.githubusercontent.com/kylebennison/staturdays/master/Production/source_everything.R)
    

    Next, let’s get the initial data. We’ll need the games, elo ratings, and betting data.

    # Get historic Elo data from Staturdays
    elo <- get_elo(2013, 2021)
    
    # Get games data from CFBdata
    games <- get_games(2013, 2021)
    
    # Get historic betting data from CFBdata
    betting.master = data.frame()
    for(j in 2013:2021){
      message("Doing year ", j)
    betting_url <- paste0("https://api.collegefootballdata.com/lines?year=", j)
    full_url_betting <- paste0(betting_url)
    full_url_betting_encoded <- URLencode(full_url_betting)
    betting <- cfbd_api(full_url_betting_encoded, my_key)
    betting <- as_tibble(betting)
    betting <- unnest(betting, cols = c(lines))
    betting.master = rbind(betting.master, betting)
    }
    

    Cleaning the Data

    We’ll have to do some clean up to get the data ready to use in a model. First, we’ll average out the data in the betting file, because we have multiple lines from different providers for the same game, so we’ll just take the average of all the lines.

    Next, we’ll create a new field called “join_date” in our Elo file, since the Elo is from after the game is finished in that week, so we’ll want to join each Elo rating to the following week’s game.

    Then, we’ll join all three tables (games, elo, and betting) together.

    # Need to summarise lines for teams with multiple lines
    betting_consensus <- betting.master %>% 
      mutate(spread = as.double(spread),
             overUnder = as.double(overUnder)) %>%
      group_by(id, season, week, homeTeam, awayTeam,
               homeConference, awayConference, homeScore, awayScore) %>% 
      summarise(consensus_spread = mean(spread, na.rm = TRUE),
                consensus_over_under = mean(overUnder, na.rm = TRUE),
                consensus_home_ml = mean(homeMoneyline, na.rm = TRUE),
                consensus_away_ml = mean(awayMoneyline, na.rm = TRUE))
    
    e2 <- elo %>% 
      group_by(team) %>% 
      mutate(join_date = lead(date, n = 1L, order_by = date))
    
    games_elo <- games %>% 
      mutate(start_date = lubridate::as_datetime(start_date)) %>% 
      left_join(e2, by = c("start_date" = "join_date",
                           "home_team" = "team")) %>% 
      left_join(e2, by = c("start_date" = "join_date",
                           "away_team" = "team"),
                suffix = c("_home", "_away"))
    
    games_elo_lines <- games_elo %>% 
      inner_join(betting_consensus, by = "id")
    

    Doing Some Calculations

    Ok, we’ve cleaned everything up and joined it together. Now, we need to do some calculations. Mainly, we want to know the difference in Elo between the home and away teams, since we’ll use this as a feature in our model later. We’ll also want to calculate the final actual spread of the game, and this will be our response variable: the variable we’re trying to predict.

    ge2 <- games_elo_lines %>%
      mutate(home_elo_adv = elo_rating_home + 55 - elo_rating_away,
             final_home_spread = away_points - home_points)
    

    We’re including a 55 point home-field advantage in the Elo advantage calculation, which we’ve identified as the best home-field advantage value in previous testing.

    Let’s look at the relationship between Elo and the final spread.

    ge2 %>% 
      ggplot(aes(x = home_elo_adv, y = final_home_spread)) +
      geom_point(alpha = .1, color = staturdays_colors("light_blue")) +
      geom_smooth(color = staturdays_colors("orange")) +
      staturdays_theme +
      theme(panel.grid.major = element_line(color = "lightgrey")) +
      labs(title = "Elo and Spread 2000-2021",
           subtitle = "Elo advantage includes built-in home-field advantage worth around 3 points",
           x = "Home Elo Point Advantage/Disadvantage",
           y = "Home Win/Loss Point Margin")
    

    So remember that a negative spread means the home team won by that amount. So, as the Elo advantage increases for home teams, so does the spread. There is a lot of deviation, but the relationship is clearly linear. So we should be able to model this, and we’ll use a linear regression model.

    Now, CFB Data recently provided their own Elo model, and while it’s fairly similar to Staturdays, it is different in a few decisions and assumptions it makes. Rather than be picky, I’m just going to include them both. It can only help our model if you think of it like the wisdom of the crowd (this isn’t really true if two variables are highly correlated, it can actually throw off your model and make it worse). Of course, this isn’t always true. More variables doesn’t always mean a better model if those variables aren’t helpful. If I include the time of the kickoff as a variable, it might end up confusing the model more than helping it because it might find some strange correlation that has nothing to do with the team’s or their skill and more to do with random chance.

    # Include CFB Data's elo as well.
    ge3 <- ge2 %>% mutate(alt_elo_adv = home_pregame_elo - away_pregame_elo)
    

    Ok, we’re ready to build our model.

    Building the Linear Regression Model

    model_spread <- lm(final_home_spread ~ home_elo_adv + alt_elo_adv + consensus_spread, ge3)
    
    summary(model_spread)
    

    So the lm() function is to build a linear model, and the syntax you’re seeing is saying “predict final_home_spread using home_elo_adv, alt_elo_adv, and consensus_spread, from the ge3 dataset”. Here are the results.

    So we have an R-squared of .47, which means 47% of the variation in spread can be explained by our model. That’s not great, but it’s certainly a start. The consensus spread is the only variable that is significant at the 5% confidence level, but that doesn’t mean we need to exclude our other variables. We would want to compare the results to a model that excluded elo and see which performed better. For now, we’ll leave it as is.

    If you really wanted to stress-test your model’s validity in the real world, you could train and test it, using a holdout set of data. We’ve skipped this because we’re just trying to build a model here, and not necessarily test and optimize it right now.

    Saving and Using The Model

    Now that we have the model, we can apply it to new data to get predictions.

    If you want to save a model for use another time, you can save it to a .rds file.

    saveRDS(model_spread, file = "Production Models/elo_combo_spread_model_v2.rds")
    

    To apply this model, we’d need to rerun all the code above, but only pull data from 2021 and look at the games coming up this week. Then, we’d run this code to make our spread predictions:

    # Read in the model we saved earlier
    model_spread <- readRDS(file = "Production Models/elo_combo_spread_model_v2.rds")
    
    # Predict the spread using our model
    ge3$predicted_spread <- predict(model_spread, newdata = ge3)
    

    And there you have it. R will use your model and the input variables in your ge3 data to predict the final spread of the game!

    From here, we could try to include more relevant variables that might help improve our model, or we could try a different model type altogether, like a decision tree, to see if that helps predict spreads more accurately.

  • Tuning an In-Game Win Probability Model Using xgboost

    Tuning an In-Game Win Probability Model Using xgboost

    A few weeks ago, we started tweeting out the post-game win probability plots for every FBS college football game. They were a nice addition to our Saturdays, but there were some issues. Namely, we’d get a lot of plots like this:

    Now, to be fair, this was a back-and-forth game that ended 70-56 in favor of Wake Forest, but the idea that one play can bring you from a 80% win probability to a 10% win probability, and then back up again on the next play, makes little sense. So now, here’s a look at how I fixed it.

    AUC vs. Log Loss

    Thankfully, a lot of the heavy lifting had already been done for me by my brother, Drew. He had already put the model together once and selected the initial features and parameters to use. It was up to me to re-evaluate them and tune them to our needs.

    The first thing I did was look at how we were evaluating our model. In every model, you’re trying to optimize some number to be as low or high as possible. In a linear model, maybe it’s RMSE, or Root Mean Square Error: the difference between your predicted values and the actual values, on average. However, we’re building a logistic model when we make an in-game win probability model, which means we’re predicting a binary outcome (1 or 0, win or loss).

    There are several ways to measure the success of a model. One way is something called AUC: Area Under the Receiver Operating Characteristics Curve. This is a good way to measure how well you’re classifying your predictions into the right or wrong bucket. An AUC closer to 0 means you’re classifying most of your points incorrectly (saying a team will win when they actually lose), and an AUC closer to 1 means you’re classifying most of your points correctly. An AUC of 50% means you may as well be randomly picking winners and losers. The problem with this is that AUC is scale-invariant, meaning it only cares about the classification, rather than the value of the probability you assign to it. As a result, you can get a model that’s overconfident in one direction or the other, immediately trying to classify each point as a 1 or a 0, when it should be getting a value closer to .5.

    Our original win probability model—the one that made the plot above—used AUC as the value it was trying to optimize. Here was what our first model looked like with those settings. On the x-axis is our predicted win probabilities, and on the y-axis is the actual average results of game at those points. We get this data by using a “leave one season out” method, where we train the model with data from 2014-2019, and then test it on 2020 and save the results of our predictions. Then, we train it again on data from 2015-2020, and test it on 2014. And so on.

    If our model were well-calibrated, then our plot should fall right along that perfect dotted line from 0 to 1, meaning that, for example, if we predicted teams to win 75% of the time, they actually won 75% of the time. In our case above, when we predicted a 75% win probability, teams’ actual win probabilities were closer to 50%! That’s not great! If we take the average error at each of these 1% intervals, then we’re off by an average of 7.1% with this method.

    So let’s see if we can do any better simply by switching our evaluation metric. This time, we’ll use something called log loss. Now, the benefit of using log loss is that it actually matters how close your predictions are to the actual values, so the model is incentivized to make realistic predictions to minimize the difference between the actual outcome and the predicted one.

    Unfortunately, changing the evaluation to log loss alone was not enough to improve the model significantly. Instead, I have to start tuning the parameters.

    Tuning Parameters

    Yes, the dreaded parameters. These are the little settings that tell your model how to run, and give it some guidelines to work within. They can be scary, though, because sometimes you don’t know what they even mean! Take, for example, some of the parameters in xgboost, the machine-learning model that we’re using to make our predictions:

    • eta
    • gamma
    • subsample
    • colsample_bytree
    • max_depth
    • min_child_weight

    I’m already intimidated! Luckily, the documentation for xgboost (which stands for Extreme Gradient Boosting) is helpful in that it basically tells you “making eta higher makes your model more aggressive and less conservative”. So you can really just play with those values until you find the right balance. And in our case, we want a more conservative model, because right now it’s overfitting, resulting in wild swings in overpredicting and underpredicting. I also took a little bit of guidance from our professional counterpart’s model at NFLfastr to see what they did with their parameters, and then optimized for our data and our limitations to find a happy middle between our original model and our new one.

    One of the first things I am going to try is reducing the max_depth parameter down to 5. xgboost creates decision trees, and at each node, there is a split based on some feature of the data to decide how to classify each value. In our original model, our max_depth was a whopping 20! That is way to high for a model that only includes 12 features. So, I’ll lower it down and see how it does.

    That’s looking a whole lot better already! By lowering the number of decision points in our trees, we’re making our model more conservative and relying on the most important features only. This time, our average error is down to just 1.6%! That’s about a 75% improvement from the first model, just by reducing the tree depth.

    However, there are some other things we need to look at with any win probability model. One key thing our model has struggled with is figuring out when the game is over, and setting the win probability all the way to 100% or 0%. If that doesn’t happen, then you get some weird looking graphs because a team has won the game, but the win probability might be at 75% on the last play of the graph. Not only is this wrong, but it just looks bad as well. So let’s see how this one has done with that task.

    So when we apply the model to a testing set of all games from the 2021 college football season, 86% of them finish fairly close to 0 or 100%. That’s not terrible, but I think we should try to do better. Our original model, which was more aggressive and used AUC as the evaluation statistic, was up around 94% of all games, so we have some room for improvement. Our conservative model has helped overall, but has made the end-of-game predictions too conservative.

    A lot of this is trial and error. Make some changes, retrain and retest the new model, and see how your results look. So I’ll spare you the trial and error part and jump to the results that worked out the best.

    Feature Engineering

    A lot of the effort around improving the end-game results came from improving our features that the model uses. One of those key features is our end-of-game row in each game of play-by-play data.

    Set Custom End of Game Values

    So one thing we do to try to nudge the model in the right direction is manually force all the variables to certain values when the game is over. For instance, instead of the game ending on 3rd and 9, we add a “fake” play at the end of the game and set the down and distance to -10 and -10 yards to go. The idea is that this would be distinctly different than every other play, and so the model would realize that it’s the end of the game, and one team should be named a winner now.

    However, we had originally set all these values to -10. Yards to Goal, Down, Distance, Quarter, Home Timeouts, Away Timeouts, etc. So I wondered if this may be throwing the model off, since downs are usually between 1 and 4, and on this one play of the game, the down is -10, and the model has no idea what to do with that. So instead, I changed these end-game values to be more logical. I set the quarter to 20, guaranteeing that it’s the highest quarter in the data (each overtime in our data counts as a quarter). Here are the rest of the values I settled on.

    • Home Timeouts = 0
    • Away Timeouts = 0
    • Clock Remaining in Seconds = -10,000 (In overtime, the clock becomes negative, so this ensures that it’s lower than any overtime game)
    • Down = 5
    • Distance = 100
    • Yards to Goal = 100
    • Home Possession Flag = 1

    I made this update and reran our model, and sure enough, we saw good results, and not just with our end-game win probabilities, where 94% of games were now within .1 of 1 or 0, but also in our overall model performance, where we now had an average calibrated error of .011.

    One other addition I’d like to make to this part is with the possession flag. I’m guessing that having the possession improves your chances of winning slightly, so why not give possession to the team that won the game, to give the model an extra hint at who won (outside of the literal score, of course).

    After making this change, it didn’t really help the calibration, but it did improve the end-game results, so that now 96.7% of games were classified correctly in the end. That’s about as good as it gets!

    Adding In Spread

    One last thing I want to include is the pregame spread. This could be helpful because the spread in itself is derived out of a lot of knowledge, from the sportsbook itself and from the betting public. We already use Elo win probability as a factor in our model, but this could help even more.

    However, after including the spread, I found that our model actually got worse! How could this be? Well, for one, the spread is being joined to every single play of the game, whether it’s the first play or the last. And once the game gets going, the actual score and how the teams are playing on that day matter a heck of a lot more than the pregame spread. So how do we deal with this reality, while still including this relevant data?

    Luckily, I read up on how nflfastR does it, and they gave me a great idea. They use a time-decaying function to decrease the value of the spread as the game goes on. Now, I don’t know exactly how they did that, but I gave it a shot in my own way and got some good results. What I landed on was this function to transform the spread throughout the game:

    new_spread = spread * (clock_in_seconds/3600)^3

    So, for example, if the spread is -7, and at the start of the game there are 3600 seconds left on the clock, so on the first play of the game the spread value is -7 * (3600/3600)^3 = -7. At halftime, the spread value will be -7 * (1800/3600)^3 = -.875.

    So as you can see, the spread exponentially becomes less and less important as the game goes on. The reason I chose the third power instead of squaring is because I tested both and found that the third power was the sweet spot where all evaluation metrics were the best.

    So after this last round of changes, our average calibration error was down to .0096, and our accurately classified end-games were up to 97.5%! So I’d say that’s a big win compared to where we started at!

    Using Play Number Instead of Clock

    A major source of headaches with many of our plots is the game clock. Unfortunately, a lot of the clock data from ESPN’s play-by-play is incorrect. Each week, about half of games have bad clock data: several plays having the same time on the clock, an entire quarter of plays being marked as 0:00, and so on. This makes it very difficult for our model when this happens, because it gets confused why the clock isn’t moving.

    Initially, we resolved this by removing any extra plays that happened after the first one at any given timestamp. This worked decently, but it makes the graphs less attractive because it may jump from a play early in the 1st quarter to late in the 1st quarter, and you miss all the information of what happened in between.

    So, to try to tackle this issue, we instead use play number, or, more appropriately, the percentage of the game done. We calculate the total plays in the game, and then for each play calculate the percentage of the game done. This is then used in the model in place of the clock. The limitation of this is that if we ever wanted to do a live in-game win probability, it would be difficult because the model doesn’t know how many plays are left in a game that hasn’t finished yet, unlike with the clock where there are a set amount of minutes in a game.

    The results from this change were remarkable. Not only did we get smoother graphs that included every play of the game, but we improved the performance of our model by .002, down to .0076 average calibration error—a 21% improvement! On top of that, we classified 98.3% of end-games correctly, a new best for us! Below, we’ll look at some of the before and after plots side-by-side.

    Improved Plots

    So now that our new model is ready, let’s test it out and see how it handles some of those games it struggled on before.

    Compared to the wild swings from the initial graph, this one is much more reserved and doesn’t overreact to individual scores by as much. And, the win probability approaches 0% for Army by the end of the game which is another good sign.

    This Washington State-Arizona State game had missing clock data, and an ugly spike right at the end of the game, when Arizona State scored a garbage-time touchdown. The switch to play number resolved our concerns over the clock data, and the inclusion of a simple indicator saying whether a play was a kickoff or not helped the model realize that after ASU scored that garbage-time touchdown, they’d have to give the ball back to Washington State again who would run the clock out.

    Here’s another game with bad clock, and you can see that it helps smooth out the movement of the win probability and remove some of the jaggedness that we see on the lefthand side.

    Here’s the 9-overtime game between Illinois and Penn State. And while our model still hasn’t been tuned to understand the new CFB overtime rules and the back and forth go-fo-2, which is surely confusing it, it’s at least a lot better than what happened in our original clock-based-model.

    So that’s it! We have our new model and it’s ready to go. We can use it to post win probability charts like you see above, or calculate win probability added (WPA), which is just the difference between one play and the next in terms of win probability. Then, we could look at teams on, let’s say, 4th downs only, and see what their average win probability added is on those plays. That could give you insight into which teams are making good decisions on 4th downs, and which teams are not.