General

  • Should Pato O’Ward Have 2-Stopped at Mid-Ohio?

    Should Pato O’Ward Have 2-Stopped at Mid-Ohio?

    After Pato O’Ward’s poor qualifying effort which had him starting 25th at Mid-Ohio, the broadcasters in the NBC booth stressed that he had to “do something different”.

    And sure enough, the McLaren driver’s pit wall opted for a 3-stopper instead of the favored and eventual race-winning 2-stop strategy.

    Despite these setbacks, O’Ward still finished with the 3rd fastest total time in the field, covering 80 laps just 14 seconds slower than race winner Alex Palou despite finishing 28.5s behind on the track.

    Of course, qualifying better could have eaten into that chunk of time a bit, but could one-less pitstop and the 30 seconds saved have made up for his poor qualifying and seen O’Ward actually come from 25th to win?

    To analyze this, we need to consider two things:

    1. How much time would O’Ward have saved by pitting one less time? (This is the easy question to answer because it’s just the pit delta).
    2. How much time did O’Ward gain from having fresh tires on his extra stint, or how much time would he have lost being on older tires for longer?

    The net of these two things ends up being the delta between a 2-stopper and a 3-stopper.

    The second question is harder to answer because we don’t know how his tires would have held up, what traffic he would have hit, and all the other unknowns that may have occurred had he 2-stopped.

    The best we can do is see how the drivers around him were affected and extrapolate out to O’Ward’s race.

    Pit Delta

    The pit delta for O’Ward specifically can be found by looking at his pit-in and pit-out laps, and then comparing that to his normal non-pit green flag laps.

    O'Ward Pit Stop Delta
Time lost over average green flag lap (seconds).
Pit Stop 1: 33.307 seconds
Pit Stop 2: 28.792 seconds
Pit Stop 3: 26.775 seconds

    The Pit Delta also changes as the race progresses, and particularly the 3rd stop delta was quicker for O’Ward because he needed less fuel, so we will use that delta instead of the overall average. His 3rd stop was 2s quicker than his 2nd stop and 6.5s quicker than his first.

    So his 3rd stop added 26.775 seconds to his race time vs. staying out and doing average-paced laps.

    Green Flag Pace

    Green Flag Lap Times by Strategy (seconds)
2-Stops: Average Lap Time: 70.603. Std. Dev. Lap Time: 1.206.
3-Stops: Average Lap Time: 70.354. Std. Dev. Lap Time: 0.961.

    O’Ward’s green flag non-pit lap times on average were 69.8 seconds. The drivers on the 2-stop strategy were averaging 70.6 seconds per lap, but we care more about the fastest of the 2-stoppers, Alex Palou and Scott Dixon.

    They were both averaging 69.95. So on a typical green-flag non-pit lap, of which there were 69 for O’Ward, he was only 0.15s faster than Palou. So that equates to about 10.4s gained over those green flag laps on pace alone.

    So he gained 10.4s by choosing a 3-stop over a 2-stop, but lost 26.8s in pit-road time, for a net loss of 16.2s with a 3-stop strategy. This tells us that O’Ward likely should have stuck with a 2-stop strategy.

    Compared to the field, the 3-stopper made sense as he was almost a full second quicker per lap than the average 2-stopper, meaning he made up 55s on the field to counteract his 27s stop. But the fastest 2-stoppers were able to keep a similar pace to him even on longer stints.

    Navigating Traffic

    What likely hurt O’Ward was all the extra passes he had to make thanks to that extra stop. O’Ward started in 25th, so he had a bigger gap to make up and 24 cars to get by. He was already 6.6s behind Palou at the end of the first full green flag lap.

    O’Ward had to overtake 43 cars for position in the race, 14 more than the next-closest driver. Palou and Dixon were among the lowest in the field, with just 10 and 13 overtakes, respectively. This number doesn’t cover the backmarkers, although all drivers would have to deal with them eventually.

    You can see this take effect in the data too. While O’Ward had the fastest green-flag pace of anyone, he also had one of the highest standard deviations in lap time, meaning he was not able to be consistent. The difference may seem slight, but it is enough to make an impact on your race, and this accounts for the significant amount of extra time he spent behind and navigating around his competitors.

    You can see from the green flag lap charts that O’Ward was much more inconsistent than Palou over the course of the race, with spikes and dips in pace.

    Tire Difference

    The last thing we’ll look at is the tire dropoff, which undoubtedly helped O’Ward to pump out laps almost a whole second faster than the field average throughout the race.

    The reds and primary tires were similar until lap 15 of a stint, when the reds fell off sharply by about 0.5 seconds. At 25 laps, both tires were 1 second slower than at their best.

    As you can see, there’s about a 1-second dropoff from lap 10 to lap 20 on the reds, and about a half-second on the primaries. The reds seemed to come back a bit near the end of the stint, but that could be drivers just pushing them to the max before pitting.

    Luckily for O’Ward, he was on the primary most of the race which appears to be preferable if you want to go more than 15 laps. However, as the race went on and the traffic built up, his pace got slower and slower and his lap times more unpredictable.

    When To Do An Extra Stop

    This is a question that I want to do more research on, but my initial thoughts after doing this analysis are:

    1. You have to take track position and traffic into account. If you are going to have to re-pass more than 2 or 3 cars by pitting, it’s going to cost you a significant chunk of time. 3-stopping probably makes more sense for leaders with a big gap who want to protect against tire dropoff than it does for mid-pack trying to make a big move.
    2. Track length and cars on track matter. A short 70-second lap like Mid-Ohio with 27 cars on course appears to be the wrong place to do an extra stop. There’s too high a chance of coming out in traffic and dealing with backmarkers late in the race.
    3. Tire dropoff matters. If tire dropoff isn’t significant, then you don’t have as much to gain on fresh tires vs. worn tires. Similarly, if the difference between reds and blacks isn’t significant, then doing shorter stints on softer tires won’t be as big of an advantage.
    4. Cautions change everything. If more cautions had fallen later in the race, the analysis changes completely. Time lost on pit lane is mostly erased. Restarts provide easy overtaking opportunities.
  • Why Don’t NFL Teams Pay NBA Players to Block Kicks?

    Why Don’t NFL Teams Pay NBA Players to Block Kicks?

    Recently I listened to an episode of Freakonomics Radio where they talked about specialization in the NFL. Specifically, they were talking about how the long snapper position has become something that NFL teams specifically draft for and pay upwards of $1 million a year to have on their 53-man roster. This got me wondering: what other positions are ripe for specialization in the NFL?

    This train of thought brought me to the field-goal unit. Not the kicker, holder, or snapper, but the defense. What if teams hired the tallest guy money could buy to do one thing: block kicks? It’s so obvious that I assumed there must be a reason teams aren’t doing it, or at least trying it. This sent me down a rabbit-hole of heights and wingspans, standing reaches, verticals, salary caps, and something I’m not very good at: geometry.

    So why don’t NFL teams pay a really tall guy to block field goals for them?

    Is Blocking A Kick Just By Being Tall Even Possible?

    My assumption for hiring a tall guy to block kicks is that he wouldn’t need to work hard to do it. In my experiment, he doesn’t even cross the line of scrimmage. He just stands there, arms raised, and jumps to try to swat the ball down. So is this even a feasible strategy? How high does the ball travel over the line of scrimmage? Obviously kicks get blocked, but usually the defender has some penetration. What if he was at the line of scrimmage?

    Thankfully I didn’t have to break out the geometry textbooks to figure this one out. I used this nifty website to calculate exactly how high the ball would be above the line of scrimmage at different kick angles. The assumption here is that the ball is on a straight upward trajectory as it crosses the line of scrimmage, and that the kick is taken from 7 yards out. Here are some heights at various kick angles.

    Angle (Degrees)Height at Line of Scrimmage (Feet)
    3012.1
    3514.7
    4017.6
    4521

    According to one study, the optimal launch angle for a kick is between 38 and 45 degrees, as this will maximize distance. Thus, we can assume that kickers are going to be aiming for that angle. Of course, human error and misjudgment will mean that some kicks go below or above that ideal range.

    Graph from the University of Nebraska study which found that even at different kick speeds, 45 degrees was still the optimal angle.

    So according to our geometry and assumptions about the ideal angle, most balls will be flying over the line of scrimmage 21 feet in the air, far out of the reach of even the tallest NBA players. But, at launch angles of 35 degrees or lower, we have a glimmer of hope, maybe.

    Let’s take a famous big guy in the NBA: Boban Marjanovic. The 7’4″ center has a 7’10” wing span, and a 10’2.5″ standing reach. He’s one of the few NBA players able to dunk the ball without even jumping. He has just a 23″ vertical leap, putting his overall potential range at 12’1.5″. A truly enormous human being. It’s hard to comprehend. However, he’s barely tall enough to block a 30° kick, and that’s with a perfectly-timed jump.

    With that being said, it’s highly probable that Marjanovic gets his hand on at least one kick during the season, and if he can make it even half a yard toward the kicker, he takes the vertical height down a cool foot, making it that much easier.

    There were just 14 field-goals blocked in 2021, out of 1066 field-goal attempts, meaning just 1.3% of all field goal attempts at any distance were blocked. Honestly, that’s higher than I was expecting. I think it’s reasonable to assume that our tall guy could attain at least that and maybe a half-point to a point higher over the course of a season.

    There would also likely be psychological effects on the kicker of having such a large human towering over the line of scrimmage, and the kicker may subconsciously put unnecessary height on the ball, decreasing its horizontal travel. This would especially be apparent early on, when it’s a new phenomenon for the kicker and they have not seen this type of player in front of them before.

    One other logistical challenge in blocking a kick is that it’s not a given, even if the ball is low enough, that your center gets a hand on the ball. What kind of reaction time would be necessary to accurately block a kick from the line of scrimmage?

    According to a highly cited paper, a field goal travels around 19-22 m/s, or 21-24 yards/s. Meaning the ball will reach the line of scrimmage in approximately .33 seconds. That’s about 80ms longer than the average human reaction time, meaning our NBA center could reasonably have some reaction to the ball in that time. How much is another story, but with good positioning I don’t hate our chances.

    So all-in-all, I think we can safely say it’d be difficult, but not impossible for a big guy to block kicks from the line of scrimmage, and that over the course of the season, he’d probably block one or two and influence many more.

    The Existing Arguments Against Having A Designated Tall Guy

    I did a quick Google search and didn’t see any in-depth statistical analysis or mathematical justifications on the subject. The most common argument I found on some reddit threads against the idea was “roster spots are limited” and “roster spots are valuable and blocked kicks are not”. However, we already know that teams are willing to dedicate a whole roster spot to a guy who snaps the ball 8-12 times a game. And he’s not even scoring… they pay him to prevent a slip-up that costs them points; my guy would be paid to actively prevent the other team from scoring.

    My own kin, Drew, offered another argument:

    "I also wonder if having someone like that would cause more fakes because it kind of takes him out of play as a defender since those guys usually can't run"

    Aside from some hurtful accusations he’s making about the speed and agility of big guys, he makes a point. This could be an unintended side-effect of implementing this kind of player. We see this with all elements of the game: someone innovates, and then a few years later the league adapts and the competitive advantage largely disappears. But for the sake of my argument, I’m going to assume that teams wouldn’t immediately be able to exploit this tactic just because there’s one tall guy on the field.

    So that left a few questions: how valuable is a blocked kick, and how much should teams be willing to pay for one?

    How Valuable Is A Blocked Kick?

    In 2021, NFL teams had an average total cap of around $187 million, and scored 391 points on average. So they paid about $478,000 per point last season. This is one way to value blocked kicks. By blocking a kick, you’re preventing 3 points (maybe 2.61 points if you consider that the median FG% in 2021 was 87%. However, you could also argue that a blocked kick sets the offense up with a better chance to score, and that change in expected points should be attributed to the blocker, but let’s keep this simple for now.) So each blocked kick could be valued at approximately $1.434 million (3 points * $478k per point). I’m assuming that the going rate for a point prevented is the same as the rate for a point scored (which you could argue isn’t the case since defensive players aren’t paid the same as offensive players).

    Another way to look at it is by cost per win. The median cap spend per win in 2021 was $21 million. The median points required to win a game was 29. So that gives you a cost of $724k per point, if we’re after wins (which most teams are). This would give even greater value to a blocked kick ($2.2 million)!

    What NBA Players Could NFL Teams Afford?

    Much to my satisfaction, we could actually afford our Serbian center Boban at his modest $3.5 million per year. Using our second valuation, he would only have to block two kicks all year to be worth the investment. If we use our points-based valuation, it would be 3 kicks blocked over the course of a season.

    A younger, more athletic Mo Bamba would cost NFL teams a premium at $7.5 million per year, so he’d need a higher output to justify the cost. At that rate, we may even ask him to rush the kicker a bit or maybe even get in for some corner fade routes in the red zone.

    At 7’1″ and only $2.1 million a year, Bol Bol may be a great investment for some NFL team looking to make a splash on special teams. But wait, could these guys actually block that many field goals in a season?

    There are usually several field goal attempts per game (the top 32 kickers averaged 1.93 field-goal attempts per game). The average team made 32.1 field goal attempts in 2021. Using that 1.3% block rate, we would expect the average team (not individual) to block .42 field goals per year. If you’re paying $724k per point, that means you’d only want to pay a designated kick-blocker $304k per year since that’s likely to be his output. Given the NFL salary cap minimum for 2022 is $705k, that’s not feasible.

    Let’s assume our tall guy makes your team twice as good at blocking kicks, from a 1.3% to 2.6% block rate. Now, he’s still blocking under 1 kick per season at .83. Maybe you can afford the league minimum at this point, but that’s it.

    At this point, I’m feeling a little deflated. I think there’s a reason nobody is trying this. Blocked kicks are rare, and kickers kick the ball high above the line of scrimmage most of the time. So unless you can find a 7-footer that’s also fast and can rush the kicker a bit, and doesn’t cost a fortune, it’s probably not worth it over your two-way guys that can contribute in multiple ways for the team. However, if you find a 7-footer that can also stand in the endzone and cherry pick some passes, then this argument changes a lot.

  • StatRdays: The Easiest Model You’ll Ever Make

    StatRdays: The Easiest Model You’ll Ever Make

    This year, we’ve been participating in the CollegeFootballData.com prediction contest, where each week you predict the actual spread of the game. You’re judged on a variety of factors like your outright picks, picks against the spread, absolute and mean-squared error. Two areas that we are performing very well (1st and 2nd, respectively), are MSE and absolute error. Today, I’ll show you how we do that.

    Getting the Data

    We’ll be using R today, and we’ll be getting all of our data from the collegefootballdata.com API. This part requires you to have your own API key from CFB Data. If you don’t have one yet, you can get one here. This also requires you to store your key. I recommend storing it in your .Renviron file, which you can probably find in your Documents folder on Windows.

    Then, you’ll want to edit it in Notepad and add your key like this:

    cfbd_staturdays_key = "yoursecretkey"
    

    Now, you should be all set to do the rest of the exercise. Let’s first grab a few functions that will help us get the data from the API.

    # Load in required functions
    source(https://raw.githubusercontent.com/kylebennison/staturdays/master/Production/source_everything.R)
    

    Next, let’s get the initial data. We’ll need the games, elo ratings, and betting data.

    # Get historic Elo data from Staturdays
    elo <- get_elo(2013, 2021)
    
    # Get games data from CFBdata
    games <- get_games(2013, 2021)
    
    # Get historic betting data from CFBdata
    betting.master = data.frame()
    for(j in 2013:2021){
      message("Doing year ", j)
    betting_url <- paste0("https://api.collegefootballdata.com/lines?year=", j)
    full_url_betting <- paste0(betting_url)
    full_url_betting_encoded <- URLencode(full_url_betting)
    betting <- cfbd_api(full_url_betting_encoded, my_key)
    betting <- as_tibble(betting)
    betting <- unnest(betting, cols = c(lines))
    betting.master = rbind(betting.master, betting)
    }
    

    Cleaning the Data

    We’ll have to do some clean up to get the data ready to use in a model. First, we’ll average out the data in the betting file, because we have multiple lines from different providers for the same game, so we’ll just take the average of all the lines.

    Next, we’ll create a new field called “join_date” in our Elo file, since the Elo is from after the game is finished in that week, so we’ll want to join each Elo rating to the following week’s game.

    Then, we’ll join all three tables (games, elo, and betting) together.

    # Need to summarise lines for teams with multiple lines
    betting_consensus <- betting.master %>% 
      mutate(spread = as.double(spread),
             overUnder = as.double(overUnder)) %>%
      group_by(id, season, week, homeTeam, awayTeam,
               homeConference, awayConference, homeScore, awayScore) %>% 
      summarise(consensus_spread = mean(spread, na.rm = TRUE),
                consensus_over_under = mean(overUnder, na.rm = TRUE),
                consensus_home_ml = mean(homeMoneyline, na.rm = TRUE),
                consensus_away_ml = mean(awayMoneyline, na.rm = TRUE))
    
    e2 <- elo %>% 
      group_by(team) %>% 
      mutate(join_date = lead(date, n = 1L, order_by = date))
    
    games_elo <- games %>% 
      mutate(start_date = lubridate::as_datetime(start_date)) %>% 
      left_join(e2, by = c("start_date" = "join_date",
                           "home_team" = "team")) %>% 
      left_join(e2, by = c("start_date" = "join_date",
                           "away_team" = "team"),
                suffix = c("_home", "_away"))
    
    games_elo_lines <- games_elo %>% 
      inner_join(betting_consensus, by = "id")
    

    Doing Some Calculations

    Ok, we’ve cleaned everything up and joined it together. Now, we need to do some calculations. Mainly, we want to know the difference in Elo between the home and away teams, since we’ll use this as a feature in our model later. We’ll also want to calculate the final actual spread of the game, and this will be our response variable: the variable we’re trying to predict.

    ge2 <- games_elo_lines %>%
      mutate(home_elo_adv = elo_rating_home + 55 - elo_rating_away,
             final_home_spread = away_points - home_points)
    

    We’re including a 55 point home-field advantage in the Elo advantage calculation, which we’ve identified as the best home-field advantage value in previous testing.

    Let’s look at the relationship between Elo and the final spread.

    ge2 %>% 
      ggplot(aes(x = home_elo_adv, y = final_home_spread)) +
      geom_point(alpha = .1, color = staturdays_colors("light_blue")) +
      geom_smooth(color = staturdays_colors("orange")) +
      staturdays_theme +
      theme(panel.grid.major = element_line(color = "lightgrey")) +
      labs(title = "Elo and Spread 2000-2021",
           subtitle = "Elo advantage includes built-in home-field advantage worth around 3 points",
           x = "Home Elo Point Advantage/Disadvantage",
           y = "Home Win/Loss Point Margin")
    

    So remember that a negative spread means the home team won by that amount. So, as the Elo advantage increases for home teams, so does the spread. There is a lot of deviation, but the relationship is clearly linear. So we should be able to model this, and we’ll use a linear regression model.

    Now, CFB Data recently provided their own Elo model, and while it’s fairly similar to Staturdays, it is different in a few decisions and assumptions it makes. Rather than be picky, I’m just going to include them both. It can only help our model if you think of it like the wisdom of the crowd (this isn’t really true if two variables are highly correlated, it can actually throw off your model and make it worse). Of course, this isn’t always true. More variables doesn’t always mean a better model if those variables aren’t helpful. If I include the time of the kickoff as a variable, it might end up confusing the model more than helping it because it might find some strange correlation that has nothing to do with the team’s or their skill and more to do with random chance.

    # Include CFB Data's elo as well.
    ge3 <- ge2 %>% mutate(alt_elo_adv = home_pregame_elo - away_pregame_elo)
    

    Ok, we’re ready to build our model.

    Building the Linear Regression Model

    model_spread <- lm(final_home_spread ~ home_elo_adv + alt_elo_adv + consensus_spread, ge3)
    
    summary(model_spread)
    

    So the lm() function is to build a linear model, and the syntax you’re seeing is saying “predict final_home_spread using home_elo_adv, alt_elo_adv, and consensus_spread, from the ge3 dataset”. Here are the results.

    So we have an R-squared of .47, which means 47% of the variation in spread can be explained by our model. That’s not great, but it’s certainly a start. The consensus spread is the only variable that is significant at the 5% confidence level, but that doesn’t mean we need to exclude our other variables. We would want to compare the results to a model that excluded elo and see which performed better. For now, we’ll leave it as is.

    If you really wanted to stress-test your model’s validity in the real world, you could train and test it, using a holdout set of data. We’ve skipped this because we’re just trying to build a model here, and not necessarily test and optimize it right now.

    Saving and Using The Model

    Now that we have the model, we can apply it to new data to get predictions.

    If you want to save a model for use another time, you can save it to a .rds file.

    saveRDS(model_spread, file = "Production Models/elo_combo_spread_model_v2.rds")
    

    To apply this model, we’d need to rerun all the code above, but only pull data from 2021 and look at the games coming up this week. Then, we’d run this code to make our spread predictions:

    # Read in the model we saved earlier
    model_spread <- readRDS(file = "Production Models/elo_combo_spread_model_v2.rds")
    
    # Predict the spread using our model
    ge3$predicted_spread <- predict(model_spread, newdata = ge3)
    

    And there you have it. R will use your model and the input variables in your ge3 data to predict the final spread of the game!

    From here, we could try to include more relevant variables that might help improve our model, or we could try a different model type altogether, like a decision tree, to see if that helps predict spreads more accurately.