General

  • Separating the Fans from the Stadium: Stadium Size, Attendance, and Home-Record in College Football

    Separating the Fans from the Stadium: Stadium Size, Attendance, and Home-Record in College Football

    Today we’re looking at the interwoven relationships between stadium-size and attendance and a team’s performance at home and away. We’ll look at how this relationship changed in 2020 when fans were not able to attend most games.

    We’re going to try to separate the attendance from the venue size to see if having a packed house makes a difference, regardless of a team’s overall strength.

    First, a few primers to understand the data we’re dealing with.

    Let’s look at the distribution of venue sizes in college football.

    Distribution of venue sizes in College Football. The most common stadium-size is 30,000 seats.

    The most popular stadium size appears to be right around 30,000 seats. I’m not sure if this is a rounding thing, or if it just a nice number that a lot of teams settled on. Either way, there are a handful of college football cathedrals, with 90,000+ capacity, but most are under 60,000.

    And, believe it or not, the size of your stadium matters. There is indeed a relationship between stadium size and the performance of your team at home, although this can almost certainly be chalked up to the best, most-historic teams in college football growing their stadium size over time to accommodate the demands of their fans. And whether that drives good recruits and good results, or vice-versa, the best teams typically play in the largest stadiums (in the past 20 years).

    Boxplot of stadium capacity and home winning percentage. There is a slight positive relationship, and as stadium capacity increases, the uncertainty of results is narrower and teams tend to have a winning season at home.

    As you can see, there is a slight positive relationship, and as stadium capacity increases, the uncertainty of results is narrower and teams tend to have a winning season at home. However, actual attendance at these games appears to be a more important factor.

    Boxplot of average home attendance and home winning percentage. There is a large uptick in winning percentage when you gather at least a handful of fans.

    Now we have to take this with a grain of salt, because attendance data can be murky. Some games have no attendance data at all. Others may be way off. It’s impossible to be certain, but there’s at least some positive trend between fans showing up to your games and playing well at home. This shouldn’t come as much of a surprise, although it is surprising that the difference between 50,000 fans and 100,000 seems negligible, indicating that the home field advantage of some of the largest brands in college football might not be all it’s cracked up to be. Of course, the biggest teams are also probably playing some pretty tough opponents in those mega stadiums.

    Boxplot with average percent of stadium capacity filled for each team rounded to the nearest tenth, and home winning percentage on the y axis. There is, again, a positive relationship. Average attendances below 50% or above 100% were filtered out for low sample size.

    When we put all of these relationships—stadium capacity, raw attendance, and percent of the stadium filled—side by side, we see that they all have a pretty similar relationship to one another and a pretty similar relationship to wins.

    Side-by-side scatterplots of capacity, attendance, and percent of total capacity on the x-axis vs. home winning percentage on the y-axis. All have similar positive linear relationships.

    Empty Stadiums

    Let’s quickly take a look at how this changed in 2020. We saw previously that there is some relationship between stadium size and home record, but most of that is likely attributable to the people inside that big stadium, not just the stadium itself. In 2020, look at how that advantage disappeared as those intimidating and loud fans became smiling cardboard cutouts.

    Multiple scatterplots of stadium capacity and home winning percentage, grouped by season from 2000 – 2020. In 2020, the relationship between these two variables went from slightly positive to almost 0.

    In almost every season since 2000, the correlation between stadium size and home record was between .2 and .4 and once as high as .41. In 2020, it dropped to historical lows of just 0.08, or almost no relationship between how big a stadium a team was in and how well they played at home. The factor that changed here, of course, was the fans, not the stadiums. And this data includes even the teams that allowed fans for some or all of 2020, so the correlation could have been even worse than this.

    So from this plot alone we can see that fans matter a lot more than the stadium size.

    So next, let’s try to tease out how much each of these factors matters in the grand scheme of winning football games at home.

    Controlling For Team Strength

    In order to accurately access the importance of stadium capacity and fan attendance, I need to control for something important: overall team strength. And by control, we really just mean including it as a factor in our regression model. Because without it, the model might confuse stadium size with being more important than it is, simply because it doesn’t know that better teams tend to have bigger stadiums to begin with. By adding in a variable that indicates how good a team is, we can look at similarly-ranked teams and see if their stadium-size has an impact on their performance, given that their overall team strengths are similar.

    To do this, I’m choosing to use a team’s away record as a proxy for overall team performance. I could use Elo ratings, or even AP Poll rankings, but I feel like away performances are fair because you have no home-field advantage and it’s just up to your team to perform. This assumes that the quality of away opponents is fairly evenly distributed among teams of different stadium sizes. This may not always be the case, but the majority of the season (3/4 of it) are reserved for in-conference matchups, which are the better games, and non-conference games, if they are easy opponents, are usually reserved for early-season home games (or that week where Alabama takes a week off in November to beat Grambling State by 80).

    Just to confirm, how teams play at home and away are related, and this relationship is fairly strong.

    Correlation between home winning percentage and away winning percentage, represented as boxplots for each 10% winning percentage. There is a fairly strong positive relationship between the two, with teams that win 90% or more of home games rarely losing more than 25% of their away matchups.

    What Drives Home Game Attendance?

    An interesting finding of my data exploration was that home attendance actually correlates more to away performances than home performances. The reasoning behind this might be that fans watch their teams play well on the road, and then get a desire to go see them in person when they’re home. If they watch their team on TV and they stink, they’ll be less motivated to go buy a ticket to see that trainwreck in person.

    Home and Away winning percentages plotted against average percent of stadium capacity at home games. Performing well away from home actually leads to higher likelihood of sellout crowds than home performances.

    So teams that play completely average and win 50% of games are better off doing so away from home, as it leads to nearly 15% higher home capacity than doing the same at home.


    Building A Regression Model

    Okay, let’s finally get into it and put some numbers to these relationships. We’re going to build a linear regression model to see if the relationship between stadium size and home record is significant, along with how significant the other factors like attendance and away record are in determining home record.

    Since we’re trying to get an idea of how important each feature is in the regression, we need to ensure that no two variables are too highly correlated. I used Variance Inflation Factor (VIF) to determine this, and found that attendance and stadium size are too similar to use both (they are correlated at 90%), however stadium size and percent of the stadium filled on average are not, so I dropped raw attendance data and instead went with the following three variables:

    • Stadium Size
    • Average percent of capacity filled
    • Away Record

    I’m using this to predict a team’s average home record over the course of their games at that stadium. For this reason, I’m using a linear model since the response variable is continuous from 0 to 1. Note: in hindsight, the best model to use in this situation would be a Beta Regression, which is specifically for continuous response variable s between 0 and 1 like my situation, however not knowing much about it, I’m not going to get into it for the sake of simplicity. I’m not doing rocket-science here, after all.

    Results of linear model. Only the team’s away record was significant.

    After running the model with those three variables, we find that neither stadium size nor percent of capacity are significant factors when team strength (via away record) is included in the model. This indicates that regardless of the size of stadium a team plays in or how packed the house is on average, they will perform to their best abilities over time. This is fair, but what if we look at an individual game. Can we gain any predictive power by factoring in the crowd size or venue when trying to determine the outcome of one game?

    This time, I chose to use logistic regression to see if a model that included the attendance, stadium capacity, and percent of the stadium filled could outperform one that solely relied on the Elo rating of the two teams. We’re using logistic regression because we’re trying to predict a simple True/False of whether the home team won the game or not. This data excluded 2020 because the attendance data is far from complete, so we’ll just assume it’s a normal year. And in a normal year, it turns out that none of those variables are useful relative to Elo ratings alone.

    Results of logistic regression. Only the Elo rating was significant.

    The size of the stadium was the closest variable to being significant, but I would guess that the model was recognizing some of the larger stadiums and starting to correlate that to a better outcome for the home team, when in fact the physical size and capacity of the venue doesn’t mean much. Similarly, the number of people in the stadium or percent of the stadium filled didn’t seem to matter either.

  • Home-Field Advantage in 2020? It’s Complicated

    Home-Field Advantage in 2020? It’s Complicated

    You’ve heard of home-field advantage, but it’s always in the context of the advantage that a home-crowd gives a team. But what if that stadium were empty? Well sure enough, we saw just that last year.


    Home-field advantage changed in 2020. That’s for sure. But by how much and why is less certain. Take, for instance, the distribution of home records over the past 6 seasons. As you’ll see, 2020 saw more teams with weaker home records, some getting shut out completely, a rare occurrence in past years.

    Density plot of home winning percentages in college football over the past six seasons. 2020 saw more teams with home records below 50%.
    Density plot of home winning percentages in college football over the past six seasons. 2020 saw more teams with home records below 50%.

    However, this doesn’t tell the full story, because, as we know, in 2020 teams played abbreviated schedules and dealt with last-minute cancellations, leading to a smaller slate of home games for some teams. Here’s the distribution of the number of home games played in 2020 vs. 2019.

    Distribution of number of home games played and count of teams in 2020 vs. 2019. In 2019, every team played 5 or more home games while last year, 69 teams played 4 or less.
    Distribution of number of home games played and count of teams in 2020 vs. 2019. In 2019, every team played 5 or more home games while last year, 69 teams played 4 or less.

    So more than half of D-I teams played 4 or less home games. This led to a lot of variability in their results. Almost every conference also played an exclusively conference-only schedule last year, upping the quality of their competition in those home games. Naturally, we’d expect their home-record to drop as the average quality of their opponent went up.

    When we filter for only those teams that played at least six home games in 2020, we get a much different story.

    Density plot for home winning percentage for the past six seasons, filtered for teams that played at least six games at home each season. 2020 has a higher density at the right side of the graph, and a lower density in the middle of the graph for 50% win rates.
    Density plot for home winning percentage for the past six seasons, filtered for teams that played at least six games at home each season. 2020 has a higher density at the right side of the graph, and a lower density in the middle of the graph for 50% win rates.

    Well now what? This looks like teams actually played better at home when they got their 6+ games in. And in fact, they did play better on average at home in 2020 than the overall average in the previous five seasons. Teams in 2020 won 71% of their home games when they played six or more of them. From 2015-2019, that number was 64%. The difference is statistically significant with 95% confidence.

    That being said, when you include all teams, regardless of how many home games they played, the difference between home-records in 2020 was statistically significantly worse than the preceding five seasons. So when teams were able to get all their games in, they saw improved home-field advantage, and when they didn’t get their normal games in, they struggled at home.

    So how can we make sense of this trend? I don’t know that we can entirely understand the difference. Only 28 teams out of 127 got 6 or more games in in 2020. 10 were from the ACC, and then a mix of Sun Belt, Independent, Conference USA, and a few Big 12 and American Athletic conferences. The overwhelming majority of these teams were from the South, where eased restrictions meant more fans at home games, which could have given them improved home-field advantage.

    Elo Ratings between the two groups were almost identical going into 2020, but were 50 points higher when the season ended for the teams that played all their home games.

    We also need to remember that conferences like the Big Ten only played 9 games, all in-conference. So we would expect their win percentage to decrease significantly in a season where they effectively lost one or two near-guaranteed home-wins against non-conference cupcakes. Who knows what would have happened with an extra three games. We saw teams start off slow and finish the season on a run, adjusting to the new normal of the 2020 season. We also saw teams fall off, falling victim to opt-outs, infections, and lack of motivation.

    So while, in part, the full-season teams played better than usual, it is likely that had more teams gotten in a full-season’s worth of games, they would have dragged the home-field advantage down to below-average levels. There is no doubt that the overall landscape in college football favored the away team more than in any other season in at least the past 20 years.

    This year, we’ll see how much that home-field winning percentage rebounds as fans return in full force in most stadiums. And we can’t wait to see it.

    Have a theory about why those 28 teams played better at home in 2020? Email me at kyle@staturdays.com or tweet us @Staturdays on Twitter.

  • My Issue with Sports Twitter

    I have been contemplating for a while why it seems that sports fans are so quick to want someone fired on their favorite teams when times are rough. On the surface it seems obvious, they want accountability held for poor performance and times to get better. But I believe it goes much deeper than that.

    If you are on United States sports twitter like I am (and you are reading this, so you probably are) you have noticed a shift in the replies on sports tweets in the last few years. Replies have been diluted to trolling ‘copy and paste’ punchlines like “not a real sport” or “Mickey Mouse LeBum James is washed”. These meaningless comments, written by faceless fan/parody accounts with sub-50 followers, have become so prevalent that they are either 9 of the 10 top replies or 90% of the replies total to a sports post. It is becoming insufferable and shutting out any real conversation on the sports twitter landscape.

    At the same time, tweets and other social media posts calling for coaches (head or assistant) to get fired has also been on the rise. If you don’t believe me, search on twitter your favorite coaches name preceded by the word ‘fire’ and you will see literally hundreds of individuals calling for their replacement. I did this exercise during the 2019 college football season and found tweets calling for the firing of Ryan Day, Nick Saban, Dabo Sweeny, Ed Orgeron, and Lincoln Riley. My point then was to show the ridiculousness of calling for the firing a typical successful Head Coach after one loss; but my point now is to highlight how often this is happening. Coaches often do get fired and changes have to be made, but these are typically decisions that come after lots of deliberation by the front offices of teams. If coaches were fired every time fans asked for it to happen, the best coaches in the world would be jobless.

    So, what do these two things have in common?

    I believe that the rampant trolling on twitter has changed the landscape on sports fandom in the United States in an extremely negative way. Following teams for younger audiences has become more and more of their personalities in the last few years. Look at the bio on a typical twitter account that interacts with @ESPN or @BleacherReport. Usually, it’s the state they live in followed by their favorite four sports teams and their records – often the team’s record is in their name. Being a fan of these teams is not just something they enjoy – it is something that defines them. Because of this they are quick to defend these teams and sometimes players just as quick as they would be to defend themselves. Their least favorite player, say Lebron James or Aaron Judge, becomes a legitimate enemy to them. The failure of a team feels more personal than it ever had.

    Someone who fans don’t typically become as attached to is a team’s coaching staff due to the short tenure that is the nature of those jobs. Pinning the issues on the coaches is usually a pretty sophomoric argument but that doesn’t take away how convenient it can be. This way you don’t have to get angry at players as often and allows failures to feel more distant. After any loss for a sports fan, emotions tend to be high. You can hedge this bet by calling for the firing of a coach early and never lose an argument again. If the team you like wins you get to be excited, if they lose you get to say, “I told you so”.

    In professional sports only one of thirty teams can win it all and in college the odds are even less. It can feel good to say “your team will never win” despite the odds being massively on your side. If that team does win, it can always be explained away by “bad refs” or “easy schedule”. You can say, my team will never win with this guy as our coach – and odds are you’re right.

    My convoluted point is this, we need to stop allowing sports teams to become so integrated into our personalities. We need to stop allowing trolls on twitter to force us to hedge our favorite teams with why they wont win and why its at least better on my favorite team than it is on yours. I guarantee that even the athletes competing don’t get as emotional as some fans. You are allowed to get excited for a regular season win and you’re allowed to think that your team is the best there is. You don’t have to answer to @NetsFan697(22-10) saying that Joel Embiid sucks because you know he doesn’t.

    I beg fans, enjoy sports as they are intended to be enjoyed. Trust that the right moves will be made eventually to bring your favorite team to the playoffs and please have fun during wins and shrug off losses. Engage only in real debate with individuals arguing in good faith. Anything less than this is pointless.

    Hopefully soon the discussion on twitter can again resemble something worth participating in.