Kyle Bennison

  • How One setuptools Release Broke Everything, and What We Can Learn From It

    How One setuptools Release Broke Everything, and What We Can Learn From It

    If you work in python package development or maintenance, then your world may have stopped for a few hours on Monday afternoon.

    To most, it probably felt like one of those “random” errors that pops up one day and is gone the next, but behind the scenes were hundreds of developers arguing back and forth on the public GitHub repository of setuptools over the best way to rectify a breaking change that broke more than anticipated.

    What Happened?

    It all started on Monday morning with a major release of setuptools which began failing build for what was previously a deprecation warning about using dashes or uppercase characters in your setup.cfg file.

    Previously, setuptools would quietly fix your file for you, but for some reason in v78 they decided to start enforcing it.

    Funny enough, the 78.0.0 release initially failed because their own tests were failing due to a dependency, requests, not complying with their new rule. Rather than viewing this as a sign to take a moment to pause and consider the potential impact, they decided to comment out the tests.

    I say “they”, but in reality this was the rash decision by one individual developer who opened and merged their own PR with no reviewers, no requests for reviewers, and no comments.

    Once this release made it onto PyPI is when the trouble for everyone else started.

    It seems innocent enough: starting with v78, you need to correct your naming. It’s a simple fix. The problem was twofold, though:

    1. If you didn’t pin a top version of setuptools in your build specification, the newest version started being used by default, potentially breaking your package.
    2. Even if you fixed your own package, if any of your dependency packages were out of compliance (and not pinning their version of setuptools), then they would also cause your build to fail, namely if their package needed to be built from source (as is the case in Linux). This is why the pain was most immediately felt in a lot of people’s CI workflows on GitHub.

    So now there was this cascading effect of having to fix your own package (no big deal, usually), but then having to ask others to fix their packages (many of which–as many in the comments of the faulty PR pointed out–were no longer being actively maintained).

    Again, all of this was apparent when the setuptools tests themselves were failing because of requests (a python software foundation package) being out of compliance.

    How Was It Resolved?

    Fairly quickly, issues started popping up in the repos of packages that were out of compliance asking them to quickly fix their setup.cfg files. It’s hard to say how many packages were out of compliance, but rest-assured it would’ve taken weeks to months to get everything resolved across all the packages out there. People quickly realized this was too big an issue to deal with in this way. In the case of one package, pyspark, the fix was opened in a PR, but the pyspark tests take several hours to run (and they failed), so it was not exactly a quick-fix and they needed to do that on several still-supported minor versions of the package.

    So, the discourse in the setuptools PR space was to revert the changes and yank the bad versions in the meantime. Thankfully this only took a matter of hours, the release succeeded, and things quickly went back to normal. However, it revealed a lot of the flaws in open-source software development and the inherent risks of relying on these tools that we take for granted on a daily basis.

    What Did We Learn?

    1. Pin a Version Range

    If every package had set a max version of setuptools to something they knew they were compatible with, then this wouldn’t have happened. Of course, we can only control what we do in our own package. In this case, pinning setuptools wouldn’t have helped us if our dependencies didn’t do the same, but generally this is not the case and pinning our requirements can save us from running into these breaking changes unexpectedly. For GitHub Actions, pin a specific, immutable commit hash.

    2. Don’t Ignore Deprecation Warnings

    While I still think this breaking change was done poorly, the fact is that this deprecation warning was out there since 2021. When you ignore deprecation warnings, you’re taking a risk. Unless you are completely pinned down.

    3. Choose Your Dependencies Wisely

    When designing your system, build around well-supported and actively maintained projects. Leveraging fringe packages with little support or below their first major version release is risky, especially at the enterprise level where reliability matters.

    4. Respect SemVer

    For all the criticism, one thing you can’t argue is that setuptools followed SemVer. Most people cannot say the same. If you are instituting a breaking change, you must increment the major version number. This indicates to your users that it is not necessarily safe to just bump the version without checking things out first. You can tell that most people do not respect SemVer because most packages, even long-standing ones like numpy and pandas, are still only on version 1 or 2. The odds that they have not incorporated at least some if not many breaking changes in all those patch and minor versions is low.

    5. Understand Why Your Tests Broke

    Writing tests can be hard. Understanding tests is even harder. Debugging odd failures, especially ones that pop up during CI/CD, can feel insurmountable. However, it’s an essential skill and, as we saw on Monday, can be the difference between catching a crucial issue and unleashing hurt on others. Had the setuptools developers actually resolved the cause of their test failure rather than commenting it out, they would’ve seen that the solve was going to cascade to hundreds of other packages and maybe would have held off on the release until they could effectively communicate the change with others.

    6. Don’t Merge Your Own PR’s

    This might be the golden rule in my opinion; it really is that important. You cannot just shoot from the hip, especially on a public repository that is such critical infrastructure to the python packaging community. Request reviewers. Get a second and third pair of eyes on your code. Had they done so, someone certainly would’ve questioned why commenting out a test was the best course of action.

    7. Deprecation Warnings Need to Come With A Timeline

    Crazily enough, a user in the discussion tab of the setuptools repo predicted this exact situation almost 3 years ago. They asked how they were supposed to determine which deprecation warnings were most urgent, noting that the one in question about naming syntax of config keys seemed “minor”.

    If you are planning to deprecate something, saying “will be deprecated in a future version” isn’t specific enough. You need to set a specific version or a specific date by which action needs to be taken.

    Additionally, you need to weigh the benefits and drawbacks of implementing these deprecations. When I told my coworker about the issue and the scope of people affected, his response was “all this over a stylistic choice?” Some things are not worth the trouble of enforcing if there’s no security, performance, or functional benefit to implementing it.

    Packages can throw out so many warnings that you become numb to them, so you really need to pick and choose your battles as a package developer.

    And the next time someone brings up the potential for something to maybe break a year or two from now, don’t roll your eyes! We just saw it happen, and it is destined to happen again soon.

    Is Open Source Actually The Holy Grail of Security?

    I’ll finish with a final thought: I tried to explain this situation to my fiancé Raychel: how one random guy in the UK brought down our and so many other packages around the world. She couldn’t believe that my company was relying on the packages of so many others outside of our control, and the contributions of random people around the world that we’ve never met. “That doesn’t sound very secure,” parroting back a phrase I often say to her in relation to her job.

    “No. Open source is the most secure because everyone can see it.” I told her this because it’s what I’ve been told. But between the setuptools issue that morning and the tj-actions security incident the week before, I was questioning this premise.

    Sure, the issues can get rectified fairly quickly once they’re noticed, but usually the damage is already done. The data was stolen, secrets exposed, or time lost. If one random guy in the UK could shut down packaging work for an afternoon, what else was possible? I think open source is for sure still better than the alternative, but it has its flaws, and some tightened guidelines around these critical packages–such as requiring PR reviewers–would be a really simple step in the right direction.

  • Mahomes Isn’t The Only One Flag-Baiting

    Mahomes Isn’t The Only One Flag-Baiting

    There was a lot of noise this weekend—and rightly so—about Patrick Mahomes baiting defenders into late hits in hopes of drawing flags. He was able to get one early on in the game, but didn’t fool the refs later on when he loitered on the edge of the boundary and then flopped out of bounds once defenders arrived.

    For a lot of fans, I think the frustration came fast because it was reminiscent of another moment earlier in the season when Mahomes drew a foul on another (questionably) late hit.

    This was in the 3rd quarter:

    And then the no-call in the 4th quarter:

    This was a similar play that went viral earlier in the season where Mahomes danced near the sidelines and turned it into a big gain:

    To be fair to Mahomes, there were many more examples (just search “Patrick Mahomes late hit” on your favorite social media platform) throughout the season where flags were not thrown for hits near the sideline.

    Interestingly, if you search the same terms with “Josh Allen”—probably the second most popular quarterback in the NFL—it returns a lot less results. So it does seem like Mahomes is a bit of an anomaly due to his outsized celebrity (maybe from all the State Farm commercials).

    But I did want to try to find out if there was truly some bias towards Mahomes and the Kansas City Chiefs in general when it comes to these late hit fouls. So I took a look at the play-by-play data from nflfastR and FTN Data via nflverse, which charts certain movement data like whether a QB went outside the pocket.

    Before we begin, there are two terms to define here. A QB scramble is defined by nflverse and is always a run play. It’s basically anytime a pass turns into a QB run. The QB being out of the pocket is anytime the QB exits the pocket.

    A QB scramble almost always means the QB is out of the pocket, but a QB being out of the pocket does not imply a QB scramble. In fact, a QB being out-of-the-pocket is a pass 68% of the time to a run 24% of the time, and is a QB scramble 25% of the time.

    The base dataset is all plays where a penalty was called on the defense that were not interceptions or fumbles lost.

    First off, I looked at unnecessary roughness and roughing the passer penalties specifically since these are the two penalties commonly called for late or illegal hits. According to the NFL rulebook, roughing the passer can be called inside or outside the pocket. If a QB is outside the pocket and on the move, they are allowed to be hit low or high (unlike when they are in the pocket), however if they stop moving and return to a “passing posture”, they can no longer be hit low or high again.

    The findings were interesting. Looking at plays where the QB was out of the pocket, Buffalo actually led the way with five unnecessary roughness or roughing the passer calls on the defense. KC only had one recorded.

    Unsurprisingly, some of the teams leading the way in this area have very dynamic, mobile quarterbacks: Josh Allen, Kyler Murray, Jayden Daniels, Justin Fields/Russell Wilson. However, the numbers are pretty low overall so far.

    Of course, the data is not perfect. There may be calls missed and charting data on in vs. out of the pocket is not always reliable either, plus it’s not guaranteed that the unnecessary roughness calls were called for hits on the quarterback. But we can still learn a lot from the general trends.

    Let’s cut the data a few more ways. First, let’s normalize for how many out-of-pocket plays a team has to see who’s getting the most calls proportional to the number of rollouts they do.

    All the below plots will be looking at the roughing the passer and unnecessary roughness calls only.

    Buffalo still leads the way on penalty calls when the QB is out of the pocket on a per-play basis, averaging about .3 penalty yards per out-of-pocket play, or about a call every 50 rollouts. As you can see, KC is near the bottom in this category.

    When we look at totals, the result is mostly the same. However, we’re only looking at one or two calls per team. Note that it is possible for the penalty to be less than 15 yards in both cases because they can be called within 15 yards of the goal line.

    Next, let’s look at QB scrambles where the QB becomes a runner. At this point, we shouldn’t see any roughing the passer calls but will still see the unnecessary roughness calls for late hits. There’s a lot less data here; there were only 10 unnecessary roughness calls out of 1181 QB scramble plays.

    Minnesota and Arizona both drew two of these penalties while the other teams had one. KC is notably not on this list (although please fact check me here because, as I said, the data may not be perfect. I was able to find two instances but they occurred on turnovers.)

    Last, let’s take off all the reigns and look at these two penalties across all plays, including those where the QB stays in the pocket.

    Here, Miami creep up as a big beneficiary of these calls (is Tua Tagovailoa’s injury history playing a factor here, perhaps?). Again, the Chiefs are very middle of the road here, showing no evidence of favoring Mahomes.

    On a total yardage view, there isn’t much change.

    Just for fun, let’s see who’s getting the most flags called against their opposing defense overall, without any filters on turnovers or penalty types.

    Here, we see that Minnesota, Washington, Dallas, and Buffalo drew the most penalties while Jacksonville, Indianapolis, and Detroit drew the least.

    While this is interesting, nothing really stands out and the distribution of penalties across the teams is fairly normal.

    So, to summarize, I think that fans have a short memory and also a memory that highlights the moments and players that are most infuriating and impactful to the result of the game. Patrick Mahomes is a world-class player and he’s on TV all the time. He’s also had some moments that have been A) on national television or B) gone viral where he has seemingly flopped or earned a call from the refs that was undeserved.

    That being said, I found no evidence in any of the data that the Chiefs get any more calls than other teams. In fact, they rank near the middle or bottom of most of the charts above.

    I agree that the NFL does need to figure out how to regulate the “QB as a runner” situation a bit better, because it is hugely disadvantageous to the defense when quarterbacks get these special protections near the boundaries that other players seem not to get. Whether it’s actually the case that these calls are made more often on quarterback runners than other runners is a whole other investigation (a quick glance gives me the impression that this is not actually the case).

  • Never Write “dev” Anywhere, Ever

    Never Write “dev” Anywhere, Ever

    I was going through our deployment process this week and hit a snag: I was using the dev host in our production environment. And then another one, and another, and another… It turned out that we’d been breaking our own rules, not all at once, but slowly and surely. Sometimes it was a matter of convenience. Other times a matter of bad assumptions. Each time, it was innocent enough, but compiled it created a nightmare when it came time to deploy. We’d been hardcoding dev values.

    dev crossed-out and replaced with the function get_env().

    While sometimes it can feel innocent enough to hardcode “dev” here and there to speed things up, I’d argue that it’s almost always a bad idea if you have any hopes of using any of the code in prod someday. So while it may take a little more upfront thought to get started, it will save you a lot of tricky debugging in the long run if you have a plan for switching environments early on.

    How to Handle Environments

    There are three keystone decisions you need to make in order to properly handle different environments. They are:

    1. Separate environments completely
    2. Decide the source of truth for the environment
    3. Use switching logic and a secret store to manage values

    Separate Environments Completely

    dev, qa, and prod should not just be folders. They are entirely separate environments. In GitHub, only your main branch should deploy anything to prod, and it should be done via CI/CD (i.e. GitHub Actions).

    If you are using Databricks, this would mean having a dev workspace, a qa workspace, and a prod workspace. In GCP, this would be different projects.

    In addition, each environment should have its own service principal that is allowed to run your code, and whose permissions are used when running said code.

    Decide the source of truth for the environment

    You need a way for your code to programmatically know which environment it is in. You can then use this info to handle your switching logic and grab the right information that you need, such as service principals, passwords, hosts, etc.

    How do you know you are in dev? And similarly how do you know you are in qa or prod? It may be a host url, or some environment variables that are only set in one or the other, or have one value in dev and another value in prod.

    We do this with a combination of compute policies and environment variables. The policy is set at the workspace (environment) level and automatically added to any compute in that environment. The policy sets environment variables, namely one we call ENV_CODE which takes a value of dev, qa, or prod based on the policy and environment.

    Use switching logic and a secret store to manage values

    Now that you know definitively where you are, you can very simply determine where to deploy code to, who should deploy it, and who should run it. The easiest way to manage this is the have simple switches based on the environment, and then corresponding folders in a secret store (e.g. Azure KeyVault, Vault, etc.) that contain all of the relevant info for each environment (host, SP name and password, etc.).

    If you are deploying from GitHub, you can use your branch names as inputs to your GitHub workflows if using GitHub Actions (e.g. if the workflow is kicked off from main, deploy to prod). If you use trunk-based development, you can use event triggers to determine where to deploy to; a pull request merge can deploy to dev, a workflow dispatch can deploy to qa, and a release being published can deploy to prod.

    Why Environments are Important

    The main purpose of environments in my opinion is to prevent humans from touching and accidentally breaking important things.

    dev is a sandbox. A bit of the wild west. It’s okay to break things here, have duplicates, be a bit messy. And importantly, it’s okay for humans to be in here running things manually.

    qa is the tidied up version of dev, and is a barebones replica of prod in some ways. Maybe not everything is running all the time, but most things that you expect to find in the prod environment should also be present in the qa environment. This is a clean environment where your deployment and run processes are as close to prod as possible so you can get a true sense of if your end-to-end application will work (or still works). Humans should not be deploying to or running the code in qa. Deployments should be happening automatically via CI/CD (triggered by a release PR or a workflow dispatch or some other method), and service principals should be running the code.

    prod is the most important: the big leagues. Everything is live and real. Everything in qa applies to prod as well. The only difference is that what happens in prod matters and is permanent in some way most of the time. You have upstream dependencies and someday will have downstream dependencies.

    If you follow these steps and build them into your process early on, you’ll have far fewer headaches come deployment time, and less risk of something breaking once it makes it to production.