We recently passed the one-month anniversary of the great CrowdStrike incident of ’24. Someday, you will relish the opportunity to regale your junior administrators on the amazing feats performed in those glorious days. Today probably isn’t that day, but with the dust mostly settled, I thought maybe we should take a hot second to reflect on what happened and see what, if anything, we can learn from all these shenanigans.
First and Foremost: Thank You All
I want to start with a genuine, heartfelt thank you to all system administrators out there who had to clean up this mess. You were forced to miss important life events, plans you had made, and, most devastatingly, time with your family, friends, pets, and board games. I truly hope that your organization made some effort to recognize your heroics. Alas, early in my career, I was taught an important lesson: hope is not a strategy. My intuition, therefore, says that instead of being thanked, many of you were instead blamed for this happening at all and then crapped on for not fixing it fast enough. I’m certain the phrase “Any update?” will be triggering many of us for the rest of our lives.
To those people, let me say that on behalf of Patch My PC, we see you. We thank you. Not all heroes wear capes.
Unless … you wear a cape at work. Which would be absolutely awesome.
We should totally get some PMPC capes. With rubber ducks on them. Or maybe rubber ducks wearing capes.
Marketing, GET ON IT.
What the Heck Happened Anyway?
You mean other than 8.5 million or so devices being put into a boot loop of death?
CrowdStrike released their Post-Incident Review (PIR), which outlines the specific timeline of events and gives us a clear picture of what went wrong and when.
In February ’24, CrowdStrike released a new version (7.11) of their sensor, including new capabilities to protect against vulnerabilities in Windows’s named pipe and inter-process communication mechanisms. In short, they released a new version of their agent, which includes a custom device driver that runs in kernel mode. Like many others, this feature was driven by a channel file that gets periodic updates to address new types of threats.
On March 5th, CrowdStrike released the first channel file for this new feature and subsequently released three additional updates between April 8th and April 24th. These were rolled out without incident.
On July 19th, they rolled out another channel file, which… did not go well. This new channel file included the first use of a specific property that had not been previously used. Using this property caused the code that interpreted the channel file to barf because it only expected 20 values. The new channel file included 21 values, which led to an out-of-bounds read error that caused the code to throw an unhandled exception. Since this code was running in the kernel, it took the OS down with it, leading to the blue screen of death we have all come to know, love, and revere as a close friend.
CrowdStrike pulled this channel file 78 minutes after it was released (1:27 A.M Eastern Daylight Time). Any Windows device that checked for channel updates during that window would get the faulty channel file and was likely impacted. Machines that were not powered on or otherwise didn’t check for channel file updates were never the wiser. I shudder to think about the outcome in the US if the release was 12 hours later: 1:27 in the afternoon.
As a security product, the CrowdStrike agent is loaded early in the overall boot process. It can’t protect the device if it’s not running. However, if their agent crashes, it takes your machine down during the boot process. The result was roughly 8.5 million devices in a seemingly permanent loop of trying to boot and then throwing a BSoD before trying to boot again.
How Did We Fix It?
The solutions weren’t pretty and revolved around deleting the channel file from the disk. For some devices, rebooting them a bazillion times was enough to get lucky: the CrowdStrike agent would run just long enough to get new channel files that didn’t have the problem. If that didn’t work, you would need to get access to the drive long enough to delete that file manually. A significant complication for many was the use of drive encryption solutions such as BitLocker, which require you to unlock the drive. For most companies, the process for retrieving recovery keys was not meant to scale.
In the days that followed, I saw some amazing energy and creativity across our community to help find solutions that would scale. Even Microsoft got into the game and released a recovery tool.
The Blame Game Started Early
Within hours of release, it quickly became apparent that a major worldwide issue was happening. In a race to grab clicks, several media outlets reported this as an issue with Windows itself. By noon, Microsoft’s CEO Satya Nadella did something nearly unprecedented: he threw CrowdStrike under the bus on Twitter … or X … whatever.
Several security researchers jumped into the fray, investigated impacted devices, and reported that the CrowdStrike channel file was full of null values. This begged the question: How the heck did that get past any kind of validation or QA process? In their haste, these investigations overlooked the intricacies of how Windows saves a file to disk (spoiler: it’s not real-time) and how very wrong that process can go when the kernel crashes before the ones and zeros are actually written to the platter … or chip I guess. No, CrowdStrike didn’t ship a channel file full of null values.
Days later, despite other airlines returning to normal operations, Delta Airlines still suffered significant fallout, with thousands of travelers stuck in airports around the US. By the end of the month, Delta’s CEO was vowing to sue both CrowdStrike and Microsoft. Characteristically, Microsoft responded by saying <checks notes> ‘It looks like you just suck at IT.’ If that suit ever materializes, the silver lining will be that the stories of our sysadmin brothers and sisters at Delta will finally be told, and I suspect those stories are worth telling.
Wait, Isn’t This Microsoft’s Fault for Allowing It?
Another early cry was that while CrowdStrike might have pulled the trigger, Microsoft built the foot gun in the form of support for kernel-mode drivers. Microsoft allows trusted third parties to submit kernel-mode drivers for review and actively approves them to run alongside their own OS kernel. Several pointed to Apple’s move years ago to block third parties from running code at the kernel level. Instead, they have carved out specific APIs that can be utilized by code running outside of the kernel, thus protecting the OS from the kinds of crashes CrowdStrike triggered.
The Wall Street Journal reported a Microsoft spokesperson blaming a 2009 agreement with the European Union for why Microsoft could not do what Apple did. I find this explanation inadequate based on Microsoft’s own documentation on said agreement, which states:
“Microsoft shall ensure that third-party software products can interoperate with Microsoft’s Relevant Software Products using the same Interoperability Information on an equal footing as other Microsoft Software Products.”
This suggests that Microsoft could do what Apple did; they just can’t give their products (e.g., Defender) levels of access that third parties like CrowdStrike don’t also have.
That said, if Microsoft were to follow Apple in this regard, it would almost certainly mean limitations on what security software, including Defender and Crowdstrike, is capable of. Functionality we “enjoy” today would cease to work, and security software would become a walled garden, only doing what the Windows APIs permit. While that reduction of functionality wouldn’t run afoul of the 2009 EU agreement and would arguably better protect the kernel, it’s reasonable to think that it would lead to a renewed set of lawsuits about making breaking changes to a multi-billion-dollar security industry. For Microsoft, there’s likely no winning here.
What Lessons Can We Learn Here?
I have heard several people call for ditching CrowdStrike, and that’s an understandable reaction to such an event. However, what alternatives would you suggest? McAfee, which in 2010 did the exact same thing? Many have gleefully pointed out that CrowdStrike’s CEO, George Kurtz, was the CTO of McAfee during that outage as well. What about Defender? Do you mean the product that in January 2023 did the same thing but only managed to delete Office shortcuts from the desktop, leading to millions of users crying out, “I can’t open my email” at once?
The lesson is that software vendors should not be trusted to perform production rollouts on your behalf. When it comes to security products, however, this acts in tension with the desire to quickly and reliably push definition updates to avoid potential attacks. These two competing desires must be held in balance, as the CrowdStrike incident makes clear that the cure can sometimes be worse than the disease.
This is a lesson CrowdStrike has now learned the hard way, and they are building systems that give customers control of the channel file rollout. It’s also a lesson Microsoft learned, with Defender supporting update channels for both their engine and definition updates. If you are using another solution, ask yourself: how am I testing and rolling out that solution’s daily or hourly updates? If the answer is “I’m not,” then you’re just one Q/A failure away from the same kind of bad day.
Another more nebulous lesson is how to prepare for the next time. Because it’s almost assuredly going to happen again. My feed was full of pictures of brave IT staff climbing ladders to access PCs installed in particularly inaccessible places. If you are running business-critical kiosks, maybe putting a PC behind a TV hanging 20ft in the air is not a great idea. Where the device is challenging to access, maybe some pointed use of thin-clients might be an option. I’ve seen fellow sysadmins suggest using Intel vPro to provide bios-level remote access while others call out a history of vulnerabilities with that solution. I’m not sure what the solutions are, but even if you weren’t impacted this time, it’s worth having a meeting or two to wargame what the impacts would be. Are you an international conglomerate with 60k devices and a fully remote workforce? What’s your plan for touching every device physically? Maybe the answer is to do nothing and accept the risk because such occurrences are rare. If so, it’s worth having that email printed out and ideally signed by your leadership to trot out in time for things like these. I offer you this template in the hope that you will find it useful:
“We the undersigned, realize that life happens and you cannot fully prepare for every scenario. Security vendors are going to security vendor and this can lead to bad times here at Initech. Acknowledging that these events are rare and that the cure can be worse than the disease, we agree to not completely lose our frickin minds when such inevitabilities occur. Instead of berating our staff, we will instead cancel our tee-times and ‘touch USB’”
What Lessons Did Patch My PC Learn?
We plan to write something more in-depth on this topic in the future, so keep your eyes peeled. However, since our VP of Engineering sometimes works in Valve Time™, I will give you some abbreviated thoughts here.
From the start, one of our product goals was to give you, the system administrator, the needed controls for releasing updates for third party applications to your environments. Yes, you might be tempted to throw caution to the wind and let Chrome self-update, but many have found that counterproductive. With ConfigMgr, this is relatively straightforward as we publish the updates, and you can use Automatic Deployment Rules to deploy them in any way you see fit. For Intune, we support deploying the app for both initial installation and updating existing installations using multiple assignments to achieve your desired rollout plan. As I type, our engineers are finishing the work to bring this capability to Patch My PC Cloud, our new SaaS solution.
Speaking of Patch My PC Cloud, are there any lessons to learn there? From the start, we have built a feature you will not hear us talk about externally, which we call Feature Exposure Control (FEC). If you are familiar with ConfigMgr’s pre-release features, then you already know the drill. When we build significant new features, we gate them behind a control that hides them from being globally accessible. This allows us to slow our roll by first testing internally, then in private preview with a small number of customers, and then in public preview where customers opt-in before we fully GA and light it up globally.
For those using our Publisher software we have a preview channel that you can opt into and test out the latest builds and features before they are released into the wild.
As for the catalog itself, we can’t prevent software vendors from releasing problematic software. What we can do, and do diligently, is pull such releases from our catalog quickly.