You have probably realised that this article is about the recent Crowdstrike security update. On 19th July 2024, everything from airlines to banks operating Microsoft Windows machines failed to boot. Users across the globe were treated to the BSOD (blue screen of death) and could proceed no further. As a security vendor, this was a purely B2B issue with Crowdstrike providing endpoint protection to millions of devices on behalf of business clients. So, how did the world fail to spot the potential risk? How come it has taken so long to resolve? Where were the safeguards to prevent this? Today, we discuss risk management and a security update that ‘borked’ nine million computers.
Attitude to risk management
The good news is that the businesses impacted had some appreciation of risk. The very fact that they embedded Crowdstrike into their computing infrastructure demonstrates this fact. Whether they were fearful of malware or hacking, they had a leading platform in place to try to prevent it. Endpoint security aims to protect and prevent penetration and infection of computers running software. It is necessary for any business, using any operating system, to protect corporate assets and customer data. Many personal users may run anti-virus and spyware removal tools and basic firewalls in contrast. How and who provides such security to businesses is, of course, up for debate and tender.
As with any service provider, you probably make a choice based on price, features and quality. Evidence is a factor in such a decision, such as how many infections were prevented. They will also have convinced many operators that their approach was superior to others. In terms of protection and prevention, that may well be the case. Furthermore, they may have successfully protected some clients for many years. Hacks, malware and ransomware would be catastrophic on many endpoint devices, particularly those in public spaces or processing payments.
The risk management failure
By now, you have probably heard varying explanations as well as official press releases by Crowdstrike themselves. However, make no mistake in thinking that this was the fault of anybody else. In fact, even the notion that Microsoft Windows was to blame is simply because it was the operating system in use on the business machines which received the wretched update. It could just as easily have occurred to MacOS or Linux machines, sharing as they do a similar architecture. This is because most modern operating systems have a kernel layer and an application layer. When the kernel layer faces an error, it simply halts the system. The BSOD is just the system saying that there is a hardware error that needs to be resolved.
From a risk management point of view, the very protection meant to avoid any business interruption interrupted your business. After all, why would you expect endpoint protection to stop machines in their tracks? Furthermore, most people believe that a quick restart or ‘switch it off and back on again’ will resolve the issue. We know that 95% of all computing issues are user error. So, that means that such a failure exposed the blind spot of senior management. They probably assumed that this could not happen from this vector, without realising that it was in fact possible. So, let us explore the reasons in more detail.
The technical explanation of the risk failure: Part 1
Here goes technical explanation part 1. The kernel layer is where the operation of hardware occurs upon which the computer can run. The application layer is where your software tools, applications, games and web browsers run. This sits on top of the kernel if you think of the kernel as the ground floor and the application layer as the first floor where people work. The kernel layer is the focus of many attack vectors as that gives access to the hardware and therefore, any software running on the machine.
As a result, it is ‘locked down’ to prevent any tampering. In the case of Crowdstrike, their software driver resides in the kernel layer, giving it direct access to hardware and allowing it to ‘get ahead’ of any attack. However, there is of course no hardware for Crowdstrike to control – it is a ‘fake’ driver to give privileged machine access.
The technical explanation of the risk failure: Part 2
Part 2 is a little simpler. When software updates appear, especially in the operating system, it is normal for IT teams to have network profiles and policies that control how they are rolled out. For example, a security patch or the addition of new features might not play nice on every configuration of device. Therefore, the IT team may trial the update first on test machines or in virtual environments to ensure stability. Once they know it works, they roll it out in stages and monitor the results. This provides a level of control so that there is no cascade disruption across a business.
Unfortunately, Crowdstrike’s model appeared to successfully override suchpolicies. This meant that the update rolled out to all clients simultaneously. The result? A dodgy piece of code quickly cascaded across millions of machines that it was meant to protect. Ultimately, IT teams were sidelined by the update, which referenced an incorrect memory address.
What about Crowdstrike’s own risks?
There are two major problems with the way that Crowdstrike appears to approach endpoint security. Firstly, their protection based on the ‘driver’ model sitting in the kernel makes them a great attack vector and point of entry to millions of computers. This is especially worrying should they ever become compromised themselves or bad actors play their part. It is probable that senior leaders outside of IT had no idea how the protection functioned. Additionally, they probably had no idea that their safety net could become a risk in itself. Sure, they can respond quickly to threat vectors due to the privileged access that they have to devices, but at what cost?
Secondly, all software development cycles include a testing phase. From waterfall to agile, there is a testing phase. Even in the shortest sprint loops in agile development, testing remains a factor. In the case of Crowdstrike, they appeared to have two automated tests of the code ready to deploy. One was passed, the other was failed. The result? The code was deployed regardless. “Real developers test in deployment” is now a popular meme. Crowdstrike has since admitted that another error in the testing code led them to release the faulty update anyway. In normal circumstances, their rapid response to threats perhaps gives them a competitive advantage. However, that was until they got it wrong.
Risk management failures in the recovery?
The last part of this article is how everyone recovered from the failure. For the uninitiated, the BSOD presents users with two options. Shut down the machine or recovery options. Given the familiarity with modern GUI (graphical user interface) and touchscreens, it may have been surprising to many users. An ugly blue screen with white text is not a welcoming sight and the solution is not apparent. In many cases, users simply had no idea what to do to resolve the issue. And here lies a major risk management failure. There had to be a human being at the machine to fix it and most of the nearby human beings had no idea how.
Unfortunately for management, many IT departments are outsourced or offshored. Typical helpdesks solve routine queries by online chat and pass more nuanced issues to second and third-line support teams. Many users struggle to speak to a human being to report an issue, since the majority of trouble tickets are usually routine. It is quite unusual for large enterprises to have local IT support staff to get hands on with issues. The majority reside in shared services. Many organisations have had to wait for staff to travel around the world to reach machines. Others have subcontracted IT firms with engineers to visit the devices.
Conclusion
Finally, many of the affected devices were for customers to use. These might include checking-in for a flight, paying for your shopping or some simple banking. Such terminals are locked in cases to prevent tampering, denying access to on-site staff. The result was on-site staff apologising to customers for the outage but being powerless to resolve it. Others had to wait a few days for a human being to attend. Some users do have the knowledge to try and resolve such an issue, but then they may fear consequences if they make things worse.
Crowdstrike has already announced changes to the way it operates to better detect erroneous code. That does not prevent a ‘worldwide borking’ from happening again. In fact, this was one of the most disruptive IT errors in history. Some insurers estimate the total cost of the outage to amount to $5.4bn/£4.2bn in the US alone alone. Since boards mainly exist to ensure good stewardship over a business, investors are looking for strong governance and risk management to safeguard their investment. How many boards just added Crowdstrike to their risk register?
Risk management needs a review
Here at Think Beyond, we offer assurance services as well as risk management reviews as part of strategic planning, business continuity planning and good governance. We ask the questions that many dare not ask. What if? Could this affect that? How would you resolve this issue? What are plans A, B and C to get back on track? In summary, we are not agile software developers but we do know a thing or two about risk and resilience.
If you would like to seek support with your risk management, simply drop us a short email. Alternatively, please send us a general or assurance query online.
Finally, why not check two related articles on product development and another on business problem-solving.