Site icon SamuelMcNeill.com

The Perfect Storm: Lessons from critical outages at Microsoft and CrowdStrike in July 2024

A short note before this post starts: I’ve not been blogging recently due to a major knee injury sustained falling off my mountain bike over Easter 2024. This required two reconstructive surgeries and has restricted my time for blogging. I’m mostly over this now and plan to write blogs more regularly again. With that, on with the post!

In July 2024, CrowdStrike experienced a significant global outage that left many businesses unable to access critical security services. For several hours, their endpoint detection and response (EDR) tools were unavailable, sparking concerns about the impact of cybersecurity tool downtime on business continuity. This incident was further compounded by outages in Microsoft’s Azure data centres leading to more outages and confusion over the root causes of the problems being experienced by businesses globally. In both instances, these outages were not the result of a cyber attack but instead the consequence of software updates gone wrong. Given that a February 2023 report from IDC placed CrowdStrike at the number one spot when it comes to endpoint security, with a 17.7% market share and Microsoft’s own endpoint security solutions a close second with a 16.4%, these outages caused significant business disruption.

Understanding The Root Causes

As is often the case with critical outages, there was not one single cause, but instead a cascading effect of a combination of factors, including a software update error that propagated across CrowdStrike’s global infrastructure, compounded by issues with their cloud provider. Endpoint Detection and Response (EDR) platforms like CrowdStrike or Microsoft Defender do present security teams with a dilemma. Starting too late in the Windows boot sequence leaves them susceptible to missing detection of malware running at the lowest level of the Windows operating system, or being disabled by it. But being given boot priority is a privilege and not a right, and developers of Windows kernel drivers are required to uphold extremely high quality-assurance standards (this video from a former Microsoft Windows developer explains it well).

In this instance, the dependency on a single cloud provider which experienced failover protocols magnified the problem. This underscores the importance of both rigorous testing of updates in isolated environments and having diverse, redundant infrastructure to mitigate cloud-based risks. Similar causal issues affected Microsoft’s Azure outage: a misconfigured software update led to network routing issues and widespread unavailability of cloud services to customers. The extent of these networking issues, prevented or delayed the automated failover to redundant data centres, leaving services offline for an extended period.

Business Impacts For Affected Customers

The CrowdStrike outage, affecting only Windows devices, led to numerous challenging situations:

Companies depending on Azure for cloud computing, storage, and application hosting also faced significant challenges.

Lessons For Business Leaders

No business is immune to these disruptions: in late August Cyclone (the company where I work) experienced a complete outage of internet connectivity in our Christchurch office after a misconfigured routing update between our ISP and fibre provider, causing a 12hr outage. Critical services that were hosted on-premise were failed over to Azure cloud instances as part of our redundancy and business continuity plan, ensuring staff around New Zealand could continue to work and support our customers. Whilst time intensive and costly to perform, having confidence in redundancy systems to ensure business continuity is critical in today’s technology reliant world.

Exit mobile version