The Perfect Storm: Lessons from critical outages at Microsoft and CrowdStrike in July 2024

Sam McNeill

1 year ago

A short note before this post starts: I’ve not been blogging recently due to a major knee injury sustained falling off my mountain bike over Easter 2024. This required two reconstructive surgeries and has restricted my time for blogging. I’m mostly over this now and plan to write blogs more regularly again. With that, on with the post!

In July 2024, CrowdStrike experienced a significant global outage that left many businesses unable to access critical security services. For several hours, their endpoint detection and response (EDR) tools were unavailable, sparking concerns about the impact of cybersecurity tool downtime on business continuity. This incident was further compounded by outages in Microsoft’s Azure data centres leading to more outages and confusion over the root causes of the problems being experienced by businesses globally. In both instances, these outages were not the result of a cyber attack but instead the consequence of software updates gone wrong. Given that a February 2023 report from IDC placed CrowdStrike at the number one spot when it comes to endpoint security, with a 17.7% market share and Microsoft’s own endpoint security solutions a close second with a 16.4%, these outages caused significant business disruption.

Understanding The Root Causes

As is often the case with critical outages, there was not one single cause, but instead a cascading effect of a combination of factors, including a software update error that propagated across CrowdStrike’s global infrastructure, compounded by issues with their cloud provider. Endpoint Detection and Response (EDR) platforms like CrowdStrike or Microsoft Defender do present security teams with a dilemma. Starting too late in the Windows boot sequence leaves them susceptible to missing detection of malware running at the lowest level of the Windows operating system, or being disabled by it. But being given boot priority is a privilege and not a right, and developers of Windows kernel drivers are required to uphold extremely high quality-assurance standards (this video from a former Microsoft Windows developer explains it well).

In this instance, the dependency on a single cloud provider which experienced failover protocols magnified the problem. This underscores the importance of both rigorous testing of updates in isolated environments and having diverse, redundant infrastructure to mitigate cloud-based risks. Similar causal issues affected Microsoft’s Azure outage: a misconfigured software update led to network routing issues and widespread unavailability of cloud services to customers. The extent of these networking issues, prevented or delayed the automated failover to redundant data centres, leaving services offline for an extended period.

Business Impacts For Affected Customers

The CrowdStrike outage, affecting only Windows devices, led to numerous challenging situations:

Device Outages: Windows devices crashed with Blue Screens of Death (BSOD) and required a manually intensive process to restore to operational usage
Disruption to Security Monitoring: Companies were left unable to detect potential threats on their networks in real time. For some, this created a dangerous blind spot, leaving them vulnerable to attacks during the outage period. Security operations centers (SOCs) relying on CrowdStrike tools had to operate without their primary line of defense.
Operational Downtime: Beyond the security implications, the outage had an operational impact. Without access to critical cybersecurity infrastructure, some organizations had to suspend or reduce operations, leading to potential financial losses. The dependence on third-party services without adequate contingency measures proved costly.

Companies depending on Azure for cloud computing, storage, and application hosting also faced significant challenges.

Interruption of Security Services and Related Tools: Some businesses using Azure for hosting security monitoring tools or third-party cybersecurity services experienced an operational security gap. The inability to monitor network activity or respond to threats due to cloud downtime posed an additional layer of risk.
Disruption to Business-Critical Applications: Many organizations rely on Azure to host critical applications, from enterprise resource planning (ERP) systems to customer relationship management (CRM) platforms. The outage caused widespread disruption, with companies unable to access or run key business operations
Data Access and Storage Issues: Azure’s storage services, such as Azure Blob Storage, were also impacted. For organizations storing large volumes of data in the cloud, the inability to access or update this data during the outage caused operational delays and concerns over data integrity. In some cases, businesses were unable to retrieve time-sensitive information, disrupting decision-making and service delivery.

Lessons For Business Leaders

Build Redundancy into Critical Systems: Organizations should ensure that their key services have backup systems in place—whether by using multiple vendors, redundant cloud platforms, or hybrid cloud environments.
Develop (and test!) Strong Incident Response Plans: Businesses need a plan for when their critical cybersecurity tools go down. This includes backup solutions, alternative threat detection methods, and robust communication protocols to ensure that the security team can respond effectively even when primary systems are offline.
Vendor Accountability and Transparency: While businesses rely on vendors like CrowdStrike and Microsoft for critical functions, it’s important to demand transparency in service level agreements (SLAs) and contingency plans.
Multi-Cloud and Hybrid Cloud Strategies: One of the biggest lessons from the Azure outage is the need for organizations to diversify their cloud infrastructure. Relying solely on one provider can lead to significant downtime when outages occur. Adopting a multi-cloud or hybrid cloud approach—where critical workloads are distributed across multiple cloud providers or a mix of cloud and on-premise environments—can mitigate the impact of a single provider’s outage.
Prepare for Worst-Case Scenarios: Outages like the CrowdStrike and Azure incidents highlight the importance of contingency planning. Businesses should conduct regular risk assessments of their cloud dependencies, implement offline backups for critical data, and ensure that their internal teams are trained to handle extended cloud outages. This preparation can limit operational downtime and keep security and business operations running during disruptions.

No business is immune to these disruptions: in late August Cyclone (the company where I work) experienced a complete outage of internet connectivity in our Christchurch office after a misconfigured routing update between our ISP and fibre provider, causing a 12hr outage. Critical services that were hosted on-premise were failed over to Azure cloud instances as part of our redundancy and business continuity plan, ensuring staff around New Zealand could continue to work and support our customers. Whilst time intensive and costly to perform, having confidence in redundancy systems to ensure business continuity is critical in today’s technology reliant world.

Understanding The Root Causes

Business Impacts For Affected Customers

Lessons For Business Leaders

Share this: