19 hours, 40 engineers and one sleepless night: When CrowdStrike struck

July 25 2024, by Naveen Gera | Category: Cloud Services

How we supported our customers through the CrowdStrike outage.

The world is still recovering from what is being labelled as the largest IT outage in history. On July 19 2024, a routine software update from CrowdStrike disabled more than eight million computers worldwide. The record-breaking outage impacted major airlines, media outlets, government bodies and essential services; as well as wreaking havoc in the private sector.

Our Head of Service Assurance, Naveen Gera, led the Macquarie Cloud Services incident response team for the CrowdStrike incident. We spoke with him to get an “on the ground” perspective of the critical 19-hour period from incident identification on late Friday afternoon; to resolution on Saturday morning.

Q: Take us through the early stages of the event. When did you know something was amiss and how soon did you identify the cause?

Naveen: We started seeing a lot of activity, including multiple alarms, on our monitoring system around 2.30 PM on Friday – around an hour before the media started reporting on the outage. That immediately triggered our critical incident management process, and the key personnel involved in the escalation process jumped onto the bridge – that’s what we call the control room in our Hosting Management Centre (HMC).

The next step was to raise incidents for each customer who appeared to be impacted and let them know what we were seeing. At this stage we didn’t know what the issue was, so the first troubleshooting step was to look at the cloud infrastructure. We were able to rule out an infrastructure issue pretty quickly.

The first indication that it might have been a CrowdStrike issue came when one of our engineers was troubleshooting a customer environment and saw the blue screen of death. He suspected it might have been related to CrowdStrike but we weren’t one hundred percent sure, so we asked some of our impacted customers if they were also using CrowdStrike. Sure enough, each one answered “yes”.

Within 45 minutes of the start of the incident, we’d connected all the dots and were confident that was an issue with CrowdStrike. By now we’d also assessed the number of servers that we would need to address.

With the full scope emerging, it was time to call in reinforcements. We set up mini ‘SWAT’ teams to support every customer, at a ratio of four engineers to every account.

Q: What was the general mood by this point?

Naveen: It’s fair to say that some customers were panicking, because they had been completely taken offline. Our focus was on keeping everyone calm and making sure they had all the up-to-date information.

Q: What was happening prior to the official fix being released?

Naveen: By this stage most of the available information was coming from system engineering communities on Reddit rather than having an official fix from CrowdStrike, so in the meantime we were testing workarounds.

Our engineers correctly identified that restarting systems in safe mode would only turn on essential services and not third-party applications (including CrowdStrike), so we used that knowledge to come up with a potential workaround within 40 minutes of the start of the incident.

The team got to work filtering this information through to customers as we needed their permission to make some of the changes. When the official fix came through from CrowdStrike a bit later, it was the same as what we had identified.

Q: How long did it take to resolve the incident for customers?

Naveen: Once we had the solution it was a case of working through every customer environment to implement it. For the private cloud it was a fairly simple process; however things were a little more time-consuming for customers in the public cloud as it was running very slowly.

Our SWAT teams worked throughout the night and by 4AM Saturday we were practically fixed. We marked the incident as officially resolved a few hours after that.

Q: What are your key takeaways from the CrowdStrike incident?

Naveen: For me it really highlights the importance of having an experienced, well-resourced managed service provider. It was amazing to see the Macquarie Cloud Services team rally to ensure our customers experienced minimal outage and disruption.

By contrast, I’ve heard some nightmare stories of providers who only had a couple of engineers managing hundreds of customers. It took them days – rather than hours – to get back to operational. We were fortunate to have 40 engineers working through the night, who were able to resolve the incident in around 19 hours. They were well supported by ten delivery managers rolling out communications to our customers.

Special mention goes to our HMC Team Leader, Lisa Bebawy and Service Delivery Team Leader, Andrew Pieniazek, who played a pivotal role in managing the engineers and the communications flow.

I’m really proud of how the team pulled together to help our customers – particularly if the end-users were not particularly tech savvy, and needed greater support to work through the fix. Everyone was asking how they could help; and even our most junior engineers were keen to stay back and help resolve the issue overnight.

We even managed to extend our support to people who weren’t customers of ours, but reached out to us needing immediate help. It was a fantastic testament to one of our core values – “Personal Accountable Service’’.

How it unfolded: Timeline of the CrowdStrike outage.

Friday

2:30 PM: Initial call; monitoring systems detected multiple alarms indicating a major issue.

2:45 PM: Escalation managers activated the critical incident management process.

3:00 PM: Engineers identified the problem was not with Macquarie Cloud Services private cloud infrastructure.

3:15 PM: Engineers suspected CrowdStrike as the cause due to consistent blue screen of death errors.

3:20 PM: Reddit posts confirmed CrowdStrike-related issue.

3:25 PM: Engineers began testing fixes.

4:00 PM: Our team implemented a workaround by deleting the bad C291 channel files (later confirmed as a valid fix by CrowdStrike).

4:30PM: Five separate groups of engineers working in SWAT teams worked to assist different customers with the fix, with each SWAT team being reassigned once a customer was completed.

Overnight: Team worked through the night to restore 1500 virtual machines.

Saturday

4:00 AM: Major services under Macquarie Cloud Services management were restored.

8:30 AM: Team reconvened to address remaining customer issues.

9:30 AM: Major incident officially resolved; continuous updates and communications maintained throughout the process.

10:00 AM: Final customer services restored.

6:00 PM: Additional customer services managed by third-party providers were restored with the guidance of the Macquarie Cloud Services team.

Be prepared for anything with Macquarie Cloud Services.

In today’s landscape, the question of operational disruption is not so much “if” but “when”. It always pays to be prepared. And that means choosing the right partner to manage your cloud infrastructure.

When you partner with Macquarie Cloud Services, you’re gaining the assurance that things will tick over like clockwork, even during a major unavoidable disruption. Whether it’s late on a Friday evening or on a public holiday, our team never sleeps. We’ll always rally to protect you and ensure you’re back up and running, as soon as possible.

Reach out to us today at 1800 004 943 or drop us an email at enquiries@macquariecloudservices.com to explore how we can help you.

Get in touch.

Enquiry Sent.

Cloud Reset – The Podcast | Episode 11...

Essential Eight Compliance with Managed ...

Cloud Reset – The Podcast | Episode 10...

Macquarie Cloud Services