7/29/2024
In the early hours of July 19, the United States seemed to be at a literal standstill. The steady stream of flights that typically took off every day seemed to be immensely diminished. Some of the country’s biggest airlines, and even some international airports, were forced to halt operations. Almost 4,000 flights were canceled in a single day, leaving countless passengers stranded and confused.
The IT outage also heavily impacted several emergency and medical facilities. In both Alaska and New Hampshire, 911 dispatch services were down. Hospitals all across the United States and Canada were forced to cancel all non-urgent surgeries. Blood banks that relied on flights to distribute their donations were scrambling to get them to the correct medical centers. In fact, New York Blood Center, a provider to over 200 hospitals in the Northeast, started an impromptu driving service to ensure delivery. Meanwhile, another facility had to call on local donors after a delivery was canceled.
The culprit of this cross country hysteria was not a malicious cyberattack, but poorly written lines of code. All of the systems affected in the outage were Windows computers online at 4:09 UTC. It was at this time that CrowdStrike, an independent cybersecurity company, released a faulty update. While the update was reverted at 5:27 UTC, computers that had been online before it was fixed automatically received the update, causing the Windows operating system to crash.
According to their preliminary incident report, CrowdStrike states that the update had passed all stress testing. It consisted of two IPC Template Instances, one of which contained errors. These errors went undetected due to a bug found in the Content Validator, and the update was released. However, engineers were rather shocked to discover this bug, as prior to the incident, four IPC Template Instances had been released and implemented with no problem.
While CrowdStrike was able to resolve most of the technological damage by Friday afternoon, some consequences proved to be more long-term. While many companies were able to fix their computers following instructions from Microsoft, numerous others have had to be fixed manually, one at a time by inserting a USB drive. Insurers estimate the outage cost Fortune 500 companies over $5 billion in direct losses. Namely, Delta Airlines has still not fully recovered from the glitch and has canceled around 9,000 flights.
Most costly were the personal consequences. The restless travelers that missed work, weddings, and funerals, stranded in airports with nowhere to go. The panicked citizens of Alaska and New Hamphsire who could not contact paramedics, police, or firefighters in their worst moments. The crestfallen patients who had been eagerly anticipating a life changing surgery, treatment, or blood transfusion whose care was delayed. These are the things, the moments, that CrowdStrike has taken away from people and will never be able to give back.
Although the company has already outlined the steps they are taking to prevent something like this from happening again, the reality is that it should have never happened. The outage is a reflection of the dangers in technological consolidation on an individual and group scale. Individually, CrowdStrike’s debugging and deployment system lacked many layers and outside perspectives. As a whole, big technology companies lack the diversity they need to prevent outages like this.
In biology, it’s a widely known fact that diversity is essential to any ecosystem’s survival. That’s because variability provides protection and produces resilience in times of crisis. This incident boils down to a lack of technological diversity. CrowdStrike, alone, controls 25% of the cybersecurity market. In this incident, 8.5 million Microsoft devices were impacted. Had companies been required to charter protection from more than one cybersecurity firm, or had there been legislation preventing a cybersecurity conglomerate from forming, this global crash would have been prevented.
More than ever, engineers and legislators need to join forces to establish regulations to prevent such a thing from happening again. This time, it was bad code, but what happens when it’s vicious malware?