Cloudflare Outage: A Detailed Postmortem of the Incident that Took Down Half the Internet

Cloudflare Outage: A Detailed Postmortem of the Incident that Took Down Half the Internet

Introduction

The recent Cloudflare outage that took down thousands of websites, including high-profile victims like ChatGPT, Canva, Dropbox, Spotify, Uber, Coinbase, Zoom, X, and Reddit, serves as a stark reminder of the internet's dependence on Cloudflare's content delivery network (CDN). The six-hour outage, which occurred on Tuesday, was caused by a configuration file propagation issue to Cloudflare's Bot Management module, resulting in the crash of the module and the subsequent offline status of Cloudflare's proxy functionality.

What Happened?

The root cause of the outage was a database permissions change in ClickHouse, which kicked off the incident. The Cloudflare engineering team had been working to improve system security and reliability by moving from a shared system account to individual user accounts. However, this change had an unintended side effect: the query collecting features to be passed to Bot Management started to fetch from the r0 database, returning many more features than expected.

The Bot Management module has a limit of 200 features, which was put in place for performance reasons. When the module received more features than expected, it crashed, causing a system panic on machines served with the incorrect feature file. The exact code that caused this panic was an unwrap() function, which was not expecting an error to be returned.

The Incident Unfolds

Edge nodes started to crash, one by one, seemingly randomly. The feature file was being generated every 5 minutes and gradually rolled out to Edge nodes. Initially, only a few nodes crashed, but over time, more became non-responsive. At one point, both good and bad configuration files were being distributed, making failed nodes that received the good configuration file start working – for a while!

The Root Cause of the Delay

It took Cloudflare engineers an unusually long time – 2.5 hours – to figure out the root cause of the outage. An unrelated failure made the team suspect that they were under a coordinated botnet attack, as when a few of the Edge nodes started to go offline, the company's status page did, too. However, there was no attack, and the team wasted time looking in the wrong place.

Postmortem and Learnings

Cloudflare shared a detailed review of the incident and learnings, which can be read online. The postmortem was unusually detailed and was made possible by the fact that Cloudflare's CEO, Matthew Prince, was part of the outage call and wrote a first version of the incident review after the outage was resolved. The team circulated a Google Doc with the initial writeup and questions that needed to be reviewed, and all questions were answered within a few hours.

Some key learnings from this incident include:

  • Being explicit about logging errors when you raise them can help identify the root cause of an error much faster.
  • Having alerts set up when certain errors spike on nodes can also help mitigate the issue.
  • Logging errors before throwing them is extra work, but when done with monitoring or log analysis, it can be incredibly useful.

Conclusion

The Cloudflare outage serves as a reminder of the importance of careful planning, testing, and monitoring in preventing and mitigating outages. Cloudflare's detailed postmortem and learnings provide valuable insights for the tech community, and their commitment to transparency and accountability is commendable. As the internet continues to evolve and become increasingly dependent on cloud services, it is essential to learn from incidents like this and work towards building more resilient and reliable systems.