
On November 2, 2023, Cloudflare’s customer-facing interfaces, including their website and APIs, along with logging and analytics, ceased functioning properly. That was bad.
Over 7.5 million websites use Cloudflare, and 3,280 of the world’s 10,000 most popular websites depend on its content delivery network (CDN) services. The good news is that the CDN didn’t go down. The bad news is that Cloudflare Dashboard and its related application programming interfaces (API) were down for almost two days.
Also: The best VPN services (and how to choose the right one for you)
That kind of thing just doesn’t happen — or it shouldn’t, anyway — to major internet service companies. So, the multi-million-dollar question is: ‘What happened?’ The answer, according to Cloudflare CEO, Matthew Prince, was a power-related incident at a trio of the company’s primary data centers in Oregon, which are managed by Flexential, that cascaded into one problem after another. Thirty-six hours later, Cloudflare was finally back to normal.
Prince didn’t pussyfoot around the problem:
To start, this never should have happened. We believed that we had high availability systems in place that should have stopped an outage like this, even when one of our core data center providers failed catastrophically. And, while many systems did remain online as designed, some critical systems had non-obvious dependencies that made them unavailable. I am sorry and embarrassed for this incident and the pain that it caused our customers and our team.
He’s right — this incident never should have happened. Cloudflare’s control plane and analytics systems run on servers in three data centers around Hillsboro, Oregon. But, they’re all independent of one another; each has multiple utility power feeds, and multiple redundant and independent internet connections.
The trio of data centers is not so close together that a natural disaster would cause them all to crash at once. Simultaneously, they’re still close enough that they could all run active-redundant data clusters. So, by design, if any of the facilities go offline, the remaining ones should pick up the load and keep operating.
Sounds great, doesn’t it? However, that’s not what happened.
What happened first was that a power failure at Flexential’s facility caused unexpected service disruption. Portland…