Cloudflare Outage Caused by Internal Database Change

Bassyonni

ago 2 months

Cloudflare Outage Caused by Internal Database Change

Cloudflare faced a significant global outage on November 18th, attributed to an internal database update. This disruption began around 11:20 UTC, resulting in widespread 5xx errors across its content delivery network (CDN) and security services. The incident created access issues for customer websites and locked Cloudflare’s own engineers out of their internal dashboard.

Details of the Database Change

According to Cloudflare’s CEO, Matthew Prince, the outage was triggered by a subtle regression during a routine update to their ClickHouse database cluster. The engineers intended to enhance security by making access to tables explicit for users. However, this update inadvertently affected the Bot Management system.

Technical Challenges Encountered

The metadata query that ordinarily fetched a clean column list from the default database began returning duplicate rows from the underlying database shards.
This unexpected data expansion caused the configuration file tracking bot threats to double in size.
Cloudflare’s core proxy software is designed to pre-allocate memory for this file, which has a safety limit of 200 features.
As the bloated file exceeded this limit, it led to the crash of the Bot Management module.

The diagnosis was complicated by the rolling updates; the system oscillated between functional and non-functional states every few minutes. Initially, the engineering team mistook the erratic behavior for a hyper-scale distributed denial-of-service (DDoS) attack. Confusion escalated when Cloudflare’s external status page also went down, misleading some users into thinking it was a targeted attack on their support infrastructure.

Reactions from Industry Leaders

During the outage, Dicky Wong, CEO of Syber Couture, stressed the importance of employing multi-vendor strategies. He argued that while Cloudflare provides impressive tools, dependency on a single provider can lead to critical risks. Wong emphasized, “love is not the same as marriage without a prenup,” advocating for diversified risk management to prevent future outages.

Community Perspectives

Comments on platforms like Reddit reflected user frustration during the outage. A user highlighted the reliance of many websites on Cloudflare, noting the difficulty of obtaining information when Cloudflare itself was down. Others echoed the concern about the fragility of the current internet landscape, dominated by a few major providers.

Restoration and Future Improvements

Service was eventually restored by manually pushing a stable configuration file into the distribution queue. Traffic began to normalize by 14:30 UTC, with full resolution achieved later that afternoon. Following this incident, Cloudflare announced plans to review failure modes across its proxy modules, aiming to enhance memory pre-allocation limits to better manage unexpected data inputs.

Overall, this outage stands as a reminder of the vulnerabilities within large-scale internet infrastructure and the critical need for robust risk management strategies.