Cloudflare outage on November 18, 2025

Severity: Critical

Category: Change Process

Service: Cloudflare

This summary is created by Generative AI and may differ from the actual content.

Overview

Cloudflare experienced a significant outage on November 18, 2025, causing widespread failures in delivering core network traffic, manifesting as HTTP 5xx errors for users. The incident was not a cyber attack but stemmed from a database permissions change in a ClickHouse system. This change inadvertently caused the Bot Management system's feature file to double in size due to duplicate entries. The oversized file was then propagated across the network, exceeding a preallocated memory limit of 200 features in the traffic routing software (FL2 Rust code), leading to system panics and service failures. The system intermittently recovered and failed because the bad file was generated every five minutes, but only when the query ran on an updated part of the ClickHouse cluster. Core traffic was largely restored by 14:30 UTC, and all systems were fully functional by 17:06 UTC.

Impact

The incident resulted in significant failures to deliver core network traffic, with Internet users seeing HTTP 5xx error pages. Core CDN and security services experienced HTTP 5xx status codes. Turnstile failed to load, and Workers KV returned significantly elevated HTTP 5xx errors. The Dashboard was largely inaccessible for logins due to Turnstile unavailability. Email Security saw a temporary degradation in spam detection accuracy and some Auto Move action failures, though without critical customer impact. Cloudflare Access experienced widespread authentication failures, though existing sessions were unaffected. Additionally, there were significant increases in CDN response latency due to debugging systems consuming high CPU. Customers on the new FL2 proxy engine observed HTTP 5xx errors, while those on the old FL proxy engine saw incorrect bot scores (all traffic scored zero), leading to false positives for bot-blocking rules. The Cloudflare status page also coincidentally went down. This was Cloudflare's worst outage since 2019, causing the majority of core traffic to stop flowing.

Trigger

The incident was triggered by a change to one of Cloudflare's ClickHouse database systems' permissions, deployed at 11:05 UTC. This change, intended to improve security by making access to underlying `r0` tables explicit, inadvertently affected a specific SQL query (`SELECT name, type FROM system.columns WHERE table = 'http_requests_features' order by name;`) used by the Bot Management feature file generation logic. The query, which previously implicitly filtered for the 'default' database, began returning duplicate column metadata from the `r0` database, causing the Bot Management feature file to contain a large number of duplicate 'feature' rows and effectively doubling its size.

Detection

Cloudflare became aware of the disability as Internet users trying to access customer sites encountered error pages indicating network failures. The Cloudflare network began experiencing significant failures to deliver core network traffic, evidenced by a spike in 5xx error HTTP status codes. An automated test first detected the issue at 11:31 UTC, leading to manual investigation at 11:32 UTC and the creation of an incident call at 11:35 UTC. Initially, the team wrongly suspected a hyper-scale DDoS attack due to the fluctuating nature of the errors (system recovering and failing every five minutes as good and bad configuration files were distributed) and the coincidental outage of Cloudflare's status page.

Resolution

The resolution involved correctly identifying the core issue after initial suspicions of a DDoS attack. The generation and propagation of the oversized Bot Management feature file were stopped. A known good, earlier version of the feature file was manually inserted into the distribution queue, and the core proxy was forced to restart. Earlier in the incident, at 13:05 UTC, internal system bypasses were implemented for Workers KV and Cloudflare Access, allowing them to fall back to a prior proxy version and reducing their impact. For the Dashboard, after the feature configuration data was restored, a backlog of login attempts overwhelmed the system, which was resolved by scaling control plane concurrency at approximately 15:30 UTC. All remaining services that had entered a bad state were restarted. Core traffic was largely restored by 14:30 UTC, and all systems were fully functional by 17:06 UTC.

Root Cause

The root cause was a change in ClickHouse database permissions, deployed at 11:05 UTC, aimed at improving security by making access to underlying `r0` database tables explicit. This change inadvertently caused a specific SQL query (`SELECT name, type FROM system.columns WHERE table = 'http_requests_features' order by name;`) used by the Bot Management feature file generation logic to return duplicate column metadata from the `r0` database, as it no longer implicitly filtered for the 'default' database. This resulted in the Bot Management feature configuration file containing a large number of duplicate 'feature' rows, effectively doubling its size. The software running on Cloudflare's network machines to route traffic, specifically the FL2 Rust code for Bot Management, had a preallocated memory limit of 200 features. When the oversized file, containing more than 200 features, was propagated, this limit was exceeded, causing the system to panic and return HTTP 5xx errors.

See the original post