A deep dive into Cloudflare's September 12, 2025 dashboard and API outage

Severity: MajorCategory: BugService: Cloudflare

This summary is created by Generative AI and may differ from the actual content.

Overview

A broad outage of Cloudflare's Tenant Service API, many other APIs, and the Cloudflare Dashboard occurred on September 12, 2025. The incident was triggered by a bug in a newly released dashboard version (2025-09-12 16:32 UTC) that caused excessive, repeated calls to the Tenant Service API. This bug was due to a problematic object in the dependency array of a React useEffect hook, causing it to re-run excessively. This behavior coincided with a service update to the Tenant Service API (2025-09-12 17:50 UTC), compounding instability and ultimately overwhelming the service. As the Tenant Service is critical for API request authorization, its failure led to widespread 5xx errors. The outage was limited to the control plane and did not affect Cloudflare's network services (data plane).

Impact

The Cloudflare Dashboard was severely impacted throughout the full duration of the incident. The Cloudflare API was severely impacted for two distinct periods when the Tenant API Service was down. The incident was a failure in the control plane, which has strict separation of concerns from the data plane, meaning the outage did not affect services on Cloudflare's network. The majority of users were unaffected unless they were making configuration changes or using the dashboard.

Trigger

The immediate trigger was a bug in a newly released version of the Cloudflare Dashboard (2025-09-12 16:32 UTC). This bug caused repeated, unnecessary calls to the Tenant Service API because a problematic object was mistakenly included in the dependency array of a React useEffect hook. As this object was recreated on every state or prop change, React treated it as 'always new,' causing the useEffect to re-run each time, executing the API call many times during a single dashboard render instead of just once. This excessive call volume coincided with a service update to the Tenant Service API (2025-09-12 17:50 UTC), which compounded instability and ultimately overwhelmed the service.

Detection

Cloudflare became aware of the incident when Dashboard Availability began to drop at 2025-09-12 17:57 UTC, coinciding with the Tenant API Service becoming overwhelmed as new versions were deploying. Their automatic alerting service quickly identified the correct people to join the call and start working on remediation.

Resolution

Initial efforts focused on restoring service by reducing load and increasing resources, including installing a global rate limit on the Tenant Service and increasing the number of Kubernetes pods. While this improved API availability to 98% by 18:17 UTC, the dashboard did not recover. An attempt to patch the Tenant Service at 18:58 UTC by removing some erroring codepaths proved detrimental, degrading service further and causing a second API impact. A temporary ratelimiting rule was published at 19:01 UTC. The ultimate resolution was achieved at 19:12 UTC by reverting the problematic changes to the Tenant API Service, which restored Dashboard Availability to 100%. Post-incident, a hotfix was released shortly after the impact ended to address the 'Thundering Herd' effect, where dashboards re-authenticating simultaneously upon service recovery. Future changes will include introducing random delays to spread out retries.

Root Cause

The primary root cause was a latent bug in the Cloudflare Dashboard's React useEffect hook, which caused excessive and repeated calls to the Tenant Service API. This bug was exacerbated by a coincident service update to the Tenant Service API, leading to its overwhelming. A contributing factor was the Tenant Service's insufficient capacity to handle such spikes in load. Furthermore, the Tenant Service's critical role in API request authorization meant its failure cascaded to other APIs and the dashboard, resulting in 5xx errors. The absence of Argo Rollouts for the Tenant Service allowed a detrimental patch (removing erroring codepaths) to be deployed at 18:58 UTC, causing a second outage instead of being automatically rolled back. Finally, a 'Thundering Herd' effect, where dashboards simultaneously re-authenticating upon partial service recovery, amplified the instability.

See the original post