Cloudflare incident on August 21, 2025

Severity: MajorCategory: ScalabilityService: Cloudflare

This summary is created by Generative AI and may differ from the actual content.

Overview

On August 21, 2025, a surge of traffic from a single customer directed towards clients hosted in AWS us-east-1 caused severe network congestion on Cloudflare's links with AWS us-east-1, starting at 16:27 UTC. This led to high latency, packet loss, and connection failures for users interacting with Cloudflare via AWS us-east-1. AWS began withdrawing BGP prefixes at 16:37 UTC, and the Cloudflare network team was alerted at 16:44 UTC. The congestion was substantially reduced by 19:27 UTC through manual interventions, with intermittent latency issues fully resolving by 20:18 UTC. This was a regional issue, not affecting global Cloudflare services.

Impact

The incident caused significant performance degradation for customers with origins in AWS us-east-1, manifesting as high latency, packet loss, and failures to origins. This was a regional issue, specifically affecting traffic between Cloudflare and AWS us-east-1, and did not impact global Cloudflare services. Network queues on Cloudflare's edge routers experienced significant growth and consistent high-priority packet drops, leading to customer-facing latency Service Level Objectives (SLOs) falling below acceptable thresholds.

Trigger

The incident was triggered by a single customer initiating a large volume of requests from AWS us-east-1 to Cloudflare for cached objects. This generated an unprecedented surge of response traffic that saturated all direct peering connections between Cloudflare and AWS us-east-1.

Detection

The Cloudflare network team was alerted to internal congestion in Ashburn (IAD) at 16:44 UTC, approximately 17 minutes after the traffic surge began. This awareness was likely triggered by monitoring systems detecting the saturation of peering links and subsequent performance degradation.

Resolution

Resolution involved immediate engagement and close collaboration between Cloudflare's incident team and AWS. Cloudflare manually rate-limited the single customer responsible for the traffic surge, which began to decrease congestion by 19:05 UTC. Cloudflare's network team also performed additional traffic engineering actions, fully resolving the congestion by 19:27 UTC. AWS, after initially withdrawing BGP advertisements which rerouted traffic and worsened congestion, subsequently began reverting these BGP withdrawals at Cloudflare's request, normalizing prefix announcements by 20:07 UTC.

Root Cause

The primary root cause was insufficient network capacity and a lack of robust customer isolation mechanisms to handle an exceptional traffic surge from a single customer. Specifically, Cloudflare's direct peering connections with AWS us-east-1 were saturated by the high volume of response traffic. This was exacerbated by a pre-existing failure causing one direct peering link to operate at half-capacity, and an undersized Data Center Interconnect (DCI) connecting Cloudflare's edge routers to an offsite network interconnection switch. Furthermore, AWS's attempts to alleviate congestion by withdrawing BGP advertisements inadvertently rerouted traffic to other already saturated peering links, worsening the overall impact. The absence of a system to selectively deprioritize or budget network resources on a per-customer basis allowed this single customer's traffic to monopolize shared resources and degrade service for others.

See the original post