The 28-Hour Meltdown: What Happened When AWS US-EAST-1 Overheated

Severity: Critical

Category: Hardware

Service: AWS

This summary is created by Generative AI and may differ from the actual content.

Overview

On May 7 2026, a data center in AWS us‑east‑1 (availability zone use1‑az4) experienced temperatures that exceeded safe operating limits. The automatic thermal protection caused servers to shut down, leading to loss of power across affected racks. Consequently, EC2 instances and EBS volumes went down, and downstream services that depend on them (ELB, EKS, ElastiCache, Redshift, OpenSearch, Managed Streaming for Apache Kafka) experienced errors or full outages. The event lasted roughly 28 hours, with recovery involving traffic shifting, partial power restoration, and gradual re‑commissioning of cooling capacity. Redshift’s outage was later identified as a separate upstream dependency issue resolved independently.

Impact

The outage impacted thousands of AWS customers across multiple services. EC2 and EBS were unavailable for the majority of the AZ, causing compute‑and‑storage‑dependent workloads to fail. Managed services such as ELB, EKS, ElastiCache, OpenSearch, and Managed Streaming for Apache Kafka reported elevated error rates or complete failures. The total duration was about 28 hours, representing one of the longest single‑AZ outages in recent AWS history. No direct financial loss is reported, but customers experienced service disruption and potential data loss for workloads without multi‑AZ redundancy or recent snapshots.

Trigger

At 4:20 PM PDT on May 7, internal temperature sensors detected that the data‑center environment exceeded the safe operating threshold. The built‑in thermal protection mechanism automatically powered down the servers to prevent hardware damage, which in turn caused a loss of power to the racks.

Detection

The thermal event was first identified by AWS’s internal monitoring systems that flagged the temperature breach and initiated the automatic shutdown. Subsequent service‑level monitoring (e.g., increased HTTP 5xx errors, degraded health checks) alerted the on‑call SRE team, who began incident response and public communication at 5:06 PM and 5:25 PM respectively.

Resolution

AWS engineers first shifted traffic away from the affected AZ (5:06 PM). Power was restored to a subset of infrastructure by 9:12 PM, and additional cooling capacity was brought online at 10:11 PM, allowing racks to recover gradually. By May 8 9:30 AM Redshift’s separate issue was resolved, and by 1:50 PM cooling stabilized to pre‑event levels, restoring most EC2 and EBS resources. The incident was declared resolved at 8:04 PM on May 8, with a small number of instances and volumes still being assisted directly with customers.

Root Cause

The root cause was a failure of the data‑center cooling infrastructure that allowed ambient temperature to rise above safe limits, triggering the automatic server shutdown. The postmortem does not confirm whether a shared cooling plant caused the AZ‑wide impact; it only notes that a single data‑center event propagated across the AZ, possibly due to shared physical infrastructure. Redshift’s outage was unrelated to the thermal event.

See the original post