High error rates for ChatGPT, APIs, and Sora

Severity: MajorCategory: HardwareService: OpenAI
This summary is created by Generative AI and may differ from the actual content.
Overview
Starting at 10:40 AM on December 26th, 2024, multiple OpenAI products, including ChatGPT, Sora video creation, and various APIs, experienced degraded availability with error rates exceeding 90%. The text completions API remained unaffected. Full recovery was achieved for most systems by 3:11 PM, except for ChatGPT, which fully recovered by 8:16 PM. The incident was caused by a power failure in a cloud provider's data center, impacting critical services such as databases. Manual intervention was required for region-wide failover, which elongated the mitigation time. OpenAI plans to add a layer of indirection to ensure faster failover in the future.
Impact
Degraded availability for multiple OpenAI products, with error rates exceeding 90% for ChatGPT, Sora, and many APIs. The text completions API was unaffected. Full recovery for most systems by 3:11 PM, except for ChatGPT, which fully recovered by 8:16 PM.
Trigger
A power failure in a cloud provider's data center.
Detection
High error rates were detected across multiple OpenAI products, leading to the investigation and identification of the power failure as the cause.
Resolution
Worked with the cloud provider to fail over some databases to other regions. Full recovery was achieved when the cloud provider fully recovered the affected region. OpenAI plans to add a layer of indirection to ensure faster failover in the future.
Root Cause
Power failure in a cloud provider's data center, impacting critical services such as databases. Manual intervention was required for region-wide failover, which elongated the mitigation time.
;