High error rates for ChatGPT, APIs, and Sora

Severity: Major
Category: Hardware
Service: OpenAI

This summary is created by Generative AI and may differ from the actual content.

Overview

Starting at 10:40 AM on December 26th, 2024, multiple OpenAI products, including ChatGPT, Sora video creation, and various APIs, experienced degraded availability with error rates exceeding 90%. The text completions API remained unaffected. Full recovery was achieved for most systems by 3:11 PM, except for ChatGPT, which fully recovered by 8:16 PM. The incident was caused by a power failure in a cloud provider's data center, impacting critical services such as databases. Manual intervention was required for region-wide failover, which elongated the mitigation time. OpenAI plans to add a layer of indirection to ensure faster failover in the future.

Impact

Degraded availability for multiple OpenAI products, with error rates exceeding 90% for ChatGPT, Sora, and many APIs. The text completions API was unaffected. Full recovery for most systems by 3:11 PM, except for ChatGPT, which fully recovered by 8:16 PM.

Trigger

A power failure in a cloud provider's data center.

Detection

High error rates were detected across multiple OpenAI products, leading to the investigation and identification of the power failure as the cause.

Resolution

Worked with the cloud provider to fail over some databases to other regions. Full recovery was achieved when the cloud provider fully recovered the affected region. OpenAI plans to add a layer of indirection to ensure faster failover in the future.

Root Cause

Power failure in a cloud provider's data center, impacting critical services such as databases. Manual intervention was required for region-wide failover, which elongated the mitigation time.