API, ChatGPT & Sora Facing Issues

Severity: Major

Category: Change Process

Service: OpenAI

This summary is created by Generative AI and may differ from the actual content.

Overview

On December 11, 2024, OpenAI services, including API, ChatGPT, and Sora, experienced significant downtime due to a new telemetry service deployment that overwhelmed the Kubernetes control plane, causing cascading failures. The incident lasted from 3:16 PM PST to 7:38 PM PST, with recovery times varying for different services.

Impact

All OpenAI services faced significant degradation or unavailability, impacting users globally. The incident was not due to a security breach or recent launch.

Trigger

The deployment of a new telemetry service that unintentionally caused every node in each cluster to execute resource-intensive Kubernetes API operations, overwhelming the control plane.

Detection

Alerts were fired at 3:13 PM PST, notifying engineers of the issue shortly after the deployment began affecting services.

Resolution

Engineers scaled down cluster sizes, blocked network access to Kubernetes admin APIs, and scaled up Kubernetes API servers to restore control and remove the offending service. Traffic was shifted to healthy clusters where possible.

Root Cause

A new telemetry service configuration generated massive Kubernetes API load across large clusters, overwhelming the control plane and breaking DNS-based service discovery. The issue was exacerbated by DNS caching, which delayed visibility of the problem.

See the original post