API & ChatGPT Performance Degradation

Severity: MajorCategory: MisconfigurationService: OpenAI
This summary is created by Generative AI and may differ from the actual content.
Overview
On December 4th, 2024, OpenAI experienced two waves of incidents affecting API and ChatGPT performance. The first wave, from 15:48 PT to 15:52 PT, was due to a misconfiguration in the global load balancer, causing 100% of API requests to fail with HTTP 530 errors. This was promptly corrected. The second wave, from 16:07 PT to 17:37 PT, was triggered by an upgrade of the DNS cache system, leading to connectivity issues for the OpenAI identity system and resulting in 45% of API requests experiencing 499 errors. The first issue was resolved by correcting the load balancer configuration, and the second was mitigated by switching to an alternative system. Collaboration with Cloudflare and improvements to chaos testing were implemented to prevent future incidents.
Impact
100% of API requests failed with HTTP 530 errors during the first wave, and 45% of API requests experienced 499 errors during the second wave. This affected API and ChatGPT performance, causing elevated latencies and a 30-second wait time for users.
Trigger
The first wave was triggered by a misconfiguration in the global load balancer, and the second wave was triggered by an upgrade of the DNS cache system.
Detection
The issues were detected through monitoring of API performance and user reports of elevated latencies and request failures.
Resolution
The first issue was resolved by applying the correct configuration to the global load balancer. The second issue was mitigated by temporarily switching to an alternative system. Collaboration with Cloudflare and improvements to chaos testing were also implemented.
Root Cause
The root cause of the first wave was a misconfiguration in the global load balancer. The root cause of the second wave was the integration of the identity system with the DNS cache, which was found to be unnecessary and was removed.
;