A postmortem of three recent issues

Severity: MajorCategory: Change ProcessService: Anthropic

This summary is created by Generative AI and may differ from the actual content.

Overview

Between August and early September, three distinct infrastructure bugs intermittently degraded Claude's response quality across various platforms. The first, a context window routing error introduced on August 5, misrouted Sonnet 4 requests to 1M token context servers, affecting up to 16% of Sonnet 4 requests at its peak and approximately 30% of Claude Code users. This was exacerbated by a load balancing change on August 29. The second, an output corruption issue, arose from a misconfiguration deployed on August 25 to Claude API TPU servers, leading to unexpected characters or syntax errors in responses for Opus 4.1, Opus 4, and Sonnet 4. The third, an approximate top-k XLA:TPU miscompilation, was triggered on August 25 by new code intended to improve token selection, inadvertently exposing a latent compiler bug that caused 'completely wrong results' for Haiku 3.5 and potentially Sonnet 4 and Opus 3. Detection was challenging due to overlapping issues, inconsistent symptoms, noisy evaluations, and privacy controls limiting debugging access.

Impact

The incident intermittently degraded Claude's response quality. The context window routing error initially affected 0.8% of Sonnet 4 requests, peaking at 16% at its worst hour on August 31, and impacted approximately 30% of Claude Code users with at least one misrouted message. On Amazon Bedrock, misrouted traffic peaked at 0.18%, while Google Cloud's Vertex AI saw less than 0.0004% impact. 'Sticky' routing caused some users to be more severely affected. Output corruption led to unexpected characters (e.g., Thai/Chinese in English prompts) or syntax errors in responses for Opus 4.1, Opus 4, and Sonnet 4. The XLA:TPU miscompilation caused 'completely wrong results' for Haiku 3.5 and potentially Sonnet 4 and Opus 3. Overall, users experienced inconsistent quality, falling below Anthropic's expected high standards.

Trigger

The context window routing error was introduced on August 5 when some Sonnet 4 requests were misrouted to servers configured for the 1M token context window, and was exacerbated by a routine load balancing change on August 29. The output corruption was triggered by a misconfiguration deployed to Claude API TPU servers on August 25, caused by a runtime performance optimization. The approximate top-k XLA:TPU miscompilation was triggered on August 25 by new code deployed to improve how Claude selects tokens during text generation, which inadvertently exposed a latent bug in the XLA:TPU compiler that had been previously masked by a workaround.

Detection

Awareness of the degradation began in early August with user reports, which were initially difficult to distinguish from normal feedback variation. By late August, the increasing frequency and persistence of these reports prompted an internal investigation, leading to the discovery of the three separate infrastructure bugs. A spike in negative reports on August 29, following a load balancing change, was not immediately connected to the underlying issues.

Resolution

For the context window routing error, the routing logic was fixed to direct requests to correct server pools, with the fix deployed on September 4 and rollouts completed by September 16 for most platforms. For output corruption, the misconfiguration was identified and rolled back on September 2, and detection tests for unexpected character outputs were added. For the XLA:TPU miscompilation, affected changes were rolled back on September 4 (Haiku 3.5) and September 12 (Opus 3), and out of caution for Sonnet 4. Simultaneously, the system was switched from approximate to exact top-k, operations were standardized on fp32 precision, and collaboration with the XLA:TPU team on a compiler fix was initiated.

Root Cause

The root causes include a misconfiguration in routing logic for Sonnet 4 requests, a deployed misconfiguration causing errors during token generation on Claude API TPU servers, and a latent XLA:TPU compiler bug related to mixed precision arithmetic and an approximate top-k operation that was exposed by a sampling code rewrite. Underlying these technical issues were critical gaps in validation processes, over-reliance on noisy evaluations, and limitations in debugging tooling due to privacy practices, which collectively delayed detection and resolution. The overlapping nature of the bugs and inconsistent symptoms also made diagnosis particularly challenging.

See the original post