package tarball read outage today

Severity: Major
Category: Dependencies
Service: npm

This summary is created by Generative AI and may differ from the actual content.

Overview

The npm registry experienced a read outage for 0.5% of all package tarballs for all network regions. The unavailable tarballs were offline for about 16 hours, from mid-afternoon PDT on July 5 to early morning July 6. The root cause for this outage was an interesting interaction of file modification time, nginx's method of generating etags, and cache headers.

Impact

0.5% of all package tarballs for all network regions were completely unavailable during the outage for any region of our CDN.

Trigger

We commenced running a script that updated all existing tarballs on all servers to the new etags scheme. This ran over the course of some hours. Unfortunately, the script that applied this change to our production environment failed to clamp the resulting integer, resulting in negative numbers for timestamps.

Detection

We received a report of some 502s being returned by some servers.

Resolution

A new script was immediately run to fix mtimes on all tarball files appearing in the logs as producing 502 errors.

Root Cause

Negative mtimes triggered an nginx bug. Nginx will serve the first request for a file in this state and deliver the negative etag. However, if there is a negative etag in the if-none-match header nginx attempts to serve a 304 but never completes the request.