Four hours of partial outage, 2014-02-24

Severity: Minor
Category: Dependencies
Service: npm

This summary is created by Generative AI and may differ from the actual content.

Overview

From 2014-02-24 12:00 UTC (4:00 AM US/Pacific) until about 16:00 UTC (8:00 AM US/Pacific), package tarball downloads were returning 503 response codes. This affected the EU and eastern US areas more dramatically, but simply because of the time of day at which it occurred. There was nothing location-specific about the outage. Manta is lovely, and we will continue to use it for many things, but any single point of failure needs to be spread out. Update: We have Nagios alerts set up on Fastly's data feed, so any increase in errors will indeed alert whoever is on pager duty. Joyent has confirmed that they restarted the system that was leaking file descriptors, and they're in the process of building a more permanent fix. February 24th, 2014 9:19am @izs post mortemu

Impact

package tarball downloads were returning 503 response codes. This affected the EU and eastern US areas more dramatically, but simply because of the time of day at which it occurred. There was nothing location-specific about the outage.

Trigger

bug in HAProxy

Detection

We were monitoring the hosts directly, but not monitoring for errors from our Fastly logs. We are setting up monitoring on those logs today, so that a flurry of 5xx response codes will wake us up, even if the hosts appear to be functioning. Update: We have Nagios alerts set up on Fastly's data feed, so any increase in errors will indeed alert whoever is on pager duty.

Resolution

We first increased the error timeouts in our Fastly configuration. This avoided the immediate problem, and stopped the outage. Second, we were already in the process of mirroring our data in Manta over to a separate system in a different datacenter to remove that single point of failure. Manta is lovely, and we will continue to use it for many things, but any single point of failure needs to be spread out. Joyent has confirmed that they restarted the system that was leaking file descriptors, and they're in the process of building a more permanent fix.

Root Cause

a bug in HAProxy causes it to leak file descriptors, resulting in slow or non-responsive connections. Joyent's Manta system thus failed to respond to Fastly's requests in a timely fashion, and Fastly's system returned a 503 or 500 error to report that the backend was unavailable.