Incident with Pull Requests and Issues

Severity: Major
Category: Hardware
Service: GitHub

This summary is created by Generative AI and may differ from the actual content.

Overview

From 14:22 UTC to 14:48 UTC on January 30th, 2025, GitHub.com experienced web request failures, peaking at a 44% error rate with average successful requests taking over 3 seconds. The incident stemmed from a hardware failure in the caching layer supporting rate limiting, compounded by the absence of automated failover. Updates during the incident included identifying the caching infrastructure issue, monitoring recovery, initiating a failover, and confirming service restoration. Future plans involve transitioning to a high availability cache configuration and enhancing resilience to cache failures.

Impact

web requests to GitHub.com experienced failures (at peak the error rate was 44%), with the average successful request taking over 3 seconds to complete

Trigger

hardware failure in the caching layer that supports rate limiting

Detection

reports of degraded availability for Issues and Pull Requests

Resolution

manual failover of the primary to trusted hardware

Root Cause

hardware failure in the caching layer that supports rate limiting and a lack of automated failover for the caching layer