Incident with Pull Requests and Issues

Severity: MajorCategory: HardwareService: GitHub
This summary is created by Generative AI and may differ from the actual content.
Overview
From 14:22 UTC to 14:48 UTC on January 30th, 2025, GitHub.com experienced web request failures, peaking at a 44% error rate with average successful requests taking over 3 seconds. The incident stemmed from a hardware failure in the caching layer supporting rate limiting, compounded by the absence of automated failover. Updates during the incident included identifying the caching infrastructure issue, monitoring recovery, initiating a failover, and confirming service restoration. Future plans involve transitioning to a high availability cache configuration and enhancing resilience to cache failures.
Impact
web requests to GitHub.com experienced failures (at peak the error rate was 44%), with the average successful request taking over 3 seconds to complete
Trigger
hardware failure in the caching layer that supports rate limiting
Detection
reports of degraded availability for Issues and Pull Requests
Resolution
manual failover of the primary to trusted hardware
Root Cause
hardware failure in the caching layer that supports rate limiting and a lack of automated failover for the caching layer
;