This summary is created by Generative AI and may differ from the actual content.
Overview
From 14:22 UTC to 14:48 UTC on January 30th, 2025, GitHub.com experienced web request failures, peaking at a 44% error rate with average successful requests taking over 3 seconds. The incident stemmed from a hardware failure in the caching layer supporting rate limiting, compounded by the absence of automated failover. Updates during the incident included identifying the caching infrastructure issue, monitoring recovery, initiating a failover, and confirming service restoration. Future plans involve transitioning to a high availability cache configuration and enhancing resilience to cache failures.Impact
web requests to GitHub.com experienced failures (at peak the error rate was 44%), with the average successful request taking over 3 seconds to completeTrigger
hardware failure in the caching layer that supports rate limitingDetection
reports of degraded availability for Issues and Pull RequestsResolution
manual failover of the primary to trusted hardwareRoot Cause
hardware failure in the caching layer that supports rate limiting and a lack of automated failover for the caching layer