Live updates on pages not loading reliably

Severity: MajorCategory: Change ProcessService: GitHub
This summary is created by Generative AI and may differ from the actual content.
Overview
On December 17th, 2024, between 14:33 UTC and 14:50 UTC, many users experienced intermittent errors and timeouts when accessing github.com. The error rate was 8.5% on average and peaked at 44.3% of requests. The errors were caused by our web servers being overloaded as a result of planned maintenance that unintentionally caused our live updates service to fail to start. As a result of the live updates service being down, clients reconnected aggressively and overloaded our servers. We only marked Issues as affected during this incident despite the broad impact. This oversight was due to a gap in our alerting while our web servers were overloaded. The engineering team's focus on restoring functionality led us to not identify the broad scope of the impact to customers until the incident had already been mitigated. We mitigated the incident by rolling back the changes from the planned maintenance to the live updates service and scaling up the service to handle the influx of traffic from WebSocket clients. We are working to reduce the impact of the live updates service's availability on github.com to prevent issues like this one in the future. We are also working to improve our alerting to better detect the scope of impact from incidents like this.
Impact
The error rate was 8.5% on average and peaked at 44.3% of requests, causing a broad impact across our services, such as the inability to log in, view a repository, open a pull request, and comment on issues.
Trigger
The errors were caused by our web servers being overloaded as a result of planned maintenance that unintentionally caused our live updates service to fail to start.
Detection
The oversight in marking only 'Issues' as affected during the incident was due to a gap in our alerting while the web servers were overloaded.
Resolution
We mitigated the incident by rolling back the changes from the planned maintenance to the live updates service and scaling up the service to handle the influx of traffic from WebSocket clients.
Root Cause
The root cause was the planned maintenance that unintentionally caused our live updates service to fail to start, leading to aggressive reconnections by clients and overloading our servers.
;