Incident with Actions

Severity: MajorCategory: BugService: GitHub
This summary is created by Generative AI and may differ from the actual content.
Overview
Between 9:49 and 17:00 UTC on January 23, 2025, the available capacity of large hosted runners was degraded, impacting job assignment times.
Impact
On average, 26% of jobs requiring large runners had a >5min delay getting a runner assigned.
Trigger
The rollback of a configuration change triggered a latent bug in event processing due to mixed data shape.
Detection
Not explicitly mentioned, but likely internal monitoring of runner assignment times.
Resolution
Mitigated by using a feature flag to bypass the problematic event processing logic.
Root Cause
Rollback of a configuration change and a latent bug in event processing, which was triggered by the mixed data shape that resulted from the rollback. The processing would reprocess the same events unnecessarily and cause the background job that manages large runner creation and deletion to run out of resources.
;