This summary is created by Generative AI and may differ from the actual content.
Overview
Between Dec 3 03:35 UTC and 04:35 UTC, availability of large hosted runners for Actions was degraded due to failures in background VM provisioning jobs. This was a shorter recurrence of the issue that occurred the previous day. Users would see workflows queued waiting for a large runner. On average, 13.5% of all workflows requiring large runners over the incident time were affected, peaking at 46% of requests. Standard and Mac runners were not affected. Following the Dec 1 incident, we had disabled non-critical paths in the provisioning job and believed that would eliminate any impact while we understood and addressed the timeouts. Unfortunately, the timeouts were a symptom of broader job health issues, so those changes did not prevent this second occurrence the following day. We now understand that other jobs on these agents had issues that resulted in them hanging and consuming available job agent capacity. The reduced capacity led to saturation of the remaining agents and significant performance degradation in the running jobs. In addition to the immediate improvements shared in the previous incident summary, we immediately initiated regular recycles of all agents in this area while we continue to address the issues in both the jobs themselves and the resiliency of the agents. We also continue to improve our detection to ensure we are automatically detecting these delays.Impact
Availability of large hosted runners for Actions was degraded, affecting on average 13.5% of all workflows requiring large runners, peaking at 46% of requests. Standard and Mac runners were not affected.Trigger
Failures in background VM provisioning jobs.Detection
Users reported degraded performance for Hosted Runners, and the issue was investigated.Resolution
Regular recycles of all agents in the affected area were initiated, and ongoing improvements to job and agent resiliency were made. Detection improvements were also implemented to ensure automatic detection of delays.Root Cause
Timeouts were a symptom of broader job health issues, with other jobs on agents hanging and consuming available job agent capacity, leading to saturation and performance degradation.