Incident with Actions

Severity: Major

Category: Bug

Service: GitHub

This summary is created by Generative AI and may differ from the actual content.

Overview

On May 5 2026, from 13:22 UTC to 17:05 UTC, GitHub Actions hosted runners in the East US region experienced a degradation. Approximately 13.5 % of jobs that requested a standard runner failed, and about 16 % of larger runners with private networking either failed or were delayed by more than five minutes. Copilot Code Review requests were also affected, with roughly 8,500 review requests timing out. The incident was triggered by a routine scale‑up operation for runner VMs that hit an internal rate‑limit during image pulls, and the usual back‑off logic was not activated because of the response code returned. Mitigation involved throttling the load, allowing the system to recover, and pausing further scale‑up actions until the throttling behavior was improved.

Impact

The failure rate impacted a minority of jobs but caused visible errors for affected users: pull‑request comments displayed an error and required a manual retry. Approximately 8,500 Copilot review requests timed out. While most runner requests were automatically rerouted to other regions, a subset of East US‑bound requests remained degraded, leading to reduced productivity for developers relying on those runners.

Trigger

A scheduled scale‑up of hosted‑runner virtual machines in the East US region generated a burst of VM‑creation requests that exceeded an internal rate‑limit when pulling images from storage. Because the service returned a response code that the back‑off mechanism did not recognize, the expected throttling back‑off was not triggered, causing the cascade of failures.

Detection

Internal monitoring observed a sudden rise in queue times and failure rates for Actions jobs in the East US region. Alerts were generated around 13:37 UTC, prompting the on‑call team to begin investigation and post status updates.

Resolution

The team reduced the scale‑up load to stay below the rate‑limit, allowing queued work to be processed. By 15:34 UTC the impact dropped to less than 0.5 % of runner assignments, and full recovery was achieved by 17:05 UTC. In parallel, the team paused all further scale‑up operations, improved the throttling logic to react to rate‑limit responses, and began a review of end‑to‑end limits for similar workflows.

Root Cause

An internal rate‑limit on VM creation was exceeded during a routine scale‑up, and the back‑off logic failed to engage because the service returned an unexpected response code. This combination caused the VM‑creation failures that led to the degraded runner availability.

See the original post