Degraded Performance for GitHub Actions MacOS Runners

Severity: Major

Category: Misconfiguration

Service: GitHub

This summary is created by Generative AI and may differ from the actual content.

Overview

GitHub Actions Mac hosted runner capacity was degraded between 07:00 UTC and 17:20 UTC on October 1, 2025, leading to timed out jobs and long queue times. The incident resulted in an average error rate of 46%, peaking at 96% of requests to the service. XL and Intel runners recovered by 10:10 UTC, with other runner types taking longer to fully recover. The degradation was triggered by a scheduled event that caused a permission failure on Mac runner hosts, blocking reimage operations. While the permission issue was resolved by 9:41 UTC, full recovery was delayed by backoff logic and the need for state resets on some hosts.

Impact

GitHub Actions Mac hosted runners experienced degraded performance for 10 hours and 20 minutes, from 07:00 UTC to 17:20 UTC. This led to significant customer impact, including timed out jobs and long queue times. The service experienced an average error rate of 46%, with a peak error rate of 96% of requests. While XL and Intel runners recovered by 10:10 UTC, other runner types continued to be affected for a longer duration.

Trigger

The incident was triggered by a scheduled event at 07:00 UTC. This event inadvertently led to a permission failure on Mac runner hosts, which subsequently blocked critical reimage operations necessary for maintaining runner capacity.

Detection

The issue was detected as customers began experiencing job start delays and failures when using GitHub Actions Mac OS runners. This led to an active investigation by the GitHub team, which commenced around 07:59 UTC.

Resolution

The permission issue that blocked reimage operations was resolved by 9:41 UTC. Immediately following the incident, changes were deployed to address the scheduled event that caused the permission failure, aiming to prevent similar issues in the future. Additionally, efforts are underway to reduce the end-to-end time for self-healing of offline hosts. However, the full recovery of available runners took longer than expected due to a combination of backoff logic slowing backend operations and some hosts requiring state resets.

Root Cause

The root cause of the incident was a permission failure on Mac runner hosts, which was initiated by a scheduled event at 07:00 UTC. This permission failure prevented reimage operations, leading to a degradation in Mac hosted runner capacity. The prolonged recovery was further exacerbated by backoff logic slowing backend operations and the necessity for state resets on some affected hosts.

See the original post