Intermittent authentication failures on GitHub

Severity: Minor
Category: Scalability
Service: GitHub

This summary is created by Generative AI and may differ from the actual content.

Overview

Intermittent authentication failures on GitHub between 17:07 UTC and 19:06 UTC, affecting GitHub Actions, parts of Git operations, and other authentication-dependent requests, with an average Actions error rate of approximately 0.6% of affected API requests and a Git operations ssh read error rate of approximately 0.29%, while ssh write and http operations were not impacted.

Impact

A subset of requests failed due to token verification lookups intermittently failing, leading to 401 errors and degraded reliability for impacted workflows, with an estimated average error rate of 0.6% for Actions and 0.29% for Git operations ssh read, and no impact on ssh write and http operations.

Trigger

Elevated replication lag in the token verification database cluster due to increased write volume exceeding the cluster's available capacity.

Detection

The issue was identified through reports of degraded performance for Actions and Git Operations, and investigation revealed a low rate of authentication failures affecting GitHub App server to server tokens, GitHub Actions authentication tokens, and git operations.

Resolution

The incident was mitigated by adjusting the database replica topology to route reads away from lagging replicas and by adding/bringing additional replica capacity online, with service health improving progressively after the change, and GitHub Actions recovering by ~19:00 UTC, and the incident resolved at 19:06 UTC.

Root Cause

The root cause was the elevated replication lag in the token verification database cluster due to increased write volume exceeding the cluster's available capacity, resulting in intermittent authentication failures.