Incident with search on GitHub we are seeing increased failure rates

Severity: CriticalCategory: NetworkService: GitHub

This summary is created by Generative AI and may differ from the actual content.

Overview

The incident affected GitHub search, causing degraded performance from 13:30 UTC to 17:14 UTC on August 12, 2025. Users experienced inaccurate or incomplete search results, failures to load pages like Issues, Pull Requests, Projects, and Deployments, and broken components such as Actions workflow and label filters. The most severe impact occurred between 14:00 UTC and 15:30 UTC, with up to 75% of search queries failing and search result updates delayed by up to 100 minutes. Affected services included API Requests, Issues, Pull Requests, Actions, and Packages. The incident was triggered by intermittent connectivity issues between load balancers and search hosts, which led to retry queues overwhelming the load balancers.

Impact

GitHub search was in a degraded state for approximately 3 hours and 44 minutes. During the peak impact period (14:00 UTC to 15:30 UTC), up to 75% of search queries failed, and updates to search results were delayed by up to 100 minutes. Users experienced inaccurate or incomplete results, failures to load pages (Issues, Pull Requests, Projects, Deployments), and broken components (Actions workflow, label filters). Affected services included API Requests, Issues, Pull Requests, Actions, and Packages. Even after partial recovery, some users continued to experience increased request latency and stale search results.

Trigger

The incident was triggered by intermittent connectivity issues between GitHub's load balancers and search hosts. Initially, retry logic masked these problems, but eventually, the retry queues overwhelmed the load balancers, leading to widespread failures.

Detection

Awareness of the incident began with reports of degraded performance for API Requests, Actions, Issues, and Pull Requests, which prompted an investigation into why requests were failing to reach the search clusters.

Resolution

Query failures were mitigated at 15:30 UTC by throttling the search indexing pipeline to reduce load and stabilize retries. The underlying connectivity failures were fully resolved at 17:14 UTC after an automated reboot of a search host, which led to the recovery of the rest of the system. Post-incident, improvements have been made to internal monitors and playbooks, and the search cluster load balancer has been tuned. There is also ongoing investment in resolving the underlying connectivity issues.

Root Cause

The root cause was intermittent connectivity issues between the load balancers and search hosts. This issue was exacerbated by retry logic that initially masked the problem but eventually led to retry queues overwhelming the load balancers, causing cascading failures.

See the original post