Incident with Pull Requests

Severity: MajorCategory: Change ProcessService: GitHub

This summary is created by Generative AI and may differ from the actual content.

Overview

A production database migration to drop a column, which was no longer in direct use, was initiated. However, the ORM continued to reference this dropped column in a subset of pull request queries, leading to elevated error rates across pushes, webhooks, notifications, and pull requests. Impact peaked at approximately 4% of all web and REST API traffic. A fix was deployed by 16:13 UTC, instructing the ORM to ignore the removed column, which resolved the primary incident in the largest production environment. However, this fix was not picked up by some custom and canary environments, triggering a secondary incident affecting approximately 0.1% of pull request traffic, which was fully resolved by 19:45 UTC. The incident highlighted an application monitoring gap and the need for streamlined changes across environments.

Impact

The primary incident caused elevated error rates across pushes, webhooks, notifications, and pull requests, peaking at approximately 4% of all web and REST API traffic. A secondary incident affected approximately 0.1% of pull request traffic.

Trigger

The incident was triggered by a production database migration initiated at 15:33 UTC on August 5, 2025, to drop a column from a table backing pull request functionality.

Detection

Elevated error rates were observed across pushes, webhooks, notifications, and pull requests, leading to an investigation into degraded performance for Pull Requests. The incident also identified an application monitoring gap that would have prevented continued rollout when impact was observed.

Resolution

For the primary incident, a change was deployed that instructed the ORM to ignore the removed column, resolving most affected services by 16:13 UTC. For the secondary incident, the fix was eventually applied to the custom and canary environments, fully resolving the issue by 19:45 UTC.

Root Cause

The primary root cause was a latent issue where the ORM continued to reference a dropped column in a subset of pull request queries after a database migration. A contributing factor to the prolonged impact was that the initial fix was applied only to the largest production environment, and an update to some custom and canary environments did not pick up this fix, triggering a secondary incident. Additionally, an application monitoring gap was identified that would have prevented continued rollout when impact was observed.

See the original post