Incident with pull requests

Severity: MajorCategory: BugService: GitHub
This summary is created by Generative AI and may differ from the actual content.
Overview
On August 5, 2025, a production database migration to drop a column led to an incident when the ORM continued to reference the removed column in a subset of pull request queries. This caused elevated error rates across pushes, webhooks, notifications, and pull requests, peaking at approximately 4% of all web and REST API traffic. The primary issue was mitigated by deploying a change instructing the ORM to ignore the dropped column, restoring most services by 16:13 UTC. A secondary incident occurred because the fix was not picked up by some custom and canary environments, affecting ~0.1% of pull request traffic, which was fully resolved by 19:45 UTC. The incident highlighted an application monitoring gap during migrations and the need to streamline changes across environments.
Impact
The incident caused elevated error rates across pushes, webhooks, notifications, and pull requests, with impact peaking at approximately 4% of all web and REST API traffic. Affected services included Git Operations, Webhooks, Issues, Pull Requests, and Actions. A secondary incident later affected ~0.1% of pull request traffic.
Trigger
The incident was triggered by a production database migration initiated at 15:33 UTC on August 5, 2025, which involved dropping a column from a table backing pull request functionality.
Detection
Awareness began with reports of degraded performance for Issues and Webhooks, followed by subsequent observations of degraded performance and availability across other affected services like Git Operations, Pull Requests, and Actions.
Resolution
The primary issue was mitigated by deploying a change that instructed the ORM to ignore the removed column, leading to recovery of most affected services by 16:13 UTC. The secondary incident, caused by the fix not propagating to some custom and canary environments, was fully resolved by 19:45 UTC after the necessary updates were applied.
Root Cause
The root cause was a latent issue where the ORM continued to reference a dropped database column in a subset of pull request queries after a production database migration. This incident also identified an application monitoring gap that would have prevented continued rollout when impact was observed. A contributing factor to the secondary incident was that an update to some of our custom and canary environments did not pick up the fix, highlighting the need to streamline some types of changes across environments.
;