Event API Delays

Severity: MajorCategory: Change ProcessService: PagerDuty
This summary is created by Generative AI and may differ from the actual content.
Overview
On Aug 2nd, 2022, PagerDuty experienced an incident in the US service region causing delays in event processing and preventing certain actions on Incidents/Alerts. The issue began at 20:18 UTC and was mitigated by 20:31 UTC, with full resolution by 20:49 UTC. The incident was due to a major version upgrade of the MySQL database, which led to high lock contention on a critical table, causing HTTP request timeouts and halted event processing.
Impact
The incident caused delays in event processing and prevented users from performing actions on Incidents/Alerts, leading to delayed notifications for customers.
Trigger
The trigger was a major version upgrade of the MySQL database used for the Incident/Alert lifecycle, completed at approximately 19:30 UTC on August 2nd.
Detection
The issue was detected by an increase in HTTP request timeouts, which led to internal teams being paged to investigate.
Resolution
The resolution involved hitting a database limit that aborted hung requests, allowing database requests to complete successfully and metrics to return to normal levels. By 20:49 UTC, all delayed events were processed and systems returned to normal.
Root Cause
The root cause was high lock contention on a database table essential for the Incident/Alert lifecycle, a behavior not seen in prior versions or testing.
;