Delayed Notifications

Severity: MajorCategory: Change ProcessService: PagerDuty
This summary is created by Generative AI and may differ from the actual content.
Overview
On January 6th, 2023, between 21:20 UTC and 23:44 UTC, PagerDuty experienced a global operational incident affecting the notification system. Notifications via SMS, phone, push, or email were delayed or not delivered in the US and EU Service Regions. A backlog of events was processed, leading to unexpected escalations and repeated notifications. The issue was resolved by taking corrective action against the affected service, and the team is working on mitigations to improve resilience.
Impact
Notifications in the US Service Region failed to meet the 2 hours max delay time frame and were not delivered. In the EU Service Region, notifications were delayed up to 10 minutes, with some delayed up to 1 hour and 40 minutes. No impact on viewing or updating incidents in the Web UI, Mobile UI, or REST API.
Trigger
A failure mode in the data streaming platform during changes to increase operational resilience caused events to stop flowing to downstream micro-services, halting notifications.
Detection
The issue was detected through monitoring of notification delays and escalations in the system.
Resolution
Corrective action was taken against the affected service, allowing the queue to process events correctly. A fix was deployed, and monitoring ensured recovery in the EU region.
Root Cause
The root cause was a failure in the data streaming platform used to publish content from MySQL clusters to Kafka, which stopped events from reaching downstream services responsible for notifications.
;