Delayed Notifications

Severity: MajorCategory: DependenciesService: PagerDuty
This summary is created by Generative AI and may differ from the actual content.
Overview
On February 13th and 14th, 2023, PagerDuty experienced incidents causing notification delays and errors in the US and EU regions. Delays were up to 6 minutes on the 13th and up to 9 minutes on the 14th, with the EU web application unavailable for 31 minutes on the 13th. The incidents were due to a breaking change in an external library used by a shared component, triggered by downstream service redeployments. The issue was detected through error spikes, and resolved by reverting the change. No events were lost, but some notifications were delayed, affecting several accounts.
Impact
Delays of up to 6 minutes on February 13th and up to 9 minutes on February 14th in notification delivery and subscriber updates in the US and EU regions. Approximately 1% of requests returned 500 errors, and the PagerDuty web application was unavailable in the EU region for 31 minutes on February 13th. Several accounts were affected by delayed notifications.
Trigger
The trigger was the redeployment of downstream services, which caused the request router service to crash due to a breaking change in an external library used by a shared component.
Detection
The issue was detected through spikes in errors in the EU and US service regions, which led to multiple major incident calls and investigations by the teams.
Resolution
The resolution involved reverting the change in the base request router service image and redeploying the affected services, which restored the ability of the request router services to reload after downstream service redeployments.
Root Cause
The root cause was a breaking change in an external software library's runtime, combined with the service's version pinning, which was not detected during validation due to the bug being latent and only manifesting under production conditions.
;