Delayed Notifications

Severity: Major
Category: Dependencies
Service: PagerDuty

This summary is created by Generative AI and may differ from the actual content.

Overview

On February 13th and 14th, 2023, PagerDuty experienced incidents causing notification delays and errors in the US and EU regions. Delays were up to 6 minutes on the 13th and up to 9 minutes on the 14th, with the EU web application unavailable for 31 minutes on the 13th. The incidents were due to a breaking change in an external library used by a shared component, triggered by downstream service redeployments. The issue was detected through error spikes, and resolved by reverting the change. No events were lost, but some notifications were delayed, affecting several accounts.

Impact

Delays of up to 6 minutes on February 13th and up to 9 minutes on February 14th in notification delivery and subscriber updates in the US and EU regions. Approximately 1% of requests returned 500 errors, and the PagerDuty web application was unavailable in the EU region for 31 minutes on February 13th. Several accounts were affected by delayed notifications.

Trigger

The trigger was the redeployment of downstream services, which caused the request router service to crash due to a breaking change in an external library used by a shared component.

Detection

The issue was detected through spikes in errors in the EU and US service regions, which led to multiple major incident calls and investigations by the teams.

Resolution

The resolution involved reverting the change in the base request router service image and redeploying the affected services, which restored the ability of the request router services to reload after downstream service redeployments.

Root Cause

The root cause was a breaking change in an external software library's runtime, combined with the service's version pinning, which was not detected during validation due to the bug being latent and only manifesting under production conditions.