Brief bursts of API Instability

Severity: Major
Category: Dependencies
Service: PagerDuty

This summary is created by Generative AI and may differ from the actual content.

Overview

On February 13th and 14th, 2023, PagerDuty experienced incidents causing delays in notifications and errors in the Web UI, Mobile UI, and REST API in the US and EU regions. The issues were due to a latent bug triggered by a shared component upgrade, affecting less than 1% of requests. The web application was unavailable in the EU region for 31 minutes on February 14th. No events were lost or dropped during these times.

Impact

Delays of up to 9 minutes in notifications and subscriber updates, with less than 1% of requests returning 500 errors. The web application was unavailable in the EU region for 31 minutes. Specific numbers of delayed notifications and affected accounts were recorded.

Trigger

A latent bug was triggered by an upgrade of a shared component, specifically a breaking change in an external software library used by the component.

Detection

Teams were paged for spikes in errors in the EU and US regions, leading to major incident calls and investigations.

Resolution

The change causing the issue was reverted, and systems were monitored to ensure stability. The base request router service image was rolled back, and affected services were redeployed.

Root Cause

The root cause was a breaking change in an external software library used by a shared component, which was not detected during validation due to its latent nature and only materialized under production conditions.