Brief bursts of API Instability

Severity: MajorCategory: DependenciesService: PagerDuty
This summary is created by Generative AI and may differ from the actual content.
Overview
On February 13th and 14th, 2023, PagerDuty experienced incidents causing delays in notifications and errors in the Web UI, Mobile UI, and REST API in the US and EU regions. The issues were due to a latent bug triggered by a shared component upgrade, affecting less than 1% of requests. The web application was unavailable in the EU region for 31 minutes on February 14th. No events were lost or dropped during these times.
Impact
Delays of up to 9 minutes in notifications and subscriber updates, with less than 1% of requests returning 500 errors. The web application was unavailable in the EU region for 31 minutes. Specific numbers of delayed notifications and affected accounts were recorded.
Trigger
A latent bug was triggered by an upgrade of a shared component, specifically a breaking change in an external software library used by the component.
Detection
Teams were paged for spikes in errors in the EU and US regions, leading to major incident calls and investigations.
Resolution
The change causing the issue was reverted, and systems were monitored to ensure stability. The base request router service image was rolled back, and affected services were redeployed.
Root Cause
The root cause was a breaking change in an external software library used by a shared component, which was not detected during validation due to its latent nature and only materialized under production conditions.
;