Delayed incident creation

Severity: Major
Category: Change Process
Service: PagerDuty

This summary is created by Generative AI and may differ from the actual content.

Overview

On July 19th, between 4:36 PM UTC and 5:50 PM UTC, PagerDuty experienced delays processing API events in both the US & EU regions, with events from Microsoft Azure Alerts Integration delayed for the entire duration of the incident. The incident was caused by an Azure configuration change that triggered failsafes on our side. Those failsafes, in turn, caused slow downs in event processing for inbound, API-bourn events. In response, our on-call responders reverted the change made to the Azure integration. This resulted in a full recovery.

Impact

Delays in processing API events in both the US & EU regions, specifically affecting events from Microsoft Azure Alerts Integration for the entire duration of the incident.

Trigger

An Azure configuration change that activated a latent issue in the service, resulting in failures when processing Azure events.

Detection

Errors were detected by monitoring tools, which alerted the on-call responders.

Resolution

On-call responders reverted the changes to the service and redeployed the Azure configurations, resulting in full service restoration. The earlier failed Azure events were also successfully reprocessed.

Root Cause

Configuration changes made to the Azure integration activated a latent issue in the service, causing failures and triggering failsafes that slowed down event processing for all inbound, API-bourn events.