Delayed incident creation
This summary is created by Generative AI and may differ from the actual content.
Overview
On July 19th, between 4:36 PM UTC and 5:50 PM UTC, PagerDuty experienced delays processing API events in both the US & EU regions, with events from Microsoft Azure Alerts Integration delayed for the entire duration of the incident. The incident was caused by an Azure configuration change that triggered failsafes on our side. Those failsafes, in turn, caused slow downs in event processing for inbound, API-bourn events. In response, our on-call responders reverted the change made to the Azure integration. This resulted in a full recovery.
Impact
Delays in processing API events in both the US & EU regions, specifically affecting events from Microsoft Azure Alerts Integration for the entire duration of the incident.
Trigger
An Azure configuration change that activated a latent issue in the service, resulting in failures when processing Azure events.
Detection
Errors were detected by monitoring tools, which alerted the on-call responders.
Resolution
On-call responders reverted the changes to the service and redeployed the Azure configurations, resulting in full service restoration. The earlier failed Azure events were also successfully reprocessed.
Root Cause
Configuration changes made to the Azure integration activated a latent issue in the service, causing failures and triggering failsafes that slowed down event processing for all inbound, API-bourn events.
