This summary is created by Generative AI and may differ from the actual content.
Overview
On September 1, 2025, between 14:40 and 15:25 UTC, PagerDuty customers in the EU service region experienced degraded performance with Incident Workflows. This was caused by an extremely large volume of workflow actions (approximately 30 times normal) overwhelming the service and leading to memory exhaustion on our workflow processing systems. Affected customers saw Incident Workflows failing to load in the UI, and some workflows failed to execute, while others completed successfully but were delayed. Automated alerting tools detected the issue at 14:40 UTC. At 15:15 UTC, responders applied targeted mitigations, leading to rapid recovery, and full service resumption by 15:25 UTC. Following a review, a performance optimization not fully applied in the EU region was identified as a contributing factor. Improvements are being implemented, including ensuring the optimization is fully applied, instituting deployment safeguards, and reviewing resource allocation.Impact
A small number of PagerDuty customers in the EU service region experienced degraded performance with Incident Workflows. This included workflows failing to load in the UI and some workflows failing to execute entirely. Other workflow invocations completed successfully but were delayed. No other service regions were affected.Trigger
An extremely large volume, approximately 30 times normal, of Incident Workflow actions were invoked in the EU region.Detection
Automated alerting tools identified the incident.Resolution
Responders identified the large volume of incident updates as the cause of resource exhaustion and applied targeted mitigations at 15:15 UTC, leading to rapid system recovery and full service resumption by 15:25 UTC.Root Cause
The primary cause was memory exhaustion on our workflow processing systems, triggered by an extremely large volume of Incident Workflow actions (approximately 30 times normal) in the EU region. A contributing factor was a recent performance optimization that had not been fully applied in the EU region, leading to increased resource usage under high load.