Degradation affecting incident creation, API access, webhook delivery, incident workflow execution, web app functionality

Severity: CriticalCategory: BugService: PagerDuty

This summary is created by Generative AI and may differ from the actual content.

Overview

PagerDuty experienced two major service disruptions in its US Service Regions on August 28, 2025. The first incident, starting at 3:53 UTC and resolving by 10:10 UTC, caused widespread delays and failures in incident creation, notifications, webhooks, incident workflows, API access, and status page updates. A second, more limited incident occurred at 16:38 UTC, affecting incident creation and delaying webhooks and incident workflows, and was resolved by 20:24 UTC. Both incidents stemmed from memory constraints in the Kafka message queuing system, exacerbated by increased producer connections from a new API key tracking feature, leading to cascading failures across the platform.

Impact

The first incident caused significant degradation across critical platform functions, including incident creation, notifications, webhooks, API access, and web application functionality. Some incoming events were rejected with 500-class errors, and customers experienced duplicate notifications during the recovery phase. There was also an interruption in communication, causing delays in status page updates. The second incident had a more limited impact, primarily affecting incident creation and webhook/workflow delays, with reduced scope and duration due to prompt reapplication of mitigation steps.

Trigger

The incidents were triggered by memory constraints within the Kafka message queuing infrastructure, exacerbated by a significant increase in producer connections originating from a newly deployed API key tracking feature. This combination led to cascading failures.

Detection

The incident was detected internally, as indicated by the initial 'detected' status update at 1:03 PM GMT+9 on August 28, stating 'We are investigating a potential issue within PagerDuty.' This suggests internal monitoring systems identified the problem.

Resolution

For the first incident, a fix was deployed, and systems were gradually restored by 10:10 UTC, with efforts to process event backlogs. Manual backup procedures were initiated to update the status page due to communication issues. The second incident was quickly stabilized by reapplying previous mitigation steps, leading to full restoration by 20:24 UTC. Long-term resolution involved increasing Kafka broker memory allocation and refactoring producer connection handling to optimize connections to prevent recurrence. Improvements to internal communication processes for status page updates are also being implemented.

Root Cause

The root cause was identified as memory constraints in the Kafka message queuing infrastructure, primarily caused by an increase in producer connections from a newly deployed API key tracking feature. This led to cascading failures that impacted incident processing, notifications, webhooks, and API functionality.

See the original post