Degradation affecting incident creation, REST API access, webhook delivery, workflow execution, and custom fields visibility

Severity: CriticalCategory: ScalabilityService: PagerDuty

This summary is created by Generative AI and may differ from the actual content.

Overview

PagerDuty experienced two major service disruptions in its US Service Regions on August 28, 2025. The first incident, from 3:53 UTC to 10:10 UTC, caused widespread issues including delays and failures in incident creation, notifications, webhooks, incident workflows, API access, and status pages. The second, more limited incident, from 16:38 UTC to 20:24 UTC, caused issues with incident creation and delays in webhook and incident workflow. Both incidents stemmed from memory constraints in the Kafka message queuing infrastructure, exacerbated by increased producer connections from a newly deployed API key tracking feature, leading to cascading failures. During the first incident, communication processes were also interrupted, causing delays in public status page updates, despite updates being drafted internally. Mitigation steps applied after the first incident helped to quickly stabilize the system during the recurrence, reducing the impact of the second incident.

Impact

The first incident caused delays and failures in incident creation, notifications, webhooks, incident workflows, API access, and status pages. Some incoming events were rejected with 500-class errors from the API. Other platform functions, including outbound notifications, webhooks, chat integrations (such as Slack and Microsoft Teams), and the REST API, were also degraded. During recovery, affected customers may have received duplicate notifications and alerts for several hours while the system processed a backlog of stale messages. An interruption in the communication process caused status page update delays, where internally drafted updates were not appearing on the public status page. The second incident was more limited, causing issues with incident creation and delays in webhook and incident workflow, with significantly reduced scope and duration compared to the initial event.

Trigger

The primary trigger for both incidents was memory constraints in the Kafka message queuing infrastructure. This was exacerbated by significantly increased producer connections from a newly deployed API key tracking feature, which had been highly requested by customers. This combination led to cascading failures.

Detection

PagerDuty became aware of a potential issue through internal monitoring, indicated by the timeline starting with 'detected' and an investigation into a potential issue within PagerDuty before confirming impact. The underlying issue was a failure in PagerDuty's message queuing system (Kafka) which triggered a cascading issue.

Resolution

For the first incident, a fix was deployed, and systems stabilized, leading to restoration of services. For the second incident, the system was quickly stabilized by promptly reapplying the mitigation steps taken after the first incident. To prevent recurrence, Kafka broker memory allocation was increased, and producer connection handling was refactored to optimize connections. During the communication interruption, additional engineering teams were engaged, and backup procedures were initiated to manually update the status page.

Root Cause

The root cause of both incidents was memory constraints in the Kafka message queuing infrastructure. This was directly caused by increased producer connections from a newly deployed API key tracking feature, which overwhelmed the Kafka cluster and led to cascading failures across various services.

See the original post