Errors when accessing PagerDuty web app

Severity: MajorCategory: ScalabilityService: PagerDuty

This summary is created by Generative AI and may differ from the actual content.

Overview

On September 2, 2025, from 15:38 UTC to 17:24 UTC, PagerDuty experienced an incident that disrupted web application access for some customers in the US service region. The incident was caused by an unexpected increase in load on the API gateway, which is built in Scala and runs on the JVM, leading to capacity exhaustion. This resulted in up to 14% of requests returning 5xx server errors when loading web application features. Event ingestion and incident notifications were not affected. The incident also involved a substantial increase in JVM CPU and heap usage, along with significant garbage collection activity. Recent changes increasing the keepalive duration of websockets are being investigated as a potential contributing factor and have been temporarily rolled back.

Impact

The incident disrupted web application access for some PagerDuty customers in the US service region. At its peak, up to 14% of requests resulted in 5xx server errors (specifically 500 and 504) when loading web application features. Customers experienced errors loading the PagerDuty web app, errors viewing certain data within the web app, and an inability to query, create, or update incidents via the web app. Event ingestion and incident notifications continued to operate normally and were not affected.

Trigger

The incident was triggered by an unexpected increase in load on the API gateway, which periodically exhausted its capacity and prevented it from accepting additional requests. This led to the disrupted web app functionality. The specific trigger is still under investigation, but several features, including recent changes that increased the keepalive duration of websockets, are being considered as potential contributing factors.

Detection

The incident was detected immediately by a sudden spike in 5xx server errors (specifically 500 and 504) returned by PagerDuty's frontend load balancers. This immediately alerted the team, prompting them to trigger the Major Incident process for investigation.

Resolution

The incident was resolved by scaling up the API gateway and restarting instances at 17:19 UTC to increase system capacity and mitigate customer impact. The gateway service was stabilized by 17:24 UTC. The team continued monitoring systems for an additional hour to confirm no further impact. Additionally, several features that could have contributed, including some recent changes that increased the keepalive duration of websockets, were temporarily rolled back.

Root Cause

The root cause involved the API gateway encountering an unexpected increase in load, leading to periodic capacity exhaustion. This was exacerbated by a substantial increase in JVM CPU and heap usage, along with significant garbage collection activity, indicating resource contention. Contributing factors included a lack of predefined thresholds or metrics to help operators make clear, timely scaling decisions for the API gateway. Furthermore, there were observability gaps in network usage monitoring, specifically insufficient networking metrics from Kubernetes pods, hosts, or kernel namespaces, which prevented effective alerts for issues such as port exhaustion. The specific trigger is still under investigation, with recent changes increasing the keepalive duration of websockets being a potential factor.

See the original post