Intermittent REST API Errors and Response Delays
This summary is created by Generative AI and may differ from the actual content.
Overview
On February 17, 2026, between 17:45 UTC and 22:32 UTC, a small subset of PagerDuty customers in the US service region experienced intermittent errors and increased latency when using the PagerDuty web application. The PagerDuty public API, event ingestion pipeline, and incident notification delivery were not affected throughout the incident.
Impact
On average, fewer than 1% of web application requests returned errors during the incident window, with brief spikes up to approximately 3% during traffic surges.
Trigger
A sudden surge in request volume to the permissions service, roughly a 12x increase over normal traffic levels, overwhelmed the service's connection pool to its caching layer.
Detection
Our monitoring detected an elevated rate of HTTP 500 errors from an internal permissions service that handles user-to-service access resolution.
Resolution
We corrected the cache connection pool configuration on the permissions service, significantly increasing its capacity to handle concurrent requests. We deployed a caching improvement to the alert-processing service, reducing the volume of redundant requests it generated against the permissions service.
Root Cause
The root cause was a sudden surge in request volume to the permissions service, which overwhelmed the service's connection pool to its caching layer, leading to intermittent timeouts and HTTP 500 errors.
