Increased Error Rates on the REST API

Severity: MinorCategory: DependenciesService: PagerDuty
This summary is created by Generative AI and may differ from the actual content.
Overview
On April 1st, between 20:46 and 21:28 UTC, the PagerDuty REST API in the US service region experienced increased 5xx error rates and response times due to a "noisy neighbor" condition affecting compute resources. 91% of requests were successful, and other services were unaffected. The issue was resolved by rolling back to the legacy container runtime.
Impact
Increased 5xx error rates and response times for the REST API in the US service region, with 91% of requests remaining successful. Other services and regions were unaffected.
Trigger
A "noisy neighbor" condition that prevented REST API instances from obtaining necessary compute resources, leading to service degradation.
Detection
The team was paged at 20:48 UTC when 12% of requests resulted in 5xx errors, indicating a potential issue.
Resolution
Scaled up REST API by 50% and rolled back to the legacy container runtime, removing unhealthy instances by 21:21 UTC and normalizing error rates by 21:22 UTC.
Root Cause
The "noisy neighbor" condition and inadequate health checks that failed to detect the unhealthy state of REST API instances.
;