Increased Error Rates on the REST API

Severity: Minor
Category: Dependencies
Service: PagerDuty

This summary is created by Generative AI and may differ from the actual content.

Overview

On April 1st, between 20:46 and 21:28 UTC, the PagerDuty REST API in the US service region experienced increased 5xx error rates and response times due to a "noisy neighbor" condition affecting compute resources. 91% of requests were successful, and other services were unaffected. The issue was resolved by rolling back to the legacy container runtime.

Impact

Increased 5xx error rates and response times for the REST API in the US service region, with 91% of requests remaining successful. Other services and regions were unaffected.

Trigger

A "noisy neighbor" condition that prevented REST API instances from obtaining necessary compute resources, leading to service degradation.

Detection

The team was paged at 20:48 UTC when 12% of requests resulted in 5xx errors, indicating a potential issue.

Resolution

Scaled up REST API by 50% and rolled back to the legacy container runtime, removing unhealthy instances by 21:21 UTC and normalizing error rates by 21:22 UTC.

Root Cause

The "noisy neighbor" condition and inadequate health checks that failed to detect the unhealthy state of REST API instances.