Incident affecting Web UI, REST API, and Events API

Severity: Major
Category: Bug
Service: PagerDuty

This summary is created by Generative AI and may differ from the actual content.

Overview

On November 1, from 18:10 to 19:44 UTC, PagerDuty experienced a major incident affecting event ingestion, event processing, and Web UI and REST API requests in the US service region. A deployment in the event-processing service uncovered a bug in the event-storage service, leading to HTTP 500 responses and service disruptions. The issue was resolved by reverting the deployment, and normal functionality was restored. A backlog of events was reprocessed by 19:44 UTC.

Impact

Approximately 4% of Events API requests returned HTTP 5XX error responses, and 4.8% of notifications were delivered outside of SLA.

Trigger

A change to traffic mirroring in the event-processing service uncovered a bug in the event-storage service.

Detection

System monitors proactively notified engineers of the problem, leading to an investigation.

Resolution

On-call responders reverted the problematic change in the event-processing service, restoring full capacity and resolving the incident.

Root Cause

The incident was caused by invalid requests from traffic mirroring due to a missing validation check, leading to HTTP 500 responses and service disruptions.