Incident affecting Web UI, REST API, and Events API

Severity: MajorCategory: BugService: PagerDuty
This summary is created by Generative AI and may differ from the actual content.
Overview
On November 1, from 18:10 to 19:44 UTC, PagerDuty experienced a major incident affecting event ingestion, event processing, and Web UI and REST API requests in the US service region. A deployment in the event-processing service uncovered a bug in the event-storage service, leading to HTTP 500 responses and service disruptions. The issue was resolved by reverting the deployment, and normal functionality was restored. A backlog of events was reprocessed by 19:44 UTC.
Impact
Approximately 4% of Events API requests returned HTTP 5XX error responses, and 4.8% of notifications were delivered outside of SLA.
Trigger
A change to traffic mirroring in the event-processing service uncovered a bug in the event-storage service.
Detection
System monitors proactively notified engineers of the problem, leading to an investigation.
Resolution
On-call responders reverted the problematic change in the event-processing service, restoring full capacity and resolving the incident.
Root Cause
The incident was caused by invalid requests from traffic mirroring due to a missing validation check, leading to HTTP 500 responses and service disruptions.
;