Event Ingestion Issue

Severity: MinorCategory: Change ProcessService: PagerDuty

This summary is created by Generative AI and may differ from the actual content.

Overview

On October 31, 2022, from approximately 22:15 UTC until 22:40 UTC, a few customers in the US service region received 500 errors for events sent to the Events API. Events receiving errors were retried successfully within minutes. Webhooks, notifications, inbound email events, and the REST API were not impacted at all by this incident.

Impact

A few customers in the US service region experienced 500 errors for events sent to the Events API. However, these events were retried successfully within minutes, and other services like webhooks, notifications, inbound email events, and the REST API were not impacted.

Trigger

The errors were precipitated by the phased rollout of a configuration change, which began at 21:02 UTC and continued until 22:40 UTC when the rollout was paused.

Detection

Indicators of degradation in the system did not appear until 22:25 UTC when a monitor for excessive failed allocations was triggered. Alerts notified responders, and a major incident was triggered after the first few 500 errors.

Resolution

The deployment was paused, and failed events were immediately re-queued for processing. Recovery was observed at 22:40 UTC, and the configuration change was rolled forward to a known good configuration. Additional hosts were provisioned, and clean-up actions continued until 02:42 UTC on November 1st.

Root Cause

The configuration change caused hosts to restart repeatedly, leading to insufficient healthy hosts to process all Events API requests. This was due to hosts being marked as 'healthy' despite restarting, causing allocations to fail and transfer to other functioning hosts.

See the original post