Issue Identified with displaying incident Data

Severity: Minor
Category: Dependencies
Service: PagerDuty

This summary is created by Generative AI and may differ from the actual content.

Overview

On February 11, from 04:51 UTC to 06:40 UTC, PagerDuty experienced an incident affecting customers in the US service region, resulting in out-of-date or missing incident details due to replication delays in the database cluster.

Impact

Customers in the US service region experienced out-of-date or missing incident details for nearly two hours.

Trigger

A database migration started on February 10, combined with the decommissioning of a database server, led to replication lag detection failures.

Detection

Replication lag monitors began alerting after 20 minutes of sporadic delays.

Resolution

The migration was safely stopped, and within ten minutes, the backlog of writes was processed, resolving the issue.

Root Cause

The replication lag was caused by the migration process interacting with infrastructure changes, leading to silent failures in lag detection.