Past Incidents Temporarily Unavailable

Severity: MinorCategory: MisconfigurationService: PagerDuty
This summary is created by Generative AI and may differ from the actual content.
Overview
On March 22nd, 2023, between 17:45 UTC and 18:50 UTC, PagerDuty experienced an incident where the Past Incidents feature was unavailable in the Web UI and Mobile UI across the US and EU regions. The Past Incident API returned 500 'Internal Server Error' responses due to an incorrect secret key update, preventing the Web application from connecting to its storage systems. At 18:25 UTC, a decision was made to return 200s empty responses instead of 500s until the issue was resolved. The team gradually restored traffic after identifying and correcting the secret key issue. Proactive measures are being implemented to prevent future occurrences, including better validation and testing mechanisms. An apology and commitment to reliability were expressed.
Impact
The incident affected all PagerDuty customers in the US and EU regions, preventing access to past incidents in both the mobile and web applications. However, no events were lost or dropped during the incident.
Trigger
The incident was triggered by an incorrect secret key update during a scheduled rotation of secrets for the systems powering the Past Incidents feature.
Detection
The issue was detected when customers experienced issues accessing and viewing past incidents within the mobile and web applications, leading to an investigation by the team.
Resolution
The resolution involved identifying the incorrect secret key update, restoring the correct keys, and gradually restoring traffic. A solution was pushed to a subset of internal subdomains, followed by a rollout to both service regions.
Root Cause
The root cause was an incorrect secret key update during a scheduled rotation, compounded by ineffective validation processes and similar secret key names used in the systems.
;