Accessing Schedules

Severity: MajorCategory: Change ProcessService: PagerDuty

This summary is created by Generative AI and may differ from the actual content.

Overview

On August 4, 2025, between 15:02 and 15:15 UTC, PagerDuty experienced a service degradation affecting all US and EU customers. This incident prevented users from viewing or modifying schedules via the UI and API, resulting in 500 errors. Overrides modified during this period were delayed, some taking effect as late as 18:31 UTC, and were not correctly applied to the schedule's final layer; overrides modified after 15:15 UTC were not impacted. The degradation was triggered by a faulty update to an internal API service that expected an unset environment variable, causing application crashes on specific requests. The issue was detected by a spike in errors and resolved by rolling back the update, followed by manual repair of affected schedules. Future improvements include updating monitoring to more accurately identify application failures, enhancing the canary deployment process to automatically cancel on issues detected via response metrics, and developing new internal recovery tools.

Impact

The incident impacted all PagerDuty customers in the US and EU service regions. During the degradation, attempts to view or modify schedules through the UI failed, and API requests to retrieve or modify schedule data returned 500 response code errors. Overrides modified between 15:02 and 15:15 UTC were delayed, some taking effect as late as 18:31 UTC, and were not correctly applied to the schedule's final layer, meaning they were not considered when determining who was on-call. Incident assignment, responder requests, and other schedule-based notifications were not impacted, and no inbound events or notifications were lost or dropped.

Trigger

The incident was triggered on August 4, 2025, at 15:02 UTC, by the rollout of an update to the internal API service responsible for schedules-related requests. This update modified the service to expect an environment variable which was not set correctly during deployment.

Detection

The degradation was detected at 15:03 UTC when a spike in errors from the internal API service was observed, prompting an investigation into the possible source. By 15:09 UTC, the configured canary deployment had finished, and all incoming requests to schedules-related endpoints, except those to modify overrides, began to fail.

Resolution

At 15:11 UTC, the recent deployment was identified as the source of the errors. A rollback to the previous version was initiated and completed by 15:15 UTC, which stopped all errors. Subsequently, by 18:31 UTC, all impacted schedules where overrides were not correctly applied to the final layer were identified and repaired.

Root Cause

The root cause of the incident was a faulty update to the internal API service. The update introduced a dependency on an environment variable that was not correctly configured during deployment. This allowed the application to start and pass health checks, but it would crash and return 500 response codes when handling requests that required the expected, but missing, variable. A contributing factor was that the canary deployment process did not adequately detect this issue before the full rollout, as the application's health checks did not account for the missing environment variable's impact on request processing.

See the original post