This summary is created by Generative AI and may differ from the actual content.
Overview
On January 18, 2023, between 15:45 UTC and 20:11 UTC, PagerDuty experienced degradation of the Service Directory and Visibility Console in the US Service Region. During this time, customers in the US Service Region would have noticed a slower user experience as well as occasional errors when attempting to load the Service Directory. Customers would have also been unable to load the Visibility Console, or faced a longer wait time when trying to view the content/dashboard.Impact
Customers in the US Service Region experienced a slower user experience and occasional errors when attempting to load the Service Directory. They were also unable to load the Visibility Console or faced longer wait times when trying to view the content/dashboard.Trigger
The incident was triggered by a partial node loss in the datastore of the underlying service containing Technical Service information, which both the Visibility Console and the Service Directory depend on.Detection
PagerDuty began an incident response process at 15:47 UTC after detecting an increase in failures.Resolution
Attempts to replace the impacted node began at 16:28 UTC. A parallel effort to spin up a new cluster and transition the service over to it was initiated at 17:49 UTC. The new cluster was synced with the latest data at 20:14 UTC and was ready to be cut over if needed. The initial effort to replace the impacted node was successfully completed at 19:52 UTC, and by 20:11 UTC, users were able to load the Service Directory and Visibility Console without issue. The incident was closed at 20:28 UTC after verifying that the services were operating normally.Root Cause
The root cause was a partial node loss in the datastore, which resulted in more load being placed on the remaining nodes in the cluster, leading to general performance degradation of the service.