This summary is created by Generative AI and may differ from the actual content.
Overview
On April 13th, 2023, between 14:52 and 15:10 UTC, the PagerDuty Web Application experienced a degraded state in the US service region, causing sluggish performance and intermittent error pages for customers. The incident was triggered by a code change deployment to servers maintaining websocket connections, which led to an underprovisioned state for the new load pattern. An Emergency Rollback further exacerbated the issue. By 15:10 UTC, the situation normalized as websockets reconnected. Post-incident, server capacity was increased, and websocket code was updated for graceful reconnections during deployments.Impact
The incident affected customers using the PagerDuty Web Application in the US service region, causing sluggish experiences and intermittent error pages. Other components and service regions were not impacted.Trigger
The incident was triggered by a code change deployment to the servers maintaining websocket connections, which introduced a new load pattern that the system was underprovisioned to handle.Detection
The issue was detected when customers began experiencing sluggish performance and intermittent error pages on the PagerDuty Web Application.Resolution
The engineer executed an Emergency Rollback, which initially worsened the issue, but by 15:10 UTC, the situation normalized as all websockets reconnected. Post-incident, server capacity was increased, and websocket code was updated for more graceful reconnections during deployments.Root Cause
The root cause was an underprovisioned system for the new load pattern introduced by the websocket feature, exacerbated by an Emergency Rollback that repeated the deployment process, increasing the websocket reconnection load.