Slack notification issue

Severity: Major

Category: Scalability

Service: PagerDuty

This summary is created by Generative AI and may differ from the actual content.

Overview

On January 9, 2026, between 06:00 UTC and 10:26 UTC, PagerDuty experienced an incident where less than 1% of US service region accounts saw delays in Slack incident card and update postings. This was caused by a resource-intensive ETL job running on shared nodes, which consumed excessive CPU and network capacity, degrading the performance of the Slack integration service. While updates were delayed, none were lost, and other PagerDuty features remained operational. The issue resolved once the ETL job completed, and delayed updates were subsequently backfilled.

Impact

Less than 1% of PagerDuty accounts in the US service region experienced delays with incident cards and updates posting to Slack, with 95th-percentile delays exceeding 5 minutes. No incident card creations or updates were fully dropped, and all other PagerDuty features, including notifications and incident creation, remained fully operational.

Trigger

The incident was triggered by a scheduled, resource-intensive Extract-Transform-Load (ETL) job that began running on a number of nodes at 06:00 UTC.

Detection

PagerDuty's monitoring system detected the issue at 08:03 UTC, identifying 95th-percentile Slack incident card creation and update delays of more than 5 minutes. The major incident process was initiated at 08:33 UTC.

Resolution

The incident was resolved when the resource-intensive ETL job finished at 10:07 UTC, causing service latencies to return to baseline. Subsequently, a backfill process was initiated to deliver all delayed incident cards and updates to Slack, which was completed by 10:26 UTC.

Root Cause

The primary root cause was a resource-intensive scheduled ETL job running on shared nodes, which consumed excessive CPU and network capacity, leading to performance degradation for the Slack integration service and its dependencies. A contributing factor was the use of an older version of the HTTP library in many services, which caused unnecessary DNS lookups for connections reused in the connection pool, exacerbating the performance issues under load.

See the original post