This summary is created by Generative AI and may differ from the actual content.
Overview
On October 20, 2022, from approximately 10:40 UTC to 10:42 UTC, some customers experienced 500s errors with web and mobile sites, as well as the REST and Events APIs. Webhooks and notifications were delayed, and event ingestion was throttled for some customers. The outage was due to network connectivity issues between our service region and a disaster recovery region, starting at 10:28 UTC and lasting until 10:44 UTC. Critical path services with hard dependencies on the service region restarted and could not start up correctly due to network inaccessibility. Recovery began at 10:42 UTC, with full recovery and clean-up actions continuing until 11:30 UTC.Impact
Some customers could not connect to web and mobile sites, and experienced delays in notifications and event ingestion. The incident was resolved with no ongoing impact to customers.Trigger
Network connectivity issues between the service region and a disaster recovery region.Detection
Alerts notified responders, and a major incident was automatically triggered at 10:44 UTC.Resolution
Network connectivity was restored, allowing services to self-heal. Clean-up actions and throttle removals continued until 11:30 UTC.Root Cause
Network connectivity issues between the service region and a disaster recovery region, causing critical path services to restart and fail to start up correctly.