Incident Report: May 19, 2026- GCP Account Suspension

Severity: Critical

Category: Dependencies

Service: Railway

This summary is created by Generative AI and may differ from the actual content.

Overview

Railway suffered a platform‑wide outage on May 19‑20, 2026 when Google Cloud mistakenly suspended Railway’s production GCP account. The suspension disabled the API, control plane, databases and compute resources hosted on GCP, causing 503 errors for the dashboard and API and preventing logins. As edge proxies lost their routing‑table cache, the outage cascaded to workloads running on Railway’s Metal and AWS environments, which began returning 404 errors. The incident lasted roughly eight hours, with services gradually restored after Google reinstated account access and Railway recovered disks, networking, and compute instances.

Impact

Duration: ~8 hours (22:20 UTC May 19 – 06:14 UTC May 20). Scope: All GCP‑hosted services (API, dashboard, control plane, databases, compute) were offline. Users saw 503 “no healthy upstream” and “unconditional drop overload” errors, could not log in, and experienced service unavailability across all regions once edge caches expired, resulting in 404 errors for Metal and AWS workloads. Build and deployment pipelines were blocked, creating a backlog of queued deploys. GitHub rate‑limited Railway’s OAuth and webhook integrations, further impacting logins and builds. No mention of direct revenue loss, but full platform unavailability impacted all customers.

Trigger

An automated action within Google Cloud incorrectly placed Railway’s production account into a suspended status. The suspension was applied platform‑wide to many accounts without prior notice, triggering the outage.

Detection

Railway’s automated monitoring system observed API health‑check failures at 22:10 UTC and paged the on‑call team, who confirmed 503 errors on the dashboard and identified the GCP account suspension as the root cause by 22:19 UTC.

Resolution

A P0 ticket was opened with Google Cloud, and the account access was restored at 22:29 UTC. Subsequent steps included bringing persistent disks back online (first disk at 23:09 UTC, all disks ready by 23:54 UTC), restoring networking and edge routing (network up by ~01:30 UTC), restarting compute instances (starting 01:30 UTC), gradually re‑enabling edge traffic (01:38 UTC), pausing and then resuming deployments to avoid overload, and fixing OAuth/GitHub integration rate‑limits. Full service functionality (API, dashboard, OAuth) was confirmed by 04:00 UTC, with the incident officially resolved at 07:58 UTC.

Root Cause

The immediate cause was Google Cloud’s erroneous automated suspension of Railway’s production account, which cut off all GCP‑hosted resources. A secondary cause was Railway’s architectural reliance on the GCP‑hosted network control‑plane API for route discovery; when the control plane vanished, edge caches expired and the outage propagated to Metal and AWS workloads, turning a single‑provider failure into a platform‑wide incident.

See the original post