Querying and Ingest issues in EU

Severity: Major
Category: Dependencies
Service: Honeycomb

This summary is created by Generative AI and may differ from the actual content.

Overview

Honeycomb EU cluster experienced a critical loss of redundancy in its Kafka cluster, leading to ingestion issues, querying outages, and delays in SLO processing, with a total of 6 hours and 23 minutes of full ingest outage and 10 days of Activity Log data unavailability.

Impact

0.23% of datasets were fully affected, with a larger percentage seeing intermittent ingestion and query failures, and notifications for SLOs and Triggers were delayed, with SLO data not correct until systems caught up and the cache was rebuilt

Trigger

The incident was triggered by a critical loss of redundancy in the Kafka cluster, caused by exceptionally high load, and exacerbated by issues with tiered storage

Detection

The issue was detected through alerts and monitoring of the Kafka cluster, with responders identifying impacted teams and working on a traffic reassignment script

Resolution

The incident was resolved through a combination of stabilization work, emergency configuration changes, and an emergency migration project to a new Kafka cluster, with responders working to restore full functionality to the Activity Log and repair damaged partitions

Root Cause

The root cause of the incident was a combination of exceptionally high load and issues with tiered storage, which led to a critical loss of redundancy in the Kafka cluster and damaged multiple internal partitions used to manage the cluster itself