Investigating Querying Issues

Severity: Major
Category: Dependencies
Service: Honeycomb

This summary is created by Generative AI and may differ from the actual content.

Overview

Honeycomb experienced systemic failures in querying, SLO evaluation, and AWS console access due to a nearly sixteen-hour AWS us-east-1 outage on October 22, 2025. Event ingest remained functional, but querying was intermittently affected throughout the day, causing delays in Trigger and SLO evaluations and degrading ancillary systems like Service Maps. The outage was triggered by a DNS resolution issue for internal DynamoDB endpoints within AWS. A full incident review is planned to re-evaluate disaster recovery, vendor reliance, and regional dependence.

Impact

Systemic failures affected querying, SLO evaluation, and AWS console access. Querying was intermittently impacted for several hours across multiple time windows, leading to delays in Triggers and SLOs. Ancillary systems like Service Maps also experienced degraded performance. However, event ingest remained functional with no dropped events. The incident lasted nearly sixteen hours, with services continuing to be impacted by throttling even after AWS declared resolution.

Trigger

The incident was triggered by an AWS outage in the us-east-1 region, specifically a DNS resolution issue for internal DynamoDB endpoints. This led to subsequent issues within AWS, including errors while launching instances and failed Lambda invocations, which directly impacted Honeycomb's services.

Detection

Honeycomb became aware of the incident at 07:05 UTC on October 22, 2025, when their End to End tests failed. The responding engineer observed multiple systemic failures, including querying issues, SLO evaluation problems, and an inability to log in to the AWS console. AWS declared an outage affecting us-east-1 six minutes later at 07:11 UTC.

Resolution

Resolution was primarily dependent on AWS resolving their underlying outage, which was declared resolved nearly sixteen hours later at 22:53 UTC. Honeycomb maintained existing EC2 capacity to ensure event ingest continued. However, the widespread AWS outage limited their ability to make on-the-fly changes or deactivate problematic features due to reliance on affected external providers. Services continued to be impacted by throttling even after AWS's declared resolution. A full incident review is planned to re-evaluate disaster recovery, vendor reliance, and regional dependence.

Root Cause

The root cause was a widespread AWS outage in the us-east-1 region, specifically a DNS resolution issue for internal DynamoDB endpoints. This core issue cascaded into failures in dependent AWS systems, such as errors in launching instances and failed Lambda invocations, which Honeycomb relies on for its query functionality and SLO evaluation. The incident highlighted Honeycomb's significant reliance on external vendors and regional dependence.