Andrew Goldis
Andrew Goldis

AWS us-east-1 Outage Report

This document summarizes the incident, the actions taken to restore service and the preventative improvements we are implementing.

AWS us-east-1 Outage Report

On October 20, 2025, an outage affecting Amazon Web Services (AWS) in the us-east-1 region caused disruption to several of our services. This document summarizes the incident, the actions taken to restore service and the preventative improvements we are implementing.

Incident Timeline

Here is the timeline of the incident, all times are in Pacific Time (PT).

2025-10-20 01:00 AM - Customer reports indicate the dashboard is inaccessible. Authentication requests begin failing, showing HTTP 500 errors from AWS Cognito. AWS’s status page reflects a large regional service disruption.

2025-10-20 02:30 AM - The dashboard becomes reachable again. Customers can log in and view workflow runs.

2025-10-20 03:00 AM - Internal testing reveals that run notifications are not being delivered. MongoDB automatically suspends change-stream triggers after exceeding internal error thresholds triggered by the outage. This suspension prevents certain run status updates from propagating. Our team restored the change-stream triggers to resume run status updates.

2025-10-20 05:00 AM - 2025-10-20 11:00 AM - Queue delays increase as incoming workloads accumulate faster than they can be processed. The root cause is identified as an AWS regional capacity issue preventing new compute instances from being provisioned. Engineering begins preparing fallback workloads in us-west-2. Secondary blockers arise due to restricted VPC endpoint provisioning due to AWS service failures.

2025-10-20 11:10 AM - 2025-10-20 2:00 PM - AWS capacity normalizes. Additional queue-processing machines provision successfully. The team manually scaled infrastructure to resume queue processing and process the backlog of jobs.

2025-10-20 3:00 PM - Run processing returns to normal latency. The public status page is updated with a message to customers that the incident is resolved.

Impact

The incident had a significant impact on our customers. There were three distinct issues that customers experienced:

  • Inability to access the dashboard due to AWS Cognito authentication issues
  • Inability to receive run notifications due to MongoDB change-stream suspension
  • Stuck CI runs due to a series of cascading issues affecting data processing pipeline

Root Cause

AWS Cognito Authentication Issues

Inaccessibility of the dashboard was caused by AWS Cognito authentication issues, which in turn was caused distrubtion to AWS Cognito services.

Currents uses AWS Cognito to authenticate users and manage user sessions. It is a single point of failure for the dashboard.

MongoDB Change-Stream Suspension

MongoDB automatically suspended change-stream triggers; the issue was caused by an increased error rate in AWS API errors and latencies for Lambda functions. The suspension prevented certain run status updates from propagating.

Data Processing Pipeline Issues

Stuck CI runs were caused by a series of cascading issues affecting data processing pipeline. These issues were caused by AWS inability to allocate new EC2 resources in the us-east-1 region.

Currents used AWS ECS with autosclaing policies to adjust the number of tasks based on the incoming workload. However, no new tasks were provisioned due to AWS inability to allocate new EC2 resources in the us-east-1 region.

Learnings

In general, this incident revealed that the team is well prepared for dealing with operational issues, including escalation process, mitigation steps, incident response procedures.

However, there are several areas where we can improve:

  • We need to improve our external communication with customers by having alternative communication channels. Our stauts page provider was unavailable for a portion of the incident window.

  • Our services are dependent on AWS infrastructure with multiple single points of failure. Removing these single points of failure require significant architectural changes.

  • We could provide better resilience and reduce the RTO by provisioning alternative compute capacity in an different AWS region.

  • Due to nature of our services, we need to prioritize certain workloads that are critical for customers. For example, updating analytics data can be throttled to a lower priority to ensure that critical CI run updates are not affected.

What Went Well

  • Early detection based on customer reports and internal monitoring
  • Escalation process and incident response procedures were effective
  • We were able to isolate the affected components without causing further impact to other systems.
  • We were able to provide an alternative compute capacity to resume queue processing in a different AWS region, excluding failed connectivity to our OLAP database.

What Didn’t Go Well

  • Limited access to our status page delayed proactive communication
  • MongoDB change-stream suspensions did not generate alerts
  • Certain infrastructure changes required manual steps
  • Certain system components were not reachable from different regions
  • Lack of backup network connectivity to our OLAP database prevented us from restoring data processing in a timely manner.
  • Lack of prioritization of critical workloads resulted in delayed data processing for non-critical workloads.

Preventative Improvements

To reduce the likelihood and impact of similar events, we are implementing:

  1. Multi-Region Readiness
  • Improve IaC coverage for more efficient deployment of core components in alternate regions
  • Pre-configure failover connections to data stores
  1. Improved Notification Pathways
  • Additional communication channels beyond status page and Intercom
  • Dashboard banner deployments hosted independently
  1. Trigger & Queue Visibility
  • Alerts when MongoDB suspends change-stream triggers
  • More granular queue health monitoring
  1. Data Resilience Enhancements
  • Improved backlog handling for stale or delayed queue items
  • Backup strategies for in-memory queue data
  • Separate critical and non-critical data processing to reduce the impact of issues on queue processing.

Scale your Playwright tests with confidence.
Join hundreds of teams using Currents.
Learn More

Trademarks and logos mentioned in this text belong to their respective owners.

Recent Posts