Amazon Outage Disrupts Millions Due to Single Point of Failure

ago 4 hours
Amazon Outage Disrupts Millions Due to Single Point of Failure

Amazon Web Services (AWS) recently faced a significant outage that disrupted millions globally. This failure stemmed from a single point of failure within the AWS network, as outlined in a post-mortem report from Amazon engineers.

Root Cause of the Outage

The outage was linked to a software bug related to DynamoDB’s DNS management system. This component is responsible for managing DNS configurations and monitoring load balancer stability. A race condition within the DNS Enactor, a DynamoDB segment, caused unexpected behavior, leading to widespread service disruptions.

Cascading Failures Across Services

Initially, the DNS Enactor experienced high delays while attempting to update DNS endpoints. Simultaneously, another DynamoDB component, the DNS Planner, was generating new plans. The timing discrepancy between these two components triggered a cascading failure that eventually affected the entire DynamoDB system.

Impact on AWS Services

  • Systems relying on DynamoDB faced connection errors.
  • Amazon’s EC2 services in the US-East-1 region were severely strained.
  • New EC2 instances could be launched but lacked necessary network connectivity due to processing delays.

These issues led to a ripple effect, impacting AWS network functionalities such as:

  • Creation and modification of Redshift clusters.
  • Lambda invocations.
  • Fargate task launches, including Managed Workflows for Apache Airflow.
  • Operations related to Outposts.
  • Services offered by the AWS Support Center.

Measures Taken by Amazon

In response to this outage, Amazon has temporarily disabled the DNS Planner and DNS Enactor automation globally. The company is focused on fixing the race condition and implementing safeguards to avoid incorrect DNS plans in the future. Additionally, engineers are updating the EC2 network load balancer to enhance stability moving forward.

This incident highlights the vulnerabilities within complex cloud systems. It also presents a challenge for Amazon as it strives to maintain its leadership position in the cloud services industry amid increasing competition.