Single DNS Flaw Cripples AWS Infrastructure
Amazon Web Services (AWS) faced a significant disruption due to a critical flaw in the DNS management system of its DynamoDB service. The incident unfolded overnight, marking a day-long outage that severely affected many prominent websites and services. The financial ramifications of this issue could potentially reach hundreds of billions of dollars.
Incident Timeline
The disruption began at 11:48 PM PDT on October 19, 2023 (7:48 AM UTC on October 20). Customers quickly noticed an increase in error rates for the DynamoDB API, particularly in the US-EAST-1 Region located in Northern Virginia.
Root Cause of the Outage
A detailed postmortem revealed that the issue stemmed from a race condition within DynamoDB’s automated DNS management system. This flaw resulted in an empty DNS record for the service’s regional endpoint.
- The DNS management system consists of two components:
- DNS Planner: Monitors load balancer health and generates DNS plans.
- DNS Enactor: Implements changes through Amazon Route 53.
- The race condition was triggered when one DNS Enactor experienced unusually high delays.
- Meanwhile, the DNS Planner continued generating new plans, which led to the deletion of older plans when the second DNS Enactor applied newer ones.
Consequences of the DNS Flaw
This inconsistency in the DNS management system resulted in widespread DNS failures. Notably, this issue disrupted:
- EC2 instance launches
- Network configuration processes
The Droplet Workflow Manager (DWFM), vital for maintaining leases on servers hosting EC2 instances, was particularly impacted. The DNS failures caused state checks to fail, preventing droplets from acquiring new leases.
Recovery Efforts
DynamoDB began recovering at 2:25 AM PDT (9:25 AM UTC). However, the scale of the recovery process meant DWFM faced substantial delays, leading to lease timeouts. This inefficiency resulted in a “congestive collapse” that required manual oversight until 5:28 AM PDT (12:28 PM UTC).
Impact on AWS Services
The network propagation delays further complicated the situation. The Network Load Balancer (NLB) faced issues due to delayed health checks for new EC2 instances. Consequently, several AWS services, including:
- AWS Lambda
- Elastic Container Service (ECS)
- Elastic Kubernetes Service (EKS)
- Fargate
were adversely affected, leading to a ripple effect across various platforms.
Future Safeguards
In response to the outage, AWS has disabled the automated DNS Planner and DNS Enactor worldwide. The company announced that it is working on implementing safeguards to prevent similar occurrences. In their apology, Amazon expressed a commitment to finding ways to mitigate impacts during such events and to expediate recovery processes.
This incident serves as a stark reminder of the complexities involved in managing large-scale cloud infrastructure and the potential vulnerabilities within their systems.