AWS cooling glitch hits Coinbase

A cooling failure at an AWS data center caused service disruptions
Coinbase faced trading outages due to thermal issues in US-EAST-1
AWS is working to restore all services following the localised event

Summarized by AI ⓘ

Trending now

SEE ALL

Feedpost Specials

Uncork Your Personality: How Your Glass-Holding Style Reveals Hidden Traits

Feedpost Specials

Shifting Sands: Indian Couples Rethink Wedding Traditions, Ditching Gold for Meaning

Nitin Bisht

This desi sneaker brand stealing Nike's customers

What is the story about?

Discover how a temperature-related issue in an AWS data center caused widespread service interruptions for a major cryptocurrency exchange, highlighting the vulnerabilities of cloud infrastructure.

Cloud Cooling Catastrophe

A localized thermal event within a specific Amazon Web Services (AWS) data center, situated in its Northern Virginia region, triggered widespread service

disruptions across various platforms. This incident, attributed by AWS to a cooling system failure resulting in increased temperatures, necessitated urgent intervention to prevent further escalation. The consequence of this localized environmental problem was a significant impact on services, with cryptocurrency exchange Coinbase notably experiencing substantial operational difficulties as a direct result of the AWS infrastructure's instability in the affected zone. The reliance on a single data center's operational integrity was starkly exposed.

Coinbase's Account

Coinbase provided a detailed explanation of the service disruptions it encountered, tracing the errors back to failures within Amazon's US-EAST-1 Region, specifically in Availability Zone use1-az4. While Coinbase's systems are engineered for resilience against single zone outages and swift recovery, the incident in question impacted multiple AWS zones. This multi-zone failure led to an extended outage of core trading services, as users experienced prolonged downtime while AWS worked to rectify the temperature controls and restore essential Amazon Managed Services. The exchange confirmed the primary issue was fully resolved and announced a comprehensive analysis would be conducted, pending further details from AWS's official retrospective.

AWS Health Dashboard Timeline

The AWS Health Dashboard chronicled the ongoing mitigation efforts, starting with an alert around 5:53 PM PDT on May 7th regarding instance impairments in Availability Zone use1-az4 due to increased temperatures within a data center. This event caused power loss to affected EC2 instances and EBS volumes. By 6:47 PM PDT, traffic had been rerouted for most services, with recommendations for customers to utilize other Availability Zones. Progress was slow, with updates indicating efforts to restore temperatures and bring impacted racks back online. By 10:11 PM PDT, early signs of recovery were observed, with additional cooling capacity being brought online to restore affected infrastructure. Despite ongoing challenges and longer-than-anticipated recovery times, AWS continued to provide updates, emphasizing prioritization of the issue. The last update at 1:32 AM PDT on May 8th confirmed mitigation efforts were underway to resolve impaired EC2 instances and degraded EBS volumes, acknowledging that full recovery would take time and assuring further updates.