Via Datacenter Knowledge
 
----- 
 
When a busy cloud computing platform crashes, the impact is felt widely. That’s the case with today’s extended outage for Amazon Web Services,
 which is battling latency issues at one of its northern Virginia data 
centers. The problems are rippling through to customers, causing 
downtime for many services that use Amazon’s cloud to run their web 
services.
 
 
The sites knocked offline by Amazon’s problems include social media hub Reddit, the HootSuite link-sharing tool, the popular question-and-answer service Quora, and even a Facebook app for Microsoft (see a full list of affected sites).
 
The issues began at about 1 a.m. Pacific time and are continuing as 
of 2:30 p.m. Pacific, with Amazon saying it still cannot predict when 
services will be fully recovered. By mid-afternoon, Amazon said it had 
limited the problems to a single availability zone in the Eastern U.S., 
and was attempting to route around the affected infrastructure. The AWS 
status dashboard shows that the services experiencing problems include 
Elastic Compute Cloud (EC2), Amazon Relational Database Service and 
Amazon Elastic MapReduce and are focused in the US-East-1 region.
 
Networking Event Triggers Problems
 
The problems are focused on Elastic Block Storage (EBS), which 
provides block level storage volumes for use with Amazon EC2 instances. 
Latency problems at EBS were cited by Reddit when the site experienced major downtime in March.
 
“A networking event early this morning triggered a large amount of 
re-mirroring of EBS volumes in US-EAST-1,” Amazon said in a status 
update just before 9 am Pacific time. “This re-mirroring created a 
shortage of capacity in one of the US-EAST-1 Availability Zones, which 
impacted new EBS volume creation as well as the pace with which we could
 re-mirror and recover affected EBS volumes. Additionally, one of our 
internal control planes for EBS has become inundated such that it’s 
difficult to create new EBS volumes and EBS backed instances.
 
“We are working as quickly as possible to add capacity to that one 
Availability Zone to speed up the re-mirroring, and working to restore 
the control plane issue,” Amazon continued. “We’re starting to see 
progress on these efforts, but are not there yet. We will continue to 
provide updates when we have them.”
 
UPDATE: At 10:30 Pacific, Amazon said it was making 
“significant progress in stabilizing the affected EBS control plane 
service,” which was now seeing lower failure rates. “We have also 
brought additional capacity online in the affected Availability Zone and
 stuck EBS volumes (those that were being remirrored) are beginning to 
recover. We cannot yet estimate when these volumes will be completely 
recovered, but we will provide an estimate as soon as we have sufficient
 data to estimate the recovery.”
 
UPDATE 2:At 1:48 p.m. Amazon said a single 
Availability Zone in the US-EAST-1 region continues to experience 
problems launching EBS backed instances or creating volumes. “All other 
Availability Zones are operating normally,” Amazon said. “Customers with
 snapshots of their affected volumes can re-launch their volumes and 
instances in another zone. We recommend customers do not target a 
specific Availability Zone when launching instances. We have updated our
 service to avoid placing any instances in the impaired zone for 
untargeted requests.”
 
The outage even has affected a Microsoft initiative, according to a 
Facebook post by the company. “For those of you trying to enter our ‘Big
 Box of Awesome’ sweepstakes…the entry site is currently down, related 
to a broader problem impacting a number of sites across the internet 
today,” Microsoft told its Facebook followers. “We’ll let you know when 
it’s back up.” Microsoft has its own data center infrastructure, but 
some business units use third-party services. The Big Box of Awesome Facebook app is hosted on EC2.
 
Multi-Region Failover Option
 
The outage appears to affect many, but not all, customers using the 
US-East-1 region. Amazon operates multiple regions, allowing users to 
add redundancy to their applications by hosting them in several regions.
 In a multi-region setup, when one region experiences performance 
problems, customers can shift workloads to an unaffected region.
 
Whenever Amazon Web Services experiences outages and performance 
problems, it typically highlights the multi-region option, which allows 
customers to avoid having its cloud assets constitute a “single point of
 failure.” Today’s outage is likely to prompt some customers that rely 
on Amazon to examine adding additional regions to their deployment and 
other strategies to work around EC2 outages.
 
The outage is also likely to prompt discussion of the reliability of 
cloud computing. Is it a fair question to raise? Today’s outage has 
affected many customers, highlighting the vulnerability of a single 
service hosting many popular sites.
 
This has also been true of earlier outages at dedicated hosting providers like The Planet or data center hubs like Fisher Plaza.
  Companies relying upon those facilities could avoid outages by adding 
backup installations at other data centers – which is essentially the 
same principle as adding additional zones at Amazon.
 
Stuff happens. We write about outages all the time. But real-world 
downtime is particularly problematic in the context of claims that the 
cloud “never goes down.” Cloud infrastructure can also fail. The 
difference is that cloud deployments offer new options for managing 
redundancy and routing around failures when they happen.