Via Datacenter Knowledge
-----
When a busy cloud computing platform crashes, the impact is felt widely. That’s the case with today’s extended outage for Amazon Web Services,
which is battling latency issues at one of its northern Virginia data
centers. The problems are rippling through to customers, causing
downtime for many services that use Amazon’s cloud to run their web
services.
The sites knocked offline by Amazon’s problems include social media hub Reddit, the HootSuite link-sharing tool, the popular question-and-answer service Quora, and even a Facebook app for Microsoft (see a full list of affected sites).
The issues began at about 1 a.m. Pacific time and are continuing as
of 2:30 p.m. Pacific, with Amazon saying it still cannot predict when
services will be fully recovered. By mid-afternoon, Amazon said it had
limited the problems to a single availability zone in the Eastern U.S.,
and was attempting to route around the affected infrastructure. The AWS
status dashboard shows that the services experiencing problems include
Elastic Compute Cloud (EC2), Amazon Relational Database Service and
Amazon Elastic MapReduce and are focused in the US-East-1 region.
Networking Event Triggers Problems
The problems are focused on Elastic Block Storage (EBS), which
provides block level storage volumes for use with Amazon EC2 instances.
Latency problems at EBS were cited by Reddit when the site experienced major downtime in March.
“A networking event early this morning triggered a large amount of
re-mirroring of EBS volumes in US-EAST-1,” Amazon said in a status
update just before 9 am Pacific time. “This re-mirroring created a
shortage of capacity in one of the US-EAST-1 Availability Zones, which
impacted new EBS volume creation as well as the pace with which we could
re-mirror and recover affected EBS volumes. Additionally, one of our
internal control planes for EBS has become inundated such that it’s
difficult to create new EBS volumes and EBS backed instances.
“We are working as quickly as possible to add capacity to that one
Availability Zone to speed up the re-mirroring, and working to restore
the control plane issue,” Amazon continued. “We’re starting to see
progress on these efforts, but are not there yet. We will continue to
provide updates when we have them.”
UPDATE: At 10:30 Pacific, Amazon said it was making
“significant progress in stabilizing the affected EBS control plane
service,” which was now seeing lower failure rates. “We have also
brought additional capacity online in the affected Availability Zone and
stuck EBS volumes (those that were being remirrored) are beginning to
recover. We cannot yet estimate when these volumes will be completely
recovered, but we will provide an estimate as soon as we have sufficient
data to estimate the recovery.”
UPDATE 2:At 1:48 p.m. Amazon said a single
Availability Zone in the US-EAST-1 region continues to experience
problems launching EBS backed instances or creating volumes. “All other
Availability Zones are operating normally,” Amazon said. “Customers with
snapshots of their affected volumes can re-launch their volumes and
instances in another zone. We recommend customers do not target a
specific Availability Zone when launching instances. We have updated our
service to avoid placing any instances in the impaired zone for
untargeted requests.”
The outage even has affected a Microsoft initiative, according to a
Facebook post by the company. “For those of you trying to enter our ‘Big
Box of Awesome’ sweepstakes…the entry site is currently down, related
to a broader problem impacting a number of sites across the internet
today,” Microsoft told its Facebook followers. “We’ll let you know when
it’s back up.” Microsoft has its own data center infrastructure, but
some business units use third-party services. The Big Box of Awesome Facebook app is hosted on EC2.
Multi-Region Failover Option
The outage appears to affect many, but not all, customers using the
US-East-1 region. Amazon operates multiple regions, allowing users to
add redundancy to their applications by hosting them in several regions.
In a multi-region setup, when one region experiences performance
problems, customers can shift workloads to an unaffected region.
Whenever Amazon Web Services experiences outages and performance
problems, it typically highlights the multi-region option, which allows
customers to avoid having its cloud assets constitute a “single point of
failure.” Today’s outage is likely to prompt some customers that rely
on Amazon to examine adding additional regions to their deployment and
other strategies to work around EC2 outages.
The outage is also likely to prompt discussion of the reliability of
cloud computing. Is it a fair question to raise? Today’s outage has
affected many customers, highlighting the vulnerability of a single
service hosting many popular sites.
This has also been true of earlier outages at dedicated hosting providers like The Planet or data center hubs like Fisher Plaza.
Companies relying upon those facilities could avoid outages by adding
backup installations at other data centers – which is essentially the
same principle as adding additional zones at Amazon.
Stuff happens. We write about outages all the time. But real-world
downtime is particularly problematic in the context of claims that the
cloud “never goes down.” Cloud infrastructure can also fail. The
difference is that cloud deployments offer new options for managing
redundancy and routing around failures when they happen.