On the AWS S3 Outage

  • Posted by Brendan Caulfield, Turing Group COO
  • AWS, Perspective

As most frequent users of the Internet are aware, Amazon Web Services (AWS) experienced an outage of some of its most ubiquitous and highly-used services yesterday. This outage put a spotlight on AWS and showed a great many just how pervasive the platform has become in running such a vast portion of the world’s publicly accessible Internet sites. In fact, some estimates suggest that as much as 20% of the Internet was affected by the multi-hour outage.

One of the affected services, AWS Simple Storage System (S3), has become such a steadfast, reliable and highly-used offering that many of those who build and deploy AWS solutions may have taken it for granted and, as a result, forgotten some of the foundations of good infrastructure design, like planning for redundancy across regions or availability zones.

S3, as most AWS users know it, is an extremely robust and affordable data storage service that is designed to provide exceptional levels of durability and availability. The service was launched by AWS very early in its campaign (2006) to dominate the underlying infrastructure running much of the Internet, and it has fundamentally changed the way many developers contemplate storing data to make it accessible via the Web or via enterprise applications running on the platform. Even for those entities still managing a good portion of their Web hosting infrastructure in traditional data centers, S3 has provided a welcome new way of making files available across the Internet.

The idea of purchasing hard drives to store application data that is to be publicly accessible is often simply no longer a consideration. Many web developers and development firms rely on S3, as the service has an impeccable track record of availability, eliminates much of the administrative overhead associated with storing files on disks in a local or co-located datacenter, and offers new and innovative ways to access data. Coupled with the ability to store virtually unlimited data on a platform the grows and shrinks without needing to provision disks that may or may not be well-utilized, S3 is a compelling choice for storing Internet-accessible data.

While yesterday’s outage may have thrust AWS into a spotlight they would likely rather not have been in, it also shed light on the fact that firms deploying solutions on top of AWS gain strength from the roots of good design principles. The risk associated with S3 offering such a compelling list of features and having a reputation of being a rock-solid platform for data storage is complacency; over time, many come to simply rely on the fact that Amazon has built such a solid offering. Throughout the day yesterday, we heard grumblings from all corners of the Internet that this outage was so shocking and disappointing because AWS has often touted the durability of this platform.

AWS designed the S3 product to provide 99.999999999% durability – that is a bunch of nines. But durability should not be confused with availability and/or uptime. For sites that simply cannot tolerate downtime, or having certain components of their site unavailable for even a moment, it is critical and very feasible to design solutions that can deal with outages like we saw yesterday. Designing a fault-tolerant and highly available infrastructure takes a certain level of expertise and experience. Having the right partner and/or employees in place can put firms on a path of building a resilient platform for hosting their most valuable web properties and enterprise applications.

In addition, firms need to consider the financial ramifications of building such an infrastructure. Striking a balance between ongoing infrastructure spend and the cost of downtime requires a deep dive into the real costs of being offline versus staying online. This is not always simple but an activity that should absolutely take place before jumping headfirst into any solution – AWS or otherwise – as the cost of refactoring an application after an outage like we experienced yesterday can far exceed the costs of investing in making these decisions upfront and in an informed manner.

Turing Group is a certified and audited AWS Managed Services Partner (MSP) and provider of NextGen MSP services; needless to say, we are pretty well bought into the AWS platform and paradigm. As an AWS partner, we take great care to understand the product offering and, where applicable, the limitations of the platform. We are not, however, impervious to these outages and we are regrouping this morning to take our own advice. We are assessing how we performed yesterday, across our client landscape, throughout this interruption, and are doing everything we can to make sure our infrastructure design practices match the needs of all our clients.

We plan to take a look at the data surrounding this outage, in order to truly understand the impact and ramifications. Though yesterday was a rough day, the AWS track record over the long-term provides evidence of past reliability that still remains. Their stated SLA for Uptime or Availability on S3 caps at 99.99%, and AWS hasn’t experienced meaningful downtime in this way in recent memory. Some sporadic hiccups are inevitable, but yesterday’s outage made news because this was a very unexpected, rare, and relatively unprecedented outage for AWS.

AWS has provided a platform for technical acceleration not seen since the days when Bill Gates transitioned Microsoft from Windows 3.1 to Windows 95; they are leading the charge in ushering in a new renaissance in technology and computing and providing a platform that short circuits the time to market for some of the most compelling and innovative new products and services seen in quite some time – possibly ever. Did yesterday suck? Yep. Will there be other outages? Inevitably. Is it a small price to pay for what AWS has provided in such a short amount of time? We think so.