Downtime in the new Web

July 20 • Share1 Comment »

Web service developers today have a relatively easy job to do if they so choose. Don’t want to spend a lot of time doing statistics for your site? Use Google Analytics. Need event listings or reviews? Pull them from an API. Storage space or bandwidth concerns? There’s Amazon Web Services for that.

Failwhale avatars (illustration)One of the key tenets of the connected and shared Web world of today is to let best-in-breed service providers handle the complexities of your Web environment. How valid does that philosophy turn out to be when one of the specialized providers has service interruptions?

Services around the Web felt the pinch of service outsourcing this afternoon due to an internal communication problem with Amazon’s S3 and SQS services. The effects of the outage for some were purely cosmetic; Twitter failed to load user avatars for a period starting around noon Eastern time. For others like Basecamp and other 37signals products, the downtime caused features to be disabled temporarily.

The hardest hit this afternoon, though, were services whose entire system revolved around Amazon’s service. Users of SmugMug were greeted with a picture of the service logo watering a garden of servers and a brief message explaining the current situation. The company’s reaction? “We’re not happy about it, of course…”

Amazon does provide a service level agreement (SLA) for their AWS suite; any uptime of less than 99.9% of any given month results in either a 10% or 25% credit of that month’s service cost. (A 6-hour isolated outage on a 31-day month would result in a monthly uptime of 99.2%.) With the number of services that fundamentally depend on external providers for the entirety of their business, is a simple SLA valid?

As SmugMug CEO Don MacAskill wrote this afternoon on the SmugMug Status Updates blog:

Since problems in this industry are inevitable, and Amazon’s performance over the last two years has been so exceptional, we’ve been afraid an outage like this. I’m sure there will be more over the next few years, too.

While MacAskill’s point may sound pessimistic, it’s historically valid. While using best-of-breed services and APIs provides a service with a host of tangible benefits, it comes with the risk of multiple points of failure for a connected Web business.

(In 2006, MacAskill wrote a post detailing why S3 is a good choice for his business.)