Two Different opinions, Why Amazon Went Down

You all know amazon.com suffered a 2 hour outage on Friday, suffering a 4.59% stock drop, lost sales, and damage to its reputation. I found 2 different opinions which give you some ideas on what happened.

GigaOm has a factual analysis looking for cause and effect.

So what happened? Let’s look at the facts.

  • Traffic to https://www.amazon.com was getting there. So DNS was configured properly to send traffic to Amazon’s data centers. Global server load balancing (GSLB) is the first line of defense when a data center goes off the air. Either GSLB didn’t detect that the main data center was down, or there was no spare to which it could send visitors.
  • When traffic hit the data center, the load balancer wasn’t redirecting it. This is the second line of defense, designed to catch visitors who weren’t sent elsewhere by GSLB.
  • If some of the servers died, the load balancer should have taken them out of rotation. Either it didn’t detect the error, or all the servers were out. This is the third line of defense.
  • Most companies have an “apology page” that the load balancer serves when all servers are down. This is the fourth line of defense, and it didn’t work either.
  • The HTTP 1.1 message users saw shows something that “speaks” HTTP was on the other end. So this probably wasn’t a router or firewall.

This sort of thing is usually caused by a misconfigured HTTP service on the load balancer. But that would happen late at night, be detected, and rolled back. It could also happen from a content delivery network (CDN) not retrieving the home page properly.

So my money’s on an AFE or CDN problem. But as Berman notes, Amazon’s store is a complex application and much of their infrastructure doesn’t follow “normal” data center design. So only time (and hopefully Amazon) will tell.

Site operators can learn from this: Look into GSLB, and make sure you have geographically distributed data centers (possibly through AWS Availability Zones.) It’s another sign we can’t take operations for granted, even in the cloud.

and WSJ blogger has a more entertaining version.

“The Amazon retail site was down for approximately 2 hours earlier today (beginning around 10:25) - and we’re bringing the site back up.

Amazon’s systems are very complex and on rare occasions, despite our best efforts, they may experience problems. We work to minimize any disruption and to get the site back as quickly as possible.

Amazon’s web services were not affected nor were our international sites.”

The statement doesn’t explain what went wrong, however. Here are some possibilities the Business Technology Blog has come up with and — in honor of tomorrow’s Belmont Stakes — our carefully-calculated odds that it’s what caused the problem.

* An explosion, fire or some other mishap at one of Amazon’s data centers: 5 to 1
* A faulty software upgrade: 7 to 1
* A so-called denial-of-service attack that tries to overwhelm the site with traffic: 10 to 1
* The site was broken into the by same guys who “RoXed Comcast”: 100 to 1
* A rush on Amazon’s Kindle e-book reader: 1,000 to 1
* Sharks with laser beams on their heads: 1,000,000 to 1

Most people will focus on what needs to happen to prevent this from happening again. Or you can accept the fact that no matter what you do, outages will happen, and it is part of an online business. So, the challenge is how you handle the customer relations of an outage. The smart companies address both issues with a good analysis of the long term effects.  This coincidentally aligns with companies who think about green/sustainability action in the long term.  The last thing you want to do in going green is be reactive.