Thinking about green in the data center can have people focusing on the their PUE and energy efficiency of their servers, but few are implementing transactions per watt dashboards. It is rare to find someone who discusses data centers and the energy efficiency of their transactions. The MPG of their data center.
Paypal CTO discusses their Oct 29 outage.
At around 08:07 am PT today, a network hardware failure in one of our data centers resulted in a service interruption for all PayPal users worldwide. Everyone in our organization was immediately engaged to identify the issue and get PayPal back up and running. We were not able to switch over to our back up systems as quickly as planned. We partially restored service by approximately 8:45 am PT and the issue was fully resolved by 9:24 am PT. A second service interruption started at around 11:30 am PT and was partially resolved at 11:55am with full recovery at 12:21pm.
When I read this description in reminds me of the Star Trek scene where Sulu can’t the Enterprise into Warp.
Hikaru Sulu: The fleet has cleared spacedock, Captain. All ships ready for warp.
Christopher Pike: Set a course for Vulcan.
Hikaru Sulu: Aye-Aye, Captain. Course laid in.
Christopher Pike: Maximum warp. Punch it.
Hikaru Sulu: [One by one, the rest of the star fleet jumps into warp drive, leaving the Enterprise behind. Sulu frowns at the console, puzzled]
Christopher Pike: Lieutenant, where is Helmsman McKenna?
Hikaru Sulu: He has lungworms, sir. He couldn't report to his post. I'm Hikaru Sulu.
Christopher Pike: And you are a pilot, right?
Hikaru Sulu: Very much so, sir.
Hikaru Sulu: [he trails off, hitting buttons]
Hikaru Sulu: Uh, I'm not sure what's wrong here.
Christopher Pike: Is the parking brake on?
Hikaru Sulu: Uh, no. I'll figure it out. I'm just...
Spock: Have you disengaged the external inertial dampener?
Hikaru Sulu: [Embarrassed. Without looking at anyone, he punches in the correct sequence] Ready for warp, sir.
Christopher Pike: Let's punch it.
From 8:07a to 8:45a there was no Warp drive for Paypal. Dozens of people looking at displays. Why are transactions not completing? We have power. Services are live. Is the parking break on?
StorefrontBacktalk provides more details.
Two major technology glitches in a row knocked PayPal offline on Friday (Oct. 29), preventing the alternative payment giant from processing any E-tailer transactions for 80 minutes. First a network hardware failure shut down all PayPal payments. Then the backup plan failed when a handoff to a secondary datacenter didn’t go smoothly.
StorefrontBacktalk provides a timeline of outage, switch to backup data center, switch back to primary, repeat outage issue, then back up. Then provides these words.
Like American Eagle, PayPal had a fallback plan. But it didn’t work the way it was supposed to. And though it had a technical plan (that didn’t work) for dealing with the outage, like Wal-Mart, PayPal didn’t have any plan at all for quickly notifying the people most affected (Wal-Mart’s store personnel, PayPal’s biggest E-Commerce partners).
The lesson about failed backup plans just keeps getting bigger. Yes, improbable failures can happen. When they do, failover plans can fail. And when that happens, you need a plan already in place to warn those affected in real time.
I predict over the next 5 years we will see an outage at scale that will cripple a company permanently. We saw this last year with T-Mobile Sidekick outage, and imagine it on a bigger scale.
By Laura Northrup on October 11, 2009 4:00 PM
This time last week, we thought of the T-Mobile Sidekick data outage as a mere inconvenient outage, but a temporary one. We grossly misunderstimated how badly T-Mobile and Danger/Microsoft could screw things up.
It turns out that their promise that service would be restored "soon" actually meant "never."
Want to avoid the risk. Invest in better people and processes. Technology is what you use, not the answer to the problem.