Google Mail/Apps ups SLA, removes schedule downtime allowance

Data Centers and uptime is assumed.  Service Level Agreements (SLA) are made between groups.  But, many times there are exceptions for planned maintenance/downtime vs. unplanned downtime when calculating SLA.

InformationWeek reports on Google Apps/Gmail's change to this common practice.

Google Promises No Planned Downtime

A new service level agreement (SLA) for Google Apps customers strives to make Google's cloud as reliable as dial tone.

By Thomas ClaburnInformationWeek
January 14, 2011 02:42 PM

Google has changed its service level agreement for paid versions of Google Apps, its suite of online applications. The goal, says Google Enterprise product management director Matt Glotzbach, is to deliver service that's as reliable as telephone dial tone.

For today's mobile generation, who may lack experience with landlines, let it suffice to say that dial tone under Ma Bell was very, very reliable. Not sunrise reliable but chances were if you didn't hear a dial tone when you picked up a handset, the phone was disconnected from the wall.

Google is taking a leadership position.

But with millions of enterprise customers, Google aims to become more reliable. As a sign of its commitment, the company has disavowed planned downtime. "Unlike most providers, we don't plan for our users to be down, even when we're upgrading our services or maintaining our systems," wrote Glotzbach in a blog post. "For that reason, we're removing the SLA clause that allows for scheduled downtime."

Glotzbach says Google is the first major cloud service provider to make that pledge.

In Google's blog post they call out the competition.

Gmail: 99.984%
In 2010, Gmail was available 99.984 percent of the time, for both business and consumer users. 99.984 percent translates to seven minutes of downtime per month over the last year. That seven-minute average represents the accumulation of small delays of a few seconds, and most people experienced no issues at all. For those few who were disrupted for a longer period of time, we're very sorry, and Google Apps for Business customers received compensation where appropriate. We're particularly pleased with this level of reliability since it was accomplished without any planned downtime while launching 30 new features and adding tens of millions of active users.
Seven minutes of downtime compares very favorably with on-premises email, which is subject to much higher rates of interruption that hurt employee productivity. The latest research from the Radicati Group found that on-premises email averaged 3.8 hours of downtime per month. In comparison to Radicati's metrics for on-premises email, our calculations suggest that Gmail is 32 times more reliable than the average email system, and 46 times more available than Microsoft Exchange®.1

Fortunately Microsoft Exchange® customers can still benefit from the reliability of Gmail withGoogle Message Continuity. Comparable data for Microsoft BPOS® is unavailable, thoughtheir service notifications show 113 incidents in 2010: 74 unplanned outages, and 33 days with planned downtime.

You may be thinking I can't do this in my data center.  And you are right you can't.  This solution requires geo redundancy between data centers.  For a bit on some of Google's approach check out this Google presentation at Stanford University.

Google – A study in Scalability and A little systems horse sense

By ksankar

16 Votes

Google’s Jeff Dean did an excellent talk at Stanford as part of EE380 – it is worth one’s time to listen. Very informative, instructive and innovative. As I listened, I jotted a few quick notes.

  • Interesting comparison of the scale in search from 1999 to 2010
    • Docs and queries are up 1000X, while the query latency has decreased 5X
    • Interesting to hear that in 1999 they used to update a web page store in a month or two, but now it is reduced 50000X to seconds!
  • They have had 7 significant revisions in 11 years
  • Trivia : They encounter very expensive queries for example “circle of death” requires ~30GB of I/O
  • Trivia : In 2004, they did a rethink and refreshed the systems infrastructure from scratch
  • He discussed a little about encodings – informative discussion on Byte aligned variable length & group encoding schemes << I have to try it out …
  • Trivia : They have had long distance links failure by wild dogs, sharks, dead horses and (in Oregon) drunken hunters !

The presentation referenced is by Jeff Dean.