Google, Amazon, and Netflix comment on uptime 99.999%

January 21, 2011 Dave Ohara

NYTimes has a post on uptime.

99.999% Reliable? Don’t Hold Your Breath

By RANDALL STROSS

AT&T’s dial tone set the all-time standard for reliability. It was engineered so that 99.999 percent of the time, you could successfully make a phone call. Five 9s. That works out to being available all but 5.26 minutes a year.

The author was able to get Google.

As for moving to 99.999, well, that may never come. “We don’t believe Five 9s is attainable in a commercial service, if measured correctly,” says Urs Hölzle, senior vice president for operations at Google. The company’s goal for its major services is Four 9s.

Google’s search service almost reaches Five 9s every year, Mr. Hölzle says. By its very nature, it is relatively easy to provide uninterrupted availability for search. There are many redundant copies of Google’s indexes of the Web, and they are spread across many data centers. A Web search does not require constant updating of a user’s personal information in one place and then instantly creating identical copies at other data centers.

Amazon

One of those services, the Simple Storage Service, or S3, allows companies to store data on Amazon’s servers. “We talk of ‘durability’ of data — it’s designed for Eleven-9s durability,” says James Hamilton, a vice president for Amazon Web Services. That works out to a 0.000000001 percent chance of data being lost, at least theoretically.

And threw in a Netflix blog post.

One thing that Google and other companies offering Web services have learned to do is to keep software problems at their end out of the user’s view. John Ciancutti, vice president for personalization technology at Netflix, wrote on the company’s blog in December about lessons learned in moving its systems from its own infrastructure to that of Amazon Web Services. He said Netflix had adopted a “Rambo architecture”: each part of its system is designed to fight its way through on its own, tolerating failure from other systems upon which it normally depends.

“If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond,” Mr. Ciancutti said. “We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.”

Watch for availability to be marketed more.

Three Data Center Rating Systems: Uptime, LEED, CEEDA

January 20, 2011 Dave Ohara

ZDNET has a post summarizing the three data center rating systems out there - Uptime, LEED, and CEEDA. The author summarizes the current rating system hype.

How does your datacenter rate?

By David Chernicoff | January 20, 2011, 11:39am PST

Many businesses looking at building new datacenters announce that they are planning on achieving certification for their new datacenter by an external authority that will evaluate their datacenter and grant a specific status or award to the facility. When the new datacenter gets such a status or award, the company will send out press releases, tell stockholders, and use it in their promotional material, if applicable. But the standard for the current crop of rating entities are consistent only across their own ratings, and there are more groups doing this than you might realize. Here’s the current crop of high-end standards and awards applied to datacenters.

One of the most popular out in the public is LEED, and the author pops that illusion .

Leadership in Energy and Environmental Design (LEED)

This standard, run by the US Green Building Council, you might be surprised to learn, is not a datacenter standard per se, despite all the press over the last year on datacenters achieving high LEED awards. The USGBC defines the standard as “a nationally accepted benchmark for the design, construction, and operation of high-performance green buildings.” And while a datacenter needs to work hard to achieve LEED awards, the basic metric is not designed to rate a fully optimized datacenter.

Does your marketing group tell you to get a LEED rating?

Problem with choosing a path of cost reduction for greening a data center, the end leads to bankruptcy

January 20, 2011 Dave Ohara

I am reading a book on FedEx.

The author is a distribution logistics expert.

About the Author

Roger Frock has conducted numerous projects and workshops dealing with the subjects of transportation networks, logistics operating systems, and responsible and ethical management during his years with A.T. Kearney, as a part of the decade with Federal Express. He has been a guest speaker at the National Council of Physical Distribution Management amoung others on a variety of subjects.

One of my first passions after college was in distribution logistics when I worked at HP and at that time UPS was the dominant shipper and FedEx was a minor player being used for only those few shipments that had to get there the next day. Focusing on distribution logistics is what got me hired at Apple, and has helped to think about the abstraction of products to delivery of services.

Part of the book tells the story how the board of directors thought it was a time for change in management and was ready to reduce Fred Smith’s authority. One of my favorite paragraphs is the following.

Watch out for those in the data center/IT space who make cost reduction the #1 goal. Cost reduction is a temporary move, at some point you need to grow, and growing with cost reduction is not sustainable.

There was actually an executive sponsored study to reduce the # of cities during FedEx’s growth days because those cities were too expensive compared to others.

For those who have been around for a while, we have all heard or seen various projects that feel like this. The problem is cost reduction is not sustainable for a growing business. Cost efficacy is a sustainable.

Could you imagine a data center designs were done by someone who focuses on cost reduction? In fact, that may be a good horror story talking to some designers. The top two ways typically designed for are resiliency and efficiency. Who designs for cost reduction? Willing to compromise resiliency and efficiency. Not many, but I wouldn’t be surprised there are some designs out there as there are people out there who don’t know data centers/IT who love to hear the words “cost reduction” coming from the data center group. Until their data center services go out of business with downtime.

After Google Reorg, Data Centers are important to all three top execs

January 20, 2011 Dave Ohara

Google announced Larry Page is CEO, replacing Eric Schmidt.

But as Google has grown, managing the business has become more complicated. So Larry, Sergey and I have been talking for a long time about how best to simplify our management structure and speed up decision making—and over the holidays we decided now was the right moment to make some changes to the way we are structured.

For the last 10 years, we have all been equally involved in making decisions. This triumvirate approach has real benefits in terms of shared wisdom, and we will continue to discuss the big decisions among the three of us. But we have also agreed to clarify our individual roles so there’s clear responsibility and accountability at the top of the company.

When you read what Larry's role is leading product development and technology strategy it makes sense that Google's data center group would report to Larry.

Larry will now lead product development and technology strategy, his greatest strengths, and starting from April 4 he will take charge of our day-to-day operations as Google’s Chief Executive Officer. In this new role I know he will merge Google’s technology and business vision brilliantly. I am enormously proud of my last decade as CEO, and I am certain that the next 10 years under Larry will be even better! Larry, in my clear opinion, is ready to lead.

Sergey is working on strategic projects. But, how can Google develop new products without data center resources.

Sergey has decided to devote his time and energy to strategic projects, in particular working on new products. His title will be Co-Founder. He’s an innovator and entrepreneur to the core, and this role suits him perfectly.

And Eric is working on external projects - deals, partnerships, ... technology thought leadership that are increasingly important. You need Google's data centers for these deals.

As Executive Chairman, I will focus wherever I can add the greatest value: externally, on the deals, partnerships, customers and broader business relationships, government outreach and technology thought leadership that are increasingly important given Google’s global reach; and internally as an advisor to Larry and Sergey.

From left to right - Eric, Larry and Sergey in a self-driving car in a photo taken earlier today

So, even though Eric, Larry, and Sergey all have new roles. They all need Google's data centers.

How many companies do you know need data centers for the three top billionaire executives to do their job?

How Facebook Ships Code, hints of how their data centers work

January 19, 2011 Dave Ohara

My wife and I just watched The Social Network on DVD. We have a 9 and 6 year old so going out to movies together is rare.

The Social Network Trailer

My wife worked in sales for many companies like IDG, Ziff Davis with clients like Intel, Palm, and Microsoft. She’s seen the SW developers, but never was really been exposed to their world. Watching The Social Network was entertaining and it is Hollywood spin on the SW culture. She was amazed at how the SW developers were portrayed, and their focus on writing code. The bit of irony is working on the Apple Mac OS team and Microsoft Windows team, I could recognize behaviors that reminded of the days when I was much younger and would work in the same type of mode as Facebook was portrayed. The reality not the Hollywood version.

So what is it like in Facebook Development? Her is a post on How Facebook Ships Code.

How Facebook Ships Code
January 17, 2011 — yeeguy
I’m fascinated by the way Facebook operates. It’s a very unique environment, not easily replicated (nor would their system work for all companies, even if they tried). These are notes gathered from talking with many friends at Facebook about how the company develops and releases software.

Seems like others are also interested in Facebook… The company’s developer-driven culture is coming under greater public scrutiny and other companies are grappling with if/how to implement developer-driven culture. The company is pretty secretive about its internal processes, though. Facebook’s Engineering team releases public Notes on new features and some internal systems, but these are mostly “what” kinds of articles, not “how”… So it’s not easy for outsiders to see how Facebook is able to innovate and optimize their service so much more effectively than other companies. In my own attempt as an outsider to understand more about how Facebook operates, I assembled these observations over a period of months. Out of respect for the privacy of my sources, I’ve removed all names and mention of specific features/products. And I’ve also waited for over six months to publish these notes, so they’re surely a bit out-of-date. I hope that releasing these notes will help shed some light on how Facebook has managed to push decision-making “down” in its organization without descending into chaos… It’s hard to argue with Facebook’s results or the coherence of Facebook’s product offerings. I think and hope that many consumer internet companies can learn from Facebook’s example.

I have had friends over the last 2 years interview at Facebook, most turn down working at Facebook as they were being recruited for senior engineering manager positions and they couldn’t see how they could do their job and be successful.

The post has lots of information, and here are parts that gives you an idea of how Facebook thinks about its SW which influences hardware and data centers.

engineers generally want to work on infrastructure, scalability and “hard problems” — that’s where all the prestige is. can be hard to get engineers excited about working on front-end projects and user interfaces. this is the opposite of what you find in some consumer businesses where everyone wants to work on stuff that customers touch so you can point to a particular user experience and say “I built that.” At facebook, the back-end stuff like news feed algorithms, ad-targeting algorithms, memcache optimizations, etc. are the juicy projects that engineers want.

Note the above reference can be implied to Apple (consumer business).

Additional information that backs up a focus on infrastructure.

as of June 2010, the company has nearly 2000 employees, up from roughly 1100 employees 10 months ago. Nearly doubling staff in under a year!

the two largest teams are Engineering and Ops, with roughly 400-500 team members each. Between the two they make up about 50% of the company.

More details are explained here in a process for releases.

by default all code commits get packaged into weekly releases (tuesdays)

with extra effort, changes can go out same day

tuesday code releases require all engineers who committed code in that week’s release candidate to be on-site

engineers must be present in a specific IRC channel for “roll call” before the release begins or else suffer a public “shaming”

ops team runs code releases by gradually rolling code out

facebook has around 60,000 servers

there are 9 ~~concentric~~ levels for rolling out new code

[CORRECTION thx epriest] “The nine push phases are not concentric. There are three concentric phases (p1 = internal release, p2 = small external release, p3 = full external release). The other six phases are auxiliary tiers like our internal tools, video upload hosts, etc.”

the smallest level is only 6 servers

e.g., new tuesday release is rolled out to 6 servers (level 1), ops team then observes those 6 servers and make sure that they are behaving correctly before rolling forward to the next level.

if a release is causing any issues (e.g., throwing errors, etc.) then push is halted. the engineer who committed the offending changeset is paged to fix the problem. and then the release starts over again at level 1.

so a release may go thru levels repeatedly: 1-2-3-fix. back to 1. 1-2-3-4-5-fix. back to 1. 1-2-3-4-5-6-7-8-9.

ops team is really well-trained, well-respected, and very business-aware. their server metrics go beyond the usual error logs, load & memory utilization stats — also include user behavior. E.g., if a new release changes the percentage of users who engage with Facebook features, the ops team will see that in their metrics and may stop a release for that reason so they can investigate.

during the release process, ops team uses an IRC-based paging system that can ping individual engineers via Facebook, email, IRC, IM, and SMS if needed to get their attention. not responding to ops team results in public shaming.

once code has rolled out to level 9 and is stable, then done with weekly push.

if a feature doesn’t get coded in time for a particular weekly push, it’s not that big a deal (unless there are hard external dependencies) — features will just generally get shipped whenever they’re completed.

getting svn-blamed, publicly shamed, or slipping projects too often will result in an engineer getting fired. ”it’s a very high performance culture”. people that aren’t productive or aren’t super talented really stick out. Managers will literally take poor performers aside within 6 months of hiring and say “this just isn’t working out, you’re not a good culture fit”. this actually applies at every level of the company, even C-level and VP-level hires have been quickly dismissed if they aren’t super productive.

[CORRECTION, thx epriest] “People do not get called out for introducing bugs. They only get called out if they ask for changes to go out with the release but aren’t around to support them in case something goes wrong (and haven’t found someone to cover for you).”

Note the 60,000 server count is not accurate base on my research and is at least twice with another 1/3 growth in the short term before Prineville DC comes on line.

Would you want to be a senior executive hired for this environment? Now you can see why a lot of my friends turned down jobs.