7x24 Fall 2014 Phoenix is open for Registration, Google Execs present 2X

September 10, 2014 Dave Ohara

7x24 2014 Fall Phoenix is open for registration.

Going through the program what caught my eye is two presentations by Google.

Keynote: Google - Beyond the PUE Plateau

In a white paper released earlier this year, Google presented a revised operations model for maximizing data center performance while minimizing energy use across the data center fleet. "Machine Learning Applications for Data Center Optimization" describes Google's progress with moving beyond the PUE plateau. During this session Joe Kava will briefly introduce the adoption of Predictive PUE across Google's Data Centers and provide an update on how the sites are reaching this goal.

Joe Kava

Vice President

Google Data Centers

Google- Renewable Energy: Keeping Pace with Data Center Growth

Increasingly, customers are expecting their data center service providers to share their sustainability goals, including ensuring that the energy supply serving those data centers consists of as much renewable energy as possible. This will likely become increasingly important as major corporations move more of their IT needs onto third party infrastructure. As a carbon neutral company, Google has been a pioneer in this area. This session will focus on innovative ways that Google has used to source renewable energy for operations; trends in sourcing renewables; why this is important to Google; principal challenges moving forward; and Google’s plans for the future.

Gary Demasi

Director of Operations, Global Infrastructure

Google Inc.

15 years ago Google placed its largest server order and did something big starting site reliability engineering

July 23, 2014 Dave Ohara

Google’s Urs Hölzle posted on Google placing its largest server order in its history 15 years ago.

Urs Hölzle
Shared publicly - 11:41 AM

15 years ago we placed the largest server offer in our history: 1680 servers, packed into the now infamous "corkboard" racks that packed four small motherboards onto a single tray. (You can see some preserved racks at Google in Building 43, at the Computer History Museum in Mountain View, and at the American Museum of Natural History in DC,http://americanhistory.si.edu/press/fact-sheets/google-corkboard-server-1999.)

At the time of the order, we had a grand total of 112 servers so 1680 was a huge step. But by the summer, these racks were running search for millions of users. In retrospect the design of the racks wasn't optimized for reliability and serviceability, but given that we only had two weeks to design them, and not much money to spend, things worked out fine.

I read this thinking how impactful was this large server order. Couldn’t figure what I would post on how the order is significant.

Then I ran into this post on Site Reliability Engineering dated Apr 28, 2014, and realized there was a huge impact by Google starting the idea of a site reliability engineering team.

Here is one the insights shared.

The solution that we have in SRE -- and it's worked extremely well -- is an error budget. An error budget stems from this basic observation: 100% is the wrong reliability target for basically everything. Perhaps a pacemaker is a good exception! But, in general, for any software service or system you can think of, 100% is not the right reliability target because no user can tell the difference between a system being 100% available and, let's say, 99.999% available. Because typically there are so many other things that sit in between the user and the software service that you're running that the marginal difference is lost in the noise of everything else that can go wrong.

If 100% is the wrong reliability target for a system, what, then, is the right reliability target for the system? I propose that's a product question. It's not a technical question at all. It's a question of what will the users be happy with, given how much they're paying, whether it's direct or indirect, and what their alternatives are.

The business or the product must establish what the availability target is for the system. Once you've done that, one minus the availability target is what we call the error budget; if it's 99.99% available, that means that it's 0.01% unavailable. Now we are allowed to have .01% unavailability and this is a budget. We can spend it on anything we want, as long as we don't overspend it.

Here is another rule that is good to think about when running operations.

One of the things we measure in the quarterly service reviews (discussed earlier), is what the environment of the SREs is like. Regardless of what they say, how happy they are, whether they like their development counterparts and so on, the key thing is to actually measure where their time is going. This is important for two reasons. One, because you want to detect as soon as possible when teams have gotten to the point where they're spending most of their time on operations work. You have to stop it at that point and correct it, because every Google service is growing, and, typically, they are all growing faster than the head count is growing. So anything that scales headcount linearly with the size of the service will fail. If you're spending most of your time on operations, that situation does not self-correct! You eventually get the crisis where you're now spending all of your time on operations and it's still not enough, and then the service either goes down or has another major problem.

Think Google Infrastructure will hit $3bil/Qtr in Q4 2014 or Q1 2015?

July 22, 2014 Dave Ohara

Google’s data center group is on a growth curve that is mind blowing. Last quarter was $2.65 Bil.

Note not all of this spend is the data center group.

When you stare at this graph it seems like the $3bil mark is only a few quarters away.

If you are a believer in size brings efficiency, then Google is clearly one of the leaders.

Time can support credibility, Google vs. Microsoft outage reports

July 7, 2014 Dave Ohara

One of my friends has made the switch from Google to Microsoft. Well actually I have many friends who have made the switch. There are also many who have Google to go to Microsoft. One friend who made who knows how Microsoft works and Google made the point on how the outage reporting posts from the companies differ.

Microsoft had an Outlook outage with this post on the event.

On Monday and Tuesday of this week, some of our Office 365 customers hosted in our North America datacenters experienced unrelated service issues with our Lync Online and Exchange Online services. First, I want to apologize on behalf of the Office 365 team for the impact and inconvenience this has caused. Email and real-time communications are critical to your business, and my team and I fully recognize our accountability and responsibility as your partner and service provider.

Google reported on one of its outages with this.

Earlier today, most Google users who use logged-in services like Gmail, Google+, Calendar and Documents found they were unable to access those services for approximately 25 minutes. For about 10 percent of users, the problem persisted for as much as 30 minutes longer. Whether the effect was brief or lasted the better part of an hour, please accept our apologies—we strive to make all of Google’s services available and fast for you, all the time, and we missed the mark today.

One way to look at the contrast is Google is specific with the time of 25 minutes, 30 minutes longer.

Microsoft says they have full understanding of the issues, but doesn’t provide the specifics on time.

We have a full understanding of the issues, and the root causes of both the Exchange Online and Lync Online services have already been fixed.

Google had another outage where specifics are reported down to the minute.

Issue Summary

From 6:26 PM to 7:58 PM PT, requests to most Google APIs resulted in 500 error response messages. Google applications that rely on these APIs also returned errors or had reduced functionality. At its peak, the issue affected 100% of traffic to this API infrastructure. Users could continue to access certain APIs that run on separate infrastructures. The root cause of this outage was an invalid configuration change that exposed a bug in a widely used internal library.

Timeline (all times Pacific Time)

6:19 PM: Configuration push begins

6:26 PM: Outage begins

6:26 PM: Pagers alerted teams

6:54 PM: Failed configuration change rollback

7:15 PM: Successful configuration change rollback

7:19 PM: Server restarts begin

7:58 PM: 100% of traffic back online

Outages are painful for all companies.

Suggestion for when you report your own outage if you include the time of events, then your communication can be viewed as more credible. Using terms like “some” or “brief” doesn’t work when you are the one who is affected by the outage and brief would mean a minute of outage.

Google, Microsoft and others form Consortium for 25/50 Gbps Ethernet Switches in Data Centers

July 2, 2014 Dave Ohara

LightReading reports on Google, Microsoft and others forming a consortium for 25/50 Gbit/s switches to increase speed and reduce cost of data center networking.

SANTA CLARA, Calif. – A consortium of companies including Arista Networks, Broadcom Corporation, Google Inc., Mellanox Technologies, Ltd., and Microsoft Corp. today announced the availability of a specification optimized to allow data center networks to run over a 25 or 50 Gigabit per second (Gbit/s) Ethernet link protocol. This new specification will enable the cost-efficient scaling of network bandwidth delivered to server and storage endpoints in next-generation cloud infrastructure, where workloads are expected to surpass the capacity of 10 or 40 Gbps Ethernet links deployed today.

The 25 Gigabit Ethernet Consortium was formed by the above leading cloud networking technology providers for the purpose of supporting an industry-standard, interoperable Ethernet specification that boosts the performance and slashes the interconnect cost per Gbps between the server Network Interface Controller (NIC) and Top-of-Rack (ToR) switch.

ZDnet says the consortium is a response to stalls in the IEEE process.

The consortium was formed after plans to create official Institute of Electrical and Electronics Engineers (IEEE) specifications stalled at a meeting last March, due to a perceived lack of support.

...

The tech giants say that in essence, specifications published by the consortium "maximizes the radix and bandwidth flexibility of the data center network while leveraging many of the same fundamental technologies and behaviors already defined by the IEEE 802.3 standard."