If you want to decrease outages shouldn't you be thinking of errors?

No ones outages.  Yet how many people think about the errors made.  You know those times when you leave a burner on, ATM card in the machine, original in the copier.  OK who goes to the copier any more like they used to.  But, that doesn't mean you are making errors all the time. Running data center infrastructure is so full of potential procedural errors the disciplined have figured out they need to invest in detecting and reducing errors.

Here is a paper written in 1995 on Working Memory, the short term memory you use to do process the new and old information.  http://chil.rice.edu/byrne/Pubs/git-cogsci-95-06.pdf

In their everyday interaction with the world, people often make mistakes, slips, lapses,
miscalculations and the like. Although making errors is common and the effects of errors can
range from the merely annoying to the catastrophic, procedural errors have received relatively
little attention from cognitive psychologists. Senders and Moray (1991, p. 2) suggest that “[o]ne
reason for this is that error is frequently considered only as a result or measure of some other
variable, and not as a phenomenon in its own right.” Typically, procedural errors are viewed as
the result of some stochastic function. Decrements in performance are manifested as an increase
in global error rate, with little or no attention paid to what causes any particular error.
This stochastic view of error does not seem to correspond to people’s intuitions about
their own behavior. People seem prone to making some kinds of errors more often than others,
and errors seem to occur more often at particular steps in the execution of a given procedure. In
some cases, errors are simply the result of systematic deficiencies or “bugs” in knowledge (e.g.
Brown & VanLehn, 1980). That the lack of correct knowledge of how to perform a step would
lead to errors on that step seems a plausible explanation of some kinds of error. However, this
explanation does not cover cases where people do have the correct knowledge. Many people
report making errors such as leaving the original document behind in a photocopier, failing to
retrieve one’s bank card from an automated teller machine (ATM), and forgetting to replace the
gas cap after filling the tank. In all these cases, people almost certainly have the knowledge
required to carry out the task, because they perform the task correctly most of the time. Yet
errors in these and other similar tasks are often reported.

A difficult challenge for data center automation, availability of the control system

Part of the cloud is automation.  An example is PuppetLabs, and here is a blog post on the topic.

Automation extends to the software layer, where complex systems can be configured once and then rolled out on the fly as needed, using cloud automation tools. Intelligent systems architecture can balance the load among compute, network or storage resources, bringing systems online or offline as demand dictates.

This infrastructure-as-code approach to the modern, increasingly complex data center requires advanced cloud management tools, and cloud automation answers that need. The same software-defined approach to managing private cloud architecture works equally well for managing public clouds. Bonus: By abstracting away the differences between clouds, sophisticated cloud automation software makes it easy to provision the resources the business needs at any given moment, without getting bogged down with where the servers actually sit.
— http://puppetlabs.com/blog/what-cloud-automation-driving-force-data-center-automation

There are tons and tons of companies that have cloud automation tools.  But, how many people spend time addressing the availability of the automation control system.  ???  This may seem obvious, but a control system needs to have a higher availability than the services it is managing.  Otherwise the service will go down when the automation control system goes down.

And, this may mean you need a backup to the automation system when it goes down during an outage.

As Cloud environments get bigger and bigger, automation is a part of the solution, but have you thought about what happens when the automation system goes down.

Learning from when things break, Seattle Tunnel-Boring Machine Repairs

You ever notice how some of the so called experts rarely talk about the things that went wrong on their projects.  They make it seem like they are perfect in their execution and anything that goes wrong is an easy fix given how smart they are.  I don’t know about you but I know of some really bad data centers out there that have been the vision of some experts.  :-)  In general, their way of getting out of accountability is they say the operations crew is to blame. 

The real experts know mistakes are made and they need to learn from them.  In Seattle is the largest tunnel-boring machine in the world and it broke.  The media went wild pointing the fingers of blame at politicians as if they know how to design and operate a tunnel-boring project.  The politician is going to say whatever they think they can to benefit their goals.  This is the mistake the so-called experts make as well, to think they can say whatever they think they can to benefit their goals.

Well when you dig a big tunnel, things go wrong.  In the case of the Seattle tunnel-boring, they went really bad requiring repair work over the cost of the boring machine.  Popular Mechanics post on the repair project and the author jumps on the media.

What do you do when the world's largest tunneling machine is, essentially, stuck in the mud? Bertha is 60 feet under the earth, and you're on the surface watching a squirmy public swap rumors of cost and delay on the $1.35 billion tunnel component of an even larger transportation project, and the naysayers are howling: Just you watch, Bertha will be abandoned like an overheated mole, boondoggle to end all boondoggles. Because, don't forget, when you're boring the world's largest tunnel, everything is bigger—not just the machine and the hole and the outsize hopes but the worries too. The cynicism. 

What do you do? 

Here's what you do: You try to tune out the media. You shrug off the peanut gallery's spitballs. You put off the finger-pointing and the lawsuits for now; that's what the lawyers are paid for afterward. You do the only thing you can do. You put your head down and you think big, one more time. You figure out how to reach Bertha and get her moving again.


The post tells the engineering story of trying to repair the tunnel-boring machine.  

The YouTube video embedded in the article is available here.

Airbus Lessons for Debugging A350 could apply to Data Centers

Businessweek has an article on how Airbus is debugging the development of its latest aircraft the A350.

Reading the article it gave some good tips/lessons that can apply to data centers.

The term debugging is used which also equates to reducing  risk.

The company has put unprecedented resources into debugging the A350—“de-risking,” as it’s called.

The big risk is not the safety risk, but the cost of the plane.

The engineering risk with the A350 isn’t that it will have chronic, life-threatening safety problems; it’s cost.

When you get into the details the discussion can sound like a data center issue.

The challenge, Cousin says, is that “in a complex system there are many, many more failure modes.” A warning light in the cockpit could alert a pilot to trouble in the engine, for instance, but the warning system could also suffer a malfunction itself and give a false alarm that could prompt an expensive diversion or delay. Any downtime for unscheduled maintenance cuts into whatever savings a plane might offer in terms of fuel efficiency or extra seating capacity. For the A350 to be economically viable, says Brégier, “the airlines need an operational reliability above 99 percent.” That means that no more than one flight out of every 100 is delayed by more than 15 minutes because of technical reasons.

Airbus realized the past methods of slowly working the issues out was costly.

Instead of a cautious, incremental upgrade, Airbus went for an entire family of superefficient aircraft ranging from 276 to 369 seats, with a projected development cost of more than $10 billion. The goal was what Airbus internally calls “early maturity”—getting the program as quickly as possible to the kind of bugs-worked-out status that passenger jets typically achieve after years of service.

Many companies make it seem like the data center comes from their company, but in reality almost everyone is an integrator like Boeing and Airbus.

Much of the early work was done not by Airbus but by its suppliers. While the company might look to the outside world like an aircraft manufacturer, it’s more of an integrator: It creates the overall plan of the plane, then outsources the design and manufacture of the parts, which are then fitted together. “We have 7,000 engineers working on the A350,” says Brégier, “and at least half of them are not Airbus employees.”

And a smart move is to change the way you work with suppliers to be partners.

Throughout the development process, teams of engineers were brought in from suppliers to collaborate with Airbus counterparts in Toulouse in joint working groups called “plateaux.” “You need to have as much transparency with your suppliers as possible,” says Brégier. “With such a program you have plenty of problems every day, so it’s bloody difficult.”

And just like operations is critical to data center, airplane operations is the reality that needs to be addressed.

The idea is not just to put the systems through every combination of settings, but to see how the whole aircraft responds when individual parts are broken, overexerted, or misused. That, after all, is how the real world works. “Every plane in the air has something wrong with it,” Cousin says.

Name the number of companies who think about their data centers in the above way.  The list is pretty short.

15 years ago Google placed its largest server order and did something big starting site reliability engineering

Google’s   posted on Google placing its largest server order in its history 15 years ago.


Shared publicly  -  11:41 AM
15 years ago we placed the largest server offer in our history: 1680 servers, packed into the now infamous "corkboard" racks that packed four small motherboards onto a single tray. (You can see some preserved racks at Google in Building 43, at the Computer History Museum in Mountain View, and at the American Museum of Natural History in DC,http://americanhistory.si.edu/press/fact-sheets/google-corkboard-server-1999.)  

At the time of the order, we had a grand total of 112 servers so 1680 was a huge step.  But by the summer, these racks were running search for millions of users.  In retrospect the design of the racks wasn't optimized for reliability and serviceability, but given that we only had two weeks to design them, and not much money to spend, things worked out fine.

I read this thinking how impactful was this large server order.  Couldn’t figure what I would post on how the order is significant.

Then I ran into this post on Site Reliability Engineering dated Apr 28, 2014, and realized there was a huge impact by Google starting the idea of a site reliability engineering team.


Here is one the insights shared.


The solution that we have in SRE -- and it's worked extremely well -- is an error budget.  An error budget stems from this basic observation: 100% is the wrong reliability target for basically everything.  Perhaps a pacemaker is a good exception!  But, in general, for any software service or system you can think of, 100% is not the right reliability target because no user can tell the difference between a system being 100% available and, let's say, 99.999% available.  Because typically there are so many other things that sit in between the user and the software service that you're running that the marginal difference is lost in the noise of everything else that can go wrong.
If 100% is the wrong reliability target for a system, what, then, is the right reliability target for the system?  I propose that's a product question. It's not a technical question at all.  It's a question of what will the users be happy with, given how much they're paying, whether it's direct or indirect, and what their alternatives are.
The business or the product must establish what the availability target is for the system. Once you've done that, one minus the availability target is what we call the error budget; if it's 99.99% available, that means that it's 0.01% unavailable.  Now we are allowed to have .01% unavailability and this is a budget.  We can spend it on anything we want, as long as we don't overspend it.  

Here is another rule that is good to think about when running operations.

One of the things we measure in the quarterly service reviews (discussed earlier), is what the environment of the SREs is like. Regardless of what they say, how happy they are, whether they like their development counterparts and so on, the key thing is to actually measure where their time is going. This is important for two reasons. One, because you want to detect as soon as possible when teams have gotten to the point where they're spending most of their time on operations work. You have to stop it at that point and correct it, because every Google service is growing, and, typically, they are all growing faster than the head count is growing. So anything that scales headcount linearly with the size of the service will fail. If you're spending most of your time on operations, that situation does not self-correct! You eventually get the crisis where you're now spending all of your time on operations and it's still not enough, and then the service either goes down or has another major problem.