AWS first outage of 2014, Jan 3 12:50a

I was disconnected from the internet this weekend and one of my developer friends said AWS was out in the East Coast and I couldn’t find much on the outage.

Here is one post.

Amazon Cloud Services Down, Netflix, Other Sites Unreachable

January 3, 2014 

By Paul Thomson :: 1:16 AM

Update: As of 2:45 AM, it appears that Amazon’s cloud services are coming online again.

With the northeast in the grips of a deep freeze and blizzard, many people are stuck indoors tonight, hunkered down in front of the glowing screens of laptops and televisions. What they’re likely not doing, is watching Netflix and surfing on some parts of the social web, however.

Around 12:30 AM, an outage occurred with Amazon’s cloud storage service, throwing many databases and applications offline. Some of the affected sites include Netflix, Amazon streaming media, Amazon Mechanical Turk, Steam, Tumblr, and various blogs and websites that depend on Amazon’s services to host and deliver their data.

The outage appears to be centered in one of Amazon’s East Coast data centers, according to Tweets from various sources trying to pinpoint the problem, but no official status update has been given from Amazon yet.

Given the amount of outages in East Coast AWS we have chosen to try and use AWS West and our clients are more on the west coast. 

We’ll see what AWS outages look like in 2014 vs. 2013.  Here is an slideshare analyzing past AWS outages.  And the conclusion is most outage are caused by process issues.

NewImage

What's wrong with so many operations, not seeing the flaw of human intervention

I just spent some time in an operations discussion, and I quickly realized the path that the team was taking was wrong.  It was a classic enterprise IT system approach to collect all the requirements, get all the people on board, lots of meetings, create an enterprise IT solution that meets the requirements.  Spend millions of dollars on the system and pray that it will deliver.

What is wrong?

Lots of process.  More people add more errors.  Move at the pace of meetings.  limited by how fast people will type, and review.

Another example of how things don’t work is in hospital care.  Here is a NYTimes op-ed piece.

 

More Treatment, More Mistakes

 

 


DOCTORS make mistakes. They may be mistakes of technique, judgment, ignorance or even, sometimes, recklessness. Regardless of the cause, each time a mistake happens, a patient may suffer. We fail to uphold our profession’s basic oath: “First, do no harm.”

The piece closes with a possible solution to the problem.

Hospitals are supposed to take care of the sickest members of our society and uphold the highest standards of patient care. But hospitals are also charged with teaching doctors, and every doctor has a first mistake. The only thing we can do is learn each time one happens, and reduce future errors in the process. Having a consistent gathering to talk about the mistakes goes a long way toward that goal, and just about any institution, public or private, could benefit from a tradition like M and M. It is not enough to stop the practice of defensive medicine, but when doctors are asked by their colleagues to justify the tests they ordered and the procedures they performed, perhaps they will be reminded that more is not always better.

It is amazing how so many systems are not focused on catching the errors and addressing them.   The #1 mistake I see is when people can’t see that the system itself is full of human errors.  How can you run operations with an IT system that introduces more errors on top of the problems you are trying to fix?

The Bureaucracy of the Vietnam War comes to mind as something that introduced more problems than it solved.

NewImage

Being closer to the problem and understanding the impact is something that I think works better.

NewImage

Facebook keeps score of Serviceability and Operational Efficiency of Data Center Hardware

There is a short post on OCP by Charlie Manese, Facebook Hardware Design team on Serviceability and operational efficiency, so I will just put the whole thing up.

Know the guys at Google have this data, wonder who else does?

Facebook's perspective on serviceability and operational efficiency

Wednesday, October 09, 2013 · Posted by  at 8:09 AM

UPDATED - Webinar on October 24, 2013

By Charlie Manese, Facebook Hardware Design team

At Facebook, because of our scale, we require that solutions deployed in our data center be engineered for maximum operational efficiency and serviceability.

The data center team works closely with the hardware design team to ensure this. Our designs incorporate features such as front-of-rack serviceability, toolless repair operations, and simplicity.

We’ve completed time-in-motion studies, streamlined processes for inventory and repair, and have developed scorecards to that help us evaluate and compare different hardware solutions.

Below is a table of the time-to-repair comparison of different kinds of web servers that have been deployed in our environment:

 NewImage

 

If you're interested in learning more about how Facebook thinks about serviceability and operational efficiency, and you missed the original event, I'll be joining a Hyve webinar on October 24, 2013.

 

For more information on the event, please see  https://synnex.ilinc.com/perl/ilinc/lms/register.pl?activity_id=zvkkfkw&user_id=

 

Hope to see you there!

 

Haste makes waste, Fukushima's water tanks flawed according to construction worker

Any who runs projects knows it is really hard to get the balance in the project between cost, schedule, and quality.

NewImage

It is easy to get two of the three with one suffering.

Huffingtonpost reports on problems in Fukushima’s hasty water tank construction.

"I must say our tank assembly was slipshod work. I'm sure that's why tanks are leaking already," Uechi, 48, told The Associated Press from his hometown on Japan's southern island of Okinawa. "I feel nervous every time an earthquake shakes the area."

Officials and experts and two other workers interviewed by the AP say the quality of the tanks and their foundations suffered because of haste — haste that was unavoidable because there is so much contaminated water leaking from the wrecked reactors and mixed with ground water inflow.

"We were in an emergency and just had to build as many tanks as quickly as possible, and their quality is at bare minimum," said Teruaki Kobayashi, an official in charge of facility control for the plant operator, Tokyo Electric Power Co.

It is easy for executives to claim they got the work done fast and cheap, then they either change jobs or when there quality problems, they are ready to point the fingers of blame to operations, vendors, maintenance procedures, anything that looks it was the fault of others, not them.

Quality Control exist for a reason.  In industries who have a long term view they need someone who focuses on the quality to reject the shipping of services until it meets the quality bar.  Short term thinkers will shave cost and schedule to look like they are heroes. 

Aiming at the wrong results, increase rack density vs. NOT stranding power

I was listening to a data center analyst and they made the point increase rack density as one of the top things to do to increase efficiency in the data center.

NewImage

This is a clear target to hit, but it is the wrong one.  This situation reminds me of the 2004 Olympic target shooter who had one bullet to shoot to win the gold.  He hit the bulls eye in lane three.  He was in lane 2.

Emmons fired at the target in lane three while he was shooting in lane two. When no score appeared on the electronic scoring device for his lane, he turned to officials and gestured there was some sort of error.

"I shot," he appeared to say with a quizzical look as three officials in red blazers approached.

The officials went back and huddled briefly before announcing that Emmons had cross-fired — an extremely rare mistake in elite competition — and awarded him a score of zero.

It is easy to claim you increased the rack density, hitting the bulls eye.  But in the same way that Emmons lost because he shot the wrong target, there is a different target to aim at that would be more important if you have had judges.

If you deployed 2 kW in a rack, but also stranded power in the process why should you claim the increased rack density as a win?  The bigger picture to look at is did you use power without stranding power.  There are many other factors that influence where a piece of equipment should be placed.

Increase rack density is a great way for the Blade vendors to sell more blades.  But, the smartest data center operators don't make rack density a target to hit.

Do you?

That Olympian had instant feedback he missed the target.  Here is the story after and how he met his wife after the mistake.  

Matt and Katerina Emmons

At Athens 2004, Matt Emmons missed his target but found love. He recalls:

"That was the last shooting event of the Games so a bunch of athletes and coaches went up to this beer garden between the ranges.

"We were taking it easy and relaxing, I'm there with some friends, and Czech shooter Katerina Kurkova [who had been commentating on the final as Emmons missed the target] came up to say how sorry she was about what happened, and how she admired how I'd handled the situation.

"At that time I just knew who she was, we'd never really spoken. But we hit it off really well, we started dating a year later and we were married in 2007. She's now Katerina Emmons."