On Mar 1, I met with an out of friend guest in Bellevue and one of the other people who joined us was an visiting Microsoft MVP. In the conversation, he brought up the outage on Feb 29 of Windows Azure, and he shared his views on what had gone on, and how could Microsoft make a leap year mistake. How? Human error is an easy explanation.
Here are a few of the media posts.
GigaOm - 1 hour agoIn his post, Bill Laing, corporate VP of Microsoft's server and cloud division, said the outage affectedWindows Azure Compute and dependent services ...Microsoft Offers Credit for Azure Cloud Outage Data Center Knowledge
Microsoft details leap day bug that took down Azure, refunds customers Ars Technica
Microsoft Azure Outage Blamed on Leap Year CloudTweaks News
A high level description is provided by GigaOm's Barb Darrow.
Microsoft tries to make good on Azure outage
But, I want to point out some interesting details in Bill Laing's blog post.
There are three human errors that could have prevented the problem.
- Testing. The root cause of the initial outage was a software bug due to the incorrect manipulation of date/time values. We are taking steps that improve our testing to detect time-related bugs. We are also enhancing our code analysis tools to detect this and similar classes of coding issues, and we have already reviewed our code base.
- Fault Isolation. The Fabric Controller moved nodes to a Human Investigate (HI) state when their operations failed due to the Guest Agent (GA) bug. It incorrectly assumed the hardware, not the GA, was faulty. We are taking steps to distinguish these faults and isolate them before they can propagate further into the system.
- Graceful Degradation. We took the step of turning off service management to protect customers’ already running services during this incident, but this also prevented any ongoing management of their services. We are taking steps to have finer granularity controls to allow disabling different aspects of the service while keeping others up and visible.
Another human error is the system took 75 minutes to notify people that there was a problem.
- Fail Fast. GA failures were not surfaced until 75 minutes after a long timeout. We are taking steps to better classify errors so that we fail-fast in these cases, alert these failures and start recovery.
Lack of communication made the problems worse.
Service Dashboard. The Windows Azure Dashboard is the primary mechanism to communicate individual service health to customers. However the service dashboard experienced intermittent availability issues, didn’t provide a summary of the situation in its entirety, and didn’t provide the granularity of detail and transparency our customers need and expect.
Other Communication Channels. A significant number of customers are asking us to better use our blog, Facebook page, and Twitter handle to communicate with them in the event of an incident. They are also asking that we provide official communication through email more quickly in the days following the incident. We are taking steps to improve our communication overall and to provide more proactive information through these vehicles. We are also taking steps to provide more granular tools to customers and support to diagnose problems with their specific services.
One of the nice thing about Cloud Service is the need for transparency on cause of outages. This is a marketing exercise that needs to make sense to a critical thinking technical person.
We will continue to spend time to fully understand all of the issues outlined above and over the coming days and weeks we will take steps to address and mitigate the issues to improve our service. We know that our customers depend on Windows Azure for their services and we take our SLA with customers very seriously. We will strive to continue to be transparent with customers when incidents occur and will use the learning to advance our engineering, operations, communications and customer support and improve our service to you.
The Feb 29th outage was like a Y2K bug that caught Microsoft flat footed. There was little to point blame on a hardware failure. What caused the problems were human decisions made in error.