Netflix sets Chaos Monkey free for all to use, next comes more monkeys - latency, conformity, doctor, janitor, security, 10-18, and Chaos Gorilla

Netfilx has been getting more and more attention, and I think part of that reason is they talk about things that go wrong, things that they have learned from.  Netflix has learned the lesson that people listen much more when you talk about your mistakes then when you self promote your error free ways.

Netflix's latest move is to release Chaos Monkey to the open source community.  Here is their blog post.

NewImage

Chaos Monkey released into the wild

We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient. We are excited to make a long-awaited announcement today that will help others who embrace this approach.
We have written about our Simian Army in the past and we are now proud to announce that the source code for the founding member of the Simian Army, Chaos Monkey,is available to the community.
Do you think your applications can handle a troop of mischievous monkeys loose in your infrastructure? Now you can find out.

What is Chaos Monkey?

Chaos Monkey is a service which runs in the Amazon Web Services (AWS) that seeks out Auto Scaling Groups (ASGs) and terminates instances (virtual machines) per group. The software design is flexible enough to work with other cloud providers or instance groupings and can be enhanced to add that support. The service has a configurable schedule that, by default, runs on non-holiday weekdays between 9am and 3pm. In most cases, we have designed our applications to continue working when an instance goes offline, but in those special cases that they don't, we want to make sure there are people around to resolve and learn from any problems. With this in mind, Chaos Monkey only runs within a limited set of hours with the intent that engineers will be alert and able to respond.
There are more Monkeys coming from the Simian Army.
NewImage

Inspired by the success of the Chaos Monkey, we’ve started creating new simians that induce various kinds of failures, or detect abnormal conditions, and test our ability to survive them; a virtual Simian Army to keep our cloud safe, secure, and highly available.

Latency Monkey induces artificial delays in our RESTful client-server communication layer to simulate service degradation and measures if upstream services respond appropriately. In addition, by making very large delays, we can simulate a node or even an entire service downtime (and test our ability to survive it) without physically bringing these instances down. This can be particularly useful when testing the fault-tolerance of a new service by simulating the failure of its dependencies, without making these dependencies unavailable to the rest of the system.

Conformity Monkey finds instances that don’t adhere to best-practices and shuts them down. For example, we know that if we find instances that don’t belong to an auto-scaling group, that’s trouble waiting to happen. We shut them down to give the service owner the opportunity to re-launch them properly.

Doctor Monkey taps into health checks that run on each instance as well as monitors other external signs of health (e.g. CPU load) to detect unhealthy instances. Once unhealthy instances are detected, they are removed from service and after giving the service owners time to root-cause the problem, are eventually terminated.

Janitor Monkey ensures that our cloud environment is running free of clutter and waste. It searches for unused resources and disposes of them.

Security Monkey is an extension of Conformity Monkey. It finds security violations or vulnerabilities, such as improperly configured AWS security groups, and terminates the offending instances. It also ensures that all our SSL and DRM certificates are valid and are not coming up for renewal.

10-18 Monkey (short for Localization-Internationalization, or l10n-i18n) detects configuration and run time problems in instances serving customers in multiple geographic regions, using different languages and character sets.

Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. We want to verify that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention.

17 3MW diesel generators are a new air pollution source, Vantage applies for 51 MW in Quincy WA

One of the things you need to do in the USA to build a data center is get an air permit from the state ecological department.  An example is the Vantage Data Center's new data center construction in Quincy, WA.  As part of the process there is a public hearing phase that describes the project.

NOTICE TO CONSTRUCT A NEW AIR POLLUTION SOURCE,

ANNOUNCEMENT OF PUBLIC HEARING, & SECOND TIER PETITION APPROVAL RECOMMENDATION

Comments accepted July 30 through September 10, 2012

The State of Washington Department of Ecology (Ecology) has received application to construct a new air pollution source. Vantage Data Centers Management Company, LLC, 2625 Walsh Avenue, Santa Clara, CA 95051, has proposed to build Vantage Data Centers located at the northwest corner of the intersection of Road 11 NW and Road O NW, Quincy in Grant County. The mailing address for the Vantage Data Centers in Quincy is 2101 M Street, Quincy, WA 98848.

Vantage Data Centers will contain four main data center buildings once it is fully constructed, and will install and operate up to 17 diesel engines that will power 3.0 megawatt electrical generators for a total of 51 megawatts of emergency backup electrical power. Diesel engines generate criteria and toxic air contaminants which have been evaluated. Diesel engine exhaust particulate (DEEP) emissions were reviewed under a Second Tier Health Impact Assessment to evaluate health risks posed by the project. After review of the completed Notice of Construction application and other information on file with the agency, Ecology has decided that this project proposal will conform to all requirements as specified in Chapter 173-400 WAC. After review of the Second Tier Health Impact Assessment, Ecology concluded that impacts to the community due to the Vantage Data Centers will meet the protective requirements contained in Chapter 173-460 WAC.

Copies of the Notice of Construction Preliminary Determination, the Second Tier Petition Recommendation, and supporting application documents are available for public review at Department of Ecology, Eastern Regional Office, 4601 N. Monroe, Spokane, WA 99205-1295, and at the City of Quincy, 104 B Street SW, Quincy, WA 98848.

The public is invited to attend a public hearing that has been scheduled to start at 5:15 PM on September 6, 2012 in the upstairs meeting room at the Quincy City Hall located at 104 B Street SW in Quincy. The public hearing will include presentations followed by a question and answer session starting at 5:30 PM. Public comment will be taken starting promptly at 6:30 PM. In addition to public comments taken at the public hearing, the public is invited to comment on this project proposal prior to the public hearing. Comments accepted July 30 through

September 10, 2012. Submit comments to Beth Mort at Ecology's Spokane Office, 4601

N. Monroe, Spokane, WA 99205-1295, or email beth.mort@ecy.wa.gov, or 509 329-3502.

Oracle acquires Virtual Network Company Xsigo

It's tough being a network company.  Why?  Because VMware buys Ncira, and now Oracle buys Xsigo.

Oracle Buys Xsigo

Extends Oracle's Virtualization Capabilities with Leading Software-Defined Networking Technology for Cloud Environments

Redwood Shores, Calif. – July 30, 2012

News Facts

Oracle today announced that it has entered into an agreement to acquire Xsigo Systems, a leading provider of network virtualization technology.
Xsigo’s software-defined networking technology simplifies cloud infrastructure and operations by allowing customers to dynamically and flexibly connect any server to any network and storage, resulting in increased asset utilization and application performance while reducing cost.
The company’s products have been deployed at hundreds of enterprise customers including British Telecom, eBay, Softbank and Verizon.
The combination of Xsigo for network virtualization and Oracle VM for server virtualization is expected to deliver a complete set of virtualization capabilities for cloud environments.
Terms of the agreement were not disclosed. More information on this announcement can be found at oracle.com/xsigo.

Supporting Quotes

"The proliferation of virtualized servers in the last few years has made the virtualization of the supporting network connections essential," said John Fowler, Oracle Executive Vice President of Systems. "With Xsigo, customers can reduce the complexity and simplify management of their clouds by delivering compute, storage and network resources that can be dynamically reallocated on-demand."
"Customers are focused on reducing costs and improving utilization of their network," said Lloyd Carney, Xsigo CEO. "Virtualization of these resources allows customers to scale compute and storage for their public and private clouds while matching network capacity as demand dictates."

By the way the press releases reads you would think VMware would be one  buying Xsigo and not Oracle.

Twitter outage couldn't have happened at a worse time, day before Olympics

People have gotten used to Twitter.  Depending on a service that is free leaves little recourse except to switch.

Out of the times Twitter could go down, the day before the 2012 summer olympics is probably one of the worse.

Twitter outage spreads around the globe

Social media site's second outage in 5 weeks

 
 
 
People across much of the planet were having problems accessing Twitter on Thursday, a day before the 2012 Olympic Games are expected to cause a spike in use of the micro-blogging site.

The San Francisco-based company acknowledged the problem, saying in a statement that its engineers are "currently working to resolve the issue," although it didn't go into any further detail.

Visitors to the site were greeted with a half-formed message partially in code saying that "Twitter is currently down."

The fields where a reason for the outage and a deadline for restoring service were apparently meant to go were filled with computer code.

Sluggishness or outages were reported from countries in North America, Europe, Asia, Latin America, the Middle East, and Africa.

Isn't it kind of funny to think that you retweet a twitter outage.  :-)

THU

permalink

Twitter Site Issue 2 hours ago

Users may be experiencing issues accessing Twitter. Our engineers are currently working to resolve the issue.