Amazon Web Services network issues, transparency of data spreads from cloudkick

Amazon has a phenomenal amount of data on AWS and much of that data is shared with the partner community,.  One partner is cloudkick.

graphs

visualize important metrics like http latency and ping - from multiple data centers.

monitoring

set up monitoring in a just a few clicks
get alerts when services go critical
monitor ping, http, https & ssh

Cloudkick wrote this post on Amazon’s network performance issues.

Visual evidence of Amazon EC2 network issues

Update: After seeing this story picked up, we ran our numbers again to demonstrate a broader picture of the issue at hand. We ran a sample of ping latency across several hundred EC2 instances managed by Cloudkick located in the US-East availability zone. Below you will see that the issues started around Christmas, and have been on-going since.

Sample ping latency across several hundred EC2 instances

An average ping latency of 50ms (as seen in the period between 11-30 and 12-14) is relatively low and normal. The spikes in latencies up to 1000ms are definitely abnormal, and should never be encountered on healthy private network.

Amazon has a great track record in performance and reliability, so this is why we are so surprised by this data. As Amazon spokesperson Kay Kinton said, “When customers report a problem they are having, we take it very seriously. Sometimes this means working with customers to tweak their configurations or it could mean making modifications in our services to assure maximum performance.”

Original post:

A couple of weeks ago we noticed that our ping latency graphs on Cloudkick looked very odd.

EC2 to EC2 ping average

This post was picked by DataCenterKnowledge

Amazon: We Don’t Have Cloud Capacity Issues

January 14th, 2010 : Rich Miller

A chart from CloudKick looking at latency for resources running on Amazon EC2.

A chart from CloudKick looking at latency for resources running on Amazon EC2.

One of the key selling points for cloud computing is scalability: the ability to handle traffic spikes smoothly without the expense and hassle of adding more dedicated servers. But this week some users of Amazon EC2 are reporting that their apps on the cloud computing service are having problems scaling efficiently, and suggesting that this uneven performance could be due to capacity problems in Amazon’s data center

And Register as news, and refers to DataCenterKnowledge and cloudkick.

"Amazon has a great track record in performance and reliability, so this is why we are so surprised by this data," reads Cloudkick's blog post on the matter.

Cloudkick's numbers are limited to Amazon's "US-East" availability zone. EC2 serves up processing power from two separate geographic locations - the US and Europe - and each geographic region is split into multiple zones designed never to vanish at the same time.

enStratus, an outfit similar to Cloudkick, confirms the latency increase, but it says the spike is significantly smaller. Response time from the company's network into "all regions" of the Amazon cloud increased by 10 per cent on January 9, enStratus CTO George Reese tellsThe Reg, and it has remained roughly that high ever since. Reese's sample size is around 300 server instances.

Cloudkick and enStratus released their data in the wake of a blog post from Alan Williamson, co-head of the UK-based cloud consultancy AW2.0, who asked whether Amazon was experiencing capacity issues after one of his customers experienced a serious slowdown beginning at the end of last year. "We began noticing [the problem] around the end of November," Williamson tells The Reg. "We had been running with Amazon for approximately 20 months with absolutely no problems whatsoever. We could throw almost anything at them and it wouldn't even hiccup."

Echoing what Cloudkick and enStratrus have seen, Williamson says he eventually traced the problem back to network latency. On the application in question, the average time needed to turn around a web request jumped from about 2 to 3 milliseconds to about between 50 and 100 milliseconds.

Responding to an inquiry about the post from Data Center Knowledge, Amazon said that their infrastructure does not have capacity issues. And this afternoon, the company sent a similar statement to The Reg.

"We do not have over-capacity issues. When customers report a problem they are having, we take it very seriously," a company spokeswoman said. "Sometimes this means working with customers to tweak their configurations or it could mean making modifications in our services to assure maximum performance."