Loggly a cloud service that provides as one of its services System Monitoring and Alerting.
But, Loggly has suffered an extended outage that was caused by AWS rebooting 100% of their servers, but that was only half the time down. The other half was due to not knowing the service was down.
Loggly's Outage for December 19th
The other half of the outage was caused by Loggly not testing for a 100% reboot of all machines.
The Human Element
One of the lessons that Loggly learned that some of my SW buddies and I are using in a SW design is to add more than one monitoring solution.
The second step is to ensure more robust external monitoring. With multiple deployments, this issue becomes less of an issue, but clearly we need more reliable checks than what we rely on with Zerigo or other services. Sorry, but simple HTTP checks, pings and established connections to a box do not guarantee it's up!