Geoloqi's Amber Case Keynote at SXSW on Location Services

One of the best presentations at SXSW Interactive was Geoloqi's Amber Case.

Cybernetic anthropologist Amber Case spoke to a full house at SXSW this week, in one of the more thought-provoking sessions that I attended. She started off by declaring, "Every time you use that mobile phone of yours, you have a symbiotic relationship with it: you are a cyborg." Perhaps one of the most interesting points that she made was that current mobility interfaces take way too much of our time.

News.com also reports on Amber's presentation.

How cutting edge geolocation can change everything

Q&A At South by Southwest, Geoloqi CEO Amber Case spoke to CNET about the state of the art in geolocation, augmented reality, and heads-up displays.

Geoloqi CEO Amber Case speaking during her keynote address today at SXSW.

(Credit: CNET,James Martin)

AUSTIN, Texas--These days, smartphones seem like they're everywhere. And with their wide array of built-in sensors, those devices--iPhone, Androids, Windows Phones, and others--can provide us with more and more data about where we are and what's around us than ever before.

Amber's presentation was on Sunday afternoon which is when we were hosting a BBQ for a data center crowd, but luckily I had started interacting with Amber last year and visiting Geoloqi is on my list of 5 companies to touch base when I am next in Portland.

Geoloqi Extends Platform with Appcelerator, Factual and Locaid Partnerships

Geoloqi Extends its Reach to 350 Million Mobile Devices, 1.6 Million Mobile Developers, and a Database of 60 Million Places Globally, Giving Carriers and OEMs a Location-Based Platform Like Never Before

Austin, TX (SXSW Interactive) – March 11, 2012 – Geoloqi, a powerful platform for next-generation location-based services, today announced strategic new partnerships with Appcelerator, a leading cross-platform mobile development platform; Factual, a large-scale data aggregation platform with a Global Places API; and Locaid, the world’s largest carrier location platform. Through these partnerships, Geoloqi is significantly enhancing its location data and analytics offering while expanding its reach to millions of new developers and end users through Locaid and Appcelerators’ customer bases.

Geoloqi has a press release corresponding to Amber's keynote.

Why spend so much time thinking about location services?  Because there is a huge opportunity to revolutionize industries when you think the way Amber presents.  One of Amber's concepts in geofencing which could be viewed as a different way to do what an RFID solution would.

Let's use that example of a supermarket. With the accuracy of an iPhone's GPS, how far outside the boundaries of the store would you have to set the geofence?
Case: You could set it and encompass the parking lot and you'd be able to trigger it quite well. We've taken the native, significant location updates and how iPhone and Android handled that, and amped it up and said, well, if you had your GPS running and sending up data to the server every five seconds, your phone would run out of battery. But if you figure out how to intelligently handle it, like if I get to a new area and there are geofences here, then turn on the GPS, or just slowly monitor in the background. Then it's able to converse battery plus get the resolution when it's necessary. We saw that this was a big pain in the industry. When we released a sample app, carriers and enterprises and governments and developers started showing up and saying, This has been a big pain for us, what a relief that somebody else is trying to solve this problem.

Can you explain a little more about geotriggers?
Case: They're called geotriggers or geonotes. Geonotes are just text you leave inside a geofence, but a geotriger can trigger anything in life, so lights to turn on in your house, or you can do a lot of machine to machine communication.

Part of what Amber and I had discussed is the opportunities to apply some of her concepts in enterprise scenarios.  I missed out being in standing room only keynote of 3200 people, but sitting down in a conference room in Portland is much more useful.

BTW, I did find Amber Case when I was looking up last year who was presenting at SXSW.  When I saw her company is in Portland, I reached out to one of my friends who pretty much knows all the start-ups there and asked for an introduction.  Why wait to talk to a though leader when you can connect other ways.  Conferences are useful, but it can hard to connect with the popular people.

Human Errors cause Windows Azure Feb 29th outage

On Mar 1, I met with an out of friend guest in Bellevue and one of the other people who joined us was an visiting Microsoft MVP.  In the conversation, he brought up the outage on Feb 29 of Windows Azure, and he shared his views on what had gone on, and how could Microsoft make a leap year mistake.  How? Human error is an easy explanation.

Here are a few of the media posts.

Microsoft tries to make good on Azure outage


GigaOm - 1 hour ago
In his post, Bill Laing, corporate VP of Microsoft's server and cloud division, said the outage affectedWindows Azure Compute and dependent services ...
Microsoft Offers Credit for Azure Cloud Outage‎ Data Center Knowledge
Microsoft details leap day bug that took down Azure, refunds customers‎ Ars Technica
Microsoft Azure Outage Blamed on Leap Year‎ CloudTweaks News

A high level description is provided by GigaOm's Barb Darrow.

Microsoft tries to make good on Azure outage

Microsoft is issuing credits for the recentLeap Day Azure outage. The glitch, which cropped up on Feb. 29 and persisted well into the next day, was a setback to Microsoft, which is trying to convince businesses and consumers that its Azure platform-as-a-service is a safe and secure place to put their data and host their applications.

But, I want to point out some interesting details in Bill Laing's blog post.

There are three human errors that could have prevented the problem.

Prevention

  • Testing. The root cause of the initial outage was a software bug due to the incorrect manipulation of date/time values.  We are taking steps that improve our testing to detect time-related bugs.  We are also enhancing our code analysis tools to detect this and similar classes of coding issues, and we have already reviewed our code base.
  • Fault Isolation. The Fabric Controller moved nodes to a Human Investigate (HI) state when their operations failed due to the Guest Agent (GA) bug.  It incorrectly assumed the hardware, not the GA, was faulty.  We are taking steps to distinguish these faults and isolate them before they can propagate further into the system.
  • Graceful Degradation. We took the step of turning off service management to protect customers’ already running services during this incident, but this also prevented any ongoing management of their services.  We are taking steps to have finer granularity controls to allow disabling different aspects of the service while keeping others up and visible.

Another human error is the system took 75 minutes to notify people that there was a problem.

Detection

  • Fail Fast. GA failures were not surfaced until 75 minutes after a long timeout.  We are taking steps to better classify errors so that we fail-fast in these cases, alert these failures and start recovery.

Lack of communication made the problems worse.

Service Dashboard.  The Windows Azure Dashboard is the primary mechanism to communicate individual service health to customers.  However the service dashboard experienced intermittent availability issues, didn’t provide a summary of the situation in its entirety, and didn’t provide the granularity of detail and transparency our customers need and expect.

...

Other Communication Channels.  A significant number of customers are asking us to better use our blog, Facebook page, and Twitter handle to communicate with them in the event of an incident.  They are also asking that we provide official communication through email more quickly in the days following the incident.  We are taking steps to improve our communication overall and to provide more proactive information through these vehicles.  We are also taking steps to provide more granular tools to customers and support to diagnose problems with their specific services.

One of the nice thing about Cloud Service is the need for transparency on cause of outages.  This is a marketing exercise that needs to make sense to a critical thinking technical person.

Conclusion

We will continue to spend time to fully understand all of the issues outlined above and over the coming days and weeks we will take steps to address and mitigate the issues to improve our service.  We know that our customers depend on Windows Azure for their services and we take our SLA with customers very seriously.  We will strive to continue to be transparent with customers when incidents occur and will use the learning to advance our engineering, operations, communications and customer support and improve our service to you.

The Feb 29th outage was like a Y2K bug that caught Microsoft flat footed.  There was little to point blame on a hardware failure.  What caused the problems were human decisions made in error.

Death of Tivo, Horrible customer service that screws you over to maximize their revenue

We all use a DVR and for a while I had a Tivo DVR.  A year ago I cancelled the service when my family wanted to go back to Comcast.  But just this week I get a renewal notice for my cancelled service.

Looking up on my charge statement, Tivo had charged me for service in 2011 even though I cancelled.  I called Tivo, and they knew they had me and they refused to refund the charge.  They said the customer service person had kept my annual service alive so i could sell an active box.  How is this a cancellation of my service?

Going back and forth, Tivo tried to upsell me on another service for a monthly fee.

My question to them is "Why should I continue service with you when you take my money provide no service for a year?"  "How is canceling my service equate to charging me the annual fee so I can sell the box?"

NewImage

My next steps are to fire this off to the Better Business Bureau as my credit card company can't do anything for the charge given the age of it.

Watching this happen to myself, reminds me of other data center vendors who focus on maximizing their revenue and putting the client in the bind of having little choice but to eat the costs to fix the mistakes made by picking the wrong vendor.

In the same way that Tivo is a short timer with horrible customer service, there are some other vendors out there that will be falling off the map as they have no problem with horrible customer service.

It can be impossible to find this type of information disclosed anywhere, but if you know who to talk to you can find out what really works and what doesn't.  The data center vendors can benefit from people not wanting to air their dirty laundry.

My mistake:  I trusted a Tivo customer support to cancel my subscription when I told them to.  A charge showed up on my credit card three months later, and I neglected to audit my credit card statement for a charge I didn't expect.  Shame on me for trusting Tivo.

A Lesson from Minority Report, sometimes you want a everybody agreeing to be right

Two of my friends and I have been discussing a variety of technical and business decisions that need to be made.  One of the things we have done is to make it a rule that all three of us need to be in agreement on decisions.   Having three decision makers is a good pattern to insure that a diversity of perspectives are included in analysis, and decisions can be made if one decision maker is not available.

Triple redundancy though is typically used though where as long as two systems are in agreement than you can make a decision.

In computing, triple modular redundancy, sometimes called triple-mode redundancy,[1] (TMR) is a fault-tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a voting system to produce a single output. If any one of the three systems fails, the other two systems can correct and mask the fault.

But, an example of the flaw in this approach could be taken from the Minority Report and the use of pre-cogs where a zealousness to come to a conclusion allows a "minority report" to be discarded.

Majority and minority reports

Each of the three precogs generates its own report or prediction. The reports of all the precogs are analyzed by a computer and, if these reports differ from one another, the computer identifies the two reports with the greatest overlap and produces a majority report, taking this as the accurate prediction of the future. But the existence of majority reports implies the existence of a minority report.

James Hamilton has a blog post on error detection.  Errors could be consider the crimes in the data center.  And, you can falsely assume there are no errors (crimes) because there is error correction in various parts of the system.

Every couple of weeks I get questions along the lines of “should I checksum application files, given that the disk already has error correction?” or “given that TCP/IP has error correction on every communications packet, why do I need to have application level network error detection?” Another frequent question is “non-ECC mother boards are much cheaper -- do we really need ECC on memory?” The answer is always yes. At scale, error detection and correction at lower levels fails to correct or even detect some problems. Software stacks above introduce errors. Hardware introduces more errors. Firmware introduces errors. Errors creep in everywhere and absolutely nobody and nothing can be trusted.

If you think like this.

This incident reminds us of the importance of never trusting anything from any component in a multi-component system. Checksum every data block and have well-designed, and well-tested failure modes for even unlikely events. Rather than have complex recovery logic for the near infinite number of faults possible, have simple, brute-force recovery paths that you can use broadly and test frequently. Remember that all hardware, all firmware, and all software have faults and introduce errors. Don’t trust anyone or anything. Have test systems that bit flips and corrupts and ensure the production system can operate through these faults – at scale, rare events are amazingly common.

Maybe you won't let the majority rule and listen to minority.  All it takes is a small system, a system in the minority to bring down a service.