Two of my friends and I have been discussing a variety of technical and business decisions that need to be made. One of the things we have done is to make it a rule that all three of us need to be in agreement on decisions. Having three decision makers is a good pattern to insure that a diversity of perspectives are included in analysis, and decisions can be made if one decision maker is not available.
Triple redundancy though is typically used though where as long as two systems are in agreement than you can make a decision.
In computing, triple modular redundancy, sometimes called triple-mode redundancy, (TMR) is a fault-tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a voting system to produce a single output. If any one of the three systems fails, the other two systems can correct and mask the fault.
But, an example of the flaw in this approach could be taken from the Minority Report and the use of pre-cogs where a zealousness to come to a conclusion allows a "minority report" to be discarded.
Majority and minority reports
Each of the three precogs generates its own report or prediction. The reports of all the precogs are analyzed by a computer and, if these reports differ from one another, the computer identifies the two reports with the greatest overlap and produces a majority report, taking this as the accurate prediction of the future. But the existence of majority reports implies the existence of a minority report.
James Hamilton has a blog post on error detection. Errors could be consider the crimes in the data center. And, you can falsely assume there are no errors (crimes) because there is error correction in various parts of the system.
Every couple of weeks I get questions along the lines of “should I checksum application files, given that the disk already has error correction?” or “given that TCP/IP has error correction on every communications packet, why do I need to have application level network error detection?” Another frequent question is “non-ECC mother boards are much cheaper -- do we really need ECC on memory?” The answer is always yes. At scale, error detection and correction at lower levels fails to correct or even detect some problems. Software stacks above introduce errors. Hardware introduces more errors. Firmware introduces errors. Errors creep in everywhere and absolutely nobody and nothing can be trusted.
If you think like this.
This incident reminds us of the importance of never trusting anything from any component in a multi-component system. Checksum every data block and have well-designed, and well-tested failure modes for even unlikely events. Rather than have complex recovery logic for the near infinite number of faults possible, have simple, brute-force recovery paths that you can use broadly and test frequently. Remember that all hardware, all firmware, and all software have faults and introduce errors. Don’t trust anyone or anything. Have test systems that bit flips and corrupts and ensure the production system can operate through these faults – at scale, rare events are amazingly common.
Maybe you won't let the majority rule and listen to minority. All it takes is a small system, a system in the minority to bring down a service.