Raise Server temperatures until errors, Intel demos MCA

At Intel Developer Forum Machine Check Architecture (MCA) extensions was presented, and hardware failure was simulated.

image

This solution is in Windows Server 2008 R2.

MCA Devices

Microsoft Windows generic hardware abstraction layers (HAL) for Intel architectures (Halx86.dll, Halapic.dll, Halmps.dll, Halia64.dll) support the Machine Check Architectures (MCA) for the Intel Pentium Pro and Itanium processors. The HAL enables Machine Check Exception (MCE) reporting for all implementation-defined errors.

For more information about the MCA-specific interface for drivers for Intel Pentium Pro and Itanium processors, see MCA Interface for Drivers.

And, Intel has contributed the code to Linux.

Intel Contributes MCA Recovery Code to the Linux* Kernel
This code will allow graceful advanced Machine Check Architecture (MCA) recovery from memory errors on systems based on the processor code-named "Nehalem-EX". configured with large amounts of memory.

and Sun has done the work to add support to Solaris on AMD and Intel.

Generic Machine Check Architecture (MCA) In Solaris

The work described below was integrated into Solaris Nevada way back in August 2007 - build 76; it has since been backported to Solaris 10. It's never too late to blog about things! Actually, I just want to separate this description from the entry that will follow - Solaris x86 xVM Fault Management.

Why Generic MCA?

In past blogs I have described x86 cpu and memory fault management feature-additions for specific processor types: AMD Opteron family 0xf revisions B-E, and AMD Opteron family 0xf revisions F and G. At the time of the first AMD work Sun was not shipping any Intel x64 systems; since then, of course, Sun has famously begun a partnership with Intel and so we needed to look at offering fault management support for our new Intel-based platforms.

With MCA, you can monitor the processor error correction to detect the relationship of rising temperatures and processor errors.

image

Even if thermal alarms are going off on a server if there are no errors, should you panic?