ESXi PSOD – Host fails with intermittent NMI PSOD on HP ProLiant Gen8 servers

ESXi PSOD is always the scary things to the VMware Administrators. ESXi PSOD is similar to Blue Screen of Death in Windows for ESXi Host.  A Purple Screen of Death (PSOD) is a diagnostic screen with purple background that is displayed when the VMkernel of an ESX/ESXi host experiences a critical error, becomes inoperative. It brings all the virtual machine running on that host to down. Then VMware HA needs to restart the  failed VM’s to other ESXi host in the cluster to bring it back online. What to know what is New with vSphere 6.5 HA . Definitely It causes the downtime to your production virtual machines.  You have also need to reboot your ESXi host to recovery from ESXi PSOD.

ESXi PSOD shows the details of memory state at the time of host crash and it has other information such as ESXI build and vresion along with the execption type. It also shows what was running on each CPU at the time of crash , backtrace and error messages and information about core dump. The core dump (or memory dump) is a file that contains further diagnostic information from a PSOD that can be given to VMware support to determine a root cause analysis for the failure.

A purple diagnostic screen can also come in the form of an Exception. An Exception Handler is a computer hardware mechanism designed to handle some condition that changes the normal flow of execution (Division by Zero, Page Fault, etc). There is no trace from handlers, so you need logging to determine if handler faulted (or single step debugging). Below are the list of some of the Exceptions.

  • Exception Type 0 #DE: Divide Error
  • Exception Type 1 #DB: Debug Exception
  • Exception Type 2 NMI: Non-Maskable Interrupt
  • Exception Type 3 #BP: Breakpoint Exception
  • Exception Type 4 #OF: Overflow (INTO instruction)
  • Exception Type 5 #BR: Bounds check (BOUND instruction)
  • Exception Type 6 #UD: Invalid Opcode
  • Exception Type 7 #NM: Coprocessor not available
  • Exception Type 8 #DF: Double Fault
  • Exception Type 10 #TS: Invalid TSS
  • Exception Type 11 #NP: Segment Not Present
  • Exception Type 12 #SS: Stack Segment Fault
  • Exception Type 13 #GP: General Protection Fault
  • Exception Type 14 #PF: Page Fault
  • Exception Type 16 #MF: Coprocessor error
  • Exception Type 17 #AC: Alignment Check
  • Exception Type 18 #MC: Machine Check Exception
  • Exception Type 19 #XF: SIMD Floating-Point Exception
  • Exception Type 20-31: Reserved
  • Exception Type 32-255: User-defined (clock scheduler)

In this article , we are going to talk particular  about ESXi PSOD – Host fails with intermittent NMI PSOD on HP ProLiant Gen8 servers. This issue occurs in ESXi hosts running 5.5 p10, 6.0 p04, 6.0 U3, or 6.5 GA may fail with a purple diagnostic screen caused by non-maskable-interrupts (NMI) on HPE ProLiant Gen8 Servers.

ESXi PSOD – non-maskable-interrupts (NMI) on HPE ProLiant Gen8 Servers.

As per the VMware KB Article 2149043,  The root-cause is not yet determined and it is still under investigation by VMware and HPE. You can also take a look at HPE advisory c05392947  for latest update.  I would always recommend you to open case with GSS to get your ESXI host analyzed  before applying any fix to your ESXi hosts PSOD issue.

The issue was triggered by a change in ESXi 5.5 p10, 6.0 p04, 6.0 U3 and, 6.5 GA in which ESXi disables the Intel IOMMU’s (aka VT-d) interrupt remapper functionality. In HPE ProLiant Gen8 servers, this change is causing PCI errors which result in the platform generating an NMI and causing the ESXi host to fail with a purple diagnostic screen. There is a workaround provided by the VMWare KB article2149043. Let’s see the detailed step by procedure how to workaround this issue. Workaround for this issue is to re-enable the Intel® IOMMU interrupt remapper on the ESXi host.
1. Connect to your ESXi host using SSH
2. Validate the current iovDisableIR settings in the ESXi using the below command
esxcli system settings kernel list -o iovDisableIR
 Currently IODisbaleIR is set to TRUE . We need to set it to False.
ESXI PSOD -non-maskable-interrupts (NMI) on HPE ProLiant Gen8 Servers
3. Run the below command to re-enable the Intel IOMMU interrupt remapper on the ESXi host
esxcli system settings kernel set --setting=iovDisableIR -v FALSE
ESXI PSOD -non-maskable-interrupts (NMI) on HPE ProLiant Gen8 Servers
 4. Reboot the ESXi host
5. Revalidate the iovDisableIR setting is set to FALSE by running this command:
esxcli system settings kernel list -o iovDisableIR
  ESXI PSOD -non-maskable-interrupts (NMI) on HPE ProLiant Gen8 Servers
That’s it. We are doe with executing the workaround action for ESXi PSOD – Host fails with intermittent NMI PSOD on HP ProLiant Gen8 servers. I hope this is informative for you. Be social and share it in social media, if you feel worth sharing it.