Redhat Linux Virtual Machine file system become read-only

Redhat Linux virtual machine running in VMware platform continuously facing issue of file system becomes read only mode and avoids the write operations at file system level . It leads to the application downtime. We need to reboot the server always to fix the issues. This issue is caused due to the temporary disconnection to storage disks. As per Redhat article ,When a storage failure occurs at the VMware Hypervisor level, It may takes longer sometime than the actual SCSI timeout to expire. If the SCSI timeout occurs before the storage becomes available to esx host, which might make the linux file system to read only. If the failover occurs before the timeout, everything will continue as normal without any issue to virtual machines. This behavior was hugely experienced from Linux virtual machines but windows virtual machines will not go into read only mode.You may see filesystem retries (scsi bus resets) in windows event logs.

We may also see VMkernel logs a lot of SCSI reservation conflicts error, If storage is slow or any other underlying issue with Storage.

Jan 19 21:08:33 esx-server-xxx vmkernel: 401:11:11:15.287 cpu0:1043)WARNING: SCSI: 119: Failing I/O due to too many reservation conflicts
Jan  19 21:08:33 esx-server-xxx vmkernel: 401:11:11:15.287 cpu0:1043)WARNING: FS3: 4784: Reservation error: SCSI reservation conflict
Jan  19 21:08:34 esx-server-xxx vmkernel: 401:11:11:16.492 cpu0:1043)SCSI: vm 1043: 109: Sync CR at 64
Jan  19 21:08:35 esx-server-xxx vmkernel: 401:11:11:17.468 cpu0:1043)SCSI: vm 1043: 109: Sync CR at 48
Jan  19 21:08:36 esx-server-xxx vmkernel: 401:11:11:18.423 cpu2:1043)SCSI: vm 1043: 109: Sync CR at 32
Jan  19 21:08:37 esx-server-xxx vmkernel: 401:11:11:19.366 cpu0:1043)SCSI: vm 1043: 109: Sync CR at 16
Jan  19 21:08:38 esx-server-xxx vmkernel: 401:11:11:20.419 cpu0:1043)SCSI: vm 1043: 109: Sync CR at 0
Jan  19 21:08:38 esx-server-xxx vmkernel: 401:11:11:20.419 cpu0:1043)WARNING: SCSI: 119: Failing I/O due to too many reservation conflicts
Jan  19 21:08:38 esx-server-xxx vmkernel: 401:11:11:20.419 cpu0:1043)WARNING: FS3: 4784: Reservation error: SCSI reservation conflict

ESX uses SCSI reservations a locking mechanism to share a luns between ESX hosts. These reservations are released when any of the activity mentioned below is completed. VMkernel regularly monitors for any aged reservations and it tries to release the aged lock. If another ESX hosts is using the lun actively, it can try to reclaim the lun or to place another reservation.This scsi reservations are needed to prevent any data corruption in environment where storage luns are shared between multiple esx hosts. whenever ESX host tries to update VMFS metadata, it puts the SCSI reservations on it. When multiple hosts try to reserve the same lun at same time, a reservation conflict occurs. If the number of reservations conflicts is to big then ESX will fail the I/O.SCSI reservation errors can be a sign of san latency failures.

Resolution

Below Steps may or may not resolve your issue but definitely below are the steps which we need to perform in order to understand the root cause better. If you are continuously facing these behavior in Linux virtual machines, you can try the below steps  :

1. Verify VMware tools are up to date.
2. Migrate the affected virtual machine to other datastore and monitors the virtual machine. If issue doesn’t reappears , It could be problem with the storage. Engage the storage vendor.
2.  Even after the storage migration, if issue reoccurs .Update the Linux kernel to the latest version
3. Increase the SCSI timeout of each disk presented from VMWare as per the Redhat Linux article.

Perform the above steps to few of the affected Linux virtual machines and continuously monitors the Virtual machines for the reoccurring of the issue. if the issue reoccurs, then follow the below steps:

1. Reboot the ESX server
2. Perform the LUN reset using the below command

    vmkfstools -L lunreset /vmfs/devices/disks/device_ID

3. Reboot the storage processor.

4. Delete the affected Datastore from ESX and also destroy the LUN from storage end. Recreate the array and present it to the ESX servers then create the new datastore and place the virtual machines on it and monitor the virtual machines.

Above mentioned steps are the initial steps for troubleshooting. Which may or may not fix your issue.Below are the list of article which discuss about the same behavior.

http://communities.vmware.com/thread/58081

http://kb.vmware.com/kb/51306

https://access.redhat.com/site/solutions/21374

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1005009

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1018675

http://www.ntpro.nl/blog/archives/1638-New-in-esxtop-SCSI-Reservation-Conflicts-per-second-Revisited.html

Thanks for Reading !!!!