• A very interesting Microsoft cluster failure

    It’s been a long time sine last post. I was out of internet due to health issue. Just got recovered and backed to normal work. I have to publish my article by English then translate it to Chinese later since I lost lot of me time after my baby born, but more fun. hopefully it not impact to Google search. 🙂

    There was a interesting problem happend on Microsoft cluster when I came back from hospital. Our DBA team complaint Microsoft Cluster Service failed intermittently on virtual machine. This situation constantly happend for a week.

    At the beginning of the whole troubleshooting, team noticed quorum disks failed with following Windows event:

    Cluster service failed to update the cluster configuration data on the witness resource. Please ensure that the witness resource is online and accessible.

    So we focused on disk performance. vbod.log also show some performance degrading but the time was not match. Microsoft was involved after that, they said the cluster failure actually caused by network connectivity issue according to following Windows event:

    Cluster node ‘xxx’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

    It became interesting since virtual machines share physical network links, it cannot be only single virtual machine had problem if there was network connectivity issue. Then we noticed there was following abnormal Windows event when some failure happend:

    Reset to device, DeviceRaidPort0, was issued

    It related to ISL driver bug, maybe not related to cluster failure issue but it worth to update to improve cluster stability. You can check Windows virtual machine event log reports the error: Reset to device, DeviceRaidPort0, was issued about this bug.

    After involved multiple vendors from OS, virtualization, network and storage team, everybody said it’s not their problem. You could see this kind of problem in large datacenter since more and more system installed, it’s hard to find out which piece of the system caused the issue. You have to familar with each field of datacenter.

    Eventually we figured out the issue related to storage workload. But why vendors cannot figured out this problem? First of all, Windows OS disk is running on shared storage, Windows no responding when the VMFS5 datastore latency of OS disk is high. From Windows perspective, it doesn’t know what happend on backend storage, it just know OS is very slow for few seconds, kind of pause the system. So it leads to network packages drop, and no Windows event for that since OS resumed very quickly, Windows takes it as normal behavior. Cluster actually failed at this moment. Secondary, the particular LUNs hosted the virtual machine was not busy, but the LUNs shared same storage pool with other LUNs. Any high workload LUN will impact rest of LUNs in same storage pool.

    After understood these points, we figured out lot of virtual machines got high latency or high IO around 7PM every day, and most of the cluster failure happend this time. Since it’s impacted to large number of virtual machines, it must be caused by some common components on virtual machines. We eventially figured out it’s McAfee DAT updating after captured network packages. All virtual machines did same downloading in same time lead to high workload on shared storage and lead to cluster virtual machiens no responding for few seconds. The issue got fixed after change McAfee DAT updating schedule to random interval.

    There are always common things going on on datacenter, it maybe small resource consumer but it can be a signaficant big monster in virtualized datacenter. Such as backup, monitoring, anti-virus or system management agents. It can impacts to shared storage or network links.

     


  • How to find corresponded physical disk for Hyper-V CSV volumes

    CSV (Cluster Shared Volume) is fundamental of Microsoft Hyper-V. You must have it to leverage Live Migration and High Availability features. But it’s very confuse when you want to reclaim CSV since CSV is using different name with physical disks. For example, CSV name usually is “Cluster Disk x”, path usually is “C:ClusterStorageVolumeX”. But real disk name is “Disk x” in Disk Manager. You have to very carefully when delete the disk.

    (more…)


  • How to configure vCAC 6.2 LAB on VMware Workstation 11 – Part 3

    VMware vRealize Automation 6.2 Configuration

    vCAC configuration is little complicate. I’ll separate to vCAC server, IaaS, vCAC itself and VCO configurations, 4 sections.

    (more…)


  • How to configure vCAC 6.2 LAB on VMware Workstation 11 – Part 2

    vCenter Server Configuration

    We will do identity source and permission settings on vCenter Server.

    (more…)


  • How to configure vCAC 6.2 LAB on VMware Workstation 11 – Part 1

    In previous articles I shared how to build vCAC 6.2 LAB, we created domain controller and DNS services on DC01.contoso.com, vCenter Server on VC01.contoso.com, 3 ESXi hosts on ESX01/02/03.contoso.com, vCAC server on vCAC.contoso.com, IaaS server of vCAC on IaaS.contoso.com and FreeNAS server on FreeNAS.contoso.com.

    (more…)


  • How to Build vCAC 6.2 LAB on VMware Workstation 11 – Part 3

    VMware vRealize Automation 6.2 (vCAC) installation

    vCAC contains 3 components: vCAC itself, vRealize Orchestrator (VCO) and IaaS server. I used native VCO to save resource. I’m going to share how to install vCAC by two sections: vCAC installation and IaaS installation.

    (more…)


  • How to Build vCAC 6.2 LAB on VMware Workstation 11 – Part 2

    FreeNAS Installation

    I need FreeNAS provides shared NFS storage for ESXi hosts to enable advanced features such as HA or vMotion. I gave 1GB RAM, 4 vCPU and 2GB local disk to FreeNAS virtual machine. DNS name is FreeNAS.contoso.com.

    (more…)


  • How to Build vCAC 6.2 LAB on VMware Workstation 11 – Part 1

    Recently VMware released VMware vRealize Automation Center 6.2 (vCAC). I guess there will be a newer version along with vSphere 6.0. Be an ITPro you have to keep learning new stuff! I built a lab environment on my laptop for learning. I’m going to share my implementation experience below, it spent me dozen hours plus lot of documents reading. Initially I felt it’s to complicate to deploy (That’s looks like a tradition of VMware products). But eventually I thought it’s not easy to provide a unified self-service end user interface in a multi-vendors infrastructure. Even OpenStack is not easy!

    (more…)


  • ESXi 5.5 and Emulex OneConnect 10Gb NIC

    *** English Version ***

    You are using HP ProLiant BL460c G7 or Gen8, ESXi version is 5.5, NIC is Emulex chipset. You are using driver version 10.x.x.x. You may experience the host randomly lost connectivity on vCenter Server, host status show “No responding”. You cannot ping any virtual machine hosted on the blade. High pause frame is observed on HP virtual connect model down links after problem occurred. And you see similar error in vmkernel logs:

    (more…)


  • How to Change SCSI Controller Type on Virtual Machine

    Some of my virtual machines used ISL logical SCSI controller. It’s not recommended for Red Hat 6 virtual machines. We need to change it to VMware Paravirtual SCSI controller.

    Basically the steps is power off virtual machine, change the SCSI controller type, and power on. Then you lost operation system. 🙂

    (more…)