It’s been a month, i was busy to make our environment more stable, a lot of troubleshooting, webex session and discussing. Few days ago I noticed random VMs kept vMotion constantly. Some VMs got strange situation, show orphan, invalid or unknown status, but still online.
I couldn’t find any evidence why the VMs went to these status. One more thing I noticed was CPU and memory utilization of ESXi 5.1 shows 0 on vCenter server 5.1.
Following statement is not mature conclusion, it’s my inference according to DRS, HA and that particular 0 value CPU/memory. I also discussed that with VMware BCS support.
VM changed to abnormal status due to vMotion interrupted by something, more like HA kicked off due to network/storage intermittent failed. That become high chance since DRS kept try move heavy workload VM to 0 CPU/memory host.
You have to upgrade to ESXi 5.1 latest version or vCenter Server 5.1 update 1c to permanent fix this problem.
Choose one option from following options, that’s temporary solution, issue will present again.
1. Restart ESXi management agent.
2. Disconnect/reconnect ESXi on vSphere client.
Update: you have to upgrade ESXi host and vcenter server both to permanent fix the problem.
Virtualization becomes popular than never this year, I see many company is transforming internal infrastructure into virtual platform.
HA is key feature of vSphere ESXi 5.1, you have to consider this part on every design, especially DMZ virtual machine.
Most DMZ ESXi cluster has restricted networking policy, even ICMP maybe not allowed. As you may know, HA detects ESXi host alive by two parts: Storage and Network.
If host can see shared storage, it means host alive.
If host can ping default gateway, it means host alive.
What if ping is disabled on default gateway? You’ll get “vSphere HA agent on this host could not reach isolation address: xxx.xxx.xxx.xxx” on each host.
It can lead to VM lost HA protection sometimes, you could use following way to fix this problem.
- Login to each host by SSH.
- Run command “vmkping xxx.xxx.xxx.xxx” to ping any ICMP enabled IP address from vmkernal ports.
- Record ping worked IP addresses.
- Right click ESXi 5.1 cluster.
- Edit Setting – vSphere HA – Advanced Options
- Add das.isolationAddressX, value is the IP address of step 3, X start from 0 to 9.
- Repeat step 6 to add all favored IP addresses.
- Add das.useDefaultIsolationAddress, value is false.
- Right click each host and select Reconfigure for vSphere HA.
I have to say you’ll not able to get what you anticipating if you follow VMware document. After referred few blogs and videos, I finally deployed the production in HA and DR mode both, it consumed a lot of time since I had to clone the VM from US to India over WAN. It’s pain, I’d like the share it to make sure you never fall in same situation.
If you don’t familiar with vCHB, please read vCenter Server Heartbeat 5.6 – Architecture.
Before install vCHB, you should know that:
- Install vCenter Server and components on Primary Server, Secondary Server will be cloned.
- vCenter Update Manager, vCenter Converter, ESXi Dump Collector, Syslog Collector are configured using Fully Qualified Domain Names (FQDN) rather than IP addresses.
- Time Zone and time setting is correct.
- Port 52267 and 57348 is enabled in firewall on both servers.
- 2GB free memory available for vCenter Server Heartbeat.
- Administrator right is required to install vCenter Server Heartbeat.
- All vCenter Server components should functionally before install vCenter Server Heartbeat.
- No * in SSO master password. ( I guess that’s a bug of 5.6U1, please refer to KB2034608 to reset master password )
- vCenter Server FQDN is Primary Server computer name. ( It will be changed later )
Pre-configure before install vCHB:
- Select Install VMware vCenter Server Heartbeat to start installation.
- Select Primary to install vCHB on Primary Server.
- Accept agreement.
- Apply license key.
- Select LAN or WAN according to your architecture.
- Select Secondary Server is Virtual option. ( I only tested that option )
- Confirm installation path.
- Select vNIC for VMware Channel network.
- Enter VMware Channel IP addresses of Primary and Secondary Server.
For HA mode, you could use non-routable or routable IP address.
For DR mode, you must use routable IP addresses to make sure VMware Channel network can communicate each other over WAN.
- Select vNIC for Public Network.
- Enter IP addresses of Principal Network for both server.
For HA mode, IP address should be same on both server.
For DR mode, IP addresses should be different, you have to enter manually.
Select the options accordingly.
- If you select Different IP addresses in step above, you will need to enter a DNS update account of Windows. ( Refer to KB1008605 if you use BIND9 DNS instead of Windows DNS service )
- Then configure Management Network. This network is used for RDP.
- Rename computer name of both server. It looks like only rename Primary Server, no change for Secondary Server, but you don’t have to worry about that since we already renamed Secondary Server in early step.
- Set client port, I used default.
- Select components you want to protect and enter vCenter Login, this Login must have Administrator right on vCenter Server.
Also input SSO master password, please note the SSO master password may different with SSO administrator password, please make sure you enter correct password.
- Enter the share path you created earlier, this folder will store cluster configuration information for Secondary Server installation.
- vCHB start checking system.
- You will lost RDP connectivity for 10 seconds during installation due to Package Filter installation.
- Once the installation complete, you can start on Secondary Server, just make sure you select Secondary.
All other steps is similar like Primary Server.
- Startup vCHB services on Secondary Server.
- Open vCenter Server Heartbeat Management Console.
- Add each node by Management Network.
- Wait a while, you will see similar screen like following screenshot.
Today I see this error message on one ESXi5.0 host:
The number of heartbeat datastores for host is 0, which is less than required: 2
No any VM is running on the host by DRS or HA, VMware KB gives a solution but too complicate.
Re-configure HA can fixes the problem.
Right click the host -> Click Reconfigure for vSphere HA -> Waiting HA configuration complete.