ESXi 5.1 shows 0 value for CPU/memory in vcenter

It’s been a month, i was busy to make our environment more stable, a lot of troubleshooting, webex session and discussing. Few days ago I noticed random VMs kept vMotion constantly. Some VMs got strange situation, show orphan, invalid or unknown status, but still online.

I couldn’t find any evidence why the VMs went to these status. One more thing I noticed was CPU and memory utilization of ESXi 5.1 shows 0 on vCenter server 5.1.

Following statement is not mature conclusion, it’s my inference according to DRS, HA and that particular 0 value CPU/memory. I also discussed that with VMware BCS support.

VM changed to abnormal status due to vMotion interrupted by something, more like HA kicked off due to network/storage intermittent failed. That become high chance since  DRS kept try move heavy workload VM to 0 CPU/memory host.

You have to upgrade to ESXi 5.1 latest version or vCenter Server 5.1 update 1c to permanent fix this problem.

Workaround:

Choose one option from following options, that’s temporary solution, issue will present again.

1. Restart ESXi management agent.

2. Disconnect/reconnect ESXi on vSphere client.

Update: you have to upgrade ESXi host and vcenter server both to permanent fix the problem.

HA for DMZ ESXi 5.1 cluster

Virtualization becomes popular than never this year, I see many company is transforming internal infrastructure into virtual platform.

HA is key feature of vSphere ESXi 5.1, you have to consider this part on every design, especially DMZ virtual machine.

 

Most DMZ ESXi cluster has restricted networking policy, even ICMP maybe not allowed. As you may know, HA detects ESXi host alive by two parts: Storage and Network.

If host can see shared storage, it means host alive.

If host can ping default gateway, it means host alive.

What if ping is disabled on default gateway? You’ll get “vSphere HA agent on this host could not reach isolation address: xxx.xxx.xxx.xxx” on each host.

It can lead to VM lost HA protection sometimes, you could use following way to fix this problem.

 

  1. Login to each host by SSH.
  2. Run command “vmkping xxx.xxx.xxx.xxx” to ping any ICMP enabled IP address from vmkernal ports.
  3. Record ping worked IP addresses.
  4. Right click ESXi 5.1 cluster.
  5. Edit SettingvSphere HAAdvanced Options
  6. Add das.isolationAddressX, value is the IP address of step 3, X start from 0 to 9.
  7. Repeat step 6 to add all favored IP addresses.
  8. Add das.useDefaultIsolationAddress, value is false.
  9. Right click each host and select Reconfigure for vSphere HA.

All paths lost on HBA port

HP, a great company, I like the hardware design of HP ProLiant server, it’s pretty easy for datacenter maintenance and operation, do you like it? Today, I’ll introduce a storage issue on HP ProLiant BL460, BL480 blades. This issue happened on Qlogic HBA with VC-FC module. I have two dual port Qlogic HBAs on each ESXi5.x host, one port of each HBA was zoned together on SAN switch.

For example, vmhba1 and vmhba3 are zoned for LUN allocation, each LUN have two paths on each HBA port.

I observed all LUNs disappeared on random HBA port sometimes, it’s not happening very frequently, but it can lead to ALL VM DEAD if you get storage outage when LUNs disappeared!!! This problem becomes more frequently more your virtual infrastructure grows bigger.

This is the symptoms when the issue happening:

And if you login SSH console and check HBA card status by:

less /proc/scsi/qla2xxx/[Device ID]

You will find following differences of two HBA ports:

See? All targets show Offline status on problem HBA.

scsi-qla3-target-0=500a09859d812da0:030098:1000:<Offline>

You have two options to fix it:

  1. Reseat blade. Downtime and local resource is required.
  2. Reset HBA by following step:

Record the Device ID, and force HBA do rescan:

echo “scsi-qlascan” > /proc/scsi/qla2xxx/adapter_id

Wait few seconds, force LIP login:

echo “scsi-qlalip” > /proc/scsi/qla2xxx/adapter_id

Wait few minutes, LUNs come back online… JYou could refer to KB 1031199 for more detail.

This is a temporary remediation, the problem will repeat. I’ll show you some permanent solution in next blog.