Tag Archives: ESXi

ESXi 6.0 and 6.5 hosts no responding issue!

The smartd service  can leads to ESXi 6.0 or 6.5 hosts no responding due to out of memory. I see that issue on ESXi 6.0 U2 and U3 both. I would recommend stay on  ESXi 5.5 U3 at this moment.

We do see the issue persist even after stop smartd service.

Check out VMware official KB.

Advertisements

“No host data available” Reported in Hardware Status Tab

Just noticed a issue that nothing reported in ‘Hardware Status‘ tab of ESXi hosts in vSphere Web Client. KB 2112847 gives a solution but not works for me. The feature can be used to monitor hardware failures. I figured out a way to workaround it. You just need to login by Administrator account and click ‘Update‘ button under ‘Monitor‘ – ‘Hardware Status‘ for each ESXi host. You will get the status after few minutes.

Host Cannot Download Files From VMware vSphere Update Manager Patch Store

You may see following error when you scanning ESXi hosts by vCenter Update Manager.

Host cannot download files from VMware vSphere Update Manager patch store. Check the network connectivity and 
firewall setup, and check esxupdate logs for details.

You also see similar logs in /var/log/esxupdate.log.

[Errno -2] Name or service not known

The root cause could be following:

  1. ESXi host cannot resolve DNS name of vCenter Update Manager Server.
  2. One of the DNS servers incorrect if you set multiple DNS servers on ESXi host.

Transparent Page Sharing (TPS) is disabled by default in latest ESXi 5.5 patch

I just heared Transparent Page Sharing (TPS) is disabled by default in latest ESXi 5.5 patch. You may concern about that if your IT budget is tight since it means you need more memory for heavy virtual machines.

听说ESXi 5.5最新的patch里把TPS禁用了,专门研究了一下。觉得这对IT预算紧张的企业可能是个坏消息,因为这意味着你需要更多的内存应付大型虚拟机。

Continue reading

Incompatible device backing specified for device ‘x’

It’s easy to find a solution for this particular problem. VMware has a KB for this error. Somehow it’s not my case. I don’t know what’s the exactly root cause but you can try vMotion the virtual machine to other host and give a try.

Chinese Version
这个问题的解决方案很容易找到,VMware有一个知识库。但不知道为什么,我遇到的问题没法用此知识库解决。我通过vMotion虚拟机到其他ESXi 主机解决此问题,也不知道具体原因是什么。

The CPU has been disabled by the guest operating system

Few weeks ago, our database virtual machines got randomly failed. There was error message “The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.” on VM events. I didn’t find any abnormal on vmkernel, hostd and vm logs. Finally our Linux team identified it’s a Linux kernel bug. Please refer to BUG at block/blk-core.c:NNNN! in blk_requeue_request or blk finish_request.

The bug only present on RedHat 6 on ESXi 5.5 with VMware Paravirtual SCSI drivers.

 

Continue reading

Nodes in the ESXi cluster may report corruption after reboot host or attach device

VCE just released a new KB vce2563 to description the issue.

If your ESXi 5.x hosts is connected on VMAX running Enginuity 5876.159.102 and later, you may see this particular issue after reboot ESXi host or attach storage if you enabled block delete feature of VAAI.

To check the option status you can run following command on PowerCLI:

 Get-VMHost -Location cluster name | Get-VMHostAdvancedConfiguration -Name VMFS3.EnableBlockDelete

How to setup NTP services by PowerCLI

NTP service is very important for troubleshooting, vmkernel log timestamp is incorrect if your NTP service is not running and ESXi system time is wrong. It can also impact to VM system time even you disable time synchronization on VMware Tools since VM still need to sync time with ESXi after awake from suspended status, finish vMotion, or revert from snapshot.

I know it’s simple to configure NTP services on single how, what if you want to configure NTP service on massed hosts?

Basically we have 3 steps to make sure NTP service working properly:

  • Configure NTP server IP address.
  • Bring up NTP service.
  • Set services startup along with ESXi system.

Let’s try PowerCLI:

Get-VMHOST -Location Cluster Name | Add-VMHostNtpServer -NtpServer “NTP server address

Get-VMHOST -Location Cluster Name | Get-VMHostService| Where-Object {$_.key -eq “ntpd”} | Start-VMHostService

Get-VMHOST -Location Cluster Name | Get-VMHostService| Where-Object {$_.key -eq “ntpd”} | Set-VMHostService –Policy On

How to decode ESXi 5.x SCSI error code

Storage is critical component for virtualization, lot of VM performance issue is related to storage latency. You may see similar error message on vmkernel log for some case:

2014-02-11T07:18:20.541Z cpu8:425351)ScsiDeviceIO: 2331: Cmd(0x4124425bc700) 0x2a, CmdSN 0xd5 from world 602789 to dev “naa.514f0c5c11a00025” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x44 0x0

It much like language of another planet when I first time saw itJ. Let’s see how to “translate” it to human language.

First, I split it to several sections:

a) 2014-02-11T07:18:20.541Z cpu8:425351)

b) ScsiDeviceIO: 2331: Cmd(0x4124425bc700) 0x2a, CmdSN 0xd5

c) from world 602789

d) to dev “naa.514f0c5c11a00025”

e) failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x44 0x0

Section A shows the UTC time when the error occurred.

Section B shows what command is sent. (Actually I don’t even know what the command means is, please let me know if you know it.)

Section C shows which world the command related to.

You can found which world it is by following command

ps | grep 602789

Section D shows which storage device it show error message.

You could identify which datastore it is by following command if your datastore contains single LUN:

esxcfg-scsidevs –m naa.514f0c5c11a00025

You could also check out LUN setting and information by following command:

esxcli storage core device list –d naa.514f0c5c11a00025

esxcli storage nmp device list –d naa.514f0c5c11a00025

Section E shows SCSI sense code. That’s the part I want to give more detail.

It’s breakdown to two sections:

SCSI status codeH:0x0 D:0x2 P:0x0

H means host status

D means device status

P means plugin status

Sense data0x4 0x44 0x0

0x4 means Sense Key

0x44 means Additional Sense Code

0x0 means ASC Qualifier

Before decode, you should translate each code to NNNh notation, 0xNNN = NNNh. For example 0x7a = 7Ah, 0x77 = 77h.

SCSI status code is easy to decode. You just need to change the format and check out the code from http://www.t10.org/lists/2status.htm.

In our example H:0x0 D:0x2 P:0x0, host code 0x0 (00h) means ESX host side is good, device code 0x2 (02h) means device is not ready, plugin status code 0x0 (00h) means LUN plugin is good. (Clarify: device code 0x2 is actually means “check condition”, it’s not really means “device is not ready”, it’s just for easy understand, but looks like it confuse since “Check Condition” has different means with “Device is not Ready”. Thanks Tony point out that. )

Sense data is a little bit complicate. You have to refer two links http://www.t10.org/lists/2sensekey.htm and http://www.t10.org/lists/asc-num.txt.

In our example: 0x4 0x44 0x0, Sense Key 0x4 (4h) means HARDWARE ERROR, Additional Sense Code is 0x44 (44h) and ASC Qualifier is 0x0 (00h), combine the both code to 44h/00h, it means INTERNAL TARGET FAILURE.

Okay, then we put all decode language together:

ESX host side is good, device is not ready, LUN plugin is good because HARDWARE ERROR INTERNAL TARGET FAILURE

Actually I dumped this code from an fnic firmware/driver incompatible case. Is it make your troubleshooting more easy?J

You could also refer to following links to get more detail:

Understanding SCSI device/target NMP errors/conditions in ESX/ESXi 4.x and ESXi 5.x

Understanding SCSI host-side NMP errors/conditions in ESX 4.x and ESXi 5.x

Interpreting SCSI sense codes in VMware ESXi and ESX

Interpreting SCSI sense codes in VMware ESXi and ESX

vHBAs and other PCI devices may stop responding in ESXi 5.x when using Interrupt Remapping

Your vHBAs or other PCI devices may stop running in ESXi 5.x when using Interrupt Remapping feature.

This issue only impact to UCS blade BIOS version 1.4(3c), it has been fixed on 1.4(3j).

Please refer to http://kb.vmware.com/kb/1030265 to see how to disable Interrupt Remapping feature in ESXi 5.x

Also refer to https://tools.cisco.com/bugsearch/bug/CSCty96722.

Error: No NIC found with MAC address…

Your HP server may runs fine on ESXi 4.x or 5.0, but you may get error message No NIC found with MAC address xx:xx:xx:xx:xx:xx after upgrade to ESXi 5.1 or later.

That’s caused by network adapter firmware, you have to upgrade server network adapter firmware by HP SPP 2013.02 or later. I would recommend you upgrade firmware of each component to this version, it’s pretty stable to run ESXi 5.1.

How to find which ESXi 5.1 host lock the VM

Sometimes VM may show unknown, invalid or orphan on vCenter Server, but it still running somewhere. Some technical support engineer may request reboot VM/ESXi host, or search on each host one by one.

Declare: This article only apply to ESXi 5.1, I haven’t tested on other version.

This is easiest way to find out which host lock the VM:

  1. SSH to any host on the cluster.
  2. Go to VM folder. ( Usually it’s under /vmfs/volumes/… )
  3. Run command:  vmkfstools -D “vmx file name” | grep owner
  4. Return line similar like this:
    gen 483, mode 1, owner 529495c4-0b6a7d90-a0f3-0025b541a0dc mtime 211436
  5. The red highlight section is MAC address of owner host.
  6. Run command: esxcfg-nics -l on each ESXi host to see which host match this MAC address.

Then you need to remove the invalid VM from inventory, and login to the owner host by vSphere Client and import the VMX file again.

This procedure can save lot of time to find the real owner host, but it still consumes time if it’s a large cluster. You want to more fast? It’s possible!

After you find the MAC address, change it to regular format, like: xx:xx:xx:xx:xx:xx.

Logon vMA console and connect to vCenter Server by command: vifptarget -s vCenter Server Name

Run command: esxcfg-nics -h ESXi host name -l | grep xx:xx:xx:xx:xx:xx

More fast?

Try use Excel to list commands with all ESXi host name then past on console….

Unable to connect to web services to execute query

It’s been a long time since last post, I was pretty busy on a storage issue, I did a lot of work with hardware vendor and VMware for this weird issue.

During our troubleshooting, I noticed a minor problem when I try search VM in vSphere Client, everytime it gave me error message “Unable to connect to web services to execute query“, it requested me “Verify that the VMware VirtualCenter Management Webservices service is running

I tried to reboot vCenter Server, restart Management webservices and even re-installed vSphere Client, no lucky….Finally I fixed the problem by following step:

  • Stop VMware VirtualCenter Management Webservices service on vCenter Server.
  • Backup Data folder in C:\Program Files\VMware\Infrastructure\tomcat\webapps\sms\WEB-INF\classes\com\vmware\vim\sms.
  • Remove all sms-*.db files in Data folder.
  • Restart VMware VirtualCenter Management Webservices service.

It’s simple steps to fix the problem, but this issue confused me and VMware support for a long time. This problem appeared after we upgraded vCenter Server from 5.0 to 5.1, first thing we suspected was inventory services, error message below was logged in ds.log when we searched VM.

[2013-05-25 12:04:31,995 http-nio-/0.0.0.0-10443-exec-634  INFO  com.vmware.vim.vcauthenticate.servlets.AuthenticationServlet] Sending security error because of exception : com.vmware.vim.vcauthenticate.exception.SsoUnreachableException: com.vmware.vim.dataservices.ssoauthentication.exception.ServiceCommunicationException: com.vmware.vim.sso.admin.exception.InternalError: General failure.

It looks like a authentication issue, right? So we checked SSO, service account…etc. The unclearly logs lead to a wrong way. :-)

Since nobody complained to me, I suspected that’s a client side issue, then we tried search on another purge client but same issue. We also suspected the cache of vCenter inventory, but logs didn’t evidence it is, we cannot just reset inventory cache database since that’s production environment!

Okay, I talk too much about troubleshooting process, let’s talk about the search function of vSphere, my understood is vCenter search objects by two different way: Web Client or vSphere Client. It looks like Web Client retrieve data from database or Web Client server.

vSphere Client get data from cache database. The cache database is located in vCenter Server install folder, default path is C:\Program Files\VMware\Infrastructure\tomcat\webapps\sms\WEB-INF\classes\com\vmware\vim\sms. the cache file is actually H2 databases, it work together with Tomcat web services, sms folder contains application files of Storage Monitoring Services, it use H2 database engine v1.2.147. Please comments if you think I’m wrong.

If the H2 database incorrupt, storage monitoring services also stop working, you can find the service in Service initializing… status with warning status in vCenter Service Status node of vSphere Client.

One solution fix two issue, I like it!

 

The number of heartbeat datastores for host is 0, which is less than required: 2

Today I see this error message on one ESXi5.0 host:

The number of heartbeat datastores for host is 0, which is less than required: 2

No any VM is running on the host by DRS or HA, VMware KB gives a solution but too complicate.

Re-configure HA can fixes the problem.

Right click the host -> Click Reconfigure for vSphere HA -> Waiting HA configuration complete.