Tag Archives: ESXi

Troubleshooting Network Performance of Virtual Machine

There are several layers of networking on virtualization infrastructure. Guest operating system, Virtual Machine, ESXi driver, physical network adapters, RJ45/SFP and network switches…etc. Sometimes it’s hard to say where exactly caused problem. Especially hardware layer problems. Today I worked on a very interesting case, it may give some ideas to troubleshooting network performance issue which is caused by hardware layers.

User told me he was bothered by network performance of a virtual machine. It’s slow to copy data to NFS share. But responding of “ping” command looked good. I didn’t see any issue on virtual machine layer. VMware Tools was up to date, Windows OS was patched, virtual network adapter type was VMXNET3 and VM version was also up to date.

When I tried to copy a image file to share folder of the virtual machine, I did see sometimes speed was fast, but sometimes not. Since I have two physical uplinks, it led me guess it could be one of the uplink.

After lot of swapping and cable changing, we eventually figured out there was a bad SFP on network switch end. I was able to observe the issue by use “psping.exe” of Microsoft Sysinternals. I used following command to send different size of ping package to the virtual machine. Network drops was increasing when I increased package size.

psping.exe -l <size of package> <Destination>
Example: psping.exe -l 4k xxxx.contoso.com

The size could be 1k, 2m or even more large. I think this is a good way to identify problem outside of ESXi. Especially SFP problem as such kind of problem didn’t give any CRC or error count on network switch level.

You can also use Windows native command “ping.exe” as following. The size unit is “bytes”. So for example you need to input 4096 if you want to send 4kb.

ping.exe -l <size> <Destination>
Example: ping.exe -l 4096 xxx.contoso.com




Cannot Complete File Creation Operation When Storage vMotion

Just quick notes. I saw following error  when do storage vMotion.

Cannot Complete File Creation Operation.

When check /var/log/hostd.log. I saw following errors:

2017-11-28T02:51:04.476Z info hostd[76A80B70] [Originator@6876 sub=Vimsvc.TaskManager opID=459515D4-000040D6-2d-cf-d4-7817 user=vpxuser:contoso\testuser] Task Created : haTask--vim.host.OperationCleanup
2017-11-28T02:51:04.476Z info hostd[772C2B70] [Originator@6876 sub=Libs opID=459515D4-000040D6-2d-cf-d4-7817 user=vpxuser:contoso\testuser] CopyFromEntry: Hostlog_Dump: Hostlog /vmfs/volumes/598700ee-ec
2017-11-28T02:51:04.476Z info hostd[772C2B70] [Originator@6876 sub=Libs opID=459515D4-000040D6-2d-cf-d4-7817 user=vpxuser:contoso\testuser] UUID: 28dbb1b5-a9d8-e311-1061-03300000002d
2017-11-28T02:51:04.476Z info hostd[772C2B70] [Originator@6876 sub=Libs opID=459515D4-000040D6-2d-cf-d4-7817 user=vpxuser:contoso\testuser] MigID: 1511837464286041
2017-11-28T02:51:04.476Z info hostd[772C2B70] [Originator@6876 sub=Libs opID=459515D4-000040D6-2d-cf-d4-7817 user=vpxuser:contoso\testuser] HLState: none
2017-11-28T02:51:04.476Z info hostd[772C2B70] [Originator@6876 sub=Libs opID=459515D4-000040D6-2d-cf-d4-7817 user=vpxuser:contoso\testuser] ToFrom: none
2017-11-28T02:51:04.476Z info hostd[772C2B70] [Originator@6876 sub=Libs opID=459515D4-000040D6-2d-cf-d4-7817 user=vpxuser:contoso\testuser] MigType: invalid
2017-11-28T02:51:04.476Z info hostd[772C2B70] [Originator@6876 sub=Libs opID=459515D4-000040D6-2d-cf-d4-7817 user=vpxuser:contoso\testuser] OpType: nfc
2017-11-28T02:51:04.476Z info hostd[772C2B70] [Originator@6876 sub=Libs opID=459515D4-000040D6-2d-cf-d4-7817 user=vpxuser:contoso\testuser] WorldID: 0
2017-11-28T02:51:04.478Z warning hostd[772C2B70] [Originator@6876 sub=Libs opID=459515D4-000040D6-2d-cf-d4-7817 user=vpxuser:contoso\testuser] Hostlog_Flush: Failed to open hostlog /vmfs/volumes/598700e
2017-11-28T02:51:04.478Z warning hostd[772C2B70] [Originator@6876 sub=Vcsvc.OCM opID=459515D4-000040D6-2d-cf-d4-7817 user=vpxuser:contoso\testuser] PersistToDisk: failed to persist entry /vmfs/volumes/5
2017-11-28T02:51:04.478Z info hostd[772C2B70] [Originator@6876 sub=Default opID=459515D4-000040D6-2d-cf-d4-7817 user=vpxuser:contoso\testuser] AdapterServer caught exception: vim.fault.CannotCreateFile
2017-11-28T02:51:04.478Z info hostd[772C2B70] [Originator@6876 sub=Vimsvc.TaskManager opID=459515D4-000040D6-2d-cf-d4-7817 user=vpxuser:contoso\testuser] Task Completed : haTask--vim.host.OperationClean
2017-11-28T02:51:04.478Z info hostd[772C2B70] [Originator@6876 sub=Solo.Vmomi opID=459515D4-000040D6-2d-cf-d4-7817 user=vpxuser:contoso\testuser] Activation [N5Vmomi10ActivationE:0x75395c80] : Invoke do
2017-11-28T02:51:04.478Z verbose hostd[772C2B70] [Originator@6876 sub=Solo.Vmomi opID=459515D4-000040D6-2d-cf-d4-7817 user=vpxuser:contoso\testuser] Arg entry:
--> (vim.host.OperationCleanupManager.OperationEntry) {
--> hlogFile = "/vmfs/volumes/598700ee-ec0f9918-5b56-000000000000/XXX-VM-01/XXX-VM-01-375f29ae.hlog",
--> opId = 1511837464286041,
--> opState = "running",
--> opActivity = "nfc",
--> curHostUuid = "28dbb1b5-a9d8-e311-1061-03300000002d",
--> }
2017-11-28T02:51:04.478Z info hostd[772C2B70] [Originator@6876 sub=Solo.Vmomi opID=459515D4-000040D6-2d-cf-d4-7817 user=vpxuser:contoso\testuser] Throw vim.fault.CannotCreateFile
2017-11-28T02:51:04.478Z info hostd[772C2B70] [Originator@6876 sub=Solo.Vmomi opID=459515D4-000040D6-2d-cf-d4-7817 user=vpxuser:contoso\testuser] Result:
--> (vim.fault.CannotCreateFile) {
--> faultCause = (vmodl.MethodFault) null,
--> file = "/vmfs/volumes/598700ee-ec0f9918-5b56-000000000000/XXX-VM-01/XXX-VM-01-375f29ae.hlog",
--> msg = ""
--> }

It indicates there is a file cannot be created during migration. Further check on VM configuration file (.vmx) I noticed following parameter existing but the file doesn’t existing.

migrate.hostlog = "XXX-VM-01-375f29ae.hlog"

You cannot create the file directly. Workaround is create a .hlog file with other name then rename it to the same name.

BTW, there is a bug on ESXi 6.0 U1 for similar issue, but I saw this problem  on  U2. Just for your reference below.

Storage migration of a virtual machine with a name beginning with core fails with the error: Relocate virtual machine coreXX Cannot complete the operation because the file or folder coreXX-XXXXX.hlog already exists

Network Problems of Auto Deployed ESXi Host in LAB

I built a simple Auto Deploy environment by vSphere 6.5 on nested environment. I created virtual ESXi hosts on a physical ESXi host to do the testing. The whole configuration was smoothly, I’m impressed Auto Deploy can be implemented in few hours. One thing bothered me was networking.

New ESXi hosts cannot get IP addresses properly somehow. It’s not a single problem. The symptoms are ESXi hosts cannot get IP address, or the Configure Management Network was grayed out on console, or ESXi hosts can get IP address but no responding to ping. Just quick post my solutions here.

To fix all these problems you need to do following:

  1. Enable Promiscuous Mode on the vSwtich which is attached to nested ESXi hosts on physical ESXi hosts.
  2. (I did that on Web Client of vCenter 6.5 U1. You may see different procedure on earlier versions.) Edit the host profile of Auto DeployNetworking configurationHost port group — Highlight Management Network — The option Determine how MAC address for vmknic should be decided — Choose Use the MAC Address from which the system was PXE booted.

If you don’t do step 1, your nested ESXi hosts may not able to get DHCP IP addresses properly, or it can get IP addresses but maps to a new MAC address lead to network packages cannot be transmitted.

Nested ESXi hosts get a DHCP IP addresses when do PXE booting. The hosts get another new IP addresses when apply host profile as soon as management network is created. It could be two different IP addresses and the MAC address of management network could be a new one that not same to any of vmnics. It will be hard to trace back on network switch in real environment, so I think it’s better also to do step 2.

Update 10/25/2017 — You should choose “User must explicitly choose the policy option” in step 2 above if you have multiple NICs. The reason is DHCP IP address during PXE may be captured by random NICs. If you choose what I mentioned in step 2, you will see DHCP server may learns MAC address of a none management network NICs associated with management IP address. Please refer this article for more detail.

“No host data available” Reported in Hardware Status Tab

Just noticed a issue that nothing reported in ‘Hardware Status‘ tab of ESXi hosts in vSphere Web Client. KB 2112847 gives a solution but not works for me. The feature can be used to monitor hardware failures. I figured out a way to workaround it. You just need to login by Administrator account and click ‘Update‘ button under ‘Monitor‘ – ‘Hardware Status‘ for each ESXi host. You will get the status after few minutes.

Host Cannot Download Files From VMware vSphere Update Manager Patch Store

You may see following error when you scanning ESXi hosts by vCenter Update Manager.

Host cannot download files from VMware vSphere Update Manager patch store. Check the network connectivity and 
firewall setup, and check esxupdate logs for details.

You also see similar logs in /var/log/esxupdate.log.

[Errno -2] Name or service not known

The root cause could be following:

  1. ESXi host cannot resolve DNS name of vCenter Update Manager Server.
  2. One of the DNS servers incorrect if you set multiple DNS servers on ESXi host.

Transparent Page Sharing (TPS) is disabled by default in latest ESXi 5.5 patch

I just heared Transparent Page Sharing (TPS) is disabled by default in latest ESXi 5.5 patch. You may concern about that if your IT budget is tight since it means you need more memory for heavy virtual machines.

听说ESXi 5.5最新的patch里把TPS禁用了,专门研究了一下。觉得这对IT预算紧张的企业可能是个坏消息,因为这意味着你需要更多的内存应付大型虚拟机。

Continue reading

Incompatible device backing specified for device ‘x’

It’s easy to find a solution for this particular problem. VMware has a KB for this error. Somehow it’s not my case. I don’t know what’s the exactly root cause but you can try vMotion the virtual machine to other host and give a try.

Chinese Version
这个问题的解决方案很容易找到,VMware有一个知识库。但不知道为什么,我遇到的问题没法用此知识库解决。我通过vMotion虚拟机到其他ESXi 主机解决此问题,也不知道具体原因是什么。

The CPU has been disabled by the guest operating system

Few weeks ago, our database virtual machines got randomly failed. There was error message “The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.” on VM events. I didn’t find any abnormal on vmkernel, hostd and vm logs. Finally our Linux team identified it’s a Linux kernel bug. Please refer to BUG at block/blk-core.c:NNNN! in blk_requeue_request or blk finish_request.

The bug only present on RedHat 6 on ESXi 5.5 with VMware Paravirtual SCSI drivers.


Continue reading

Nodes in the ESXi cluster may report corruption after reboot host or attach device

VCE just released a new KB vce2563 to description the issue.

If your ESXi 5.x hosts is connected on VMAX running Enginuity 5876.159.102 and later, you may see this particular issue after reboot ESXi host or attach storage if you enabled block delete feature of VAAI.

To check the option status you can run following command on PowerCLI:

 Get-VMHost -Location cluster name | Get-VMHostAdvancedConfiguration -Name VMFS3.EnableBlockDelete

How to setup NTP services by PowerCLI

NTP service is very important for troubleshooting, vmkernel log timestamp is incorrect if your NTP service is not running and ESXi system time is wrong. It can also impact to VM system time even you disable time synchronization on VMware Tools since VM still need to sync time with ESXi after awake from suspended status, finish vMotion, or revert from snapshot.

I know it’s simple to configure NTP services on single how, what if you want to configure NTP service on massed hosts?

Basically we have 3 steps to make sure NTP service working properly:

  • Configure NTP server IP address.
  • Bring up NTP service.
  • Set services startup along with ESXi system.

Let’s try PowerCLI:

Get-VMHOST -Location Cluster Name | Add-VMHostNtpServer -NtpServer “NTP server address

Get-VMHOST -Location Cluster Name | Get-VMHostService| Where-Object {$_.key -eq “ntpd”} | Start-VMHostService

Get-VMHOST -Location Cluster Name | Get-VMHostService| Where-Object {$_.key -eq “ntpd”} | Set-VMHostService –Policy On

How to decode ESXi 5.x SCSI error code

Storage is critical component for virtualization, lot of VM performance issue is related to storage latency. You may see similar error message on vmkernel log for some case:

2014-02-11T07:18:20.541Z cpu8:425351)ScsiDeviceIO: 2331: Cmd(0x4124425bc700) 0x2a, CmdSN 0xd5 from world 602789 to dev “naa.514f0c5c11a00025” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x44 0x0

It much like language of another planet when I first time saw itJ. Let’s see how to “translate” it to human language.

First, I split it to several sections:

a) 2014-02-11T07:18:20.541Z cpu8:425351)

b) ScsiDeviceIO: 2331: Cmd(0x4124425bc700) 0x2a, CmdSN 0xd5

c) from world 602789

d) to dev “naa.514f0c5c11a00025”

e) failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x44 0x0

Section A shows the UTC time when the error occurred.

Section B shows what command is sent. (Actually I don’t even know what the command means is, please let me know if you know it.)

Section C shows which world the command related to.

You can found which world it is by following command

ps | grep 602789

Section D shows which storage device it show error message.

You could identify which datastore it is by following command if your datastore contains single LUN:

esxcfg-scsidevs –m naa.514f0c5c11a00025

You could also check out LUN setting and information by following command:

esxcli storage core device list –d naa.514f0c5c11a00025

esxcli storage nmp device list –d naa.514f0c5c11a00025

Section E shows SCSI sense code. That’s the part I want to give more detail.

It’s breakdown to two sections:

SCSI status codeH:0x0 D:0x2 P:0x0

H means host status

D means device status

P means plugin status

Sense data0x4 0x44 0x0

0x4 means Sense Key

0x44 means Additional Sense Code

0x0 means ASC Qualifier

Before decode, you should translate each code to NNNh notation, 0xNNN = NNNh. For example 0x7a = 7Ah, 0x77 = 77h.

SCSI status code is easy to decode. You just need to change the format and check out the code from http://www.t10.org/lists/2status.htm.

In our example H:0x0 D:0x2 P:0x0, host code 0x0 (00h) means ESX host side is good, device code 0x2 (02h) means device is not ready, plugin status code 0x0 (00h) means LUN plugin is good. (Clarify: device code 0x2 is actually means “check condition”, it’s not really means “device is not ready”, it’s just for easy understand, but looks like it confuse since “Check Condition” has different means with “Device is not Ready”. Thanks Tony point out that. )

Sense data is a little bit complicate. You have to refer two links http://www.t10.org/lists/2sensekey.htm and http://www.t10.org/lists/asc-num.txt.

In our example: 0x4 0x44 0x0, Sense Key 0x4 (4h) means HARDWARE ERROR, Additional Sense Code is 0x44 (44h) and ASC Qualifier is 0x0 (00h), combine the both code to 44h/00h, it means INTERNAL TARGET FAILURE.

Okay, then we put all decode language together:

ESX host side is good, device is not ready, LUN plugin is good because HARDWARE ERROR INTERNAL TARGET FAILURE

Actually I dumped this code from an fnic firmware/driver incompatible case. Is it make your troubleshooting more easy?J

You could also refer to following links to get more detail:

Understanding SCSI device/target NMP errors/conditions in ESX/ESXi 4.x and ESXi 5.x

Understanding SCSI host-side NMP errors/conditions in ESX 4.x and ESXi 5.x

Interpreting SCSI sense codes in VMware ESXi and ESX

Interpreting SCSI sense codes in VMware ESXi and ESX

vHBAs and other PCI devices may stop responding in ESXi 5.x when using Interrupt Remapping

Your vHBAs or other PCI devices may stop running in ESXi 5.x when using Interrupt Remapping feature.

This issue only impact to UCS blade BIOS version 1.4(3c), it has been fixed on 1.4(3j).

Please refer to http://kb.vmware.com/kb/1030265 to see how to disable Interrupt Remapping feature in ESXi 5.x

Also refer to https://tools.cisco.com/bugsearch/bug/CSCty96722.

Error: No NIC found with MAC address…

Your HP server may runs fine on ESXi 4.x or 5.0, but you may get error message No NIC found with MAC address xx:xx:xx:xx:xx:xx after upgrade to ESXi 5.1 or later.

That’s caused by network adapter firmware, you have to upgrade server network adapter firmware by HP SPP 2013.02 or later. I would recommend you upgrade firmware of each component to this version, it’s pretty stable to run ESXi 5.1.

How to find which ESXi 5.1 host lock the VM

Sometimes VM may show unknown, invalid or orphan on vCenter Server, but it still running somewhere. Some technical support engineer may request reboot VM/ESXi host, or search on each host one by one.

Declare: This article only apply to ESXi 5.1, I haven’t tested on other version.

This is easiest way to find out which host lock the VM:

  1. SSH to any host on the cluster.
  2. Go to VM folder. ( Usually it’s under /vmfs/volumes/… )
  3. Run command:  vmkfstools -D “vmx file name” | grep owner
  4. Return line similar like this:
    gen 483, mode 1, owner 529495c4-0b6a7d90-a0f3-0025b541a0dc mtime 211436
  5. The red highlight section is MAC address of owner host.
  6. Run command: esxcfg-nics -l on each ESXi host to see which host match this MAC address.

Then you need to remove the invalid VM from inventory, and login to the owner host by vSphere Client and import the VMX file again.

This procedure can save lot of time to find the real owner host, but it still consumes time if it’s a large cluster. You want to more fast? It’s possible!

After you find the MAC address, change it to regular format, like: xx:xx:xx:xx:xx:xx.

Logon vMA console and connect to vCenter Server by command: vifptarget -s vCenter Server Name

Run command: esxcfg-nics -h ESXi host name -l | grep xx:xx:xx:xx:xx:xx

More fast?

Try use Excel to list commands with all ESXi host name then past on console….

Unable to connect to web services to execute query

It’s been a long time since last post, I was pretty busy on a storage issue, I did a lot of work with hardware vendor and VMware for this weird issue.

During our troubleshooting, I noticed a minor problem when I try search VM in vSphere Client, everytime it gave me error message “Unable to connect to web services to execute query“, it requested me “Verify that the VMware VirtualCenter Management Webservices service is running

I tried to reboot vCenter Server, restart Management webservices and even re-installed vSphere Client, no lucky….Finally I fixed the problem by following step:

  • Stop VMware VirtualCenter Management Webservices service on vCenter Server.
  • Backup Data folder in C:\Program Files\VMware\Infrastructure\tomcat\webapps\sms\WEB-INF\classes\com\vmware\vim\sms.
  • Remove all sms-*.db files in Data folder.
  • Restart VMware VirtualCenter Management Webservices service.

It’s simple steps to fix the problem, but this issue confused me and VMware support for a long time. This problem appeared after we upgraded vCenter Server from 5.0 to 5.1, first thing we suspected was inventory services, error message below was logged in ds.log when we searched VM.

[2013-05-25 12:04:31,995 http-nio-/  INFO  com.vmware.vim.vcauthenticate.servlets.AuthenticationServlet] Sending security error because of exception : com.vmware.vim.vcauthenticate.exception.SsoUnreachableException: com.vmware.vim.dataservices.ssoauthentication.exception.ServiceCommunicationException: com.vmware.vim.sso.admin.exception.InternalError: General failure.

It looks like a authentication issue, right? So we checked SSO, service account…etc. The unclearly logs lead to a wrong way. :-)

Since nobody complained to me, I suspected that’s a client side issue, then we tried search on another purge client but same issue. We also suspected the cache of vCenter inventory, but logs didn’t evidence it is, we cannot just reset inventory cache database since that’s production environment!

Okay, I talk too much about troubleshooting process, let’s talk about the search function of vSphere, my understood is vCenter search objects by two different way: Web Client or vSphere Client. It looks like Web Client retrieve data from database or Web Client server.

vSphere Client get data from cache database. The cache database is located in vCenter Server install folder, default path is C:\Program Files\VMware\Infrastructure\tomcat\webapps\sms\WEB-INF\classes\com\vmware\vim\sms. the cache file is actually H2 databases, it work together with Tomcat web services, sms folder contains application files of Storage Monitoring Services, it use H2 database engine v1.2.147. Please comments if you think I’m wrong.

If the H2 database incorrupt, storage monitoring services also stop working, you can find the service in Service initializing… status with warning status in vCenter Service Status node of vSphere Client.

One solution fix two issue, I like it!


The number of heartbeat datastores for host is 0, which is less than required: 2

Today I see this error message on one ESXi5.0 host:

The number of heartbeat datastores for host is 0, which is less than required: 2

No any VM is running on the host by DRS or HA, VMware KB gives a solution but too complicate.

Re-configure HA can fixes the problem.

Right click the host -> Click Reconfigure for vSphere HA -> Waiting HA configuration complete.