How to decode ESXi 5.x SCSI error code

Storage is critical component for virtualization, lot of VM performance issue is related to storage latency. You may see similar error message on vmkernel log for some case:

2014-02-11T07:18:20.541Z cpu8:425351)ScsiDeviceIO: 2331: Cmd(0x4124425bc700) 0x2a, CmdSN 0xd5 from world 602789 to dev “naa.514f0c5c11a00025” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x44 0x0

It much like language of another planet when I first time saw itJ. Let’s see how to “translate” it to human language.

First, I split it to several sections:

a) 2014-02-11T07:18:20.541Z cpu8:425351)

b) ScsiDeviceIO: 2331: Cmd(0x4124425bc700) 0x2a, CmdSN 0xd5

c) from world 602789

d) to dev “naa.514f0c5c11a00025”

e) failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x44 0x0

Section A shows the UTC time when the error occurred.

Section B shows what command is sent. (Actually I don’t even know what the command means is, please let me know if you know it.)

Section C shows which world the command related to.

You can found which world it is by following command

ps | grep 602789

Section D shows which storage device it show error message.

You could identify which datastore it is by following command if your datastore contains single LUN:

esxcfg-scsidevs –m naa.514f0c5c11a00025

You could also check out LUN setting and information by following command:

esxcli storage core device list –d naa.514f0c5c11a00025

esxcli storage nmp device list –d naa.514f0c5c11a00025

Section E shows SCSI sense code. That’s the part I want to give more detail.

It’s breakdown to two sections:

SCSI status codeH:0x0 D:0x2 P:0x0

H means host status

D means device status

P means plugin status

Sense data0x4 0x44 0x0

0x4 means Sense Key

0x44 means Additional Sense Code

0x0 means ASC Qualifier

Before decode, you should translate each code to NNNh notation, 0xNNN = NNNh. For example 0x7a = 7Ah, 0x77 = 77h.

SCSI status code is easy to decode. You just need to change the format and check out the code from http://www.t10.org/lists/2status.htm.

In our example H:0x0 D:0x2 P:0x0, host code 0x0 (00h) means ESX host side is good, device code 0x2 (02h) means device is not ready, plugin status code 0x0 (00h) means LUN plugin is good. (Clarify: device code 0x2 is actually means “check condition”, it’s not really means “device is not ready”, it’s just for easy understand, but looks like it confuse since “Check Condition” has different means with “Device is not Ready”. Thanks Tony point out that. )

Sense data is a little bit complicate. You have to refer two links http://www.t10.org/lists/2sensekey.htm and http://www.t10.org/lists/asc-num.txt.

In our example: 0x4 0x44 0x0, Sense Key 0x4 (4h) means HARDWARE ERROR, Additional Sense Code is 0x44 (44h) and ASC Qualifier is 0x0 (00h), combine the both code to 44h/00h, it means INTERNAL TARGET FAILURE.

Okay, then we put all decode language together:

ESX host side is good, device is not ready, LUN plugin is good because HARDWARE ERROR INTERNAL TARGET FAILURE

Actually I dumped this code from an fnic firmware/driver incompatible case. Is it make your troubleshooting more easy?J

You could also refer to following links to get more detail:

Understanding SCSI device/target NMP errors/conditions in ESX/ESXi 4.x and ESXi 5.x

Understanding SCSI host-side NMP errors/conditions in ESX 4.x and ESXi 5.x

Interpreting SCSI sense codes in VMware ESXi and ESX

Interpreting SCSI sense codes in VMware ESXi and ESX

Error: No NIC found with MAC address…

Your HP server may runs fine on ESXi 4.x or 5.0, but you may get error message No NIC found with MAC address xx:xx:xx:xx:xx:xx after upgrade to ESXi 5.1 or later.

That’s caused by network adapter firmware, you have to upgrade server network adapter firmware by HP SPP 2013.02 or later. I would recommend you upgrade firmware of each component to this version, it’s pretty stable to run ESXi 5.1.

How to find which ESXi 5.1 host lock the VM

Sometimes VM may show unknown, invalid or orphan on vCenter Server, but it still running somewhere. Some technical support engineer may request reboot VM/ESXi host, or search on each host one by one.

Declare: This article only apply to ESXi 5.1, I haven’t tested on other version.

This is easiest way to find out which host lock the VM:

  1. SSH to any host on the cluster.
  2. Go to VM folder. ( Usually it’s under /vmfs/volumes/… )
  3. Run command:  vmkfstools -D “vmx file name” | grep owner
  4. Return line similar like this:
    gen 483, mode 1, owner 529495c4-0b6a7d90-a0f3-0025b541a0dc mtime 211436
  5. The red highlight section is MAC address of owner host.
  6. Run command: esxcfg-nics -l on each ESXi host to see which host match this MAC address.

Then you need to remove the invalid VM from inventory, and login to the owner host by vSphere Client and import the VMX file again.

This procedure can save lot of time to find the real owner host, but it still consumes time if it’s a large cluster. You want to more fast? It’s possible!

After you find the MAC address, change it to regular format, like: xx:xx:xx:xx:xx:xx.

Logon vMA console and connect to vCenter Server by command: vifptarget -s vCenter Server Name

Run command: esxcfg-nics -h ESXi host name -l | grep xx:xx:xx:xx:xx:xx

More fast?

Try use Excel to list commands with all ESXi host name then past on console….

Unable to connect to web services to execute query

It’s been a long time since last post, I was pretty busy on a storage issue, I did a lot of work with hardware vendor and VMware for this weird issue.

During our troubleshooting, I noticed a minor problem when I try search VM in vSphere Client, everytime it gave me error message “Unable to connect to web services to execute query“, it requested me “Verify that the VMware VirtualCenter Management Webservices service is running

I tried to reboot vCenter Server, restart Management webservices and even re-installed vSphere Client, no lucky….Finally I fixed the problem by following step:

  • Stop VMware VirtualCenter Management Webservices service on vCenter Server.
  • Backup Data folder in C:\Program Files\VMware\Infrastructure\tomcat\webapps\sms\WEB-INF\classes\com\vmware\vim\sms.
  • Remove all sms-*.db files in Data folder.
  • Restart VMware VirtualCenter Management Webservices service.

It’s simple steps to fix the problem, but this issue confused me and VMware support for a long time. This problem appeared after we upgraded vCenter Server from 5.0 to 5.1, first thing we suspected was inventory services, error message below was logged in ds.log when we searched VM.

[2013-05-25 12:04:31,995 http-nio-/0.0.0.0-10443-exec-634  INFO  com.vmware.vim.vcauthenticate.servlets.AuthenticationServlet] Sending security error because of exception : com.vmware.vim.vcauthenticate.exception.SsoUnreachableException: com.vmware.vim.dataservices.ssoauthentication.exception.ServiceCommunicationException: com.vmware.vim.sso.admin.exception.InternalError: General failure.

It looks like a authentication issue, right? So we checked SSO, service account…etc. The unclearly logs lead to a wrong way. 🙂

Since nobody complained to me, I suspected that’s a client side issue, then we tried search on another purge client but same issue. We also suspected the cache of vCenter inventory, but logs didn’t evidence it is, we cannot just reset inventory cache database since that’s production environment!

Okay, I talk too much about troubleshooting process, let’s talk about the search function of vSphere, my understood is vCenter search objects by two different way: Web Client or vSphere Client. It looks like Web Client retrieve data from database or Web Client server.

vSphere Client get data from cache database. The cache database is located in vCenter Server install folder, default path is C:\Program Files\VMware\Infrastructure\tomcat\webapps\sms\WEB-INF\classes\com\vmware\vim\sms. the cache file is actually H2 databases, it work together with Tomcat web services, sms folder contains application files of Storage Monitoring Services, it use H2 database engine v1.2.147. Please comments if you think I’m wrong.

If the H2 database incorrupt, storage monitoring services also stop working, you can find the service in Service initializing… status with warning status in vCenter Service Status node of vSphere Client.

One solution fix two issue, I like it!