How to decode ESXi 5.x SCSI error code

Storage is critical component for virtualization, lot of VM performance issue is related to storage latency. You may see similar error message on vmkernel log for some case:

2014-02-11T07:18:20.541Z cpu8:425351)ScsiDeviceIO: 2331: Cmd(0x4124425bc700) 0x2a, CmdSN 0xd5 from world 602789 to dev “naa.514f0c5c11a00025” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x44 0x0

It much like language of another planet when I first time saw itJ. Let’s see how to “translate” it to human language.

First, I split it to several sections:

a) 2014-02-11T07:18:20.541Z cpu8:425351)

b) ScsiDeviceIO: 2331: Cmd(0x4124425bc700) 0x2a, CmdSN 0xd5

c) from world 602789

d) to dev “naa.514f0c5c11a00025”

e) failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x44 0x0

Section A shows the UTC time when the error occurred.

Section B shows what command is sent. (Actually I don’t even know what the command means is, please let me know if you know it.)

Section C shows which world the command related to.

You can found which world it is by following command

ps | grep 602789

Section D shows which storage device it show error message.

You could identify which datastore it is by following command if your datastore contains single LUN:

esxcfg-scsidevs –m naa.514f0c5c11a00025

You could also check out LUN setting and information by following command:

esxcli storage core device list –d naa.514f0c5c11a00025

esxcli storage nmp device list –d naa.514f0c5c11a00025

Section E shows SCSI sense code. That’s the part I want to give more detail.

It’s breakdown to two sections:

SCSI status codeH:0x0 D:0x2 P:0x0

H means host status

D means device status

P means plugin status

Sense data0x4 0x44 0x0

0x4 means Sense Key

0x44 means Additional Sense Code

0x0 means ASC Qualifier

Before decode, you should translate each code to NNNh notation, 0xNNN = NNNh. For example 0x7a = 7Ah, 0x77 = 77h.

SCSI status code is easy to decode. You just need to change the format and check out the code from http://www.t10.org/lists/2status.htm.

In our example H:0x0 D:0x2 P:0x0, host code 0x0 (00h) means ESX host side is good, device code 0x2 (02h) means device is not ready, plugin status code 0x0 (00h) means LUN plugin is good. (Clarify: device code 0x2 is actually means “check condition”, it’s not really means “device is not ready”, it’s just for easy understand, but looks like it confuse since “Check Condition” has different means with “Device is not Ready”. Thanks Tony point out that. )

Sense data is a little bit complicate. You have to refer two links http://www.t10.org/lists/2sensekey.htm and http://www.t10.org/lists/asc-num.txt.

In our example: 0x4 0x44 0x0, Sense Key 0x4 (4h) means HARDWARE ERROR, Additional Sense Code is 0x44 (44h) and ASC Qualifier is 0x0 (00h), combine the both code to 44h/00h, it means INTERNAL TARGET FAILURE.

Okay, then we put all decode language together:

ESX host side is good, device is not ready, LUN plugin is good because HARDWARE ERROR INTERNAL TARGET FAILURE

Actually I dumped this code from an fnic firmware/driver incompatible case. Is it make your troubleshooting more easy?J

You could also refer to following links to get more detail:

Understanding SCSI device/target NMP errors/conditions in ESX/ESXi 4.x and ESXi 5.x

Understanding SCSI host-side NMP errors/conditions in ESX 4.x and ESXi 5.x

Interpreting SCSI sense codes in VMware ESXi and ESX

Interpreting SCSI sense codes in VMware ESXi and ESX

Advertisements

Author: Wu

VCP, MCSE, CCNA

2 thoughts on “How to decode ESXi 5.x SCSI error code”

  1. Just wanted to comment to make some corrections on your post as I stumbled accross your blog searching for something else.

    1. “Section B shows what command is sent. (Actually I don’t even know what the command means is, please let me know if you know it.) ”

    The command x2A stands for write(10). Long story short your host attempted to write some data to disk and it failed. You can get a full list of scsi commands off http://www.t10.org/lists/op-num.txt , or you can use wikipedia as a shortcut – http://en.wikipedia.org/wiki/SCSI_command . Please note that even though these commands are well known, most disks/arrays do not support every command in the scsi spec.

    2. “In our example H:0×0 D:0×2 P:0×0, host code 0×0 (00h) means ESX host side is good, device code 0×2 (02h) means device is not ready , plugin status code 0×0 (00h) means LUN plugin is good. ”

    Emphasis on the device not ready. That is not what the D:0x2 means here. What it means instead is Check Condition. All that tells you is that the device (or array if your using external storage) has more information to tell you. Automatically your host will issue command 0x03, which is Request Sense, to get more information. That information will tell you what the actual problem is, should their be one.

    And just for your reference, if your device was actually not ready, you would see a check condition (d:0x2), and sense bytes Valid sense data: 0×2 0×4 0×3, (or some variant of 0x2 0xXX 0xXX, as there are multiple “not ready” conditions in the spec.”

    3. “In our example: 0×4 0×44 0×0, Sense Key 0×4 (4h) means HARDWARE ERROR, Additional Sense Code is 0×44 (44h) and ASC Qualifier is 0×0 (00h), combine the both code to 44h/00h, it means INTERNAL TARGET FAILURE. ”

    You are correct here with this decode, I just want to clarify something. Internal target failure can mean many, many things. In english, it simply means “The disk (or array) aborted the command”. Depending on your hardware this can mean many things, and you should always consult your vendor if you are seeing this issue. This is due to the limitation of the scsi spec, as there is not always an appropriate code to tell the host why a particular command failed. Off the top of my head, this error could be due to bad cabling (so the target aborts the command since what it recieved is garbage), it could be to preserve data integrety due to the timing of your command, it could be due to a legitimate hardware problem, etc etc. Again, if you see a check condition specifying internal target failure, consult your disk vendor or array vendor so they can determine exactly why the internal target failure is occuring.

    4. One final note. Note how it says “valid sense data”. That means you have a confirmed response by the disk/array. “Possible sense data”, should it appear, means you do not have actual sense data from the array and the host is interpreting on it’s own. This normally appears when the host reports an issue (example – H:0x5, which is aborted io midflight). Possible sense data should not be trusted as an actuall message from the array.

    1. I’m appreciate for your technical replies!!! That’s really help me understand storage deeper. I updated my blog regarding item 2.
      Regarding item 3, “this error could be due to bad cabling”, I think SAN switch port shows discard or error if cabling is bad, is it?
      Item 4, that’s extend my knowledge, I never noticed that part. Thanks again!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s