Storage is critical component for virtualization, lot of VM performance issue is related to storage latency. You may see similar error message on vmkernel log for some case:
2014-02-11T07:18:20.541Z cpu8:425351)ScsiDeviceIO: 2331: Cmd(0x4124425bc700) 0x2a, CmdSN 0xd5 from world 602789 to dev “naa.514f0c5c11a00025” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x44 0x0
It much like language of another planet when I first time saw itJ. Let’s see how to “translate” it to human language.
First, I split it to several sections:
a) 2014-02-11T07:18:20.541Z cpu8:425351)
b) ScsiDeviceIO: 2331: Cmd(0x4124425bc700) 0x2a, CmdSN 0xd5
c) from world 602789
d) to dev “naa.514f0c5c11a00025”
e) failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x44 0x0
Section A shows the UTC time when the error occurred.
Section B shows what command is sent. (Actually I don’t even know what the command means is, please let me know if you know it.)
Section C shows which world the command related to.
You can found which world it is by following command
ps | grep 602789
Section D shows which storage device it show error message.
You could identify which datastore it is by following command if your datastore contains single LUN:
esxcfg-scsidevs –m naa.514f0c5c11a00025
You could also check out LUN setting and information by following command:
esxcli storage core device list –d naa.514f0c5c11a00025
esxcli storage nmp device list –d naa.514f0c5c11a00025
Section E shows SCSI sense code. That’s the part I want to give more detail.
It’s breakdown to two sections:
SCSI status code – H:0x0 D:0x2 P:0x0
H means host status
D means device status
P means plugin status
Sense data – 0x4 0x44 0x0
0x4 means Sense Key
0x44 means Additional Sense Code
0x0 means ASC Qualifier
Before decode, you should translate each code to NNNh notation, 0xNNN = NNNh. For example 0x7a = 7Ah, 0x77 = 77h.
SCSI status code is easy to decode. You just need to change the format and check out the code from http://www.t10.org/lists/2status.htm.
In our example H:0x0 D:0x2 P:0x0, host code 0x0 (00h) means ESX host side is good, device code 0x2 (02h) means device is not ready, plugin status code 0x0 (00h) means LUN plugin is good. (Clarify: device code 0x2 is actually means “check condition”, it’s not really means “device is not ready”, it’s just for easy understand, but looks like it confuse since “Check Condition” has different means with “Device is not Ready”. Thanks Tony point out that. )
In our example: 0x4 0x44 0x0, Sense Key 0x4 (4h) means HARDWARE ERROR, Additional Sense Code is 0x44 (44h) and ASC Qualifier is 0x0 (00h), combine the both code to 44h/00h, it means INTERNAL TARGET FAILURE.
Okay, then we put all decode language together:
ESX host side is good, device is not ready, LUN plugin is good because HARDWARE ERROR INTERNAL TARGET FAILURE
Actually I dumped this code from an fnic firmware/driver incompatible case. Is it make your troubleshooting more easy?J
You could also refer to following links to get more detail: