Tag: ESXi

  • Nodes in the ESXi cluster may report corruption after reboot host or attach device

    VCE just released a new KB vce2563 to description the issue.

    If your ESXi 5.x hosts is connected on VMAX running Enginuity 5876.159.102 and later, you may see this particular issue after reboot ESXi host or attach storage if you enabled block delete feature of VAAI.

    To check the option status you can run following command on PowerCLI:

     Get-VMHost -Location cluster name | Get-VMHostAdvancedConfiguration -Name VMFS3.EnableBlockDelete

  • How to setup NTP services by PowerCLI

    NTP service is very important for troubleshooting, vmkernel log timestamp is incorrect if your NTP service is not running and ESXi system time is wrong. It can also impact to VM system time even you disable time synchronization on VMware Tools since VM still need to sync time with ESXi after awake from suspended status, finish vMotion, or revert from snapshot.

    I know it’s simple to configure NTP services on single how, what if you want to configure NTP service on massed hosts?

    Basically we have 3 steps to make sure NTP service working properly:

    • Configure NTP server IP address.
    • Bring up NTP service.
    • Set services startup along with ESXi system.

    Let’s try PowerCLI:

    Get-VMHOST -Location Cluster Name | Add-VMHostNtpServer -NtpServer “NTP server address

    Get-VMHOST -Location Cluster Name | Get-VMHostService| Where-Object {$_.key -eq “ntpd”} | Start-VMHostService

    Get-VMHOST -Location Cluster Name | Get-VMHostService| Where-Object {$_.key -eq “ntpd”} | Set-VMHostService –Policy On

  • How to decode ESXi 5.x SCSI error code

    Storage is critical component for virtualization, lot of VM performance issue is related to storage latency. You may see similar error message on vmkernel log for some case:

    2014-02-11T07:18:20.541Z cpu8:425351)ScsiDeviceIO: 2331: Cmd(0x4124425bc700) 0x2a, CmdSN 0xd5 from world 602789 to dev “naa.514f0c5c11a00025” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x44 0x0

    It much like language of another planet when I first time saw itJ. Let’s see how to “translate” it to human language.

    First, I split it to several sections:

    a) 2014-02-11T07:18:20.541Z cpu8:425351)

    b) ScsiDeviceIO: 2331: Cmd(0x4124425bc700) 0x2a, CmdSN 0xd5

    c) from world 602789

    d) to dev “naa.514f0c5c11a00025”

    e) failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x44 0x0

    Section A shows the UTC time when the error occurred.

    Section B shows what command is sent. (Actually I don’t even know what the command means is, please let me know if you know it.)

    Section C shows which world the command related to.

    You can found which world it is by following command

    ps | grep 602789

    Section D shows which storage device it show error message.

    You could identify which datastore it is by following command if your datastore contains single LUN:

    esxcfg-scsidevs –m naa.514f0c5c11a00025

    You could also check out LUN setting and information by following command:

    esxcli storage core device list –d naa.514f0c5c11a00025

    esxcli storage nmp device list –d naa.514f0c5c11a00025

    Section E shows SCSI sense code. That’s the part I want to give more detail.

    It’s breakdown to two sections:

    SCSI status codeH:0x0 D:0x2 P:0x0

    H means host status

    D means device status

    P means plugin status

    Sense data0x4 0x44 0x0

    0x4 means Sense Key

    0x44 means Additional Sense Code

    0x0 means ASC Qualifier

    Before decode, you should translate each code to NNNh notation, 0xNNN = NNNh. For example 0x7a = 7Ah, 0x77 = 77h.

    SCSI status code is easy to decode. You just need to change the format and check out the code from http://www.t10.org/lists/2status.htm.

    In our example H:0x0 D:0x2 P:0x0, host code 0x0 (00h) means ESX host side is good, device code 0x2 (02h) means device is not ready, plugin status code 0x0 (00h) means LUN plugin is good. (Clarify: device code 0x2 is actually means “check condition”, it’s not really means “device is not ready”, it’s just for easy understand, but looks like it confuse since “Check Condition” has different means with “Device is not Ready”. Thanks Tony point out that. )

    Sense data is a little bit complicate. You have to refer two links http://www.t10.org/lists/2sensekey.htm and http://www.t10.org/lists/asc-num.txt.

    In our example: 0x4 0x44 0x0, Sense Key 0x4 (4h) means HARDWARE ERROR, Additional Sense Code is 0x44 (44h) and ASC Qualifier is 0x0 (00h), combine the both code to 44h/00h, it means INTERNAL TARGET FAILURE.

    Okay, then we put all decode language together:

    ESX host side is good, device is not ready, LUN plugin is good because HARDWARE ERROR INTERNAL TARGET FAILURE

    Actually I dumped this code from an fnic firmware/driver incompatible case. Is it make your troubleshooting more easy?J

    You could also refer to following links to get more detail:

    Understanding SCSI device/target NMP errors/conditions in ESX/ESXi 4.x and ESXi 5.x

    Understanding SCSI host-side NMP errors/conditions in ESX 4.x and ESXi 5.x

    Interpreting SCSI sense codes in VMware ESXi and ESX

    Interpreting SCSI sense codes in VMware ESXi and ESX

  • vHBAs and other PCI devices may stop responding in ESXi 5.x when using Interrupt Remapping

    Your vHBAs or other PCI devices may stop running in ESXi 5.x when using Interrupt Remapping feature.

    This issue only impact to UCS blade BIOS version 1.4(3c), it has been fixed on 1.4(3j).

    Please refer to http://kb.vmware.com/kb/1030265 to see how to disable Interrupt Remapping feature in ESXi 5.x

    Also refer to https://tools.cisco.com/bugsearch/bug/CSCty96722.

  • Error: No NIC found with MAC address…

    Your HP server may runs fine on ESXi 4.x or 5.0, but you may get error message No NIC found with MAC address xx:xx:xx:xx:xx:xx after upgrade to ESXi 5.1 or later.

    That’s caused by network adapter firmware, you have to upgrade server network adapter firmware by HP SPP 2013.02 or later. I would recommend you upgrade firmware of each component to this version, it’s pretty stable to run ESXi 5.1.

  • How to find which ESXi 5.1 host lock the VM

    Sometimes VM may show unknown, invalid or orphan on vCenter Server, but it still running somewhere. Some technical support engineer may request reboot VM/ESXi host, or search on each host one by one.

    Declare: This article only apply to ESXi 5.1, I haven’t tested on other version.

    This is easiest way to find out which host lock the VM:

    1. SSH to any host on the cluster.
    2. Go to VM folder. ( Usually it’s under /vmfs/volumes/… )
    3. Run command:  vmkfstools -D “vmx file name” | grep owner
    4. Return line similar like this:
      gen 483, mode 1, owner 529495c4-0b6a7d90-a0f3-0025b541a0dc mtime 211436
    5. The red highlight section is MAC address of owner host.
    6. Run command: esxcfg-nics -l on each ESXi host to see which host match this MAC address.

    Then you need to remove the invalid VM from inventory, and login to the owner host by vSphere Client and import the VMX file again.

    This procedure can save lot of time to find the real owner host, but it still consumes time if it’s a large cluster. You want to more fast? It’s possible!

    After you find the MAC address, change it to regular format, like: xx:xx:xx:xx:xx:xx.

    Logon vMA console and connect to vCenter Server by command: vifptarget -s vCenter Server Name

    Run command: esxcfg-nics -h ESXi host name -l | grep xx:xx:xx:xx:xx:xx

    More fast?

    Try use Excel to list commands with all ESXi host name then past on console….

  • ESXi 5.1 shows 0 value for CPU/memory in vcenter

    It’s been a month, i was busy to make our environment more stable, a lot of troubleshooting, webex session and discussing. Few days ago I noticed random VMs kept vMotion constantly. Some VMs got strange situation, show orphan, invalid or unknown status, but still online.

    I couldn’t find any evidence why the VMs went to these status. One more thing I noticed was CPU and memory utilization of ESXi 5.1 shows 0 on vCenter server 5.1.

    Following statement is not mature conclusion, it’s my inference according to DRS, HA and that particular 0 value CPU/memory. I also discussed that with VMware BCS support.

    VM changed to abnormal status due to vMotion interrupted by something, more like HA kicked off due to network/storage intermittent failed. That become high chance since  DRS kept try move heavy workload VM to 0 CPU/memory host.

    You have to upgrade to ESXi 5.1 latest version or vCenter Server 5.1 update 1c to permanent fix this problem.

    Workaround:

    Choose one option from following options, that’s temporary solution, issue will present again.

    1. Restart ESXi management agent.

    2. Disconnect/reconnect ESXi on vSphere client.

    Update: you have to upgrade ESXi host and vcenter server both to permanent fix the problem.

  • HA for DMZ ESXi 5.1 cluster

    Virtualization becomes popular than never this year, I see many company is transforming internal infrastructure into virtual platform.

    HA is key feature of vSphere ESXi 5.1, you have to consider this part on every design, especially DMZ virtual machine.

    Most DMZ ESXi cluster has restricted networking policy, even ICMP maybe not allowed. As you may know, HA detects ESXi host alive by two parts: Storage and Network.

    If host can see shared storage, it means host alive.

    If host can ping default gateway, it means host alive.

    What if ping is disabled on default gateway? You’ll get “vSphere HA agent on this host could not reach isolation address: xxx.xxx.xxx.xxx” on each host.

    It can lead to VM lost HA protection sometimes, you could use following way to fix this problem.

    1. Login to each host by SSH.
    2. Run command “vmkping xxx.xxx.xxx.xxx” to ping any ICMP enabled IP address from vmkernal ports.
    3. Record ping worked IP addresses.
    4. Right click ESXi 5.1 cluster.
    5. Edit SettingvSphere HAAdvanced Options
    6. Add das.isolationAddressX, value is the IP address of step 3, X start from 0 to 9.
    7. Repeat step 6 to add all favored IP addresses.
    8. Add das.useDefaultIsolationAddress, value is false.
    9. Right click each host and select Reconfigure for vSphere HA.
  • Unable to connect to web services to execute query

    It’s been a long time since last post, I was pretty busy on a storage issue, I did a lot of work with hardware vendor and VMware for this weird issue.

    During our troubleshooting, I noticed a minor problem when I try search VM in vSphere Client, everytime it gave me error message “Unable to connect to web services to execute query“, it requested me “Verify that the VMware VirtualCenter Management Webservices service is running

    I tried to reboot vCenter Server, restart Management webservices and even re-installed vSphere Client, no lucky….Finally I fixed the problem by following step:

    • Stop VMware VirtualCenter Management Webservices service on vCenter Server.
    • Backup Data folder in C:Program FilesVMwareInfrastructuretomcatwebappssmsWEB-INFclassescomvmwarevimsms.
    • Remove all sms-*.db files in Data folder.
    • Restart VMware VirtualCenter Management Webservices service.

    It’s simple steps to fix the problem, but this issue confused me and VMware support for a long time. This problem appeared after we upgraded vCenter Server from 5.0 to 5.1, first thing we suspected was inventory services, error message below was logged in ds.log when we searched VM.

    [2013-05-25 12:04:31,995 http-nio-/0.0.0.0-10443-exec-634  INFO  com.vmware.vim.vcauthenticate.servlets.AuthenticationServlet] Sending security error because of exception : com.vmware.vim.vcauthenticate.exception.SsoUnreachableException: com.vmware.vim.dataservices.ssoauthentication.exception.ServiceCommunicationException: com.vmware.vim.sso.admin.exception.InternalError: General failure.

    It looks like a authentication issue, right? So we checked SSO, service account…etc. The unclearly logs lead to a wrong way. 🙂

    Since nobody complained to me, I suspected that’s a client side issue, then we tried search on another purge client but same issue. We also suspected the cache of vCenter inventory, but logs didn’t evidence it is, we cannot just reset inventory cache database since that’s production environment!

    Okay, I talk too much about troubleshooting process, let’s talk about the search function of vSphere, my understood is vCenter search objects by two different way: Web Client or vSphere Client. It looks like Web Client retrieve data from database or Web Client server.

    vSphere Client get data from cache database. The cache database is located in vCenter Server install folder, default path is C:Program FilesVMwareInfrastructuretomcatwebappssmsWEB-INFclassescomvmwarevimsms. the cache file is actually H2 databases, it work together with Tomcat web services, sms folder contains application files of Storage Monitoring Services, it use H2 database engine v1.2.147. Please comments if you think I’m wrong.

    If the H2 database incorrupt, storage monitoring services also stop working, you can find the service in Service initializing… status with warning status in vCenter Service Status node of vSphere Client.

    One solution fix two issue, I like it!

     

  • The number of heartbeat datastores for host is 0, which is less than required: 2

    Today I see this error message on one ESXi5.0 host:

    The number of heartbeat datastores for host is 0, which is less than required: 2

    No any VM is running on the host by DRS or HA, VMware KB gives a solution but too complicate.

    Re-configure HA can fixes the problem.

    Right click the host -> Click Reconfigure for vSphere HA -> Waiting HA configuration complete.