Unable to connect to web services to execute query

It’s been a long time since last post, I was pretty busy on a storage issue, I did a lot of work with hardware vendor and VMware for this weird issue.

During our troubleshooting, I noticed a minor problem when I try search VM in vSphere Client, everytime it gave me error message “Unable to connect to web services to execute query“, it requested me “Verify that the VMware VirtualCenter Management Webservices service is running

I tried to reboot vCenter Server, restart Management webservices and even re-installed vSphere Client, no lucky….Finally I fixed the problem by following step:

  • Stop VMware VirtualCenter Management Webservices service on vCenter Server.
  • Backup Data folder in C:\Program Files\VMware\Infrastructure\tomcat\webapps\sms\WEB-INF\classes\com\vmware\vim\sms.
  • Remove all sms-*.db files in Data folder.
  • Restart VMware VirtualCenter Management Webservices service.

It’s simple steps to fix the problem, but this issue confused me and VMware support for a long time. This problem appeared after we upgraded vCenter Server from 5.0 to 5.1, first thing we suspected was inventory services, error message below was logged in ds.log when we searched VM.

[2013-05-25 12:04:31,995 http-nio-/  INFO  com.vmware.vim.vcauthenticate.servlets.AuthenticationServlet] Sending security error because of exception : com.vmware.vim.vcauthenticate.exception.SsoUnreachableException: com.vmware.vim.dataservices.ssoauthentication.exception.ServiceCommunicationException: com.vmware.vim.sso.admin.exception.InternalError: General failure.

It looks like a authentication issue, right? So we checked SSO, service account…etc. The unclearly logs lead to a wrong way. :-)

Since nobody complained to me, I suspected that’s a client side issue, then we tried search on another purge client but same issue. We also suspected the cache of vCenter inventory, but logs didn’t evidence it is, we cannot just reset inventory cache database since that’s production environment!

Okay, I talk too much about troubleshooting process, let’s talk about the search function of vSphere, my understood is vCenter search objects by two different way: Web Client or vSphere Client. It looks like Web Client retrieve data from database or Web Client server.

vSphere Client get data from cache database. The cache database is located in vCenter Server install folder, default path is C:\Program Files\VMware\Infrastructure\tomcat\webapps\sms\WEB-INF\classes\com\vmware\vim\sms. the cache file is actually H2 databases, it work together with Tomcat web services, sms folder contains application files of Storage Monitoring Services, it use H2 database engine v1.2.147. Please comments if you think I’m wrong.

If the H2 database incorrupt, storage monitoring services also stop working, you can find the service in Service initializing… status with warning status in vCenter Service Status node of vSphere Client.

One solution fix two issue, I like it!



Get specific advanced configuration of ESXi host

Storage team said the best practics of QFullSampleSize is 32, they want to check how it’s going in our environment. It’s easy to check individual host, but pretty time consuming if you want to check 300+ hosts.
Here is a one line PowerShell script to export QFullSampleSize and QFullThreshold to a csv file.

Get-VMHost | %{ $HostName=$_.Name; $HostCluster=$_.Parent; Get-VMHostAdvancedConfiguration -VMHost $_ | % { $_.getEnumerator()| ? {$_.Key -like "*QFull*"} | select Name,Value,@{N='host';E={$HostName}},@{N='Cluster';E={$HostCluster}} } } | export-csv c:\qSetting.csv




The number of heartbeat datastores for host is 0, which is less than required: 2

Today I see this error message on one ESXi5.0 host:

The number of heartbeat datastores for host is 0, which is less than required: 2

No any VM is running on the host by DRS or HA, VMware KB gives a solution but too complicate.

Re-configure HA can fixes the problem.

Right click the host -> Click Reconfigure for vSphere HA -> Waiting HA configuration complete.

No permission to login to vCenter Server 5.1

Today, we P2V one vCenter Server, I re-added identify source for some reason, I didn’t modified any existing domain group and ACL.
After a while I got a interesting case. User reported they got “No permission to login to vCenter Server 5.1 by vSphere Client”.
I looked into the vpxa.log of vCenter Server, it show that:

2013-05-01T11:08:01.399-05:00 [09108 error '[SSO]' opID=6e704a51] [UserDirectorySso] AcquireToken InvalidCredentialsException: Authentication failed: Authentication failed

2013-05-01T11:08:01.399-05:00 [08644 error 'authvpxdUser' opID=5469f71e] Failed to authenticate user <xxxx>

I was not 100% sure that log related to the real problem. but that’s indicated it should be something related to authentication components.
After compared working SSO with the fault SSO, I noticed Domain Alias was blank on fault SSO:

Idenfity source

Then I added a domain group on fault vCenter Server and compared the group with working vCenter Server, it’s shows format different, just like that:

Okay…now I know why user logging got fault. The identify source configured Domain Alias before I removed it on fault SSO, then I added identify source without Domain Alias, and thenvCenter Server used Domain name as default prefix of domain group, it lead to original domain groups format ( CONTOSO\xxxx ) cannot be identified by SSO.

So I deleted the identify source and added a same source with Domain alias, problem fixed…

How to retrieve or set Path Selection Policy by vCLI

First of all, this article is nothing related to PowerCLI. :-)

You probably know how to set Path Selection Policy (PSP) by vSphere Client, but how you can setup 100 LUNs manually? We have some script can make your life easy.

How to retrieve LUN Path Selection Policy:

esxcli storage nmp device list | egrep “Device Display Name|Path Selection Policy:”

You will get a output like that:

Device Display Name: DGC Fibre Channel Disk (naa.600601602a102e0002cdf2a2596be211)
Path Selection Policy: VMW_PSP_RR

This script help you identify which LUN is what type of policy. Here tell you what is Path Selection Policy.

Next, let’s see how to modify these LUN PSP by script:
First, you should run following script to print out command for each LUN, don’t forget change the bold text to the PSP you prefer.

esxcli storage nmp device list | awk '/^naa/{print "esxcli storage nmp device set -d "$0" -PVMW_PSP_RR" };'

Then, copy the output to notepad and remove the local disk, for example following bold NAA indicates the LUN is a local HP disk.

esxcli storage nmp device set -d naa.600601602a102e008896dda81b88e211 -P VMW_PSP_RR
esxcli storage nmp device set -d naa.600601602a102e008861b28a596be211 -P VMW_PSP_RR
esxcli storage nmp device set -d naa.600601602a102e00560d8488b456e211 -P VMW_PSP_RR
esxcli storage nmp device set -d naa.600601602a102e00c4cd2600b456e211 -P VMW_PSP_RR
esxcli storage nmp device set -d naa.600508b1001c1e987243838af4c67891 -P VMW_PSP_RR
esxcli storage nmp device set -d naa.600601602a102e008c96dda81b88e211 -P VMW_PSP_RR

Last, copy modified text back to putty session, it will run the commands one by one.

How to retrieve RDM information by PowerCLI

I worked on move RDM LUNs of Microsoft Cluster virtual machine from one iGroup to another. To make sure the moving safe, we should record RDM LUN information before migration.

We had two VMs with almost 20 RDM LUNs, it’s pretty time consume to get the information manually, I used following script to retrieve information:

$RMDinfo = Get-HardDisk -VM virtual machine name -DiskType rawPhysical

$RDMinfo | select Parent,Filename,CapacityGB,ScsiCanonicalName,Name


Port Groups not Work with VLAN Tag on Cisco Switch

Few weeks ago, I tried to standardize networking of a cluster, there were 4 VLANs for production virtual machines, I binded the VLANs on one virtual switch which had 4 physical vmnic.

Then I created 4 port groups with different VLAN ID, but for some reason virtual machines unreachable via some vmnics. Network team verified port channel was good.

I tried on several ESXi 5.0 hosts in the cluster, all had same problem, finally we found that’s a Cisco switch bug….you could find detail information and work around here.