How to Upgrade Virtual Hardware on MSCS VM

We get more new cool feature if keep virtual hardware up to date. And you may face boot problem when upgrade lower virtual hardware version to latest.

I always keep my Microsoft Cluster Services VM (MSCS VM) up to date since RDM disk usually uses on that kind of VMs.

I tried to search how to upgrade virtual hardware on MSCS VM with RDM LUN, but no lucky. That’s my experience:

  1. Update manager doesn’t work for MSCS VM.
  2. No snapshot would be taken if your SCSI controller of RDM is physical mode, you should have a good backup before upgrading.
  3. It’s possible to force upgrade hardware version by right click VM and select Upgrade Virtual Hardware.
  4. Make sure all services are running on another node.
  5. You will get following error message on Event for RDM disks in vSphere Client, upgrading procedure won’t be finished until error pop out for all RDM disks.
  6. I tried upgrade version 7 to 8.

How to export diagnostic log from SmartStart CD

Did you face similar problem? HP ask you provide hardware diagnostic log of SmartStart CD, you maintain the server remotely, and nobody available locally? How can you export the log from SmartStart?

Previously I used to map a local USB device in iLO and then export, but how about if the network performance is low between you and server locations? Most people may access iLO in a local server, so how can you map your local USB device into a iLO of remote server?

You can use UltraISO make a floppy image file and mount it in iLO as virtual floppy.

Sorry, I don’t have a English version.

File -> New -> Floppy Image

Click OK by default setting.

File -> Save As

Enter file name, a ima file will be generated.

You can also rename it to .img file directly.

After export logs to the virtual floppy, just open the file again in UltraISO, and Extract logs.

vCenter Server Heartbeat 5.6 – Installation

I have to say you’ll not able to get what you anticipating if you follow VMware document. After referred few blogs and videos, I finally deployed the production in HA and DR mode both, it consumed a lot of time since I had to clone the VM from US to India over WAN. It’s pain, I’d like the share it to make sure you never fall in same situation.

If you don’t familiar with vCHB, please read vCenter Server Heartbeat 5.6 – Architecture.

Before install vCHB, you should know that:

  • Install vCenter Server and components on Primary Server, Secondary Server will be cloned.
  • vCenter Update Manager, vCenter Converter, ESXi Dump Collector, Syslog Collector are configured using Fully Qualified Domain Names (FQDN) rather than IP addresses.
  • Time Zone and time setting is correct.
  • Port 52267 and 57348 is enabled in firewall on both servers.
  • 2GB free memory available for vCenter Server Heartbeat.
  • Administrator right is required to install vCenter Server Heartbeat.
  • All vCenter Server components should functionally before install vCenter Server Heartbeat.
  • No * in SSO master password. ( I guess that’s a bug of 5.6U1, please refer to KB2034608 to reset master password )
  • vCenter Server FQDN is Primary Server computer name. ( It will be changed later )

Pre-configure before install vCHB:

  • Make sure Primary Server computer name is vCenter Server FQDN.
  • Change vCenter Server services to manually start up on Primary Server.
    VMware VirtualCenter Server
    VMware vSphere Profile-Drive Storage
    vCenter Inventory Service
    VMware VirtualCenter Management Webservices
  • Recovery system fingerprint encrypted file.
    Go to C:Program FilesVMwareInfrastructureSSOServerutils
    Recovery footprint by following command:
    rsautil manage-secrets -a recover -m SSO Master Password
  • Power off Primary Server
  • Clone Primary Server to secondary site.
  • Disconnect vNICs on Secondary Server.
  • Power on both servers and set IP addresses.
    I use two vNICs on each server, one for Public Network, another for VMware Channel Network.
    Public Network contains two IP address, one for Management Network, another for Principle Network.
    Principle Network on both should be same if you deploy HA mode, otherwise they are different for DR mode.
  • Disable NETBIOS and DNS Register on each vNIC.
  • Leave domain and rename Secondary Server.
  • Reboot Secondary Server and connect vNICs.
  • Join Secondary Server back to domain and add proper AD groups to Administrator group.
    Note: You probably need to re-join domain twice to make sure AD synchronization correct, I got vCenter Server startup issue in initially deployment due to AD synchronization issue.
  • Create a share folder on reliable server that Primary and Secondary Server both can access.
  • Make sure configured IP addresses pingable from each server.
  • Bring up vCenter Server services on Primary Server.

Installation:

  • Select Install VMware vCenter Server Heartbeat to start installation.
  • Select Primary to install vCHB on Primary Server.
  • Accept agreement.
  • Apply license key.
  • Select LAN or WAN according to your architecture.
  • Select Secondary Server is Virtual option. ( I only tested that option )
  • Confirm installation path.
  • Select vNIC for VMware Channel network.
  • Enter VMware Channel IP addresses of Primary and Secondary Server.
    For HA mode, you could use non-routable or routable IP address.
    For DR mode, you must use routable IP addresses to make sure VMware Channel network can communicate each other over WAN.
  • Select vNIC for Public Network.
  • Enter IP addresses of Principal Network for both server.
    For HA mode, IP address should be same on both server.
    For DR mode, IP addresses should be different, you have to enter manually.
    Select the options accordingly.
  • If you select Different IP addresses in step above, you will need to enter a DNS update account of Windows. ( Refer to KB1008605 if you use BIND9 DNS instead of Windows DNS service )
  • Then configure Management Network. This network is used for RDP.
  • Rename computer name of both server. It looks like only rename Primary Server, no change for Secondary Server, but you don’t have to worry about that since we already renamed Secondary Server in early step.
  • Set client port, I used default.
  • Select components you want to protect and enter vCenter Login, this Login must have Administrator right on vCenter Server.
    Also input SSO master password, please note the SSO master password may different with SSO administrator password, please make sure you enter correct password.
  • Enter the share path you created earlier, this folder will store cluster configuration information for Secondary Server installation.
  • vCHB start checking system.
  • You will lost RDP connectivity for 10 seconds during installation due to Package Filter installation.

  • Once the installation complete, you can start on Secondary Server, just make sure you select Secondary.
    All other steps is similar like Primary Server.

After Installation:

  • Startup vCHB services on Secondary Server.
  • Open vCenter Server Heartbeat Management Console.
  • Add each node by Management Network.
  • Wait a while, you will see similar screen like following screenshot.

All paths lost on HBA port

HP, a great company, I like the hardware design of HP ProLiant server, it’s pretty easy for datacenter maintenance and operation, do you like it? Today, I’ll introduce a storage issue on HP ProLiant BL460, BL480 blades. This issue happened on Qlogic HBA with VC-FC module. I have two dual port Qlogic HBAs on each ESXi5.x host, one port of each HBA was zoned together on SAN switch.

For example, vmhba1 and vmhba3 are zoned for LUN allocation, each LUN have two paths on each HBA port.

I observed all LUNs disappeared on random HBA port sometimes, it’s not happening very frequently, but it can lead to ALL VM DEAD if you get storage outage when LUNs disappeared!!! This problem becomes more frequently more your virtual infrastructure grows bigger.

This is the symptoms when the issue happening:

And if you login SSH console and check HBA card status by:

less /proc/scsi/qla2xxx/[Device ID]

You will find following differences of two HBA ports:

See? All targets show Offline status on problem HBA.

scsi-qla3-target-0=500a09859d812da0:030098:1000:<Offline>

You have two options to fix it:

  1. Reseat blade. Downtime and local resource is required.
  2. Reset HBA by following step:

Record the Device ID, and force HBA do rescan:

echo “scsi-qlascan” > /proc/scsi/qla2xxx/adapter_id

Wait few seconds, force LIP login:

echo “scsi-qlalip” > /proc/scsi/qla2xxx/adapter_id

Wait few minutes, LUNs come back online… JYou could refer to KB 1031199 for more detail.

This is a temporary remediation, the problem will repeat. I’ll show you some permanent solution in next blog.

vCenter Server Heartbeat 5.6 – Architecture

I start to use VMware workstation since 2002 or earlier, my bad memory can’t recall it. That’s 1st generation of virtualization. If you look at today’s virtual world, we are on the way to “Matrix”! J Enterprise is virtualizing more and more server lead to vCenter Server becomes to a critical role. We have to prepare for any contingency. vCenter Server Heartbeat (vCHB) is a nice candidate for protecting vCenter Server. It provides your infrastructure ability to prevent downtime/outage of vCenter Server. To gearing up for implementation in production environment, I did some testing on my LAB, the product is nice, but the document is not ideal. I’d like to share my experience, this blog also referred to my project document, please let me know if you have any idea can help me make my document ideally. Thanks in advance.

vCHB is a cluster service like Microsoft Cluster Service or any other 3rd part cluster software. The benefit of this product is you don’t have to create the cluster on RDM and your ESXi maintenance operation would become much easier. You could deploy vCHB in HA or DR mode, I’ll focus on HA mode at this moment since I haven’t tested DR mode yet.

Server

My original LAB infrastructure contains one vCenter Server with remote SQL database server, data transmits over LAN. So my vCHB topology is one SQL database (I already have MSCS to protects SQL database server), two vCenter servers (Primary Server and Secondary Server).

vCHB uses Active-Passive for HA mode, Active Role runs protected applications, Passive Role receives changed data.

Primary Server – Original vCenter Server which I want to protect, it runs all vCenter components except outage happening.

Secondary Server – Another server of the pair, it’s Passive Role. Generally it receives change of Primary Server and takes over Active Role when outage happens.

In my LAB Active Role is Primary Server, and Passive Role is Secondary Server in most of time.

Networking

vCHB have two networks: Public Network and VMware Channel Network. You could use single NIC to run all networks or multiple NICs to separate them.

VMware Channel Network – vCHB monitors alive of each via VMware Channel Network and syncs changed data, it’s very important network.

Public Network – It contains two sub-networks: Principle Network and Management Network. Principle Network for vCHB cluster, Management Network for day-to-day operation.

Confuse? To simple it, I understand the networks like that:

VMware Channel Network – Can be private IP address or any IP address outside of the subnet of Public Network. It used for heartbeat and data transmitting.

Public Network Principle Network is IP address of Cluster DNS name, Management Network is IP address for RDP, they are in same routable subnet, but better in different prefix of IP address, please refer to KB 2004926.

Storage

No special storage requirement, but 2GB free space should be there where you want to install vCHB to. We also need a reliable share folder to store cluster data, I prefer to create share folder on a server other than vCHB servers since vCHB networks usually interrupt for few seconds during vCenter failover.

Okay, I’ll share how to install vCHB in next blog, this architecture for your reference:

A disk read error occurred after upgrade HW version from 3 to 9

This was a lesson and learns for me after I recovered the data back. My data was lost and no backup…

I had a virtual machine was moved from ESX 3.0 to ESXi 5.1 host long time ago. The virtual disk size show 0 and I cannot do storage migration and snapshot on the VM due to the hardware version was 3, it’s too low.

Generally I take snapshot before upgrade VM HW version, but that’s impossible on a VM of HW version 3 that running on vCenter Server 5.1. So I upgraded the VMware Tools and then VM hardware version by Update Manager. VMware Tools was successfully upgraded, but VM hardware version upgrading got error.

Then I right clicked the VM and used “Upgrade Hardware Version” option directly, it’s successfully without any prompt…finally I got “A disk read error occurred” when boot up. L

You may think it’s caused by SCSI controller since VM hardware version 3 supports IDE virtual disk and version 9 supports only SCSI virtual disk for best performance. That’s not my case. I tried several way to recover the disk, like convert the VM by convertor, mount the disk to other virtual machine, change SCSI parameter…etc.

I don’t think hardware version upgrading changes real virtual disk too much, it must be something changed on the head section of virtual disk, or description file. After consulted with Microsoft we got it fixed finally.

When I mounted the corrupted disk on other virtual machine, partition and size was recognized correctly. And disk manager also can recognize the NTFS file system. I can saw new drive appear in My Computer as well, but it show me “File or directory corrupted…” when I tried to open the drive. It more like a file system issue… it’s easy, just run following command to check any logical errors:

Chkdsk [drive letter]

Wow….a lot of error and files was listed, then I tried command:

Chkdsk /f [drive letter]

That’s real fix logical issue of disk. I could open the drive after used this command.

I mounted the drive back to the broken VM and powered on. New issue came up…Windows show me “Windows NT could not start because the below file is missing or corrupt: C:WindowsSystem32Ntoskrnl.exe”. I replaced the file but no help. The file was existed in the location, and file size was same like other VM, it’s perhaps not file issue?

Then I open VMDK file, aha….ddb.adapterType = “LegacyESX”, changed it to ddb.adapterType = “lsilogic” according to my SCSI controller set, my lovely Windows Server startup screen came back again. J

Okay, I talked too much. To summarize the fixing steps:

  • Mount the broken disk to a good virtual machine with same operating system. ( I’m not sure is it ok to mount on higher version of OS )
  • Run chkdsk [drive letter] to check if logical error existing.
  • Run chkdsk /f [drive] letter] to fix the logical error.
  • Unmounts the disk from good VM.
  • Edit the VMDK file in ESXi console.
  • Change the value of ddb.adapterType to proper SCSI controller type according to your SCSI controller setting.
  • Mount the disk back to broken VM.
  • Power on.

Here is my learning from that contingence:

  1. vCenter Server does not verify compatibility of VM hardware version during upgrading. Actually it’s not allowed to upgrade VM version from 3 to 9 directly.
  2. vCenter Server does not allowed you choose which VM hardware version you want to upgrade to, always latest.
  3. If you upgrade VM version from 3 to 9 directly, a SCSI controller will be added to the VM, value of ddb.adapterType will be changed to LegacyESX. You will not able to boot up the VM due to Windows Server 2003 does not contain proper SCSI driver.
  4. VM version upgrading looks like changes parameters of VMDK file but don’t change too much of real virtual disk, such as NTFS mapping and MBR table…etc.

Last, you may still face BSOD after use above solution since item 3 above, you have to inject the SCSI driver, please refer to KB 1005208 and KB 1006858.

Last of last…. 🙂 please take a backup of your virtual disk before you do any change!!!!!

Unable to connect to web services to execute query

It’s been a long time since last post, I was pretty busy on a storage issue, I did a lot of work with hardware vendor and VMware for this weird issue.

During our troubleshooting, I noticed a minor problem when I try search VM in vSphere Client, everytime it gave me error message “Unable to connect to web services to execute query“, it requested me “Verify that the VMware VirtualCenter Management Webservices service is running

I tried to reboot vCenter Server, restart Management webservices and even re-installed vSphere Client, no lucky….Finally I fixed the problem by following step:

  • Stop VMware VirtualCenter Management Webservices service on vCenter Server.
  • Backup Data folder in C:Program FilesVMwareInfrastructuretomcatwebappssmsWEB-INFclassescomvmwarevimsms.
  • Remove all sms-*.db files in Data folder.
  • Restart VMware VirtualCenter Management Webservices service.

It’s simple steps to fix the problem, but this issue confused me and VMware support for a long time. This problem appeared after we upgraded vCenter Server from 5.0 to 5.1, first thing we suspected was inventory services, error message below was logged in ds.log when we searched VM.

[2013-05-25 12:04:31,995 http-nio-/0.0.0.0-10443-exec-634  INFO  com.vmware.vim.vcauthenticate.servlets.AuthenticationServlet] Sending security error because of exception : com.vmware.vim.vcauthenticate.exception.SsoUnreachableException: com.vmware.vim.dataservices.ssoauthentication.exception.ServiceCommunicationException: com.vmware.vim.sso.admin.exception.InternalError: General failure.

It looks like a authentication issue, right? So we checked SSO, service account…etc. The unclearly logs lead to a wrong way. 🙂

Since nobody complained to me, I suspected that’s a client side issue, then we tried search on another purge client but same issue. We also suspected the cache of vCenter inventory, but logs didn’t evidence it is, we cannot just reset inventory cache database since that’s production environment!

Okay, I talk too much about troubleshooting process, let’s talk about the search function of vSphere, my understood is vCenter search objects by two different way: Web Client or vSphere Client. It looks like Web Client retrieve data from database or Web Client server.

vSphere Client get data from cache database. The cache database is located in vCenter Server install folder, default path is C:Program FilesVMwareInfrastructuretomcatwebappssmsWEB-INFclassescomvmwarevimsms. the cache file is actually H2 databases, it work together with Tomcat web services, sms folder contains application files of Storage Monitoring Services, it use H2 database engine v1.2.147. Please comments if you think I’m wrong.

If the H2 database incorrupt, storage monitoring services also stop working, you can find the service in Service initializing… status with warning status in vCenter Service Status node of vSphere Client.

One solution fix two issue, I like it!

 

Get specific advanced configuration of ESXi host

Storage team said the best practics of QFullSampleSize is 32, they want to check how it’s going in our environment. It’s easy to check individual host, but pretty time consuming if you want to check 300+ hosts.
Here is a one line PowerShell script to export QFullSampleSize and QFullThreshold to a csv file.

Get-VMHost | %{ $HostName=$_.Name; $HostCluster=$_.Parent; Get-VMHostAdvancedConfiguration -VMHost $_ | % { $_.getEnumerator()| ? {$_.Key -like "*QFull*"} | select Name,Value,@{N='host';E={$HostName}},@{N='Cluster';E={$HostCluster}} } } | export-csv c:qSetting.csv

 

 

 

No permission to login to vCenter Server 5.1

Today, we P2V one vCenter Server, I re-added identify source for some reason, I didn’t modified any existing domain group and ACL.
After a while I got a interesting case. User reported they got “No permission to login to vCenter Server 5.1 by vSphere Client”.
I looked into the vpxa.log of vCenter Server, it show that:

2013-05-01T11:08:01.399-05:00 [09108 error '[SSO]' opID=6e704a51] [UserDirectorySso] AcquireToken InvalidCredentialsException: Authentication failed: Authentication failed

2013-05-01T11:08:01.399-05:00 [08644 error 'authvpxdUser' opID=5469f71e] Failed to authenticate user <xxxx>

I was not 100% sure that log related to the real problem. but that’s indicated it should be something related to authentication components.
After compared working SSO with the fault SSO, I noticed Domain Alias was blank on fault SSO:

Idenfity source

Then I added a domain group on fault vCenter Server and compared the group with working vCenter Server, it’s shows format different, just like that:
Working SSO – CONTOSOTEST-GROUP
Fault SSO – CONTOSO.COMTEST-GROUP

Okay…now I know why user logging got fault. The identify source configured Domain Alias before I removed it on fault SSO, then I added identify source without Domain Alias, and thenvCenter Server used Domain name as default prefix of domain group, it lead to original domain groups format ( CONTOSOxxxx ) cannot be identified by SSO.

So I deleted the identify source and added a same source with Domain alias, problem fixed…