Tag Archives: Hardware

Emulex OneConnect OCe10102 on ESXi 6.0

Please refer to following post for basic troubleshooting of Emulex OneConnect.

How to Install Proper Drivers for 3rd Party Network Adapter on ESXi 5.x

I have a box uses Emulex OneConnect OCe10102 network adapters. The adapter is quite old and Emulex brand card doesn’t support ESXi 6.0. I upgraded the server to ESXi 6.0 and the Emulex adapters lost.

In the initial troubleshooting, I noticed that the adapters are still visible in BIOS. So it should be some driver level issues. I checked VMware Compatibility Guide. The model OCe10102 doesn’t support by ESXi 6.0.

If you run the following command you will still be able to see the adapters in PCI list on ESXi.

esxcli hardware pci list

So it indicates the adapters are not visible in ESXi since the newer Emulex driver doesn’t contain the model of the adapter in ESXi 6.0 native driver.

Then I uninstalled the native Emulex driver for ESXi 6.0 by the following command and rebooted the ESXi host.

esxcli software vib remove -n elxnet

The adapters still not visible after rebooting since no any drivers for Emulex adapters. Then I downloaded the Emulex drivers for ESXi 5.5 on VMware website and uploaded the “offline” package in the zip file to /tmp directory of the host. Then installed the driver by the following command:

esxcli software vib install -d "/tmp/xxxxx.zip"

The adapters appeared after rebooting the host.


“The Update is Not Applicable to Your Computer” When Install KB3046101

  HPE 3PAR upgrading team usually sends a per-requisites before upgrading. One thing in the guide incorrect is the Windows 2012 required patch KB3046101.     You may see error below when you install the path on Windows 2012 server.    
The update is not applicable to your computer
      The reason is the version of mpio.sys and msdsm.sys on your server is higher than 6.3.9600.17809. Usually, because the server installed KB3121261 already. You can ignore KB3046101.

Vlan ‘xxx’ resolved to unsupported VLAN ID in Cisco UCS Manager

You may need only 1 IP address for blade console in Cisco UCS Manager. You can follow Understanding “Management IP” of Cisco UCS Manager to configure it. You may see warning “Vlan ‘xxx’ resolved to unsupported VLAN ID” when you delete existing inbound and outbound IP pools if you are trying to clean up existing management IP pools.

That’s because inbound IP address for blade is not cleaned. You have to go to “Equipment” -> “Chassis” -> Target chassis -> “Servers” -> Target server -> Go to “Inventory” tab -> “CIMC” tab -> Click “Change Inbound Management IP” -> Remove existing VLAN and IP pool.

You will see inband IP tab is blank once it’s saved. Please note, the IP address reassign back after 1 minute if you clicked “Delete Inband Configuration” instead of “Change Inbound Managemnt IP“.

Understanding “Management IP” of Cisco UCS Manager

IP address for KVM in Cisco UCS Manager is different with HPE servers. It may assign multiple IP addresses to same blade if you don’t configure it properly. In my case each blade gets 3 IP addresses!

There are actually 3 types of IP address for KVM. (Cisco manual says 2):

  • Outbound Management IPs.
  • Inbound Management IPs for Blades.
  • Inbound Management IPs for Service Profiles.

Outbound Management IP” is default for KVM. Every new deployed blade will try to get a DHCP IP over management port in same VLAN of Cisco UCS Manager.

The more confused is the 2nd and 3rd IPs.  “Inbound Management IPs for Blades” is from “hardware” perspective. “Inbound Management IPs for Service Profiles” is from “logical” perspective.

If you go to “Equipment” -> Chassis -> blade -> Click the KVM to go console. You get console over either “Outbound Management IP” or “Inbound Management IPs for Blades“.

If you go to “Servers” -> “Service Profiles” -> Click the KVM of a service profile. You get console over either “Outbound Management IP” or “Inbound Management IPs for Service Profiles”.

If you want to configure just 1 IP for a blade whatever it’s for hardware or service profile. You need to do following:

  1. Delete the range of the default “ext-mgmt” in “IP Pools” of “LAN” node in Cisco UCS Manager.
  2. Create a new inbound IP pool and a VLAN group without uplink.
  3. Assign the VLAN and inbound IP pool to templates or service profile.

Refer to Setting the Management IP Address of Cisco UCS Manager manual for detail.

BTW, you may see Vlan ‘xxx’ resolved to unsupported VLAN ID in Cisco UCS Manager when you clean up existing IP pool and create new inbound pool.

“x/xx on FI-A is connected by a unknown server device” on Cisco UCS

You may see following errors in ‘info’ category of error messages in the Cisco UCS Manager after upgrading infrastructure firmware to 3.2.x.

“x/xx on FI-A is connected by a unknown server device”

This is bug documented in CSCvk76095. You have to reset the port on FI to fix it.

  1. Go to “Equipment” in Cisco UCS Manager.
  2. Go to “Fabric Interconnects” -> Go to the corresponding FI.
  3. Right-click the port x/xx -> Choose “Disable“.
  4. You will see multiple major faults. Wait for 5 seconds.
  5. Right click the port x/xx -> Choose “Enable“.
  6. All warnings disappeared after 5 mins. You may still see the warning in GUI due to cache. Relogin and check.

This change impacts to one link between IOM and the FI port. You need downtime if the IOM only has a single path. I don’t see any impact to ESXi blades in the pod.

Connect to New Provisioned Raspberry Pi Less than $3

The IP configuration of new provisoined Raspberry Pi struggled me a long time. I need to connect to a monitor so I login to system and configure IP address. The problem was I don’t have monitor. I only have a laptop.

Last year, my old laptop dead. I connected the laptop monitor to a HDMI board to my Raspberry Pi. It’s not a low cost solution, it costed me more than $10. And the monitor, cables and board looks uglily.


Actually there is another solution to leveraging laptop keyboard and monitor. It’s serials port to console. Something similar like when you configure Cisco network switches. Following is how to do it. I achieve that on Raspberry Pi 2.

  1. You need to buy a USB to TTL device with chipset CP2102.
  2. Connect the pins to Raspberry Pi 2. Refer here for GPIO layout.
    TXD > Pi RXD Pin #10 (GPIO 16)
    RXD > Pi TXD Pin #08 (GPIO 15)
    GND > Pi GND Pin #6
  3. Connect the USB to laptop. You will see a device in ‘Device Manager’ needs drivers.
  4. Download driver and install.
  5. Download Putty and install.
  6. Open Putty and “Serial”.
  7. “Serial line” is COM3 or COM4.
  8. “Speed” is 115200.

The USB to TTL I bought on Taobao (Chinese version of Aliexpress). It’s around $1.2 including shipping.

Show CDP Neighbor of Cisco UCS Uplinks

There are two ways to know which network switch ports the network uplinks of Cisco UCS Fabric Interconnects are connected to.


  • SSH to the Cisco UCS Manager.
  • Connect to FI-A.
# connect nxos a
  • Show neighbor of network uplinks.
# show cdp neighbor interface ethernet <port num>

By PowerShell

  • Make sure Cisco PowerTool (For UCS Manager) is installed.
  • Enabling the Information Policy via UCSM GUI.
    • Go to “Equipment” -> “Policies” tab -> “Global Policies” tab -> “Info Policy” area.
    • Change to “Enabled“. (No impact to running blades)
  • Open a PowerShell window.
  • Connect to the UCS Manager.
# Connect-Ucs <UCS FQDN>
  • Show CDP neighbor details.
# Get-UcsNetworkLanNeighborEntry

Side notes

Following command can shows network switch name, network switch ports and FI ports

# Get-UcsNetworkLanNeighborEntry | Select deviceid,remoteinterface,localinterface

If you prefer to enable the “Info Policy” by PowerShell, run following command

# Get-UcsTopInfoPolicy | Set-UcsTopInfoPolicy -State enabled -Force

“default Keyring’s certificate is invalid” in Cisco UCS Manager

You may see following error in Cisco UCS Manager:

default Keyring’s certificate is invalid

The reason is Admin -> Key Management -> KeyRing default is expired. It’s not possible to delete or change the KeyRing in GUI. You have to log in to SSH of Cisco UCS Manager and run following commands (The strings after “#”):

lab-B# scope security
lab-B /security # scope keyring default
lab-B /security/keyring # set regenerate yes
lab-B /security/keyring* # commit-buffer
lab-B /security/keyring #

This will result in a disconnect of the Cisco UCS Manager GUI on your client computer. Just refreshing the page after 5 seconds. It’s no impact to blades.

A Huge Amount of Warnings of “Image is Deleted” in Cisco UCS Manager

A few days ago, I deleted some older firmware packages in Cisco UCS Manager. Suddenly more than 100 warnings were generated. The error messages are similar below:

blade-controller image with vendor Cisco System Inc……is deleted

Cause: image-deleted

Clearly, it’s triggered due to packages deletion. But all of my service profiles and service profile templates were using existing firmware packages. The deleted packages were not been used anywhere.

I also deleted download tasks and cleaned up everything I can. The warnings still persisted. I figured out it’s caused by the default firmware policy when I read a blog article.

In case you are facing same issue. Please go to Servers -> Policies -> Host Firmware Packages -> default ->  Click Modify Package Versions -> Change it to available version.


Cisco UCS Blade Cannot Get IP Address for KVM

You may see “The IP address to reach the server is not set” when clicking the KVM console in Cisco UCS Manager. The issue persists even Cisco UCS Manager has enough IP addresses for management. Re-acknowledge or reset CIMC cannot fix the problem.

The fix procedure is go to “Equipment” -> Select the server -> “General” tab -> “Server Maintenance” -> “Decommission” the server.

Wait for the decommission completed, then re-acknowledge the server. IP address will be assigned to the server after the acknowledge process is completed.

UCS Manager UI Fonts Size on 4K Screen

Older UCS Manager uses Java application. The UI fonts could be extremely small on high DPI screen. The fix is:

  1. Go to “C:\Program Files (x86)\Java\jre1.8.0_171\bin“.
  2. Go to “Properties” of “jp2launcher.exe“.
  3. Compatibility” tab -> “Change high DPI settings“.
  4. Check “Override high DPI scaling behavior….“.
  5. Select “System (Enhanced)” or “System“.


Troubleshooting Network Performance of Virtual Machine

There are several layers of networking on the virtualization infrastructure. Guest operating system, Virtual Machine, ESXi driver, physical network adapters, RJ45/SFP and network switches…etc. Sometimes it’s hard to say where exactly caused a problem. Especially hardware layer problems. Today I worked on a very interesting case, it may give some ideas to troubleshooting network performance issue which is caused by hardware layers.

A user told me he was bothered by network performance of a virtual machine. It’s slow to copy data to NFS share. But responding to “ping” command looked good. I didn’t see any issue on virtual machine layer. VMware Tools was up to date, Windows OS was patched, virtual network adapter type was VMXNET3 and VM version was also up to date.

When I tried to copy an image file to share folder of the virtual machine, I did see sometimes speed was fast, but sometimes not. Since I have two physical uplinks, it led me to guess it could be one of the uplinks.

After a lot of swapping and cable changing, we eventually figured out there was a bad SFP on network switch end. I was able to observe the issue by using “psping.exe” of Microsoft Sysinternals. I used the following command to send the different size of ping package to the virtual machine. Network drops were increasing when I increased package size.

psping.exe -l <size of package> <Destination>
Example: psping.exe -l 4k xxxx.contoso.com

The size could be 1k, 2m or even larger. I think this is a good way to identify problem outside of ESXi. Especially SFP problem as such kind of problem didn’t give any CRC or error count on network switch level.

You can also use Windows native command “ping.exe” as following. The size unit is “bytes”. For example, you need to input 4096 if you want to send 4kb.

ping.exe -l <size> <Destination>
Example: ping.exe -l 4096 xxx.contoso.com



CVE-2017-5754, CVE-2017-5753 and CVE-2017-5715 (Spectre and Meltdown)

You may know there are 3 vulnerabilities recently noticed by industry. Long story to short, kernel address space exposed to hackers when processors running user space code. It’s not only impact to Intel processors but also AMD and ARM. CVE-2017-5715 is a hardware issues that only apply certain firmware can fix the vulnerabilities. CVE-2017-5754 and CVE-2017-5753 need to apply OS patches to change how codes access kernel address space. Following are some useful links just for your reference.




VMware: https://www.vmware.com/security/advisories/VMSA-2018-0002.html (For CVE-2017-5753 and CVE-2017-5715. VMware has not published anything for CVE-2017-5754 yet.)

Microsoft: https://support.microsoft.com/en-gb/help/4072698/windows-server-guidance-to-protect-against-the-speculative-execution

HPE: http://h22208.www2.hpe.com/eginfolib/securityalerts/SCAM/Side_Channel_Analysis_Method.html

Cisco: https://tools.cisco.com/security/center/content/CiscoSecurityAdvisory/cisco-sa-20180104-cpusidechannel

Maximum Supported Boot Devices in Virtual Machine BIOS

Noticed a interesting limitation on VMware virtual machines. If you configure multiple SCSI controllers and distribute more than  8 virtual  disks. You may experience randomly OS boot up failure when power cycle VMs. Only last 8 disks with higher SCSI ID present in boot order settings of BIOS. You cannot choose the disks with lower SCSI ID.

You need to following up VMware KB “Changing the boot order of a virtual machine using vmx options (2011654)” to force virtual machines boot up on proper SCSI node.

Memory Errors on Modern Servers

I used to see memory degrading on  Cisco  UCS blades. But less see on HPE blades. I thought it maybe quality control problem of Cisco manufacture. Today I read two articles in Cisco website, it explains why we see memory degrading and how it works. I attached the articles below.

Managing Correctable Memory Errors on Cisco UCS Servers

UCS Enhanced Memory Error Management

The conduction in the whitepaper is not only specific for Cisco UCS, but also for any modern servers. Following is summary of why memory errors rates is going high nowadays.

  • Larger memory systems contain more bits
  • Higher capacity DRAM chips require smaller bit cells which result in fewer stored charges per bit
  • Lower operating voltages can lead to reduced noise margin
  • Higher operating speeds can lead to reduced timing margin

Oracle Utilizes 50% of Physical Processors on HPE Server

DBA team told me Oracle was running slow on a HPE server. I observed the CPU utilization was about 50% of overall capacity. Whenever Oracle database bumps up the system experienced slowness.

Further  digged into the issue, I see Oracle workload only ran on single physical processor, but the server has two processors. And the  Windows 2012 R2 resource manager show the system used Processor Group, the two physical processors were grouped out. This technology is described in Microsoft MSDN article.

To fix the issue you have to change value of “NUMA Group Size Optimization” to “Flat” in BIOS. Please refer to HPE article for detail  steps.

Detail of HPE server behavior  is documented here. Please note, the article says it impacts to ProLiant Gen9 and Intel E5-26xx v3 processors. But it actually also impacts to Intel E5-26xx v4 and Synergy blades.

“No host data available” Reported in Hardware Status Tab

Just noticed a issue that nothing reported in ‘Hardware Status‘ tab of ESXi hosts in vSphere Web Client. KB 2112847 gives a solution but not works for me. The feature can be used to monitor hardware failures. I figured out a way to workaround it. You just need to login by Administrator account and click ‘Update‘ button under ‘Monitor‘ – ‘Hardware Status‘ for each ESXi host. You will get the status after few minutes.

PCPU locked up on Cisco UCS

PCPU 20 locked up. Failed to ack TLB invalidate

Error message of the PSOD

ESXi 5.5 Update 2 is stable version, but I got PSOD on one UCS blade few days ago. It scared me since there was a big bug when I upgraded ESXi from 5.1 to 5.5 Update 1 last year(See detail ESXi 5.5 and Emulex OneConnect 10Gb NIC), it lead to dozen virtual  machines crashed over and over again.I bet I’m gonna to die if it happens again. :-)

ESXi 5.5 Update 2 算得上比较稳定的版本了,但前几天遇到一台紫屏,差点儿吓尿了。半年前从ESXi 5.1升级到ESXi5.5 Update 1时候遇到个大BUG(详情见我的文章ESXi 5.5 and Emulex OneConnect 10Gb NIC),搞得几十台几十台机器挂,这次升级再来一次估计职业生涯就此结束了。

Continue reading

ESXi 5.5 and Emulex OneConnect 10Gb NIC

*** English Version ***

You are using HP ProLiant BL460c G7 or Gen8, ESXi version is 5.5, NIC is Emulex chipset. You are using driver version 10.x.x.x. You may experience the host randomly lost connectivity on vCenter Server, host status show “No responding”. You cannot ping any virtual machine hosted on the blade. High pause frame is observed on HP virtual connect model down links after problem occurred. And you see similar error in vmkernel logs:

Continue reading

Blue Screen with Bug Check 50 on ESXi 5.x

Some critical VMs got blue screen in last few weeks. After working with OS and hardware vendor, we figured out the root cause eventually. It’s a CPU problem related to Intel v2 CPU of E3, E5 and E7 families. The detail information is documented in VMware KB Windows 2008 R2 and Solaris 10 64-bit virtual machines blue screen or kernel panic when running on ESXi 5.x with an Intel E5 v2 series processor.

Continue reading

How to get HP ProLiant blade server and enclosure information

An enterprise infrastructure administrator needs to run plenty of reports for firmware, software version, or any kind of infrastructure data in their day-to-day operation. Some vendors provide powerful tools to pull out data from their solution, but what if you don’t have such tools? It is pain to get data manually especially for large number of servers. I’m going to share my trick to you. I’ll use HP ProLiant blade system for example, as it’s very common case in enterprise datacenter.

Continue reading

HP Blade Firmware Upgrading Best Practices for ESXi Host

I discussed this topic with a group, some people think firmware upgrade is not required if ESXi host working fine, that’s adapted to small business, but I think enterprise can do more better.

My ESXi running on HP blades, I’ll use that platform for example to share my thought and experience.

Why you need a plan for HP blade firmware upgrading of ESXi host?

First voice around my head is “We suggest you upgrade firmware to latest version”. You may experience similar like me when you call HP for helping, that’s look like HP official statement whenever we suspect a problem related to hardware. ;-) You know how hard to upgrade bulk of ESXi hosts to troubleshooting a network/storage problem, especially your hosts are running on older version, it may be extremely time consuming. So keep firmware up to date will save troubleshooting time, also make your life easy. :-)

Even no issue on hardware, you may still need to upgrade software, it’s rarely but some maybe conflict with old firmware, and in this scenarios please consider significantly downtime when you have to upgrade firmware if your server is running on older version.

Reboot is required for most firmware upgrading,

HP blade firmware upgrading tools for ESXi host

HP is right statement, their firmware has lifecycle, and the official HP policy is only to support updating to a new version that is two versions newer than the currently installed version.

Recently HP is replacing old firmware tools by HP Service Pack for ProLiant (SPP). SPP is an all in one image file includes firmware, drivers and management tools for ProLiant servers. Thanks HP, it’s pretty confuse when I upgrade by old way, now it’s easy to know which firmware level your servers exactly on.

You can upgrade ESXi host by two ways below. Online upgrading is recommended. Refer to
HP ProLiant Gen8 and later Servers – Understanding the Differences between Online and Offline Modes in HP SUM

Online upgrading – ESXi 5.x first time supports online firmware upgrading, that’s really benefit for production ESXi host. But on other side SPP doesn’t support online upgrading for all components on ESXi host, such as power management, and you have to install HP customized ESXi to use online upgrading.

Offline upgrading – offline upgrading is convention for all OS, ~30 minutes downtime is required for each blade.

You can click here for more detail of SPP.

Best practices for HP blade firmware upgrading

I’m using it now, it may give you some idea of how to plan firmware upgrading for ESXi host.

Before implement firmware

  1. Ensure HBA firmware is supported by storage vendor.
  2. Ensure NIC firmware is supported by OS and switch.
    Please check VMware compatibility guide.
  3. Create SPP server.
    You may have multiple Datacenter on different location. You have to prepare servers on each location to store SPP image, it reduces SPP image load time from local server.
  4. Create firmware baselines.
    You may want to keep ESXi host firmware up to date, I suggest creating a baseline, all ESXi host must be upgraded to exactly same firmware base on baseline. Enterprise datacenter may has thousands ESXi host, unified firmware will make it more stable. Your troubleshooting also more efficiency since it’s possible to identify hardware issue quickly.
  5. Create rollback plans.
    HP firmware can be force rollback, but not 100% successful, you can prepare alternative, such as vendor support after upgrading failed, data recovery from tape…etc.
    Create update plan.
  6. Which SPP will you use?
    Which ESXi version should be along with the baseline?
    How you upgrade ESXi host?
  7. Create testing environment.
    I would recommend perform testing if you want upgrade all smoothly. As least run the upgrading on one ESXi host and keeps it running 72 hours, monitor vmkernel log in case any issue.
  8. Generate firmware report.
    A firmware report is required to understanding the whole picture.
    You can generate the reports by native HP SUM (Smart Update Manager) in SPP image, or you can download SUM from HP website and run on a server, native version has problem to generate reports for some blade model, so latest version is preferred.
  9. Identify hotfixes and critical advisories.
    Read SPP release notes and HP CA to understand known issue and work around will make your IT life beautiful. :-)

Pre-check before upgrade OA/VC

HP blade is installed on enclosure, it managed by enclosure Onboard Administrator (OA) and interactive with network/storage via virtual connect module (VCM). Blade firmware should compatible with OA and VCM firmware version as well.

Before the upgrading you should spend some time to verify enclosure health and version by following steps.

  1. Perform a health check on the VC modules by Virtual Connect Support Utility.
  2. If OA firmware is 1.x, it must be updated to 2.32 before updating to newer versions.
  3. If VC firmware is greater than 3.00, then OA must be 3.00 first.
  4. Run HP Virtual Connect Pre 3.30 Analyzer if VC version is 3.x and upgrade to 3.3.
  5. Make sure that the VC modules are set up in a redundant configuration. Stack link should be configured.

You also need to make sure blade drivers is updated by same SPP image before upgrading.

Firmware upgrading

As I mentioned above, blade firmware should compatible with OA/VCM firmware, upgrade sequence is very important, blade may lost communication with OA/VCM if you upgrade by wrong sequence.

  1. If VC earlier than 1.34:
    Sequence is VC -> OA -> Blade.
  2. If VC 1.34 or later:
    Online mode sequence is OA -> Blade -> VC. (This is for firmware upgrading by SPP image.)
    Offline mode sequence is OA -> VC -> Blade. (This is for upgrading under CLI or offline mode.)
  3. Insert the SPP image via iLO. ( You can also extract the image to local disk of target server if it’s Windows )
  4. Boot from CD-ROM if you run via iLO.
  5. I recommend you select Interactive Mode if that’s first time you do it for a particular hardware specification.
  6. Go to review stage by following the wizard.
  7. Make sure all hardware is listed on updating list.
  8. Reboot after upgrading completed.

Note: If your blade firmware/driver is earlier than SPP2013.02 (include this version) you must upgrade VC to 4.01 or later, and then upgrade  blades.

That’s the best practices what I’m using, please let me know if you have better idea.

Windows cannot be installed on drive 0 partition 1

I think Windows Server 2012 will be next popular server OS just like Windows Server 2008, it’s also a nice hypervisor OS on virtual world. How do you think?

Installation is first step to experience the wonderful OS, you may see some strange problem during that step just like me. Today’s topic occurred long time ago, just want to share with people who may face similar issue like me.

That’s HP blade system with local disk attached, you may see similar problem on other vendor. When you select disk to install OS, installer may says Windows can’t be installed on drive 0 partition 1, or Windows cannot be installed on this disk. This computer’s hardware may not support booting to this disk. Ensure that the disk’s controllers is enabled in the computer’s BIOS menu.

That’s because boot volume is not set on array controller. For example by HP servers, you have to reboot and press F8 after BIOS checks array controller to enter array controller management interface. Then go to Select Boot Volume in main menu, select Direct Attached Storage, and then select the disk you want to install OS. Follow up the wizard to continue boot up.

If the problem persists, go to array controller management interface, rebuild array and select boot volume again, it should fix your problem.

How to Upgrade Virtual Hardware on MSCS VM

We get more new cool feature if keep virtual hardware up to date. And you may face boot problem when upgrade lower virtual hardware version to latest.

I always keep my Microsoft Cluster Services VM (MSCS VM) up to date since RDM disk usually uses on that kind of VMs.

I tried to search how to upgrade virtual hardware on MSCS VM with RDM LUN, but no lucky. That’s my experience:

  1. Update manager doesn’t work for MSCS VM.
  2. No snapshot would be taken if your SCSI controller of RDM is physical mode, you should have a good backup before upgrading.
  3. It’s possible to force upgrade hardware version by right click VM and select Upgrade Virtual Hardware.
  4. Make sure all services are running on another node.
  5. You will get following error message on Event for RDM disks in vSphere Client, upgrading procedure won’t be finished until error pop out for all RDM disks.
  6. I tried upgrade version 7 to 8.

A disk read error occurred after upgrade HW version from 3 to 9

This was a lesson and learns for me after I recovered the data back. My data was lost and no backup…

I had a virtual machine was moved from ESX 3.0 to ESXi 5.1 host long time ago. The virtual disk size show 0 and I cannot do storage migration and snapshot on the VM due to the hardware version was 3, it’s too low.

Generally I take snapshot before upgrade VM HW version, but that’s impossible on a VM of HW version 3 that running on vCenter Server 5.1. So I upgraded the VMware Tools and then VM hardware version by Update Manager. VMware Tools was successfully upgraded, but VM hardware version upgrading got error.

Then I right clicked the VM and used “Upgrade Hardware Version” option directly, it’s successfully without any prompt…finally I got “A disk read error occurred” when boot up. L

You may think it’s caused by SCSI controller since VM hardware version 3 supports IDE virtual disk and version 9 supports only SCSI virtual disk for best performance. That’s not my case. I tried several way to recover the disk, like convert the VM by convertor, mount the disk to other virtual machine, change SCSI parameter…etc.

I don’t think hardware version upgrading changes real virtual disk too much, it must be something changed on the head section of virtual disk, or description file. After consulted with Microsoft we got it fixed finally.

When I mounted the corrupted disk on other virtual machine, partition and size was recognized correctly. And disk manager also can recognize the NTFS file system. I can saw new drive appear in My Computer as well, but it show me “File or directory corrupted…” when I tried to open the drive. It more like a file system issue… it’s easy, just run following command to check any logical errors:

Chkdsk [drive letter]

Wow….a lot of error and files was listed, then I tried command:

Chkdsk /f [drive letter]

That’s real fix logical issue of disk. I could open the drive after used this command.

I mounted the drive back to the broken VM and powered on. New issue came up…Windows show me “Windows NT could not start because the below file is missing or corrupt: C:\Windows\System32\Ntoskrnl.exe”. I replaced the file but no help. The file was existed in the location, and file size was same like other VM, it’s perhaps not file issue?

Then I open VMDK file, aha….ddb.adapterType = “LegacyESX”, changed it to ddb.adapterType = “lsilogic” according to my SCSI controller set, my lovely Windows Server startup screen came back again. J

Okay, I talked too much. To summarize the fixing steps:

  • Mount the broken disk to a good virtual machine with same operating system. ( I’m not sure is it ok to mount on higher version of OS )
  • Run chkdsk [drive letter] to check if logical error existing.
  • Run chkdsk /f [drive] letter] to fix the logical error.
  • Unmounts the disk from good VM.
  • Edit the VMDK file in ESXi console.
  • Change the value of ddb.adapterType to proper SCSI controller type according to your SCSI controller setting.
  • Mount the disk back to broken VM.
  • Power on.

Here is my learning from that contingence:

  1. vCenter Server does not verify compatibility of VM hardware version during upgrading. Actually it’s not allowed to upgrade VM version from 3 to 9 directly.
  2. vCenter Server does not allowed you choose which VM hardware version you want to upgrade to, always latest.
  3. If you upgrade VM version from 3 to 9 directly, a SCSI controller will be added to the VM, value of ddb.adapterType will be changed to LegacyESX. You will not able to boot up the VM due to Windows Server 2003 does not contain proper SCSI driver.
  4. VM version upgrading looks like changes parameters of VMDK file but don’t change too much of real virtual disk, such as NTFS mapping and MBR table…etc.

Last, you may still face BSOD after use above solution since item 3 above, you have to inject the SCSI driver, please refer to KB 1005208 and KB 1006858.

Last of last…. :-) please take a backup of your virtual disk before you do any change!!!!!