Tag Archives: Hardware

Maximum Supported Boot Devices in Virtual Machine BIOS

Noticed a interesting limitation on VMware virtual machines. If you configure multiple SCSI controllers and distribute more than  8 virtual  disks. You may experience randomly OS boot up failure when power cycle VMs. Only last 8 disks with higher SCSI ID present in boot order settings of BIOS. You cannot choose the disks with lower SCSI ID.

You need to following up VMware KB “Changing the boot order of a virtual machine using vmx options (2011654)” to force virtual machines boot up on proper SCSI node.

Advertisements

Memory Errors on Modern Servers

I used to see memory degrading on  Cisco  UCS blades. But less see on HPE blades. I thought it maybe quality control problem of Cisco manufacture. Today I read two articles in Cisco website, it explains why we see memory degrading and how it works. I attached the articles below.

Managing Correctable Memory Errors on Cisco UCS Servers

UCS Enhanced Memory Error Management

The conduction in the whitepaper is not only specific for Cisco UCS, but also for any modern servers. Following is summary of why memory errors rates is going high nowadays.

  • Larger memory systems contain more bits
  • Higher capacity DRAM chips require smaller bit cells which result in fewer stored charges per bit
  • Lower operating voltages can lead to reduced noise margin
  • Higher operating speeds can lead to reduced timing margin

Oracle Utilizes 50% of Physical Processors on HPE Server

DBA team told me Oracle was running slow on a HPE server. I observed the CPU utilization was about 50% of overall capacity. Whenever Oracle database bumps up the system experienced slowness.

Further  digged into the issue, I see Oracle workload only ran on single physical processor, but the server has two processors. And the  Windows 2012 R2 resource manager show the system used Processor Group, the two physical processors were grouped out. This technology is described in Microsoft MSDN article.

To fix the issue you have to change value of “NUMA Group Size Optimization” to “Flat” in BIOS. Please refer to HPE article for detail  steps.

Detail of HPE server behavior  is documented here. Please note, the article says it impacts to ProLiant Gen9 and Intel E5-26xx v3 processors. But it actually also impacts to Intel E5-26xx v4 and Synergy blades.

“No host data available” Reported in Hardware Status Tab

Just noticed a issue that nothing reported in ‘Hardware Status‘ tab of ESXi hosts in vSphere Web Client. KB 2112847 gives a solution but not works for me. The feature can be used to monitor hardware failures. I figured out a way to workaround it. You just need to login by Administrator account and click ‘Update‘ button under ‘Monitor‘ – ‘Hardware Status‘ for each ESXi host. You will get the status after few minutes.

PCPU locked up on Cisco UCS

PCPU 20 locked up. Failed to ack TLB invalidate

Error message of the PSOD

ESXi 5.5 Update 2 is stable version, but I got PSOD on one UCS blade few days ago. It scared me since there was a big bug when I upgraded ESXi from 5.1 to 5.5 Update 1 last year(See detail ESXi 5.5 and Emulex OneConnect 10Gb NIC), it lead to dozen virtual  machines crashed over and over again.I bet I’m gonna to die if it happens again. :-)

ESXi 5.5 Update 2 算得上比较稳定的版本了,但前几天遇到一台紫屏,差点儿吓尿了。半年前从ESXi 5.1升级到ESXi5.5 Update 1时候遇到个大BUG(详情见我的文章ESXi 5.5 and Emulex OneConnect 10Gb NIC),搞得几十台几十台机器挂,这次升级再来一次估计职业生涯就此结束了。

Continue reading

ESXi 5.5 and Emulex OneConnect 10Gb NIC

*** English Version ***

You are using HP ProLiant BL460c G7 or Gen8, ESXi version is 5.5, NIC is Emulex chipset. You are using driver version 10.x.x.x. You may experience the host randomly lost connectivity on vCenter Server, host status show “No responding”. You cannot ping any virtual machine hosted on the blade. High pause frame is observed on HP virtual connect model down links after problem occurred. And you see similar error in vmkernel logs:

Continue reading

Blue Screen with Bug Check 50 on ESXi 5.x

Some critical VMs got blue screen in last few weeks. After working with OS and hardware vendor, we figured out the root cause eventually. It’s a CPU problem related to Intel v2 CPU of E3, E5 and E7 families. The detail information is documented in VMware KB Windows 2008 R2 and Solaris 10 64-bit virtual machines blue screen or kernel panic when running on ESXi 5.x with an Intel E5 v2 series processor.

Continue reading

How to get HP ProLiant blade server and enclosure information

An enterprise infrastructure administrator needs to run plenty of reports for firmware, software version, or any kind of infrastructure data in their day-to-day operation. Some vendors provide powerful tools to pull out data from their solution, but what if you don’t have such tools? It is pain to get data manually especially for large number of servers. I’m going to share my trick to you. I’ll use HP ProLiant blade system for example, as it’s very common case in enterprise datacenter.

Continue reading

HP Blade Firmware Upgrading Best Practices for ESXi Host

I discussed this topic with a group, some people think firmware upgrade is not required if ESXi host working fine, that’s adapted to small business, but I think enterprise can do more better.

My ESXi running on HP blades, I’ll use that platform for example to share my thought and experience.

Why you need a plan for HP blade firmware upgrading of ESXi host?

First voice around my head is “We suggest you upgrade firmware to latest version”. You may experience similar like me when you call HP for helping, that’s look like HP official statement whenever we suspect a problem related to hardware. ;-) You know how hard to upgrade bulk of ESXi hosts to troubleshooting a network/storage problem, especially your hosts are running on older version, it may be extremely time consuming. So keep firmware up to date will save troubleshooting time, also make your life easy. :-)

Even no issue on hardware, you may still need to upgrade software, it’s rarely but some maybe conflict with old firmware, and in this scenarios please consider significantly downtime when you have to upgrade firmware if your server is running on older version.

Reboot is required for most firmware upgrading,

HP blade firmware upgrading tools for ESXi host

HP is right statement, their firmware has lifecycle, and the official HP policy is only to support updating to a new version that is two versions newer than the currently installed version.

Recently HP is replacing old firmware tools by HP Service Pack for ProLiant (SPP). SPP is an all in one image file includes firmware, drivers and management tools for ProLiant servers. Thanks HP, it’s pretty confuse when I upgrade by old way, now it’s easy to know which firmware level your servers exactly on.

You can upgrade ESXi host by two ways below. Online upgrading is recommended. Refer to
HP ProLiant Gen8 and later Servers – Understanding the Differences between Online and Offline Modes in HP SUM

Online upgrading – ESXi 5.x first time supports online firmware upgrading, that’s really benefit for production ESXi host. But on other side SPP doesn’t support online upgrading for all components on ESXi host, such as power management, and you have to install HP customized ESXi to use online upgrading.

Offline upgrading – offline upgrading is convention for all OS, ~30 minutes downtime is required for each blade.

You can click here for more detail of SPP.

Best practices for HP blade firmware upgrading

I’m using it now, it may give you some idea of how to plan firmware upgrading for ESXi host.

Before implement firmware

  1. Ensure HBA firmware is supported by storage vendor.
  2. Ensure NIC firmware is supported by OS and switch.
    Please check VMware compatibility guide.
  3. Create SPP server.
    You may have multiple Datacenter on different location. You have to prepare servers on each location to store SPP image, it reduces SPP image load time from local server.
  4. Create firmware baselines.
    You may want to keep ESXi host firmware up to date, I suggest creating a baseline, all ESXi host must be upgraded to exactly same firmware base on baseline. Enterprise datacenter may has thousands ESXi host, unified firmware will make it more stable. Your troubleshooting also more efficiency since it’s possible to identify hardware issue quickly.
  5. Create rollback plans.
    HP firmware can be force rollback, but not 100% successful, you can prepare alternative, such as vendor support after upgrading failed, data recovery from tape…etc.
    Create update plan.
  6. Which SPP will you use?
    Which ESXi version should be along with the baseline?
    How you upgrade ESXi host?
  7. Create testing environment.
    I would recommend perform testing if you want upgrade all smoothly. As least run the upgrading on one ESXi host and keeps it running 72 hours, monitor vmkernel log in case any issue.
  8. Generate firmware report.
    A firmware report is required to understanding the whole picture.
    You can generate the reports by native HP SUM (Smart Update Manager) in SPP image, or you can download SUM from HP website and run on a server, native version has problem to generate reports for some blade model, so latest version is preferred.
  9. Identify hotfixes and critical advisories.
    Read SPP release notes and HP CA to understand known issue and work around will make your IT life beautiful. :-)

Pre-check before upgrade OA/VC

HP blade is installed on enclosure, it managed by enclosure Onboard Administrator (OA) and interactive with network/storage via virtual connect module (VCM). Blade firmware should compatible with OA and VCM firmware version as well.

Before the upgrading you should spend some time to verify enclosure health and version by following steps.

  1. Perform a health check on the VC modules by Virtual Connect Support Utility.
  2. If OA firmware is 1.x, it must be updated to 2.32 before updating to newer versions.
  3. If VC firmware is greater than 3.00, then OA must be 3.00 first.
  4. Run HP Virtual Connect Pre 3.30 Analyzer if VC version is 3.x and upgrade to 3.3.
  5. Make sure that the VC modules are set up in a redundant configuration. Stack link should be configured.

You also need to make sure blade drivers is updated by same SPP image before upgrading.

Firmware upgrading

As I mentioned above, blade firmware should compatible with OA/VCM firmware, upgrade sequence is very important, blade may lost communication with OA/VCM if you upgrade by wrong sequence.

  1. If VC earlier than 1.34:
    Sequence is VC -> OA -> Blade.
  2. If VC 1.34 or later:
    Online mode sequence is OA -> Blade -> VC. (This is for firmware upgrading by SPP image.)
    Offline mode sequence is OA -> VC -> Blade. (This is for upgrading under CLI or offline mode.)
  3. Insert the SPP image via iLO. ( You can also extract the image to local disk of target server if it’s Windows )
  4. Boot from CD-ROM if you run via iLO.
  5. I recommend you select Interactive Mode if that’s first time you do it for a particular hardware specification.
  6. Go to review stage by following the wizard.
  7. Make sure all hardware is listed on updating list.
  8. Reboot after upgrading completed.

Note: If your blade firmware/driver is earlier than SPP2013.02 (include this version) you must upgrade VC to 4.01 or later, and then upgrade  blades.

That’s the best practices what I’m using, please let me know if you have better idea.

Windows cannot be installed on drive 0 partition 1

I think Windows Server 2012 will be next popular server OS just like Windows Server 2008, it’s also a nice hypervisor OS on virtual world. How do you think?

Installation is first step to experience the wonderful OS, you may see some strange problem during that step just like me. Today’s topic occurred long time ago, just want to share with people who may face similar issue like me.

That’s HP blade system with local disk attached, you may see similar problem on other vendor. When you select disk to install OS, installer may says Windows can’t be installed on drive 0 partition 1, or Windows cannot be installed on this disk. This computer’s hardware may not support booting to this disk. Ensure that the disk’s controllers is enabled in the computer’s BIOS menu.

That’s because boot volume is not set on array controller. For example by HP servers, you have to reboot and press F8 after BIOS checks array controller to enter array controller management interface. Then go to Select Boot Volume in main menu, select Direct Attached Storage, and then select the disk you want to install OS. Follow up the wizard to continue boot up.

If the problem persists, go to array controller management interface, rebuild array and select boot volume again, it should fix your problem.

How to Upgrade Virtual Hardware on MSCS VM

We get more new cool feature if keep virtual hardware up to date. And you may face boot problem when upgrade lower virtual hardware version to latest.

I always keep my Microsoft Cluster Services VM (MSCS VM) up to date since RDM disk usually uses on that kind of VMs.

I tried to search how to upgrade virtual hardware on MSCS VM with RDM LUN, but no lucky. That’s my experience:

  1. Update manager doesn’t work for MSCS VM.
  2. No snapshot would be taken if your SCSI controller of RDM is physical mode, you should have a good backup before upgrading.
  3. It’s possible to force upgrade hardware version by right click VM and select Upgrade Virtual Hardware.
  4. Make sure all services are running on another node.
  5. You will get following error message on Event for RDM disks in vSphere Client, upgrading procedure won’t be finished until error pop out for all RDM disks.
  6. I tried upgrade version 7 to 8.

A disk read error occurred after upgrade HW version from 3 to 9

This was a lesson and learns for me after I recovered the data back. My data was lost and no backup…

I had a virtual machine was moved from ESX 3.0 to ESXi 5.1 host long time ago. The virtual disk size show 0 and I cannot do storage migration and snapshot on the VM due to the hardware version was 3, it’s too low.

Generally I take snapshot before upgrade VM HW version, but that’s impossible on a VM of HW version 3 that running on vCenter Server 5.1. So I upgraded the VMware Tools and then VM hardware version by Update Manager. VMware Tools was successfully upgraded, but VM hardware version upgrading got error.

Then I right clicked the VM and used “Upgrade Hardware Version” option directly, it’s successfully without any prompt…finally I got “A disk read error occurred” when boot up. L

You may think it’s caused by SCSI controller since VM hardware version 3 supports IDE virtual disk and version 9 supports only SCSI virtual disk for best performance. That’s not my case. I tried several way to recover the disk, like convert the VM by convertor, mount the disk to other virtual machine, change SCSI parameter…etc.

I don’t think hardware version upgrading changes real virtual disk too much, it must be something changed on the head section of virtual disk, or description file. After consulted with Microsoft we got it fixed finally.

When I mounted the corrupted disk on other virtual machine, partition and size was recognized correctly. And disk manager also can recognize the NTFS file system. I can saw new drive appear in My Computer as well, but it show me “File or directory corrupted…” when I tried to open the drive. It more like a file system issue… it’s easy, just run following command to check any logical errors:

Chkdsk [drive letter]

Wow….a lot of error and files was listed, then I tried command:

Chkdsk /f [drive letter]

That’s real fix logical issue of disk. I could open the drive after used this command.

I mounted the drive back to the broken VM and powered on. New issue came up…Windows show me “Windows NT could not start because the below file is missing or corrupt: C:\Windows\System32\Ntoskrnl.exe”. I replaced the file but no help. The file was existed in the location, and file size was same like other VM, it’s perhaps not file issue?

Then I open VMDK file, aha….ddb.adapterType = “LegacyESX”, changed it to ddb.adapterType = “lsilogic” according to my SCSI controller set, my lovely Windows Server startup screen came back again. J

Okay, I talked too much. To summarize the fixing steps:

  • Mount the broken disk to a good virtual machine with same operating system. ( I’m not sure is it ok to mount on higher version of OS )
  • Run chkdsk [drive letter] to check if logical error existing.
  • Run chkdsk /f [drive] letter] to fix the logical error.
  • Unmounts the disk from good VM.
  • Edit the VMDK file in ESXi console.
  • Change the value of ddb.adapterType to proper SCSI controller type according to your SCSI controller setting.
  • Mount the disk back to broken VM.
  • Power on.

Here is my learning from that contingence:

  1. vCenter Server does not verify compatibility of VM hardware version during upgrading. Actually it’s not allowed to upgrade VM version from 3 to 9 directly.
  2. vCenter Server does not allowed you choose which VM hardware version you want to upgrade to, always latest.
  3. If you upgrade VM version from 3 to 9 directly, a SCSI controller will be added to the VM, value of ddb.adapterType will be changed to LegacyESX. You will not able to boot up the VM due to Windows Server 2003 does not contain proper SCSI driver.
  4. VM version upgrading looks like changes parameters of VMDK file but don’t change too much of real virtual disk, such as NTFS mapping and MBR table…etc.

Last, you may still face BSOD after use above solution since item 3 above, you have to inject the SCSI driver, please refer to KB 1005208 and KB 1006858.

Last of last…. :-) please take a backup of your virtual disk before you do any change!!!!!