I used to see memory degrading on Cisco UCS blades. But less see on HPE blades. I thought it maybe quality control problem of Cisco manufacture. Today I read two articles in Cisco website, it explains why we see memory degrading and how it works. I attached the articles below.
Managing Correctable Memory Errors on Cisco UCS Servers
UCS Enhanced Memory Error Management
The conduction in the whitepaper is not only specific for Cisco UCS, but also for any modern servers. Following is summary of why memory errors rates is going high nowadays.
- Larger memory systems contain more bits
- Higher capacity DRAM chips require smaller bit cells which result in fewer stored charges per bit
- Lower operating voltages can lead to reduced noise margin
- Higher operating speeds can lead to reduced timing margin
DBA team told me Oracle was running slow on a HPE server. I observed the CPU utilization was about 50% of overall capacity. Whenever Oracle database bumps up the system experienced slowness.
Further digged into the issue, I see Oracle workload only ran on single physical processor, but the server has two processors. And the Windows 2012 R2 resource manager show the system used Processor Group, the two physical processors were grouped out. This technology is described in Microsoft MSDN article.
To fix the issue you have to change value of “NUMA Group Size Optimization” to “Flat” in BIOS. Please refer to HPE article for detail steps.
Detail of HPE server behavior is documented here. Please note, the article says it impacts to ProLiant Gen9 and Intel E5-26xx v3 processors. But it actually also impacts to Intel E5-26xx v4 and Synergy blades.
Just noticed a issue that nothing reported in ‘Hardware Status‘ tab of ESXi hosts in vSphere Web Client. KB 2112847 gives a solution but not works for me. The feature can be used to monitor hardware failures. I figured out a way to workaround it. You just need to login by Administrator account and click ‘Update‘ button under ‘Monitor‘ – ‘Hardware Status‘ for each ESXi host. You will get the status after few minutes.
New B200 M4 blades can running on Intel v4 processors. You may see discovery issue if your UCSM firmware version lower than 2.2.7c. I hit that problem few days ago when I install a new M4 blade. The FSM hung on 58% a real long time and failed eventually.
Today my co-worker asked me a question about memory spec on ESXi hosts. There are two types of memory, LRDIMM and RDIMM available on hardware vendor’s list. Which one is best choice?
Error message of the PSOD
ESXi 5.5 Update 2 is stable version, but I got PSOD on one UCS blade few days ago. It scared me since there was a big bug when I upgraded ESXi from 5.1 to 5.5 Update 1 last year(See detail ESXi 5.5 and Emulex OneConnect 10Gb NIC), it lead to dozen virtual machines crashed over and over again.I bet I’m gonna to die if it happens again. :-)
ESXi 5.5 Update 2 算得上比较稳定的版本了，但前几天遇到一台紫屏，差点儿吓尿了。半年前从ESXi 5.1升级到ESXi5.5 Update 1时候遇到个大BUG（详情见我的文章ESXi 5.5 and Emulex OneConnect 10Gb NIC），搞得几十台几十台机器挂，这次升级再来一次估计职业生涯就此结束了。