I used to see memory degrading on Cisco UCS blades. But less see on HPE blades. I thought it maybe quality control problem of Cisco manufacture. Today I read two articles in Cisco website, it explains why we see memory degrading and how it works. I attached the articles below.
Managing Correctable Memory Errors on Cisco UCS Servers
UCS Enhanced Memory Error Management
The conduction in the whitepaper is not only specific for Cisco UCS, but also for any modern servers. Following is summary of why memory errors rates is going high nowadays.
- Larger memory systems contain more bits
- Higher capacity DRAM chips require smaller bit cells which result in fewer stored charges per bit
- Lower operating voltages can lead to reduced noise margin
- Higher operating speeds can lead to reduced timing margin
New B200 M4 blades can running on Intel v4 processors. You may see discovery issue if your UCSM firmware version lower than 2.2.7c. I hit that problem few days ago when I install a new M4 blade. The FSM hung on 58% a real long time and failed eventually.
Cisco UCS blade system is the best blade system I used so far. Whatever the hardware, software or support is perfect. I recommend leverage the system for primary system of virtualization. UCS blade system architecture is different with HP. I feel it more likes a network system. Fabric Interconnect (FI) modules exchange data between uplinks and internal components. IOMs on each chassis controls data routing. Architecture is complicate, but it’s powerful to manage large datacenter. Talking about large datacenter, you may have hundred chassis or blades. Data goes through FIs, IOMs and blades, you could see issues on any layer. It’s hard to find out where exactly the problem is. UCS Manager provides statistics for ports just like how Cisco does on network switches. You can show statistics of a particular port. But it doesn’t tell you when and which layer it happened. I tested Cisco UCS adapter for vRealize Operation Manager before I reviewed NetApp adapter for vRealize Operation Manager. It’s developed by same company Blue Medora. I’d like to introduce few of this product, it’s just my personal review.
思科UCS刀片系列是我至今用过最好的刀片系统。无论是硬件、软件还是技术支持都堪称完美。个人推荐在大型虚拟化机房里把思科UCS作为主要设备。思科UCS刀片系统的架构和惠普的完全不同，感觉更像是个网络设备。Fabric Interconnect (FI)模块负责上联口和内部各组件之间的数据交换、IOM负责各刀箱数据路由。架构看起来很复杂，但是在管理大型数据中心时非常强大。说到大型数据中心，比如有 上百个刀箱和刀片服务器，数据要经过FI、IOM、刀片等，问题可能发生在任何层面，大型虚拟化数据中心很难找到问题的根源。UCS Manager有提供类似思科网络交换机一样的计数器功能，可以显示每一个端口的计数情况，但是这个监控工具不会告诉你什么时间、在哪个层面发生了问题 。在测试NetApp存储性能监控组件之前我有幸测试了vRealize Operations Manager 6的Cisco UCS性能监控组件。该组建同样由Blue Medora开发。以下简单介绍一下，只是我的个人观点 。
I noticed UCS Manager got unexpected failover after we upgraded firmware to 2.2(2c). Looks like it hits a bug CSCuo11700. Firmware should be upgraded to 2.2.(3a) to fix the issue.
I just found a Cisco KB descripts a firmware issue may impact to ESXi FC storage performance. Please have a look whether your Cisco UCS system firmware is 2.1(2a), 2.1(2c), or 2.1(2d).
Whatever you configure on MDS, whatever you configure on Cisco UCS FIs, whatever you do for port channel on both side, the Cisco UCS uplink ports always down with error message Initilize failed, or Error disabled.
Congratulation! your device hit MDS firmware bug…https://tools.cisco.com/bugsearch/bug/CSCtr01652/?reffering_site=dumpcr.
Your vHBAs or other PCI devices may stop running in ESXi 5.x when using Interrupt Remapping feature.
This issue only impact to UCS blade BIOS version 1.4(3c), it has been fixed on 1.4(3j).
Please refer to http://kb.vmware.com/kb/1030265 to see how to disable Interrupt Remapping feature in ESXi 5.x
Also refer to https://tools.cisco.com/bugsearch/bug/CSCty96722.