Skip to content

Wu's Blog

My IT Adventure

Tag: Hardware

Cisco UCS Blade Cannot Get IP Address for KVM

You may see “The IP address to reach the server is not set” when clicking the KVM console in Cisco UCS Manager. The issue persists even Cisco UCS Manager has enough IP addresses for management. Re-acknowledge or reset CIMC cannot fix the problem.

The fix procedure is go to “Equipment” -> Select the server -> “General” tab -> “Server Maintenance” -> “Decommission” the server.

Wait for the decommission completed, then re-acknowledge the server. IP address will be assigned to the server after the acknowledge process is completed.

August 25, 2018
UCS Manager UI Fonts Size on 4K Screen
Older UCS Manager uses Java application. The UI fonts could be extremely small on high DPI screen. The fix is:
1. Go to “C:Program Files (x86)Javajre1.8.0_171bin“.
2. Go to “Properties” of “jp2launcher.exe“.
3. “Compatibility” tab -> “Change high DPI settings“.
4. Check “Override high DPI scaling behavior….“.
5. Select “System (Enhanced)” or “System“.
June 1, 2018
Troubleshooting Network Performance of Virtual Machine
There are several layers of networking on the virtualization infrastructure. Guest operating system, Virtual Machine, ESXi driver, physical network adapters, RJ45/SFP and network switches…etc. Sometimes it’s hard to say where exactly caused a problem. Especially hardware layer problems. Today I worked on a very interesting case, it may give some ideas to troubleshooting network performance issue which is caused by hardware layers.

A user told me he was bothered by network performance of a virtual machine. It’s slow to copy data to NFS share. But responding to “ping” command looked good. I didn’t see any issue on virtual machine layer. VMware Tools was up to date, Windows OS was patched, virtual network adapter type was VMXNET3 and VM version was also up to date.

When I tried to copy an image file to share folder of the virtual machine, I did see sometimes speed was fast, but sometimes not. Since I have two physical uplinks, it led me to guess it could be one of the uplinks.

After a lot of swapping and cable changing, we eventually figured out there was a bad SFP on network switch end. I was able to observe the issue by using “psping.exe” of Microsoft Sysinternals. I used the following command to send the different size of ping package to the virtual machine. Network drops were increasing when I increased package size.
```
psping.exe -l <size of package> <Destination>
Example: psping.exe -l 4k xxxx.contoso.com
```
The size could be 1k, 2m or even larger. I think this is a good way to identify problem outside of ESXi. Especially SFP problem as such kind of problem didn’t give any CRC or error count on network switch level.

You can also use Windows native command “ping.exe” as following. The size unit is “bytes”. For example, you need to input 4096 if you want to send 4kb.
```
ping.exe -l <size> <Destination>
Example: ping.exe -l 4096 xxx.contoso.com
```
January 31, 2018
CVE-2017-5754, CVE-2017-5753 and CVE-2017-5715 (Spectre and Meltdown)

You may know there are 3 vulnerabilities recently noticed by industry. Long story to short, kernel address space exposed to hackers when processors running user space code. It’s not only impact to Intel processors but also AMD and ARM. CVE-2017-5715 is a hardware issues that only apply certain firmware can fix the vulnerabilities. CVE-2017-5754 and CVE-2017-5753 need to apply OS patches to change how codes access kernel address space. Following are some useful links just for your reference.

CVE-2017-5753

CVE-2017-5715

CVE-2017-5754

VMware: https://www.vmware.com/security/advisories/VMSA-2018-0002.html (For CVE-2017-5753 and CVE-2017-5715. VMware has not published anything for CVE-2017-5754 yet.)

Microsoft: https://support.microsoft.com/en-gb/help/4072698/windows-server-guidance-to-protect-against-the-speculative-execution
https://support.microsoft.com/en-gb/help/4073119/protect-against-speculative-execution-side-channel-vulnerabilities-in

HPE: http://h22208.www2.hpe.com/eginfolib/securityalerts/SCAM/Side_Channel_Analysis_Method.html

Cisco: https://tools.cisco.com/security/center/content/CiscoSecurityAdvisory/cisco-sa-20180104-cpusidechannel

January 5, 2018
Maximum Supported Boot Devices in Virtual Machine BIOS

Noticed a interesting limitation on VMware virtual machines. If you configure multiple SCSI controllers and distribute more than 8 virtual disks. You may experience randomly OS boot up failure when power cycle VMs. Only last 8 disks with higher SCSI ID present in boot order settings of BIOS. You cannot choose the disks with lower SCSI ID.

You need to following up VMware KB “Changing the boot order of a virtual machine using vmx options (2011654)” to force virtual machines boot up on proper SCSI node.

October 17, 2017
Memory Errors on Modern Servers
I used to see memory degrading on Cisco UCS blades. But less see on HPE blades. I thought it maybe quality control problem of Cisco manufacture. Today I read two articles in Cisco website, it explains why we see memory degrading and how it works. I attached the articles below.

Managing Correctable Memory Errors on Cisco UCS Servers

UCS Enhanced Memory Error Management

The conduction in the whitepaper is not only specific for Cisco UCS, but also for any modern servers. Following is summary of why memory errors rates is going high nowadays.
- Larger memory systems contain more bits
- Higher capacity DRAM chips require smaller bit cells which result in fewer stored charges per bit
- Lower operating voltages can lead to reduced noise margin
- Higher operating speeds can lead to reduced timing margin
July 27, 2017
Oracle Utilizes 50% of Physical Processors on HPE Server

DBA team told me Oracle was running slow on a HPE server. I observed the CPU utilization was about 50% of overall capacity. Whenever Oracle database bumps up the system experienced slowness.

Further digged into the issue, I see Oracle workload only ran on single physical processor, but the server has two processors. And the Windows 2012 R2 resource manager show the system used Processor Group, the two physical processors were grouped out. This technology is described in Microsoft MSDN article.

To fix the issue you have to change value of “NUMA Group Size Optimization” to “Flat” in BIOS. Please refer to HPE article for detail steps.

Detail of HPE server behavior is documented here. Please note, the article says it impacts to ProLiant Gen9 and Intel E5-26xx v3 processors. But it actually also impacts to Intel E5-26xx v4 and Synergy blades.

July 10, 2017
“No host data available” Reported in Hardware Status Tab

Just noticed a issue that nothing reported in ‘Hardware Status‘ tab of ESXi hosts in vSphere Web Client. KB 2112847 gives a solution but not works for me. The feature can be used to monitor hardware failures. I figured out a way to workaround it. You just need to login by Administrator account and click ‘Update‘ button under ‘Monitor‘ – ‘Hardware Status‘ for each ESXi host. You will get the status after few minutes.

January 19, 2017
Cisco UCS blade B200 M4 discovery pending on 58%

New B200 M4 blades can running on Intel v4 processors. You may see discovery issue if your UCSM firmware version lower than 2.2.7c. I hit that problem few days ago when I install a new M4 blade. The FSM hung on 58% a real long time and failed eventually.

(more…)

July 25, 2016
在esxi主机上用LRDIMM还是RDIMM？

同事今天问了个关于ESXi内存的问题。硬件供应商有两种类型的内存，LRDIMM和RDIMM，应该选择哪一种？

(more…)

July 5, 2016

←Previous Page

1 2 3 4 5

Twenty Twenty-Five

Designed with WordPress