MAC Address Conflict with ESXi vmkernel NIC on Cisco UCS Blades

Background

I worked on a interesting case few month back. A ESXi blade was not able to bring up due to management IP address didn’t responding to ping. We tried to reconfigure IP address, re-acknowledge blade, rebuild the network, and even replaced the motherboard. It was no lucky. Eventually we figured it out that another ESXi host’s management network somehow configured same MAC address. It caused the MAC address conflict on network.

This guide will show you some tips of how to troubleshooting MAC address conflicts on ESXi and Cisco UCS level.

Some Reference

The first article you should read is “vmk0 management network MAC address is not updated when NIC card is replaced or vmkernel has duplicate MAC address”. It helps you understand why vmkernel MAC address is not updated. The solution in the KB is change MAC address manually on ESXi. Or re-create management network.

But the reality is we usually don’t know where the conflict comes from. We only know this Cisco UCS blade installed ESXi and it doesn’t responding to ping. So you may suspect it’s a hardware issue like me.

Check MAC address conflicts on Cisco UCS

There are some ways to check MAC address conflicts on Cisco UCS.

  • Login to UCS Manager by SSH and check MAC address status.
  • Export UCS Manager log and check MAC address conflicts in fwm_trace_log file.
# Login to UCS Manager
# Run following command to show mac address status.
show platform fwm info mac <mac address> <vlan id>

# Sample
show platform fwm info mac 0025.0050.11.11 141

Login to UCS Manager GUI to generate support log.

Admins -> AllFaults, Events and Audit -> Log -> TechSupport Files

Generate a ucsm log bundle. Download and extract it. There are two major files in the log bundle: UCSM_A_TechSupport.tar.gz and UCSM_B_TechSupport.tar.gz. The files correspond to their respective Fabric Interconnect.

MAC address conflicts usually occurred on one Fabric Interconnect. So you may need to check both of them. I use A side as sample. Go to extract folder -> UCSM_A_TechSupport -> sw_trace_logs -> fwm_trace_log.current

Search keyword “REGMAC seen on border port” in the log. You need to repeat same in the log of the other FI. If you can find the entries and time is recently. Then it indicate there is conflict on the MAC address outside the UCS domain.

There maybe other reasons can cause mac address issue. I wrote in Error: No NIC found with MAC address…

Cannot Open Cisco UCS KVM Console By Java

When you lunch KVM console in Cisco UCS Manager. You probably get following error message:

Unable to launch the application

Error: you can not run this program because your system deployment.config file states that an enterprise configuration file is mandatory…

This is caused by Java. There are two things you can try to fix KVM console:

  • Install Java on a directory without “space”. For example, install it on C:javajre7.
  • Delete Sun folder in C:windows. But please make a backup of the folder since it may contains some special configuration of your enterprise.

I have another blog talking about UCS KVM issue: Cisco UCS Blade Cannot Get IP Address for KVM

“default Keyring’s certificate is invalid” in Cisco UCS Manager

You may see following error in Cisco UCS Manager:

default Keyring’s certificate is invalid

The reason is Admin -> Key Management -> KeyRing default is expired. It’s not possible to delete or change the KeyRing in GUI. You have to log in to SSH of Cisco UCS Manager and run following commands (The strings after “#”):

lab-B# scope security
lab-B /security # scope keyring default
lab-B /security/keyring # set regenerate yes
lab-B /security/keyring* # commit-buffer
lab-B /security/keyring #

This will result in a disconnect of the Cisco UCS Manager GUI on your client computer. Just refreshing the page after 5 seconds. It’s no impact to blades.

A Huge Amount of Warnings of “Image is Deleted” in Cisco UCS Manager

A few days ago, I deleted some older firmware packages in Cisco UCS Manager. Suddenly more than 100 warnings were generated. The error messages are similar below:

blade-controller image with vendor Cisco System Inc……is deleted

Cause: image-deleted

Clearly, it’s triggered due to packages deletion. But all of my service profiles and service profile templates were using existing firmware packages. The deleted packages were not been used anywhere.

I also deleted download tasks and cleaned up everything I can. The warnings still persisted. I figured out it’s caused by the default firmware policy when I read a blog article.

In case you are facing same issue. Please go to Servers -> Policies -> Host Firmware Packages -> default ->  Click Modify Package Versions -> Change it to available version.

 

Cisco UCS Blade Cannot Get IP Address for KVM

You may see “The IP address to reach the server is not set” when clicking the KVM console in Cisco UCS Manager. The issue persists even Cisco UCS Manager has enough IP addresses for management. Re-acknowledge or reset CIMC cannot fix the problem.

The fix procedure is go to “Equipment” -> Select the server -> “General” tab -> “Server Maintenance” -> “Decommission” the server.

Wait for the decommission completed, then re-acknowledge the server. IP address will be assigned to the server after the acknowledge process is completed.

UCS Manager UI Fonts Size on 4K Screen

Older UCS Manager uses Java application. The UI fonts could be extremely small on high DPI screen. The fix is:

  1. Go to “C:Program Files (x86)Javajre1.8.0_171bin“.
  2. Go to “Properties” of “jp2launcher.exe“.
  3. Compatibility” tab -> “Change high DPI settings“.
  4. Check “Override high DPI scaling behavior….“.
  5. Select “System (Enhanced)” or “System“.

 

Memory Errors on Modern Servers

I used to see memory degrading on  Cisco  UCS blades. But less see on HPE blades. I thought it maybe quality control problem of Cisco manufacture. Today I read two articles in Cisco website, it explains why we see memory degrading and how it works. I attached the articles below.

Managing Correctable Memory Errors on Cisco UCS Servers

UCS Enhanced Memory Error Management

The conduction in the whitepaper is not only specific for Cisco UCS, but also for any modern servers. Following is summary of why memory errors rates is going high nowadays.

  • Larger memory systems contain more bits
  • Higher capacity DRAM chips require smaller bit cells which result in fewer stored charges per bit
  • Lower operating voltages can lead to reduced noise margin
  • Higher operating speeds can lead to reduced timing margin

vRealize Operations Management Pack for Cisco UCS Review

Cisco UCS blade system is the best blade system I used so far. Whatever the hardware, software or support is perfect. I recommend leverage the system for primary system of virtualization. UCS blade system architecture is different with HP. I feel it more likes a network system. Fabric Interconnect (FI) modules exchange data between uplinks and internal components. IOMs on each chassis controls data routing. Architecture is complicate, but it’s powerful to manage large datacenter. Talking about large datacenter, you may have hundred chassis or blades. Data goes through FIs, IOMs and blades, you could see issues on any layer. It’s hard to find out where exactly the problem is. UCS Manager provides statistics for ports just like how Cisco does on network switches. You can show statistics of a particular port. But it doesn’t tell you when and which layer it happened. I tested Cisco UCS adapter for vRealize Operation Manager before I reviewed NetApp adapter for vRealize Operation Manager. It’s developed by same company Blue Medora. I’d like to introduce few of this product, it’s just my personal review.

Continue reading “vRealize Operations Management Pack for Cisco UCS Review”