MAC Address Conflict with ESXi vmkernel NIC on Cisco UCS Blades

Background

I worked on a interesting case few month back. A ESXi blade was not able to bring up due to management IP address didn’t responding to ping. We tried to reconfigure IP address, re-acknowledge blade, rebuild the network, and even replaced the motherboard. It was no lucky. Eventually we figured it out that another ESXi host’s management network somehow configured same MAC address. It caused the MAC address conflict on network.

This guide will show you some tips of how to troubleshooting MAC address conflicts on ESXi and Cisco UCS level.

Some Reference

The first article you should read is “vmk0 management network MAC address is not updated when NIC card is replaced or vmkernel has duplicate MAC address”. It helps you understand why vmkernel MAC address is not updated. The solution in the KB is change MAC address manually on ESXi. Or re-create management network.

But the reality is we usually don’t know where the conflict comes from. We only know this Cisco UCS blade installed ESXi and it doesn’t responding to ping. So you may suspect it’s a hardware issue like me.

Check MAC address conflicts on Cisco UCS

There are some ways to check MAC address conflicts on Cisco UCS.

Login to UCS Manager by SSH and check MAC address status.
Export UCS Manager log and check MAC address conflicts in fwm_trace_log file.

# Login to UCS Manager
# Run following command to show mac address status.
show platform fwm info mac <mac address> <vlan id>

# Sample
show platform fwm info mac 0025.0050.11.11 141

Admins -> AllFaults, Events and Audit -> Log -> TechSupport Files

Generate a ucsm log bundle. Download and extract it. There are two major files in the log bundle: UCSM_A_TechSupport.tar.gz and UCSM_B_TechSupport.tar.gz. The files correspond to their respective Fabric Interconnect.

MAC address conflicts usually occurred on one Fabric Interconnect. So you may need to check both of them. I use A side as sample. Go to extract folder -> UCSM_A_TechSupport -> sw_trace_logs -> fwm_trace_log.current

Search keyword “REGMAC seen on border port” in the log. You need to repeat same in the log of the other FI. If you can find the entries and time is recently. Then it indicate there is conflict on the MAC address outside the UCS domain.

There maybe other reasons can cause mac address issue. I wrote in Error: No NIC found with MAC address…

Cisco UCS Error: Applying moref properties. Remote-Invocation-Error.

Cisco UCS Manager shows following errors with code “F1001201”.

Applying moref properties(FSM:sam:dme:MorefImportRootApplyMoRefs). Remote-Invocation-Error: FSM Retries Exhausted.

The reason is Cisco UCS Manager cannot imports some configurations when the UCS is setup.

Solution:

Login to Cisco UCS Manager by SSH.
Run following commands:
- # scope system
If you want to see what is pending on importing please run following commands, otherwise skip this step:
- # show pending-import
- # show pending-import fsm status expand detail
Delete pending imports:
- # delete pending-import
- # commit-buffer
The alerts should go away now in Cisco UCS Manager GUI.

This procedure is online operation without downtime required.

Cannot Open Cisco UCS KVM Console By Java

When you lunch KVM console in Cisco UCS Manager. You probably get following error message:

Unable to launch the application
Error: you can not run this program because your system deployment.config file states that an enterprise configuration file is mandatory…

This is caused by Java. There are two things you can try to fix KVM console:

Install Java on a directory without “space”. For example, install it on C:javajre7.
Delete Sun folder in C:windows. But please make a backup of the folder since it may contains some special configuration of your enterprise.

I have another blog talking about UCS KVM issue: Cisco UCS Blade Cannot Get IP Address for KVM

Private IP Address Routes to L3 Subnet on Dual vNIC VM

It’s not easy for me to describe the issue in one line on the title. Let me give some background here. I have 2 set of VMs. Set 1 has VM A & VM B. Set 2 has VM C & VM D. Each VM has a vNIC configured with a private IP address. VM A and VM C also have another vNIC configured with an L3 (Routable) IP address. Each set’s private IP addresses are the same. To make sure no confusion I implemented a vRouter VM for each set. The vRouter is same as VM A or VM C, it has two vNICs. One is connected to L3 network, another is connected to the private network. This way can keep the private network traffic not going outside of the set. So the both set no disturb each other when I set same private IP addresses.

Following are IP addresses I set for each VM:

VM A: 192.168.0.11
VM B: 192.168.0.12
VM C: 192.168.0.11
VM D: 192.168.0.12

The problem is I still can get ping responding on VM A to 192.168.0.12 when I turn off VM B. I expected to see the L2 traffic goes to it own vRouter and finds VM B is offline. But tracert command shows me the traffic goes from VM A’s L3 network to vRouter of the 2nd set, and then get the answer from VM D. Looks like the L2 ping package is broadcasting on L3 network.

The issue was fixed by enabling a feature on L3 network. It called “Enforce Subnet Check for IP Learning“. Cisco changed the name to “Limit IP Learning To Subnet“. It’s a VLAN level setting. It will not allow broadcasting the private Ip traffic on an L3 network. It forces private IP traffic to go to L2 network only.

Vlan ‘xxx’ resolved to unsupported VLAN ID in Cisco UCS Manager

You may need only 1 IP address for blade console in Cisco UCS Manager. You can follow Understanding “Management IP” of Cisco UCS Manager to configure it. You may see warning “Vlan ‘xxx’ resolved to unsupported VLAN ID” when you delete existing inbound and outbound IP pools if you are trying to clean up existing management IP pools.

That’s because inbound IP address for blade is not cleaned. You have to go to “Equipment” -> “Chassis” -> Target chassis -> “Servers” -> Target server -> Go to “Inventory” tab -> “CIMC” tab -> Click “Change Inbound Management IP” -> Remove existing VLAN and IP pool.

You will see inband IP tab is blank once it’s saved. Please note, the IP address reassign back after 1 minute if you clicked “Delete Inband Configuration” instead of “Change Inbound Managemnt IP“.

Understanding “Management IP” of Cisco UCS Manager

IP address for KVM in Cisco UCS Manager is different with HPE servers. It may assign multiple IP addresses to same blade if you don’t configure it properly. In my case each blade gets 3 IP addresses!

There are actually 3 types of IP address for KVM. (Cisco manual says 2):

Outbound Management IPs.
Inbound Management IPs for Blades.
Inbound Management IPs for Service Profiles.

“Outbound Management IP” is default for KVM. Every new deployed blade will try to get a DHCP IP address over Cisco UCS Fabric Interconnect management port in same VLAN.

The confusion is the 2nd and 3rd IPs. “Inbound Management IPs for Blades” is from “hardware” perspective. “Inbound Management IPs for Service Profiles” is from “logical” perspective.

If you go to Equipment -> Chassis -> blade -> Click the KVM to go console. You get console over either Outbound Management IP or Inbound Management IPs for Blades.

If you go to Servers -> Service Profiles -> Click the KVM of a service profile. You get console over either Outbound Management IP or Inbound Management IPs for Service Profiles.

If you want to configure just 1 IP for a blade whatever it’s for hardware or service profile. You need to do following:

Delete the range of the default ext-mgmt in IP Pools of LAN node in Cisco UCS Manager.
Create a new inbound IP pool and a VLAN group without uplink.
Choose the VLAN and inbound IP pool in LAN Cloud -> Global Policies -> Inbound Profile.
Assign the VLAN and inbound IP pool to templates or service profile.

Refer to Setting the Management IP Address of Cisco UCS Manager manual for detail.

BTW, you may see Vlan ‘xxx’ resolved to unsupported VLAN ID in Cisco UCS Manager when you clean up existing IP pool and create new inbound pool.

“x/xx on FI-A is connected by a unknown server device” on Cisco UCS

You may see following errors in ‘info’ category of error messages in the Cisco UCS Manager after upgrading infrastructure firmware to 3.2.x.

“x/xx on FI-A is connected by a unknown server device”

This is bug documented in CSCvk76095. You have to reset the port on FI to fix it.

Go to “Equipment” in Cisco UCS Manager.
Go to “Fabric Interconnects” -> Go to the corresponding FI.
Right-click the port x/xx -> Choose “Disable“.
You will see multiple major faults. Wait for 5 seconds.
Right click the port x/xx -> Choose “Enable“.
All warnings disappeared after 5 mins. You may still see the warning in GUI due to cache. Relogin and check.

This change impacts to one link between IOM and the FI port. You need downtime if the IOM only has a single path. I don’t see any impact to ESXi blades in the pod.

Show CDP Neighbor of Cisco UCS Uplinks

There are two ways to know which network switch ports the network uplinks of Cisco UCS Fabric Interconnects are connected to.

By CLI

SSH to the Cisco UCS Manager.
Connect to FI-A.

# connect nxos a

Show neighbor of network uplinks.

# show cdp neighbor interface ethernet <port num>

By PowerShell

Make sure Cisco PowerTool (For UCS Manager) is installed.
Enabling the Information Policy via UCSM GUI.
- Go to “Equipment” -> “Policies” tab -> “Global Policies” tab -> “Info Policy” area.
- Change to “Enabled“. (No impact to running blades)
Open a PowerShell window.
Connect to the UCS Manager.

# Connect-Ucs <UCS FQDN>

Show CDP neighbor details.

# Get-UcsNetworkLanNeighborEntry

Side notes

Following command can shows network switch name, network switch ports and FI ports

# Get-UcsNetworkLanNeighborEntry | Select deviceid,remoteinterface,localinterface

If you prefer to enable the “Info Policy” by PowerShell, run following command

# Get-UcsTopInfoPolicy | Set-UcsTopInfoPolicy -State enabled -Force

“default Keyring’s certificate is invalid” in Cisco UCS Manager

You may see following error in Cisco UCS Manager:

default Keyring’s certificate is invalid

The reason is Admin -> Key Management -> KeyRing default is expired. It’s not possible to delete or change the KeyRing in GUI. You have to log in to SSH of Cisco UCS Manager and run following commands (The strings after “#”):

lab-B# scope security
lab-B /security # scope keyring default
lab-B /security/keyring # set regenerate yes
lab-B /security/keyring* # commit-buffer
lab-B /security/keyring #

This will result in a disconnect of the Cisco UCS Manager GUI on your client computer. Just refreshing the page after 5 seconds. It’s no impact to blades.

A Huge Amount of Warnings of “Image is Deleted” in Cisco UCS Manager

A few days ago, I deleted some older firmware packages in Cisco UCS Manager. Suddenly more than 100 warnings were generated. The error messages are similar below:

blade-controller image with vendor Cisco System Inc……is deleted

Cause: image-deleted

Clearly, it’s triggered due to packages deletion. But all of my service profiles and service profile templates were using existing firmware packages. The deleted packages were not been used anywhere.

I also deleted download tasks and cleaned up everything I can. The warnings still persisted. I figured out it’s caused by the default firmware policy when I read a blog article.

In case you are facing same issue. Please go to Servers -> Policies -> Host Firmware Packages -> default -> Click Modify Package Versions -> Change it to available version.