It’s not easy for me to describe the issue in one line on the title. Let me give some background here. I have 2 set of VMs. Set 1 has VM A & VM B. Set 2 has VM C & VM D. Each VM has a vNIC configured with a private IP address. VM A and VM C also have another vNIC configured with an L3 (Routable) IP address. Each set’s private IP addresses are the same. To make sure no confusion I implemented a vRouter VM for each set. The vRouter is same as VM A or VM C, it has two vNICs. One is connected to L3 network, another is connected to the private network. This way can keep the private network traffic not going outside of the set. So the both set no disturb each other when I set same private IP addresses.
Following are IP addresses I set for each VM:
VM A: 192.168.0.11
VM B: 192.168.0.12
VM C: 192.168.0.11
VM D: 192.168.0.12
The problem is I still can get ping responding on VM A to 192.168.0.12 when I turn off VM B. I expected to see the L2 traffic goes to it own vRouter and finds VM B is offline. But tracert command shows me the traffic goes from VM A’s L3 network to vRouter of the 2nd set, and then get the answer from VM D. Looks like the L2 ping package is broadcasting on L3 network.
The issue was fixed by enabling a feature on L3 network. It called “Enforce Subnet Check for IP Learning“. Cisco changed the name to “Limit IP Learning To Subnet“. It’s a VLAN level setting. It will not allow broadcasting the private Ip traffic on an L3 network. It forces private IP traffic to go to L2 network only.
I have a box uses Emulex OneConnect OCe10102 network adapters. The adapter is quite old and Emulex brand card doesn’t support ESXi 6.0. I upgraded the server to ESXi 6.0 and the Emulex adapters lost.
In the initial troubleshooting, I noticed that the adapters are still visible in BIOS. So it should be some driver level issues. I checked VMware Compatibility Guide. The model OCe10102 doesn’t support by ESXi 6.0.
If you run the following command you will still be able to see the adapters in PCI list on ESXi.
esxcli hardware pci list
So it indicates the adapters are not visible in ESXi since the newer Emulex driver doesn’t contain the model of the adapter in ESXi 6.0 native driver.
Then I uninstalled the native Emulex driver for ESXi 6.0 by the following command and rebooted the ESXi host.
The adapters still not visible after rebooting since no any drivers for Emulex adapters. Then I downloaded the Emulex drivers for ESXi 5.5 on VMware website and uploaded the “offline” package in the zip file to /tmp directory of the host. Then installed the driver by the following command:
There are several layers of networking on the virtualization infrastructure. Guest operating system, Virtual Machine, ESXi driver, physical network adapters, RJ45/SFP and network switches…etc. Sometimes it’s hard to say where exactly caused a problem. Especially hardware layer problems. Today I worked on a very interesting case, it may give some ideas to troubleshooting network performance issue which is caused by hardware layers.
A user told me he was bothered by network performance of a virtual machine. It’s slow to copy data to NFS share. But responding to “ping” command looked good. I didn’t see any issue on virtual machine layer. VMware Tools was up to date, Windows OS was patched, virtual network adapter type was VMXNET3 and VM version was also up to date.
When I tried to copy an image file to share folder of the virtual machine, I did see sometimes speed was fast, but sometimes not. Since I have two physical uplinks, it led me to guess it could be one of the uplinks.
After a lot of swapping and cable changing, we eventually figured out there was a bad SFP on network switch end. I was able to observe the issue by using “psping.exe” of Microsoft Sysinternals. I used the following command to send the different size of ping package to the virtual machine. Network drops were increasing when I increased package size.
The size could be 1k, 2m or even larger. I think this is a good way to identify problem outside of ESXi. Especially SFP problem as such kind of problem didn’t give any CRC or error count on network switch level.
You can also use Windows native command “ping.exe” as following. The size unit is “bytes”. For example, you need to input 4096 if you want to send 4kb.
Just a quick post. When virtual machine cannot get DHCP IP address the first thing you want to check is firewall. Whatever Windows firewall or physical firewall. You should make sure UDP port 67 and 68 are not blocked. Otherwise you will see the virtual machine gets 169.x.x.x IP address only.
The two ports is required for DHCP client to query IP addresses. The methodology is introduced in RFC document.
DHCP uses UDP as its transport protocol. DHCP messages from a client
to a server are sent to the ‘DHCP server’ port (67), and DHCP
messages from a server to a client are sent to the ‘DHCP client’ port
(68). A server with multiple network address (e.g., a multi-homed
host) MAY use any of its network addresses in outgoing DHCP messages.
Slight network latency may cause application problem on sensitive virtual machines. Even the network responding time is just 3 or 7 ms. There is a way to improve the stability of responding latency – Enable RSS on NIC.
Network traffic is handled by single CPU core when RSS is disabled. Enable it will distribute the workload to 4 cores by default. You can increase CPU for RSS by change registry.
To summarize the solution. Go to Device Manager -> NIC properties -> Advance -> Find RSS option and enable it. You will see 2 – 3 network drops when applying it.
If your company implemented firewall and blocked public NTP server, you may see installation of vRealize Operation Manager pending on ./install.sh on console. That’s because the installer tries to negotiate with NTP server www.iana.org. The firewall blocked the traffic.
VMware TAM Manager Shan told me there are two options on firewall to block traffic: REJECT and DROP. REJECT means firewall responding to the request and let source device knows it’s rejected. DROP means firewall immediately ignores the request and no responding to source device. Looks like there is a bug in vROPs code that it hung if NTP request gets drop and no responding.
The workaround is create a port group without physical uplinks and install vRealize Operation Manager. Then move it to proper network after installation is completed. You can configure correct IP addresses when import the OVF file so later on you just need simply move the network.
It’s been a while since last technical post. I was pretty busy on preparation of holiday maintenance plan as well as few problems in virtual environment. There was one I’d like to share as it’s a sample to show how to ‘touch’ hardware layer from virtual layer. 🙂
To implement enterprise application like SAP, Oracle or SQL on UCS virtualization environment. Default setting of UCS blades may not suitable for the application. We always expect highest performance by optimize hardware and ESXi. In my UCS training session, I noticed one “hidden” parameter may helpful for performance.
Receive Side Scaling – So called RSS, it’s a feature that allows you to utilize multiple CPUs and multiple cores per CPU to process the receiving network load. Without RSS, all of the receive network traffic is processed by one CPU and by only one core of the CPU. Essentially, RSS distributes receiving network load to all of the CPUs and their cores.
The parameter is an option in BIOS, but it’s not under BIOS policy in UCS Manager. You should go to Servers tab, extend Policies node, and check an Eth Adapter Policy under Adapter Policy node, Receive Side Scaling (RSS) is available in Options section of right frame. Blade should be rebooted to leverage the option.
Please keep in mind that do not enable RSS if your adapters more than your CPUs, it will cause unexpected network transmit failed. RSS option must be enabled on UCS policy before enable it on OS layer (I confirmed with Cisco TAC, is that true?). Regarding OS layer, please refer to those articles.