There are several layers of networking on the virtualization infrastructure. Guest operating system, Virtual Machine, ESXi driver, physical network adapters, RJ45/SFP and network switches…etc. Sometimes it’s hard to say where exactly caused a problem. Especially hardware layer problems. Today I worked on a very interesting case, it may give some ideas to troubleshooting network performance issue which is caused by hardware layers.
A user told me he was bothered by network performance of a virtual machine. It’s slow to copy data to NFS share. But responding to “ping” command looked good. I didn’t see any issue on virtual machine layer. VMware Tools was up to date, Windows OS was patched, virtual network adapter type was VMXNET3 and VM version was also up to date.
When I tried to copy an image file to share folder of the virtual machine, I did see sometimes speed was fast, but sometimes not. Since I have two physical uplinks, it led me to guess it could be one of the uplinks.
After a lot of swapping and cable changing, we eventually figured out there was a bad SFP on network switch end. I was able to observe the issue by using “psping.exe” of Microsoft Sysinternals. I used the following command to send the different size of ping package to the virtual machine. Network drops were increasing when I increased package size.
psping.exe -l <size of package> <Destination> Example: psping.exe -l 4k xxxx.contoso.com
The size could be 1k, 2m or even larger. I think this is a good way to identify problem outside of ESXi. Especially SFP problem as such kind of problem didn’t give any CRC or error count on network switch level.
You can also use Windows native command “ping.exe” as following. The size unit is “bytes”. For example, you need to input 4096 if you want to send 4kb.
ping.exe -l <size> <Destination> Example: ping.exe -l 4096 xxx.contoso.com