There are several layers of networking on virtualization infrastructure. Guest operating system, Virtual Machine, ESXi driver, physical network adapters, RJ45/SFP and network switches…etc. Sometimes it’s hard to say where exactly caused problem. Especially hardware layer problems. Today I worked on a very interesting case, it may give some ideas to troubleshooting network performance issue which is caused by hardware layers.
User told me he was bothered by network performance of a virtual machine. It’s slow to copy data to NFS share. But responding of “ping” command looked good. I didn’t see any issue on virtual machine layer. VMware Tools was up to date, Windows OS was patched, virtual network adapter type was VMXNET3 and VM version was also up to date.
When I tried to copy a image file to share folder of the virtual machine, I did see sometimes speed was fast, but sometimes not. Since I have two physical uplinks, it led me guess it could be one of the uplink.
After lot of swapping and cable changing, we eventually figured out there was a bad SFP on network switch end. I was able to observe the issue by use “psping.exe” of Microsoft Sysinternals. I used following command to send different size of ping package to the virtual machine. Network drops was increasing when I increased package size.
psping.exe -l <size of package> <Destination> Example: psping.exe -l 4k xxxx.contoso.com
The size could be 1k, 2m or even more large. I think this is a good way to identify problem outside of ESXi. Especially SFP problem as such kind of problem didn’t give any CRC or error count on network switch level.
You can also use Windows native command “ping.exe” as following. The size unit is “bytes”. So for example you need to input 4096 if you want to send 4kb.
ping.exe -l <size> <Destination> Example: ping.exe -l 4096 xxx.contoso.com