Slow network performance on a virtual machine

It’s been a while since last technical post. I was pretty busy on preparation of holiday maintenance plan as well as few problems in virtual environment. There was one I’d like to share as it’s a sample to show how to ‘touch’ hardware layer from virtual layer. 🙂

Users complained one virtual machine got slow copy speed to a NAS box few weeks ago. I was involved in the troubleshooting session since they are VIP. Problem was file coping from the virtual machine to a NAS box (I call it bad NAS box below) got only 200 KB/s. For this kind of case, I usually dig into it by dividing and conquering.

Initially, I suspected virtual layer, CPU or memory constrains. As you may know, system performance is degraded if virtual memory gets swapping, vCPU has high waiting time. Then I checked storage latency. Disk read/write performance is degraded if the LUN of virtual disk is high latency. But my case all looks good. Everything was smoothly on guest OS layer and ESXi layer.

Then I captured network packages on the problem virtual machine. I saw there was lot of packages similar like the strings below. The Dup AckReTransmit and Request Fast-Retransmit indicated TCP package unable to shake hands between source and destination. It’s dropped somewhere. That’s the reason why the copy speed was slow.

TCP:[Dup Ack #162]Flags=…A…., SrcPort=Microsoft-DS(445), DstPort=58860, PayloadLen=0, Seq=835560569, Ack=3098154067, Win=17520

TCP:[ReTransmit #161]Flags=…A…., SrcPort=58860, DstPort=Microsoft-DS(445), PayloadLen=1460, Seq=3098154067 – 3098155527, Ack=835560569, Win=255

TCP:[Request Fast-Retransmit from Seq3098168667]Flags=…A…., SrcPort=Microsoft-DS(445), DstPort=58860, PayloadLen=0, Seq=835560569, Ack=3098168667, Win=17520

The ESXi host is HP BL460 G7 blade. The issue occurred on networking layer, so next try was blade system layer. My understanding of HP Blade System is it’s a hardware layer virtualization. Multiple blades share same uplinks through HP virtual connect modules (VC). In our case, 16 blades share 2 x 10Gbps uplinks. Each uplink set on each VC. The physical path similar like following:

Network Switch A  Uplink A VC A All Blades
Network Switch B Uplink B  VC B All Blades

It’s rarely to see a problem between VC and blade. Statistics on VC is a great tool we can leverage to check network traffic qualities. There are two part of statistics we can see: Uplink and Server Port. Uplink statistics show the quality between uplinks and VC, and Server Port statistics show between VC and blades. I checked few counters, such as DiscardsErrors, and PauseFrame…etc. Still no lucky. It’s indicated physical path from virtual machine to outside of blade system are good.

Team checked network switches, they didn’t find any error as well. Then we tried to copy same file to another NAS box. It was super fast (50 MB/s)! We totally confused because another virtual machines also had fast copy speed to the trouble NAS box. Looks like the whole troubleshooting did not make sense! (At that moment, I felt virtualization environment is more complicate than physical, the whole market is moving forward to ‘Matrix, we will never find where exactly the problem is eventually.)

My colleague said ‘let’s replace cable and SFP if nothing we can find‘. Yes, Give a try is better than waiting. Boom! Suddenly the issue gone! It’s a bad SFP and cable. To make sure it’s a firm root cause, we did some testing on the virtual machine. The issue came back again when team forced uplink B down on network switch B. I realized that’s something related to path. When we replaced SFP and cables, we only did on A side since blades traffic was gone through A side only. During the replacement, all network traffic failoverred to B side. That’s how HP virtual connect module works. So that means path B was working fine. Eventually network team figured out that’s a bad SFP and cables between network switch A and B.

The trouble NAS box physically connected on network switch B. When virtual machine traffic came from A side, it should go through the link between A and B then reached out to NAS box. So the whole picture like that:

Network Switch A ↔ Uplink A VC A All Blades  VM

Network Switch B Uplink B  VC B All Blades

Trouble NAS box

Once we fixed the bad SFP and cables, I captured network packages on the virtual machine again. Never saw Dup AckReTransmit and Request Fast-Retransmit again!!!

The troubleshooting token two days, but the reason is great for me. That’s first time I observed physical components issue by capture network packages. If you get network performance issue and you see similar TCP package issue, it’s high chance a bad cable or SFP. 🙂

Please refer to TCP DupACKs and TCP Fast Retransmits if you want to know more about TCP DupACK and retransmits.

Chinese version

一直忙于准备假日的维护工作和一些排错,很久没有更新技术文章了。最近有一个有趣的故障我希望记录下来,这次故障让我从虚拟层面“触摸”到了物理层面。:-)

用户抱怨他们的虚拟机往某个NAS上复制文件时非常慢,由于这些用户是VIP,我参与了整个排错过程。从虚拟机往这台NAS(我叫它有问题的NAS)复制东西时只有200 KB/s。一般情况下对于这种问题我都是各个击破。

一开始,我怀疑是虚拟层面的问题,比如CPU或者内存吃紧。如你所知,如果内存出现缓存或者CPU出现了长时间的等待就会使整个系统性能下降。我又检查了存储延迟,如果有延迟,磁盘的读写会受到影响。都没有问题。从虚拟机操作系统和ESXi上看一切都很正常。

然后我在有问题的虚拟机上抓了个包,发现有很多包如下所示。这种出现Dup Ack、ReTransmit、Request Fast-Retransmit的包表明TCP握手有问题,在什么地方被丢弃了。这就是为什么慢的原因。

TCP:[Dup Ack #162]Flags=…A…., SrcPort=Microsoft-DS(445), DstPort=58860, PayloadLen=0, Seq=835560569, Ack=3098154067, Win=17520

TCP:[ReTransmit #161]Flags=…A…., SrcPort=58860, DstPort=Microsoft-DS(445), PayloadLen=1460, Seq=3098154067 – 3098155527, Ack=835560569, Win=255

TCP:[Request Fast-Retransmit from Seq3098168667]Flags=…A…., SrcPort=Microsoft-DS(445), DstPort=58860, PayloadLen=0, Seq=835560569, Ack=3098168667, Win=17520

把目光转向物理层面,这台ESXi主机是HP BL460 G7刀片服务器。由于问题出现在网络上,接下来看的就是刀片系统。对于HP的刀片系统我的理解是把硬件虚拟化了。多个刀片服务器通过HP Virtual Connect Module(VC)共享同一个上联口。在我们的环境中,16个刀片服务器共享 2 x 10Gbps上联口,每个上联口链接各自的VC。物理链路如下:

Network Switch A Uplink A VC A All Blades
Network Switch B Uplink B VC B All Blades

刀片系统和VC之间的通讯有问题是非常罕见的。计数器是个查看网络质量的好工具,HP VC的计数器分为两部分:上联口和下联口。上联口计数器显示从外部物理交换机到VC之间的链路状态,下联口计数器显示从VC到刀片服务器的链路状态。我检查了几个计数器,比如Discards、Errors、PauseFrame等。没有任何问题,这说明从上联口到虚拟机这段是正常的。

网络管理员也没在交换机上发现任何问题。我们又尝试从这台机器到另外一台NAS上复制文件,速度非常快(50 MB/s)!另外找了台虚拟机往有问题的NAS上复制文件也特别快。这次真糊涂了。(这一刻,我觉得虚拟环境要比物理环境复杂的多,整个虚拟化市场正在走向《黑客帝国》里描述的那样,有一天我们终将不知问题到底在哪儿。)

同事说“如果没有头绪不如更换一下网线和SFP吧”。聊胜于无,却没想一更换马上好了!原来问题在这儿!为了确保这就是原因我们又做了一些测试。当网络管理员强制关闭上联口B端的交换机端口后问题又回来了。我察觉这是和链路有关。因为当我们更换网线和SFP时,整个网络流量都在A端,所以我们只更换了A端的。更换时网络流量会自动切换到B端,这是正常的HP VC工作机制。到了B端就好了,这意味着B端链路是正常的,但是A有问题。

最终网络管理员发现故障是因为在网络交换机A和B之间的链路上有一个坏了的网线和SFP。有问题的NAS物理的连接在网络交换机B上,当虚拟主机的流量来自链路A时,需要通过坏的SFP和网线从网络交换机A走到B然后到NAS。整个链路如下:

Network Switch A ↔ Uplink A VC A All Blades  VM

Network Switch B Uplink B  VC B All Blades

Trouble NAS box

更换了网线和SFP之后,我又抓了次包,这次一切再没有之前说的哪些问题了!排错花了两天时间,结果是让人欣慰的,这也是我第一次通过抓包的方式从虚拟层面窥探到物理层面的故障。如果你抓包遇到了和我类似的问题,可以先尝试更换网线和SFP。:-)

有关TCP问题包的基础知识可以看这篇文章TCP DupACKs and TCP Fast Retransmits.