PCPU locked up on Cisco UCS

PCPU 20 locked up. Failed to ack TLB invalidate
Error message of the PSOD

ESXi 5.5 Update 2 is stable version, but I got PSOD on one UCS blade few days ago. It scared me since there was a big bug when I upgraded ESXi from 5.1 to 5.5 Update 1 last year(See detail ESXi 5.5 and Emulex OneConnect 10Gb NIC), it lead to dozen virtual  machines crashed over and over again.I bet I’m gonna to die if it happens again. 🙂

ESXi 5.5 Update 2 算得上比较稳定的版本了,但前几天遇到一台紫屏,差点儿吓尿了。半年前从ESXi 5.1升级到ESXi5.5 Update 1时候遇到个大BUG(详情见我的文章ESXi 5.5 and Emulex OneConnect 10Gb NIC),搞得几十台几十台机器挂,这次升级再来一次估计职业生涯就此结束了。

The error message on the POSD was “PCPU 20 locked up. Failed to ack TLB invalidate”. I checked ESXi logs after rebooting. It looked like the server suddenly crashed without any error or warning messages. I suspected it’s not software layer issue. Eventually I found the CPU lock up problem occurred on Cisco UCS, the root cause is a bug in fnic driver. Please refer detail on CSCut64613. Basically you need to update fnic driver to 1.6.0.17a.

这次的紫屏错误是: PCPU 20 locked up. Failed to ack TLB invalidate。重启后先查ESXi日志,发现服务器是在某个时间点突然紫屏了,之前没有任何报错 。由此推测不像是软件层面的问题,网上搜了搜发现这种CPU锁定的故障在思科UCS上确实发生过,这是由于fnic的驱动bug导致的,详细信息可以看一下思科BUG库的CSCut64613。总体来说,你需要将fnic驱动升级到1.6.0.17a。