vRealize Operations Management Pack for Cisco UCS Review

Cisco UCS blade system is the best blade system I used so far. Whatever the hardware, software or support is perfect. I recommend leverage the system for primary system of virtualization. UCS blade system architecture is different with HP. I feel it more likes a network system. Fabric Interconnect (FI) modules exchange data between uplinks and internal components. IOMs on each chassis controls data routing. Architecture is complicate, but it’s powerful to manage large datacenter. Talking about large datacenter, you may have hundred chassis or blades. Data goes through FIs, IOMs and blades, you could see issues on any layer. It’s hard to find out where exactly the problem is. UCS Manager provides statistics for ports just like how Cisco does on network switches. You can show statistics of a particular port. But it doesn’t tell you when and which layer it happened. I tested Cisco UCS adapter for vRealize Operation Manager before I reviewed NetApp adapter for vRealize Operation Manager. It’s developed by same company Blue Medora. I’d like to introduce few of this product, it’s just my personal review.

Continue reading


vRealize Operations Manager 6之Cisco UCS性能监控组件介绍

思科UCS刀片系列是我至今用过最好的刀片系统。无论是硬件、软件还是技术支持都堪称完美。个人推荐在大型虚拟化机房里把思科UCS作为主要设备。思科UCS刀片系统的架构和惠普的完全不同,感觉更像是个网络设备。Fabric Interconnect (FI)模块负责上联口和内部各组件之间的数据交换、IOM负责各刀箱数据路由。架构看起来很复杂,但是在管理大型数据中心时非常强大。说到大型数据中心,比如有 上百个刀箱和刀片服务器,数据要经过FI、IOM、刀片等,问题可能发生在任何层面,大型虚拟化数据中心很难找到问题的根源。UCS Manager有提供类似思科网络交换机一样的计数器功能,可以显示每一个端口的计数情况,但是这个监控工具不会告诉你什么时间、在哪个层面发生了问题 。在测试NetApp存储性能监控组件之前我有幸测试了vRealize Operations Manager 6的Cisco UCS性能监控组件。该组建同样由Blue Medora开发。以下简单介绍一下,只是我的个人观点 。

Continue reading

vRealize Opertion Manager 6之NetApp存储性能监控组件介绍

vRealize Operation Manager 6 (又叫vROps)是vCenter Operation Manager的全新版本,我从vCenter Operation Manager还是1.0时就开始使用了,很喜欢自我学习和动态阀值这两个功能。但是这款产品只能监控虚拟层面,如果可以监控存储层面就完美了。在比较大的vSphere环境中虚拟机是共享ESXi数据存储(datastore)的,如果少数虚拟机产生很高的IO,可能会影响到其他处于同一个存储上的虚拟机。想象一下,如果你有100个LUN跑在一个NetApp存储上,300个虚拟机在使用这100个LUN,某日用户说他们的虚拟机很慢,但是他们并没有跑什么应用,这时候就会比较难判断到底是哪儿出了问题,因为虚拟机可能共享同一个数据存储(datastore),数据存储存建于LUN上,LUN 可能来自某个聚合(Aggregate),并且多个LUN可能来自同一个物理磁盘。vCenter Operation Manager 在5.x时代有提供一款NetApp存储监控组件,但问题是很难把vSphere的数据存储(Datastore)和NetApp存储的设备关联起来。

Continue reading

NetApp Management Package for vRealize Operation Manager 6

vRealize Operation Manager 6 (aka vROps) is new generation of vCenter Operation Manager. I started to use vCenter Operation Manager since version 1.0. I like the idea of self-learning and dynamic threshold. But the product only monitors virtualization layer. It would be perfect if it’s able to monitor under layer storage. In large vSphere environment, virtual machines share IO capacity of datastores. If few virtual machines running high disk IO it may lead to other virtual machines get performance degrading in same storage. Think about you have 100 datastores come from a NetApp filer, and 300 virtual machines running on its. One user says their virtual machine is slow but no workload from applications end. It hard to say where the latency comes from because multiple virtual machines may share same datastore, multiple LUNs share same aggregate, and maybe same physical disks. vCenter Operation Manager provided NetApp Adapter for 5.x few years ago. But the problem was it’s too hard to associate storage objects with vSphere datastore objects.

Continue reading

Inventory Service无法启动

某日,vCenter Server突然无法搜索虚拟机了。在vSphere Client中搜索时会提示 Unable to connect to web services to execute query. Verify that the ‘VMware VirtualCenter Management Webservices’ service is running on https://vCenter_Server_FQDN:10443。没过几个小时用户就开始抱怨vSphere Web Client也出问题了,总是提示错误 Client is not authenticated to VMware Inventory Service – https://Inventory_Service_FQDN:10443

Continue reading



Continue reading

Create VM on specified OU on vRA

Best practices to manage enterprise Active Directory is organizing servers by particular properties.  For example, servers maybe put into different OU by role, business group or function…etc. Following is a vRO workflow sample to automate provisioning computers in proper OUs according to user choice in vRA Service Catalog. I’ll just give brief of each step in this article, so please make sure you understand both products before read this post.

Continue reading


当创建虚拟机的时候你可能需要将虚拟机根据不同的属性放入不同的OU中,比如根据角色、组、用户组等。在vRealize Automation Center (vRA)中可以很轻易地创建一个下拉菜单实现这类属性的选择,但是这类属性的值往往都以字符串的形式传递到vRO中,而vRO的活动目录工作流中并没有提供字符串转OU对象的功能。

Continue reading

Convert string to OU object in vRO

When you put virtual machine to particular OU, you may refer to virtual machine properties, such as ‘server role’, ‘server group’ or ‘user group’…etc. It’s easy to set a drop-list in blueprint of vRealize Automation Center (vRA) to let users choose this kind of properties but hard to create a computer account in corresponded OU location in vRO. That’s because vRA passes most of values to vRO as strings, Active Directory workflows in vRO do not provide a way to convert string to OU.

Continue reading

Inventory Service Cannot be Brought Up

One day, my vCenter Server suddenly lost search. It popped me “Unable to connect to web services to execute query. Verify that the ‘VMware VirtualCenter Management Webservices’ service is running on https://vCenter_Server_FQDN:10443” when I did object search on vSphere Client. Few hours later people starting complaint they got error on vSphere Web Client, it show “Client is not authenticated to VMware Inventory Service – https://Inventory_Service_FQDN:10443“.

Continue reading

Windows Server 2016 技术预览3内核模式 – 远程管理技巧

微软刚刚发布了Windows Server 2016的技术预览3。新版本中有很多增强,看起来微软的软件定义的数据中心正在赶上VMware。一个稳定的虚拟层是软件定义数据中心的前提,但这是微 软的软肋。你不得不不停地打各种补丁和重启服务器,甚至有些企业有定期的重启计划。微软在Windows Server 2008 的时候引入了核心模式并且在Windows Server 2012 R2中得到增强。但是Windows Server 2012 R2瞄准的是中小企业市场,我不认为他们会使用核心模式,因为复杂度要提升很多。

Continue reading

How to call customized vRealize Orchestrator workflows in vRealize Automation Center

You almost can do everything as long as vRealize Automation Center (aka vRA) and vRealize Orchestrator (aka vRO) are integrated. I think that’s the hard part if you are newbie like me. After reading lot of articles, I learned how it works. Following is my experience, please let me know if you see anything wrong.

把vRA和vRO结合在一起几乎可以做任何事情。如果你和我一样是新手,和uijuede 得整合这块比较难,最近阅读了一些这方面的文章,算是有所了解了。以下是我的见解,如果有问题留言给我。

Continue reading

Core mode of Windows Server 2016 TP3 – Remote Management Tip

Microsoft just released technical preview 3 of Windows Server 2016, it’s catching up VMware on SDDC. I can see a lot of enhancement in the new version. A stable hypervisor is  prerequisite of SDDC but it’s weakness of Microsoft. You have to patch and reboot frequently, some organizations even have regular reboot schedules. Microsoft introduced core mode on Windows Server 2008, it much enhanced on Windows Server 2012 R2. But Windows Server 2012 R2 aims to SBM. I didn’t think SBM organizations really need that if you compare operation complexity of core mode with GUI.

Continue reading

PCPU locked up on Cisco UCS

PCPU 20 locked up. Failed to ack TLB invalidate

Error message of the PSOD

ESXi 5.5 Update 2 is stable version, but I got PSOD on one UCS blade few days ago. It scared me since there was a big bug when I upgraded ESXi from 5.1 to 5.5 Update 1 last year(See detail ESXi 5.5 and Emulex OneConnect 10Gb NIC), it lead to dozen virtual  machines crashed over and over again.I bet I’m gonna to die if it happens again. :-)

ESXi 5.5 Update 2 算得上比较稳定的版本了,但前几天遇到一台紫屏,差点儿吓尿了。半年前从ESXi 5.1升级到ESXi5.5 Update 1时候遇到个大BUG(详情见我的文章ESXi 5.5 and Emulex OneConnect 10Gb NIC),搞得几十台几十台机器挂,这次升级再来一次估计职业生涯就此结束了。

Continue reading

How to Automate Snapshot on Virtual Machine

I always treat virtual machine snapshots like a big risk. It caused several outages in our infrastructure. Please check out Best practices for virtual machine snapshots in the VMware to understand how it impacts production.

虚拟机快照对我来说绝对是个大威胁,已经在我的生产环境里发生过好几次由此引发的故障了。如果你要了解快照对生产环境的影响可以看看:Best practices for virtual machine snapshots in the VMware

Continue reading

How to integrate PowerCLI with PowerShell and PowerShell ISE

I wrote a post about how to integrate PowerCLI with PowerShell manually. I rebuilt my computer few days ago, need to integrate PowerCLI again. I used to scripting by PowerGUI, but something always lead to PowerGUI lost menu, it frustrated me a long time. I cannot figured out what’s the root cause. So I wondered is it possible use PowerShell ISE instead of PowerGUI?

Continue reading

CustomAction VM_InstallJRE returned actual error code 1624

vCenter Server 5.5 Update 2e contains fix of Storage Monitor Service. It’s also a stable version since 5.5 Update 1. I got a problem when I upgraded my development vCenter Server last weekend. I’d like to share the solution since VMware doesn’t document that problem. (Maybe I didn’t find it. :-)) It’s kind tricky.

vCenter Server 5.5 Update 2e包含SMS服务的bug修复,它也是当前比较稳定的版本。上周我在升级vCenter Server到这个版本时遇到了一个问题。此问题不是那么容易修复因为VMware的KB并没有提供解决方案,我在这里把我的方法共享出来。

Continue reading

Transparent Page Sharing (TPS) is disabled by default in latest ESXi 5.5 patch

I just heared Transparent Page Sharing (TPS) is disabled by default in latest ESXi 5.5 patch. You may concern about that if your IT budget is tight since it means you need more memory for heavy virtual machines.

听说ESXi 5.5最新的patch里把TPS禁用了,专门研究了一下。觉得这对IT预算紧张的企业可能是个坏消息,因为这意味着你需要更多的内存应付大型虚拟机。

Continue reading

A very interesting Microsoft cluster failure

It’s been a long time sine last post. I was out of internet due to health issue. Just got recovered and backed to normal work. I have to publish my article by English then translate it to Chinese later since I lost lot of me time after my baby born, but more fun. hopefully it not impact to Google search. :-)

There was a interesting problem happend on Microsoft cluster when I came back from hospital. Our DBA team complaint Microsoft Cluster Service failed intermittently on virtual machine. This situation constantly happend for a week.

At the beginning of the whole troubleshooting, team noticed quorum disks failed with following Windows event:

Cluster service failed to update the cluster configuration data on the witness resource. Please ensure that the witness resource is online and accessible.

So we focused on disk performance. vbod.log also show some performance degrading but the time was not match. Microsoft was involved after that, they said the cluster failure actually caused by network connectivity issue according to following Windows event:

Cluster node ‘xxx’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

It became interesting since virtual machines share physical network links, it cannot be only single virtual machine had problem if there was network connectivity issue. Then we noticed there was following abnormal Windows event when some failure happend:

Reset to device, DeviceRaidPort0, was issued

It related to ISL driver bug, maybe not related to cluster failure issue but it worth to update to improve cluster stability. You can check Windows virtual machine event log reports the error: Reset to device, DeviceRaidPort0, was issued about this bug.

After involved multiple vendors from OS, virtualization, network and storage team, everybody said it’s not their problem. You could see this kind of problem in large datacenter since more and more system installed, it’s hard to find out which piece of the system caused the issue. You have to familar with each field of datacenter.

Eventually we figured out the issue related to storage workload. But why vendors cannot figured out this problem? First of all, Windows OS disk is running on shared storage, Windows no responding when the VMFS5 datastore latency of OS disk is high. From Windows perspective, it doesn’t know what happend on backend storage, it just know OS is very slow for few seconds, kind of pause the system. So it leads to network packages drop, and no Windows event for that since OS resumed very quickly, Windows takes it as normal behavior. Cluster actually failed at this moment. Secondary, the particular LUNs hosted the virtual machine was not busy, but the LUNs shared same storage pool with other LUNs. Any high workload LUN will impact rest of LUNs in same storage pool.

After understood these points, we figured out lot of virtual machines got high latency or high IO around 7PM every day, and most of the cluster failure happend this time. Since it’s impacted to large number of virtual machines, it must be caused by some common components on virtual machines. We eventially figured out it’s McAfee DAT updating after captured network packages. All virtual machines did same downloading in same time lead to high workload on shared storage and lead to cluster virtual machiens no responding for few seconds. The issue got fixed after change McAfee DAT updating schedule to random interval.

There are always common things going on on datacenter, it maybe small resource consumer but it can be a signaficant big monster in virtualized datacenter. Such as backup, monitoring, anti-virus or system management agents. It can impacts to shared storage or network links.


How to find corresponded physical disk for Hyper-V CSV volumes

CSV (Cluster Shared Volume) is fundamental of Microsoft Hyper-V. You must have it to leverage Live Migration and High Availability features. But it’s very confuse when you want to reclaim CSV since CSV is using different name with physical disks. For example, CSV name usually is “Cluster Disk x”, path usually is “C:ClusterStorageVolumeX”. But real disk name is “Disk x” in Disk Manager. You have to very carefully when delete the disk.

Continue reading

How to Build vCAC 6.2 LAB on VMware Workstation 11 – Part 1

Recently VMware released VMware vRealize Automation Center 6.2 (vCAC). I guess there will be a newer version along with vSphere 6.0. Be an ITPro you have to keep learning new stuff! I built a lab environment on my laptop for learning. I’m going to share my implementation experience below, it spent me dozen hours plus lot of documents reading. Initially I felt it’s to complicate to deploy (That’s looks like a tradition of VMware products). But eventually I thought it’s not easy to provide a unified self-service end user interface in a multi-vendors infrastructure. Even OpenStack is not easy!

Continue reading

ESXi 5.5 and Emulex OneConnect 10Gb NIC

*** English Version ***

You are using HP ProLiant BL460c G7 or Gen8, ESXi version is 5.5, NIC is Emulex chipset. You are using driver version 10.x.x.x. You may experience the host randomly lost connectivity on vCenter Server, host status show “No responding”. You cannot ping any virtual machine hosted on the blade. High pause frame is observed on HP virtual connect model down links after problem occurred. And you see similar error in vmkernel logs:

Continue reading

Incompatible device backing specified for device ‘x’

It’s easy to find a solution for this particular problem. VMware has a KB for this error. Somehow it’s not my case. I don’t know what’s the exactly root cause but you can try vMotion the virtual machine to other host and give a try.

Chinese Version
这个问题的解决方案很容易找到,VMware有一个知识库。但不知道为什么,我遇到的问题没法用此知识库解决。我通过vMotion虚拟机到其他ESXi 主机解决此问题,也不知道具体原因是什么。