Cisco UCS blade system is the best blade system I used so far. Whatever the hardware, software or support is perfect. I recommend leverage the system for primary system of virtualization. UCS blade system architecture is different with HP. I feel it more likes a network system. Fabric Interconnect (FI) modules exchange data between uplinks and internal components. IOMs on each chassis controls data routing. Architecture is complicate, but it’s powerful to manage large datacenter. Talking about large datacenter, you may have hundred chassis or blades. Data goes through FIs, IOMs and blades, you could see issues on any layer. It’s hard to find out where exactly the problem is. UCS Manager provides statistics for ports just like how Cisco does on network switches. You can show statistics of a particular port. But it doesn’t tell you when and which layer it happened. I tested Cisco UCS adapter for vRealize Operation Manager before I reviewed NetApp adapter for vRealize Operation Manager. It’s developed by same company Blue Medora. I’d like to introduce few of this product, it’s just my personal review.
思科UCS刀片系列是我至今用过最好的刀片系统。无论是硬件、软件还是技术支持都堪称完美。个人推荐在大型虚拟化机房里把思科UCS作为主要设备。思科UCS刀片系统的架构和惠普的完全不同，感觉更像是个网络设备。Fabric Interconnect (FI)模块负责上联口和内部各组件之间的数据交换、IOM负责各刀箱数据路由。架构看起来很复杂，但是在管理大型数据中心时非常强大。说到大型数据中心，比如有 上百个刀箱和刀片服务器，数据要经过FI、IOM、刀片等，问题可能发生在任何层面，大型虚拟化数据中心很难找到问题的根源。UCS Manager有提供类似思科网络交换机一样的计数器功能，可以显示每一个端口的计数情况，但是这个监控工具不会告诉你什么时间、在哪个层面发生了问题 。在测试NetApp存储性能监控组件之前我有幸测试了vRealize Operations Manager 6的Cisco UCS性能监控组件。该组建同样由Blue Medora开发。以下简单介绍一下，只是我的个人观点 。
NetApp released Virtual Storage Console (VSC) 6.1 for vCenter 6.0. The solution is only support vSphere Web Client now. I did some testing on my lab, faced a very special case.
NetApp发布了Virtual Storage Console (VSC) 6.1 对应 vCenter 6.0。这个产品现在只支持vSphere Web Client了。我在实验环境下做了一些测试，遇到一个非常特殊的案例。
Today I got a strange problem about share folder. Some virtual machines cannot access network share path. It gave me Unspecified Error 0x80004005 when I opened a share folder on explorer. It gave me The network path was not found 0x80070035 when I opened same share folder by clicking Start – Run.
今天遇到了一个很奇怪的问题。有些虚拟机无法访问网络共享文件夹。当用文件浏览器打开共享文件夹时会弹出Unspecified Error 0x80004005错误。如果在开始 – 运行 里打开相同的文件夹又会提示The network path was not found 0x80070035。
vRealize Operation Manager 6 （又叫vROps）是vCenter Operation Manager的全新版本，我从vCenter Operation Manager还是1.0时就开始使用了，很喜欢自我学习和动态阀值这两个功能。但是这款产品只能监控虚拟层面，如果可以监控存储层面就完美了。在比较大的vSphere环境中虚拟机是共享ESXi数据存储（datastore）的，如果少数虚拟机产生很高的IO，可能会影响到其他处于同一个存储上的虚拟机。想象一下，如果你有100个LUN跑在一个NetApp存储上，300个虚拟机在使用这100个LUN，某日用户说他们的虚拟机很慢，但是他们并没有跑什么应用，这时候就会比较难判断到底是哪儿出了问题，因为虚拟机可能共享同一个数据存储（datastore），数据存储存建于LUN上，LUN 可能来自某个聚合（Aggregate），并且多个LUN可能来自同一个物理磁盘。vCenter Operation Manager 在5.x时代有提供一款NetApp存储监控组件，但问题是很难把vSphere的数据存储（Datastore）和NetApp存储的设备关联起来。
vRealize Operation Manager 6 (aka vROps) is new generation of vCenter Operation Manager. I started to use vCenter Operation Manager since version 1.0. I like the idea of self-learning and dynamic threshold. But the product only monitors virtualization layer. It would be perfect if it’s able to monitor under layer storage. In large vSphere environment, virtual machines share IO capacity of datastores. If few virtual machines running high disk IO it may lead to other virtual machines get performance degrading in same storage. Think about you have 100 datastores come from a NetApp filer, and 300 virtual machines running on its. One user says their virtual machine is slow but no workload from applications end. It hard to say where the latency comes from because multiple virtual machines may share same datastore, multiple LUNs share same aggregate, and maybe same physical disks. vCenter Operation Manager provided NetApp Adapter for 5.x few years ago. But the problem was it’s too hard to associate storage objects with vSphere datastore objects.
vRealize Automation 7 (vRA 7)和vRA6比起来有很多增强和改进。网上有大量的文章介绍这方面以及安装方法。vRA7的初始设置和vRA6有很大不同。以下是我的一些经验，可以帮你快速搭建实验环境。
vRealize Automation 7 (vRA 7) has lot of enhancements and changes compare with vRA 6. There are plenty of introductions available in internet. The initial configuration is different with vRA 6. I’m going to share my experience. You can easily build up LAB or POC by following this post.
错误提示 “To view this page ensure that Adobe Flash Player version 11.5.0 or greater is installed. “。登陆框依旧可见，但是登陆后页面成空白状。
You may see that error message “To view this page ensure that Adobe Flash Player version 11.5.0 or greater is installed. ” when you open vSphere Web Client 6.0 on IE 11 on Windows 8.1. The login fields still visible, but the page go to blank after you login.
某日，vCenter Server突然无法搜索虚拟机了。在vSphere Client中搜索时会提示 Unable to connect to web services to execute query. Verify that the ‘VMware VirtualCenter Management Webservices’ service is running on https://vCenter_Server_FQDN:10443。没过几个小时用户就开始抱怨vSphere Web Client也出问题了，总是提示错误 Client is not authenticated to VMware Inventory Service – https://Inventory_Service_FQDN:10443。
Best practices to manage enterprise Active Directory is organizing servers by particular properties. For example, servers maybe put into different OU by role, business group or function…etc. Following is a vRO workflow sample to automate provisioning computers in proper OUs according to user choice in vRA Service Catalog. I’ll just give brief of each step in this article, so please make sure you understand both products before read this post.
当创建虚拟机的时候你可能需要将虚拟机根据不同的属性放入不同的OU中，比如根据角色、组、用户组等。在vRealize Automation Center (vRA)中可以很轻易地创建一个下拉菜单实现这类属性的选择，但是这类属性的值往往都以字符串的形式传递到vRO中，而vRO的活动目录工作流中并没有提供字符串转OU对象的功能。
When you put virtual machine to particular OU, you may refer to virtual machine properties, such as ‘server role’, ‘server group’ or ‘user group’…etc. It’s easy to set a drop-list in blueprint of vRealize Automation Center (vRA) to let users choose this kind of properties but hard to create a computer account in corresponded OU location in vRO. That’s because vRA passes most of values to vRO as strings, Active Directory workflows in vRO do not provide a way to convert string to OU.
One day, my vCenter Server suddenly lost search. It popped me “Unable to connect to web services to execute query. Verify that the ‘VMware VirtualCenter Management Webservices’ service is running on https://vCenter_Server_FQDN:10443” when I did object search on vSphere Client. Few hours later people starting complaint they got error on vSphere Web Client, it show “Client is not authenticated to VMware Inventory Service – https://Inventory_Service_FQDN:10443“.
今天在vRealize Operation Manager 6.0创建了几个super metric，主要用来计算ESXi主机的物理链路吞吐量。结果发现这些super metric只是出现在部分主机里。估计是有什么bug。快速解决的办法是重启一下vROps vApp。
Today I created few super metrics on vRealize Operation Manager 6.0 to calculate throughput of physical links on ESXi host. The super metrics just present to part of the selected hosts. I guess it’s some kind of minor bug. A reboot of vROps vApp can works around it. Just heads up.
I don’t know why VMware doesn’t allow hidden default dashboards in VMware design vRealize Operation Manager (vROps). They also states no solution in current version. I searched internet, only thing I found was a community post that someone wants to delete the dashboards, but no proper answer.
微软刚刚发布了Windows Server 2016的技术预览3。新版本中有很多增强，看起来微软的软件定义的数据中心正在赶上VMware。一个稳定的虚拟层是软件定义数据中心的前提，但这是微 软的软肋。你不得不不停地打各种补丁和重启服务器，甚至有些企业有定期的重启计划。微软在Windows Server 2008 的时候引入了核心模式并且在Windows Server 2012 R2中得到增强。但是Windows Server 2012 R2瞄准的是中小企业市场，我不认为他们会使用核心模式，因为复杂度要提升很多。
First of all, this article is nothing related to PowerCLI. You probably know how to set Path Selection Policy (PSP) by vSphere Client, but how you can set up 100 LUNs manually? We have some script can make your life easy.
You almost can do everything as long as vRealize Automation Center (aka vRA) and vRealize Orchestrator (aka vRO) are integrated. I think that’s the hard part if you are newbie like me. After reading lot of articles, I learned how it works. Following is my experience, please let me know if you see anything wrong.
It’s frustration to check RDM information, you have to check across all ESXi hosts to make sure configuration is aligned. I just figured out two line commands to get path selection policy (aka PSP).
Microsoft just released technical preview 3 of Windows Server 2016, it’s catching up VMware on SDDC. I can see a lot of enhancement in the new version. A stable hypervisor is prerequisite of SDDC but it’s weakness of Microsoft. You have to patch and reboot frequently, some organizations even have regular reboot schedules. Microsoft introduced core mode on Windows Server 2008, it much enhanced on Windows Server 2012 R2. But Windows Server 2012 R2 aims to SBM. I didn’t think SBM organizations really need that if you compare operation complexity of core mode with GUI.
ESXi 5.5 Update 2 is stable version, but I got PSOD on one UCS blade few days ago. It scared me since there was a big bug when I upgraded ESXi from 5.1 to 5.5 Update 1 last year(See detail ESXi 5.5 and Emulex OneConnect 10Gb NIC), it lead to dozen virtual machines crashed over and over again.I bet I’m gonna to die if it happens again. :-)
ESXi 5.5 Update 2 算得上比较稳定的版本了，但前几天遇到一台紫屏，差点儿吓尿了。半年前从ESXi 5.1升级到ESXi5.5 Update 1时候遇到个大BUG（详情见我的文章ESXi 5.5 and Emulex OneConnect 10Gb NIC），搞得几十台几十台机器挂，这次升级再来一次估计职业生涯就此结束了。
The title looks scared, is it? Actually I don’t want to talk about any problem of VMware product but just a feature.
I always treat virtual machine snapshots like a big risk. It caused several outages in our infrastructure. Please check out Best practices for virtual machine snapshots in the VMware to understand how it impacts production.
虚拟机快照对我来说绝对是个大威胁，已经在我的生产环境里发生过好几次由此引发的故障了。如果你要了解快照对生产环境的影响可以看看：Best practices for virtual machine snapshots in the VMware
Someone setup a non-secure wifi around my apartment, I never connected it till yesterday since I worried it’s may be a honeypot. I had some me time yesterday night, so I setup a virtual machine to connect the wifi.
I wrote a post about how to integrate PowerCLI with PowerShell manually. I rebuilt my computer few days ago, need to integrate PowerCLI again. I used to scripting by PowerGUI, but something always lead to PowerGUI lost menu, it frustrated me a long time. I cannot figured out what’s the root cause. So I wondered is it possible use PowerShell ISE instead of PowerGUI?
I just found an article show how to check alignment of Windows virtual machine and datastore.
刚找到一篇关于 如何查看Windows虚拟机和ESXi datastore是否和存储扇区一致 的文章。
vCenter Server 5.5 Update 2e contains fix of Storage Monitor Service. It’s also a stable version since 5.5 Update 1. I got a problem when I upgraded my development vCenter Server last weekend. I’d like to share the solution since VMware doesn’t document that problem. (Maybe I didn’t find it. :-)) It’s kind tricky.
vCenter Server 5.5 Update 2e包含SMS服务的bug修复，它也是当前比较稳定的版本。上周我在升级vCenter Server到这个版本时遇到了一个问题。此问题不是那么容易修复因为VMware的KB并没有提供解决方案，我在这里把我的方法共享出来。
I just heared Transparent Page Sharing (TPS) is disabled by default in latest ESXi 5.5 patch. You may concern about that if your IT budget is tight since it means you need more memory for heavy virtual machines.
It’s been a long time sine last post. I was out of internet due to health issue. Just got recovered and backed to normal work. I have to publish my article by English then translate it to Chinese later since I lost lot of me time after my baby born, but more fun. hopefully it not impact to Google search. :-)
There was a interesting problem happend on Microsoft cluster when I came back from hospital. Our DBA team complaint Microsoft Cluster Service failed intermittently on virtual machine. This situation constantly happend for a week.
At the beginning of the whole troubleshooting, team noticed quorum disks failed with following Windows event:
Cluster service failed to update the cluster configuration data on the witness resource. Please ensure that the witness resource is online and accessible.
So we focused on disk performance. vbod.log also show some performance degrading but the time was not match. Microsoft was involved after that, they said the cluster failure actually caused by network connectivity issue according to following Windows event:
Cluster node ‘xxx’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
It became interesting since virtual machines share physical network links, it cannot be only single virtual machine had problem if there was network connectivity issue. Then we noticed there was following abnormal Windows event when some failure happend:
Reset to device, DeviceRaidPort0, was issued
It related to ISL driver bug, maybe not related to cluster failure issue but it worth to update to improve cluster stability. You can check Windows virtual machine event log reports the error: Reset to device, DeviceRaidPort0, was issued about this bug.
After involved multiple vendors from OS, virtualization, network and storage team, everybody said it’s not their problem. You could see this kind of problem in large datacenter since more and more system installed, it’s hard to find out which piece of the system caused the issue. You have to familar with each field of datacenter.
Eventually we figured out the issue related to storage workload. But why vendors cannot figured out this problem? First of all, Windows OS disk is running on shared storage, Windows no responding when the VMFS5 datastore latency of OS disk is high. From Windows perspective, it doesn’t know what happend on backend storage, it just know OS is very slow for few seconds, kind of pause the system. So it leads to network packages drop, and no Windows event for that since OS resumed very quickly, Windows takes it as normal behavior. Cluster actually failed at this moment. Secondary, the particular LUNs hosted the virtual machine was not busy, but the LUNs shared same storage pool with other LUNs. Any high workload LUN will impact rest of LUNs in same storage pool.
After understood these points, we figured out lot of virtual machines got high latency or high IO around 7PM every day, and most of the cluster failure happend this time. Since it’s impacted to large number of virtual machines, it must be caused by some common components on virtual machines. We eventially figured out it’s McAfee DAT updating after captured network packages. All virtual machines did same downloading in same time lead to high workload on shared storage and lead to cluster virtual machiens no responding for few seconds. The issue got fixed after change McAfee DAT updating schedule to random interval.
There are always common things going on on datacenter, it maybe small resource consumer but it can be a signaficant big monster in virtualized datacenter. Such as backup, monitoring, anti-virus or system management agents. It can impacts to shared storage or network links.
CSV (Cluster Shared Volume) is fundamental of Microsoft Hyper-V. You must have it to leverage Live Migration and High Availability features. But it’s very confuse when you want to reclaim CSV since CSV is using different name with physical disks. For example, CSV name usually is “Cluster Disk x”, path usually is “C:ClusterStorageVolumeX”. But real disk name is “Disk x” in Disk Manager. You have to very carefully when delete the disk.
VMware vRealize Automation 6.2 Configuration
vCAC configuration is little complicate. I’ll separate to vCAC server, IaaS, vCAC itself and VCO configurations, 4 sections.
vCenter Server Configuration
We will do identity source and permission settings on vCenter Server.
In previous articles I shared how to build vCAC 6.2 LAB, we created domain controller and DNS services on DC01.contoso.com, vCenter Server on VC01.contoso.com, 3 ESXi hosts on ESX01/02/03.contoso.com, vCAC server on vCAC.contoso.com, IaaS server of vCAC on IaaS.contoso.com and FreeNAS server on FreeNAS.contoso.com.
VMware vRealize Automation 6.2 (vCAC) installation
vCAC contains 3 components: vCAC itself, vRealize Orchestrator (VCO) and IaaS server. I used native VCO to save resource. I’m going to share how to install vCAC by two sections: vCAC installation and IaaS installation.
I need FreeNAS provides shared NFS storage for ESXi hosts to enable advanced features such as HA or vMotion. I gave 1GB RAM, 4 vCPU and 2GB local disk to FreeNAS virtual machine. DNS name is FreeNAS.contoso.com.
Recently VMware released VMware vRealize Automation Center 6.2 (vCAC). I guess there will be a newer version along with vSphere 6.0. Be an ITPro you have to keep learning new stuff! I built a lab environment on my laptop for learning. I’m going to share my implementation experience below, it spent me dozen hours plus lot of documents reading. Initially I felt it’s to complicate to deploy (That’s looks like a tradition of VMware products). But eventually I thought it’s not easy to provide a unified self-service end user interface in a multi-vendors infrastructure. Even OpenStack is not easy!
*** English Version ***
You are using HP ProLiant BL460c G7 or Gen8, ESXi version is 5.5, NIC is Emulex chipset. You are using driver version 10.x.x.x. You may experience the host randomly lost connectivity on vCenter Server, host status show “No responding”. You cannot ping any virtual machine hosted on the blade. High pause frame is observed on HP virtual connect model down links after problem occurred. And you see similar error in vmkernel logs:
Some of my virtual machines used ISL logical SCSI controller. It’s not recommended for Red Hat 6 virtual machines. We need to change it to VMware Paravirtual SCSI controller.
Basically the steps is power off virtual machine, change the SCSI controller type, and power on. Then you lost operation system. :-)
It’s easy to find a solution for this particular problem. VMware has a KB for this error. Somehow it’s not my case. I don’t know what’s the exactly root cause but you can try vMotion the virtual machine to other host and give a try.
I noticed UCS Manager got unexpected failover after we upgraded firmware to 2.2(2c). Looks like it hits a bug CSCuo11700. Firmware should be upgraded to 2.2.(3a) to fix the issue.
Again something wrong on ESXi 5.5! Please don’t upgrade VMware Tools to 5.5 if you have Debian or Red Hat Linux virtual machine on your ESXi 5.5 hosts. There is a unsolved bug on vmmemctl drivers (balloon driver) of VMware Tools 5.5 can lead to Linux virtual machine hangs.