Tag Archives: storage

NetApp Virtual Storage Console Icon Missing on vSphere Web Client

NetApp released Virtual Storage Console (VSC) 6.1 for vCenter 6.0. The solution is only support vSphere Web Client now. I did some testing on my lab, faced a very special case.

Continue reading


NetApp Management Package for vRealize Operation Manager 6

vRealize Operation Manager 6 (aka vROps) is new generation of vCenter Operation Manager. I started to use vCenter Operation Manager since version 1.0. I like the idea of self-learning and dynamic threshold. But the product only monitors virtualization layer. It would be perfect if it’s able to monitor under layer storage. In large vSphere environment, virtual machines share IO capacity of datastores. If few virtual machines running high disk IO it may lead to other virtual machines get performance degrading in same storage. Think about you have 100 datastores come from a NetApp filer, and 300 virtual machines running on its. One user says their virtual machine is slow but no workload from applications end. It hard to say where the latency comes from because multiple virtual machines may share same datastore, multiple LUNs share same aggregate, and maybe same physical disks. vCenter Operation Manager provided NetApp Adapter for 5.x few years ago. But the problem was it’s too hard to associate storage objects with vSphere datastore objects.

Continue reading

How to find corresponded physical disk for Hyper-V CSV volumes

CSV (Cluster Shared Volume) is fundamental of Microsoft Hyper-V. You must have it to leverage Live Migration and High Availability features. But it’s very confuse when you want to reclaim CSV since CSV is using different name with physical disks. For example, CSV name usually is “Cluster Disk x”, path usually is “C:ClusterStorageVolumeX”. But real disk name is “Disk x” in Disk Manager. You have to very carefully when delete the disk.

Continue reading

PortChannel does not work on Cisco UCS Fabric Interconnect

Whatever  you configure on MDS, whatever you configure on Cisco UCS FIs, whatever you do for port channel on both side, the Cisco UCS uplink ports always down with error message Initilize failed, or Error disabled.

Congratulation! your device hit MDS firmware bug…https://tools.cisco.com/bugsearch/bug/CSCtr01652/?reffering_site=dumpcr.


Device or Resource Busy

You may read my post How to find which ESXi 5.1 host lock the VM, it’s a solution to figure out which host lock down a file.

But sometimes you may face similar problem but different solution.

You are able to browse the file by CLI or GUI, but cannot delete by either way. It returns you device or resource busy or similar error messages.

You could try following command to delete the file/folders:

rm [File or folder name] -rf

How to get HBA WWPN of ESXi hosts

It’s busy month, I haven’t update my blog since I back from Phuket with my wife. I’m running into multiple projects, a little overload.

Just a quick share, my storage team ask me provide WWPN of all hosts to do a health check. it’s nightmare to pull out the data from vSphere client or web client. Just found a way to get it.

Get-VMHost -Location | Get-VMHostHBA -type fibrechannel | select VMHost,Device,@{N=”WWPN”;E={“{0:X}” -f $_.PortWorldWideName}}

Especially “{0:X}” -f $_.PortWorldWideName}

{0:X} is format, check out here  to find more.

-f is kind of pipeline.

$_.PortWorldWideName is the value you want to convert.


Windows cannot be installed on drive 0 partition 1

I think Windows Server 2012 will be next popular server OS just like Windows Server 2008, it’s also a nice hypervisor OS on virtual world. How do you think?

Installation is first step to experience the wonderful OS, you may see some strange problem during that step just like me. Today’s topic occurred long time ago, just want to share with people who may face similar issue like me.

That’s HP blade system with local disk attached, you may see similar problem on other vendor. When you select disk to install OS, installer may says Windows can’t be installed on drive 0 partition 1, or Windows cannot be installed on this disk. This computer’s hardware may not support booting to this disk. Ensure that the disk’s controllers is enabled in the computer’s BIOS menu.

That’s because boot volume is not set on array controller. For example by HP servers, you have to reboot and press F8 after BIOS checks array controller to enter array controller management interface. Then go to Select Boot Volume in main menu, select Direct Attached Storage, and then select the disk you want to install OS. Follow up the wizard to continue boot up.

If the problem persists, go to array controller management interface, rebuild array and select boot volume again, it should fix your problem.

Nodes in the ESXi cluster may report corruption after reboot host or attach device

VCE just released a new KB vce2563 to description the issue.

If your ESXi 5.x hosts is connected on VMAX running Enginuity 5876.159.102 and later, you may see this particular issue after reboot ESXi host or attach storage if you enabled block delete feature of VAAI.

To check the option status you can run following command on PowerCLI:

 Get-VMHost -Location cluster name | Get-VMHostAdvancedConfiguration -Name VMFS3.EnableBlockDelete

Error 2931 The connection to the VMM agent on the virtualization server was lost

Windows Server 2012, the biggest competitor of VMware vSphere. There are adequate reason to use Hyper-V 2012 instead of vSphere 5.x, but it’s still very hard to for newbie, we spend more than 30 hours to try figure out how to create cluster on SCVMM 2012 SP1, the software is easy to install, but hard to configure. I saw “failed” everywhere, it’s not a mature product in my view.

We installed Windows Server 2012 data center edition on HP BL460, storage is NetApp FAS2240 (Maybe wrong, I’m not storage guy). We got following error message when we created Hyper-V Cluster on SCVMM2012 SP1.

Error (2931)
VMM is unable to complete the request. The connection to the VMM agent on the virtualization server (xxx) was lost.
Unknown error (0x80338029)

Recommended Action
Ensure that the Windows Remote Management (WS-Management) service and the VMM agent are installed and running and that a firewall is not blocking HTTPS traffic.

This can also happen due to DNS issues. Try and see if the server (dcahyv04.amat.com) is reachable over the network and can be looked up in DNS. You can ping the virtualization server from VMM management server and make sure that the IP address returned matches the IP address locally obtained from the virtualization server.

If the error still persists, restart the virtualization server, and then try the operation again.

SCVMM job failed on Mounts storage disk on xxxx.

Initially I thought that’s something wrong with services, I checked the mentioned Windows Remote Management service, but it’s up and running. Then I noticed WINS servers was not set, but still no lucky.

Why the job always failed on mount storage? Maybe something related to disk operation? SCVMM server is remote server, it must be operates disk remotely, so I tried connect Hyper-V server by Computer Management tool remotely, it show me RPC is unavailable when I click Disk Management node. Aha…firewall problem, that’s because SCVMM server disabled firewall, but Hyper-V server enabled, the RPC ports was blocked by client side.

Sometimes cluster creating can be successful after I disabled firewall, but still Hyper-V server looks like hard to mount storage.

Since SCVMM mount/unmount storage on each Hyper-V hosts during cluster creating, it takes very long time to mount storage before the job failed, we suspected that’s something related to storage, finally, we installed NetApp Host Utilities 6.0.1 and NTAP MPIO 4.0.1 to solved the problem.

To summarize, you must enable remote management for Hyper-V host, such as Remote Register service…etc, allows required ports in Windows firewall and storage MPIO plugin should be installed as well. BTW, you should disable UAC on Windows Server 2012, it’s different with Windows Server 2008, check http://social.technet.microsoft.com/wiki/contents/articles/13953.windows-server-2012-deactivating-uac.aspx.

That’s just first step to make Hyper-V successful. :)

How to configure vSAN on nested ESXi hosts with SSD hard disk

There are lot of articles introduce vSAN feature and steps by steps guide. I referred William Lam’s article & Duncan’s article to configure vSAN on my lab, I was true I exactly followed his steps to configure the vSAN, but I can not saw anything under disk field under Disk Management.

Please note: Following steps does not work for ESXi 6.0 RC on VMware Workstation 10. You have to set scsix:y.virtualssd = 0 in vmx file to mark the disk as non-SSD. Please refer to William’s article for detail.

After looked into it deeper, I found something interesting:

esxcli storage core device list

I got that output:

Display Name: Local VMware, Disk (mpx.vmhba1:C0:T1:L0)
Has Settable Display Name: false
Size: 5120
Device Type: Direct-Access
Multipath Plugin: NMP
Devfs Path: /vmfs/devices/disks/mpx.vmhba1:C0:T1:L0
Vendor: VMware,
Model: VMware Virtual S
Revision: 1.0
SCSI Level: 2
Is Pseudo: false
Status: on
Is RDM Capable: false
Is Local: true
Is Removable: false
Is SSD: true
Is Offline: false
Is Perennially Reserved: false
Queue Full Sample Size: 0
Queue Full Threshold: 0
Thin Provisioning Status: unknown
Attached Filters:
VAAI Status: unsupported
Other UIDs: vml.0000000000766d686261313a313a30
Is Local SAS Device: false
Is Boot USB Device: false
No of outstanding IOs with competing worlds: 32

Initially, I thought that disk marked as SSD since I ran command to enable SSD. Actually it’s not like that, it shows SSD since my hard disk is SSD!!!! I don’t have to run the command introduced in the articles to turn SSD on, it’s nature SSD. lol

What I need to do is actually totally oppositely. That’s the steps I used to enable vSAN:

1. Create two disks.

2. Login ESXi hosts by SSH.

3. Run following command, find out the two disks you want to use for vSAN. Record runtime name.

esxcli storage core device list

4. Run following command to disable SSD for one disk.

esxcli storage nmp satp rule add –satp VMW_SATP_LOCAL –device vmhba1:C0:T2:L0 –option “disable_ssd”

5. Follow up the articles above to enable vSAN ports, create clusters, enable vSAN on cluster and join ESXi hosts to clusters.

How to decode ESXi 5.x SCSI error code

Storage is critical component for virtualization, lot of VM performance issue is related to storage latency. You may see similar error message on vmkernel log for some case:

2014-02-11T07:18:20.541Z cpu8:425351)ScsiDeviceIO: 2331: Cmd(0x4124425bc700) 0x2a, CmdSN 0xd5 from world 602789 to dev “naa.514f0c5c11a00025” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x44 0x0

It much like language of another planet when I first time saw itJ. Let’s see how to “translate” it to human language.

First, I split it to several sections:

a) 2014-02-11T07:18:20.541Z cpu8:425351)

b) ScsiDeviceIO: 2331: Cmd(0x4124425bc700) 0x2a, CmdSN 0xd5

c) from world 602789

d) to dev “naa.514f0c5c11a00025”

e) failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x44 0x0

Section A shows the UTC time when the error occurred.

Section B shows what command is sent. (Actually I don’t even know what the command means is, please let me know if you know it.)

Section C shows which world the command related to.

You can found which world it is by following command

ps | grep 602789

Section D shows which storage device it show error message.

You could identify which datastore it is by following command if your datastore contains single LUN:

esxcfg-scsidevs –m naa.514f0c5c11a00025

You could also check out LUN setting and information by following command:

esxcli storage core device list –d naa.514f0c5c11a00025

esxcli storage nmp device list –d naa.514f0c5c11a00025

Section E shows SCSI sense code. That’s the part I want to give more detail.

It’s breakdown to two sections:

SCSI status codeH:0x0 D:0x2 P:0x0

H means host status

D means device status

P means plugin status

Sense data0x4 0x44 0x0

0x4 means Sense Key

0x44 means Additional Sense Code

0x0 means ASC Qualifier

Before decode, you should translate each code to NNNh notation, 0xNNN = NNNh. For example 0x7a = 7Ah, 0x77 = 77h.

SCSI status code is easy to decode. You just need to change the format and check out the code from http://www.t10.org/lists/2status.htm.

In our example H:0x0 D:0x2 P:0x0, host code 0x0 (00h) means ESX host side is good, device code 0x2 (02h) means device is not ready, plugin status code 0x0 (00h) means LUN plugin is good. (Clarify: device code 0x2 is actually means “check condition”, it’s not really means “device is not ready”, it’s just for easy understand, but looks like it confuse since “Check Condition” has different means with “Device is not Ready”. Thanks Tony point out that. )

Sense data is a little bit complicate. You have to refer two links http://www.t10.org/lists/2sensekey.htm and http://www.t10.org/lists/asc-num.txt.

In our example: 0x4 0x44 0x0, Sense Key 0x4 (4h) means HARDWARE ERROR, Additional Sense Code is 0x44 (44h) and ASC Qualifier is 0x0 (00h), combine the both code to 44h/00h, it means INTERNAL TARGET FAILURE.

Okay, then we put all decode language together:

ESX host side is good, device is not ready, LUN plugin is good because HARDWARE ERROR INTERNAL TARGET FAILURE

Actually I dumped this code from an fnic firmware/driver incompatible case. Is it make your troubleshooting more easy?J

You could also refer to following links to get more detail:

Understanding SCSI device/target NMP errors/conditions in ESX/ESXi 4.x and ESXi 5.x

Understanding SCSI host-side NMP errors/conditions in ESX 4.x and ESXi 5.x

Interpreting SCSI sense codes in VMware ESXi and ESX

Interpreting SCSI sense codes in VMware ESXi and ESX

vHBAs and other PCI devices may stop responding in ESXi 5.x when using Interrupt Remapping

Your vHBAs or other PCI devices may stop running in ESXi 5.x when using Interrupt Remapping feature.

This issue only impact to UCS blade BIOS version 1.4(3c), it has been fixed on 1.4(3j).

Please refer to http://kb.vmware.com/kb/1030265 to see how to disable Interrupt Remapping feature in ESXi 5.x

Also refer to https://tools.cisco.com/bugsearch/bug/CSCty96722.

IPv6 link in NetApp SMVI backup log

NetApp Virtual Storage Console is my favorite to manage and backup data on NetApp attached ESXi host, there is lot of benefits to secure VM data more efficient.

The installation is pretty simple, and very less resource it requires, you can even install it on a multi-role virtual machine. But the first headache maybe the backup log.…

The default report URL is IPv6 in NetApp Virtual Storage Console. You have to add parameter in wrapper.conf file manually. Here is detail steps:

This procedure has to be repeated after NetApp Virtual Storage Console is upgraded.

1) Shut down SMVI server (via Windows service).

2) Open the wrapper.conf in C:\Program Files\NetApp\Virtual Storage Console\smvi\server\etc

3) Locate section


Java Additional Parameters

4) Add following line:


5) Start SMVI server (via Windows service).

How to find which ESXi 5.1 host lock the VM

Sometimes VM may show unknown, invalid or orphan on vCenter Server, but it still running somewhere. Some technical support engineer may request reboot VM/ESXi host, or search on each host one by one.

Declare: This article only apply to ESXi 5.1, I haven’t tested on other version.

This is easiest way to find out which host lock the VM:

  1. SSH to any host on the cluster.
  2. Go to VM folder. ( Usually it’s under /vmfs/volumes/… )
  3. Run command:  vmkfstools -D “vmx file name” | grep owner
  4. Return line similar like this:
    gen 483, mode 1, owner 529495c4-0b6a7d90-a0f3-0025b541a0dc mtime 211436
  5. The red highlight section is MAC address of owner host.
  6. Run command: esxcfg-nics -l on each ESXi host to see which host match this MAC address.

Then you need to remove the invalid VM from inventory, and login to the owner host by vSphere Client and import the VMX file again.

This procedure can save lot of time to find the real owner host, but it still consumes time if it’s a large cluster. You want to more fast? It’s possible!

After you find the MAC address, change it to regular format, like: xx:xx:xx:xx:xx:xx.

Logon vMA console and connect to vCenter Server by command: vifptarget -s vCenter Server Name

Run command: esxcfg-nics -h ESXi host name -l | grep xx:xx:xx:xx:xx:xx

More fast?

Try use Excel to list commands with all ESXi host name then past on console….

How to retrieve or set Path Selection Policy by vCLI

First of all, this article is nothing related to PowerCLI. :-)

You probably know how to set Path Selection Policy (PSP) by vSphere Client, but how you can setup 100 LUNs manually? We have some script can make your life easy.

How to retrieve LUN Path Selection Policy:

esxcli storage nmp device list | egrep “Device Display Name|Path Selection Policy:”

You will get a output like that:

Device Display Name: DGC Fibre Channel Disk (naa.600601602a102e0002cdf2a2596be211)
Path Selection Policy: VMW_PSP_RR

This script help you identify which LUN is what type of policy. Here tell you what is Path Selection Policy.

Next, let’s see how to modify these LUN PSP by script:
First, you should run following script to print out command for each LUN, don’t forget change the bold text to the PSP you prefer.

esxcli storage nmp device list | awk '/^naa/{print "esxcli storage nmp device set -d "$0" -PVMW_PSP_RR" };'

Then, copy the output to notepad and remove the local disk, for example following bold NAA indicates the LUN is a local HP disk.

esxcli storage nmp device set -d naa.600601602a102e008896dda81b88e211 -P VMW_PSP_RR
esxcli storage nmp device set -d naa.600601602a102e008861b28a596be211 -P VMW_PSP_RR
esxcli storage nmp device set -d naa.600601602a102e00560d8488b456e211 -P VMW_PSP_RR
esxcli storage nmp device set -d naa.600601602a102e00c4cd2600b456e211 -P VMW_PSP_RR
esxcli storage nmp device set -d naa.600508b1001c1e987243838af4c67891 -P VMW_PSP_RR
esxcli storage nmp device set -d naa.600601602a102e008c96dda81b88e211 -P VMW_PSP_RR

Last, copy modified text back to putty session, it will run the commands one by one.

How to retrieve RDM information by PowerCLI

I worked on move RDM LUNs of Microsoft Cluster virtual machine from one iGroup to another. To make sure the moving safe, we should record RDM LUN information before migration.

We had two VMs with almost 20 RDM LUNs, it’s pretty time consume to get the information manually, I used following script to retrieve information:

$RMDinfo = Get-HardDisk -VM virtual machine name -DiskType rawPhysical

$RDMinfo | select Parent,Filename,CapacityGB,ScsiCanonicalName,Name


Unknown status of Hardware Acceleration

When I read VMware documents, there is a cool feature Hardware Acceleration I found in storage book. That recall me an outage about one year ago, our NetApp filer was crashed due to motherboard problem, part of datastores was failed, we have to move virtual machine from the filer to other. We noticed the storage vMotion performance was pretty high, the data moving speed was 2 times less than regular storage vMotion. That’s the advantage of Hardware Acceleration.

The first thing of this year is standardize the virtualization environment. I found an interesting problem when I checked the Hardware Acceleration part, same luns show different status on different ESXi 5 host of a cluster, some of the hosts show Hardware Acceleration enabled, and some show Unknown.

The storage is EMC Clarion CX series with ALUA enabled, I found working hosts attached VAAI filter, non-working hosts had nothing.

Working Host

Figure 1   Working Host

Non-working Host

Figure 2   Non-working Host

ESXi 5 automatic attach different filter according to lun properties, that issue indicates the lun properties was different on different ESXi 5 host, that’s a storage layer issue, after troubleshooting with EMC, we found Failover Mode of luns was different on each host, the Failover Mode should be 4 instead of default 1.

Please be aware of that storage activity on particular host will interrupt when you change Failover Mode, please put the host in maintenance mode first.

Regarding Failover Mode, I had discussion with a storage engineer, he told me different storage vendor have different name for “Failover Mode”, some storage vendor may request choose OS type of target machine. For EMC, there are 5 modes, please refer to page 10 on EMC document

Unable to find new lun when you try to extend vmfs datastore

You probably see this rare problem: your storage team allocate new lun to esxi 5.0 host, lun is visible in add new storage screen, but invisible in extend datastore  screen.

Add new storage screen:

Add storage screen

Increase datastore capacity:

Increase datastore capacity screen


That’s because the datastore, lun is connected to multiple esxi / esx host which have different version, please be sure storage is connected to same version of esxi / esx host.

ALUA Devices on ESXi 5.0

You may see the keyword ALUA frequently if you read VMware storage documents, so what’s the ALUA exactly is? How it reflects in ESXi 5.0? What’s the advantage of ALUA? I certainly have the questions, you?

First of all, ALUA is short word of “Asymmetric Logic Unit Access”, you probably already knowJ, ALUA is a SCSI standard, it’s not support by all storage arrays, but I think most large company should have the ALUA supported array. There are different articles tried to explain what ALUA is, I’m not a storage expert, I just want to give my interpretation. You may don’t agree, have question about that, please give me a comment, I’m willing to talk about that.

Generally, storage array ( Active-Active ) have two controllers (SPA, SPB), each controller have two paths (SPA0, SPA1, SPB0, SPB1), data transmits between ESX and storage array through these paths, in older ESX version, it can only use FIXED path selection policy to transmit data through a single path. Here is a potential problem, for example, you have 10 ESX hosts in a cluster mounts a LUN, one half hosts use SPA0, and the other half hosts use SPB0, it’s would cause path thrashing since first half hosts pull the LUN to storage controller SPA, and other half pulls the LUN back to storage controller SPB over and over again. Another scenario is the LUN owned by SPA but some ESX hosts transmit data through SPB for some reason.

Whatever caused the path thrashing, I guess that’s why I can saw following error in vmkernel.log:

2013-01-15T05:36:33.831Z cpu14:4110)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.60a9800064676a2d6b5a6c33474b5138" state in doubt; requested fast path state update...

ALUA give the ability to avoid the frequently switching between storage controllers, ALUA provides two types of paths: Optimized / Non-Optimized, Optimized means data transmit between ESX host and storage controller through owning controller, Non-Optimized means data transmit through non-owning controller without switch controller. Non-Optimized path transmit data to non-owning controller then transmit data to owning controller internally, then do underlay operation, as you can see it cause latency.

So how we know does ESXi 5.0 host running properly with ALUA? Let me show you some command:

Esxcli storage nmp device list –d NAA ID

Output like that:

   Device Display Name: DGC Fibre Channel Disk (naa.600601602c802900146f4f294d8ee011)
   Storage Array Type: VMW_SATP_ALUA_CX
   Storage Array Type Device Config: {navireg=on, ipfilter=on}{implicit_support=on;explicit_support=on; explicit_allow=on;alua_followover=on;{TPG_id=1,TPG_state=AO}{TPG_id=2,TPG_state=ANO}}
   Path Selection Policy: VMW_PSP_FIXED
   Path Selection Policy Device Config: {preferred=vmhba2:C0:T1:L14;current=vmhba2:C0:T1:L14}
  Path Selection Policy Device Custom Config:
  Working Paths: vmhba2:C0:T1:L14

Okay, let’s focus on the highlight line, it’s actually three sections:

{navireg=on, ipfilter=on}

Navireg means whether or not register the device with Navisphere automatically.

Ipfiler means whether or not STOP sending the host name for Navisphere registration.

Implicit_support means whether or not device TPG state is managed by storage device self.

Explicit_support means whether or not device TPG state can be managed by ESXi host.

Explicit_allow means whether or not user allows the STAP to use its explicit ALUA capability.

Alua_followover means whether or not the ESX host follow alternative path instead of preferred path.

TPG means Target Port Group, it’s different path routing group with different state, like Optimized, Non-Optimized, Standby…etc.

AO means Active/Optimized path routing

ANO means Active/Non-Optimized path routing

Move multiple datastores to a folder

We are moving virtual machine from old storage to new datastore today, there are a lot of old datastores need to be removed after migration, for saftey consideration, I move all old datastore to a folder and then do decommission process.

There are more than 60 datastores, and vSphere client not allow move in one time. Here is a PowerCLI script can help move multiple datastores to a folder.

Note: Please make sure your folder name is uniquely.

When you create datastores.txt, please make sure first line is “Name”, one datastore name in each line.


Extend ATS capability VMFS5 datastore maybe failed

A lot of storage support hardware acceleration, it is able to offload some storage operation from ESXi 5.x host to storage filer, the feature can significantly improve performance during cloning, vMotion, coping…etc.

Different storage device may support different features of hardware acceleration, block device have block zero, full copy, hardware assisted locking, thin provisioning, NAS device have extended stats, file cloning, large scale native SS, native SS to LC, space reserve.

You can find the detail information in this article.

For block storage, we initially create VMFS5 datastore by one LUN, more LUN (extent) maybe added to the datastore when free space is low. Please be sure that all extent of VMFS5 datastore should have same ATS feature, support or not. You may see a error message “Operation failed, unable to add extent to filesystem” when you add a non-ATS extent to ATS enabled VMFS5 datastore.

How to know if lun support ATS?

You can login ESXi 5.x host via SSH and use the command to see supported feature of a lun.

esxcli storage core device vaai status get -d=device id

What is ATS?

Atomic Test and Set (ATS) is new SCSI locking method, it’s try locking per disk sector instead of reserve entire lun. More detail information in this article.