How to retrieve or set Path Selection Policy by vCLI

First of all, this article is nothing related to PowerCLI. 🙂

You probably know how to set Path Selection Policy (PSP) by vSphere Client, but how you can setup 100 LUNs manually? We have some script can make your life easy.

How to retrieve LUN Path Selection Policy:

esxcli storage nmp device list | egrep “Device Display Name|Path Selection Policy:”

You will get a output like that:

Device Display Name: DGC Fibre Channel Disk (naa.600601602a102e0002cdf2a2596be211)
Path Selection Policy: VMW_PSP_RR

This script help you identify which LUN is what type of policy. Here tell you what is Path Selection Policy.

Next, let’s see how to modify these LUN PSP by script:
First, you should run following script to print out command for each LUN, don’t forget change the bold text to the PSP you prefer.

esxcli storage nmp device list | awk '/^naa/{print "esxcli storage nmp device set -d "$0" -PVMW_PSP_RR" };'

Then, copy the output to notepad and remove the local disk, for example following bold NAA indicates the LUN is a local HP disk.

esxcli storage nmp device set -d naa.600601602a102e008896dda81b88e211 -P VMW_PSP_RR
esxcli storage nmp device set -d naa.600601602a102e008861b28a596be211 -P VMW_PSP_RR
esxcli storage nmp device set -d naa.600601602a102e00560d8488b456e211 -P VMW_PSP_RR
esxcli storage nmp device set -d naa.600601602a102e00c4cd2600b456e211 -P VMW_PSP_RR
esxcli storage nmp device set -d naa.600508b1001c1e987243838af4c67891 -P VMW_PSP_RR
esxcli storage nmp device set -d naa.600601602a102e008c96dda81b88e211 -P VMW_PSP_RR

Last, copy modified text back to putty session, it will run the commands one by one.

How to retrieve RDM information by PowerCLI

I worked on move RDM LUNs of Microsoft Cluster virtual machine from one iGroup to another. To make sure the moving safe, we should record RDM LUN information before migration.

We had two VMs with almost 20 RDM LUNs, it’s pretty time consume to get the information manually, I used following script to retrieve information:

$RMDinfo = Get-HardDisk -VM virtual machine name -DiskType rawPhysical

$RDMinfo | select Parent,Filename,CapacityGB,ScsiCanonicalName,Name

Port Groups not Work with VLAN Tag on Cisco Switch

Few weeks ago, I tried to standardize networking of a cluster, there were 4 VLANs for production virtual machines, I binded the VLANs on one virtual switch which had 4 physical vmnic.

Then I created 4 port groups with different VLAN ID, but for some reason virtual machines unreachable via some vmnics. Network team verified port channel was good.

I tried on several ESXi 5.0 hosts in the cluster, all had same problem, finally we found that’s a Cisco switch bug….you could find detail information and work around here.

HP patching error after upgrade to Update Manager 5.1

If you installed “HP ESXi 5.0 Complete Bundle Update 1.6” via Update Manager 5.0, you would be able to see storage and power sub-system shows warning on HP server, that’s because some parameters show NULL in updated HP SIM provider.

Example:

HPVC_SAController.Name="vmwControllerHPSA1",CreationClassName="HPVC_SAController"
 CreationClassName = HPVC_SAController
 Name = vmwControllerHPSA1
 PowerManagementCapabilities = (NULL)
 ResetCapability = (NULL)
 OtherDedicatedDescriptions = (NULL)
 Dedicated = (NULL)
 NameFormat = (NULL)
 TransitioningToState = 12
 AvailableRequestedStates = (NULL)
 TimeOfLastStateChange = (NULL)
 EnabledDefault = 2
 RequestedState = 12

I think HP has called back the bundle, you may see similar error message below if you already download the patch and upgrade to Update Manager 5.1 then.

VMware vSphere Update Manager had an unknown error. Check the events and log files for details.

After upgrade to Update Manager 5.1
Cannot download software packages from patch source. Check the events and the Update Manager log for download details.

After remove "data" folder in Update Manager 5.1
No way to avoid the error message except filter your baseline to exclude HP patches.

Another blogger also described same situation here.

Unknown status of Hardware Acceleration

When I read VMware documents, there is a cool feature Hardware Acceleration I found in storage book. That recall me an outage about one year ago, our NetApp filer was crashed due to motherboard problem, part of datastores was failed, we have to move virtual machine from the filer to other. We noticed the storage vMotion performance was pretty high, the data moving speed was 2 times less than regular storage vMotion. That’s the advantage of Hardware Acceleration.

The first thing of this year is standardize the virtualization environment. I found an interesting problem when I checked the Hardware Acceleration part, same luns show different status on different ESXi 5 host of a cluster, some of the hosts show Hardware Acceleration enabled, and some show Unknown.

The storage is EMC Clarion CX series with ALUA enabled, I found working hosts attached VAAI filter, non-working hosts had nothing.

Figure 1 Working Host

Figure 2 Non-working Host

ESXi 5 automatic attach different filter according to lun properties, that issue indicates the lun properties was different on different ESXi 5 host, that’s a storage layer issue, after troubleshooting with EMC, we found Failover Mode of luns was different on each host, the Failover Mode should be 4 instead of default 1.

Please be aware of that storage activity on particular host will interrupt when you change Failover Mode, please put the host in maintenance mode first.

Regarding Failover Mode, I had discussion with a storage engineer, he told me different storage vendor have different name for “Failover Mode”, some storage vendor may request choose OS type of target machine. For EMC, there are 5 modes, please refer to page 10 on EMC document

How to remove multiple snapshot by PowerCLI

My SMVI backup job was crashed few days ago, the stupid application generated a lot of snapshots for virtual machine!!! It’s hundred!

I really don’t like to remove one by one! That’s what I used to clean up the snapshot.

Get-VM | Get-Snapshot -Name smvi* | Remove-Snapshot

I used wildcard smiv*, it means all snapshot that name start with smvi.

Unable to find new lun when you try to extend vmfs datastore

You probably see this rare problem: your storage team allocate new lun to esxi 5.0 host, lun is visible in add new storage screen, but invisible in extend datastore screen.

Add new storage screen:

Increase datastore capacity:

That’s because the datastore, lun is connected to multiple esxi / esx host which have different version, please be sure storage is connected to same version of esxi / esx host.

ALUA Devices on ESXi 5.0

You may see the keyword ALUA frequently if you read VMware storage documents, so what’s the ALUA exactly is? How it reflects in ESXi 5.0? What’s the advantage of ALUA? I certainly have the questions, you?

First of all, ALUA is short word of “Asymmetric Logic Unit Access”, you probably already knowJ, ALUA is a SCSI standard, it’s not support by all storage arrays, but I think most large company should have the ALUA supported array. There are different articles tried to explain what ALUA is, I’m not a storage expert, I just want to give my interpretation. You may don’t agree, have question about that, please give me a comment, I’m willing to talk about that.

Generally, storage array ( Active-Active ) have two controllers (SPA, SPB), each controller have two paths (SPA0, SPA1, SPB0, SPB1), data transmits between ESX and storage array through these paths, in older ESX version, it can only use FIXED path selection policy to transmit data through a single path. Here is a potential problem, for example, you have 10 ESX hosts in a cluster mounts a LUN, one half hosts use SPA0, and the other half hosts use SPB0, it’s would cause path thrashing since first half hosts pull the LUN to storage controller SPA, and other half pulls the LUN back to storage controller SPB over and over again. Another scenario is the LUN owned by SPA but some ESX hosts transmit data through SPB for some reason.

Whatever caused the path thrashing, I guess that’s why I can saw following error in vmkernel.log:

2013-01-15T05:36:33.831Z cpu14:4110)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.60a9800064676a2d6b5a6c33474b5138" state in doubt; requested fast path state update...

ALUA give the ability to avoid the frequently switching between storage controllers, ALUA provides two types of paths: Optimized / Non-Optimized, Optimized means data transmit between ESX host and storage controller through owning controller, Non-Optimized means data transmit through non-owning controller without switch controller. Non-Optimized path transmit data to non-owning controller then transmit data to owning controller internally, then do underlay operation, as you can see it cause latency.

So how we know does ESXi 5.0 host running properly with ALUA? Let me show you some command:

Esxcli storage nmp device list –d NAA ID

Output like that:

naa.600601602c802900146f4f294d8ee011
   Device Display Name: DGC Fibre Channel Disk (naa.600601602c802900146f4f294d8ee011)
   Storage Array Type: VMW_SATP_ALUA_CX
   Storage Array Type Device Config: {navireg=on, ipfilter=on}{implicit_support=on;explicit_support=on; explicit_allow=on;alua_followover=on;{TPG_id=1,TPG_state=AO}{TPG_id=2,TPG_state=ANO}}
   Path Selection Policy: VMW_PSP_FIXED
   Path Selection Policy Device Config: {preferred=vmhba2:C0:T1:L14;current=vmhba2:C0:T1:L14}
  Path Selection Policy Device Custom Config:
  Working Paths: vmhba2:C0:T1:L14

Okay, let’s focus on the highlight line, it’s actually three sections:

{navireg=on, ipfilter=on}
{implicit_support=on;explicit_support=on;explicit_allow=on;alua_followover=on;
{TPG_id=1,TPG_state=AO}{TPG_id=2,TPG_state=ANO}
}

Navireg means whether or not register the device with Navisphere automatically.

Ipfiler means whether or not STOP sending the host name for Navisphere registration.

Implicit_support means whether or not device TPG state is managed by storage device self.

Explicit_support means whether or not device TPG state can be managed by ESXi host.

Explicit_allow means whether or not user allows the STAP to use its explicit ALUA capability.

Alua_followover means whether or not the ESX host follow alternative path instead of preferred path.

TPG means Target Port Group, it’s different path routing group with different state, like Optimized, Non-Optimized, Standby…etc.

AO means Active/Optimized path routing

ANO means Active/Non-Optimized path routing

VMotion fails with the error: A general system error occurred. Invalid fault

vSphere client pop following error when I put some ESXi 5.0 host to maintenance mode.

A general system error occurred. Invalid fault

That message really no help for troubleshooting, I found a KB article in VMware website, but it’s not my case.

My virtual machines is intact, I can change setting, remove from inventory or power on/off the boxes, so what’s the issue?

I found the following message in hostd.log:

2013-01-18T01:18:10.177Z [39489B90 info 'Default' opID=DDBEEEE7-0000023A-78] File path provided /vmfs/volumes/4fef9740-0b0c0cee-c1a4-e8393521ff62/VM-01 does not exist or underlying datastore is inaccessible: /vmfs/volumes/4fef9740-0b0c0cee-c1a4-e8393521ff62/VM-01

Also found messages in vmware.log:

2013-01-18T01:19:41.966Z| vmx| Migrate_SetFailure: Timed out waiting for migration start request.

The logs indicates ESXi cannot identify the location of VM configuration file, it leads to ESXi don’t know the IP address family of VM and also not able to allocate memory in target host.

But my datastore is accessible and I can browse content, I think the only reason is ESXi host still use old information of datastore, a re-scan can fix the problem.

“There is no valid reference host associated with the profile”

When I tried to apply my host profile to a ESXi 5.0 host, it’s show me “There is no valid reference host associated with the profile“.

I was thinking it’s probably caused by answer file, but actually it’s due to reference host lost in host profile!

I made simple question too complicate…:-)