Category: English

English version of my posts.

The number of heartbeat datastores for host is 0, which is less than required: 2
Today I see this error message on one ESXi5.0 host:
```
The number of heartbeat datastores for host is 0, which is less than required: 2
```
No any VM is running on the host by DRS or HA, VMware KB gives a solution but too complicate.

Re-configure HA can fixes the problem.

Right click the host -> Click Reconfigure for vSphere HA -> Waiting HA configuration complete.
May 6, 2013
No permission to login to vCenter Server 5.1
Today, we P2V one vCenter Server, I re-added identify source for some reason, I didn’t modified any existing domain group and ACL.
After a while I got a interesting case. User reported they got “No permission to login to vCenter Server 5.1 by vSphere Client”.
I looked into the vpxa.log of vCenter Server, it show that:
```
2013-05-01T11:08:01.399-05:00 [09108 error '[SSO]' opID=6e704a51] [UserDirectorySso] AcquireToken InvalidCredentialsException: Authentication failed: Authentication failed

2013-05-01T11:08:01.399-05:00 [08644 error 'authvpxdUser' opID=5469f71e] Failed to authenticate user <xxxx>
```
I was not 100% sure that log related to the real problem. but that’s indicated it should be something related to authentication components.
After compared working SSO with the fault SSO, I noticed Domain Alias was blank on fault SSO:

Then I added a domain group on fault vCenter Server and compared the group with working vCenter Server, it’s shows format different, just like that:
Working SSO – CONTOSOTEST-GROUP
Fault SSO – CONTOSO.COMTEST-GROUP

Okay…now I know why user logging got fault. The identify source configured Domain Alias before I removed it on fault SSO, then I added identify source without Domain Alias, and thenvCenter Server used Domain name as default prefix of domain group, it lead to original domain groups format ( CONTOSOxxxx ) cannot be identified by SSO.

So I deleted the identify source and added a same source with Domain alias, problem fixed…
May 2, 2013
How to retrieve or set Path Selection Policy by vCLI
First of all, this article is nothing related to PowerCLI. 🙂

You probably know how to set Path Selection Policy (PSP) by vSphere Client, but how you can setup 100 LUNs manually? We have some script can make your life easy.

How to retrieve LUN Path Selection Policy:

esxcli storage nmp device list | egrep “Device Display Name|Path Selection Policy:”

You will get a output like that:

Device Display Name: DGC Fibre Channel Disk (naa.600601602a102e0002cdf2a2596be211)
Path Selection Policy: VMW_PSP_RR

This script help you identify which LUN is what type of policy. Here tell you what is Path Selection Policy.

Next, let’s see how to modify these LUN PSP by script:
First, you should run following script to print out command for each LUN, don’t forget change the bold text to the PSP you prefer.
```
esxcli storage nmp device list | awk '/^naa/{print "esxcli storage nmp device set -d "$0" -PVMW_PSP_RR" };'
```
Then, copy the output to notepad and remove the local disk, for example following bold NAA indicates the LUN is a local HP disk.
```
esxcli storage nmp device set -d naa.600601602a102e008896dda81b88e211 -P VMW_PSP_RR
esxcli storage nmp device set -d naa.600601602a102e008861b28a596be211 -P VMW_PSP_RR
esxcli storage nmp device set -d naa.600601602a102e00560d8488b456e211 -P VMW_PSP_RR
esxcli storage nmp device set -d naa.600601602a102e00c4cd2600b456e211 -P VMW_PSP_RR
esxcli storage nmp device set -d naa.600508b1001c1e987243838af4c67891 -P VMW_PSP_RR
esxcli storage nmp device set -d naa.600601602a102e008c96dda81b88e211 -P VMW_PSP_RR
```
Last, copy modified text back to putty session, it will run the commands one by one.
April 25, 2013
How to retrieve RDM information by PowerCLI
I worked on move RDM LUNs of Microsoft Cluster virtual machine from one iGroup to another. To make sure the moving safe, we should record RDM LUN information before migration.

We had two VMs with almost 20 RDM LUNs, it’s pretty time consume to get the information manually, I used following script to retrieve information:
```
$RMDinfo = Get-HardDisk -VM virtual machine name -DiskType rawPhysical

$RDMinfo | select Parent,Filename,CapacityGB,ScsiCanonicalName,Name
```
April 17, 2013
Port Groups not Work with VLAN Tag on Cisco Switch

Few weeks ago, I tried to standardize networking of a cluster, there were 4 VLANs for production virtual machines, I binded the VLANs on one virtual switch which had 4 physical vmnic.

Then I created 4 port groups with different VLAN ID, but for some reason virtual machines unreachable via some vmnics. Network team verified port channel was good.

I tried on several ESXi 5.0 hosts in the cluster, all had same problem, finally we found that’s a Cisco switch bug….you could find detail information and work around here.

March 25, 2013

HP patching error after upgrade to Update Manager 5.1

If you installed “HP ESXi 5.0 Complete Bundle Update 1.6” via Update Manager 5.0, you would be able to see storage and power sub-system shows warning on HP server, that’s because some parameters show NULL in updated HP SIM provider.

Example:

HPVC_SAController.Name="vmwControllerHPSA1",CreationClassName="HPVC_SAController"
 CreationClassName = HPVC_SAController
 Name = vmwControllerHPSA1
 PowerManagementCapabilities = (NULL)
 ResetCapability = (NULL)
 OtherDedicatedDescriptions = (NULL)
 Dedicated = (NULL)
 NameFormat = (NULL)
 TransitioningToState = 12
 AvailableRequestedStates = (NULL)
 TimeOfLastStateChange = (NULL)
 EnabledDefault = 2
 RequestedState = 12

I think HP has called back the bundle, you may see similar error message below if you already download the patch and upgrade to Update Manager 5.1 then.

VMware vSphere Update Manager had an unknown error. Check the events and log files for details.

After upgrade to Update Manager 5.1
Cannot download software packages from patch source. Check the events and the Update Manager log for download details.

After remove "data" folder in Update Manager 5.1
No way to avoid the error message except filter your baseline to exclude HP patches.

Another blogger also described same situation here.

February 16, 2013

Unknown status of Hardware Acceleration

When I read VMware documents, there is a cool feature Hardware Acceleration I found in storage book. That recall me an outage about one year ago, our NetApp filer was crashed due to motherboard problem, part of datastores was failed, we have to move virtual machine from the filer to other. We noticed the storage vMotion performance was pretty high, the data moving speed was 2 times less than regular storage vMotion. That’s the advantage of Hardware Acceleration.

The first thing of this year is standardize the virtualization environment. I found an interesting problem when I checked the Hardware Acceleration part, same luns show different status on different ESXi 5 host of a cluster, some of the hosts show Hardware Acceleration enabled, and some show Unknown.

The storage is EMC Clarion CX series with ALUA enabled, I found working hosts attached VAAI filter, non-working hosts had nothing.

Figure 1 Working Host

Figure 2 Non-working Host

ESXi 5 automatic attach different filter according to lun properties, that issue indicates the lun properties was different on different ESXi 5 host, that’s a storage layer issue, after troubleshooting with EMC, we found Failover Mode of luns was different on each host, the Failover Mode should be 4 instead of default 1.

Please be aware of that storage activity on particular host will interrupt when you change Failover Mode, please put the host in maintenance mode first.

Regarding Failover Mode, I had discussion with a storage engineer, he told me different storage vendor have different name for “Failover Mode”, some storage vendor may request choose OS type of target machine. For EMC, there are 5 modes, please refer to page 10 on EMC document

February 5, 2013
How to remove multiple snapshot by PowerCLI
My SMVI backup job was crashed few days ago, the stupid application generated a lot of snapshots for virtual machine!!! It’s hundred!

I really don’t like to remove one by one! That’s what I used to clean up the snapshot.
```
Get-VM | Get-Snapshot -Name smvi* | Remove-Snapshot
```
I used wildcard smiv*, it means all snapshot that name start with smvi.
January 29, 2013
Unable to find new lun when you try to extend vmfs datastore

You probably see this rare problem: your storage team allocate new lun to esxi 5.0 host, lun is visible in add new storage screen, but invisible in extend datastore screen.

Add new storage screen:

Increase datastore capacity:

That’s because the datastore, lun is connected to multiple esxi / esx host which have different version, please be sure storage is connected to same version of esxi / esx host.

January 24, 2013
ALUA Devices on ESXi 5.0
You may see the keyword ALUA frequently if you read VMware storage documents, so what’s the ALUA exactly is? How it reflects in ESXi 5.0? What’s the advantage of ALUA? I certainly have the questions, you?

First of all, ALUA is short word of “Asymmetric Logic Unit Access”, you probably already knowJ, ALUA is a SCSI standard, it’s not support by all storage arrays, but I think most large company should have the ALUA supported array. There are different articles tried to explain what ALUA is, I’m not a storage expert, I just want to give my interpretation. You may don’t agree, have question about that, please give me a comment, I’m willing to talk about that.

Generally, storage array ( Active-Active ) have two controllers (SPA, SPB), each controller have two paths (SPA0, SPA1, SPB0, SPB1), data transmits between ESX and storage array through these paths, in older ESX version, it can only use FIXED path selection policy to transmit data through a single path. Here is a potential problem, for example, you have 10 ESX hosts in a cluster mounts a LUN, one half hosts use SPA0, and the other half hosts use SPB0, it’s would cause path thrashing since first half hosts pull the LUN to storage controller SPA, and other half pulls the LUN back to storage controller SPB over and over again. Another scenario is the LUN owned by SPA but some ESX hosts transmit data through SPB for some reason.

Whatever caused the path thrashing, I guess that’s why I can saw following error in vmkernel.log:
```
2013-01-15T05:36:33.831Z cpu14:4110)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.60a9800064676a2d6b5a6c33474b5138" state in doubt; requested fast path state update...
```
ALUA give the ability to avoid the frequently switching between storage controllers, ALUA provides two types of paths: Optimized / Non-Optimized, Optimized means data transmit between ESX host and storage controller through owning controller, Non-Optimized means data transmit through non-owning controller without switch controller. Non-Optimized path transmit data to non-owning controller then transmit data to owning controller internally, then do underlay operation, as you can see it cause latency.

So how we know does ESXi 5.0 host running properly with ALUA? Let me show you some command:
```
Esxcli storage nmp device list –d NAA ID
```
Output like that:
```
naa.600601602c802900146f4f294d8ee011
   Device Display Name: DGC Fibre Channel Disk (naa.600601602c802900146f4f294d8ee011)
   Storage Array Type: VMW_SATP_ALUA_CX
   Storage Array Type Device Config: {navireg=on, ipfilter=on}{implicit_support=on;explicit_support=on; explicit_allow=on;alua_followover=on;{TPG_id=1,TPG_state=AO}{TPG_id=2,TPG_state=ANO}}
   Path Selection Policy: VMW_PSP_FIXED
   Path Selection Policy Device Config: {preferred=vmhba2:C0:T1:L14;current=vmhba2:C0:T1:L14}
  Path Selection Policy Device Custom Config:
  Working Paths: vmhba2:C0:T1:L14
```
Okay, let’s focus on the highlight line, it’s actually three sections:
```
{navireg=on, ipfilter=on}
{implicit_support=on;explicit_support=on;explicit_allow=on;alua_followover=on;
{TPG_id=1,TPG_state=AO}{TPG_id=2,TPG_state=ANO}
}
```
Navireg means whether or not register the device with Navisphere automatically.

Ipfiler means whether or not STOP sending the host name for Navisphere registration.

Implicit_support means whether or not device TPG state is managed by storage device self.

Explicit_support means whether or not device TPG state can be managed by ESXi host.

Explicit_allow means whether or not user allows the STAP to use its explicit ALUA capability.

Alua_followover means whether or not the ESX host follow alternative path instead of preferred path.

TPG means Target Port Group, it’s different path routing group with different state, like Optimized, Non-Optimized, Standby…etc.

AO means Active/Optimized path routing

ANO means Active/Non-Optimized path routing
January 23, 2013