ALUA Devices on ESXi 5.0

You may see the keyword ALUA frequently if you read VMware storage documents, so what’s the ALUA exactly is? How it reflects in ESXi 5.0? What’s the advantage of ALUA? I certainly have the questions, you?

First of all, ALUA is short word of “Asymmetric Logic Unit Access”, you probably already knowJ, ALUA is a SCSI standard, it’s not support by all storage arrays, but I think most large company should have the ALUA supported array. There are different articles tried to explain what ALUA is, I’m not a storage expert, I just want to give my interpretation. You may don’t agree, have question about that, please give me a comment, I’m willing to talk about that.

Generally, storage array ( Active-Active ) have two controllers (SPA, SPB), each controller have two paths (SPA0, SPA1, SPB0, SPB1), data transmits between ESX and storage array through these paths, in older ESX version, it can only use FIXED path selection policy to transmit data through a single path. Here is a potential problem, for example, you have 10 ESX hosts in a cluster mounts a LUN, one half hosts use SPA0, and the other half hosts use SPB0, it’s would cause path thrashing since first half hosts pull the LUN to storage controller SPA, and other half pulls the LUN back to storage controller SPB over and over again. Another scenario is the LUN owned by SPA but some ESX hosts transmit data through SPB for some reason.

Whatever caused the path thrashing, I guess that’s why I can saw following error in vmkernel.log:

2013-01-15T05:36:33.831Z cpu14:4110)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device "naa.60a9800064676a2d6b5a6c33474b5138" state in doubt; requested fast path state update...

ALUA give the ability to avoid the frequently switching between storage controllers, ALUA provides two types of paths: Optimized / Non-Optimized, Optimized means data transmit between ESX host and storage controller through owning controller, Non-Optimized means data transmit through non-owning controller without switch controller. Non-Optimized path transmit data to non-owning controller then transmit data to owning controller internally, then do underlay operation, as you can see it cause latency.

So how we know does ESXi 5.0 host running properly with ALUA? Let me show you some command:

Esxcli storage nmp device list –d NAA ID

Output like that:

   Device Display Name: DGC Fibre Channel Disk (naa.600601602c802900146f4f294d8ee011)
   Storage Array Type: VMW_SATP_ALUA_CX
   Storage Array Type Device Config: {navireg=on, ipfilter=on}{implicit_support=on;explicit_support=on; explicit_allow=on;alua_followover=on;{TPG_id=1,TPG_state=AO}{TPG_id=2,TPG_state=ANO}}
   Path Selection Policy: VMW_PSP_FIXED
   Path Selection Policy Device Config: {preferred=vmhba2:C0:T1:L14;current=vmhba2:C0:T1:L14}
  Path Selection Policy Device Custom Config:
  Working Paths: vmhba2:C0:T1:L14

Okay, let’s focus on the highlight line, it’s actually three sections:

{navireg=on, ipfilter=on}

Navireg means whether or not register the device with Navisphere automatically.

Ipfiler means whether or not STOP sending the host name for Navisphere registration.

Implicit_support means whether or not device TPG state is managed by storage device self.

Explicit_support means whether or not device TPG state can be managed by ESXi host.

Explicit_allow means whether or not user allows the STAP to use its explicit ALUA capability.

Alua_followover means whether or not the ESX host follow alternative path instead of preferred path.

TPG means Target Port Group, it’s different path routing group with different state, like Optimized, Non-Optimized, Standby…etc.

AO means Active/Optimized path routing

ANO means Active/Non-Optimized path routing

VMotion fails with the error: A general system error occurred. Invalid fault

vSphere client pop following error when I put some ESXi 5.0 host to maintenance mode.

A general system error occurred. Invalid fault

That message really no help for troubleshooting, I found a KB article in VMware website, but it’s not my case.

My virtual machines is intact, I can change setting, remove from inventory or power on/off the boxes, so what’s the issue?

I found the following message in hostd.log:

2013-01-18T01:18:10.177Z [39489B90 info 'Default' opID=DDBEEEE7-0000023A-78] File path provided /vmfs/volumes/4fef9740-0b0c0cee-c1a4-e8393521ff62/VM-01 does not exist or underlying datastore is inaccessible: /vmfs/volumes/4fef9740-0b0c0cee-c1a4-e8393521ff62/VM-01

Also found messages in vmware.log:

2013-01-18T01:19:41.966Z| vmx| Migrate_SetFailure: Timed out waiting for migration start request.

The logs indicates ESXi cannot identify the location of VM configuration file, it leads to ESXi don’t know the IP address family of VM and also not able to allocate      memory in target host.

But my datastore is accessible and I can browse content, I think the only reason is ESXi host still use old information of datastore, a re-scan can fix the problem.

“There is no valid reference host associated with the profile”

When I tried to apply my host profile to a ESXi 5.0 host, it’s show me “There is no valid reference host associated with the profile“.

I was thinking it’s probably caused by answer file, but actually it’s due to reference host lost in host profile!

I made simple question too complicate…:-)

Failed to connect to SQL Server when install vCenter SSO

The installer may prompt “Failed to established connection” after input SQL database information.

Reason can be vary. If your SQL account password is correct, it maybe caused by SQL password policy. The three password policy is selected by default when you create SQL account.

SQL password policy

You could also find similiar error message in %TEMP%/vm-sso-javalib.log:

[2013-01-11 10:54:33,640]ERROR 733[main] - 
com.vmware.vim.installer.core.logging.CoreLoggerImpl.error(?:?) - 
Failed to established connection: 
Login failed for user 'vCenterSSO_DBA'.  
Reason: The password of the account must be changed.

Please deselect the 3 password policy or change your SQL password more complexitily.


Move multiple datastores to a folder

We are moving virtual machine from old storage to new datastore today, there are a lot of old datastores need to be removed after migration, for saftey consideration, I move all old datastore to a folder and then do decommission process.

There are more than 60 datastores, and vSphere client not allow move in one time. Here is a PowerCLI script can help move multiple datastores to a folder.

Note: Please make sure your folder name is uniquely.

When you create datastores.txt, please make sure first line is “Name”, one datastore name in each line.


Extend ATS capability VMFS5 datastore maybe failed

A lot of storage support hardware acceleration, it is able to offload some storage operation from ESXi 5.x host to storage filer, the feature can significantly improve performance during cloning, vMotion, coping…etc.

Different storage device may support different features of hardware acceleration, block device have block zero, full copy, hardware assisted locking, thin provisioning, NAS device have extended stats, file cloning, large scale native SS, native SS to LC, space reserve.

You can find the detail information in this article.

For block storage, we initially create VMFS5 datastore by one LUN, more LUN (extent) maybe added to the datastore when free space is low. Please be sure that all extent of VMFS5 datastore should have same ATS feature, support or not. You may see a error message “Operation failed, unable to add extent to filesystem” when you add a non-ATS extent to ATS enabled VMFS5 datastore.

How to know if lun support ATS?

You can login ESXi 5.x host via SSH and use the command to see supported feature of a lun.

esxcli storage core device vaai status get -d=device id

What is ATS?

Atomic Test and Set (ATS) is new SCSI locking method, it’s try locking per disk sector instead of reserve entire lun. More detail information in this article.

Time synchronization on virtual machine

Guest OS of virtual machine can synchronize time by several way, such as NTP, VMware Tools, Windows Time Service, CMOS…etc.

VMware Tools synchronize time with ESXi host when  you enable periodic time synchronization. VMware Tools time synchronization function is disabled by default, but that’s doesn’t means time synch never happens between guest OS and host. It still happens after certain operations:

  • When the VMware Tools daemon is started.
  • When resuming a VM from suspended status.
  • After reverting from a snapshot.
  • After shrinking a disk.

It can causes some problem if guest OS have different time with host, it can lead to SAP application failed due to SAP database timestamp different with guest OS.

You can completely disable the time synchronization by following step:

  • Power off the VM.
  • Add following lines to .vmx file.
    tools.syncTime = “FALSE”
    time.synchronize.continue = “FALSE”
    time.synchronize.restore = “FALSE”
    time.synchronize.resume.disk = “FALSE”
    time.synchronize.shrink = “FALSE” = “FALSE”
  • Save and close the file.
  • Power on the VM.

You can also refer VMware official document.

Here is also a useful KB article for timekeeping best practices for Linux VM.