Challenge
During snapshot creation or commit phase of a Veeam Backup or Replication job using vSphere, a primary node in a DAG cluster may lose the heartbeat long enough to cause a failover to the secondary node.Cause
This problem is caused by the lack of connectivity that can occur in VMware vSphere during snapshot operations. It is sometimes referred to as the "stun" period. All Veeam Backup and Replication jobs require snapshot operations in vSphere.Solution
This behavior is infrastructural and relevant to third party software and infrastructure hardware. These are simply suggestions and tips to help alleviate this problem. These suggestions may include configuration changes to VMware as well as Microsoft Exchange. Veeam is not responsible for any problems encountered by making any of the suggested changes in these systems. Please refer to their respective support organizations for more detail on these settings.- Place the Exchange Virtual Machines disks on the fastest disks (Datastores) that are available.
- Disable all background scanning and/or maintenance tasks occurring in Exchange, or any other tools that are being leveraged against the system at the time of backup.
- Perform the Exchange Backup singularly as opposed to concurrently with other jobs.
- Adjust Microsoft settings for failover sensitivity (in bold, run from command line):
- cluster /prop SameSubnetDelay=2000:DWORD (Default: 1000)
- cluster /prop CrossSubnetDelay=4000:DWORD (Default: 1000)
- cluster /prop CrossSubnetThreshold=10:DWORD (Default: 5)
- cluster /prop SameSubnetThreshold=10:DWORD (Default: 5)
- To check settings, use: cluster /prop (see note)
- Add the line snapshot.maxConsolidateTime = "1" to the .vmx (configuration) file for the primary node. Please note that this is an undocumented vmx alteration, and should be validated by VMware support prior to using.
- Reduce total amount of disks (.vmdk's) for primary node if possible, reducing impact of snapshot operations.
- If possible, migrate the virtual machine from NFS type to VMFS formatted storage.
- Use Network (NBD) mode setting on Source Backup Proxy as opposed to Appliance (hotadd) mode for your backup and/or replication jobs in Veeam.
- Test snapshot operations directly to ESX(i) host instead of vCenter. (In some cases, gaps in communication between vCenter and ESX(i) host can impact snapshot operations, including VSS operation timing.)
- Install-WindowsFeature -name RSAT-Clustering-CmdInterface
Alternative way of altering cluster settings:
- Get-cluster | fl *subnet* - provides current settings for timeout
- Altering cluster settings:
- (get-cluster).SameSubnetThreshold = 20 (Default 10 in Windows 2012R2)
- (get-cluster).SameSubnetDelay = 2000 (Default 1000 in Windows 2012R2)
- (get-cluster).CrossSubnetThreshold = 40 (Default 20 in Windows 2012R2)
- (get-cluster).CrossSubnetDelay = 4000 (Default 1000 in Windows 2012R2)