Initiating Failover From the Ui
Planned and unplanned failover wizard, confirming target site, monitoring progress
This page walks you through initiating both planned and unplanned failover operations from the Trilio Site Recovery Horizon dashboard. You will use the failover wizard to select the target site, review resource mappings, confirm your intent, and then track operation progress to completion. Understanding which failover type to use ā planned (graceful, coordinated shutdown) versus unplanned (immediate, no quiesce) ā is essential before you begin, as the choice affects data consistency and recovery time objectives.
Before initiating a failover from the UI, confirm the following are in place:
- Trilio Site Recovery deployed on both sites. The
protector-apiandprotector-engineservices must be running independently on your primary site and your secondary (DR) site. - Both sites registered. Your primary and secondary sites must appear as
activein the Sites list. You can verify this under Site Recovery ā Sites in the Horizon dashboard. - Protection Group in
activeorfailed_overstatus. A Protection Group inerrororfailing_overstate cannot be used as a failover source. Check status under Site Recovery ā Protection Groups. - Metadata in sync. The Protection Group metadata version must match on both sites. If the sync status shows
OUT OF SYNCorUNREACHABLE, resolve that before proceeding ā modifications and failover operations are blocked when the peer site is unreachable, to prevent metadata divergence. - Replication policy configured. A replication policy with valid Pure Storage FlashArray credentials must be attached to the Protection Group.
- Resource mappings prepared. You need to know the network and flavor mappings between your primary and secondary sites. Networks and flavors are not automatically discovered ā mismatched mappings will cause VM recreation to fail. Confirm the target site has networks and flavors that correspond to those used on the source site.
- Sufficient quota on the target site. The target site must have enough Nova compute and Cinder storage quota to accommodate the VMs and volumes being failed over.
- Cinder volume types on both sites have
replication_enabled='<is> True'. Volumes not backed by a replication-enabled volume type cannot be failed over.
The failover wizard is part of the Trilio Site Recovery Horizon plugin and requires no separate installation step beyond the standard Trilio Site Recovery deployment. If the Site Recovery panel does not appear in your Horizon sidebar, the trilio-horizon-plugin package is not installed or enabled on this Horizon host.
To verify the plugin is active:
# On the Horizon host
pip show trilio-horizon-plugin
If the package is missing, follow your site's Trilio Site Recovery deployment runbook to install and enable the Horizon plugin, then restart the Horizon web server:
systemctl restart apache2 # Debian/Ubuntu
# or
systemctl restart httpd # RHEL/CentOS
Once the plugin is active, the failover wizard is accessible from any Protection Group detail page ā no further installation is needed.
The failover wizard surfaces several configuration inputs. Understanding each option's effect prevents mistakes that are difficult to reverse after failover completes.
Failover Type
| Option | When to use | Effect |
|---|---|---|
| Planned | Primary site is healthy; you are migrating workloads intentionally | The engine quiesces VMs on the source site, takes a final consistency snapshot, then brings VMs up on the target site. Minimizes data loss. |
| Unplanned | Primary site is unavailable or degraded | The engine proceeds immediately using the most recent replicated snapshot available on the target FlashArray. No attempt is made to contact the source site. Data loss up to your RPO window is expected. |
Target Site
The wizard presents only the site currently registered as the secondary for this Protection Group. Primary/secondary designations are workload-relative and swap after each failover ā after a failover completes, the site that was secondary becomes the new primary, and the former primary becomes the new secondary.
Network Mapping
Each source network must be explicitly mapped to a target network. The wizard lists the networks currently attached to member VMs and provides a dropdown for each one populated with networks available on the target site. If a network is left unmapped, VM creation for the affected member will fail during Phase 3 of the operation.
Flavor Mapping
Flavors are mapped similarly to networks. If a source flavor exists with the same name on the target site, the wizard pre-populates the mapping. You can override any pre-populated mapping. If a target flavor is not specified for a given source flavor, the engine attempts to use the same flavor name on the target site ā this will fail if the flavor does not exist there.
Force Flag
The Force option bypasses pre-flight checks (replication status, sync status). Use this only as a last resort during a declared disaster, when normal validation cannot complete. Forcing failover when replication lag is high increases data loss beyond your configured RPO.
Initiating a Planned Failover
Use planned failover when the primary site is operational and you want a controlled, low-data-loss migration ā for example, before a maintenance window or a site decommission.
- Log in to Horizon and authenticate to the primary site (the site where workloads are currently running).
- Navigate to Site Recovery ā Protection Groups.
- Locate the Protection Group you want to fail over and confirm its status is
active. - Click the Protection Group name to open the detail page.
- Verify the Sync Status panel shows
IN SYNCand both sites show the same metadata version number. If the status isOUT OF SYNC, resolve the sync issue before continuing. - Click Failover in the action bar. The failover wizard opens.
- Select Planned as the failover type.
- Confirm the Target Site shown is correct. The wizard displays the registered secondary site for this Protection Group.
- In the Network Mappings table, assign a target network for each source network. All rows must have a target selected before you can proceed.
- In the Flavor Mappings table, verify or update the flavor assignments. Pre-populated mappings are suggestions based on name matching ā review them for correctness.
- Review the Summary panel, which lists the member VMs, their volumes, and the full set of mappings you have configured.
- Click Initiate Failover. The wizard submits the operation and redirects you to the Operations progress view.
Initiating an Unplanned Failover
Use unplanned failover when the primary site is unreachable or has failed and you need to recover workloads on the secondary site immediately.
- Log in to Horizon and authenticate to the secondary site (the DR site where you want workloads to come up).
- Navigate to Site Recovery ā Protection Groups.
- Locate the Protection Group. Because the primary site is unavailable, its status may show as
activewith a degraded sync status. This is expected. - Click the Protection Group name.
- Click Failover.
- Select Unplanned as the failover type. A warning banner appears indicating that the operation will not attempt to quiesce the source site and data loss may occur up to the last replicated snapshot.
- Complete the network and flavor mappings as described in the planned failover steps above.
- If pre-flight checks fail because the primary site is unreachable, you may enable Force to proceed. Read the force warning carefully before enabling it.
- Review the summary and click Initiate Failover.
Monitoring Progress
After the wizard submits the operation, you are redirected to the operation detail view under Site Recovery ā Operations. The progress bar and step log update in real time:
| Progress Range | Phase | What is happening |
|---|---|---|
| 0ā20% | Preparation | Protection Group status set to failing_over; DR operation record created; target site reachability confirmed; latest snapshot identified on FlashArray |
| 20ā60% | Storage failover | Replicated snapshot extracted on FlashArray at target site; volumes created from snapshot; volumes managed into Cinder on target site |
| 60ā90% | Instance recreation | VMs recreated on target site using stored metadata, applied network and flavor mappings, boot volume attached, additional volumes attached |
| 90ā100% | Finalization | Protection Group status updated to failed_over; current primary site pointer updated to target site; failover count incremented; operation marked completed |
You do not need to keep the browser tab open. You can return to Site Recovery ā Operations at any time to check status. Each operation records steps_completed and steps_failed fields that are visible in the operation detail view, making it straightforward to identify exactly where a failure occurred.
Example 1: Planned failover of a three-VM web application
Scenario: You are failing over the prod-web-app Protection Group from Site A (primary, Boston) to Site B (secondary, Seattle) before a scheduled maintenance window. All three member VMs (web-server-1, web-server-2, db-server) are healthy and replication is current.
Wizard inputs:
Failover Type: Planned
Target Site: site-b (Seattle)
Network Mappings:
net-primary-web ā net-secondary-web
net-primary-db ā net-secondary-db
Flavor Mappings:
m1.large ā m1.large
m1.xlarge ā m1.xlarge
Force: No
Expected operation output (as seen in the Operations detail view):
Operation ID: op-a1b2c3d4-...
Type: failover
Status: completed
Progress: 100%
Steps Completed:
[ā] validate_target_site
[ā] get_latest_snapshot
[ā] quiesce_source_vms ā planned only
[ā] create_volumes_from_snapshot (3 volumes)
[ā] manage_volumes_into_cinder
[ā] recreate_instance:web-server-1
[ā] recreate_instance:web-server-2
[ā] recreate_instance:db-server
[ā] update_protection_group_state
Steps Failed: (none)
Started: 2025-11-03T02:00:05Z
Completed: 2025-11-03T02:08:43Z
Duration: 8m 38s
Protection Group state after completion:
Name: prod-web-app
Status: failed_over
Current Primary Site: site-b
Failover Count: 1
Last Failover At: 2025-11-03T02:08:43Z
Example 2: Unplanned failover with force enabled
Scenario: Site A has gone offline unexpectedly. You are logged in to Horizon on Site B. The Protection Group sync status shows UNREACHABLE for the peer site. You need to recover workloads immediately.
Wizard inputs:
Failover Type: Unplanned
Target Site: site-b (you are already on site-b)
Network Mappings:
net-primary-web ā net-secondary-web
net-primary-db ā net-secondary-db
Flavor Mappings:
m1.large ā m1.large
m1.xlarge ā m1.xlarge
Force: Yes ā required because peer site pre-flight check cannot complete
Expected operation output:
Operation ID: op-e5f6g7h8-...
Type: failover
Status: completed
Progress: 100%
Steps Completed:
[ā] validate_target_site
[ā] get_latest_snapshot ā uses most recent replicated snapshot
[ā] quiesce_source_vms ā skipped (unplanned + force)
[ā] create_volumes_from_snapshot (3 volumes)
[ā] manage_volumes_into_cinder
[ā] recreate_instance:web-server-1
[ā] recreate_instance:web-server-2
[ā] recreate_instance:db-server
[ā] update_protection_group_state
Warning: Source site sync status was UNREACHABLE at operation start.
Data loss up to last replicated snapshot is possible.
Started: 2025-11-03T10:15:02Z
Completed: 2025-11-03T10:22:17Z
Example 3: Failover blocked due to out-of-sync metadata
Scenario: You attempt to initiate failover but the wizard shows a pre-flight warning and blocks submission.
Error shown in wizard:
ā Pre-flight check failed
Metadata sync status: OUT OF SYNC
Local version: 8
Remote version: 7
Last sync: 5 minutes ago (FAILED)
Failover is blocked because the remote site metadata is behind by 1 version.
This prevents metadata divergence between sites.
Recommended action:
1. Check remote site connectivity.
2. Navigate to Protection Group ā Actions ā Force Sync.
3. Re-attempt failover once sync status shows IN SYNC.
If the remote site is unreachable and you must proceed,
enable Force to override this check.
Failover wizard shows no target site in the dropdown
Symptom: The Target Site dropdown in the failover wizard is empty or greyed out.
Likely cause: No secondary site is registered for this Protection Group, or the secondary site record has status unreachable or error.
Fix:
- Navigate to Site Recovery ā Sites and verify the secondary site appears and has status
active. - If the site is listed but shows
unreachable, click Validate on the site record to re-test connectivity. Resolve any network or credential issues reported. - If no secondary site is registered, return to the Protection Group configuration and associate a secondary site before attempting failover.
Failover operation fails at "recreate_instance" step
Symptom: The operation progress bar stalls between 60ā90% and the step log shows [ā] recreate_instance:<vm-name> with an error such as No valid host found or Network <uuid> not found.
Likely cause: The network or flavor mapping provided in the wizard does not exist on the target site, or the target site does not have sufficient compute capacity.
Fix:
- Open the operation detail view and expand the failed step to read the full error message.
- For
Network not found: confirm the target network UUID in your mapping is correct for the target site. Network UUIDs differ between sites ā do not copy UUIDs from the primary site. - For
No valid host found: verify the target flavor exists on the target site (openstack flavor listscoped to the target site) and that the target site has available compute resources. - Correct the mappings and re-initiate the failover. If volumes were already managed into Cinder during the failed attempt, the engine will detect them and skip re-creation ā you will need to clean up orphaned volumes manually before retrying.
Operation stuck at 0% with status pending
Symptom: The operation appears in the Operations list with status pending and does not advance after several minutes.
Likely cause: The protector-engine service on the target site is not running or is unable to reach the message queue.
Fix:
- SSH to the target site's controller and check the engine service:
systemctl status protector-engine journalctl -u protector-engine -n 100 - If the service is stopped, start it and watch for errors in the log.
- If the engine is running but the operation remains pending, check database connectivity and RabbitMQ status on the target site.
Wizard blocks failover with "Metadata sync status: UNREACHABLE"
Symptom: The pre-flight check in the wizard shows the peer site as unreachable and will not allow you to proceed without enabling Force.
Likely cause: During a planned failover, the primary site cannot reach the secondary site to push a final metadata sync. During an unplanned failover initiated from the secondary site, this is expected if the primary site has failed.
Fix (planned failover):
- Verify that the secondary site's
protector-apiendpoint is reachable from the primary site on port 8788. - Check firewall rules between sites.
- Once connectivity is restored, use Force Sync on the Protection Group before re-attempting the planned failover.
Fix (unplanned failover): This is the expected state when the primary site is down. Enable Force in the wizard to proceed. Acknowledge that data loss up to the last replicated snapshot is possible.
Volumes not appearing on target site after failover completes
Symptom: The operation shows completed at 100%, but the member VMs on the target site are in ERROR state or are missing their volumes.
Likely cause: Cinder volume_manage permission is not granted to the member role on the target site, causing the volume import step to succeed at the Pure Storage layer but fail when registering volumes in Cinder.
Fix:
- On the target site, verify that
policy.yamlfor Cinder includes:"volume_extension:volume_manage": "rule:admin_or_owner" "volume_extension:volume_unmanage": "rule:admin_or_owner" "volume_extension:services:index": "rule:admin_or_owner" - Apply the policy change and reconfigure Cinder:
kolla-ansible -i inventory reconfigure -t cinder - Clean up any orphaned volumes on the FlashArray (volumes that were promoted but not imported into Cinder), then re-initiate the failover.
Protection Group status remains failing_over after a failed operation
Symptom: A failover operation failed and the Protection Group status is stuck at failing_over, preventing new operations from being submitted.
Likely cause: The operation failed mid-execution and the engine was unable to complete the state rollback.
Fix:
- Review the failed operation's
steps_failedoutput in Site Recovery ā Operations to understand what succeeded and what did not. - Manually clean up any partial resources on the target site (unmanage imported volumes, delete partially-created VMs).
- Use the CLI to reset the Protection Group status:
openstack protector protection-group reset-state <pg-id> --state active - Re-initiate the failover once the protection group returns to
activeand the environment is in a known-clean state.