Site Recoveryfor OpenStack
Guide

Initiating Failover From the Ui

Planned and unplanned failover wizard, confirming target site, monitoring progress

master

Overview

This page walks you through initiating both planned and unplanned failover operations from the Trilio Site Recovery Horizon dashboard. You will use the failover wizard to select the target site, review resource mappings, confirm your intent, and then track operation progress to completion. Understanding which failover type to use — planned (graceful, coordinated shutdown) versus unplanned (immediate, no quiesce) — is essential before you begin, as the choice affects data consistency and recovery time objectives.


Prerequisites

Before initiating a failover from the UI, confirm the following are in place:

  • Trilio Site Recovery deployed on both sites. The protector-api and protector-engine services must be running independently on your primary site and your secondary (DR) site.
  • Both sites registered. Your primary and secondary sites must appear as active in the Sites list. You can verify this under Site Recovery → Sites in the Horizon dashboard.
  • Protection Group in active or failed_over status. A Protection Group in error or failing_over state cannot be used as a failover source. Check status under Site Recovery → Protection Groups.
  • Metadata in sync. The Protection Group metadata version must match on both sites. If the sync status shows OUT OF SYNC or UNREACHABLE, resolve that before proceeding — modifications and failover operations are blocked when the peer site is unreachable, to prevent metadata divergence.
  • Replication policy configured. A replication policy with valid Pure Storage FlashArray credentials must be attached to the Protection Group.
  • Resource mappings prepared. You need to know the network and flavor mappings between your primary and secondary sites. Networks and flavors are not automatically discovered — mismatched mappings will cause VM recreation to fail. Confirm the target site has networks and flavors that correspond to those used on the source site.
  • Sufficient quota on the target site. The target site must have enough Nova compute and Cinder storage quota to accommodate the VMs and volumes being failed over.
  • Cinder volume types on both sites have replication_enabled='<is> True'. Volumes not backed by a replication-enabled volume type cannot be failed over.

Installation

The failover wizard is part of the Trilio Site Recovery Horizon plugin and requires no separate installation step beyond the standard Trilio Site Recovery deployment. If the Site Recovery panel does not appear in your Horizon sidebar, the trilio-horizon-plugin package is not installed or enabled on this Horizon host.

To verify the plugin is active:

# On the Horizon host
pip show trilio-horizon-plugin

If the package is missing, follow your site's Trilio Site Recovery deployment runbook to install and enable the Horizon plugin, then restart the Horizon web server:

systemctl restart apache2   # Debian/Ubuntu
# or
systemctl restart httpd     # RHEL/CentOS

Once the plugin is active, the failover wizard is accessible from any Protection Group detail page — no further installation is needed.


Configuration

The failover wizard surfaces several configuration inputs. Understanding each option's effect prevents mistakes that are difficult to reverse after failover completes.

Failover Type

OptionWhen to useEffect
PlannedPrimary site is healthy; you are migrating workloads intentionallyThe engine quiesces VMs on the source site, takes a final consistency snapshot, then brings VMs up on the target site. Minimizes data loss.
UnplannedPrimary site is unavailable or degradedThe engine proceeds immediately using the most recent replicated snapshot available on the target FlashArray. No attempt is made to contact the source site. Data loss up to your RPO window is expected.

Target Site

The wizard presents only the site currently registered as the secondary for this Protection Group. Primary/secondary designations are workload-relative and swap after each failover — after a failover completes, the site that was secondary becomes the new primary, and the former primary becomes the new secondary.

Network Mapping

Each source network must be explicitly mapped to a target network. The wizard lists the networks currently attached to member VMs and provides a dropdown for each one populated with networks available on the target site. If a network is left unmapped, VM creation for the affected member will fail during Phase 3 of the operation.

Flavor Mapping

Flavors are mapped similarly to networks. If a source flavor exists with the same name on the target site, the wizard pre-populates the mapping. You can override any pre-populated mapping. If a target flavor is not specified for a given source flavor, the engine attempts to use the same flavor name on the target site — this will fail if the flavor does not exist there.

Force Flag

The Force option bypasses pre-flight checks (replication status, sync status). Use this only as a last resort during a declared disaster, when normal validation cannot complete. Forcing failover when replication lag is high increases data loss beyond your configured RPO.


Usage

Initiating a Planned Failover

Use planned failover when the primary site is operational and you want a controlled, low-data-loss migration — for example, before a maintenance window or a site decommission.

  1. Log in to Horizon and authenticate to the primary site (the site where workloads are currently running).
  2. Navigate to Site Recovery → Protection Groups.
  3. Locate the Protection Group you want to fail over and confirm its status is active.
  4. Click the Protection Group name to open the detail page.
  5. Verify the Sync Status panel shows IN SYNC and both sites show the same metadata version number. If the status is OUT OF SYNC, resolve the sync issue before continuing.
  6. Click Failover in the action bar. The failover wizard opens.
  7. Select Planned as the failover type.
  8. Confirm the Target Site shown is correct. The wizard displays the registered secondary site for this Protection Group.
  9. In the Network Mappings table, assign a target network for each source network. All rows must have a target selected before you can proceed.
  10. In the Flavor Mappings table, verify or update the flavor assignments. Pre-populated mappings are suggestions based on name matching — review them for correctness.
  11. Review the Summary panel, which lists the member VMs, their volumes, and the full set of mappings you have configured.
  12. Click Initiate Failover. The wizard submits the operation and redirects you to the Operations progress view.

Initiating an Unplanned Failover

Use unplanned failover when the primary site is unreachable or has failed and you need to recover workloads on the secondary site immediately.

  1. Log in to Horizon and authenticate to the secondary site (the DR site where you want workloads to come up).
  2. Navigate to Site Recovery → Protection Groups.
  3. Locate the Protection Group. Because the primary site is unavailable, its status may show as active with a degraded sync status. This is expected.
  4. Click the Protection Group name.
  5. Click Failover.
  6. Select Unplanned as the failover type. A warning banner appears indicating that the operation will not attempt to quiesce the source site and data loss may occur up to the last replicated snapshot.
  7. Complete the network and flavor mappings as described in the planned failover steps above.
  8. If pre-flight checks fail because the primary site is unreachable, you may enable Force to proceed. Read the force warning carefully before enabling it.
  9. Review the summary and click Initiate Failover.

Monitoring Progress

After the wizard submits the operation, you are redirected to the operation detail view under Site Recovery → Operations. The progress bar and step log update in real time:

Progress RangePhaseWhat is happening
0–20%PreparationProtection Group status set to failing_over; DR operation record created; target site reachability confirmed; latest snapshot identified on FlashArray
20–60%Storage failoverReplicated snapshot extracted on FlashArray at target site; volumes created from snapshot; volumes managed into Cinder on target site
60–90%Instance recreationVMs recreated on target site using stored metadata, applied network and flavor mappings, boot volume attached, additional volumes attached
90–100%FinalizationProtection Group status updated to failed_over; current primary site pointer updated to target site; failover count incremented; operation marked completed

You do not need to keep the browser tab open. You can return to Site Recovery → Operations at any time to check status. Each operation records steps_completed and steps_failed fields that are visible in the operation detail view, making it straightforward to identify exactly where a failure occurred.


Examples

Example 1: Planned failover of a three-VM web application

Scenario: You are failing over the prod-web-app Protection Group from Site A (primary, Boston) to Site B (secondary, Seattle) before a scheduled maintenance window. All three member VMs (web-server-1, web-server-2, db-server) are healthy and replication is current.

Wizard inputs:

Failover Type:    Planned
Target Site:      site-b (Seattle)

Network Mappings:
  net-primary-web  →  net-secondary-web
  net-primary-db   →  net-secondary-db

Flavor Mappings:
  m1.large   →  m1.large
  m1.xlarge  →  m1.xlarge

Force:  No

Expected operation output (as seen in the Operations detail view):

Operation ID:    op-a1b2c3d4-...
Type:            failover
Status:          completed
Progress:        100%

Steps Completed:
  [āœ“] validate_target_site
  [āœ“] get_latest_snapshot
  [āœ“] quiesce_source_vms          ← planned only
  [āœ“] create_volumes_from_snapshot (3 volumes)
  [āœ“] manage_volumes_into_cinder
  [āœ“] recreate_instance:web-server-1
  [āœ“] recreate_instance:web-server-2
  [āœ“] recreate_instance:db-server
  [āœ“] update_protection_group_state

Steps Failed:    (none)

Started:         2025-11-03T02:00:05Z
Completed:       2025-11-03T02:08:43Z
Duration:        8m 38s

Protection Group state after completion:

Name:                    prod-web-app
Status:                  failed_over
Current Primary Site:    site-b
Failover Count:          1
Last Failover At:        2025-11-03T02:08:43Z

Example 2: Unplanned failover with force enabled

Scenario: Site A has gone offline unexpectedly. You are logged in to Horizon on Site B. The Protection Group sync status shows UNREACHABLE for the peer site. You need to recover workloads immediately.

Wizard inputs:

Failover Type:    Unplanned
Target Site:      site-b  (you are already on site-b)

Network Mappings:
  net-primary-web  →  net-secondary-web
  net-primary-db   →  net-secondary-db

Flavor Mappings:
  m1.large   →  m1.large
  m1.xlarge  →  m1.xlarge

Force:  Yes   ← required because peer site pre-flight check cannot complete

Expected operation output:

Operation ID:    op-e5f6g7h8-...
Type:            failover
Status:          completed
Progress:        100%

Steps Completed:
  [āœ“] validate_target_site
  [āœ“] get_latest_snapshot         ← uses most recent replicated snapshot
  [āœ—] quiesce_source_vms          ← skipped (unplanned + force)
  [āœ“] create_volumes_from_snapshot (3 volumes)
  [āœ“] manage_volumes_into_cinder
  [āœ“] recreate_instance:web-server-1
  [āœ“] recreate_instance:web-server-2
  [āœ“] recreate_instance:db-server
  [āœ“] update_protection_group_state

Warning: Source site sync status was UNREACHABLE at operation start.
         Data loss up to last replicated snapshot is possible.

Started:         2025-11-03T10:15:02Z
Completed:       2025-11-03T10:22:17Z

Example 3: Failover blocked due to out-of-sync metadata

Scenario: You attempt to initiate failover but the wizard shows a pre-flight warning and blocks submission.

Error shown in wizard:

⚠  Pre-flight check failed

Metadata sync status: OUT OF SYNC
  Local version:   8
  Remote version:  7
  Last sync:       5 minutes ago (FAILED)

Failover is blocked because the remote site metadata is behind by 1 version.
This prevents metadata divergence between sites.

Recommended action:
  1. Check remote site connectivity.
  2. Navigate to Protection Group → Actions → Force Sync.
  3. Re-attempt failover once sync status shows IN SYNC.

If the remote site is unreachable and you must proceed,
enable Force to override this check.

Troubleshooting

Failover wizard shows no target site in the dropdown

Symptom: The Target Site dropdown in the failover wizard is empty or greyed out.

Likely cause: No secondary site is registered for this Protection Group, or the secondary site record has status unreachable or error.

Fix:

  1. Navigate to Site Recovery → Sites and verify the secondary site appears and has status active.
  2. If the site is listed but shows unreachable, click Validate on the site record to re-test connectivity. Resolve any network or credential issues reported.
  3. If no secondary site is registered, return to the Protection Group configuration and associate a secondary site before attempting failover.

Failover operation fails at "recreate_instance" step

Symptom: The operation progress bar stalls between 60–90% and the step log shows [āœ—] recreate_instance:<vm-name> with an error such as No valid host found or Network <uuid> not found.

Likely cause: The network or flavor mapping provided in the wizard does not exist on the target site, or the target site does not have sufficient compute capacity.

Fix:

  1. Open the operation detail view and expand the failed step to read the full error message.
  2. For Network not found: confirm the target network UUID in your mapping is correct for the target site. Network UUIDs differ between sites — do not copy UUIDs from the primary site.
  3. For No valid host found: verify the target flavor exists on the target site (openstack flavor list scoped to the target site) and that the target site has available compute resources.
  4. Correct the mappings and re-initiate the failover. If volumes were already managed into Cinder during the failed attempt, the engine will detect them and skip re-creation — you will need to clean up orphaned volumes manually before retrying.

Operation stuck at 0% with status pending

Symptom: The operation appears in the Operations list with status pending and does not advance after several minutes.

Likely cause: The protector-engine service on the target site is not running or is unable to reach the message queue.

Fix:

  1. SSH to the target site's controller and check the engine service:
    systemctl status protector-engine
    journalctl -u protector-engine -n 100
    
  2. If the service is stopped, start it and watch for errors in the log.
  3. If the engine is running but the operation remains pending, check database connectivity and RabbitMQ status on the target site.

Wizard blocks failover with "Metadata sync status: UNREACHABLE"

Symptom: The pre-flight check in the wizard shows the peer site as unreachable and will not allow you to proceed without enabling Force.

Likely cause: During a planned failover, the primary site cannot reach the secondary site to push a final metadata sync. During an unplanned failover initiated from the secondary site, this is expected if the primary site has failed.

Fix (planned failover):

  1. Verify that the secondary site's protector-api endpoint is reachable from the primary site on port 8788.
  2. Check firewall rules between sites.
  3. Once connectivity is restored, use Force Sync on the Protection Group before re-attempting the planned failover.

Fix (unplanned failover): This is the expected state when the primary site is down. Enable Force in the wizard to proceed. Acknowledge that data loss up to the last replicated snapshot is possible.


Volumes not appearing on target site after failover completes

Symptom: The operation shows completed at 100%, but the member VMs on the target site are in ERROR state or are missing their volumes.

Likely cause: Cinder volume_manage permission is not granted to the member role on the target site, causing the volume import step to succeed at the Pure Storage layer but fail when registering volumes in Cinder.

Fix:

  1. On the target site, verify that policy.yaml for Cinder includes:
    "volume_extension:volume_manage": "rule:admin_or_owner"
    "volume_extension:volume_unmanage": "rule:admin_or_owner"
    "volume_extension:services:index": "rule:admin_or_owner"
    
  2. Apply the policy change and reconfigure Cinder:
    kolla-ansible -i inventory reconfigure -t cinder
    
  3. Clean up any orphaned volumes on the FlashArray (volumes that were promoted but not imported into Cinder), then re-initiate the failover.

Protection Group status remains failing_over after a failed operation

Symptom: A failover operation failed and the Protection Group status is stuck at failing_over, preventing new operations from being submitted.

Likely cause: The operation failed mid-execution and the engine was unable to complete the state rollback.

Fix:

  1. Review the failed operation's steps_failed output in Site Recovery → Operations to understand what succeeded and what did not.
  2. Manually clean up any partial resources on the target site (unmanage imported volumes, delete partially-created VMs).
  3. Use the CLI to reset the Protection Group status:
    openstack protector protection-group reset-state <pg-id> --state active
    
  4. Re-initiate the failover once the protection group returns to active and the environment is in a known-clean state.