Site Recoveryfor OpenStack
Guide

Failback

Returning workloads to the original primary site after a failover

master

Overview

Failback returns your workloads to their original primary site after a failover event. Once the primary site has recovered and replication is re-established, you execute a failback to restore the Protection Group to its pre-failover state — VMs running on the original primary site with replication flowing back to the secondary. This guide covers the full failback workflow: verifying primary site recovery, checking metadata synchronization, reversing replication direction, recreating instances on the primary site, and confirming the Protection Group returns to active status.


Prerequisites

Before initiating a failback, verify the following conditions are met:

  • Trilio Site Recovery deployed on both sites — protector-api and protector-engine must be running and reachable on both the original primary site and the current active (DR) site.
  • Primary site has recovered — The original primary site's Nova, Cinder, Neutron, and Keystone endpoints must be accessible. Validate this with openstack protector site validate <primary-site-name>.
  • Replication re-established — The Pure Storage FlashArray on the primary site must be operational and replication from the current active site (secondary) back to the primary must be functioning or ready to be reversed.
  • Metadata in sync — The Protection Group metadata must be synchronized between both sites before failback can proceed. Run openstack protector protection-group sync-status <pg-name> to confirm. If the sites are out of sync, force a sync first (see Troubleshooting).
  • Protection Group in failed_over or failed_over_partial status — Failback is only valid from a Protection Group that has previously been failed over.
  • Resource mappings prepared — You must know the network mapping from the current active site back to the primary site (the reverse of the mappings used during failover). Flavor mappings are required if the primary site uses different flavor IDs.
  • OSC CLI plugin installed — The protectorclient OSC plugin must be installed and configured with credentials for the current active site (where VMs are now running).
  • Cinder policy — Both sites must have volume_extension:volume_manage and volume_extension:volume_unmanage granted to the member role, as volumes will be unmanaged on the current active site and managed into Cinder on the primary site during failback.

Installation

Failback is an operational workflow — there is no separate software to install. The openstack protector CLI commands used here are part of the protectorclient OSC plugin, which is installed as part of the standard Trilio Site Recovery deployment.

If the plugin is not already installed, install it now on the host from which you will coordinate the failback:

pip install python-protectorclient

Verify the plugin is available:

openstack protector --help

You should see protection-group, operation, and site command groups listed in the output. If you see No such command, the plugin is not installed or not on the active Python path.


Configuration

Failback behavior is controlled by the options you pass to the failback action. The following parameters influence how the operation runs:

ParameterDescriptionDefaultNotes
--reverse-replicationAfter failback completes, reverses the replication direction so the primary site is once again replicating to the secondary.falseStrongly recommended for production failbacks. Without this, the Protection Group is restored but replication to the DR site is not automatically re-enabled.
--network-mappingMaps network UUIDs from the current active site (where VMs are running post-failover) back to the corresponding networks on the primary site.None (required)Must be provided as <active-site-net-uuid>=<primary-site-net-uuid> pairs.
--flavor-mappingMaps flavor IDs from the current active site to equivalent flavors on the primary site.None (optional)Required only if flavor IDs differ between sites.
--forceSkips validation steps and forces the failback even if some preflight checks fail.falseUse only if you understand the risks. A forced failback does not perform a final quiesce or snapshot before shutting down VMs on the active site.
--waitBlocks the CLI until the failback operation completes, then prints the final result.falseRecommended for scripted or automated workflows.

Protection Group Status Transitions During Failback

Understanding the status progression helps you monitor progress and diagnose issues:

StatusMeaning
failed_overStarting state — VMs running on the DR site, primary site recovered.
failing_backFailback operation is in progress.
activeFailback completed — VMs running on the original primary site, replication active.
errorFailback failed. Inspect the operation record for details.

Metadata Sync Behavior During Failback

Failback requires both sites to be reachable. The Trilio Site Recovery service blocks modifications to a Protection Group if the remote site is unreachable, to prevent metadata divergence. This includes the failback operation itself. If your primary site is unreachable when you attempt failback, the operation will be rejected — you must wait for connectivity to be restored before proceeding.


Usage

1. Authenticate to the Current Active Site

After a failover, your VMs are running on the DR site. All failback operations must be initiated from the site where workloads are currently active — the site that is now acting as primary.

# Source credentials for the current active site (your DR / secondary site)
source ~/site-b-openrc
export OS_AUTH_URL=http://site-b:5000/v3

2. Verify Primary Site Recovery

Before starting failback, confirm that the original primary site is reachable and its services are healthy:

openstack protector site validate site-a

Expected output indicates the site is active and all service endpoints (Keystone, Nova, Cinder, Neutron) are reachable. If this command fails, the primary site is not ready for failback — do not proceed.

3. Check Protection Group Status

openstack protector protection-group show prod-web-app

Confirm the Protection Group shows:

  • status: failed_over (or failed_over_partial)
  • current_primary_site: site-b (the DR site, where VMs are running)

4. Check Metadata Synchronization

openstack protector protection-group sync-status prod-web-app

Both sites must be at the same metadata version before failback proceeds. If you see OUT OF SYNC, resolve it before continuing (see Troubleshooting).

5. Execute the Failback

Initiate the failback with reverse network mappings (the opposite direction from the original failover):

openstack protector protection-group failback prod-web-app \
  --reverse-replication \
  --network-mapping \
    net-secondary-web=net-primary-web \
    net-secondary-db=net-primary-db

If flavor IDs differ between your sites, also include:

  --flavor-mapping \
    m2.large=m1.large

6. Monitor Progress

The failback runs asynchronously. Use the returned operation ID to track progress:

# Monitor in real time
watch openstack protector operation show <op-id>

# Or list all recent operations
openstack protector operation list

Alternatively, pass --wait to the failback command to block until the operation completes.

7. Verify Successful Failback

Once the operation status shows completed:

# Confirm Protection Group is active and pointing to the original primary
openstack protector protection-group show prod-web-app
# Expected: status=active, current_primary_site=site-a

# Confirm VMs are running on the primary site
source ~/site-a-openrc
openstack server list --project production-project

# Confirm metadata is in sync on both sites
openstack protector protection-group sync-status prod-web-app

Examples

Example 1: Standard Planned Failback After a Successful Failover

This is the most common scenario: you failed over to Site B during a maintenance window or disaster, Site A has recovered, and you want to return workloads to Site A.

# Authenticate to the currently active site (Site B)
source ~/site-b-openrc

# Validate the original primary site is healthy
openstack protector site validate site-a

Expected output:

+------------------+--------+
| Field            | Value  |
+------------------+--------+
| name             | site-a |
| status           | active |
| keystone         | ok     |
| nova             | ok     |
| cinder           | ok     |
| neutron          | ok     |
+------------------+--------+
# Confirm Protection Group is in failed_over state
openstack protector protection-group show prod-web-app

Expected output (key fields):

+---------------------------+------------------------------------------+
| Field                     | Value                                    |
+---------------------------+------------------------------------------+
| status                    | failed_over                              |
| current_primary_site      | site-b                                   |
| failover_count            | 1                                        |
+---------------------------+------------------------------------------+
# Execute the failback with reverse replication enabled
openstack protector protection-group failback prod-web-app \
  --reverse-replication \
  --network-mapping \
    net-secondary-web=net-primary-web \
    net-secondary-db=net-primary-db \
  --flavor-mapping \
    m2.large=m1.large \
  --wait

Expected output:

+------------------------+--------------------------------------+
| Field                  | Value                                |
+------------------------+--------------------------------------+
| operation_id           | op-789abc12-...                      |
| operation_type         | failback                             |
| status                 | completed                            |
| progress               | 100                                  |
| source_site            | site-b                               |
| target_site            | site-a                               |
| instances_created      | 3                                    |
| instances_failed       | 0                                    |
+------------------------+--------------------------------------+
# Switch context to Site A and verify workloads
source ~/site-a-openrc
openstack server list --project production-project

Expected output (VMs running on Site A):

+--------------------------------------+--------------+--------+
| ID                                   | Name         | Status |
+--------------------------------------+--------------+--------+
| aabbcc11-...                         | web-server-1 | ACTIVE |
| ddeeff22-...                         | web-server-2 | ACTIVE |
| 99887766-...                         | db-server    | ACTIVE |
+--------------------------------------+--------------+--------+

Example 2: Checking Metadata Sync Before Failback and Force-Syncing If Needed

If the primary site was unreachable while you were operating on the DR site, metadata may have diverged. Run this check before attempting failback.

source ~/site-b-openrc

# Check sync status
openstack protector protection-group sync-status prod-web-app

Out-of-sync output:

Sync Status: āŒ OUT OF SYNC

Local Metadata:
  Version: 4
  Current Site: Site B
  Last Modified: 2025-11-03T14:35:00Z

Remote Sync:
  Status: FAILED
  Remote Version: 3
  Last Sync: 2025-11-03T10:00:00Z (4 hours ago)
  Error: Connection timeout

Action Required:
  1. Check remote site connectivity
  2. Force sync once remote site is available
# Force sync to push current metadata to Site A
openstack protector protection-group sync-force prod-web-app

Expected output:

Force Sync Initiated...

Checking remote site connectivity...
  āœ… Site A is reachable

Syncing metadata (version 4)...
  Gathering current metadata... āœ“
  Calculating checksum... āœ“
  Pushing to Site A... āœ“

Remote Site Response:
  Status: success
  Version: 4
  Duration: 312ms

āœ… Sync completed successfully
Both sites now at version 4
# Now proceed with failback
openstack protector protection-group failback prod-web-app \
  --reverse-replication \
  --network-mapping \
    net-secondary-web=net-primary-web \
    net-secondary-db=net-primary-db

Example 3: Monitoring a Failback Operation in Progress

If you did not pass --wait, monitor the operation using its ID:

# Poll the operation until complete
watch -n 5 openstack protector operation show op-789abc12-...

In-progress output:

+------------------+---------------------------------------------+
| Field            | Value                                       |
+------------------+---------------------------------------------+
| operation_id     | op-789abc12-...                             |
| operation_type   | failback                                    |
| status           | running                                     |
| progress         | 55                                          |
| steps_completed  | ["quiesce_vms", "final_snapshot",           |
|                  |  "promote_storage_primary"]                 |
| steps_failed     | []                                          |
| started_at       | 2025-11-03T15:00:00Z                        |
+------------------+---------------------------------------------+

Completed output:

+------------------+---------------------------------------------+
| Field            | Value                                       |
+------------------+---------------------------------------------+
| operation_id     | op-789abc12-...                             |
| operation_type   | failback                                    |
| status           | completed                                   |
| progress         | 100                                         |
| started_at       | 2025-11-03T15:00:00Z                        |
| completed_at     | 2025-11-03T15:08:43Z                        |
+------------------+---------------------------------------------+

Troubleshooting

Issue 1: Failback Blocked — Remote Site Unreachable

Symptom:

Error: Cannot initiate failback - remote site 'site-a' is unreachable.
Modifications to a Protection Group require both sites to be reachable.

Cause: The original primary site (Site A) cannot be reached from the current active site. Because metadata sync requires both sites, the service blocks the operation to prevent divergence.

Fix:

  1. Verify the primary site is actually recovered: check that Nova, Cinder, Neutron, and Keystone endpoints on Site A respond.
  2. Check network connectivity between sites: ensure port 8788 (protector API) and the Keystone port (5000) are reachable from Site B to Site A.
  3. Run openstack protector site validate site-a — if this fails, the site is not ready.
  4. Once connectivity is restored, run openstack protector protection-group sync-force <pg-name> to re-establish metadata sync before retrying the failback.

Issue 2: Metadata Out of Sync — Versions Do Not Match

Symptom:

Sync Status: āŒ OUT OF SYNC
Version mismatch (local: 5, remote: 3)

Cause: Operations were performed on the current active site while the primary site was unreachable, advancing the local metadata version without syncing to the peer.

Fix:

# Check which site has the higher (authoritative) version
openstack protector protection-group sync-status <pg-name>

# Force sync from the current active site (higher version) to primary
openstack protector protection-group sync-force <pg-name>

This pushes the current active site's metadata to the primary site, bringing both into alignment. Verify both sites show the same version before proceeding with failback.


Issue 3: Protection Group Not in a Failable-Back State

Symptom:

Error: Protection Group 'prod-web-app' is not in a failed_over state.
Current status: active

Cause: Failback can only be initiated on a Protection Group with status failed_over or failed_over_partial. If the status is active, the group is already running on its configured primary site and no failback is needed.

Fix: Verify the current state:

openstack protector protection-group show prod-web-app

If current_primary_site matches the site where your VMs are running and status is active, your workloads are already on the primary site. No failback is required.


Issue 4: Failback Fails — Volume Manage Operation Denied

Symptom:

Error: Failed to manage volume on site-a: Policy does not allow
volume_extension:volume_manage to be performed.

Cause: The Cinder policy on the primary site does not allow the member role to call the volume manage API. During failback, Trilio Site Recovery creates volumes from replicated snapshots and imports them into Cinder using the manage API.

Fix: On the primary site, add the following to Cinder's policy file (/etc/cinder/policy.yaml) and reconfigure Cinder:

"volume_extension:volume_manage": "rule:admin_or_owner"
"volume_extension:volume_unmanage": "rule:admin_or_owner"
"volume_extension:services:index": "rule:admin_or_owner"

For Kolla-Ansible deployments:

kolla-ansible -i inventory reconfigure -t cinder

After reconfiguring, retry the failback.


Issue 5: Instances Not Created on Primary Site — Flavor or Network Not Found

Symptom:

Error: Failed to create instance 'web-server-1' on site-a:
Flavor 'm2.large' not found.

or

Error: Network 'net-secondary-web' not found on site-a.

Cause: The flavor or network IDs from the current active site (Site B) were not mapped to their equivalents on the primary site (Site A). The failback command requires explicit mappings when IDs differ between sites.

Fix: List the available flavors and networks on the primary site and construct the correct mappings:

source ~/site-a-openrc
openstack flavor list
openstack network list --project production-project

Then re-run the failback with the correct --flavor-mapping and --network-mapping values:

openstack protector protection-group failback prod-web-app \
  --reverse-replication \
  --network-mapping \
    net-secondary-web=net-primary-web \
    net-secondary-db=net-primary-db \
  --flavor-mapping \
    m2.large=m1.large

Issue 6: Failback Operation Stuck at running / No Progress

Symptom: The operation has been in running state for longer than expected and the progress field has not changed.

Cause: The protector-engine service on the current active site may have encountered an error interacting with either the Pure Storage array or the primary site's OpenStack APIs.

Fix:

  1. Check the protector-engine logs on the current active site for errors:
journalctl -u openstack-protector-engine -n 200 --no-pager
  1. Inspect the full operation record for error_message and steps_failed:
openstack protector operation show <op-id>
  1. Verify Pure Storage replication status independently — confirm that the replication connection between FlashArray B and FlashArray A is healthy and that snapshots are visible on Site A's array.
  2. If the engine process has crashed, restart it and re-check operation status:
systemctl restart openstack-protector-engine
openstack protector operation show <op-id>