Failback
Returning workloads to the original primary site after a failover
Failback returns your workloads to their original primary site after a failover event. Once the primary site has recovered and replication is re-established, you execute a failback to restore the Protection Group to its pre-failover state ā VMs running on the original primary site with replication flowing back to the secondary. This guide covers the full failback workflow: verifying primary site recovery, checking metadata synchronization, reversing replication direction, recreating instances on the primary site, and confirming the Protection Group returns to active status.
Before initiating a failback, verify the following conditions are met:
- Trilio Site Recovery deployed on both sites ā
protector-apiandprotector-enginemust be running and reachable on both the original primary site and the current active (DR) site. - Primary site has recovered ā The original primary site's Nova, Cinder, Neutron, and Keystone endpoints must be accessible. Validate this with
openstack protector site validate <primary-site-name>. - Replication re-established ā The Pure Storage FlashArray on the primary site must be operational and replication from the current active site (secondary) back to the primary must be functioning or ready to be reversed.
- Metadata in sync ā The Protection Group metadata must be synchronized between both sites before failback can proceed. Run
openstack protector protection-group sync-status <pg-name>to confirm. If the sites are out of sync, force a sync first (see Troubleshooting). - Protection Group in
failed_overorfailed_over_partialstatus ā Failback is only valid from a Protection Group that has previously been failed over. - Resource mappings prepared ā You must know the network mapping from the current active site back to the primary site (the reverse of the mappings used during failover). Flavor mappings are required if the primary site uses different flavor IDs.
- OSC CLI plugin installed ā The
protectorclientOSC plugin must be installed and configured with credentials for the current active site (where VMs are now running). - Cinder policy ā Both sites must have
volume_extension:volume_manageandvolume_extension:volume_unmanagegranted to thememberrole, as volumes will be unmanaged on the current active site and managed into Cinder on the primary site during failback.
Failback is an operational workflow ā there is no separate software to install. The openstack protector CLI commands used here are part of the protectorclient OSC plugin, which is installed as part of the standard Trilio Site Recovery deployment.
If the plugin is not already installed, install it now on the host from which you will coordinate the failback:
pip install python-protectorclient
Verify the plugin is available:
openstack protector --help
You should see protection-group, operation, and site command groups listed in the output. If you see No such command, the plugin is not installed or not on the active Python path.
Failback behavior is controlled by the options you pass to the failback action. The following parameters influence how the operation runs:
| Parameter | Description | Default | Notes |
|---|---|---|---|
--reverse-replication | After failback completes, reverses the replication direction so the primary site is once again replicating to the secondary. | false | Strongly recommended for production failbacks. Without this, the Protection Group is restored but replication to the DR site is not automatically re-enabled. |
--network-mapping | Maps network UUIDs from the current active site (where VMs are running post-failover) back to the corresponding networks on the primary site. | None (required) | Must be provided as <active-site-net-uuid>=<primary-site-net-uuid> pairs. |
--flavor-mapping | Maps flavor IDs from the current active site to equivalent flavors on the primary site. | None (optional) | Required only if flavor IDs differ between sites. |
--force | Skips validation steps and forces the failback even if some preflight checks fail. | false | Use only if you understand the risks. A forced failback does not perform a final quiesce or snapshot before shutting down VMs on the active site. |
--wait | Blocks the CLI until the failback operation completes, then prints the final result. | false | Recommended for scripted or automated workflows. |
Protection Group Status Transitions During Failback
Understanding the status progression helps you monitor progress and diagnose issues:
| Status | Meaning |
|---|---|
failed_over | Starting state ā VMs running on the DR site, primary site recovered. |
failing_back | Failback operation is in progress. |
active | Failback completed ā VMs running on the original primary site, replication active. |
error | Failback failed. Inspect the operation record for details. |
Metadata Sync Behavior During Failback
Failback requires both sites to be reachable. The Trilio Site Recovery service blocks modifications to a Protection Group if the remote site is unreachable, to prevent metadata divergence. This includes the failback operation itself. If your primary site is unreachable when you attempt failback, the operation will be rejected ā you must wait for connectivity to be restored before proceeding.
1. Authenticate to the Current Active Site
After a failover, your VMs are running on the DR site. All failback operations must be initiated from the site where workloads are currently active ā the site that is now acting as primary.
# Source credentials for the current active site (your DR / secondary site)
source ~/site-b-openrc
export OS_AUTH_URL=http://site-b:5000/v3
2. Verify Primary Site Recovery
Before starting failback, confirm that the original primary site is reachable and its services are healthy:
openstack protector site validate site-a
Expected output indicates the site is active and all service endpoints (Keystone, Nova, Cinder, Neutron) are reachable. If this command fails, the primary site is not ready for failback ā do not proceed.
3. Check Protection Group Status
openstack protector protection-group show prod-web-app
Confirm the Protection Group shows:
status: failed_over(orfailed_over_partial)current_primary_site: site-b(the DR site, where VMs are running)
4. Check Metadata Synchronization
openstack protector protection-group sync-status prod-web-app
Both sites must be at the same metadata version before failback proceeds. If you see OUT OF SYNC, resolve it before continuing (see Troubleshooting).
5. Execute the Failback
Initiate the failback with reverse network mappings (the opposite direction from the original failover):
openstack protector protection-group failback prod-web-app \
--reverse-replication \
--network-mapping \
net-secondary-web=net-primary-web \
net-secondary-db=net-primary-db
If flavor IDs differ between your sites, also include:
--flavor-mapping \
m2.large=m1.large
6. Monitor Progress
The failback runs asynchronously. Use the returned operation ID to track progress:
# Monitor in real time
watch openstack protector operation show <op-id>
# Or list all recent operations
openstack protector operation list
Alternatively, pass --wait to the failback command to block until the operation completes.
7. Verify Successful Failback
Once the operation status shows completed:
# Confirm Protection Group is active and pointing to the original primary
openstack protector protection-group show prod-web-app
# Expected: status=active, current_primary_site=site-a
# Confirm VMs are running on the primary site
source ~/site-a-openrc
openstack server list --project production-project
# Confirm metadata is in sync on both sites
openstack protector protection-group sync-status prod-web-app
Example 1: Standard Planned Failback After a Successful Failover
This is the most common scenario: you failed over to Site B during a maintenance window or disaster, Site A has recovered, and you want to return workloads to Site A.
# Authenticate to the currently active site (Site B)
source ~/site-b-openrc
# Validate the original primary site is healthy
openstack protector site validate site-a
Expected output:
+------------------+--------+
| Field | Value |
+------------------+--------+
| name | site-a |
| status | active |
| keystone | ok |
| nova | ok |
| cinder | ok |
| neutron | ok |
+------------------+--------+
# Confirm Protection Group is in failed_over state
openstack protector protection-group show prod-web-app
Expected output (key fields):
+---------------------------+------------------------------------------+
| Field | Value |
+---------------------------+------------------------------------------+
| status | failed_over |
| current_primary_site | site-b |
| failover_count | 1 |
+---------------------------+------------------------------------------+
# Execute the failback with reverse replication enabled
openstack protector protection-group failback prod-web-app \
--reverse-replication \
--network-mapping \
net-secondary-web=net-primary-web \
net-secondary-db=net-primary-db \
--flavor-mapping \
m2.large=m1.large \
--wait
Expected output:
+------------------------+--------------------------------------+
| Field | Value |
+------------------------+--------------------------------------+
| operation_id | op-789abc12-... |
| operation_type | failback |
| status | completed |
| progress | 100 |
| source_site | site-b |
| target_site | site-a |
| instances_created | 3 |
| instances_failed | 0 |
+------------------------+--------------------------------------+
# Switch context to Site A and verify workloads
source ~/site-a-openrc
openstack server list --project production-project
Expected output (VMs running on Site A):
+--------------------------------------+--------------+--------+
| ID | Name | Status |
+--------------------------------------+--------------+--------+
| aabbcc11-... | web-server-1 | ACTIVE |
| ddeeff22-... | web-server-2 | ACTIVE |
| 99887766-... | db-server | ACTIVE |
+--------------------------------------+--------------+--------+
Example 2: Checking Metadata Sync Before Failback and Force-Syncing If Needed
If the primary site was unreachable while you were operating on the DR site, metadata may have diverged. Run this check before attempting failback.
source ~/site-b-openrc
# Check sync status
openstack protector protection-group sync-status prod-web-app
Out-of-sync output:
Sync Status: ā OUT OF SYNC
Local Metadata:
Version: 4
Current Site: Site B
Last Modified: 2025-11-03T14:35:00Z
Remote Sync:
Status: FAILED
Remote Version: 3
Last Sync: 2025-11-03T10:00:00Z (4 hours ago)
Error: Connection timeout
Action Required:
1. Check remote site connectivity
2. Force sync once remote site is available
# Force sync to push current metadata to Site A
openstack protector protection-group sync-force prod-web-app
Expected output:
Force Sync Initiated...
Checking remote site connectivity...
ā
Site A is reachable
Syncing metadata (version 4)...
Gathering current metadata... ā
Calculating checksum... ā
Pushing to Site A... ā
Remote Site Response:
Status: success
Version: 4
Duration: 312ms
ā
Sync completed successfully
Both sites now at version 4
# Now proceed with failback
openstack protector protection-group failback prod-web-app \
--reverse-replication \
--network-mapping \
net-secondary-web=net-primary-web \
net-secondary-db=net-primary-db
Example 3: Monitoring a Failback Operation in Progress
If you did not pass --wait, monitor the operation using its ID:
# Poll the operation until complete
watch -n 5 openstack protector operation show op-789abc12-...
In-progress output:
+------------------+---------------------------------------------+
| Field | Value |
+------------------+---------------------------------------------+
| operation_id | op-789abc12-... |
| operation_type | failback |
| status | running |
| progress | 55 |
| steps_completed | ["quiesce_vms", "final_snapshot", |
| | "promote_storage_primary"] |
| steps_failed | [] |
| started_at | 2025-11-03T15:00:00Z |
+------------------+---------------------------------------------+
Completed output:
+------------------+---------------------------------------------+
| Field | Value |
+------------------+---------------------------------------------+
| operation_id | op-789abc12-... |
| operation_type | failback |
| status | completed |
| progress | 100 |
| started_at | 2025-11-03T15:00:00Z |
| completed_at | 2025-11-03T15:08:43Z |
+------------------+---------------------------------------------+
Issue 1: Failback Blocked ā Remote Site Unreachable
Symptom:
Error: Cannot initiate failback - remote site 'site-a' is unreachable.
Modifications to a Protection Group require both sites to be reachable.
Cause: The original primary site (Site A) cannot be reached from the current active site. Because metadata sync requires both sites, the service blocks the operation to prevent divergence.
Fix:
- Verify the primary site is actually recovered: check that Nova, Cinder, Neutron, and Keystone endpoints on Site A respond.
- Check network connectivity between sites: ensure port 8788 (protector API) and the Keystone port (5000) are reachable from Site B to Site A.
- Run
openstack protector site validate site-aā if this fails, the site is not ready. - Once connectivity is restored, run
openstack protector protection-group sync-force <pg-name>to re-establish metadata sync before retrying the failback.
Issue 2: Metadata Out of Sync ā Versions Do Not Match
Symptom:
Sync Status: ā OUT OF SYNC
Version mismatch (local: 5, remote: 3)
Cause: Operations were performed on the current active site while the primary site was unreachable, advancing the local metadata version without syncing to the peer.
Fix:
# Check which site has the higher (authoritative) version
openstack protector protection-group sync-status <pg-name>
# Force sync from the current active site (higher version) to primary
openstack protector protection-group sync-force <pg-name>
This pushes the current active site's metadata to the primary site, bringing both into alignment. Verify both sites show the same version before proceeding with failback.
Issue 3: Protection Group Not in a Failable-Back State
Symptom:
Error: Protection Group 'prod-web-app' is not in a failed_over state.
Current status: active
Cause: Failback can only be initiated on a Protection Group with status failed_over or failed_over_partial. If the status is active, the group is already running on its configured primary site and no failback is needed.
Fix: Verify the current state:
openstack protector protection-group show prod-web-app
If current_primary_site matches the site where your VMs are running and status is active, your workloads are already on the primary site. No failback is required.
Issue 4: Failback Fails ā Volume Manage Operation Denied
Symptom:
Error: Failed to manage volume on site-a: Policy does not allow
volume_extension:volume_manage to be performed.
Cause: The Cinder policy on the primary site does not allow the member role to call the volume manage API. During failback, Trilio Site Recovery creates volumes from replicated snapshots and imports them into Cinder using the manage API.
Fix: On the primary site, add the following to Cinder's policy file (/etc/cinder/policy.yaml) and reconfigure Cinder:
"volume_extension:volume_manage": "rule:admin_or_owner"
"volume_extension:volume_unmanage": "rule:admin_or_owner"
"volume_extension:services:index": "rule:admin_or_owner"
For Kolla-Ansible deployments:
kolla-ansible -i inventory reconfigure -t cinder
After reconfiguring, retry the failback.
Issue 5: Instances Not Created on Primary Site ā Flavor or Network Not Found
Symptom:
Error: Failed to create instance 'web-server-1' on site-a:
Flavor 'm2.large' not found.
or
Error: Network 'net-secondary-web' not found on site-a.
Cause: The flavor or network IDs from the current active site (Site B) were not mapped to their equivalents on the primary site (Site A). The failback command requires explicit mappings when IDs differ between sites.
Fix: List the available flavors and networks on the primary site and construct the correct mappings:
source ~/site-a-openrc
openstack flavor list
openstack network list --project production-project
Then re-run the failback with the correct --flavor-mapping and --network-mapping values:
openstack protector protection-group failback prod-web-app \
--reverse-replication \
--network-mapping \
net-secondary-web=net-primary-web \
net-secondary-db=net-primary-db \
--flavor-mapping \
m2.large=m1.large
Issue 6: Failback Operation Stuck at running / No Progress
Symptom: The operation has been in running state for longer than expected and the progress field has not changed.
Cause: The protector-engine service on the current active site may have encountered an error interacting with either the Pure Storage array or the primary site's OpenStack APIs.
Fix:
- Check the
protector-enginelogs on the current active site for errors:
journalctl -u openstack-protector-engine -n 200 --no-pager
- Inspect the full operation record for
error_messageandsteps_failed:
openstack protector operation show <op-id>
- Verify Pure Storage replication status independently ā confirm that the replication connection between FlashArray B and FlashArray A is healthy and that snapshots are visible on Site A's array.
- If the engine process has crashed, restart it and re-check operation status:
systemctl restart openstack-protector-engine
openstack protector operation show <op-id>