Failover and Failback
Planned and unplanned failover procedures, failback, and resource lifecycle management.
This page explains how to execute failover and failback operations using Trilio Site Recovery for OpenStack. It covers both planned failover ā a graceful, coordinated cutover where VMs are shut down cleanly and a final snapshot is triggered before switchover ā and unplanned failover, which promotes the latest available replicated snapshot when the primary site is unavailable. You will also learn how to execute failback to restore normal operations on the original primary site, and how to manage the resource lifecycle that each failover/failback cycle creates.
Before executing any failover or failback operation, verify the following:
- Two independent OpenStack sites are registered and reachable from your CLI session: a primary site and a secondary (DR) site, each with its own Nova, Cinder, Neutron, and Keystone endpoints.
- Trilio Site Recovery services (
protector-apiandprotector-engine) are running independently on both sites. - The
protectorclientOSC CLI plugin (or Horizon dashboard) is configured with credentials for both sites ā this is the coordination layer that orchestrates metadata sync. - Protection Group is in
activestatus with all member VMs added and replication confirmed healthy. Runopenstack protector protection-group show <pg-name>and verifystatus: active. - Metadata is in sync across both sites. Run
openstack protector protection-group sync-status <pg-name>and confirm both sites show the same version number. If the remote site showsFAILEDorUNREACHABLE, resolve sync before proceeding ā modification and failover operations are blocked when the peer site is unreachable, by design, to prevent metadata divergence. - Resource mappings are configured. You must have network mappings (primary network UUID ā secondary network UUID) and, if your flavors differ between sites, flavor mappings ready. See the resource mappings guide for details.
- For failback: The original primary site must be recovered and reachable. The Protection Group must be in
failed_overstatus. Replication must be re-established in the reverse direction before initiating failback. - Cinder policy on the target site must permit
volume_extension:volume_manageandvolume_extension:volume_unmanagefor thememberrole. These are required for the engine to import replicated volumes into Cinder after storage failover. See the deployment prerequisites guide if these are not yet configured.
No additional installation is required to use failover and failback. These operations are built into the protector-engine service and exposed through the protectorclient OSC plugin and the Protector REST API.
Verify that both services are running on each site before proceeding:
# On the primary site controller
systemctl status protector-api
systemctl status protector-engine
# On the secondary site controller
systemctl status protector-api
systemctl status protector-engine
Verify your CLI plugin can reach both sites:
# Confirm site registrations
openstack protector site list
# Validate connectivity to each site
openstack protector site validate site-a
openstack protector site validate site-b
If either site returns an error, resolve connectivity before continuing.
Failover and failback behavior is controlled by the parameters you pass at operation time, plus the Protection Group's replication policy. There are no separate configuration files for individual failover operations.
Replication type (set at Protection Group creation, not modifiable after creation):
| Value | Behavior | Data loss exposure |
|---|---|---|
async | Periodic snapshots replicated to the secondary array. Failover promotes the latest available replicated snapshot. | Up to the replication interval (configured via replication_interval in the policy, default 300 seconds / 5 minutes) |
sync | Real-time replication via Pure Storage Pod. Failover promotes the Pod. | Near-zero |
Replication policy fields (set per Protection Group, relevant to failover timing):
| Field | Description | Example |
|---|---|---|
replication_interval | Seconds between async snapshot cycles | 300 |
rpo_minutes | Recovery Point Objective ā informational, used in readiness validation | 15 |
pure_pg_name | Name of the Pure Storage Protection Group or Pod being managed | pg-prod-web-app |
Failover action parameters:
| Parameter | Required | Description |
|---|---|---|
network_mapping | Yes | Maps primary-site network UUIDs to secondary-site network UUIDs. Every network attached to a protected VM must have a mapping entry. |
flavor_mapping | No | Maps primary-site flavor IDs to secondary-site flavor IDs. Required when flavors differ between sites. If omitted, the engine uses the same flavor ID on the secondary site. |
force | No | Bypasses some preflight checks. Use only when explicitly directed ā forcing a failover without valid replication state can result in data loss or split-brain. Default: false. |
Failback action parameters:
| Parameter | Required | Description |
|---|---|---|
network_mapping | Yes | Maps secondary-site network UUIDs back to primary-site network UUIDs. |
reverse_replication | No | When true, the engine re-establishes replication in the reverse direction (secondary ā primary) before executing the cutover. Strongly recommended. Default: false. |
force | No | Same semantics as for failover. Default: false. |
Protection Group status transitions during failover and failback:
active ā failing_over ā failed_over
failed_over ā failing_back ā active
Operations are blocked while the Protection Group is in failing_over or failing_back state.
Checking sync status before any operation
Always verify metadata is synchronized before initiating a failover or failback. This is especially important after any period of site unreachability.
openstack protector protection-group sync-status prod-web-app
If the output shows a version mismatch or UNREACHABLE, resolve it first:
# Force a metadata sync push to the remote site
openstack protector protection-group sync-force prod-web-app
Note: if the remote site is genuinely unreachable due to a disaster, the sync check will fail for unplanned failover ā in that scenario you proceed directly with the unplanned failover command on the secondary site.
Planned failover
Use planned failover when you need to migrate workloads to the secondary site with controlled downtime ā for example, before scheduled primary site maintenance. The engine shuts down VMs on the primary site, triggers a final consistency-group snapshot, performs storage failover, then recreates VMs on the secondary site. Data loss is zero or near-zero because the final snapshot captures all writes before cutover.
Step 1. Source credentials for the site where the Protection Group is currently active (the current primary):
export OS_AUTH_URL=http://site-a:5000/v3
source ~/site-a-openrc
Step 2. Execute planned failover:
openstack protector protection-group failover prod-web-app \
--network-mapping \
net-primary-web=net-secondary-web \
net-primary-db=net-secondary-db \
--flavor-mapping \
m1.large=m2.large
Step 3. Monitor the operation until completion:
# Get the operation ID from the failover output, then:
watch openstack protector operation show <operation-id>
Step 4. Verify the Protection Group status and confirm VMs are running on the secondary site:
openstack protector protection-group show prod-web-app
# Expect: status: failed_over, current_primary_site: site-b
# Authenticate to site-b and verify instances
export OS_AUTH_URL=http://site-b:5000/v3
source ~/site-b-openrc
openstack server list
Unplanned failover
Use unplanned failover when the primary site is unavailable ā for example, after an outage or disaster. Because the primary is unreachable, the engine cannot shut down VMs or trigger a final snapshot. Instead, it promotes the latest available replicated snapshot on the secondary array. Data loss is limited to the replication lag window (up to the configured replication_interval for async replication).
Because the primary site is down, you initiate the operation from the secondary site:
Step 1. Source credentials for the secondary site:
export OS_AUTH_URL=http://site-b:5000/v3
source ~/site-b-openrc
Step 2. Execute unplanned failover:
openstack protector protection-group failover prod-web-app \
--type unplanned \
--network-mapping \
net-primary-web=net-secondary-web \
net-primary-db=net-secondary-db
The engine will attempt to update metadata on the primary site. Because the primary is unreachable, it will mark the sync status as UNREACHABLE and continue. The failover proceeds using the local metadata copy that was previously synchronized to the secondary site ā this is why strict metadata synchronization is maintained at all times.
Step 3. Monitor and verify (same as planned failover, steps 3ā4 above).
Step 4. When the primary site recovers, force a metadata sync to bring it up to date before performing any further operations:
export OS_AUTH_URL=http://site-b:5000/v3
source ~/site-b-openrc
openstack protector protection-group sync-force prod-web-app
Failback
Failback returns workloads to the original primary site after it has recovered. The process is structurally the same as a failover, but in the reverse direction: replication is re-established from the current active site (secondary) back to the original primary, data is synced, VMs are shut down on the secondary, and then recreated on the primary.
Step 1. Confirm the primary site is recovered and reachable:
openstack protector site validate site-a
Step 2. Sync metadata to bring the recovered primary site up to date:
export OS_AUTH_URL=http://site-b:5000/v3
source ~/site-b-openrc
openstack protector protection-group sync-force prod-web-app
Step 3. Execute failback with reverse replication enabled:
openstack protector protection-group failback prod-web-app \
--reverse-replication \
--network-mapping \
net-secondary-web=net-primary-web \
net-secondary-db=net-primary-db
The --reverse-replication flag directs the engine to re-establish Pure Storage replication in the secondary-to-primary direction before executing the cutover. This ensures data written since the failover is captured.
Step 4. Monitor the operation and verify:
watch openstack protector operation show <operation-id>
# After completion:
openstack protector protection-group show prod-web-app
# Expect: status: active, current_primary_site: site-a
Resource lifecycle and orphan cleanup
Each failover and failback cycle produces source-side resources ā volumes and VM instances on the site that was just vacated ā that are no longer needed. The engine performs cleanup of these orphaned source-side resources after each successful operation. You should verify cleanup completed as expected after every failover and failback:
# After failover to site-b, verify site-a no longer has active member VMs
export OS_AUTH_URL=http://site-a:5000/v3
source ~/site-a-openrc
openstack server list
# Check for any orphaned volumes on site-a
openstack volume list
If orphaned resources persist across multiple failover/failback cycles, they accumulate and can cause Cinder quota exhaustion or storage backend confusion. This is a critical operational concern. If you find orphaned resources that the engine did not clean up (for example, due to a partial operation failure), remove them manually and reconcile the Protection Group state before proceeding with further DR operations.
Example 1: Planned failover ā web application tier
Fail over a three-VM web application from site-a to site-b with network and flavor mappings.
# Authenticated to site-a
export OS_AUTH_URL=http://site-a:5000/v3
source ~/site-a-openrc
# Verify sync is current
openstack protector protection-group sync-status prod-web-app
# Execute planned failover
openstack protector protection-group failover prod-web-app \
--network-mapping \
a2c3d4e5-web=f6g7h8i9-web \
a2c3d4e5-db=f6g7h8i9-db \
--flavor-mapping \
m1.large=m2.large
Expected output:
+------------------------+--------------------------------------+
| Field | Value |
+------------------------+--------------------------------------+
| operation_id | op-7f3a2b1c-... |
| operation_type | failover |
| status | running |
| progress | 10% |
| source_site | site-a |
| target_site | site-b |
+------------------------+--------------------------------------+
Monitor until complete:
watch openstack protector operation show op-7f3a2b1c-...
Progress output at completion:
+-------------------+----------------------------------------------+
| Field | Value |
+-------------------+----------------------------------------------+
| operation_id | op-7f3a2b1c-... |
| operation_type | failover |
| status | completed |
| progress | 100% |
| started_at | 2025-06-01T09:00:00Z |
| completed_at | 2025-06-01T09:04:37Z |
| steps_completed | ["validate_sites", "verify_replication", |
| | "trigger_final_snapshot", "stop_vms", |
| | "storage_failover", "recreate_vms", |
| | "update_pg_status", "cleanup_source"] |
+-------------------+----------------------------------------------+
Verify Protection Group state:
export OS_AUTH_URL=http://site-b:5000/v3
source ~/site-b-openrc
openstack protector protection-group show prod-web-app
+---------------------------+--------------------------------------+
| Field | Value |
+---------------------------+--------------------------------------+
| name | prod-web-app |
| status | failed_over |
| current_primary_site | site-b |
| failover_count | 1 |
| last_failover_at | 2025-06-01T09:04:37Z |
| replication_type | async |
+---------------------------+--------------------------------------+
Example 2: Unplanned failover ā site-a is down
Site A has become unavailable. You initiate failover directly from Site B using the local metadata copy.
# Authenticated to site-b (the DR site)
export OS_AUTH_URL=http://site-b:5000/v3
source ~/site-b-openrc
openstack protector protection-group failover prod-web-app \
--type unplanned \
--network-mapping \
a2c3d4e5-web=f6g7h8i9-web \
a2c3d4e5-db=f6g7h8i9-db
Expected output (note the sync warning ā expected during unplanned failover):
+------------------------+--------------------------------------+
| Field | Value |
+------------------------+--------------------------------------+
| operation_id | op-9d1e0f2a-... |
| operation_type | failover |
| status | running |
| progress | 10% |
+------------------------+--------------------------------------+
WARNING: Primary site (site-a) is unreachable.
Proceeding with unplanned failover using local metadata (version 5).
Metadata sync to site-a will be retried when site-a recovers.
After site-a recovers, sync metadata before any further operations:
openstack protector protection-group sync-force prod-web-app
Force Sync Initiated...
Checking remote site connectivity...
ā
Site A is now reachable
Syncing metadata (version 6)...
Gathering current metadata... ā
Pushing to Site A... ā
ā
Sync completed successfully
Both sites now at version 6
Example 3: Failback after recovery of site-a
Site A has recovered. You fail back from site-b to site-a, re-establishing replication before the cutover.
# Still authenticated to site-b (currently active)
export OS_AUTH_URL=http://site-b:5000/v3
source ~/site-b-openrc
# Confirm site-a is reachable
openstack protector site validate site-a
# Sync metadata to the recovered site-a
openstack protector protection-group sync-force prod-web-app
# Execute failback
openstack protector protection-group failback prod-web-app \
--reverse-replication \
--network-mapping \
f6g7h8i9-web=a2c3d4e5-web \
f6g7h8i9-db=a2c3d4e5-db
Expected output:
+------------------------+--------------------------------------+
| Field | Value |
+------------------------+--------------------------------------+
| operation_id | op-2c4e6a8b-... |
| operation_type | failback |
| status | running |
| progress | 10% |
+------------------------+--------------------------------------+
After completion, verify the Protection Group has returned to active on site-a:
export OS_AUTH_URL=http://site-a:5000/v3
source ~/site-a-openrc
openstack protector protection-group show prod-web-app
+---------------------------+--------------------------------------+
| Field | Value |
+---------------------------+--------------------------------------+
| name | prod-web-app |
| status | active |
| current_primary_site | site-a |
| failover_count | 1 |
| replication_type | async |
+---------------------------+--------------------------------------+
Example 4: Monitoring a running DR operation
You can list all DR operations for a Protection Group or inspect a specific operation:
# List all operations (most recent first)
openstack protector operation list
# Show detailed step-level progress of a running operation
openstack protector operation show op-7f3a2b1c-...
A running operation in progress:
+-------------------+----------------------------------------------+
| Field | Value |
+-------------------+----------------------------------------------+
| operation_id | op-7f3a2b1c-... |
| operation_type | failover |
| status | running |
| progress | 65% |
| steps_completed | ["validate_sites", "verify_replication", |
| | "trigger_final_snapshot", "stop_vms", |
| | "storage_failover"] |
| steps_failed | [] |
| error_message | |
+-------------------+----------------------------------------------+
Failover blocked: "Cannot modify protection group ā remote site unreachable"
Symptom: The failover or failback command returns an error stating the remote site is unreachable and the operation is blocked.
Likely cause: Metadata synchronization requires both sites to be reachable for all state-modifying operations. This is by design ā allowing changes when the remote site is unreachable would cause metadata divergence.
Fix for unplanned failover (primary is down by design): Use --type unplanned. The engine accepts that the primary site is unreachable for this specific operation type and proceeds using the local metadata copy, marking the sync status as UNREACHABLE to be resolved later.
openstack protector protection-group failover prod-web-app \
--type unplanned \
--network-mapping ...
Fix for all other cases: Diagnose and restore connectivity to the remote site, then force a sync before retrying:
openstack protector site validate site-b
openstack protector protection-group sync-force prod-web-app
Failover fails at storage_failover step
Symptom: The operation reaches the storage_failover step and then fails. The error_message field in the operation record mentions the Pure Storage array or a snapshot.
Likely cause (async replication): No valid replicated snapshot is available on the secondary FlashArray, or the Protection Group name (pure_pg_name in the replication policy) does not match the actual name on the array.
Fix:
- Verify that the
pure_pg_namein the replication policy exactly matches the Protection Group name configured on the Pure Storage array. - Log into the secondary FlashArray management interface and confirm that replicated snapshots exist for the Protection Group.
- If replication was never established or has been broken, restore FlashArray replication connectivity and allow at least one replication cycle to complete before retrying.
Likely cause (sync replication / Pod): The Pod promotion failed, often because the Pod is in a degraded state.
Fix: Check the Pod status on the secondary FlashArray. Resolve any array-level issues before retrying.
Failover completes but VMs fail to start on the secondary site
Symptom: The operation shows completed, but VMs on the secondary site are in ERROR state.
Likely cause: A resource mapping issue ā the network or flavor specified in the mapping does not exist on the secondary site, or a security group referenced in the VM metadata is not present.
Fix:
- Review the Nova error for the failed instance:
openstack server show <instance-id>on the secondary site. - Verify that all network UUIDs in
--network-mappingcorrespond to existing networks on the secondary site. - Verify that all security groups referenced by protected VMs exist on the secondary site (they must be pre-created with matching names or IDs).
- If a flavor mapping is missing, add it and retry using
--flavor-mapping.
Failback fails with "Protection Group is not in failed_over status"
Symptom: The failback command is rejected with a status error.
Likely cause: The Protection Group is not currently in failed_over status. Failback can only be initiated from the failed_over state.
Fix: Check the current Protection Group status:
openstack protector protection-group show prod-web-app
If the status is failing_over or failing_back, a previous operation is still in progress or stalled. Check the active operation:
openstack protector operation list
openstack protector operation show <operation-id>
If the operation is stuck in running state with no progress, check the protector-engine logs on both sites for errors:
tail -f /var/log/protector/protector-engine.log
Orphaned volumes or VMs remain on the source site after failover
Symptom: After a successful failover or failback, the source site still shows VM instances or Cinder volumes belonging to the Protection Group that should have been cleaned up.
Likely cause: The cleanup phase of the operation failed or was skipped (for example, due to a Cinder policy error preventing volume_unmanage, or a Nova error during VM deletion).
Fix:
- Review the completed operation's
steps_completedfield to confirm whether the cleanup step ran. - Check
protector-enginelogs for errors during the cleanup phase. - Verify that Cinder policy on the source site permits
volume_extension:volume_unmanagefor thememberrole. - Manually delete any orphaned VMs using
openstack server deleteon the source site. - Manually delete or unmanage orphaned Cinder volumes using
openstack volume deleteon the source site. - Do not allow orphaned resources to accumulate across multiple failover/failback cycles. Accumulated orphans can exhaust Cinder quotas and cause storage backend state conflicts.
Metadata version mismatch after recovering a failed site
Symptom: After the primary site recovers following an unplanned failover, sync-status shows the sites are at different metadata versions.
Likely cause: Expected behavior. The secondary site advanced its metadata version during and after the unplanned failover while the primary was unreachable.
Fix: This is resolved with a force sync from the currently active site (the secondary) to the recovered primary:
# Authenticated to the currently active site
openstack protector protection-group sync-force prod-web-app
The secondary site's metadata is authoritative (it is the site where VMs are currently running), so the sync pushes the current version to the recovered primary. Both sites will then be at the same version and further operations will be permitted.