Failover and Failback Commands
openstack dr failover, failback; planned vs unplanned types; monitoring with operation show
This page explains how to execute failover and failback operations using Trilio Site Recovery for OpenStack. It covers planned failover (graceful workload migration between sites), unplanned failover (emergency recovery when the primary site is unavailable), and failback (returning workloads to the original site after recovery). You will also learn how to monitor operation progress using the dr operation show command and understand the state transitions that a Protection Group moves through during each operation type.
Before executing any failover or failback operation, confirm the following:
- Two registered and validated sites: Both your primary and secondary sites must be registered in the Protector service and reachable via their Keystone endpoints. Run
openstack protector site validate <site-name>on each site to confirm. - Protection Group in
activeorfailed_overstate: Failover requiresactivestatus. Failback requiresfailed_overstatus. Check withopenstack protector protection-group show <pg-name>. - Replication policy configured: The Protection Group must have a replication policy with valid FlashArray credentials for both sites. Verify with
openstack protector protection-group policy-show <pg-name>. - Metadata in sync: Both sites must be running the same metadata version. Run
openstack protector protection-group sync-status <pg-name>and resolve anyOUT OF SYNCcondition before proceeding. Modifications to a Protection Group — including initiating failover — are blocked if the peer site is unreachable, unless you are performing an unplanned failover. - Network and flavor mappings prepared: Identify the network UUIDs and flavor IDs on the target site before you begin. Networks and flavors are not shared between sites.
clouds.yamlconfigured for both sites: TheprotectorclientCLI plugin must be able to authenticate to both sites. See the deployment guide forclouds.yamlconfiguration.- Cinder policy updated on the target site: The
volume_extension:volume_manageandvolume_extension:volume_unmanageCinder policies must be set torule:admin_or_owneron the site receiving the workload.
The failover and failback commands are part of the protectorclient OSC plugin. If you have already installed the plugin during initial deployment, no additional installation is required.
To verify the plugin is installed and the DR commands are available:
openstack protector --help
If the protector command group is not listed, install the client plugin:
pip install protectorclient
Confirm the installed version:
pip show protectorclient
Verify connectivity to both sites by listing protection groups from each:
# Source credentials for your primary site
source ~/site-a-openrc
openstack protector protection-group list
# Source credentials for your secondary site
source ~/site-b-openrc
openstack protector protection-group list
Both commands should return results without authentication errors. If either site is unreachable, resolve connectivity before attempting any DR operation.
Failover and failback behavior is controlled by flags passed at execution time. The table below describes each flag, its effect, and when to use it.
| Flag | Applies to | Description |
|---|---|---|
--type planned | failover, failback | Graceful operation. The engine quiesces workloads on the source site, performs a final storage sync, then activates on the target site. Requires the source site to be reachable. |
--type unplanned | failover only | Emergency operation. The engine promotes the most recent replicated snapshot on the target site without contacting the source. Use this when the primary site is down. Metadata sync to the source site is skipped and marked UNREACHABLE. |
--network-mapping <src-net>=<dst-net> | failover, failback | Maps source-site network UUIDs to target-site network UUIDs. Pass one --network-mapping flag per network. Required unless your sites share network UUIDs (uncommon). |
--flavor-mapping <src-flavor>=<dst-flavor> | failover, failback | Maps source-site flavor IDs to target-site flavor IDs. Optional — omit if flavor IDs are identical across sites or if you want the engine to use the same flavor name. |
--reverse-replication | failback only | After failback completes, reverses the Pure Storage replication direction so that Site A becomes the replication source again. Recommended for production failbacks to restore normal RPO immediately. |
--force | failover, failback | Bypasses pre-flight checks that would otherwise block the operation (for example, a stale sync status). Use with caution — this can cause metadata divergence if the peer site is genuinely unreachable. |
Protection Group status transitions
Understanding status transitions helps you interpret operation show output and diagnose failures:
| Operation | Starting status | Intermediate status | Final status (success) | Final status (failure) |
|---|---|---|---|---|
| Planned failover | active | failing_over | failed_over | error |
| Unplanned failover | active | failing_over | failed_over | error |
| Failback | failed_over | failing_back | active | error |
The current_primary_site_id field on the Protection Group tracks which site is currently running the workload. It flips from Site A to Site B on failover and back to Site A on failback.
Planned failover
Use planned failover when you need to migrate workloads intentionally — for example, for scheduled maintenance, capacity rebalancing, or a controlled DR drill where you want zero data loss. Both sites must be reachable.
# Authenticate to the site where workloads are currently running
source ~/site-a-openrc
openstack protector protection-group failover prod-web-app \
--type planned \
--network-mapping net-primary-web=net-secondary-web \
--network-mapping net-primary-db=net-secondary-db \
--flavor-mapping m1.large=m2.large
The command returns immediately with an operation ID. The engine runs the failover asynchronously. Track progress with openstack protector operation show <operation-id> (see the Monitoring section).
Unplanned failover
Use unplanned failover when the primary site is unavailable and you must recover workloads from the most recent replicated snapshot. Authenticate to the secondary site — this is the site that will receive the workload.
# Authenticate to the secondary (DR) site
source ~/site-b-openrc
openstack protector protection-group failover prod-web-app \
--type unplanned \
--network-mapping net-primary-web=net-secondary-web \
--network-mapping net-primary-db=net-secondary-db
Because the primary site is down, the engine loads the last-synced metadata from the local (Site B) database and promotes the latest available snapshot from FlashArray B. Sync to Site A is attempted but skipped if unreachable — the sync status is marked UNREACHABLE. Once Site A recovers, run openstack protector protection-group sync-force prod-web-app before making any modifications to the Protection Group.
Failback
Use failback after the primary site has recovered and you want to return workloads to it. Authenticate to the site that is currently running the workloads (the site you failed over to).
# Authenticate to the site currently running workloads (after failover, this is Site B)
source ~/site-b-openrc
# Confirm the peer site is reachable and metadata is in sync before failback
openstack protector protection-group sync-status prod-web-app
openstack protector protection-group failback prod-web-app \
--type planned \
--reverse-replication \
--network-mapping net-secondary-web=net-primary-web \
--network-mapping net-secondary-db=net-primary-db
Note that network mappings for failback are the reverse of failover mappings: the source networks are now on Site B, and the target networks are on Site A.
The --reverse-replication flag instructs the engine to flip the Pure Storage replication direction back to Site A → Site B after failback completes. Omitting this flag leaves replication in the Site B → Site A direction, which is appropriate only if you plan to fail over again immediately.
Monitoring operation progress
All DR operations run asynchronously. Use openstack protector operation show to track progress:
openstack protector operation show <operation-id>
To watch progress in real time:
watch -n 5 openstack protector operation show <operation-id>
To list all operations for your Protection Group:
openstack protector operation list
Operations report a progress field (0–100), a status field (pending, running, completed, failed, rolling_back), and a steps_completed array that shows which phases have finished. On failure, the error_message field contains the reason.
Example 1: Planned failover with network and flavor mappings
Migrate a production web application from Site A to Site B during a maintenance window.
source ~/site-a-openrc
openstack protector protection-group failover prod-web-app \
--type planned \
--network-mapping a1b2c3d4-net-web=e5f6a7b8-net-web \
--network-mapping c9d0e1f2-net-db=a3b4c5d6-net-db \
--flavor-mapping m1.large=m2.large
Expected output:
+------------------------+--------------------------------------+
| Field | Value |
+------------------------+--------------------------------------+
| operation_id | op-3fa85f64-5717-4562-b3fc-2c963f66 |
| operation_type | failover |
| status | running |
| progress | 10 |
| source_site | site-a |
| target_site | site-b |
| started_at | 2025-03-15T09:00:12Z |
+------------------------+--------------------------------------+
Example 2: Monitoring failover progress
Poll the operation until completion. Use the operation_id returned by the failover command.
openstack protector operation show op-3fa85f64-5717-4562-b3fc-2c963f66
Output during Phase 2 (storage failover, ~20–60% progress):
+------------------------+----------------------------------------------+
| Field | Value |
+------------------------+----------------------------------------------+
| operation_id | op-3fa85f64-5717-4562-b3fc-2c963f66 |
| operation_type | failover |
| status | running |
| progress | 45 |
| steps_completed | ["prepare", "validate_target", |
| | "get_snapshot", "promote_volumes"] |
| steps_failed | [] |
| error_message | |
| started_at | 2025-03-15T09:00:12Z |
| completed_at | |
+------------------------+----------------------------------------------+
Output on successful completion:
+------------------------+----------------------------------------------+
| Field | Value |
+------------------------+----------------------------------------------+
| operation_id | op-3fa85f64-5717-4562-b3fc-2c963f66 |
| operation_type | failover |
| status | completed |
| progress | 100 |
| steps_completed | ["prepare", "validate_target", |
| | "get_snapshot", "promote_volumes", |
| | "recreate_instances", "finalize"] |
| steps_failed | [] |
| error_message | |
| started_at | 2025-03-15T09:00:12Z |
| completed_at | 2025-03-15T09:07:44Z |
+------------------------+----------------------------------------------+
Example 3: Unplanned failover from the secondary site
Site A has gone down unexpectedly. Recover workloads from Site B using the last replicated snapshot.
source ~/site-b-openrc
openstack protector protection-group failover prod-web-app \
--type unplanned \
--network-mapping a1b2c3d4-net-web=e5f6a7b8-net-web \
--network-mapping c9d0e1f2-net-db=a3b4c5d6-net-db
Expected output:
+------------------------+--------------------------------------+
| Field | Value |
+------------------------+--------------------------------------+
| operation_id | op-7c8d9e0f-1234-5678-abcd-ef012345 |
| operation_type | failover |
| status | running |
| progress | 10 |
| source_site | site-a |
| target_site | site-b |
| started_at | 2025-03-15T11:32:01Z |
+------------------------+--------------------------------------+
WARNING: Unplanned failover initiated. Source site (site-a) is unreachable.
Metadata sync to site-a will be deferred until it recovers.
Run 'openstack protector protection-group sync-force prod-web-app' after site-a is restored.
Example 4: Verifying Protection Group state after failover
After the operation completes, confirm that current_primary_site has flipped to Site B and failover_count has incremented.
source ~/site-b-openrc
openstack protector protection-group show prod-web-app
Expected output (relevant fields):
+----------------------------+--------------------------------------+
| Field | Value |
+----------------------------+--------------------------------------+
| status | failed_over |
| primary_site | site-a |
| secondary_site | site-b |
| current_primary_site | site-b |
| failover_count | 1 |
| last_failover_at | 2025-03-15T09:07:44Z |
+----------------------------+--------------------------------------+
Example 5: Failback to Site A after recovery
Site A has been restored. Sync metadata, then fail back.
source ~/site-b-openrc
# Step 1: Confirm sync status
openstack protector protection-group sync-status prod-web-app
# Step 2: Force sync if Site A was previously unreachable
openstack protector protection-group sync-force prod-web-app
# Step 3: Execute failback
openstack protector protection-group failback prod-web-app \
--type planned \
--reverse-replication \
--network-mapping e5f6a7b8-net-web=a1b2c3d4-net-web \
--network-mapping a3b4c5d6-net-db=c9d0e1f2-net-db
Expected output:
+------------------------+--------------------------------------+
| Field | Value |
+------------------------+--------------------------------------+
| operation_id | op-aabb1122-ccdd-3344-eeff-55667788 |
| operation_type | failback |
| status | running |
| progress | 10 |
| source_site | site-b |
| target_site | site-a |
| started_at | 2025-03-16T08:15:00Z |
+------------------------+--------------------------------------+
After completion, the Protection Group status returns to active and current_primary_site returns to site-a.
Operation fails with "remote site unreachable" during planned failover
Symptom: Running openstack protector protection-group failover <pg> --type planned returns an error immediately, before the operation is created.
Cause: Planned failover requires both sites to be reachable for metadata sync. If the peer site cannot be contacted, the operation is blocked to prevent metadata divergence.
Fix:
- Confirm the secondary site's Keystone endpoint is reachable:
curl -s http://site-b:5000/v3 - Check the site's registered status:
openstack protector site validate site-b - If the secondary site is genuinely unavailable and you need to recover immediately, use
--type unplannedauthenticated to the secondary site instead. - If the site is reachable but the sync status is stale, run
openstack protector protection-group sync-force <pg-name>before retrying.
Operation stuck at a progress percentage for more than 10 minutes
Symptom: openstack protector operation show <op-id> shows status: running but progress has not changed for an extended period.
Cause: The engine may be waiting on a Pure Storage array operation (snapshot promotion, volume management), a Cinder API call, or a Nova boot. Long waits during the 20–60% range typically indicate storage issues; waits during 60–90% indicate Nova or Cinder issues.
Fix:
- Check
steps_completedto identify which phase is stalled. - Review the Protector engine log on the target site:
journalctl -u protector-engine -f - If the issue is storage-related, verify FlashArray reachability from the target site and confirm that the replication policy credentials in
openstack protector protection-group policy-show <pg>are valid. - If the issue is Nova-related, check
openstack server liston the target site for VMs inERRORstate and inspect their fault details.
Failover completes but VMs are in ERROR state on the target site
Symptom: The operation reaches status: completed and progress: 100, but when you run openstack server list on the target site, one or more VMs show ERROR.
Cause: VMs can reach ERROR state after the operation completes if the boot volume failed to attach, if the specified flavor does not exist on the target site, or if the mapped network is full or otherwise unavailable.
Fix:
- Check the Nova fault for the failed VM:
openstack server show <vm-id>— look at thefaultfield. - Verify the flavor mapping:
openstack flavor liston the target site and confirm the mapped flavor ID exists. - Verify the network mapping:
openstack network show <target-net-id>to confirm the network is active and has available IP addresses. - Check Cinder volume status:
openstack volume liston the target site. Volumes created during failover should bein-use. If they areavailable, the attachment failed.
Failback blocked with "Protection Group is not in failed_over state"
Symptom: Running openstack protector protection-group failback <pg> returns an error stating the Protection Group is not eligible for failback.
Cause: Failback is only valid when the Protection Group status is failed_over. If a previous failover failed partway through, the status may be error instead.
Fix:
- Check the current status:
openstack protector protection-group show <pg-name> - If status is
error, inspect the most recent operation:openstack protector operation listand thenopenstack protector operation show <op-id>to read theerror_message. - Resolve the underlying issue (storage, network, or Nova), then attempt the failover or failback again. If the Protection Group is stuck in
errorand you are certain it is safe to proceed, use--force— but only after confirming no partial VM or volume state exists on the target site that could cause conflicts.
Metadata out of sync after unplanned failover; modifications blocked
Symptom: After an unplanned failover, attempting to add or remove members returns: Cannot modify protection group - remote site unreachable.
Cause: During an unplanned failover, the engine marks the primary site's sync status as UNREACHABLE. The Protection Group remains in this state until the primary site recovers and a sync is performed. All modifications are blocked until both sites agree on the current metadata version.
Fix:
- Wait for Site A to recover.
- Check sync status:
openstack protector protection-group sync-status <pg-name> - Push the current metadata to Site A:
openstack protector protection-group sync-force <pg-name> - Confirm both sites are at the same version, then proceed with modifications.
volume_manage fails during failover with permission denied
Symptom: The operation fails during the storage failover phase (around 20–60% progress) with an error referencing volume_manage or PolicyNotAuthorized.
Cause: The Cinder policy on the target site does not permit the member role to call the volume manage API. This is a deployment configuration issue.
Fix: On the target site, update /etc/cinder/policy.yaml (or the Kolla-Ansible equivalent) to include:
"volume_extension:volume_manage": "rule:admin_or_owner"
"volume_extension:volume_unmanage": "rule:admin_or_owner"
"volume_extension:services:index": "rule:admin_or_owner"
Then restart the Cinder API service (or run kolla-ansible reconfigure -t cinder) and retry the failover operation.