Site Recoveryfor OpenStack
Guide

Failover and Failback Commands

openstack dr failover, failback; planned vs unplanned types; monitoring with operation show

master

Overview

This page explains how to execute failover and failback operations using Trilio Site Recovery for OpenStack. It covers planned failover (graceful workload migration between sites), unplanned failover (emergency recovery when the primary site is unavailable), and failback (returning workloads to the original site after recovery). You will also learn how to monitor operation progress using the dr operation show command and understand the state transitions that a Protection Group moves through during each operation type.


Prerequisites

Before executing any failover or failback operation, confirm the following:

  • Two registered and validated sites: Both your primary and secondary sites must be registered in the Protector service and reachable via their Keystone endpoints. Run openstack protector site validate <site-name> on each site to confirm.
  • Protection Group in active or failed_over state: Failover requires active status. Failback requires failed_over status. Check with openstack protector protection-group show <pg-name>.
  • Replication policy configured: The Protection Group must have a replication policy with valid FlashArray credentials for both sites. Verify with openstack protector protection-group policy-show <pg-name>.
  • Metadata in sync: Both sites must be running the same metadata version. Run openstack protector protection-group sync-status <pg-name> and resolve any OUT OF SYNC condition before proceeding. Modifications to a Protection Group — including initiating failover — are blocked if the peer site is unreachable, unless you are performing an unplanned failover.
  • Network and flavor mappings prepared: Identify the network UUIDs and flavor IDs on the target site before you begin. Networks and flavors are not shared between sites.
  • clouds.yaml configured for both sites: The protectorclient CLI plugin must be able to authenticate to both sites. See the deployment guide for clouds.yaml configuration.
  • Cinder policy updated on the target site: The volume_extension:volume_manage and volume_extension:volume_unmanage Cinder policies must be set to rule:admin_or_owner on the site receiving the workload.

Installation

The failover and failback commands are part of the protectorclient OSC plugin. If you have already installed the plugin during initial deployment, no additional installation is required.

To verify the plugin is installed and the DR commands are available:

openstack protector --help

If the protector command group is not listed, install the client plugin:

pip install protectorclient

Confirm the installed version:

pip show protectorclient

Verify connectivity to both sites by listing protection groups from each:

# Source credentials for your primary site
source ~/site-a-openrc
openstack protector protection-group list

# Source credentials for your secondary site
source ~/site-b-openrc
openstack protector protection-group list

Both commands should return results without authentication errors. If either site is unreachable, resolve connectivity before attempting any DR operation.


Configuration

Failover and failback behavior is controlled by flags passed at execution time. The table below describes each flag, its effect, and when to use it.

FlagApplies toDescription
--type plannedfailover, failbackGraceful operation. The engine quiesces workloads on the source site, performs a final storage sync, then activates on the target site. Requires the source site to be reachable.
--type unplannedfailover onlyEmergency operation. The engine promotes the most recent replicated snapshot on the target site without contacting the source. Use this when the primary site is down. Metadata sync to the source site is skipped and marked UNREACHABLE.
--network-mapping <src-net>=<dst-net>failover, failbackMaps source-site network UUIDs to target-site network UUIDs. Pass one --network-mapping flag per network. Required unless your sites share network UUIDs (uncommon).
--flavor-mapping <src-flavor>=<dst-flavor>failover, failbackMaps source-site flavor IDs to target-site flavor IDs. Optional — omit if flavor IDs are identical across sites or if you want the engine to use the same flavor name.
--reverse-replicationfailback onlyAfter failback completes, reverses the Pure Storage replication direction so that Site A becomes the replication source again. Recommended for production failbacks to restore normal RPO immediately.
--forcefailover, failbackBypasses pre-flight checks that would otherwise block the operation (for example, a stale sync status). Use with caution — this can cause metadata divergence if the peer site is genuinely unreachable.

Protection Group status transitions

Understanding status transitions helps you interpret operation show output and diagnose failures:

OperationStarting statusIntermediate statusFinal status (success)Final status (failure)
Planned failoveractivefailing_overfailed_overerror
Unplanned failoveractivefailing_overfailed_overerror
Failbackfailed_overfailing_backactiveerror

The current_primary_site_id field on the Protection Group tracks which site is currently running the workload. It flips from Site A to Site B on failover and back to Site A on failback.


Usage

Planned failover

Use planned failover when you need to migrate workloads intentionally — for example, for scheduled maintenance, capacity rebalancing, or a controlled DR drill where you want zero data loss. Both sites must be reachable.

# Authenticate to the site where workloads are currently running
source ~/site-a-openrc

openstack protector protection-group failover prod-web-app \
  --type planned \
  --network-mapping net-primary-web=net-secondary-web \
  --network-mapping net-primary-db=net-secondary-db \
  --flavor-mapping m1.large=m2.large

The command returns immediately with an operation ID. The engine runs the failover asynchronously. Track progress with openstack protector operation show <operation-id> (see the Monitoring section).

Unplanned failover

Use unplanned failover when the primary site is unavailable and you must recover workloads from the most recent replicated snapshot. Authenticate to the secondary site — this is the site that will receive the workload.

# Authenticate to the secondary (DR) site
source ~/site-b-openrc

openstack protector protection-group failover prod-web-app \
  --type unplanned \
  --network-mapping net-primary-web=net-secondary-web \
  --network-mapping net-primary-db=net-secondary-db

Because the primary site is down, the engine loads the last-synced metadata from the local (Site B) database and promotes the latest available snapshot from FlashArray B. Sync to Site A is attempted but skipped if unreachable — the sync status is marked UNREACHABLE. Once Site A recovers, run openstack protector protection-group sync-force prod-web-app before making any modifications to the Protection Group.

Failback

Use failback after the primary site has recovered and you want to return workloads to it. Authenticate to the site that is currently running the workloads (the site you failed over to).

# Authenticate to the site currently running workloads (after failover, this is Site B)
source ~/site-b-openrc

# Confirm the peer site is reachable and metadata is in sync before failback
openstack protector protection-group sync-status prod-web-app

openstack protector protection-group failback prod-web-app \
  --type planned \
  --reverse-replication \
  --network-mapping net-secondary-web=net-primary-web \
  --network-mapping net-secondary-db=net-primary-db

Note that network mappings for failback are the reverse of failover mappings: the source networks are now on Site B, and the target networks are on Site A.

The --reverse-replication flag instructs the engine to flip the Pure Storage replication direction back to Site A → Site B after failback completes. Omitting this flag leaves replication in the Site B → Site A direction, which is appropriate only if you plan to fail over again immediately.

Monitoring operation progress

All DR operations run asynchronously. Use openstack protector operation show to track progress:

openstack protector operation show <operation-id>

To watch progress in real time:

watch -n 5 openstack protector operation show <operation-id>

To list all operations for your Protection Group:

openstack protector operation list

Operations report a progress field (0–100), a status field (pending, running, completed, failed, rolling_back), and a steps_completed array that shows which phases have finished. On failure, the error_message field contains the reason.


Examples

Example 1: Planned failover with network and flavor mappings

Migrate a production web application from Site A to Site B during a maintenance window.

source ~/site-a-openrc

openstack protector protection-group failover prod-web-app \
  --type planned \
  --network-mapping a1b2c3d4-net-web=e5f6a7b8-net-web \
  --network-mapping c9d0e1f2-net-db=a3b4c5d6-net-db \
  --flavor-mapping m1.large=m2.large

Expected output:

+------------------------+--------------------------------------+
| Field                  | Value                                |
+------------------------+--------------------------------------+
| operation_id           | op-3fa85f64-5717-4562-b3fc-2c963f66 |
| operation_type         | failover                             |
| status                 | running                              |
| progress               | 10                                   |
| source_site            | site-a                               |
| target_site            | site-b                               |
| started_at             | 2025-03-15T09:00:12Z                 |
+------------------------+--------------------------------------+

Example 2: Monitoring failover progress

Poll the operation until completion. Use the operation_id returned by the failover command.

openstack protector operation show op-3fa85f64-5717-4562-b3fc-2c963f66

Output during Phase 2 (storage failover, ~20–60% progress):

+------------------------+----------------------------------------------+
| Field                  | Value                                        |
+------------------------+----------------------------------------------+
| operation_id           | op-3fa85f64-5717-4562-b3fc-2c963f66         |
| operation_type         | failover                                     |
| status                 | running                                      |
| progress               | 45                                           |
| steps_completed        | ["prepare", "validate_target",               |
|                        |  "get_snapshot", "promote_volumes"]          |
| steps_failed           | []                                           |
| error_message          |                                              |
| started_at             | 2025-03-15T09:00:12Z                         |
| completed_at           |                                              |
+------------------------+----------------------------------------------+

Output on successful completion:

+------------------------+----------------------------------------------+
| Field                  | Value                                        |
+------------------------+----------------------------------------------+
| operation_id           | op-3fa85f64-5717-4562-b3fc-2c963f66         |
| operation_type         | failover                                     |
| status                 | completed                                    |
| progress               | 100                                          |
| steps_completed        | ["prepare", "validate_target",               |
|                        |  "get_snapshot", "promote_volumes",          |
|                        |  "recreate_instances", "finalize"]           |
| steps_failed           | []                                           |
| error_message          |                                              |
| started_at             | 2025-03-15T09:00:12Z                         |
| completed_at           | 2025-03-15T09:07:44Z                         |
+------------------------+----------------------------------------------+

Example 3: Unplanned failover from the secondary site

Site A has gone down unexpectedly. Recover workloads from Site B using the last replicated snapshot.

source ~/site-b-openrc

openstack protector protection-group failover prod-web-app \
  --type unplanned \
  --network-mapping a1b2c3d4-net-web=e5f6a7b8-net-web \
  --network-mapping c9d0e1f2-net-db=a3b4c5d6-net-db

Expected output:

+------------------------+--------------------------------------+
| Field                  | Value                                |
+------------------------+--------------------------------------+
| operation_id           | op-7c8d9e0f-1234-5678-abcd-ef012345 |
| operation_type         | failover                             |
| status                 | running                              |
| progress               | 10                                   |
| source_site            | site-a                               |
| target_site            | site-b                               |
| started_at             | 2025-03-15T11:32:01Z                 |
+------------------------+--------------------------------------+

WARNING: Unplanned failover initiated. Source site (site-a) is unreachable.
Metadata sync to site-a will be deferred until it recovers.
Run 'openstack protector protection-group sync-force prod-web-app' after site-a is restored.

Example 4: Verifying Protection Group state after failover

After the operation completes, confirm that current_primary_site has flipped to Site B and failover_count has incremented.

source ~/site-b-openrc
openstack protector protection-group show prod-web-app

Expected output (relevant fields):

+----------------------------+--------------------------------------+
| Field                      | Value                                |
+----------------------------+--------------------------------------+
| status                     | failed_over                          |
| primary_site               | site-a                               |
| secondary_site             | site-b                               |
| current_primary_site       | site-b                               |
| failover_count             | 1                                    |
| last_failover_at           | 2025-03-15T09:07:44Z                 |
+----------------------------+--------------------------------------+

Example 5: Failback to Site A after recovery

Site A has been restored. Sync metadata, then fail back.

source ~/site-b-openrc

# Step 1: Confirm sync status
openstack protector protection-group sync-status prod-web-app

# Step 2: Force sync if Site A was previously unreachable
openstack protector protection-group sync-force prod-web-app

# Step 3: Execute failback
openstack protector protection-group failback prod-web-app \
  --type planned \
  --reverse-replication \
  --network-mapping e5f6a7b8-net-web=a1b2c3d4-net-web \
  --network-mapping a3b4c5d6-net-db=c9d0e1f2-net-db

Expected output:

+------------------------+--------------------------------------+
| Field                  | Value                                |
+------------------------+--------------------------------------+
| operation_id           | op-aabb1122-ccdd-3344-eeff-55667788 |
| operation_type         | failback                             |
| status                 | running                              |
| progress               | 10                                   |
| source_site            | site-b                               |
| target_site            | site-a                               |
| started_at             | 2025-03-16T08:15:00Z                 |
+------------------------+--------------------------------------+

After completion, the Protection Group status returns to active and current_primary_site returns to site-a.


Troubleshooting

Operation fails with "remote site unreachable" during planned failover

Symptom: Running openstack protector protection-group failover <pg> --type planned returns an error immediately, before the operation is created.

Cause: Planned failover requires both sites to be reachable for metadata sync. If the peer site cannot be contacted, the operation is blocked to prevent metadata divergence.

Fix:

  1. Confirm the secondary site's Keystone endpoint is reachable: curl -s http://site-b:5000/v3
  2. Check the site's registered status: openstack protector site validate site-b
  3. If the secondary site is genuinely unavailable and you need to recover immediately, use --type unplanned authenticated to the secondary site instead.
  4. If the site is reachable but the sync status is stale, run openstack protector protection-group sync-force <pg-name> before retrying.

Operation stuck at a progress percentage for more than 10 minutes

Symptom: openstack protector operation show <op-id> shows status: running but progress has not changed for an extended period.

Cause: The engine may be waiting on a Pure Storage array operation (snapshot promotion, volume management), a Cinder API call, or a Nova boot. Long waits during the 20–60% range typically indicate storage issues; waits during 60–90% indicate Nova or Cinder issues.

Fix:

  1. Check steps_completed to identify which phase is stalled.
  2. Review the Protector engine log on the target site: journalctl -u protector-engine -f
  3. If the issue is storage-related, verify FlashArray reachability from the target site and confirm that the replication policy credentials in openstack protector protection-group policy-show <pg> are valid.
  4. If the issue is Nova-related, check openstack server list on the target site for VMs in ERROR state and inspect their fault details.

Failover completes but VMs are in ERROR state on the target site

Symptom: The operation reaches status: completed and progress: 100, but when you run openstack server list on the target site, one or more VMs show ERROR.

Cause: VMs can reach ERROR state after the operation completes if the boot volume failed to attach, if the specified flavor does not exist on the target site, or if the mapped network is full or otherwise unavailable.

Fix:

  1. Check the Nova fault for the failed VM: openstack server show <vm-id> — look at the fault field.
  2. Verify the flavor mapping: openstack flavor list on the target site and confirm the mapped flavor ID exists.
  3. Verify the network mapping: openstack network show <target-net-id> to confirm the network is active and has available IP addresses.
  4. Check Cinder volume status: openstack volume list on the target site. Volumes created during failover should be in-use. If they are available, the attachment failed.

Failback blocked with "Protection Group is not in failed_over state"

Symptom: Running openstack protector protection-group failback <pg> returns an error stating the Protection Group is not eligible for failback.

Cause: Failback is only valid when the Protection Group status is failed_over. If a previous failover failed partway through, the status may be error instead.

Fix:

  1. Check the current status: openstack protector protection-group show <pg-name>
  2. If status is error, inspect the most recent operation: openstack protector operation list and then openstack protector operation show <op-id> to read the error_message.
  3. Resolve the underlying issue (storage, network, or Nova), then attempt the failover or failback again. If the Protection Group is stuck in error and you are certain it is safe to proceed, use --force — but only after confirming no partial VM or volume state exists on the target site that could cause conflicts.

Metadata out of sync after unplanned failover; modifications blocked

Symptom: After an unplanned failover, attempting to add or remove members returns: Cannot modify protection group - remote site unreachable.

Cause: During an unplanned failover, the engine marks the primary site's sync status as UNREACHABLE. The Protection Group remains in this state until the primary site recovers and a sync is performed. All modifications are blocked until both sites agree on the current metadata version.

Fix:

  1. Wait for Site A to recover.
  2. Check sync status: openstack protector protection-group sync-status <pg-name>
  3. Push the current metadata to Site A: openstack protector protection-group sync-force <pg-name>
  4. Confirm both sites are at the same version, then proceed with modifications.

volume_manage fails during failover with permission denied

Symptom: The operation fails during the storage failover phase (around 20–60% progress) with an error referencing volume_manage or PolicyNotAuthorized.

Cause: The Cinder policy on the target site does not permit the member role to call the volume manage API. This is a deployment configuration issue.

Fix: On the target site, update /etc/cinder/policy.yaml (or the Kolla-Ansible equivalent) to include:

"volume_extension:volume_manage": "rule:admin_or_owner"
"volume_extension:volume_unmanage": "rule:admin_or_owner"
"volume_extension:services:index": "rule:admin_or_owner"

Then restart the Cinder API service (or run kolla-ansible reconfigure -t cinder) and retry the failover operation.