Site Recoveryfor OpenStack
Guide

Test Failover Commands

openstack dr test failover, test cleanup; non-disruptive DR validation

master

Overview

Test failover (also called a DR drill) lets you validate your entire disaster recovery configuration without disrupting production workloads. The openstack dr test-failover command spins up copies of your protected VMs on the secondary site using replicated volume snapshots, while the originals continue running on the primary site. When the drill is complete, openstack dr test-cleanup tears down the test instances and discards the temporary volumes, leaving the Protection Group exactly as it was before the drill. Running regular DR drills is the only way to confirm that your replication policy, resource mappings, and secondary-site capacity are correct before you need them in a real incident.


Prerequisites

Before running a test failover, verify the following:

  • Both sites registered and reachable. The protector-api service must be able to reach both the primary and secondary site Keystone endpoints. Run openstack protector site validate <site-name> for each site.
  • Protection Group in active status. The Protection Group must not be in failed_over, failing_over, failing_back, or error state. Confirm with openstack protector protection-group show <pg-name>.
  • Replication policy configured. A replication policy with valid Pure Storage FlashArray credentials must be attached to the Protection Group. See Configure a replication policy for setup steps.
  • At least one successful replication cycle completed. A replicated snapshot must exist on FlashArray B before test failover can proceed. Check the consistency group sync status with openstack protector consistency-group show <pg-name>.
  • Metadata in sync across both sites. Both sites must be at the same metadata version. Run openstack protector protection-group sync-status <pg-name> and resolve any OUT OF SYNC condition before continuing.
  • Network mapping prepared. You need either the UUIDs of the target networks on the secondary site (for --network-mapping) or confirmation that --auto-network is supported by your deployment's Neutron policy. See Configure resource mappings for details.
  • Secondary-site capacity. Confirm that sufficient compute and storage capacity exists on the secondary site to accommodate the test instances, since they run concurrently with the production instances on the primary site.
  • Cinder policy updated. The protector service account requires volume_extension:volume_manage and volume_extension:volume_unmanage permissions on the secondary site's Cinder. Verify these are set per the Deployment Prerequisites guide.
  • OpenStack CLI and protectorclient plugin installed. The protectorclient OSC plugin must be installed in the same virtual environment as your openstack CLI.

Installation

The test failover commands are part of the protectorclient OSC plugin, which ships with the Trilio Site Recovery package. If you have already installed the package for other DR operations, no additional installation is needed.

To confirm the plugin is available:

openstack dr --help

If the dr command group is not found, install or reinstall the client plugin:

pip install openstack-protector-client

Verify the installed version:

pip show openstack-protector-client

Configure your ~/.config/openstack/clouds.yaml to include credentials for both sites. The protectorclient authenticates to both sites independently — it does not relay through either site's protector-api service:

clouds:
  site-a:
    auth:
      auth_url: http://site-a-controller:5000/v3
      project_name: production-project
      username: your-user
      password: your-password
      user_domain_name: Default
      project_domain_name: Default
    region_name: RegionOne

  site-b:
    auth:
      auth_url: http://site-b-controller:5000/v3
      project_name: production-project
      username: your-user
      password: your-password
      user_domain_name: Default
      project_domain_name: Default
    region_name: RegionOne

With clouds.yaml in place, all DR commands accept --os-cloud <site-name> to select which site's API to authenticate against.


Configuration

Test failover behavior is controlled by the flags you pass at invocation time. There are no persistent configuration file settings specific to test failover — all behavior is per-run.

openstack dr test-failover options

OptionRequiredDefaultDescription
<protection-group>YesName or UUID of the Protection Group to test.
--network-mapping <primary-net>=<secondary-net>ConditionalMaps each primary-site network UUID to its secondary-site equivalent. Repeat the flag once per network. Required unless --auto-network is used.
--auto-networkConditionalDisabledInstructs the service to create a temporary, isolated network on the secondary site for the test instances. Useful when no production network counterpart exists on Site B. Requires Neutron policy to allow the protector service account to create networks.
--flavor-mapping <primary-flavor>=<secondary-flavor>NoIdentity mappingMaps flavor IDs from the primary site to equivalent flavors on the secondary site. If omitted, the service uses the same flavor ID on both sites; the operation fails if that flavor does not exist on Site B.
--retain-primaryNotrue (implicit for test-failover)For test failover, primary VMs are always retained — this flag is accepted for explicitness but cannot be set to false. Use failover if you intend to actually cut over.
--waitNoDisabledBlock the terminal until the operation reaches a terminal state (completed or failed), then print the result. Without this flag the command returns immediately with an operation ID.
--os-cloud <cloud-name>NoOS_CLOUD env varWhich entry in clouds.yaml to authenticate against. For test failover, authenticate against the primary site (the site where VMs are currently running).

openstack dr test-cleanup options

OptionRequiredDefaultDescription
<protection-group>YesName or UUID of the Protection Group whose test environment to remove.
--operation-id <id>NoLatest test-failoverTargets cleanup at a specific test-failover operation. If omitted, the service cleans up the most recent test environment for the Protection Group.
--waitNoDisabledBlock until cleanup completes.
--os-cloud <cloud-name>NoOS_CLOUD env varAuthenticate against the primary site for cleanup.

Metadata sync behavior during test failover

Test failover does not change the Protection Group's current_primary_site_id or status — the Protection Group remains active throughout. However, because the operation is recorded in both sites' operation logs, both sites must be reachable when you initiate the drill. If the secondary site is unreachable, the test failover is blocked. If the secondary site becomes unreachable after the drill starts, the in-progress operation continues but cleanup must be retried once connectivity is restored.


Usage

Typical DR drill workflow

A DR drill follows a three-step cycle: initiate the test failover, validate the test environment, then clean up.

Step 1 — Initiate the test failover

Authenticate against the primary site (Site A, where your VMs are currently running) and submit the test failover command:

export OS_CLOUD=site-a

openstack dr test-failover prod-web-app \
  --network-mapping net-primary-web=net-secondary-web \
  --network-mapping net-primary-db=net-secondary-db \
  --flavor-mapping m1.large=m2.large \
  --wait

The --wait flag blocks your terminal and streams progress until the operation completes. Without it, the command returns an operation ID immediately and you poll separately.

Step 2 — Validate the test environment

Once the operation reaches completed, the test VMs are running on Site B. Validate them as you would any newly launched instance: check network reachability, confirm application services are responding, and verify data integrity against the expected RPO.

# Check that test instances appeared on Site B
export OS_CLOUD=site-b
openstack server list

# Check protection group status (should still be 'active')
export OS_CLOUD=site-a
openstack protector protection-group show prod-web-app

The Protection Group status remains active and your primary VMs continue running on Site A without interruption.

Step 3 — Clean up the test environment

After validation, remove the test VMs and their temporary volumes:

export OS_CLOUD=site-a

openstack dr test-cleanup prod-web-app --wait

Cleanup deletes the test instances from Site B's Nova, removes the temporary Cinder volumes that were created from replicated snapshots, and marks the DR operation as completed. The Protection Group's replication continues unaffected.

Using auto-network instead of explicit mappings

If your secondary site does not have a pre-configured network that matches your primary-site topology, use --auto-network to let the service create a temporary isolated network for the drill:

openstack dr test-failover prod-web-app \
  --auto-network \
  --wait

The auto-network is torn down automatically during test-cleanup. Note that test instances on an auto-network will not have external connectivity unless you manually attach a floating IP during the drill.

Monitoring a running test failover

If you launched without --wait, track progress using the operation ID returned by the command:

openstack protector operation show <operation-id>

Or list all operations for the Protection Group:

openstack protector operation list --protection-group prod-web-app

The progress field increments from 0 to 100. The steps_completed field lists each completed phase, which is useful for diagnosing stalls.


Examples

Example 1 — Basic async test failover with explicit network mapping

This example drills a Protection Group named prod-web-app that uses async replication. Network and flavor mappings are provided explicitly.

export OS_CLOUD=site-a

openstack dr test-failover prod-web-app \
  --network-mapping a1b2c3d4-0000-0000-0000-111111111111=e5f6a7b8-0000-0000-0000-222222222222 \
  --flavor-mapping m1.large=m2.large \
  --wait

Expected output:

+------------------+--------------------------------------+
| Field            | Value                                |
+------------------+--------------------------------------+
| operation_id     | op-9f3c1a22-84b2-4d10-bc7e-...      |
| operation_type   | test_failover                        |
| status           | completed                            |
| progress         | 100                                  |
| source_site      | site-a                               |
| target_site      | site-b                               |
| started_at       | 2025-06-12T09:14:02Z                 |
| completed_at     | 2025-06-12T09:17:44Z                 |
| error_message    | None                                 |
+------------------+--------------------------------------+

Test failover completed successfully.
3 instances created on site-b.
Primary instances on site-a are unaffected.

Example 2 — Test failover without --wait, then poll progress

Submit the operation and poll separately. This is useful in automation pipelines where you want to perform other tasks while the drill runs.

export OS_CLOUD=site-a

# Submit and capture operation ID
OP_ID=$(openstack dr test-failover prod-web-app \
  --network-mapping a1b2c3d4-0000-0000-0000-111111111111=e5f6a7b8-0000-0000-0000-222222222222 \
  -f value -c operation_id)

echo "DR drill operation: $OP_ID"

# Poll until terminal state
while true; do
  STATUS=$(openstack protector operation show "$OP_ID" -f value -c status)
  PROGRESS=$(openstack protector operation show "$OP_ID" -f value -c progress)
  echo "Status: $STATUS  Progress: ${PROGRESS}%"
  [[ "$STATUS" == "completed" || "$STATUS" == "failed" ]] && break
  sleep 15
done

Expected output (progressive):

DR drill operation: op-9f3c1a22-84b2-4d10-bc7e-...
Status: running  Progress: 10%
Status: running  Progress: 35%
Status: running  Progress: 60%
Status: running  Progress: 85%
Status: completed  Progress: 100%

Example 3 — Verify test instances on Site B, then clean up

After a successful test failover, inspect the test VMs before tearing them down.

# Switch to Site B to see test instances
export OS_CLOUD=site-b
openstack server list --name test-*

Expected output:

+--------------------------------------+-------------------+--------+----------------------------+
| ID                                   | Name              | Status | Networks                   |
+--------------------------------------+-------------------+--------+----------------------------+
| 7a3e1f22-0000-0000-0000-aabbccdd0001 | test-web-server-1 | ACTIVE | net-secondary-web=10.1.0.5 |
| 7a3e1f22-0000-0000-0000-aabbccdd0002 | test-web-server-2 | ACTIVE | net-secondary-web=10.1.0.6 |
| 7a3e1f22-0000-0000-0000-aabbccdd0003 | test-db-server    | ACTIVE | net-secondary-db=10.2.0.4  |
+--------------------------------------+-------------------+--------+----------------------------+

After validation, run cleanup from the primary site context:

export OS_CLOUD=site-a

openstack dr test-cleanup prod-web-app --wait

Expected output:

+------------------+--------------------------------------+
| Field            | Value                                |
+------------------+--------------------------------------+
| operation_id     | op-c4d5e6f7-1234-5678-abcd-...      |
| operation_type   | test_cleanup                         |
| status           | completed                            |
| progress         | 100                                  |
| completed_at     | 2025-06-12T09:35:11Z                 |
+------------------+--------------------------------------+

Test environment removed.
3 instances deleted from site-b.
3 temporary volumes removed from site-b.
Protection Group status: active

Example 4 — Test failover using auto-network

Use this when Site B has no pre-provisioned network counterpart for the primary-site networks.

export OS_CLOUD=site-a

openstack dr test-failover prod-web-app \
  --auto-network \
  --wait

Expected output:

Creating isolated test network on site-b... done
+------------------+--------------------------------------+
| Field            | Value                                |
+------------------+--------------------------------------+
| operation_id     | op-aa11bb22-ccdd-eeff-0011-...      |
| operation_type   | test_failover                        |
| status           | completed                            |
| progress         | 100                                  |
| test_network_id  | f9e8d7c6-0000-0000-0000-333333333333 |
+------------------+--------------------------------------+

Test failover completed. Instances are on isolated network f9e8d7c6-...
Run 'openstack dr test-cleanup prod-web-app' to remove the test environment
and delete the temporary network.

Troubleshooting

Issue: metadata sync error — remote site unreachable

Symptom: Test failover is rejected immediately with a message such as:

ERROR: Cannot initiate test failover — remote site unreachable.
Metadata sync to site-b failed: connection timeout.

Likely cause: The protectorclient (or protector-api on the primary site) cannot reach the secondary site's Keystone or protector-api endpoint. Both sites must be reachable when initiating a test failover.

Fix:

  1. Verify the secondary site's API endpoint is up: openstack protector site validate site-b
  2. Check network connectivity from the primary site controller to the secondary site's auth_url and port 8788.
  3. Confirm clouds.yaml has the correct auth_url for Site B.
  4. Once connectivity is restored, check sync status: openstack protector protection-group sync-status prod-web-app. If versions are mismatched, force a sync first: openstack protector protection-group sync-force prod-web-app.

Issue: no replicated snapshot available

Symptom: The test failover operation starts but fails during the storage phase with:

error_message: No replicated snapshot found for consistency group cg-87654321...
  on FlashArray B. Ensure at least one replication cycle has completed.

Likely cause: The replication policy is configured but no snapshot has yet been replicated from FlashArray A to FlashArray B. This is common immediately after adding a VM to a Protection Group.

Fix:

  1. Check replication status: openstack protector consistency-group show prod-web-app
  2. Force a manual sync to trigger an immediate snapshot: openstack protector consistency-group sync prod-web-app
  3. Wait for the sync to complete (the operation log will show sync_volumes as completed).
  4. Retry the test failover.

Issue: volume_manage permission denied on secondary site

Symptom: The test failover fails during the storage phase:

error_message: Cinder volume manage failed on site-b:
  Policy does not allow volume_extension:volume_manage to be performed.

Likely cause: The secondary site's Cinder policy has not been updated to allow the member role to manage volumes.

Fix: On the secondary site, add the following to /etc/cinder/policy.yaml:

"volume_extension:volume_manage": "rule:admin_or_owner"
"volume_extension:volume_unmanage": "rule:admin_or_owner"
"volume_extension:services:index": "rule:admin_or_owner"

For Kolla-Ansible deployments, update /etc/kolla/config/cinder/policy.yaml and run:

kolla-ansible -i inventory reconfigure -t cinder

Issue: Test instances created but Protection Group stuck in unexpected state

Symptom: The test failover operation shows completed but openstack protector protection-group show prod-web-app reports an unexpected status (e.g., error or failing_over).

Likely cause: A partial failure occurred after the instances were created but before the operation record was finalized. This can happen if the protector-engine process was interrupted mid-operation.

Fix:

  1. Check the operation detail for steps_failed: openstack protector operation show <operation-id>
  2. If the status is rolling_back, wait for rollback to complete before retrying.
  3. If the Protection Group is stuck in error, run openstack dr test-cleanup prod-web-app --wait to remove any partially created resources.
  4. After cleanup completes successfully, the Protection Group should return to active. If it does not, check the protector-engine logs on the primary site: journalctl -u protector-engine -n 200.

Issue: --auto-network fails with Policy does not allow create_network

Symptom:

ERROR: Failed to create test network on site-b:
  HttpException: 403 Forbidden — policy does not allow create_network:shared
  to be performed.

Likely cause: The protector service account does not have permission to create networks on the secondary site, and your Neutron policy restricts network creation to admin only.

Fix (option A): Update the secondary site's Neutron policy to allow the protector service account to create networks, then retry.

Fix (option B): Create the test network manually on Site B before running the drill, then use --network-mapping instead of --auto-network:

# On Site B, create an isolated test network
export OS_CLOUD=site-b
TEST_NET_ID=$(openstack network create dr-test-net -f value -c id)
openstack subnet create dr-test-subnet \
  --network $TEST_NET_ID \
  --subnet-range 192.168.200.0/24

# Run the drill with explicit mapping
export OS_CLOUD=site-a
openstack dr test-failover prod-web-app \
  --network-mapping <primary-net-uuid>=$TEST_NET_ID \
  --wait

Issue: test-cleanup fails — test instances already deleted manually

Symptom:

ERROR: Cleanup failed: instance test-web-server-1 (7a3e1f22-...) not found on site-b.

Likely cause: Someone manually deleted the test instances from Site B's Nova before running test-cleanup. The protector service still has records pointing to those instance IDs.

Fix: Run cleanup with the --force flag to instruct the service to skip missing resources and clean up only what it can find:

openstack dr test-cleanup prod-web-app --force --wait

The --force flag causes the cleanup operation to log missing resources as warnings rather than errors, and still removes any remaining temporary volumes and network objects before marking the operation complete.