Test Failover Commands
openstack dr test failover, test cleanup; non-disruptive DR validation
Test failover (also called a DR drill) lets you validate your entire disaster recovery configuration without disrupting production workloads. The openstack dr test-failover command spins up copies of your protected VMs on the secondary site using replicated volume snapshots, while the originals continue running on the primary site. When the drill is complete, openstack dr test-cleanup tears down the test instances and discards the temporary volumes, leaving the Protection Group exactly as it was before the drill. Running regular DR drills is the only way to confirm that your replication policy, resource mappings, and secondary-site capacity are correct before you need them in a real incident.
Before running a test failover, verify the following:
- Both sites registered and reachable. The protector-api service must be able to reach both the primary and secondary site Keystone endpoints. Run
openstack protector site validate <site-name>for each site. - Protection Group in
activestatus. The Protection Group must not be infailed_over,failing_over,failing_back, orerrorstate. Confirm withopenstack protector protection-group show <pg-name>. - Replication policy configured. A replication policy with valid Pure Storage FlashArray credentials must be attached to the Protection Group. See Configure a replication policy for setup steps.
- At least one successful replication cycle completed. A replicated snapshot must exist on FlashArray B before test failover can proceed. Check the consistency group sync status with
openstack protector consistency-group show <pg-name>. - Metadata in sync across both sites. Both sites must be at the same metadata version. Run
openstack protector protection-group sync-status <pg-name>and resolve anyOUT OF SYNCcondition before continuing. - Network mapping prepared. You need either the UUIDs of the target networks on the secondary site (for
--network-mapping) or confirmation that--auto-networkis supported by your deployment's Neutron policy. See Configure resource mappings for details. - Secondary-site capacity. Confirm that sufficient compute and storage capacity exists on the secondary site to accommodate the test instances, since they run concurrently with the production instances on the primary site.
- Cinder policy updated. The protector service account requires
volume_extension:volume_manageandvolume_extension:volume_unmanagepermissions on the secondary site's Cinder. Verify these are set per the Deployment Prerequisites guide. - OpenStack CLI and protectorclient plugin installed. The
protectorclientOSC plugin must be installed in the same virtual environment as youropenstackCLI.
The test failover commands are part of the protectorclient OSC plugin, which ships with the Trilio Site Recovery package. If you have already installed the package for other DR operations, no additional installation is needed.
To confirm the plugin is available:
openstack dr --help
If the dr command group is not found, install or reinstall the client plugin:
pip install openstack-protector-client
Verify the installed version:
pip show openstack-protector-client
Configure your ~/.config/openstack/clouds.yaml to include credentials for both sites. The protectorclient authenticates to both sites independently — it does not relay through either site's protector-api service:
clouds:
site-a:
auth:
auth_url: http://site-a-controller:5000/v3
project_name: production-project
username: your-user
password: your-password
user_domain_name: Default
project_domain_name: Default
region_name: RegionOne
site-b:
auth:
auth_url: http://site-b-controller:5000/v3
project_name: production-project
username: your-user
password: your-password
user_domain_name: Default
project_domain_name: Default
region_name: RegionOne
With clouds.yaml in place, all DR commands accept --os-cloud <site-name> to select which site's API to authenticate against.
Test failover behavior is controlled by the flags you pass at invocation time. There are no persistent configuration file settings specific to test failover — all behavior is per-run.
openstack dr test-failover options
| Option | Required | Default | Description |
|---|---|---|---|
<protection-group> | Yes | — | Name or UUID of the Protection Group to test. |
--network-mapping <primary-net>=<secondary-net> | Conditional | — | Maps each primary-site network UUID to its secondary-site equivalent. Repeat the flag once per network. Required unless --auto-network is used. |
--auto-network | Conditional | Disabled | Instructs the service to create a temporary, isolated network on the secondary site for the test instances. Useful when no production network counterpart exists on Site B. Requires Neutron policy to allow the protector service account to create networks. |
--flavor-mapping <primary-flavor>=<secondary-flavor> | No | Identity mapping | Maps flavor IDs from the primary site to equivalent flavors on the secondary site. If omitted, the service uses the same flavor ID on both sites; the operation fails if that flavor does not exist on Site B. |
--retain-primary | No | true (implicit for test-failover) | For test failover, primary VMs are always retained — this flag is accepted for explicitness but cannot be set to false. Use failover if you intend to actually cut over. |
--wait | No | Disabled | Block the terminal until the operation reaches a terminal state (completed or failed), then print the result. Without this flag the command returns immediately with an operation ID. |
--os-cloud <cloud-name> | No | OS_CLOUD env var | Which entry in clouds.yaml to authenticate against. For test failover, authenticate against the primary site (the site where VMs are currently running). |
openstack dr test-cleanup options
| Option | Required | Default | Description |
|---|---|---|---|
<protection-group> | Yes | — | Name or UUID of the Protection Group whose test environment to remove. |
--operation-id <id> | No | Latest test-failover | Targets cleanup at a specific test-failover operation. If omitted, the service cleans up the most recent test environment for the Protection Group. |
--wait | No | Disabled | Block until cleanup completes. |
--os-cloud <cloud-name> | No | OS_CLOUD env var | Authenticate against the primary site for cleanup. |
Metadata sync behavior during test failover
Test failover does not change the Protection Group's current_primary_site_id or status — the Protection Group remains active throughout. However, because the operation is recorded in both sites' operation logs, both sites must be reachable when you initiate the drill. If the secondary site is unreachable, the test failover is blocked. If the secondary site becomes unreachable after the drill starts, the in-progress operation continues but cleanup must be retried once connectivity is restored.
Typical DR drill workflow
A DR drill follows a three-step cycle: initiate the test failover, validate the test environment, then clean up.
Step 1 — Initiate the test failover
Authenticate against the primary site (Site A, where your VMs are currently running) and submit the test failover command:
export OS_CLOUD=site-a
openstack dr test-failover prod-web-app \
--network-mapping net-primary-web=net-secondary-web \
--network-mapping net-primary-db=net-secondary-db \
--flavor-mapping m1.large=m2.large \
--wait
The --wait flag blocks your terminal and streams progress until the operation completes. Without it, the command returns an operation ID immediately and you poll separately.
Step 2 — Validate the test environment
Once the operation reaches completed, the test VMs are running on Site B. Validate them as you would any newly launched instance: check network reachability, confirm application services are responding, and verify data integrity against the expected RPO.
# Check that test instances appeared on Site B
export OS_CLOUD=site-b
openstack server list
# Check protection group status (should still be 'active')
export OS_CLOUD=site-a
openstack protector protection-group show prod-web-app
The Protection Group status remains active and your primary VMs continue running on Site A without interruption.
Step 3 — Clean up the test environment
After validation, remove the test VMs and their temporary volumes:
export OS_CLOUD=site-a
openstack dr test-cleanup prod-web-app --wait
Cleanup deletes the test instances from Site B's Nova, removes the temporary Cinder volumes that were created from replicated snapshots, and marks the DR operation as completed. The Protection Group's replication continues unaffected.
Using auto-network instead of explicit mappings
If your secondary site does not have a pre-configured network that matches your primary-site topology, use --auto-network to let the service create a temporary isolated network for the drill:
openstack dr test-failover prod-web-app \
--auto-network \
--wait
The auto-network is torn down automatically during test-cleanup. Note that test instances on an auto-network will not have external connectivity unless you manually attach a floating IP during the drill.
Monitoring a running test failover
If you launched without --wait, track progress using the operation ID returned by the command:
openstack protector operation show <operation-id>
Or list all operations for the Protection Group:
openstack protector operation list --protection-group prod-web-app
The progress field increments from 0 to 100. The steps_completed field lists each completed phase, which is useful for diagnosing stalls.
Example 1 — Basic async test failover with explicit network mapping
This example drills a Protection Group named prod-web-app that uses async replication. Network and flavor mappings are provided explicitly.
export OS_CLOUD=site-a
openstack dr test-failover prod-web-app \
--network-mapping a1b2c3d4-0000-0000-0000-111111111111=e5f6a7b8-0000-0000-0000-222222222222 \
--flavor-mapping m1.large=m2.large \
--wait
Expected output:
+------------------+--------------------------------------+
| Field | Value |
+------------------+--------------------------------------+
| operation_id | op-9f3c1a22-84b2-4d10-bc7e-... |
| operation_type | test_failover |
| status | completed |
| progress | 100 |
| source_site | site-a |
| target_site | site-b |
| started_at | 2025-06-12T09:14:02Z |
| completed_at | 2025-06-12T09:17:44Z |
| error_message | None |
+------------------+--------------------------------------+
Test failover completed successfully.
3 instances created on site-b.
Primary instances on site-a are unaffected.
Example 2 — Test failover without --wait, then poll progress
Submit the operation and poll separately. This is useful in automation pipelines where you want to perform other tasks while the drill runs.
export OS_CLOUD=site-a
# Submit and capture operation ID
OP_ID=$(openstack dr test-failover prod-web-app \
--network-mapping a1b2c3d4-0000-0000-0000-111111111111=e5f6a7b8-0000-0000-0000-222222222222 \
-f value -c operation_id)
echo "DR drill operation: $OP_ID"
# Poll until terminal state
while true; do
STATUS=$(openstack protector operation show "$OP_ID" -f value -c status)
PROGRESS=$(openstack protector operation show "$OP_ID" -f value -c progress)
echo "Status: $STATUS Progress: ${PROGRESS}%"
[[ "$STATUS" == "completed" || "$STATUS" == "failed" ]] && break
sleep 15
done
Expected output (progressive):
DR drill operation: op-9f3c1a22-84b2-4d10-bc7e-...
Status: running Progress: 10%
Status: running Progress: 35%
Status: running Progress: 60%
Status: running Progress: 85%
Status: completed Progress: 100%
Example 3 — Verify test instances on Site B, then clean up
After a successful test failover, inspect the test VMs before tearing them down.
# Switch to Site B to see test instances
export OS_CLOUD=site-b
openstack server list --name test-*
Expected output:
+--------------------------------------+-------------------+--------+----------------------------+
| ID | Name | Status | Networks |
+--------------------------------------+-------------------+--------+----------------------------+
| 7a3e1f22-0000-0000-0000-aabbccdd0001 | test-web-server-1 | ACTIVE | net-secondary-web=10.1.0.5 |
| 7a3e1f22-0000-0000-0000-aabbccdd0002 | test-web-server-2 | ACTIVE | net-secondary-web=10.1.0.6 |
| 7a3e1f22-0000-0000-0000-aabbccdd0003 | test-db-server | ACTIVE | net-secondary-db=10.2.0.4 |
+--------------------------------------+-------------------+--------+----------------------------+
After validation, run cleanup from the primary site context:
export OS_CLOUD=site-a
openstack dr test-cleanup prod-web-app --wait
Expected output:
+------------------+--------------------------------------+
| Field | Value |
+------------------+--------------------------------------+
| operation_id | op-c4d5e6f7-1234-5678-abcd-... |
| operation_type | test_cleanup |
| status | completed |
| progress | 100 |
| completed_at | 2025-06-12T09:35:11Z |
+------------------+--------------------------------------+
Test environment removed.
3 instances deleted from site-b.
3 temporary volumes removed from site-b.
Protection Group status: active
Example 4 — Test failover using auto-network
Use this when Site B has no pre-provisioned network counterpart for the primary-site networks.
export OS_CLOUD=site-a
openstack dr test-failover prod-web-app \
--auto-network \
--wait
Expected output:
Creating isolated test network on site-b... done
+------------------+--------------------------------------+
| Field | Value |
+------------------+--------------------------------------+
| operation_id | op-aa11bb22-ccdd-eeff-0011-... |
| operation_type | test_failover |
| status | completed |
| progress | 100 |
| test_network_id | f9e8d7c6-0000-0000-0000-333333333333 |
+------------------+--------------------------------------+
Test failover completed. Instances are on isolated network f9e8d7c6-...
Run 'openstack dr test-cleanup prod-web-app' to remove the test environment
and delete the temporary network.
Issue: metadata sync error — remote site unreachable
Symptom: Test failover is rejected immediately with a message such as:
ERROR: Cannot initiate test failover — remote site unreachable.
Metadata sync to site-b failed: connection timeout.
Likely cause: The protectorclient (or protector-api on the primary site) cannot reach the secondary site's Keystone or protector-api endpoint. Both sites must be reachable when initiating a test failover.
Fix:
- Verify the secondary site's API endpoint is up:
openstack protector site validate site-b - Check network connectivity from the primary site controller to the secondary site's
auth_urland port 8788. - Confirm
clouds.yamlhas the correctauth_urlfor Site B. - Once connectivity is restored, check sync status:
openstack protector protection-group sync-status prod-web-app. If versions are mismatched, force a sync first:openstack protector protection-group sync-force prod-web-app.
Issue: no replicated snapshot available
Symptom: The test failover operation starts but fails during the storage phase with:
error_message: No replicated snapshot found for consistency group cg-87654321...
on FlashArray B. Ensure at least one replication cycle has completed.
Likely cause: The replication policy is configured but no snapshot has yet been replicated from FlashArray A to FlashArray B. This is common immediately after adding a VM to a Protection Group.
Fix:
- Check replication status:
openstack protector consistency-group show prod-web-app - Force a manual sync to trigger an immediate snapshot:
openstack protector consistency-group sync prod-web-app - Wait for the sync to complete (the operation log will show
sync_volumesas completed). - Retry the test failover.
Issue: volume_manage permission denied on secondary site
Symptom: The test failover fails during the storage phase:
error_message: Cinder volume manage failed on site-b:
Policy does not allow volume_extension:volume_manage to be performed.
Likely cause: The secondary site's Cinder policy has not been updated to allow the member role to manage volumes.
Fix:
On the secondary site, add the following to /etc/cinder/policy.yaml:
"volume_extension:volume_manage": "rule:admin_or_owner"
"volume_extension:volume_unmanage": "rule:admin_or_owner"
"volume_extension:services:index": "rule:admin_or_owner"
For Kolla-Ansible deployments, update /etc/kolla/config/cinder/policy.yaml and run:
kolla-ansible -i inventory reconfigure -t cinder
Issue: Test instances created but Protection Group stuck in unexpected state
Symptom: The test failover operation shows completed but openstack protector protection-group show prod-web-app reports an unexpected status (e.g., error or failing_over).
Likely cause: A partial failure occurred after the instances were created but before the operation record was finalized. This can happen if the protector-engine process was interrupted mid-operation.
Fix:
- Check the operation detail for
steps_failed:openstack protector operation show <operation-id> - If the status is
rolling_back, wait for rollback to complete before retrying. - If the Protection Group is stuck in
error, runopenstack dr test-cleanup prod-web-app --waitto remove any partially created resources. - After cleanup completes successfully, the Protection Group should return to
active. If it does not, check the protector-engine logs on the primary site:journalctl -u protector-engine -n 200.
Issue: --auto-network fails with Policy does not allow create_network
Symptom:
ERROR: Failed to create test network on site-b:
HttpException: 403 Forbidden — policy does not allow create_network:shared
to be performed.
Likely cause: The protector service account does not have permission to create networks on the secondary site, and your Neutron policy restricts network creation to admin only.
Fix (option A): Update the secondary site's Neutron policy to allow the protector service account to create networks, then retry.
Fix (option B): Create the test network manually on Site B before running the drill, then use --network-mapping instead of --auto-network:
# On Site B, create an isolated test network
export OS_CLOUD=site-b
TEST_NET_ID=$(openstack network create dr-test-net -f value -c id)
openstack subnet create dr-test-subnet \
--network $TEST_NET_ID \
--subnet-range 192.168.200.0/24
# Run the drill with explicit mapping
export OS_CLOUD=site-a
openstack dr test-failover prod-web-app \
--network-mapping <primary-net-uuid>=$TEST_NET_ID \
--wait
Issue: test-cleanup fails — test instances already deleted manually
Symptom:
ERROR: Cleanup failed: instance test-web-server-1 (7a3e1f22-...) not found on site-b.
Likely cause: Someone manually deleted the test instances from Site B's Nova before running test-cleanup. The protector service still has records pointing to those instance IDs.
Fix: Run cleanup with the --force flag to instruct the service to skip missing resources and clean up only what it can find:
openstack dr test-cleanup prod-web-app --force --wait
The --force flag causes the cleanup operation to log missing resources as warnings rather than errors, and still removes any remaining temporary volumes and network objects before marking the operation complete.