Test Failover
Non-disruptive DR validation using snapshot-backed VMs on the secondary site
Test failover lets you validate your disaster recovery readiness without affecting production workloads. The operation boots your protected VMs on the secondary site using snapshot-backed volumes cloned from the latest replication snapshot, runs configurable health checks against them, and then tears everything down — leaving primary-site workloads untouched throughout. Use test failover as a regular DR drill to confirm that your resource mappings, network topology, health check thresholds, and RTO estimates are accurate before you ever need to execute a real failover.
Before running a test failover, confirm the following:
- Two registered and reachable sites: Both your primary and secondary sites must be registered in Trilio Site Recovery and reporting
status: active. Runopenstack dr site listto verify. - Protection Group in
activestatus: The Protection Group you intend to test must have at least one VM member and must not currently be in a failover, failback, or error state. - Replication policy configured: A replication policy must be attached to the Protection Group, and at least one replicated snapshot must exist on the secondary site. Run
openstack dr protection-group show <pg>and confirmlast_replication_atis populated. - Volume types with replication enabled: All volumes in the Protection Group's Consistency Group must use a Cinder volume type with
replication_enabled='<is> True'. Volumes on non-replicating types will block the operation. - Resource mappings prepared: Network and (optionally) flavor mappings for the secondary site must be available. You can supply them at invocation time or pre-configure them on the Protection Group.
- Glance image parity (mock storage mode only): If you are using the mock storage driver, every image used to boot VMs in the Protection Group must also exist on the secondary site with the same UUID or a configured image mapping. The mock driver simulates volume data from Glance images rather than from a real FlashArray snapshot.
- Cinder policy updated on secondary site: The
volume_extension:volume_managepolicy must allow thememberrole on the secondary site so that the service can import snapshot-backed volumes into Cinder. See the deployment prerequisites guide if this has not been configured. - OSC plugin installed: The
protectorclientOpenStack CLI plugin must be installed in your environment andopenstack drcommands must resolve correctly.
The test failover capability is part of the core Trilio Site Recovery service and requires no additional installation beyond the base Protector deployment. If you have not yet deployed the service, follow the main deployment guide first.
To confirm your environment is ready to execute test failovers:
Step 1 — Verify the CLI plugin is available
openstack dr --help
You should see test-failover and test cleanup listed under protection-group actions. If the command is not found, reinstall the protectorclient plugin:
pip install openstack-protectorclient
Step 2 — Confirm both sites are reachable
openstack dr site list
Expected output:
+--------+--------+--------+----------------------------------+
| Name | Type | Status | Auth URL |
+--------+--------+--------+----------------------------------+
| site-a | primary | active | http://site-a-ctrl:5000/v3 |
| site-b | secondary | active | http://site-b-ctrl:5000/v3 |
+--------+--------+--------+----------------------------------+
Both sites must show active before proceeding.
Step 3 — Confirm at least one replication snapshot exists
openstack dr protection-group show <pg-name-or-id>
Check that last_replication_at is set and that replication_status is not error. If no snapshot exists yet, wait for the next replication cycle or force one:
openstack dr consistency-group sync <pg-name-or-id>
Test failover behaviour is controlled by flags passed at invocation time and by a small set of options in protector.conf. The table below covers the settings most relevant to test failover.
CLI flags (per-invocation)
| Flag | Default | Description |
|---|---|---|
--network-mapping <src>=<dst> | Required | Maps each primary-site network UUID to a secondary-site network UUID. Repeat the flag for each network. |
--flavor-mapping <src>=<dst> | None (same flavor reused) | Maps primary flavor IDs to secondary flavor IDs. Omit if secondary flavors match by ID. |
--auto-network | Off | Instructs the service to create a temporary isolated network on the secondary site for test VMs. Use this for quick drills when you do not want to pre-configure a mapping. Note: requires create_network:shared policy on the secondary Neutron — see the deployment prerequisites guide. |
--no-cleanup | Off (cleanup runs automatically) | Leaves test VMs, volumes, and networks in place after verification completes. Useful when you want to inspect test instances manually before tearing them down. Cleanup must then be triggered manually. |
--health-check tcp:<port> | None | Adds a TCP connectivity check against the specified port on each recovered VM. Repeat for multiple ports. |
--health-check http:<path> | None | Adds an HTTP GET check. The service polls the path on each VM until it returns 2xx or the timeout expires. |
--health-check exec:<command> | None | Runs an arbitrary command inside each test VM via the Nova console API. VM must support guest agent for exec checks. |
--health-check-timeout <seconds> | 120 | Maximum time in seconds to wait for all health checks to pass before marking the operation failed. |
--test-project <project-id> | Same project as Protection Group | Boots test VMs into a different project for isolation. The service account must have the member role in the target project. |
protector.conf options
The following options in the [test_failover] section of protector.conf set service-wide defaults. Per-invocation flags override them.
[test_failover]
# Automatically clean up test resources after verification (true/false).
# Set to false during initial DR programme setup if you want engineers to
# inspect recovered instances before cleanup runs.
auto_cleanup = true
# Default health check timeout in seconds.
health_check_timeout = 120
# Maximum number of concurrent test failover operations across all
# protection groups. Prevents secondary-site resource exhaustion during
# large-scale drills.
max_concurrent_tests = 3
Operation phase transitions
Understanding the phase sequence helps you interpret progress output and diagnose failures:
| Phase | Description |
|---|---|
pending | Operation accepted; engine picking it up. |
creating_snapshots | Engine is requesting a point-in-time snapshot of the Consistency Group on the secondary FlashArray (or mock backend). |
creating_volumes | Cinder volumes are being created from the snapshot on the secondary site using the volume manage API. |
creating_vms | Nova instances are being booted on the secondary site using the cloned volumes and mapped networks/flavors. |
verifying | Configured health checks are running against the test VMs. |
succeeded | All checks passed. Cleanup will run immediately unless --no-cleanup or auto_cleanup = false. |
failed | One or more checks did not pass within the timeout, or a phase error occurred. Cleanup still runs (unless disabled). |
cleaning_up | Test VMs, snapshot-backed volumes, and any auto-created networks are being deleted. |
cleaned | All test resources removed. Terminal state. |
Initiating a test failover
You authenticate to your primary site to issue the command. The protectorclient plugin authenticates to both sites on your behalf and coordinates the operation.
source ~/site-a-openrc
openstack dr protection-group test-failover prod-web-app \
--network-mapping net-primary-web=net-secondary-web \
--network-mapping net-primary-db=net-secondary-db
The CLI returns an operation ID immediately. The operation runs asynchronously.
Monitoring progress
Poll the operation until it reaches succeeded, failed, or cleaned:
watch -n 5 openstack dr operation show <operation-id>
For a one-shot status check:
openstack dr operation show <operation-id>
The progress field (0–100) and steps_completed list give you granular visibility into which phase is executing.
Inspecting test VMs before cleanup
If you want to log in to test instances or run manual checks before they are deleted, pass --no-cleanup:
openstack dr protection-group test-failover prod-web-app \
--network-mapping net-primary-web=net-secondary-web \
--no-cleanup
Once you have finished inspecting the instances, trigger cleanup manually:
openstack dr test cleanup <operation-id>
The operation transitions from succeeded (or failed) → cleaning_up → cleaned.
Running a drill with automatic network isolation
If your secondary site has an appropriate Neutron policy, you can skip manual network mapping and let the service create an isolated test network:
openstack dr protection-group test-failover prod-web-app \
--auto-network
The auto-created network is scoped to the test operation and is deleted during cleanup. It has no external routing, so test VMs are reachable only from within the secondary site.
What the operation does NOT do
- It does not stop, snapshot, or in any way interrupt VMs running on the primary site.
- It does not promote the secondary site's replicated volumes to writable state permanently — cloned volumes are ephemeral and removed during cleanup.
- It does not change the Protection Group's
current_primary_site_idor failover count. The Protection Group status remainsactivethroughout.
Example 1 — Basic test failover with network mapping
The simplest complete drill: map two networks, run the default verification (boot success only), and let auto-cleanup remove everything.
source ~/site-a-openrc
openstack dr protection-group test-failover prod-web-app \
--network-mapping a1b2c3d4-net-web=f5e6d7c8-net-web \
--network-mapping a1b2c3d4-net-db=f5e6d7c8-net-db
Expected output:
+----------------+----------------------------------------------+
| Field | Value |
+----------------+----------------------------------------------+
| operation_id | op-9f3a1b2c-4d5e-6f7a-8b9c-0d1e2f3a4b5c |
| operation_type | test_failover |
| status | pending |
| progress | 0 |
| protection_group | prod-web-app |
| source_site | site-a |
| target_site | site-b |
+----------------+----------------------------------------------+
Poll until complete:
openstack dr operation show op-9f3a1b2c-4d5e-6f7a-8b9c-0d1e2f3a4b5c
Final state when successful:
+-------------------+----------------------------------------------+
| Field | Value |
+-------------------+----------------------------------------------+
| status | cleaned |
| progress | 100 |
| steps_completed | ["creating_snapshots", "creating_volumes", |
| | "creating_vms", "verifying", |
| | "cleaning_up"] |
| started_at | 2025-06-10T14:02:11Z |
| completed_at | 2025-06-10T14:09:43Z |
| error_message | |
+-------------------+----------------------------------------------+
Example 2 — Test failover with TCP and HTTP health checks
Validate that your web tier is actually serving traffic before cleanup runs.
openstack dr protection-group test-failover prod-web-app \
--network-mapping a1b2c3d4-net-web=f5e6d7c8-net-web \
--network-mapping a1b2c3d4-net-db=f5e6d7c8-net-db \
--health-check tcp:22 \
--health-check tcp:443 \
--health-check http:/healthz \
--health-check-timeout 180
During the verifying phase, the engine checks TCP reachability on ports 22 and 443 and then polls http://<vm-ip>/healthz on each recovered VM. If all checks pass within 180 seconds, the operation moves to succeeded. If any check times out, the operation moves to failed — and cleanup still runs (unless --no-cleanup was also passed).
Example 3 — Drill with post-test inspection (no auto-cleanup)
Use this pattern when you want engineers to manually verify recovered instances — for example, to check application logs or database integrity — before resources are deleted.
openstack dr protection-group test-failover prod-web-app \
--network-mapping a1b2c3d4-net-web=f5e6d7c8-net-web \
--no-cleanup
After the operation reaches succeeded, list the test instances on the secondary site:
# Switch to secondary site credentials
source ~/site-b-openrc
openstack server list --name 'test-*'
Inspect as needed, then clean up:
# Switch back to primary site credentials to issue the cleanup
source ~/site-a-openrc
openstack dr test cleanup op-9f3a1b2c-4d5e-6f7a-8b9c-0d1e2f3a4b5c
Expected cleanup output:
Cleanup initiated for operation op-9f3a1b2c-4d5e-6f7a-8b9c-0d1e2f3a4b5c
Status: cleaning_up
Resources to be deleted:
Instances : 3
Volumes : 5
Networks : 0 (user-supplied, not auto-created)
Monitor with: openstack dr operation show op-9f3a1b2c-...
Example 4 — Mock storage mode (lab / CI environment)
When running against the mock storage driver, volume data is synthesised from Glance images rather than from a real FlashArray snapshot. Ensure the same image exists on both sites before running:
# Verify image is present on site-b
source ~/site-b-openrc
openstack image show <image-uuid>
Then run the test failover exactly as in Example 1. The mock driver intercepts the snapshot and volume-creation phases and substitutes Glance-backed volumes transparently. The operation phases, health checks, and cleanup behaviour are identical to production mode.
Example 5 — Flavor mapping for asymmetric sites
If your secondary site uses different flavor names or IDs, supply a flavor mapping alongside the network mapping:
openstack dr protection-group test-failover prod-web-app \
--network-mapping a1b2c3d4-net-web=f5e6d7c8-net-web \
--flavor-mapping <primary-flavor-id>=<secondary-flavor-id>
Unmapped flavors fall back to using the same flavor ID on the secondary site. If that ID does not exist, the creating_vms phase fails with a FlavorNotFound error.
Issue: Operation stuck in pending for more than a few minutes
Symptom: openstack dr operation show returns status: pending and progress: 0 without advancing.
Likely cause: The protector-engine on the secondary site is not running or is not consuming the task queue.
Fix:
- On the secondary site controller, check the engine service:
systemctl status protector-engine - Review engine logs:
journalctl -u protector-engine -n 100 --no-pager - Confirm RabbitMQ is reachable from the engine host.
- Restart the engine if it is in a fault state:
systemctl restart protector-engine
Issue: creating_volumes phase fails with a Cinder policy error
Symptom: Operation reaches failed at the creating_volumes step. The error_message field contains something like Policy doesn't allow volume_extension:volume_manage to be performed.
Likely cause: The volume_manage Cinder policy on the secondary site has not been updated to allow the member role.
Fix: Add the following to /etc/cinder/policy.yaml on the secondary site and reconfigure Cinder:
"volume_extension:volume_manage": "rule:admin_or_owner"
"volume_extension:volume_unmanage": "rule:admin_or_owner"
"volume_extension:services:index": "rule:admin_or_owner"
For Kolla-Ansible deployments:
kolla-ansible -i inventory reconfigure -t cinder
Issue: creating_vms fails with NetworkNotFound or No valid host
Symptom: Operation fails during VM boot. Error references a network UUID that does not exist on the secondary site, or Nova cannot schedule the instance.
Likely cause (network): The network mapping is incorrect or the target network UUID does not exist on the secondary site.
Fix: Verify the target network UUID on the secondary site:
source ~/site-b-openrc
openstack network list
Correct the --network-mapping argument and re-run the test failover.
Likely cause (scheduling): The secondary site does not have sufficient compute capacity, or the flavor mapped from the primary does not exist.
Fix: Confirm the flavor exists on the secondary site:
openstack flavor list # against site-b credentials
Add a --flavor-mapping argument if IDs differ, or provision additional compute capacity on the secondary site.
Issue: Health checks time out — operation moves to failed
Symptom: The operation reaches verifying and then transitions to failed after the timeout. The steps_failed field lists verifying.
Likely cause: The test VMs booted successfully but the application did not start in time, or the test network does not have a path to the health-check target port.
Fix:
- Re-run with
--no-cleanupto preserve test instances after the failed verification. - Log in to the secondary site, locate the test VMs, and check application logs.
- If the issue is network routing (TCP checks unreachable), verify that security groups on the secondary site allow the checked ports, and that the test network has connectivity from the Protector engine host.
- If the application is simply slow to start, increase the timeout:
--health-check-timeout 300. - Once root cause is resolved, manually clean up the failed operation:
openstack dr test cleanup <operation-id>before re-running.
Issue: Mock mode test failover fails with ImageNotFound on secondary site
Symptom: In mock storage mode, the creating_volumes or creating_vms phase fails with an error referencing a missing Glance image.
Likely cause: The base image used to boot VMs in the Protection Group does not exist on the secondary site. Mock volume data is synthesised from Glance images, so both sites must have the same image.
Fix: Upload the image to the secondary site with the same UUID (use --id when creating), or configure an image mapping. Verify presence before retrying:
source ~/site-b-openrc
openstack image show <image-uuid>
Issue: Cleanup does not run automatically after succeeded
Symptom: Operation reaches succeeded but stays there rather than transitioning to cleaning_up.
Likely cause: auto_cleanup = false is set in protector.conf on the secondary site, or --no-cleanup was passed.
Fix: Trigger cleanup manually:
openstack dr test cleanup <operation-id>
If you want auto-cleanup to be the default, set auto_cleanup = true in [test_failover] in protector.conf and restart protector-engine on the secondary site.
Issue: Protection Group modification blocked after a test failover completes
Symptom: After a test failover, attempting to add or remove a VM from the Protection Group returns an error like Cannot modify protection group — remote site unreachable.
Likely cause: The metadata sync between sites did not complete before or after the test operation, leaving the remote sync status in a non-SYNCED state.
Fix:
# Check sync status
openstack dr protection-group sync-status prod-web-app
# If remote site is reachable, force a sync
openstack dr protection-group sync-force prod-web-app
Once both sites report version parity and SYNCED status, Protection Group modifications will be unblocked.