Guide

Test Failover

Non-disruptive DR validation using snapshot-backed VMs on the secondary site

master

Overview

Test failover lets you validate your disaster recovery readiness without affecting production workloads. The operation boots your protected VMs on the secondary site using snapshot-backed volumes cloned from the latest replication snapshot, runs configurable health checks against them, and then tears everything down — leaving primary-site workloads untouched throughout. Use test failover as a regular DR drill to confirm that your resource mappings, network topology, health check thresholds, and RTO estimates are accurate before you ever need to execute a real failover.

Prerequisites

Before running a test failover, confirm the following:

Two registered and reachable sites: Both your primary and secondary sites must be registered in Trilio Site Recovery and reporting status: active. Run openstack dr site list to verify.
Protection Group in active status: The Protection Group you intend to test must have at least one VM member and must not currently be in a failover, failback, or error state.
Replication policy configured: A replication policy must be attached to the Protection Group, and at least one replicated snapshot must exist on the secondary site. Run openstack dr protection-group show <pg> and confirm last_replication_at is populated.
Volume types with replication enabled: All volumes in the Protection Group's Consistency Group must use a Cinder volume type with replication_enabled='<is> True'. Volumes on non-replicating types will block the operation.
Resource mappings prepared: Network and (optionally) flavor mappings for the secondary site must be available. You can supply them at invocation time or pre-configure them on the Protection Group.
Glance image parity (mock storage mode only): If you are using the mock storage driver, every image used to boot VMs in the Protection Group must also exist on the secondary site with the same UUID or a configured image mapping. The mock driver simulates volume data from Glance images rather than from a real FlashArray snapshot.
Cinder policy updated on secondary site: The volume_extension:volume_manage policy must allow the member role on the secondary site so that the service can import snapshot-backed volumes into Cinder. See the deployment prerequisites guide if this has not been configured.
OSC plugin installed: The protectorclient OpenStack CLI plugin must be installed in your environment and openstack dr commands must resolve correctly.

Installation

The test failover capability is part of the core Trilio Site Recovery service and requires no additional installation beyond the base Protector deployment. If you have not yet deployed the service, follow the main deployment guide first.

To confirm your environment is ready to execute test failovers:

Step 1 — Verify the CLI plugin is available

openstack dr --help

You should see test-failover and test cleanup listed under protection-group actions. If the command is not found, reinstall the protectorclient plugin:

pip install openstack-protectorclient

Step 2 — Confirm both sites are reachable

openstack dr site list

Expected output:

+--------+--------+--------+----------------------------------+
| Name   | Type   | Status | Auth URL                         |
+--------+--------+--------+----------------------------------+
| site-a | primary   | active | http://site-a-ctrl:5000/v3    |
| site-b | secondary | active | http://site-b-ctrl:5000/v3    |
+--------+--------+--------+----------------------------------+

Both sites must show active before proceeding.

Step 3 — Confirm at least one replication snapshot exists

openstack dr protection-group show <pg-name-or-id>

Check that last_replication_at is set and that replication_status is not error. If no snapshot exists yet, wait for the next replication cycle or force one:

openstack dr consistency-group sync <pg-name-or-id>

Configuration

Test failover behaviour is controlled by flags passed at invocation time and by a small set of options in protector.conf. The table below covers the settings most relevant to test failover.

CLI flags (per-invocation)

Flag	Default	Description
`--network-mapping <src>=<dst>`	Required	Maps each primary-site network UUID to a secondary-site network UUID. Repeat the flag for each network.
`--flavor-mapping <src>=<dst>`	None (same flavor reused)	Maps primary flavor IDs to secondary flavor IDs. Omit if secondary flavors match by ID.
`--auto-network`	Off	Instructs the service to create a temporary isolated network on the secondary site for test VMs. Use this for quick drills when you do not want to pre-configure a mapping. Note: requires `create_network:shared` policy on the secondary Neutron — see the deployment prerequisites guide.
`--no-cleanup`	Off (cleanup runs automatically)	Leaves test VMs, volumes, and networks in place after verification completes. Useful when you want to inspect test instances manually before tearing them down. Cleanup must then be triggered manually.
`--health-check tcp:<port>`	None	Adds a TCP connectivity check against the specified port on each recovered VM. Repeat for multiple ports.
`--health-check http:<path>`	None	Adds an HTTP GET check. The service polls the path on each VM until it returns 2xx or the timeout expires.
`--health-check exec:<command>`	None	Runs an arbitrary command inside each test VM via the Nova console API. VM must support guest agent for exec checks.
`--health-check-timeout <seconds>`	`120`	Maximum time in seconds to wait for all health checks to pass before marking the operation `failed`.
`--test-project <project-id>`	Same project as Protection Group	Boots test VMs into a different project for isolation. The service account must have the `member` role in the target project.

protector.conf options

The following options in the [test_failover] section of protector.conf set service-wide defaults. Per-invocation flags override them.

[test_failover]
# Automatically clean up test resources after verification (true/false).
# Set to false during initial DR programme setup if you want engineers to
# inspect recovered instances before cleanup runs.
auto_cleanup = true

# Default health check timeout in seconds.
health_check_timeout = 120

# Maximum number of concurrent test failover operations across all
# protection groups. Prevents secondary-site resource exhaustion during
# large-scale drills.
max_concurrent_tests = 3

Operation phase transitions

Understanding the phase sequence helps you interpret progress output and diagnose failures:

Phase	Description
`pending`	Operation accepted; engine picking it up.
`creating_snapshots`	Engine is requesting a point-in-time snapshot of the Consistency Group on the secondary FlashArray (or mock backend).
`creating_volumes`	Cinder volumes are being created from the snapshot on the secondary site using the volume manage API.
`creating_vms`	Nova instances are being booted on the secondary site using the cloned volumes and mapped networks/flavors.
`verifying`	Configured health checks are running against the test VMs.
`succeeded`	All checks passed. Cleanup will run immediately unless `--no-cleanup` or `auto_cleanup = false`.
`failed`	One or more checks did not pass within the timeout, or a phase error occurred. Cleanup still runs (unless disabled).
`cleaning_up`	Test VMs, snapshot-backed volumes, and any auto-created networks are being deleted.
`cleaned`	All test resources removed. Terminal state.

Usage

Initiating a test failover

You authenticate to your primary site to issue the command. The protectorclient plugin authenticates to both sites on your behalf and coordinates the operation.

source ~/site-a-openrc

openstack dr protection-group test-failover prod-web-app \
  --network-mapping net-primary-web=net-secondary-web \
  --network-mapping net-primary-db=net-secondary-db

The CLI returns an operation ID immediately. The operation runs asynchronously.

Monitoring progress

Poll the operation until it reaches succeeded, failed, or cleaned:

watch -n 5 openstack dr operation show <operation-id>

For a one-shot status check:

openstack dr operation show <operation-id>

The progress field (0–100) and steps_completed list give you granular visibility into which phase is executing.

Inspecting test VMs before cleanup

If you want to log in to test instances or run manual checks before they are deleted, pass --no-cleanup:

openstack dr protection-group test-failover prod-web-app \
  --network-mapping net-primary-web=net-secondary-web \
  --no-cleanup

Once you have finished inspecting the instances, trigger cleanup manually:

openstack dr test cleanup <operation-id>

The operation transitions from succeeded (or failed) → cleaning_up → cleaned.

Running a drill with automatic network isolation

If your secondary site has an appropriate Neutron policy, you can skip manual network mapping and let the service create an isolated test network:

openstack dr protection-group test-failover prod-web-app \
  --auto-network

The auto-created network is scoped to the test operation and is deleted during cleanup. It has no external routing, so test VMs are reachable only from within the secondary site.

What the operation does NOT do

It does not stop, snapshot, or in any way interrupt VMs running on the primary site.
It does not promote the secondary site's replicated volumes to writable state permanently — cloned volumes are ephemeral and removed during cleanup.
It does not change the Protection Group's current_primary_site_id or failover count. The Protection Group status remains active throughout.

Examples

Example 1 — Basic test failover with network mapping

The simplest complete drill: map two networks, run the default verification (boot success only), and let auto-cleanup remove everything.

source ~/site-a-openrc

openstack dr protection-group test-failover prod-web-app \
  --network-mapping a1b2c3d4-net-web=f5e6d7c8-net-web \
  --network-mapping a1b2c3d4-net-db=f5e6d7c8-net-db

Expected output:

+----------------+----------------------------------------------+
| Field          | Value                                        |
+----------------+----------------------------------------------+
| operation_id   | op-9f3a1b2c-4d5e-6f7a-8b9c-0d1e2f3a4b5c   |
| operation_type | test_failover                                |
| status         | pending                                      |
| progress       | 0                                            |
| protection_group | prod-web-app                               |
| source_site    | site-a                                       |
| target_site    | site-b                                       |
+----------------+----------------------------------------------+

Poll until complete:

openstack dr operation show op-9f3a1b2c-4d5e-6f7a-8b9c-0d1e2f3a4b5c

Final state when successful:

+-------------------+----------------------------------------------+
| Field             | Value                                        |
+-------------------+----------------------------------------------+
| status            | cleaned                                      |
| progress          | 100                                          |
| steps_completed   | ["creating_snapshots", "creating_volumes",   |
|                   |  "creating_vms", "verifying",               |
|                   |  "cleaning_up"]                              |
| started_at        | 2025-06-10T14:02:11Z                         |
| completed_at      | 2025-06-10T14:09:43Z                         |
| error_message     |                                              |
+-------------------+----------------------------------------------+

Example 2 — Test failover with TCP and HTTP health checks

Validate that your web tier is actually serving traffic before cleanup runs.

openstack dr protection-group test-failover prod-web-app \
  --network-mapping a1b2c3d4-net-web=f5e6d7c8-net-web \
  --network-mapping a1b2c3d4-net-db=f5e6d7c8-net-db \
  --health-check tcp:22 \
  --health-check tcp:443 \
  --health-check http:/healthz \
  --health-check-timeout 180

During the verifying phase, the engine checks TCP reachability on ports 22 and 443 and then polls http://<vm-ip>/healthz on each recovered VM. If all checks pass within 180 seconds, the operation moves to succeeded. If any check times out, the operation moves to failed — and cleanup still runs (unless --no-cleanup was also passed).

Example 3 — Drill with post-test inspection (no auto-cleanup)

Use this pattern when you want engineers to manually verify recovered instances — for example, to check application logs or database integrity — before resources are deleted.

openstack dr protection-group test-failover prod-web-app \
  --network-mapping a1b2c3d4-net-web=f5e6d7c8-net-web \
  --no-cleanup

After the operation reaches succeeded, list the test instances on the secondary site:

# Switch to secondary site credentials
source ~/site-b-openrc
openstack server list --name 'test-*'

Inspect as needed, then clean up:

# Switch back to primary site credentials to issue the cleanup
source ~/site-a-openrc
openstack dr test cleanup op-9f3a1b2c-4d5e-6f7a-8b9c-0d1e2f3a4b5c

Expected cleanup output:

Cleanup initiated for operation op-9f3a1b2c-4d5e-6f7a-8b9c-0d1e2f3a4b5c
Status: cleaning_up

Resources to be deleted:
  Instances : 3
  Volumes   : 5
  Networks  : 0 (user-supplied, not auto-created)

Monitor with: openstack dr operation show op-9f3a1b2c-...

Example 4 — Mock storage mode (lab / CI environment)

When running against the mock storage driver, volume data is synthesised from Glance images rather than from a real FlashArray snapshot. Ensure the same image exists on both sites before running:

# Verify image is present on site-b
source ~/site-b-openrc
openstack image show <image-uuid>

Then run the test failover exactly as in Example 1. The mock driver intercepts the snapshot and volume-creation phases and substitutes Glance-backed volumes transparently. The operation phases, health checks, and cleanup behaviour are identical to production mode.

Example 5 — Flavor mapping for asymmetric sites

If your secondary site uses different flavor names or IDs, supply a flavor mapping alongside the network mapping:

openstack dr protection-group test-failover prod-web-app \
  --network-mapping a1b2c3d4-net-web=f5e6d7c8-net-web \
  --flavor-mapping <primary-flavor-id>=<secondary-flavor-id>

Unmapped flavors fall back to using the same flavor ID on the secondary site. If that ID does not exist, the creating_vms phase fails with a FlavorNotFound error.

Troubleshooting

Issue: Operation stuck in pending for more than a few minutes

Symptom: openstack dr operation show returns status: pending and progress: 0 without advancing.

Likely cause: The protector-engine on the secondary site is not running or is not consuming the task queue.

Fix:

On the secondary site controller, check the engine service: systemctl status protector-engine
Review engine logs: journalctl -u protector-engine -n 100 --no-pager
Confirm RabbitMQ is reachable from the engine host.
Restart the engine if it is in a fault state: systemctl restart protector-engine

Issue: creating_volumes phase fails with a Cinder policy error

Symptom: Operation reaches failed at the creating_volumes step. The error_message field contains something like Policy doesn't allow volume_extension:volume_manage to be performed.

Likely cause: The volume_manage Cinder policy on the secondary site has not been updated to allow the member role.

Fix: Add the following to /etc/cinder/policy.yaml on the secondary site and reconfigure Cinder:

"volume_extension:volume_manage": "rule:admin_or_owner"
"volume_extension:volume_unmanage": "rule:admin_or_owner"
"volume_extension:services:index": "rule:admin_or_owner"

For Kolla-Ansible deployments:

kolla-ansible -i inventory reconfigure -t cinder

Issue: creating_vms fails with NetworkNotFound or No valid host

Symptom: Operation fails during VM boot. Error references a network UUID that does not exist on the secondary site, or Nova cannot schedule the instance.

Likely cause (network): The network mapping is incorrect or the target network UUID does not exist on the secondary site.

Fix: Verify the target network UUID on the secondary site:

source ~/site-b-openrc
openstack network list

Correct the --network-mapping argument and re-run the test failover.

Likely cause (scheduling): The secondary site does not have sufficient compute capacity, or the flavor mapped from the primary does not exist.

Fix: Confirm the flavor exists on the secondary site:

openstack flavor list  # against site-b credentials

Add a --flavor-mapping argument if IDs differ, or provision additional compute capacity on the secondary site.

Issue: Health checks time out — operation moves to failed

Symptom: The operation reaches verifying and then transitions to failed after the timeout. The steps_failed field lists verifying.

Likely cause: The test VMs booted successfully but the application did not start in time, or the test network does not have a path to the health-check target port.

Fix:

Re-run with --no-cleanup to preserve test instances after the failed verification.
Log in to the secondary site, locate the test VMs, and check application logs.
If the issue is network routing (TCP checks unreachable), verify that security groups on the secondary site allow the checked ports, and that the test network has connectivity from the Protector engine host.
If the application is simply slow to start, increase the timeout: --health-check-timeout 300.
Once root cause is resolved, manually clean up the failed operation: openstack dr test cleanup <operation-id> before re-running.

Issue: Mock mode test failover fails with ImageNotFound on secondary site

Symptom: In mock storage mode, the creating_volumes or creating_vms phase fails with an error referencing a missing Glance image.

Likely cause: The base image used to boot VMs in the Protection Group does not exist on the secondary site. Mock volume data is synthesised from Glance images, so both sites must have the same image.

Fix: Upload the image to the secondary site with the same UUID (use --id when creating), or configure an image mapping. Verify presence before retrying:

source ~/site-b-openrc
openstack image show <image-uuid>

Issue: Cleanup does not run automatically after succeeded

Symptom: Operation reaches succeeded but stays there rather than transitioning to cleaning_up.

Likely cause: auto_cleanup = false is set in protector.conf on the secondary site, or --no-cleanup was passed.

Fix: Trigger cleanup manually:

openstack dr test cleanup <operation-id>

If you want auto-cleanup to be the default, set auto_cleanup = true in [test_failover] in protector.conf and restart protector-engine on the secondary site.

Issue: Protection Group modification blocked after a test failover completes

Symptom: After a test failover, attempting to add or remove a VM from the Protection Group returns an error like Cannot modify protection group — remote site unreachable.

Likely cause: The metadata sync between sites did not complete before or after the test operation, leaving the remote sync status in a non-SYNCED state.

Fix:

# Check sync status
openstack dr protection-group sync-status prod-web-app

# If remote site is reachable, force a sync
openstack dr protection-group sync-force prod-web-app

Once both sites report version parity and SYNCED status, Protection Group modifications will be unblocked.