Guide

Planned Failover

Graceful failover procedure with zero or near-zero data loss

master

Overview

A planned failover is a graceful, operator-initiated migration of protected workloads from the primary site to the secondary (DR) site. Unlike an unplanned failover triggered by a disaster, a planned failover allows time to quiesce VMs, force a final replication sync, and bring up instances on the secondary site with zero or near-zero data loss. Use this procedure for scheduled maintenance windows, capacity migrations, or any scenario where the primary site remains reachable and you can control the timing of the cutover. Because Trilio Site Recovery treats site designations as workload-relative, the secondary site becomes the new primary after the operation completes, and the original primary becomes the standby ready for a subsequent failback.

Prerequisites

Before executing a planned failover, confirm the following:

Environment

Two fully operational OpenStack clouds are registered as sites in Trilio Site Recovery (protector-api and protector-engine running on each site).
The protectorclient OSC plugin is installed and your clouds.yaml contains valid credentials for both sites.
Both sites are reachable from your workstation — metadata synchronization is blocked if the peer site is unreachable, and a planned failover will refuse to proceed if either site cannot be contacted.

Protection Group readiness

The target Protection Group has status: active and current_primary_site pointing to the site you are failing over from.
Metadata is in sync across both sites. Run openstack protector protection-group sync-status <pg-name> and confirm Sync Status: IN SYNC before proceeding.
All Protection Group members show status: protected.

Storage

The Cinder volume type used by the Protection Group has replication_enabled='<is> True' and a matching replication_type property on both sites.
The Pure Storage FlashArray replication link between sites is healthy. Verify in the FlashArray management UI or via the replication policy validation step.
For async replication: the most recent replication interval has completed successfully (check openstack protector consistency-group show <pg-name> for lag).
For sync replication: no additional lag check is required — data is current by definition.

Resource mappings

You have identified the network and (optionally) flavor mappings between the primary and secondary sites. Networks and flavors are site-local resources and do not replicate automatically.
Security groups referenced by protected VMs exist on the secondary site with equivalent rules.

Permissions

Your OpenStack user has the member role (or higher) in the tenant that owns the Protection Group on both sites.
Cinder policy on the secondary site permits volume_extension:volume_manage and volume_extension:volume_unmanage for the member role. See the deployment prerequisites documentation if this has not been configured.

Installation

Trilio Site Recovery does not require a separate installation step for planned failover — the capability is built into the protector-engine service and the protectorclient CLI plugin. If you have not yet installed the CLI plugin, complete the following steps.

Step 1: Install the OSC protectorclient plugin

pip install python-protectorclient

Verify the plugin is available:

openstack protector --help

You should see protection-group, operation, site, and related command groups listed.

Step 2: Configure clouds.yaml for both sites

Ensure ~/.config/openstack/clouds.yaml contains entries for both sites. The CLI authenticates to both sites independently during a planned failover to orchestrate metadata sync.

clouds:
  site-a:
    auth:
      auth_url: http://site-a-controller:5000/v3
      project_name: production-project
      username: your-user
      password: your-password
      user_domain_name: Default
      project_domain_name: Default
    region_name: RegionOne

  site-b:
    auth:
      auth_url: http://site-b-controller:5000/v3
      project_name: production-project
      username: your-user
      password: your-password
      user_domain_name: Default
      project_domain_name: Default
    region_name: RegionOne

Step 3: Verify connectivity to both protector-api endpoints

# Check Site A protector API
curl http://site-a-controller:8788/

# Check Site B protector API
curl http://site-b-controller:8788/

Both requests should return an API version response. If either endpoint is unreachable, resolve the connectivity issue before proceeding — a planned failover requires both sites to be reachable throughout the operation.

Configuration

A planned failover is driven by parameters supplied at execution time rather than persistent configuration. The following options control its behavior:

--network-mapping <primary-net-uuid>=<secondary-net-uuid> Maps one or more networks on the primary site to their equivalents on the secondary site. This is required whenever your VMs are attached to tenant networks (which is almost always). You can supply this flag multiple times or as a space-separated list of key=value pairs. Without a mapping, the engine cannot attach NICs on the secondary site and the operation will fail.

--flavor-mapping <primary-flavor-id>=<secondary-flavor-id> Maps compute flavors between sites. This is optional — if omitted, the engine attempts to use the same flavor ID on the secondary site. Supply this flag when flavor IDs or names differ between sites, or when you want to rightsize instances during the migration.

--force false (default) When false, the engine performs pre-flight validation (site reachability, sync status, member health) and aborts if any check fails. Set to true only to bypass pre-flight checks in exceptional circumstances. For a planned failover, always use the default (false) — the checks exist specifically to prevent data loss.

retain_primary (API field, default false) When false (the default for a planned failover), the engine shuts down VMs on the primary site before activating them on the secondary site. This is what distinguishes a planned failover from a test failover, where retain_primary is true. Do not set this to true for a production planned failover — doing so results in the same VM running on both sites simultaneously.

Replication interval and RPO (set on the replication policy) For async replication, the replication_interval (in seconds) and rpo_minutes values on the replication policy determine how much data can be in-flight at the moment the failover is initiated. A planned failover triggers a forced final sync before cutover, so the effective RPO at failover time is the lag accumulated since the last successful replication cycle, which approaches zero when both sites are healthy. These values are configured when you create the replication policy and are not changed at failover time.

Usage

The planned failover workflow consists of four stages: pre-flight validation, initiating the failover, monitoring progress, and post-failover verification. Work through each stage in order.

Stage 1: Pre-flight validation

Confirm the Protection Group is healthy and metadata is synchronized before touching anything:

# Check protection group status
openstack protector protection-group show prod-web-app

# Check metadata sync status
openstack protector protection-group sync-status prod-web-app

The sync-status output must show Sync Status: IN SYNC and matching version numbers on both sites. If the status shows OUT OF SYNC or FAILED, resolve the sync issue first:

# Force a metadata sync to the secondary site
openstack protector protection-group sync-force prod-web-app

Re-run sync-status after the force sync completes and confirm both sites are at the same version before continuing.

Stage 2: Initiate the planned failover

Authenticate against the primary site (the site where workloads are currently running) and submit the failover action. The CLI coordinates with both sites on your behalf.

export OS_CLOUD=site-a

openstack protector protection-group failover prod-web-app \
  --network-mapping \
    net-primary-web=net-secondary-web \
    net-primary-db=net-secondary-db \
  --flavor-mapping \
    m1.large=m2.large

The command returns an operation ID immediately. The actual failover runs asynchronously in the protector-engine.

Stage 3: Monitor progress

# Poll the operation until it reaches 'completed' or 'failed'
watch openstack protector operation show <operation-id>

# Or list all recent operations
openstack protector operation list

The operation progresses through four phases, reflected in the progress field (0–100) and steps_completed list:

Preparation (0–20%) — Validates both sites, sets Protection Group status to failing_over, records the DR operation.
Storage failover (20–60%) — Identifies the latest replicated snapshot on FlashArray B, creates Cinder volumes from that snapshot on the secondary site, and imports them via the Cinder manage API.
Instance recreation (60–90%) — Shuts down VMs on the primary site, then recreates each instance on the secondary site using stored metadata (flavor, keypair, security groups), the network mappings you supplied, and the volumes imported in the previous phase.
Finalization (90–100%) — Updates the Protection Group: status becomes failed_over, current_primary_site_id is updated to point to the secondary site, failover_count is incremented, and metadata is synced back to the original primary site.

Stage 4: Post-failover verification

Once the operation shows status: completed:

# Confirm protection group reflects the new primary
openstack protector protection-group show prod-web-app
# Expect: current_primary_site = site-b, status = failed_over

# Confirm VMs are running on the secondary site
export OS_CLOUD=site-b
openstack server list --project production-project

# Verify application connectivity
curl http://<site-b-application-endpoint>

At this point the secondary site is the active site. The original primary site is now the standby. When you are ready to return workloads to the original site, follow the failback procedure.

Examples

Example 1: Simple planned failover with network mapping

Fail over a protection group from Site A to Site B, mapping one tenant network.

export OS_CLOUD=site-a

openstack protector protection-group failover prod-web-app \
  --network-mapping net-primary-web=net-secondary-web

Expected output:

+------------------------+--------------------------------------+
| Field                  | Value                                |
+------------------------+--------------------------------------+
| operation_id           | op-a1b2c3d4-0000-0000-0000-000000000001 |
| operation_type         | failover                             |
| status                 | running                              |
| progress               | 0                                    |
| source_site            | site-a                               |
| target_site            | site-b                               |
+------------------------+--------------------------------------+

Example 2: Planned failover with multiple network and flavor mappings

A three-tier application with separate web and database networks, and a flavor difference between sites.

export OS_CLOUD=site-a

openstack protector protection-group failover prod-three-tier-app \
  --network-mapping \
    net-primary-web=net-secondary-web \
    net-primary-db=net-secondary-db \
    net-primary-mgmt=net-secondary-mgmt \
  --flavor-mapping \
    m1.large=m2.large \
    m1.xlarge=m2.xlarge

Expected output:

+------------------------+--------------------------------------+
| Field                  | Value                                |
+------------------------+--------------------------------------+
| operation_id           | op-b2c3d4e5-0000-0000-0000-000000000002 |
| operation_type         | failover                             |
| status                 | running                              |
| progress               | 0                                    |
+------------------------+--------------------------------------+

Example 3: Monitoring a failover operation to completion

openstack protector operation show op-a1b2c3d4-0000-0000-0000-000000000001

Output at ~50% through storage failover phase:

+------------------------+----------------------------------------------+
| Field                  | Value                                        |
+------------------------+----------------------------------------------+
| id                     | op-a1b2c3d4-0000-0000-0000-000000000001      |
| operation_type         | failover                                     |
| status                 | running                                      |
| progress               | 45                                           |
| steps_completed        | ["validate_sites", "create_dr_record",       |
|                        |  "identify_snapshot", "create_volumes_site_b"]|
| steps_failed           | []                                           |
| started_at             | 2025-06-01T09:00:00Z                         |
| completed_at           | null                                         |
| error_message          | null                                         |
+------------------------+----------------------------------------------+

Output when complete:

+------------------------+----------------------------------------------+
| Field                  | Value                                        |
+------------------------+----------------------------------------------+
| id                     | op-a1b2c3d4-0000-0000-0000-000000000001      |
| operation_type         | failover                                     |
| status                 | completed                                    |
| progress               | 100                                          |
| started_at             | 2025-06-01T09:00:00Z                         |
| completed_at           | 2025-06-01T09:04:23Z                         |
| error_message          | null                                         |
+------------------------+----------------------------------------------+

Example 4: Verifying protection group state after failover

export OS_CLOUD=site-b
openstack protector protection-group show prod-web-app

Expected output:

+---------------------------+--------------------------------------+
| Field                     | Value                                |
+---------------------------+--------------------------------------+
| id                        | pg-12345678-1234-1234-1234-12345678  |
| name                      | prod-web-app                         |
| status                    | failed_over                          |
| current_primary_site      | site-b                               |
| primary_site              | site-a                               |
| secondary_site            | site-b                               |
| failover_count            | 1                                    |
| last_failover_at          | 2025-06-01T09:04:23Z                 |
| replication_type          | async                                |
+---------------------------+--------------------------------------+

Example 5: Forcing a metadata sync before failover

If sync-status shows a version mismatch, resolve it before initiating failover:

export OS_CLOUD=site-a

# Check sync status
openstack protector protection-group sync-status prod-web-app
# Output shows: Remote Version: 6, Local Version: 7 — OUT OF SYNC

# Force sync to secondary site
openstack protector protection-group sync-force prod-web-app
# Output: Both sites now at version 7

# Now initiate planned failover
openstack protector protection-group failover prod-web-app \
  --network-mapping net-primary-web=net-secondary-web

Troubleshooting

Failover blocked: remote site unreachable

Symptom: The failover command returns an error such as Cannot execute failover — remote site unreachable or the operation immediately transitions to status: failed with an error message referencing site connectivity.

Likely cause: The protector-api on the secondary site is unreachable from your workstation or from the primary site's protector-engine. Because metadata sync requires both sites to be reachable, the engine refuses to proceed to prevent metadata divergence.

Fix:

Verify the protector-api endpoint on the secondary site is accessible: curl http://site-b-controller:8788/.
Check that port 8788 is open in your network security groups and firewall rules between the two sites.
Confirm the secondary site's protector-api and protector-engine services are running: systemctl status protector-api protector-engine on the secondary controller.
Once connectivity is restored, re-run the failover command.

Failover blocked: metadata out of sync

Symptom: The operation fails pre-flight with a message such as Protection group metadata is out of sync with remote site — resolve sync before proceeding.

Likely cause: A previous operation (member add/remove, policy update) partially synced or the secondary site was temporarily unreachable when a change was made. The version numbers on the two sites do not match.

Fix:

# Confirm the mismatch
openstack protector protection-group sync-status prod-web-app

# Force a sync from the current primary to the secondary
openstack protector protection-group sync-force prod-web-app

# Re-check before retrying the failover
openstack protector protection-group sync-status prod-web-app

Operation fails during storage phase: volume manage error

Symptom: The operation reaches 30–50% progress and then transitions to status: failed. The error_message field references a Cinder volume manage operation or mentions volume_extension:volume_manage.

Likely cause: The Cinder policy on the secondary site does not permit the member role to call the volume manage API. This is a deployment prerequisite that is sometimes missed.

Fix:

On the secondary site, add the following to /etc/cinder/policy.yaml (or the Kolla-Ansible equivalent):

"volume_extension:volume_manage": "rule:admin_or_owner"
"volume_extension:volume_unmanage": "rule:admin_or_owner"
"volume_extension:services:index": "rule:admin_or_owner"

Restart the Cinder API service: systemctl restart cinder-api (or kolla-ansible reconfigure -t cinder).
Retry the failover operation.

Operation fails during instance recreation: network not found

Symptom: The operation reaches 60–80% progress (instance recreation phase) and fails with an error referencing a network UUID that does not exist on the secondary site.

Likely cause: The --network-mapping argument was incomplete. One or more networks attached to protected VMs were not included in the mapping, so the engine attempted to use the primary site's network UUID directly on the secondary site, where it does not exist.

Fix:

List all networks attached to VMs in the Protection Group:

export OS_CLOUD=site-a
openstack protector protection-group member-list prod-web-app
# Note the network IDs from the member metadata

Find the corresponding networks on the secondary site:

export OS_CLOUD=site-b
openstack network list

Retry the failover with a complete mapping covering all networks.

Operation completes but VMs are in ERROR state on secondary site

Symptom: The DR operation shows status: completed and progress: 100, but on the secondary site one or more instances show status: ERROR in Nova.

Likely cause: Nova accepted the boot request but encountered an error during scheduling or volume attachment. Common causes include: the target flavor does not exist on the secondary site, a security group referenced in the VM metadata does not exist on the secondary site, or the Cinder volume failed to attach (for example, due to an iSCSI connectivity issue between the compute node and the FlashArray).

Fix:

Check the Nova instance fault on the secondary site:

export OS_CLOUD=site-b
openstack server show <failed-instance-id> -f json | jq '.fault'

Check the Cinder volume status for volumes that should be attached:

openstack volume list --project production-project

For missing flavors: create matching flavors on the secondary site and retry using --flavor-mapping.
For missing security groups: create equivalent security groups on the secondary site with the same rules, then retry the failover.
For volume attach failures: check iSCSI connectivity between the compute host and the FlashArray on the secondary site, then manually re-attach the volume if needed.

Failover appears to hang at a specific progress percentage

Symptom: The operation stays at the same progress value for an unusually long time (more than 10–15 minutes) without advancing or failing.

Likely cause: The protector-engine on the primary site has stalled, typically due to a Pure Storage API timeout, a Cinder operation that is stuck pending, or an RPC issue between the API and engine processes.

Fix:

Check the protector-engine log on the primary site for errors or stack traces:

tail -f /var/log/protector/protector-engine.log
# Or via systemd:
journalctl -u protector-engine -f

Check whether the Cinder operation is stuck:

export OS_CLOUD=site-b
openstack volume list  # Look for volumes in 'creating' or 'downloading' state

Verify Pure Storage API connectivity from the primary site controller to both FlashArray management IPs.
If the engine process has deadlocked, restart it: systemctl restart protector-engine. The operation will transition to status: failed and can be retried once the underlying issue is resolved.