Site Recoveryfor OpenStack
Guide

Unplanned Failover

Emergency failover when primary site is unavailable

master

Overview

Unplanned failover is the emergency procedure you execute when your primary site becomes unavailable due to an unexpected outage — hardware failure, network partition, datacenter loss, or any condition that makes the primary site inaccessible. Unlike a planned failover, you cannot gracefully shut down workloads or perform a final replication sync before acting. This guide walks you through executing an unplanned failover from the secondary (DR) site, using the metadata and replicated storage snapshots that were synchronized before the outage occurred. Understanding the trade-offs of unplanned failover — specifically around recovery point and the sync state of protection group metadata — is critical before you execute this procedure in production.


Prerequisites

Before executing an unplanned failover, verify the following conditions:

Environment

  • Trilio Site Recovery (Protector) is deployed and running on both sites. Even though the primary site is unavailable, the secondary site's protector-api and protector-engine services must be healthy.
  • The protectorclient OSC CLI plugin is installed on the workstation you are operating from.
  • A clouds.yaml file is configured with credentials for both sites. You will authenticate to the secondary site to drive the operation.
  • The secondary site's Keystone, Nova, Cinder, and Neutron endpoints are reachable from your workstation.

Protection Group state

  • The Protection Group you are failing over must exist on the secondary site with synchronized metadata. Metadata is replicated to the secondary site on every change; if the last sync completed before the outage, the secondary site has a complete copy.
  • The Protection Group status must not already be failing_over, failed_over, or error — resolve any pre-existing operation before proceeding.
  • Replication must have been active before the outage. The Pure Storage FlashArray on the secondary site must hold at least one replicated snapshot of the Consistency Group.

Resource mappings

  • You need the network UUIDs on the secondary site that correspond to each network used by the protected VMs on the primary site. Prepare your --network-mapping arguments in advance.
  • If the secondary site uses different Nova flavors, prepare your --flavor-mapping arguments.

Cinder policy

  • The Cinder policy on the secondary site must allow volume_extension:volume_manage and volume_extension:volume_unmanage for the member role. See the deployment prerequisites documentation if this has not been configured.

Awareness of unplanned failover constraints

  • Because the primary site is unreachable, the metadata sync that normally accompanies every Protection Group modification cannot complete. Protector will mark the sync status as UNREACHABLE on the secondary site's local record and proceed. You must re-synchronize metadata after the primary site recovers before you can modify the Protection Group or execute failback.
  • Recovery Point Objective (RPO): your workloads will be restored from the most recent replicated snapshot on the secondary FlashArray. Data written to the primary site after the last successful replication cycle will be lost.

Installation

No additional installation is required specifically for unplanned failover. The protectorclient CLI plugin and Protector services must already be deployed as part of your standard Trilio Site Recovery setup. If they are not, complete the deployment guide before proceeding.

Verify your CLI is functional and pointed at the secondary site before continuing:

# Source your secondary site credentials
source ~/site-b-openrc
# OR, if using clouds.yaml:
export OS_CLOUD=site-b

# Confirm the Protector API on the secondary site is reachable
openstack protector protection-group list

Expected output: a table listing your Protection Groups as they were last synchronized to the secondary site. If this command fails, resolve secondary site connectivity before proceeding — you cannot execute an unplanned failover if the DR site itself is unhealthy.


Configuration

Unplanned failover does not require changes to protector.conf. The behavior of the operation is governed by the arguments you pass at runtime and by the Protection Group's existing replication policy. The following parameters are relevant:

--type unplanned Passes "force": true in the underlying API action payload. This instructs the engine to skip the graceful primary-side shutdown steps and proceed directly to storage promotion on the secondary site. Without this flag, the engine attempts to contact the primary site, which will time out and fail when the primary is down.

--network-mapping <primary-net-uuid>=<secondary-net-uuid> Required when the secondary site uses different network UUIDs (which is the case for all two-cluster deployments). You must provide a mapping for every network attached to every VM in the Protection Group. Unmapped networks will cause instance recreation to fail for the affected VMs.

Example:

--network-mapping a1b2c3d4-web=e5f6a7b8-web --network-mapping c9d0e1f2-db=f3a4b5c6-db

--flavor-mapping <primary-flavor-id>=<secondary-flavor-id> Optional. If the secondary site has identical flavor IDs (same Keystone, same region), this mapping is not needed. For separate OpenStack clusters where flavor UUIDs differ, provide a mapping for each flavor used by protected VMs. If a flavor is not mapped and the same flavor ID does not exist on the secondary site, instance recreation will fail for that VM.

--network-mapping sourced from Protection Group resource mappings If you have pre-configured resource mappings on the Protection Group (the recommended approach for production), the CLI uses those stored mappings by default. Runtime --network-mapping and --flavor-mapping flags override stored mappings for that execution only.

Replication policy (pre-configured) The rpo_minutes and replication_interval values set in the Protection Group's replication policy determine how stale the latest available snapshot may be. These are read-only at failover time — they inform your recovery point, not the failover procedure itself.


Usage

Execute unplanned failover from the secondary site. You must authenticate to the secondary site's Keystone — the primary site is unavailable and cannot issue tokens.

Step 1: Authenticate to the secondary site

source ~/site-b-openrc
# OR:
export OS_CLOUD=site-b

Step 2: Confirm Protection Group metadata is available on the secondary site

openstack protector protection-group show prod-web-app

Review the output. Key fields to check before proceeding:

  • status — must be active (not error, failing_over, or failed_over)
  • current_primary_site — should show the primary site name, confirming the metadata reflects the pre-outage state
  • remote_sync_status — may show UNREACHABLE already if the primary went down while the Protection Group was last modified; this is expected and does not block unplanned failover

Step 3: Check available replicated snapshots (optional but recommended)

Before triggering the failover, you can confirm the FlashArray on the secondary site holds replicated snapshots by reviewing the replication policy and consistency group status:

openstack protector protection-group show prod-web-app
openstack protector consistency-group show prod-web-app

Step 4: Execute the unplanned failover

openstack protector protection-group failover prod-web-app \
  --type unplanned \
  --network-mapping <primary-net-uuid>=<secondary-net-uuid> \
  --flavor-mapping <primary-flavor-id>=<secondary-flavor-id>

The CLI returns an operation ID immediately. The failover runs asynchronously on the secondary site's protector-engine.

Step 5: Monitor operation progress

# Poll the operation until status is 'completed' or 'failed'
watch openstack protector operation show <operation-id>

# Or list all recent operations
openstack protector operation list

Progress increments through four phases (see the Examples section for expected output at each phase).

Step 6: Verify workloads are running on the secondary site

# Confirm VMs were recreated
openstack server list

# Confirm Protection Group reflects the new active site
openstack protector protection-group show prod-web-app
# Expect: current_primary_site = site-b, status = failed_over

Step 7: After primary site recovery — re-synchronize metadata

Once the primary site is back online, you must synchronize metadata before you can modify the Protection Group or initiate failback:

# Switch to primary site credentials to verify it is reachable
export OS_CLOUD=site-a
openstack catalog list

# Switch back to secondary (active) site and force sync
export OS_CLOUD=site-b
openstack protector protection-group sync-status prod-web-app
openstack protector protection-group sync-force prod-web-app

Do not attempt failback or Protection Group modifications until sync status returns SYNCED and both sites show the same metadata version.


Examples

Example 1: Execute unplanned failover with network and flavor mappings

This is the standard execution path for a two-cluster deployment where Site A (primary) is down.

export OS_CLOUD=site-b

openstack protector protection-group failover prod-web-app \
  --type unplanned \
  --network-mapping a1b2c3d4-0000-0000-0000-net-primary-web=e5f6a7b8-0000-0000-0000-net-dr-web \
  --network-mapping c9d0e1f2-0000-0000-0000-net-primary-db=f3a4b5c6-0000-0000-0000-net-dr-db \
  --flavor-mapping m1.large=m1.xlarge

Expected output:

+----------------+--------------------------------------+
| Field          | Value                                |
+----------------+--------------------------------------+
| operation_id   | op-9a8b7c6d-5e4f-3a2b-1c0d-abcdef12 |
| operation_type | failover                             |
| status         | running                              |
| progress       | 5                                    |
| source_site    | site-a                               |
| target_site    | site-b                               |
+----------------+--------------------------------------+

Example 2: Monitor operation progress through all phases

openstack protector operation show op-9a8b7c6d-5e4f-3a2b-1c0d-abcdef12

Output during Phase 1 — Preparation (0–20%):

+------------------+-----------------------------------------------+
| Field            | Value                                         |
+------------------+-----------------------------------------------+
| status           | running                                       |
| progress         | 15                                            |
| steps_completed  | ["validate_target_site",                      |
|                  |  "create_dr_operation_record"]                |
| steps_failed     | []                                            |
| error_message    | None                                          |
+------------------+-----------------------------------------------+

Output during Phase 2 — Storage Failover (20–60%):

+------------------+-----------------------------------------------+
| Field            | Value                                         |
+------------------+-----------------------------------------------+
| status           | running                                       |
| progress         | 45                                            |
| steps_completed  | ["validate_target_site",                      |
|                  |  "create_dr_operation_record",                |
|                  |  "identify_latest_snapshot",                  |
|                  |  "promote_volumes_from_snapshot",             |
|                  |  "manage_volumes_into_cinder"]                |
+------------------+-----------------------------------------------+

Output during Phase 3 — Instance Recreation (60–90%):

+------------------+-----------------------------------------------+
| Field            | Value                                         |
+------------------+-----------------------------------------------+
| status           | running                                       |
| progress         | 75                                            |
| steps_completed  | [..., "recreate_vm_web-server-1",             |
|                  |  "recreate_vm_web-server-2"]                  |
+------------------+-----------------------------------------------+

Output on successful completion:

+------------------+-----------------------------------------------+
| Field            | Value                                         |
+------------------+-----------------------------------------------+
| status           | completed                                     |
| progress         | 100                                           |
| started_at       | 2025-06-15T03:42:11Z                          |
| completed_at     | 2025-06-15T03:49:38Z                          |
| error_message    | None                                          |
+------------------+-----------------------------------------------+

Example 3: Verify Protection Group state after failover

openstack protector protection-group show prod-web-app

Expected output:

+-------------------------+--------------------------------------+
| Field                   | Value                                |
+-------------------------+--------------------------------------+
| id                      | pg-12345678-1234-1234-1234-12345678 |
| name                    | prod-web-app                         |
| status                  | failed_over                          |
| current_primary_site    | site-b                               |
| primary_site            | site-a                               |
| secondary_site          | site-b                               |
| failover_count          | 1                                    |
| last_failover_at        | 2025-06-15T03:49:38Z                 |
| remote_sync_status      | UNREACHABLE                          |
+-------------------------+--------------------------------------+

Note that remote_sync_status is UNREACHABLE — this is expected because the primary site was down when the failover completed. This field will update to SYNCED after you run sync-force once Site A recovers.


Example 4: Re-synchronize metadata after primary site recovery

export OS_CLOUD=site-b

# Check current sync status
openstack protector protection-group sync-status prod-web-app
Sync Status: āš ļø  OUT OF SYNC

Local Metadata:
  Version: 4
  Current Site: site-b
  Last Modified: 2025-06-15T03:49:38Z

Remote Sync:
  Status: UNREACHABLE
  Remote Version: 3
  Last Sync: 2025-06-15T03:42:00Z
  Error: Connection timeout

Action Required:
  1. Check remote site connectivity
  2. Force sync once remote site is available
# Push current metadata to the recovered primary site
openstack protector protection-group sync-force prod-web-app
Force Sync Initiated...

Checking remote site connectivity...
  āœ… site-a is reachable

Syncing metadata (version 4)...
  Gathering current metadata... āœ“
  Calculating checksum... āœ“
  Pushing to site-a... āœ“

Remote Site Response:
  Status: success
  Version: 4
  Duration: 312ms

āœ… Sync completed successfully
Both sites now at version 4

Troubleshooting

Issue: openstack protector protection-group show returns no results or authentication error on the secondary site

Symptom: Running openstack protector protection-group list against the secondary site returns an empty list or a 401/403 error.

Likely cause: Your shell environment is still sourced with primary site credentials, or the OS_CLOUD variable points to the wrong site.

Fix: Explicitly source the secondary site credentials before running any commands:

source ~/site-b-openrc
# or:
export OS_CLOUD=site-b
openstack catalog show protector

Confirm the Protector endpoint shown is the secondary site's endpoint.


Issue: Failover operation fails immediately with remote_sync_status: BLOCKED or a version mismatch error

Symptom: The operation status transitions to failed within seconds of submission with an error referencing metadata version conflict.

Likely cause: The Protection Group's metadata on the secondary site is at a lower version than expected, indicating a sync was in progress or had recently failed before the outage.

Fix: Inspect the sync status to understand the version gap:

openstack protector protection-group sync-status prod-web-app

If the local version on the secondary site is behind what the primary had last written, you may need to proceed with the understanding that some recent configuration changes (member additions, policy updates) are not reflected. Review the last successful sync timestamp and compare it against any recent Protection Group changes. If the metadata is sufficient to recover your workloads, resubmit the failover. If metadata is critically out of date, contact Trilio support before proceeding.


Issue: Failover operation reaches ~45% and then fails with a storage error

Symptom: steps_failed includes promote_volumes_from_snapshot or manage_volumes_into_cinder. The error message references the Pure Storage array or Cinder volume management.

Likely cause (storage): The FlashArray on the secondary site has no replicated snapshot for one or more volumes, or the replication connection between arrays was broken before the outage.

Likely cause (Cinder policy): The volume_extension:volume_manage permission is missing from the secondary site's Cinder policy, preventing Protector from importing volumes.

Fix (Cinder policy): Add the required policy rule to /etc/cinder/policy.yaml on the secondary site:

"volume_extension:volume_manage": "rule:admin_or_owner"
"volume_extension:volume_unmanage": "rule:admin_or_owner"
"volume_extension:services:index": "rule:admin_or_owner"

Then reconfigure Cinder and retry the failover.

Fix (no snapshot): If no replicated snapshot exists on FlashArray B for a given volume, that volume — and the VM it was attached to — cannot be recovered. Review your replication policy's replication_interval and rpo_minutes settings to understand when the last successful replication cycle completed before the outage.


Issue: Instance recreation fails for one or more VMs (operation reaches ~75% and stalls or partially fails)

Symptom: steps_failed includes recreate_vm_<instance-name>. Some VMs come up on the secondary site; others do not.

Likely cause: A network UUID provided in --network-mapping does not exist on the secondary site, or a flavor UUID was not mapped and does not exist on the secondary site.

Fix: Check the operation's error_message field for the specific VM and network or flavor involved:

openstack protector operation show <operation-id>

Verify the target network and flavor exist on the secondary site:

openstack network list
openstack flavor list

Correct your mappings and re-execute the failover. Note: volumes that were already managed into Cinder during Phase 2 may need to be cleaned up manually before retrying. Check with openstack volume list on the secondary site.


Issue: After primary recovery, sync-force fails with a conflict error

Symptom: Running openstack protector protection-group sync-force prod-web-app returns an error indicating the primary site's local metadata version is newer than the secondary site's version.

Likely cause: The primary site's Protector database was partially updated before the outage (e.g., a modification was committed locally on the primary but not yet pushed to the secondary). The sites now have divergent metadata.

Fix: This is the metadata divergence scenario that unplanned failover's design aims to prevent through the strict sync requirement on modifications. Do not attempt to manually reconcile the database. Use the secondary site's metadata as the authoritative source (it reflects the post-failover state) and force-push it:

openstack protector protection-group sync-force prod-web-app --force-direction secondary-to-primary

If this flag is not available in your version of the CLI, contact Trilio support for guidance on resolving the divergence safely before executing failback.


Issue: Protection Group status is stuck at failing_over after an interrupted attempt

Symptom: A previous failover attempt was interrupted (engine restart, network loss to secondary site's API). The Protection Group status remains failing_over and a new failover submission is rejected.

Fix: Check whether a DR operation record exists in a running state and confirm the engine is not actively working:

openstack protector operation list

If the operation is confirmed dead (engine was restarted, started_at is old, no progress updates), force-reset the Protection Group status:

openstack protector protection-group reset-state prod-web-app --state active

Then resubmit the unplanned failover. Review engine logs on the secondary site before retrying to understand what partial steps completed:

journalctl -u protector-engine -n 200