Unplanned Failover
Emergency failover when primary site is unavailable
Unplanned failover is the emergency procedure you execute when your primary site becomes unavailable due to an unexpected outage ā hardware failure, network partition, datacenter loss, or any condition that makes the primary site inaccessible. Unlike a planned failover, you cannot gracefully shut down workloads or perform a final replication sync before acting. This guide walks you through executing an unplanned failover from the secondary (DR) site, using the metadata and replicated storage snapshots that were synchronized before the outage occurred. Understanding the trade-offs of unplanned failover ā specifically around recovery point and the sync state of protection group metadata ā is critical before you execute this procedure in production.
Before executing an unplanned failover, verify the following conditions:
Environment
- Trilio Site Recovery (Protector) is deployed and running on both sites. Even though the primary site is unavailable, the secondary site's
protector-apiandprotector-engineservices must be healthy. - The
protectorclientOSC CLI plugin is installed on the workstation you are operating from. - A
clouds.yamlfile is configured with credentials for both sites. You will authenticate to the secondary site to drive the operation. - The secondary site's Keystone, Nova, Cinder, and Neutron endpoints are reachable from your workstation.
Protection Group state
- The Protection Group you are failing over must exist on the secondary site with synchronized metadata. Metadata is replicated to the secondary site on every change; if the last sync completed before the outage, the secondary site has a complete copy.
- The Protection Group status must not already be
failing_over,failed_over, orerrorā resolve any pre-existing operation before proceeding. - Replication must have been active before the outage. The Pure Storage FlashArray on the secondary site must hold at least one replicated snapshot of the Consistency Group.
Resource mappings
- You need the network UUIDs on the secondary site that correspond to each network used by the protected VMs on the primary site. Prepare your
--network-mappingarguments in advance. - If the secondary site uses different Nova flavors, prepare your
--flavor-mappingarguments.
Cinder policy
- The Cinder policy on the secondary site must allow
volume_extension:volume_manageandvolume_extension:volume_unmanagefor thememberrole. See the deployment prerequisites documentation if this has not been configured.
Awareness of unplanned failover constraints
- Because the primary site is unreachable, the metadata sync that normally accompanies every Protection Group modification cannot complete. Protector will mark the sync status as
UNREACHABLEon the secondary site's local record and proceed. You must re-synchronize metadata after the primary site recovers before you can modify the Protection Group or execute failback. - Recovery Point Objective (RPO): your workloads will be restored from the most recent replicated snapshot on the secondary FlashArray. Data written to the primary site after the last successful replication cycle will be lost.
No additional installation is required specifically for unplanned failover. The protectorclient CLI plugin and Protector services must already be deployed as part of your standard Trilio Site Recovery setup. If they are not, complete the deployment guide before proceeding.
Verify your CLI is functional and pointed at the secondary site before continuing:
# Source your secondary site credentials
source ~/site-b-openrc
# OR, if using clouds.yaml:
export OS_CLOUD=site-b
# Confirm the Protector API on the secondary site is reachable
openstack protector protection-group list
Expected output: a table listing your Protection Groups as they were last synchronized to the secondary site. If this command fails, resolve secondary site connectivity before proceeding ā you cannot execute an unplanned failover if the DR site itself is unhealthy.
Unplanned failover does not require changes to protector.conf. The behavior of the operation is governed by the arguments you pass at runtime and by the Protection Group's existing replication policy. The following parameters are relevant:
--type unplanned
Passes "force": true in the underlying API action payload. This instructs the engine to skip the graceful primary-side shutdown steps and proceed directly to storage promotion on the secondary site. Without this flag, the engine attempts to contact the primary site, which will time out and fail when the primary is down.
--network-mapping <primary-net-uuid>=<secondary-net-uuid>
Required when the secondary site uses different network UUIDs (which is the case for all two-cluster deployments). You must provide a mapping for every network attached to every VM in the Protection Group. Unmapped networks will cause instance recreation to fail for the affected VMs.
Example:
--network-mapping a1b2c3d4-web=e5f6a7b8-web --network-mapping c9d0e1f2-db=f3a4b5c6-db
--flavor-mapping <primary-flavor-id>=<secondary-flavor-id>
Optional. If the secondary site has identical flavor IDs (same Keystone, same region), this mapping is not needed. For separate OpenStack clusters where flavor UUIDs differ, provide a mapping for each flavor used by protected VMs. If a flavor is not mapped and the same flavor ID does not exist on the secondary site, instance recreation will fail for that VM.
--network-mapping sourced from Protection Group resource mappings
If you have pre-configured resource mappings on the Protection Group (the recommended approach for production), the CLI uses those stored mappings by default. Runtime --network-mapping and --flavor-mapping flags override stored mappings for that execution only.
Replication policy (pre-configured)
The rpo_minutes and replication_interval values set in the Protection Group's replication policy determine how stale the latest available snapshot may be. These are read-only at failover time ā they inform your recovery point, not the failover procedure itself.
Execute unplanned failover from the secondary site. You must authenticate to the secondary site's Keystone ā the primary site is unavailable and cannot issue tokens.
Step 1: Authenticate to the secondary site
source ~/site-b-openrc
# OR:
export OS_CLOUD=site-b
Step 2: Confirm Protection Group metadata is available on the secondary site
openstack protector protection-group show prod-web-app
Review the output. Key fields to check before proceeding:
statusā must beactive(noterror,failing_over, orfailed_over)current_primary_siteā should show the primary site name, confirming the metadata reflects the pre-outage stateremote_sync_statusā may showUNREACHABLEalready if the primary went down while the Protection Group was last modified; this is expected and does not block unplanned failover
Step 3: Check available replicated snapshots (optional but recommended)
Before triggering the failover, you can confirm the FlashArray on the secondary site holds replicated snapshots by reviewing the replication policy and consistency group status:
openstack protector protection-group show prod-web-app
openstack protector consistency-group show prod-web-app
Step 4: Execute the unplanned failover
openstack protector protection-group failover prod-web-app \
--type unplanned \
--network-mapping <primary-net-uuid>=<secondary-net-uuid> \
--flavor-mapping <primary-flavor-id>=<secondary-flavor-id>
The CLI returns an operation ID immediately. The failover runs asynchronously on the secondary site's protector-engine.
Step 5: Monitor operation progress
# Poll the operation until status is 'completed' or 'failed'
watch openstack protector operation show <operation-id>
# Or list all recent operations
openstack protector operation list
Progress increments through four phases (see the Examples section for expected output at each phase).
Step 6: Verify workloads are running on the secondary site
# Confirm VMs were recreated
openstack server list
# Confirm Protection Group reflects the new active site
openstack protector protection-group show prod-web-app
# Expect: current_primary_site = site-b, status = failed_over
Step 7: After primary site recovery ā re-synchronize metadata
Once the primary site is back online, you must synchronize metadata before you can modify the Protection Group or initiate failback:
# Switch to primary site credentials to verify it is reachable
export OS_CLOUD=site-a
openstack catalog list
# Switch back to secondary (active) site and force sync
export OS_CLOUD=site-b
openstack protector protection-group sync-status prod-web-app
openstack protector protection-group sync-force prod-web-app
Do not attempt failback or Protection Group modifications until sync status returns SYNCED and both sites show the same metadata version.
Example 1: Execute unplanned failover with network and flavor mappings
This is the standard execution path for a two-cluster deployment where Site A (primary) is down.
export OS_CLOUD=site-b
openstack protector protection-group failover prod-web-app \
--type unplanned \
--network-mapping a1b2c3d4-0000-0000-0000-net-primary-web=e5f6a7b8-0000-0000-0000-net-dr-web \
--network-mapping c9d0e1f2-0000-0000-0000-net-primary-db=f3a4b5c6-0000-0000-0000-net-dr-db \
--flavor-mapping m1.large=m1.xlarge
Expected output:
+----------------+--------------------------------------+
| Field | Value |
+----------------+--------------------------------------+
| operation_id | op-9a8b7c6d-5e4f-3a2b-1c0d-abcdef12 |
| operation_type | failover |
| status | running |
| progress | 5 |
| source_site | site-a |
| target_site | site-b |
+----------------+--------------------------------------+
Example 2: Monitor operation progress through all phases
openstack protector operation show op-9a8b7c6d-5e4f-3a2b-1c0d-abcdef12
Output during Phase 1 ā Preparation (0ā20%):
+------------------+-----------------------------------------------+
| Field | Value |
+------------------+-----------------------------------------------+
| status | running |
| progress | 15 |
| steps_completed | ["validate_target_site", |
| | "create_dr_operation_record"] |
| steps_failed | [] |
| error_message | None |
+------------------+-----------------------------------------------+
Output during Phase 2 ā Storage Failover (20ā60%):
+------------------+-----------------------------------------------+
| Field | Value |
+------------------+-----------------------------------------------+
| status | running |
| progress | 45 |
| steps_completed | ["validate_target_site", |
| | "create_dr_operation_record", |
| | "identify_latest_snapshot", |
| | "promote_volumes_from_snapshot", |
| | "manage_volumes_into_cinder"] |
+------------------+-----------------------------------------------+
Output during Phase 3 ā Instance Recreation (60ā90%):
+------------------+-----------------------------------------------+
| Field | Value |
+------------------+-----------------------------------------------+
| status | running |
| progress | 75 |
| steps_completed | [..., "recreate_vm_web-server-1", |
| | "recreate_vm_web-server-2"] |
+------------------+-----------------------------------------------+
Output on successful completion:
+------------------+-----------------------------------------------+
| Field | Value |
+------------------+-----------------------------------------------+
| status | completed |
| progress | 100 |
| started_at | 2025-06-15T03:42:11Z |
| completed_at | 2025-06-15T03:49:38Z |
| error_message | None |
+------------------+-----------------------------------------------+
Example 3: Verify Protection Group state after failover
openstack protector protection-group show prod-web-app
Expected output:
+-------------------------+--------------------------------------+
| Field | Value |
+-------------------------+--------------------------------------+
| id | pg-12345678-1234-1234-1234-12345678 |
| name | prod-web-app |
| status | failed_over |
| current_primary_site | site-b |
| primary_site | site-a |
| secondary_site | site-b |
| failover_count | 1 |
| last_failover_at | 2025-06-15T03:49:38Z |
| remote_sync_status | UNREACHABLE |
+-------------------------+--------------------------------------+
Note that remote_sync_status is UNREACHABLE ā this is expected because the primary site was down when the failover completed. This field will update to SYNCED after you run sync-force once Site A recovers.
Example 4: Re-synchronize metadata after primary site recovery
export OS_CLOUD=site-b
# Check current sync status
openstack protector protection-group sync-status prod-web-app
Sync Status: ā ļø OUT OF SYNC
Local Metadata:
Version: 4
Current Site: site-b
Last Modified: 2025-06-15T03:49:38Z
Remote Sync:
Status: UNREACHABLE
Remote Version: 3
Last Sync: 2025-06-15T03:42:00Z
Error: Connection timeout
Action Required:
1. Check remote site connectivity
2. Force sync once remote site is available
# Push current metadata to the recovered primary site
openstack protector protection-group sync-force prod-web-app
Force Sync Initiated...
Checking remote site connectivity...
ā
site-a is reachable
Syncing metadata (version 4)...
Gathering current metadata... ā
Calculating checksum... ā
Pushing to site-a... ā
Remote Site Response:
Status: success
Version: 4
Duration: 312ms
ā
Sync completed successfully
Both sites now at version 4
Issue: openstack protector protection-group show returns no results or authentication error on the secondary site
Symptom: Running openstack protector protection-group list against the secondary site returns an empty list or a 401/403 error.
Likely cause: Your shell environment is still sourced with primary site credentials, or the OS_CLOUD variable points to the wrong site.
Fix: Explicitly source the secondary site credentials before running any commands:
source ~/site-b-openrc
# or:
export OS_CLOUD=site-b
openstack catalog show protector
Confirm the Protector endpoint shown is the secondary site's endpoint.
Issue: Failover operation fails immediately with remote_sync_status: BLOCKED or a version mismatch error
Symptom: The operation status transitions to failed within seconds of submission with an error referencing metadata version conflict.
Likely cause: The Protection Group's metadata on the secondary site is at a lower version than expected, indicating a sync was in progress or had recently failed before the outage.
Fix: Inspect the sync status to understand the version gap:
openstack protector protection-group sync-status prod-web-app
If the local version on the secondary site is behind what the primary had last written, you may need to proceed with the understanding that some recent configuration changes (member additions, policy updates) are not reflected. Review the last successful sync timestamp and compare it against any recent Protection Group changes. If the metadata is sufficient to recover your workloads, resubmit the failover. If metadata is critically out of date, contact Trilio support before proceeding.
Issue: Failover operation reaches ~45% and then fails with a storage error
Symptom: steps_failed includes promote_volumes_from_snapshot or manage_volumes_into_cinder. The error message references the Pure Storage array or Cinder volume management.
Likely cause (storage): The FlashArray on the secondary site has no replicated snapshot for one or more volumes, or the replication connection between arrays was broken before the outage.
Likely cause (Cinder policy): The volume_extension:volume_manage permission is missing from the secondary site's Cinder policy, preventing Protector from importing volumes.
Fix (Cinder policy): Add the required policy rule to /etc/cinder/policy.yaml on the secondary site:
"volume_extension:volume_manage": "rule:admin_or_owner"
"volume_extension:volume_unmanage": "rule:admin_or_owner"
"volume_extension:services:index": "rule:admin_or_owner"
Then reconfigure Cinder and retry the failover.
Fix (no snapshot): If no replicated snapshot exists on FlashArray B for a given volume, that volume ā and the VM it was attached to ā cannot be recovered. Review your replication policy's replication_interval and rpo_minutes settings to understand when the last successful replication cycle completed before the outage.
Issue: Instance recreation fails for one or more VMs (operation reaches ~75% and stalls or partially fails)
Symptom: steps_failed includes recreate_vm_<instance-name>. Some VMs come up on the secondary site; others do not.
Likely cause: A network UUID provided in --network-mapping does not exist on the secondary site, or a flavor UUID was not mapped and does not exist on the secondary site.
Fix: Check the operation's error_message field for the specific VM and network or flavor involved:
openstack protector operation show <operation-id>
Verify the target network and flavor exist on the secondary site:
openstack network list
openstack flavor list
Correct your mappings and re-execute the failover. Note: volumes that were already managed into Cinder during Phase 2 may need to be cleaned up manually before retrying. Check with openstack volume list on the secondary site.
Issue: After primary recovery, sync-force fails with a conflict error
Symptom: Running openstack protector protection-group sync-force prod-web-app returns an error indicating the primary site's local metadata version is newer than the secondary site's version.
Likely cause: The primary site's Protector database was partially updated before the outage (e.g., a modification was committed locally on the primary but not yet pushed to the secondary). The sites now have divergent metadata.
Fix: This is the metadata divergence scenario that unplanned failover's design aims to prevent through the strict sync requirement on modifications. Do not attempt to manually reconcile the database. Use the secondary site's metadata as the authoritative source (it reflects the post-failover state) and force-push it:
openstack protector protection-group sync-force prod-web-app --force-direction secondary-to-primary
If this flag is not available in your version of the CLI, contact Trilio support for guidance on resolving the divergence safely before executing failback.
Issue: Protection Group status is stuck at failing_over after an interrupted attempt
Symptom: A previous failover attempt was interrupted (engine restart, network loss to secondary site's API). The Protection Group status remains failing_over and a new failover submission is rejected.
Fix: Check whether a DR operation record exists in a running state and confirm the engine is not actively working:
openstack protector operation list
If the operation is confirmed dead (engine was restarted, started_at is old, no progress updates), force-reset the Protection Group status:
openstack protector protection-group reset-state prod-web-app --state active
Then resubmit the unplanned failover. Review engine logs on the secondary site before retrying to understand what partial steps completed:
journalctl -u protector-engine -n 200