Failover Issues
Stuck or failed failover operations; how to interpret steps_failed; manual recovery
This page helps you diagnose and recover from stuck or failed failover operations in Trilio Site Recovery for OpenStack. Failover operations involve coordinated steps across storage, compute, and networking layers on the target site — any one of those steps can fail independently, leaving the operation in a running or failed state with partial progress. You will learn how to interpret the steps_failed field in operation output, trace errors to their root cause in the engine log, and recover manually when automated recovery is not possible.
Before using this guide, ensure you have:
- CLI access to both the primary and secondary OpenStack sites, with valid credentials (
clouds.yamloropenrc) for each - The
protectorclientOSC plugin installed and functional (openstack dr operation showmust work against both sites) - SSH or console access to the host running
protector-engineon the target site (the site the failover is directed toward), so you can read engine logs - Familiarity with the Protection Group and DR operation concepts described in the architecture documentation
- Sufficient Keystone permissions to inspect Nova, Cinder, and Neutron resources on the target site
- Pure Storage FlashArray credentials if you need to verify array-level replication state
No additional configuration changes are required to troubleshoot failover issues. However, the following configuration points directly affect the behavior described in this guide and are worth verifying before you begin diagnosis.
Engine log location
The protector-engine service on each site writes its log to:
/opt/openstack-protector/bin/logs/engine.log
This is the primary diagnostic artifact for any stuck or failed failover. Always check the log on the target site (the site the workloads are being moved to), because that is where the engine executes the failover steps.
Replication policy: pure_pg_name
The pure_pg_name field in the replication policy must exactly match the Protection Group name (or Pod name for sync replication) configured on the Pure Storage FlashArray. A mismatch causes promote_volumes to fail. Verify the value with:
openstack dr protection-group policy show <pg-id> -f json
Cinder volume type replication properties
Volume types used by a Protection Group must have both of the following extra specs set, or storage promotion will be rejected:
replication_enabled='<is> True'
replication_type='<in> async' # or '<in> sync'
Metadata sync requirement
Modifications to a Protection Group — including any manual recovery steps that involve updating group membership — are blocked when the peer site is unreachable. This is by design to prevent metadata divergence. If the peer site is down during an unplanned failover, you will see sync errors in the log; this is expected and does not block the failover itself, but it will block subsequent modifications until the peer recovers and you run openstack dr protection-group sync-force <pg-name>.
Reverse replication requirement for failback
After any failover (planned or unplanned), the replication direction must be explicitly reversed before failback can proceed. This is not done automatically. The replication not configured error during failback indicates this step was skipped. See the Failback fails with 'replication not configured' issue below.
The primary tool for diagnosing a failover operation is the openstack dr operation show command with JSON output, which exposes the full step-level detail of what the engine attempted:
openstack dr operation show <operation-id> -f json
The fields most relevant to troubleshooting are:
| Field | What it tells you |
|---|---|
status | Overall state: running, failed, completed, rolling_back |
progress | Integer 0–100 indicating how far the operation got |
steps_completed | JSON array of step names that finished successfully |
steps_failed | JSON array of step names that encountered an error |
error_message | High-level error string; often a summary of the first failure |
result_data | JSON blob with step-specific output, including volume and instance IDs created so far |
Use steps_failed to identify exactly where the operation stopped, then cross-reference with the engine log on the target site to get the full stack trace and upstream API error.
Identifying the target site
The target_site_id field in the operation output tells you which site the engine was running on. Match that ID against your registered sites:
openstack dr operation show <operation-id> -f json | python3 -m json.tool
openstack dr site list
Then SSH to the controller node of that site and tail the engine log:
tail -n 200 /opt/openstack-protector/bin/logs/engine.log
Search for the operation ID within the log to isolate the relevant entries:
grep <operation-id> /opt/openstack-protector/bin/logs/engine.log
Retrieve full step-level detail for a failed operation
Run this immediately after observing a failed or stuck running status:
openstack dr operation show a1b2c3d4-e5f6-7890-abcd-ef1234567890 -f json
Expected output (truncated for clarity):
{
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"protection_group_id": "pg-uuid-here",
"operation_type": "failover",
"status": "failed",
"source_site_id": "site-a-uuid",
"target_site_id": "site-b-uuid",
"progress": 35,
"steps_completed": ["validate_sites", "get_latest_snapshot"],
"steps_failed": ["promote_volumes"],
"error_message": "StorageDriver error: Protection Group 'pg-prod-web-app' not found on FlashArray",
"result_data": {},
"started_at": "2025-11-03T10:00:00Z",
"completed_at": null
}
The steps_failed value promote_volumes tells you the failure occurred during the storage layer promotion, before any Nova resources were created on the target site.
Verify the pure_pg_name in the replication policy
After a promote_volumes failure, confirm that the Protection Group name in the policy matches what exists on the array:
openstack dr protection-group policy show prod-web-app -f json
Expected output:
{
"pure_pg_name": "pg-prod-web-app",
"primary_fa_url": "https://flasharray-a.example.com",
"secondary_fa_url": "https://flasharray-b.example.com",
"replication_interval": 300,
"rpo_minutes": 15
}
Log into FlashArray B and confirm that pg-prod-web-app exists as a Protection Group (async) or Pod (sync). If the name differs, update the policy:
openstack dr protection-group policy set prod-web-app \
--pure-pg-name <correct-name-from-array>
Check Neutron network availability on the target site before retrying
If steps_failed contains recreate_instances, verify that the mapped network exists on the target site before retrying:
# Authenticate to the target site
export OS_CLOUD=site-b
# Check the network referenced in your --network-mapping
openstack network show net-secondary-web
# Check Nova quota headroom
openstack quota show --detail
Force a metadata sync after an unplanned failover recovers the primary
After an unplanned failover where Site A was unreachable, once Site A comes back online, re-establish metadata consistency before attempting any further operations:
export OS_CLOUD=site-b
openstack dr protection-group sync-status prod-web-app
openstack dr protection-group sync-force prod-web-app
Expected output after a successful sync:
Syncing to Site A...
✅ Site A is now reachable
✅ Pushing version 4 to Site A
✅ Site A updated successfully
Sync Status: SYNCED
Both sites now at version 4
Establish reverse replication before failback
This step is required after every failover. Without it, the failback operation will fail with replication not configured.
# Authenticated to the site currently running the workloads (e.g., site-b)
export OS_CLOUD=site-b
openstack dr protection-group policy set prod-web-app \
--enable-reverse-replication
Then verify replication is active before initiating failback:
openstack dr protection-group show prod-web-app -f json | python3 -m json.tool
# Confirm: "status": "failed_over" and replication direction is site-b → site-a
Use the format below for each issue: Symptom → Likely cause → Fix.
Operation stuck in running state
Symptom: openstack dr operation show <id> returns "status": "running" for an extended period (more than 10–15 minutes for a typical async failover) and the progress value is not advancing.
Likely cause: The protector-engine process on the target site has stalled waiting on a downstream API call — most commonly a Pure Storage StorageDriver timeout, a Cinder volume manage call that never returned, or a Nova boot request that is queued but not scheduled.
Fix:
- Identify the target site from the operation's
target_site_idfield. - On the target site controller, open the engine log:
grep <operation-id> /opt/openstack-protector/bin/logs/engine.log | tail -50 - Look for lines containing
StorageDriver,TimeoutError,ConnectionError, or an HTTP status code from Nova or Cinder. - If you see a StorageDriver error, verify FlashArray B is reachable from the target site controller and that the API token in the replication policy is valid.
- If you see a Cinder or Nova API timeout, check the health of those services on the target site:
openstack --os-cloud site-b volume service list openstack --os-cloud site-b compute service list - If services are healthy and the operation has been stuck for over 30 minutes, you may need to manually mark the operation as failed and clean up any partially created resources before retrying. Contact your platform team before manually modifying operation state in the database.
steps_failed contains promote_volumes
Symptom: The operation completes with "status": "failed" and "steps_failed": ["promote_volumes"]. The error_message references a Pure Storage Protection Group or Pod not found, or a FlashArray connectivity error.
Likely cause: One of two root causes:
- The FlashArray B is unreachable from the target site, or the API token is expired.
- The
pure_pg_namein the replication policy does not match the actual Protection Group or Pod name on the array. For sync replication, a Pod name is expected; for async, a Protection Group name.
Fix:
- Retrieve the policy and note
pure_pg_name,secondary_fa_url:openstack dr protection-group policy show <pg-name> -f json - Test connectivity to FlashArray B from the target site controller (e.g.,
curl -k https://<secondary_fa_url>/api/api_version). - Log into the FlashArray B management interface and confirm the Protection Group or Pod named in
pure_pg_nameexists and has a healthy replication link. - If the name is wrong, update the policy:
openstack dr protection-group policy set <pg-name> \ --pure-pg-name <corrected-name> - Retry the failover operation.
VM recreation fails on target site (steps_failed contains recreate_instances)
Symptom: The storage phase (promote_volumes) succeeds — volumes appear in Cinder on the target site — but the operation fails during recreate_instances. VMs are not created on the target site.
Likely cause: One or more of:
- A network referenced in your
--network-mappingdoes not exist on the target site, or the mapped network UUID is wrong. - A Glance image referenced by the VM's metadata (used for non-volume-backed boot or as a source reference) is not present on the target site.
- Nova quota on the target site is insufficient for the number or flavor of instances being created.
Fix:
- Check the engine log for the specific VM that failed:
grep -A 20 'recreate_instances' /opt/openstack-protector/bin/logs/engine.log - Verify that every network in your mapping exists on the target site:
openstack --os-cloud site-b network list - Verify Glance image availability on the target site for any image referenced in the member metadata:
openstack --os-cloud site-b image list - Check Nova quota on the target site:
Request a quota increase or free resources before retrying.openstack --os-cloud site-b quota show --detail - Clean up any orphaned Cinder volumes on the target site that were created during the partial failover before retrying, or the next attempt may fail with duplicate volume name conflicts.
Failback fails with replication not configured
Symptom: After a successful failover, initiating failback returns an error: replication not configured or the operation immediately transitions to failed with steps_failed indicating a replication setup step.
Likely cause: After a failover, the original replication direction (Site A → Site B) is broken by design — the volumes now live on Site B and replication must be explicitly re-established in the reverse direction (Site B → Site A) before failback can proceed. This step is not performed automatically.
Fix:
- Confirm the Protection Group is in
failed_overstatus and thatcurrent_primary_sitepoints to the site currently running the workloads:openstack dr protection-group show <pg-name> -f json - Enable reverse replication from the currently active site:
openstack dr protection-group policy set <pg-name> \ --enable-reverse-replication - Allow time for the initial reverse sync to complete (for async replication, at least one replication interval, as configured in
replication_interval). - Verify replication health on FlashArray A before proceeding — confirm that FlashArray A is receiving replicated data from FlashArray B.
- Retry the failback:
openstack dr protection-group failback <pg-name> \ --network-mapping net-secondary-web=net-primary-web
Modification to Protection Group blocked after unplanned failover
Symptom: After an unplanned failover (where the primary site was down), any attempt to modify the Protection Group on the target site returns an error indicating the remote site is unreachable and the operation is blocked.
Likely cause: This is expected behavior. Metadata sync requires both sites to be reachable to prevent divergence. During an unplanned failover the sync to the failed site is skipped, leaving the two sites at different metadata versions. Until the failed site recovers and is explicitly re-synced, modifications are blocked.
Fix:
- Wait for the primary site to recover and its Protector API to become reachable.
- Check the current sync status:
openstack dr protection-group sync-status <pg-name> - Force a sync to bring the recovered site up to the current metadata version:
openstack dr protection-group sync-force <pg-name> - Confirm both sites report
SYNCEDbefore retrying any modifications.