Site Recoveryfor OpenStack
Guide

Failover Issues

Stuck or failed failover operations; how to interpret steps_failed; manual recovery

master

Overview

This page helps you diagnose and recover from stuck or failed failover operations in Trilio Site Recovery for OpenStack. Failover operations involve coordinated steps across storage, compute, and networking layers on the target site — any one of those steps can fail independently, leaving the operation in a running or failed state with partial progress. You will learn how to interpret the steps_failed field in operation output, trace errors to their root cause in the engine log, and recover manually when automated recovery is not possible.


Prerequisites

Before using this guide, ensure you have:

  • CLI access to both the primary and secondary OpenStack sites, with valid credentials (clouds.yaml or openrc) for each
  • The protectorclient OSC plugin installed and functional (openstack dr operation show must work against both sites)
  • SSH or console access to the host running protector-engine on the target site (the site the failover is directed toward), so you can read engine logs
  • Familiarity with the Protection Group and DR operation concepts described in the architecture documentation
  • Sufficient Keystone permissions to inspect Nova, Cinder, and Neutron resources on the target site
  • Pure Storage FlashArray credentials if you need to verify array-level replication state

Installation

Configuration

No additional configuration changes are required to troubleshoot failover issues. However, the following configuration points directly affect the behavior described in this guide and are worth verifying before you begin diagnosis.

Engine log location

The protector-engine service on each site writes its log to:

/opt/openstack-protector/bin/logs/engine.log

This is the primary diagnostic artifact for any stuck or failed failover. Always check the log on the target site (the site the workloads are being moved to), because that is where the engine executes the failover steps.

Replication policy: pure_pg_name

The pure_pg_name field in the replication policy must exactly match the Protection Group name (or Pod name for sync replication) configured on the Pure Storage FlashArray. A mismatch causes promote_volumes to fail. Verify the value with:

openstack dr protection-group policy show <pg-id> -f json

Cinder volume type replication properties

Volume types used by a Protection Group must have both of the following extra specs set, or storage promotion will be rejected:

replication_enabled='<is> True'
replication_type='<in> async'   # or '<in> sync'

Metadata sync requirement

Modifications to a Protection Group — including any manual recovery steps that involve updating group membership — are blocked when the peer site is unreachable. This is by design to prevent metadata divergence. If the peer site is down during an unplanned failover, you will see sync errors in the log; this is expected and does not block the failover itself, but it will block subsequent modifications until the peer recovers and you run openstack dr protection-group sync-force <pg-name>.

Reverse replication requirement for failback

After any failover (planned or unplanned), the replication direction must be explicitly reversed before failback can proceed. This is not done automatically. The replication not configured error during failback indicates this step was skipped. See the Failback fails with 'replication not configured' issue below.


Usage

The primary tool for diagnosing a failover operation is the openstack dr operation show command with JSON output, which exposes the full step-level detail of what the engine attempted:

openstack dr operation show <operation-id> -f json

The fields most relevant to troubleshooting are:

FieldWhat it tells you
statusOverall state: running, failed, completed, rolling_back
progressInteger 0–100 indicating how far the operation got
steps_completedJSON array of step names that finished successfully
steps_failedJSON array of step names that encountered an error
error_messageHigh-level error string; often a summary of the first failure
result_dataJSON blob with step-specific output, including volume and instance IDs created so far

Use steps_failed to identify exactly where the operation stopped, then cross-reference with the engine log on the target site to get the full stack trace and upstream API error.

Identifying the target site

The target_site_id field in the operation output tells you which site the engine was running on. Match that ID against your registered sites:

openstack dr operation show <operation-id> -f json | python3 -m json.tool
openstack dr site list

Then SSH to the controller node of that site and tail the engine log:

tail -n 200 /opt/openstack-protector/bin/logs/engine.log

Search for the operation ID within the log to isolate the relevant entries:

grep <operation-id> /opt/openstack-protector/bin/logs/engine.log

Examples

Retrieve full step-level detail for a failed operation

Run this immediately after observing a failed or stuck running status:

openstack dr operation show a1b2c3d4-e5f6-7890-abcd-ef1234567890 -f json

Expected output (truncated for clarity):

{
  "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "protection_group_id": "pg-uuid-here",
  "operation_type": "failover",
  "status": "failed",
  "source_site_id": "site-a-uuid",
  "target_site_id": "site-b-uuid",
  "progress": 35,
  "steps_completed": ["validate_sites", "get_latest_snapshot"],
  "steps_failed": ["promote_volumes"],
  "error_message": "StorageDriver error: Protection Group 'pg-prod-web-app' not found on FlashArray",
  "result_data": {},
  "started_at": "2025-11-03T10:00:00Z",
  "completed_at": null
}

The steps_failed value promote_volumes tells you the failure occurred during the storage layer promotion, before any Nova resources were created on the target site.


Verify the pure_pg_name in the replication policy

After a promote_volumes failure, confirm that the Protection Group name in the policy matches what exists on the array:

openstack dr protection-group policy show prod-web-app -f json

Expected output:

{
  "pure_pg_name": "pg-prod-web-app",
  "primary_fa_url": "https://flasharray-a.example.com",
  "secondary_fa_url": "https://flasharray-b.example.com",
  "replication_interval": 300,
  "rpo_minutes": 15
}

Log into FlashArray B and confirm that pg-prod-web-app exists as a Protection Group (async) or Pod (sync). If the name differs, update the policy:

openstack dr protection-group policy set prod-web-app \
  --pure-pg-name <correct-name-from-array>

Check Neutron network availability on the target site before retrying

If steps_failed contains recreate_instances, verify that the mapped network exists on the target site before retrying:

# Authenticate to the target site
export OS_CLOUD=site-b

# Check the network referenced in your --network-mapping
openstack network show net-secondary-web

# Check Nova quota headroom
openstack quota show --detail

Force a metadata sync after an unplanned failover recovers the primary

After an unplanned failover where Site A was unreachable, once Site A comes back online, re-establish metadata consistency before attempting any further operations:

export OS_CLOUD=site-b
openstack dr protection-group sync-status prod-web-app
openstack dr protection-group sync-force prod-web-app

Expected output after a successful sync:

Syncing to Site A...
  ✅ Site A is now reachable
  ✅ Pushing version 4 to Site A
  ✅ Site A updated successfully

Sync Status: SYNCED
Both sites now at version 4

Establish reverse replication before failback

This step is required after every failover. Without it, the failback operation will fail with replication not configured.

# Authenticated to the site currently running the workloads (e.g., site-b)
export OS_CLOUD=site-b

openstack dr protection-group policy set prod-web-app \
  --enable-reverse-replication

Then verify replication is active before initiating failback:

openstack dr protection-group show prod-web-app -f json | python3 -m json.tool
# Confirm: "status": "failed_over" and replication direction is site-b → site-a

Troubleshooting

Use the format below for each issue: Symptom → Likely cause → Fix.


Operation stuck in running state

Symptom: openstack dr operation show <id> returns "status": "running" for an extended period (more than 10–15 minutes for a typical async failover) and the progress value is not advancing.

Likely cause: The protector-engine process on the target site has stalled waiting on a downstream API call — most commonly a Pure Storage StorageDriver timeout, a Cinder volume manage call that never returned, or a Nova boot request that is queued but not scheduled.

Fix:

  1. Identify the target site from the operation's target_site_id field.
  2. On the target site controller, open the engine log:
    grep <operation-id> /opt/openstack-protector/bin/logs/engine.log | tail -50
    
  3. Look for lines containing StorageDriver, TimeoutError, ConnectionError, or an HTTP status code from Nova or Cinder.
  4. If you see a StorageDriver error, verify FlashArray B is reachable from the target site controller and that the API token in the replication policy is valid.
  5. If you see a Cinder or Nova API timeout, check the health of those services on the target site:
    openstack --os-cloud site-b volume service list
    openstack --os-cloud site-b compute service list
    
  6. If services are healthy and the operation has been stuck for over 30 minutes, you may need to manually mark the operation as failed and clean up any partially created resources before retrying. Contact your platform team before manually modifying operation state in the database.

steps_failed contains promote_volumes

Symptom: The operation completes with "status": "failed" and "steps_failed": ["promote_volumes"]. The error_message references a Pure Storage Protection Group or Pod not found, or a FlashArray connectivity error.

Likely cause: One of two root causes:

  • The FlashArray B is unreachable from the target site, or the API token is expired.
  • The pure_pg_name in the replication policy does not match the actual Protection Group or Pod name on the array. For sync replication, a Pod name is expected; for async, a Protection Group name.

Fix:

  1. Retrieve the policy and note pure_pg_name, secondary_fa_url:
    openstack dr protection-group policy show <pg-name> -f json
    
  2. Test connectivity to FlashArray B from the target site controller (e.g., curl -k https://<secondary_fa_url>/api/api_version).
  3. Log into the FlashArray B management interface and confirm the Protection Group or Pod named in pure_pg_name exists and has a healthy replication link.
  4. If the name is wrong, update the policy:
    openstack dr protection-group policy set <pg-name> \
      --pure-pg-name <corrected-name>
    
  5. Retry the failover operation.

VM recreation fails on target site (steps_failed contains recreate_instances)

Symptom: The storage phase (promote_volumes) succeeds — volumes appear in Cinder on the target site — but the operation fails during recreate_instances. VMs are not created on the target site.

Likely cause: One or more of:

  • A network referenced in your --network-mapping does not exist on the target site, or the mapped network UUID is wrong.
  • A Glance image referenced by the VM's metadata (used for non-volume-backed boot or as a source reference) is not present on the target site.
  • Nova quota on the target site is insufficient for the number or flavor of instances being created.

Fix:

  1. Check the engine log for the specific VM that failed:
    grep -A 20 'recreate_instances' /opt/openstack-protector/bin/logs/engine.log
    
  2. Verify that every network in your mapping exists on the target site:
    openstack --os-cloud site-b network list
    
  3. Verify Glance image availability on the target site for any image referenced in the member metadata:
    openstack --os-cloud site-b image list
    
  4. Check Nova quota on the target site:
    openstack --os-cloud site-b quota show --detail
    
    Request a quota increase or free resources before retrying.
  5. Clean up any orphaned Cinder volumes on the target site that were created during the partial failover before retrying, or the next attempt may fail with duplicate volume name conflicts.

Failback fails with replication not configured

Symptom: After a successful failover, initiating failback returns an error: replication not configured or the operation immediately transitions to failed with steps_failed indicating a replication setup step.

Likely cause: After a failover, the original replication direction (Site A → Site B) is broken by design — the volumes now live on Site B and replication must be explicitly re-established in the reverse direction (Site B → Site A) before failback can proceed. This step is not performed automatically.

Fix:

  1. Confirm the Protection Group is in failed_over status and that current_primary_site points to the site currently running the workloads:
    openstack dr protection-group show <pg-name> -f json
    
  2. Enable reverse replication from the currently active site:
    openstack dr protection-group policy set <pg-name> \
      --enable-reverse-replication
    
  3. Allow time for the initial reverse sync to complete (for async replication, at least one replication interval, as configured in replication_interval).
  4. Verify replication health on FlashArray A before proceeding — confirm that FlashArray A is receiving replicated data from FlashArray B.
  5. Retry the failback:
    openstack dr protection-group failback <pg-name> \
      --network-mapping net-secondary-web=net-primary-web
    

Modification to Protection Group blocked after unplanned failover

Symptom: After an unplanned failover (where the primary site was down), any attempt to modify the Protection Group on the target site returns an error indicating the remote site is unreachable and the operation is blocked.

Likely cause: This is expected behavior. Metadata sync requires both sites to be reachable to prevent divergence. During an unplanned failover the sync to the failed site is skipped, leaving the two sites at different metadata versions. Until the failed site recovers and is explicitly re-synced, modifications are blocked.

Fix:

  1. Wait for the primary site to recover and its Protector API to become reachable.
  2. Check the current sync status:
    openstack dr protection-group sync-status <pg-name>
    
  3. Force a sync to bring the recovered site up to the current metadata version:
    openstack dr protection-group sync-force <pg-name>
    
  4. Confirm both sites report SYNCED before retrying any modifications.