Site Recoveryfor OpenStack
Guide

Operation Monitoring Commands

openstack dr operation list/show; watching progress; interpreting steps_completed and steps_failed

master

Overview

This page explains how to monitor the progress of disaster recovery operations — failover, failback, test failover, and related actions — using the openstack dr operation CLI commands. DR operations are long-running, multi-step processes; understanding how to list active operations, inspect individual operation records, and interpret the steps_completed and steps_failed fields lets you confirm that a failover is proceeding correctly, diagnose failures mid-operation, and determine whether a rollback has been triggered. All examples assume you are authenticated to the site where the operation was initiated.


Prerequisites

Before monitoring DR operations, ensure the following:

  • Trilio Site Recovery (Protector) is deployed and running on both sites (protector-api and protector-engine services healthy)
  • The protectorclient OSC plugin is installed (pip show python-protectorclient should return a version)
  • You have valid credentials for the site where the operation is running — operations are site-local records and are not visible cross-site through the operation list command
  • A Protection Group exists and at least one DR action (failover, failback, test failover, or sync) has been initiated — the operations list is empty until an action is triggered
  • Your clouds.yaml or environment variables point to the correct site:
# Confirm you are authenticated to the correct site
openstack catalog show protector

Installation

The monitoring commands are part of the protectorclient OSC plugin. No separate installation is required beyond the plugin itself.

Step 1 — Verify the plugin is installed

pip show python-protectorclient

Expected output includes a Location and Version line. If the package is missing, install it:

pip install python-protectorclient

Step 2 — Confirm the commands are registered

openstack dr operation --help

You should see list and show as available sub-commands. If the output shows 'dr' is not an openstack command, the plugin entry points have not been registered — reinstall with:

pip install --force-reinstall python-protectorclient

Step 3 — Confirm API reachability

openstack dr operation list

An empty table (rather than a connection error) confirms the API is reachable and your token is valid.


Configuration

Operation monitoring is read-only and requires no persistent configuration beyond valid credentials. The following fields in the dr_operations data model directly affect what you see when inspecting an operation:

FieldTypeDescription
statusenumCurrent lifecycle state. Valid values: pending, running, completed, failed, rolling_back.
progressinteger (0–100)Coarse percentage of overall operation completion. Updated at major phase boundaries, not continuously.
steps_completedJSON arrayOrdered list of step names that have finished successfully. Populated incrementally as the engine advances.
steps_failedJSON arrayList of step names that encountered an error. A non-empty array always accompanies a failed or rolling_back status.
error_messagetextHuman-readable description of the first fatal error, if any.
result_dataJSON objectOutcome details after completion — e.g., instance IDs created on the target site, volume mappings. Present only when status is completed.
started_at / completed_atdatetimeWall-clock timestamps for operation duration calculation. completed_at is null while the operation is still running.

The replication_interval and rpo_minutes values from the replication policy affect when replication snapshots are available before a failover, but they are not surfaced in the operation record itself.


Usage

Listing all operations

To see all DR operations for your project, run:

openstack dr operation list

By default, the table shows id, operation_type, status, progress, protection_group_id, started_at, and completed_at. Operations are returned most-recent-first.

Filter by protection group to narrow results when you manage multiple groups:

openstack dr operation list --protection-group prod-web-app

Filter by status to find only active or failed operations:

openstack dr operation list --status running
openstack dr operation list --status failed

Watching a running operation

The show command returns a point-in-time snapshot. To watch progress update in place, use the shell watch utility:

watch -n 5 openstack dr operation show <operation-id>

This re-runs the command every 5 seconds and refreshes the terminal. The progress field and the length of steps_completed will increase as the engine advances through its phases.

Showing full operation details

openstack dr operation show <operation-id>

This returns all fields including steps_completed, steps_failed, error_message, and result_data. The steps_completed and steps_failed arrays are the primary diagnostic tool — they tell you exactly which steps finished before any failure occurred.

Retrieving the operation ID after triggering an action

Every DR action command (failover, failback, test-failover) returns an operation record inline:

openstack protector protection-group failover prod-web-app \
  --network-mapping net-primary-web=net-secondary-web

The output includes operation_id. Copy this value before the command exits to avoid hunting for it in the list.


Examples

Example 1 — List all operations for a protection group

List operations scoped to a specific protection group to avoid noise from other groups:

openstack dr operation list --protection-group prod-web-app

Expected output:

+--------------------------------------+---------------+-----------+----------+--------------------------------------+----------------------+----------------------+
| id                                   | operation_type| status    | progress | protection_group_id                  | started_at           | completed_at         |
+--------------------------------------+---------------+-----------+----------+--------------------------------------+----------------------+----------------------+
| op-9f3a1b2c-4d5e-6f7a-8b9c-0d1e2f3a | failover      | completed | 100      | pg-12345678-1234-1234-1234-123456789 | 2025-03-12T09:14:02Z | 2025-03-12T09:21:47Z |
| op-1a2b3c4d-5e6f-7a8b-9c0d-1e2f3a4b | test_failover | failed    | 45       | pg-12345678-1234-1234-1234-123456789 | 2025-03-10T14:30:00Z | 2025-03-10T14:33:12Z |
+--------------------------------------+---------------+-----------+----------+--------------------------------------+----------------------+----------------------+

Example 2 — Show a running failover operation

Inspect a failover that is currently in the instance recreation phase:

openstack dr operation show op-9f3a1b2c-4d5e-6f7a-8b9c-0d1e2f3a

Expected output (mid-operation, storage phase complete):

+-------------------+-----------------------------------------------------------------------+
| Field             | Value                                                                 |
+-------------------+-----------------------------------------------------------------------+
| id                | op-9f3a1b2c-4d5e-6f7a-8b9c-0d1e2f3a                                  |
| operation_type    | failover                                                              |
| status            | running                                                               |
| progress          | 65                                                                    |
| protection_group  | pg-12345678-1234-1234-1234-123456789 (prod-web-app)                   |
| source_site       | site-a                                                                |
| target_site       | site-b                                                                |
| started_at        | 2025-03-12T09:14:02Z                                                  |
| completed_at      | None                                                                  |
| error_message     | None                                                                  |
| steps_completed   | ["validate_sites", "create_operation_record",                        |
|                   |  "get_latest_snapshot", "promote_volumes",                           |
|                   |  "manage_volumes_cinder", "update_volume_records"]                    |
| steps_failed      | []                                                                    |
| result_data       | None                                                                  |
+-------------------+-----------------------------------------------------------------------+

The progress value of 65 and the completed steps confirm the operation has finished the storage failover phase (steps completing up through manage_volumes_cinder) and is now working through instance recreation.


Example 3 — Show a completed failover

Once the operation reaches status: completed, result_data is populated:

openstack dr operation show op-9f3a1b2c-4d5e-6f7a-8b9c-0d1e2f3a

Expected output:

+-------------------+-----------------------------------------------------------------------+
| Field             | Value                                                                 |
+-------------------+-----------------------------------------------------------------------+
| id                | op-9f3a1b2c-4d5e-6f7a-8b9c-0d1e2f3a                                  |
| operation_type    | failover                                                              |
| status            | completed                                                             |
| progress          | 100                                                                   |
| source_site       | site-a                                                                |
| target_site       | site-b                                                                |
| started_at        | 2025-03-12T09:14:02Z                                                  |
| completed_at      | 2025-03-12T09:21:47Z                                                  |
| error_message     | None                                                                  |
| steps_completed   | ["validate_sites", "create_operation_record",                        |
|                   |  "get_latest_snapshot", "promote_volumes",                           |
|                   |  "manage_volumes_cinder", "update_volume_records",                   |
|                   |  "recreate_instances", "attach_volumes",                             |
|                   |  "update_protection_group", "finalize"]                               |
| steps_failed      | []                                                                    |
| result_data       | {"instances_created": {"web-server-1": "<site-b-instance-uuid>",     |
|                   |   "db-server": "<site-b-instance-uuid>"},                            |
|                   |  "volumes_managed": 4, "failover_count": 1}                          |
+-------------------+-----------------------------------------------------------------------+

Example 4 — Show a failed operation

When a step fails, steps_failed is non-empty and error_message explains the failure:

openstack dr operation show op-1a2b3c4d-5e6f-7a8b-9c0d-1e2f3a4b

Expected output:

+-------------------+-----------------------------------------------------------------------+
| Field             | Value                                                                 |
+-------------------+-----------------------------------------------------------------------+
| id                | op-1a2b3c4d-5e6f-7a8b-9c0d-1e2f3a4b                                  |
| operation_type    | test_failover                                                         |
| status            | failed                                                                |
| progress          | 45                                                                    |
| source_site       | site-a                                                                |
| target_site       | site-b                                                                |
| started_at        | 2025-03-10T14:30:00Z                                                  |
| completed_at      | 2025-03-10T14:33:12Z                                                  |
| error_message     | Volume manage failed: Cinder volume service host 'pure@backend-b'    |
|                   | not found. Verify volume_extension:services:index policy on Site B.  |
| steps_completed   | ["validate_sites", "create_operation_record",                        |
|                   |  "get_latest_snapshot", "promote_volumes"]                            |
| steps_failed      | ["manage_volumes_cinder"]                                             |
| result_data       | None                                                                  |
+-------------------+-----------------------------------------------------------------------+

The combination of steps_failed: ["manage_volumes_cinder"] and the error_message immediately directs you to the Cinder policy configuration on Site B as the root cause.


Troubleshooting

Operation stuck at the same progress value for an extended period

Symptom: openstack dr operation show repeatedly returns the same progress integer and status: running for more than 10–15 minutes.

Likely cause: The protector-engine process on the active site has stalled or crashed mid-operation. The operation record is updated by the engine, so if the engine is down, progress stops.

Fix:

  1. Check the engine service on the site where the operation is running:
    systemctl status protector-engine
    journalctl -u protector-engine -n 100 --no-pager
    
  2. If the engine has crashed, restart it:
    systemctl restart protector-engine
    
  3. The engine will detect the in-progress operation on startup and attempt to resume or roll it back depending on which step was interrupted.

steps_failed contains manage_volumes_cinder with a policy error

Symptom: Operation fails at the manage_volumes_cinder step. error_message references a 403 or Policy does not allow string.

Likely cause: The Cinder policy on the target site does not grant volume_extension:volume_manage to the member role used by the Protector service trust.

Fix: On the target site, add the required policy overrides to /etc/cinder/policy.yaml:

"volume_extension:volume_manage": "rule:admin_or_owner"
"volume_extension:volume_unmanage": "rule:admin_or_owner"
"volume_extension:services:index": "rule:admin_or_owner"

For Kolla-Ansible deployments, apply under /etc/kolla/config/cinder/policy.yaml and reconfigure:

kolla-ansible -i inventory reconfigure -t cinder

After the policy is applied, re-trigger the failed operation.


status transitions to rolling_back unexpectedly

Symptom: An operation that was running changes to rolling_back without user intervention.

Likely cause: A non-recoverable error occurred after at least one side-effect had already been applied (for example, volumes were promoted but instance creation failed). The engine automatically initiates rollback to avoid leaving resources in a partially-failed state.

Fix:

  1. Wait for rolling_back to complete — do not attempt to manually delete resources while rollback is in progress.
  2. Once the operation reaches failed (after rollback completes), read steps_failed and error_message to identify the root cause.
  3. Resolve the underlying issue, then re-trigger the operation.

openstack dr operation list returns an empty table

Symptom: No operations appear even after triggering a failover.

Likely cause A: You are authenticated to the wrong site. Operations are site-local. If you triggered the failover on Site B, you must query Site B's API to see the record.

Fix: Switch your credentials to the correct site and re-run:

export OS_AUTH_URL=http://site-b:5000/v3
source ~/site-b-openrc
openstack dr operation list

Likely cause B: The operation_type or status filter is too restrictive.

Fix: Run without filters first:

openstack dr operation list

status shows failed but steps_failed is an empty array

Symptom: The operation is clearly failed (status: failed, progress stopped) but steps_failed contains [] and error_message is None.

Likely cause: The engine was killed (OOM, SIGKILL, host reboot) before it could write the failure metadata to the database. The API record was marked failed by a recovery sweep but the step details were never persisted.

Fix: Examine the engine logs from the time window between started_at and completed_at:

journalctl -u protector-engine \
  --since "<started_at value>" \
  --until "<completed_at value>" \
  --no-pager | grep -E 'ERROR|CRITICAL|step'

The log will contain the step name and exception that caused the failure.