Operation Monitoring Commands
openstack dr operation list/show; watching progress; interpreting steps_completed and steps_failed
This page explains how to monitor the progress of disaster recovery operations — failover, failback, test failover, and related actions — using the openstack dr operation CLI commands. DR operations are long-running, multi-step processes; understanding how to list active operations, inspect individual operation records, and interpret the steps_completed and steps_failed fields lets you confirm that a failover is proceeding correctly, diagnose failures mid-operation, and determine whether a rollback has been triggered. All examples assume you are authenticated to the site where the operation was initiated.
Before monitoring DR operations, ensure the following:
- Trilio Site Recovery (Protector) is deployed and running on both sites (
protector-apiandprotector-engineservices healthy) - The
protectorclientOSC plugin is installed (pip show python-protectorclientshould return a version) - You have valid credentials for the site where the operation is running — operations are site-local records and are not visible cross-site through the
operation listcommand - A Protection Group exists and at least one DR action (failover, failback, test failover, or sync) has been initiated — the operations list is empty until an action is triggered
- Your
clouds.yamlor environment variables point to the correct site:
# Confirm you are authenticated to the correct site
openstack catalog show protector
The monitoring commands are part of the protectorclient OSC plugin. No separate installation is required beyond the plugin itself.
Step 1 — Verify the plugin is installed
pip show python-protectorclient
Expected output includes a Location and Version line. If the package is missing, install it:
pip install python-protectorclient
Step 2 — Confirm the commands are registered
openstack dr operation --help
You should see list and show as available sub-commands. If the output shows 'dr' is not an openstack command, the plugin entry points have not been registered — reinstall with:
pip install --force-reinstall python-protectorclient
Step 3 — Confirm API reachability
openstack dr operation list
An empty table (rather than a connection error) confirms the API is reachable and your token is valid.
Operation monitoring is read-only and requires no persistent configuration beyond valid credentials. The following fields in the dr_operations data model directly affect what you see when inspecting an operation:
| Field | Type | Description |
|---|---|---|
status | enum | Current lifecycle state. Valid values: pending, running, completed, failed, rolling_back. |
progress | integer (0–100) | Coarse percentage of overall operation completion. Updated at major phase boundaries, not continuously. |
steps_completed | JSON array | Ordered list of step names that have finished successfully. Populated incrementally as the engine advances. |
steps_failed | JSON array | List of step names that encountered an error. A non-empty array always accompanies a failed or rolling_back status. |
error_message | text | Human-readable description of the first fatal error, if any. |
result_data | JSON object | Outcome details after completion — e.g., instance IDs created on the target site, volume mappings. Present only when status is completed. |
started_at / completed_at | datetime | Wall-clock timestamps for operation duration calculation. completed_at is null while the operation is still running. |
The replication_interval and rpo_minutes values from the replication policy affect when replication snapshots are available before a failover, but they are not surfaced in the operation record itself.
Listing all operations
To see all DR operations for your project, run:
openstack dr operation list
By default, the table shows id, operation_type, status, progress, protection_group_id, started_at, and completed_at. Operations are returned most-recent-first.
Filter by protection group to narrow results when you manage multiple groups:
openstack dr operation list --protection-group prod-web-app
Filter by status to find only active or failed operations:
openstack dr operation list --status running
openstack dr operation list --status failed
Watching a running operation
The show command returns a point-in-time snapshot. To watch progress update in place, use the shell watch utility:
watch -n 5 openstack dr operation show <operation-id>
This re-runs the command every 5 seconds and refreshes the terminal. The progress field and the length of steps_completed will increase as the engine advances through its phases.
Showing full operation details
openstack dr operation show <operation-id>
This returns all fields including steps_completed, steps_failed, error_message, and result_data. The steps_completed and steps_failed arrays are the primary diagnostic tool — they tell you exactly which steps finished before any failure occurred.
Retrieving the operation ID after triggering an action
Every DR action command (failover, failback, test-failover) returns an operation record inline:
openstack protector protection-group failover prod-web-app \
--network-mapping net-primary-web=net-secondary-web
The output includes operation_id. Copy this value before the command exits to avoid hunting for it in the list.
Example 1 — List all operations for a protection group
List operations scoped to a specific protection group to avoid noise from other groups:
openstack dr operation list --protection-group prod-web-app
Expected output:
+--------------------------------------+---------------+-----------+----------+--------------------------------------+----------------------+----------------------+
| id | operation_type| status | progress | protection_group_id | started_at | completed_at |
+--------------------------------------+---------------+-----------+----------+--------------------------------------+----------------------+----------------------+
| op-9f3a1b2c-4d5e-6f7a-8b9c-0d1e2f3a | failover | completed | 100 | pg-12345678-1234-1234-1234-123456789 | 2025-03-12T09:14:02Z | 2025-03-12T09:21:47Z |
| op-1a2b3c4d-5e6f-7a8b-9c0d-1e2f3a4b | test_failover | failed | 45 | pg-12345678-1234-1234-1234-123456789 | 2025-03-10T14:30:00Z | 2025-03-10T14:33:12Z |
+--------------------------------------+---------------+-----------+----------+--------------------------------------+----------------------+----------------------+
Example 2 — Show a running failover operation
Inspect a failover that is currently in the instance recreation phase:
openstack dr operation show op-9f3a1b2c-4d5e-6f7a-8b9c-0d1e2f3a
Expected output (mid-operation, storage phase complete):
+-------------------+-----------------------------------------------------------------------+
| Field | Value |
+-------------------+-----------------------------------------------------------------------+
| id | op-9f3a1b2c-4d5e-6f7a-8b9c-0d1e2f3a |
| operation_type | failover |
| status | running |
| progress | 65 |
| protection_group | pg-12345678-1234-1234-1234-123456789 (prod-web-app) |
| source_site | site-a |
| target_site | site-b |
| started_at | 2025-03-12T09:14:02Z |
| completed_at | None |
| error_message | None |
| steps_completed | ["validate_sites", "create_operation_record", |
| | "get_latest_snapshot", "promote_volumes", |
| | "manage_volumes_cinder", "update_volume_records"] |
| steps_failed | [] |
| result_data | None |
+-------------------+-----------------------------------------------------------------------+
The progress value of 65 and the completed steps confirm the operation has finished the storage failover phase (steps completing up through manage_volumes_cinder) and is now working through instance recreation.
Example 3 — Show a completed failover
Once the operation reaches status: completed, result_data is populated:
openstack dr operation show op-9f3a1b2c-4d5e-6f7a-8b9c-0d1e2f3a
Expected output:
+-------------------+-----------------------------------------------------------------------+
| Field | Value |
+-------------------+-----------------------------------------------------------------------+
| id | op-9f3a1b2c-4d5e-6f7a-8b9c-0d1e2f3a |
| operation_type | failover |
| status | completed |
| progress | 100 |
| source_site | site-a |
| target_site | site-b |
| started_at | 2025-03-12T09:14:02Z |
| completed_at | 2025-03-12T09:21:47Z |
| error_message | None |
| steps_completed | ["validate_sites", "create_operation_record", |
| | "get_latest_snapshot", "promote_volumes", |
| | "manage_volumes_cinder", "update_volume_records", |
| | "recreate_instances", "attach_volumes", |
| | "update_protection_group", "finalize"] |
| steps_failed | [] |
| result_data | {"instances_created": {"web-server-1": "<site-b-instance-uuid>", |
| | "db-server": "<site-b-instance-uuid>"}, |
| | "volumes_managed": 4, "failover_count": 1} |
+-------------------+-----------------------------------------------------------------------+
Example 4 — Show a failed operation
When a step fails, steps_failed is non-empty and error_message explains the failure:
openstack dr operation show op-1a2b3c4d-5e6f-7a8b-9c0d-1e2f3a4b
Expected output:
+-------------------+-----------------------------------------------------------------------+
| Field | Value |
+-------------------+-----------------------------------------------------------------------+
| id | op-1a2b3c4d-5e6f-7a8b-9c0d-1e2f3a4b |
| operation_type | test_failover |
| status | failed |
| progress | 45 |
| source_site | site-a |
| target_site | site-b |
| started_at | 2025-03-10T14:30:00Z |
| completed_at | 2025-03-10T14:33:12Z |
| error_message | Volume manage failed: Cinder volume service host 'pure@backend-b' |
| | not found. Verify volume_extension:services:index policy on Site B. |
| steps_completed | ["validate_sites", "create_operation_record", |
| | "get_latest_snapshot", "promote_volumes"] |
| steps_failed | ["manage_volumes_cinder"] |
| result_data | None |
+-------------------+-----------------------------------------------------------------------+
The combination of steps_failed: ["manage_volumes_cinder"] and the error_message immediately directs you to the Cinder policy configuration on Site B as the root cause.
Operation stuck at the same progress value for an extended period
Symptom: openstack dr operation show repeatedly returns the same progress integer and status: running for more than 10–15 minutes.
Likely cause: The protector-engine process on the active site has stalled or crashed mid-operation. The operation record is updated by the engine, so if the engine is down, progress stops.
Fix:
- Check the engine service on the site where the operation is running:
systemctl status protector-engine journalctl -u protector-engine -n 100 --no-pager - If the engine has crashed, restart it:
systemctl restart protector-engine - The engine will detect the in-progress operation on startup and attempt to resume or roll it back depending on which step was interrupted.
steps_failed contains manage_volumes_cinder with a policy error
Symptom: Operation fails at the manage_volumes_cinder step. error_message references a 403 or Policy does not allow string.
Likely cause: The Cinder policy on the target site does not grant volume_extension:volume_manage to the member role used by the Protector service trust.
Fix: On the target site, add the required policy overrides to /etc/cinder/policy.yaml:
"volume_extension:volume_manage": "rule:admin_or_owner"
"volume_extension:volume_unmanage": "rule:admin_or_owner"
"volume_extension:services:index": "rule:admin_or_owner"
For Kolla-Ansible deployments, apply under /etc/kolla/config/cinder/policy.yaml and reconfigure:
kolla-ansible -i inventory reconfigure -t cinder
After the policy is applied, re-trigger the failed operation.
status transitions to rolling_back unexpectedly
Symptom: An operation that was running changes to rolling_back without user intervention.
Likely cause: A non-recoverable error occurred after at least one side-effect had already been applied (for example, volumes were promoted but instance creation failed). The engine automatically initiates rollback to avoid leaving resources in a partially-failed state.
Fix:
- Wait for
rolling_backto complete — do not attempt to manually delete resources while rollback is in progress. - Once the operation reaches
failed(after rollback completes), readsteps_failedanderror_messageto identify the root cause. - Resolve the underlying issue, then re-trigger the operation.
openstack dr operation list returns an empty table
Symptom: No operations appear even after triggering a failover.
Likely cause A: You are authenticated to the wrong site. Operations are site-local. If you triggered the failover on Site B, you must query Site B's API to see the record.
Fix: Switch your credentials to the correct site and re-run:
export OS_AUTH_URL=http://site-b:5000/v3
source ~/site-b-openrc
openstack dr operation list
Likely cause B: The operation_type or status filter is too restrictive.
Fix: Run without filters first:
openstack dr operation list
status shows failed but steps_failed is an empty array
Symptom: The operation is clearly failed (status: failed, progress stopped) but steps_failed contains [] and error_message is None.
Likely cause: The engine was killed (OOM, SIGKILL, host reboot) before it could write the failure metadata to the database. The API record was marked failed by a recovery sweep but the step details were never persisted.
Fix: Examine the engine logs from the time window between started_at and completed_at:
journalctl -u protector-engine \
--since "<started_at value>" \
--until "<completed_at value>" \
--no-pager | grep -E 'ERROR|CRITICAL|step'
The log will contain the step name and exception that caused the failure.