Operations Panel
Monitoring active and completed DR operations, interpreting step-level detail
The Operations panel gives you a real-time and historical view of every DR operation that Trilio Site Recovery has executed or is currently executing on your behalf — failovers, failbacks, test failovers, test cleanups, and consistency group syncs. Each operation record exposes step-level progress, percentage completion, source and target site context, and any error detail that caused a failure, giving you the diagnostic depth to distinguish a transient storage glitch from a configuration problem. Use this panel to confirm that a planned failover completed cleanly, to investigate a stalled DR drill, or to build an audit trail of protection activity across your fleet of Protection Groups.
Before using the Operations panel, ensure the following are in place:
- Trilio Site Recovery deployed on both sites —
protector-apiandprotector-enginemust be running on the site you are querying. - OSC CLI plugin installed — the
protectorclientplugin must be installed in your Python environment:pip install python-protectorclient. Horizon dashboard users require access to the Trilio Site Recovery panel within your Horizon deployment. - Valid OpenStack credentials — your
clouds.yamlor environment variables must be configured for the site whose operation history you want to inspect. Operations are stored per-site; the site where the operation was initiated holds the authoritative record. - At least one Protection Group created — operation records are scoped to a Protection Group. There are no operations to inspect until a Protection Group has been created and at least one DR action has been triggered.
- RBAC — your Keystone project role must satisfy the
protector:operations:indexandprotector:operations:showpolicy rules (default:admin_or_owner). Operators with thememberrole within the owning project have read access to operations by default.
The Operations panel is part of the core Trilio Site Recovery service — no additional installation is required beyond the base protectorclient plugin and the running protector-api service.
Step 1 — Verify the CLI plugin is available
pip show python-protectorclient
Expected output includes a Version line. If the package is missing, install it:
pip install python-protectorclient
Step 2 — Confirm the API is reachable on the target site
curl -s http://<site-controller>:8788/
A version document is returned if the API is healthy. If you receive a connection error, check that protector-api is running:
systemctl status protector-api
Step 3 — Source credentials for the site you want to inspect
source ~/site-a-openrc
# or, using clouds.yaml:
export OS_CLOUD=site-a
Step 4 — Run a quick connectivity check
openstack protector operation list
An empty table (or a list of past operations) confirms the plugin and API are communicating correctly.
Operation records are created automatically by the protector-engine when a DR action is triggered — you do not configure the Operations panel itself. The behavior that affects what you see in the panel is controlled by the following settings in /etc/protector/protector.conf.
[DEFAULT] section
| Option | Default | Effect |
|---|---|---|
debug | False | When True, the engine writes verbose step-level diagnostic output to the log. Operation records themselves are unchanged, but log correlation is richer. |
Operation record fields
Each operation stored in the dr_operations table (and returned by the API) exposes the following fields that are relevant to monitoring:
| Field | Type | Description |
|---|---|---|
id | UUID | Unique identifier for this operation. Use this to retrieve step-level detail. |
operation_type | enum | One of: failover, failback, test_failover, test_cleanup, sync_volumes. |
status | enum | One of: pending, running, completed, failed, rolling_back. |
progress | integer | Completion percentage, 0–100. Updated by the engine as phases complete. |
source_site_id | UUID | The site from which the operation was initiated. |
target_site_id | UUID | The site to which workloads are being moved or synced. |
started_at | datetime | UTC timestamp when the engine began execution. |
completed_at | datetime | UTC timestamp when the operation reached a terminal state (completed or failed). |
error_message | text | Populated only on failed or rolling_back status. Contains the first fatal error encountered. |
steps_completed | JSON array | Ordered list of step identifiers that finished successfully. |
steps_failed | JSON array | List of step identifiers that produced an error. |
result_data | JSON | Operation-type-specific output — for example, the IDs of instances created on the target site during a failover. |
RBAC policy
The relevant policy rules in /etc/protector/policy.yaml are:
"protector:operations:index": "rule:default"
"protector:operations:show": "rule:default"
"protector:operations:action": "rule:default"
The default admin_or_owner rule means any project member can list and inspect operations scoped to their own project. Adjust these rules if your organisation requires stricter read access.
Listing operations
To see all DR operations across all Protection Groups in your project:
openstack protector operation list
The output is a table sorted by creation time, most recent first. The STATUS and PROGRESS columns give you an at-a-glance view of what is running and what has finished.
To scope the list to a specific Protection Group:
openstack protector operation list --protection-group <pg-id-or-name>
Inspecting a single operation
Once you have an operation ID, retrieve the full detail record:
openstack protector operation show <operation-id>
The detail view includes:
statusandprogress— where the operation is right now.steps_completed— the sequence of phases that have finished. This is your primary signal for how far a long-running failover has progressed.steps_failed— if the operation enteredfailedorrolling_back, this array identifies which phase produced the fault.error_message— the human-readable description of the terminal error.result_data— on a completedfailoverortest_failover, this JSON object contains the Nova instance IDs that were created on the target site, which you can use to verify the recovered VMs.
Watching a live operation
For operations that take minutes (such as a full failover of a large Protection Group), poll the detail record at a regular interval:
watch -n 5 openstack protector operation show <operation-id>
The progress field increments as each of the four phases completes:
| Progress range | Phase |
|---|---|
| 0–20 % | Preparation — validating reachability, retrieving latest snapshot |
| 20–60 % | Storage failover — promoting replicated volumes into Cinder on the target site |
| 60–90 % | Instance recreation — booting VMs on the target site with mapped networks and flavors |
| 90–100 % | Finalization — updating Protection Group state, writing completion record |
Understanding terminal states
| Status | Meaning | Action required |
|---|---|---|
completed | All phases finished without error. | Verify application health on the target site. |
failed | A phase produced a fatal error and the engine stopped. | Inspect steps_failed and error_message, then remediate and retry. |
rolling_back | The engine encountered an error and is attempting to undo partial changes. | Wait for rollback to complete before retrying. Do not issue a second operation while rollback is in progress. |
Operations and Protection Group status
Operation state and Protection Group state are coupled. When an operation moves to completed, the Protection Group status field transitions accordingly — for example, from failing_over to failed_over. If an operation fails, the Protection Group may enter the error state. Always check the Protection Group status alongside the operation record to understand the full picture:
openstack protector protection-group show <pg-id-or-name>
Example 1 — List all operations in the current project
export OS_CLOUD=site-a
openstack protector operation list
Expected output:
+--------------------------------------+------------------+-----------------------+----------+--------------------+---------------------+
| ID | PROTECTION_GROUP | TYPE | STATUS | PROGRESS | STARTED_AT |
+--------------------------------------+------------------+-----------------------+----------+--------------------+---------------------+
| op-a1b2c3d4-... | prod-web-app | failover | completed| 100 | 2025-03-12T09:14:02 |
| op-e5f6a7b8-... | prod-web-app | test_failover | completed| 100 | 2025-03-10T14:30:00 |
| op-c9d0e1f2-... | batch-jobs-pg | sync_volumes | completed| 100 | 2025-03-10T08:00:00 |
+--------------------------------------+------------------+-----------------------+----------+--------------------+---------------------+
Example 2 — Inspect a completed failover operation
openstack protector operation show op-a1b2c3d4-...
Expected output:
+---------------------+-----------------------------------------------+
| Field | Value |
+---------------------+-----------------------------------------------+
| id | op-a1b2c3d4-1234-1234-1234-a1b2c3d4e5f6 |
| protection_group_id | pg-12345678-1234-1234-1234-12345678abcd |
| operation_type | failover |
| status | completed |
| progress | 100 |
| source_site_id | site-a-uuid |
| target_site_id | site-b-uuid |
| started_at | 2025-03-12T09:14:02Z |
| completed_at | 2025-03-12T09:21:47Z |
| error_message | None |
| steps_completed | ["prepare", "storage_failover", |
| | "instance_recreation", "finalization"] |
| steps_failed | [] |
| result_data | {"instances_created": [ |
| | {"name": "web-server-1", |
| | "id": "nova-uuid-site-b-ws1"}, |
| | {"name": "db-server", |
| | "id": "nova-uuid-site-b-db"}]} |
+---------------------+-----------------------------------------------+
The result_data.instances_created array gives you the Nova instance UUIDs on site-b that you can pass directly to openstack server show to confirm the VMs are ACTIVE.
Example 3 — Monitor a failover in progress
watch -n 5 openstack protector operation show op-e5f6a7b8-...
Output at the storage failover phase:
| status | running |
| progress | 45 |
| steps_completed | ["prepare", "storage_failover"] |
| steps_failed | [] |
| error_message | None |
At 45 % progress, the prepare and storage_failover steps have finished. The engine is about to begin instance_recreation.
Example 4 — Inspect a failed operation
openstack protector operation show op-c9d0e1f2-...
Expected output:
| status | failed |
| progress | 38 |
| steps_completed | ["prepare"] |
| steps_failed | ["storage_failover"] |
| error_message | Volume manage failed: no replicated snapshot found for volume |
| | vol-abcdef on FlashArray B. Verify replication policy and RPO. |
| result_data | {} |
The operation stalled at 38 % — inside the storage_failover phase. The error_message identifies the exact volume and points to the replication policy as the area to investigate.
Example 5 — List operations for a specific Protection Group
openstack protector operation list --protection-group prod-web-app
This scopes the output to only operations associated with prod-web-app, which is useful when you manage many Protection Groups and want to review the DR history for a single workload.
Operation stuck in running state
Symptom: openstack protector operation show returns status: running and progress has not incremented for more than 10 minutes.
Likely cause: The protector-engine process has stalled or lost connectivity to a required service — the Cinder API on the target site, the Pure Storage FlashArray management IP, or the Nova API.
Fix:
- Check the engine log on the site where the operation was initiated:
journalctl -u protector-engine -f tail -f /var/log/protector/protector-engine.log - Look for connection timeout or authentication errors referencing Nova, Cinder, or the FlashArray URL.
- Verify the target site's API endpoints are reachable from the engine host.
- If the engine process has crashed, restart it:
The engine will reconcile in-progress operations on startup and either resume or transition them tosystemctl restart protector-enginefailed.
Operation shows failed with error_message: Volume manage failed
Symptom: A failover or test failover operation fails during the storage_failover step. The error_message mentions volume manage or missing snapshot.
Likely cause: The Cinder volume manage policy has not been applied on the target site, or the replicated snapshot does not exist on FlashArray B — either because replication has not yet run (RPO not yet met) or the replication policy is misconfigured.
Fix:
- Confirm the Cinder policy on the target site includes:
"volume_extension:volume_manage": "rule:admin_or_owner" - Verify replication status on the FlashArray B management interface or via the
protector-enginelog — look for the Protection Group name specified in the replication policy (pure_pg_name). - If replication is lagging, force a consistency group sync on the primary site and wait for the RPO interval before retrying:
openstack protector consistency-group sync <pg-name>
rolling_back state — what to do
Symptom: An operation transitions to rolling_back rather than completed or failed.
Likely cause: The engine encountered an error mid-operation (typically during instance_recreation) and is attempting to undo the partial changes made on the target site — removing any VMs that were created and unmanaging any volumes that were imported into Cinder.
Fix:
- Do not issue another failover or failback while rollback is in progress. Wait for the operation to reach
failed. - Once
failed, inspectsteps_failedanderror_messageto identify the root cause. - Check the Protection Group status:
If status isopenstack protector protection-group show <pg-name>error, the rollback completed. You can now remediate the underlying issue and retry the operation.
result_data is empty after a completed failover
Symptom: The operation shows status: completed but result_data contains {} and you cannot find the recovered VMs on the target site.
Likely cause: This is expected for sync_volumes and test_cleanup operation types, which do not create instances. For failover or test_failover, an empty result_data may indicate the operation completed at the metadata level but instance creation produced no output — check the steps_completed array to confirm instance_recreation is listed.
Fix:
- If
instance_recreationis not insteps_completed, the operation did not reach that phase despite thecompletedstatus — check the engine log for a silent error. - Query Nova directly on the target site to check whether instances with the expected names exist:
export OS_CLOUD=site-b openstack server list
Cannot list operations — 403 Forbidden
Symptom: openstack protector operation list returns a 403 error.
Likely cause: Your Keystone token does not satisfy the protector:operations:index policy rule on the site you are querying. The default rule requires your project role to satisfy admin_or_owner.
Fix:
- Verify you are sourcing credentials for the correct project:
openstack token issue - Confirm your role in the project:
openstack role assignment list --user <your-user> --project <your-project> - If your role is correct, check whether the
policy.yamlon the target site has been customised to restrict operation visibility.