Guide

Operations Panel

Monitoring active and completed DR operations, interpreting step-level detail

master

Overview

The Operations panel gives you a real-time and historical view of every DR operation that Trilio Site Recovery has executed or is currently executing on your behalf — failovers, failbacks, test failovers, test cleanups, and consistency group syncs. Each operation record exposes step-level progress, percentage completion, source and target site context, and any error detail that caused a failure, giving you the diagnostic depth to distinguish a transient storage glitch from a configuration problem. Use this panel to confirm that a planned failover completed cleanly, to investigate a stalled DR drill, or to build an audit trail of protection activity across your fleet of Protection Groups.

Prerequisites

Before using the Operations panel, ensure the following are in place:

Trilio Site Recovery deployed on both sites — protector-api and protector-engine must be running on the site you are querying.
OSC CLI plugin installed — the protectorclient plugin must be installed in your Python environment: pip install python-protectorclient. Horizon dashboard users require access to the Trilio Site Recovery panel within your Horizon deployment.
Valid OpenStack credentials — your clouds.yaml or environment variables must be configured for the site whose operation history you want to inspect. Operations are stored per-site; the site where the operation was initiated holds the authoritative record.
At least one Protection Group created — operation records are scoped to a Protection Group. There are no operations to inspect until a Protection Group has been created and at least one DR action has been triggered.
RBAC — your Keystone project role must satisfy the protector:operations:index and protector:operations:show policy rules (default: admin_or_owner). Operators with the member role within the owning project have read access to operations by default.

Installation

The Operations panel is part of the core Trilio Site Recovery service — no additional installation is required beyond the base protectorclient plugin and the running protector-api service.

Step 1 — Verify the CLI plugin is available

pip show python-protectorclient

Expected output includes a Version line. If the package is missing, install it:

pip install python-protectorclient

Step 2 — Confirm the API is reachable on the target site

curl -s http://<site-controller>:8788/

A version document is returned if the API is healthy. If you receive a connection error, check that protector-api is running:

systemctl status protector-api

Step 3 — Source credentials for the site you want to inspect

source ~/site-a-openrc
# or, using clouds.yaml:
export OS_CLOUD=site-a

Step 4 — Run a quick connectivity check

openstack protector operation list

An empty table (or a list of past operations) confirms the plugin and API are communicating correctly.

Configuration

Operation records are created automatically by the protector-engine when a DR action is triggered — you do not configure the Operations panel itself. The behavior that affects what you see in the panel is controlled by the following settings in /etc/protector/protector.conf.

`[DEFAULT]` section

Option	Default	Effect
`debug`	`False`	When `True`, the engine writes verbose step-level diagnostic output to the log. Operation records themselves are unchanged, but log correlation is richer.

Operation record fields

Each operation stored in the dr_operations table (and returned by the API) exposes the following fields that are relevant to monitoring:

Field	Type	Description
`id`	UUID	Unique identifier for this operation. Use this to retrieve step-level detail.
`operation_type`	enum	One of: `failover`, `failback`, `test_failover`, `test_cleanup`, `sync_volumes`.
`status`	enum	One of: `pending`, `running`, `completed`, `failed`, `rolling_back`.
`progress`	integer	Completion percentage, 0–100. Updated by the engine as phases complete.
`source_site_id`	UUID	The site from which the operation was initiated.
`target_site_id`	UUID	The site to which workloads are being moved or synced.
`started_at`	datetime	UTC timestamp when the engine began execution.
`completed_at`	datetime	UTC timestamp when the operation reached a terminal state (`completed` or `failed`).
`error_message`	text	Populated only on `failed` or `rolling_back` status. Contains the first fatal error encountered.
`steps_completed`	JSON array	Ordered list of step identifiers that finished successfully.
`steps_failed`	JSON array	List of step identifiers that produced an error.
`result_data`	JSON	Operation-type-specific output — for example, the IDs of instances created on the target site during a failover.

RBAC policy

The relevant policy rules in /etc/protector/policy.yaml are:

"protector:operations:index": "rule:default"
"protector:operations:show": "rule:default"
"protector:operations:action": "rule:default"

The default admin_or_owner rule means any project member can list and inspect operations scoped to their own project. Adjust these rules if your organisation requires stricter read access.

Usage

Listing operations

To see all DR operations across all Protection Groups in your project:

openstack protector operation list

The output is a table sorted by creation time, most recent first. The STATUS and PROGRESS columns give you an at-a-glance view of what is running and what has finished.

To scope the list to a specific Protection Group:

openstack protector operation list --protection-group <pg-id-or-name>

Inspecting a single operation

Once you have an operation ID, retrieve the full detail record:

openstack protector operation show <operation-id>

The detail view includes:

status and progress — where the operation is right now.
steps_completed — the sequence of phases that have finished. This is your primary signal for how far a long-running failover has progressed.
steps_failed — if the operation entered failed or rolling_back, this array identifies which phase produced the fault.
error_message — the human-readable description of the terminal error.
result_data — on a completed failover or test_failover, this JSON object contains the Nova instance IDs that were created on the target site, which you can use to verify the recovered VMs.

Watching a live operation

For operations that take minutes (such as a full failover of a large Protection Group), poll the detail record at a regular interval:

watch -n 5 openstack protector operation show <operation-id>

The progress field increments as each of the four phases completes:

Progress range	Phase
0–20 %	Preparation — validating reachability, retrieving latest snapshot
20–60 %	Storage failover — promoting replicated volumes into Cinder on the target site
60–90 %	Instance recreation — booting VMs on the target site with mapped networks and flavors
90–100 %	Finalization — updating Protection Group state, writing completion record

Understanding terminal states

Status	Meaning	Action required
`completed`	All phases finished without error.	Verify application health on the target site.
`failed`	A phase produced a fatal error and the engine stopped.	Inspect `steps_failed` and `error_message`, then remediate and retry.
`rolling_back`	The engine encountered an error and is attempting to undo partial changes.	Wait for rollback to complete before retrying. Do not issue a second operation while rollback is in progress.

Operations and Protection Group status

Operation state and Protection Group state are coupled. When an operation moves to completed, the Protection Group status field transitions accordingly — for example, from failing_over to failed_over. If an operation fails, the Protection Group may enter the error state. Always check the Protection Group status alongside the operation record to understand the full picture:

openstack protector protection-group show <pg-id-or-name>

Examples

Example 1 — List all operations in the current project

export OS_CLOUD=site-a
openstack protector operation list

Expected output:

+--------------------------------------+------------------+-----------------------+----------+--------------------+---------------------+
| ID                                   | PROTECTION_GROUP | TYPE                  | STATUS   | PROGRESS           | STARTED_AT          |
+--------------------------------------+------------------+-----------------------+----------+--------------------+---------------------+
| op-a1b2c3d4-...                      | prod-web-app     | failover              | completed| 100                | 2025-03-12T09:14:02 |
| op-e5f6a7b8-...                      | prod-web-app     | test_failover         | completed| 100                | 2025-03-10T14:30:00 |
| op-c9d0e1f2-...                      | batch-jobs-pg    | sync_volumes          | completed| 100                | 2025-03-10T08:00:00 |
+--------------------------------------+------------------+-----------------------+----------+--------------------+---------------------+

Example 2 — Inspect a completed failover operation

openstack protector operation show op-a1b2c3d4-...

Expected output:

+---------------------+-----------------------------------------------+
| Field               | Value                                         |
+---------------------+-----------------------------------------------+
| id                  | op-a1b2c3d4-1234-1234-1234-a1b2c3d4e5f6      |
| protection_group_id | pg-12345678-1234-1234-1234-12345678abcd      |
| operation_type      | failover                                      |
| status              | completed                                     |
| progress            | 100                                           |
| source_site_id      | site-a-uuid                                   |
| target_site_id      | site-b-uuid                                   |
| started_at          | 2025-03-12T09:14:02Z                          |
| completed_at        | 2025-03-12T09:21:47Z                          |
| error_message       | None                                          |
| steps_completed     | ["prepare", "storage_failover",               |
|                     |  "instance_recreation", "finalization"]       |
| steps_failed        | []                                            |
| result_data         | {"instances_created": [                       |
|                     |   {"name": "web-server-1",                    |
|                     |    "id": "nova-uuid-site-b-ws1"},             |
|                     |   {"name": "db-server",                       |
|                     |    "id": "nova-uuid-site-b-db"}]}             |
+---------------------+-----------------------------------------------+

The result_data.instances_created array gives you the Nova instance UUIDs on site-b that you can pass directly to openstack server show to confirm the VMs are ACTIVE.

Example 3 — Monitor a failover in progress

watch -n 5 openstack protector operation show op-e5f6a7b8-...

Output at the storage failover phase:

| status              | running                                       |
| progress            | 45                                            |
| steps_completed     | ["prepare", "storage_failover"]               |
| steps_failed        | []                                            |
| error_message       | None                                          |

At 45 % progress, the prepare and storage_failover steps have finished. The engine is about to begin instance_recreation.

Example 4 — Inspect a failed operation

openstack protector operation show op-c9d0e1f2-...

Expected output:

| status              | failed                                                          |
| progress            | 38                                                              |
| steps_completed     | ["prepare"]                                                     |
| steps_failed        | ["storage_failover"]                                            |
| error_message       | Volume manage failed: no replicated snapshot found for volume   |
|                     | vol-abcdef on FlashArray B. Verify replication policy and RPO.  |
| result_data         | {}                                                              |

The operation stalled at 38 % — inside the storage_failover phase. The error_message identifies the exact volume and points to the replication policy as the area to investigate.

Example 5 — List operations for a specific Protection Group

openstack protector operation list --protection-group prod-web-app

This scopes the output to only operations associated with prod-web-app, which is useful when you manage many Protection Groups and want to review the DR history for a single workload.

Troubleshooting

Operation stuck in `running` state

Symptom: openstack protector operation show returns status: running and progress has not incremented for more than 10 minutes.

Likely cause: The protector-engine process has stalled or lost connectivity to a required service — the Cinder API on the target site, the Pure Storage FlashArray management IP, or the Nova API.

Fix:

Check the engine log on the site where the operation was initiated:

journalctl -u protector-engine -f
tail -f /var/log/protector/protector-engine.log

Look for connection timeout or authentication errors referencing Nova, Cinder, or the FlashArray URL.
Verify the target site's API endpoints are reachable from the engine host.
If the engine process has crashed, restart it:
```
systemctl restart protector-engine
```
The engine will reconcile in-progress operations on startup and either resume or transition them to failed.

Operation shows `failed` with `error_message: Volume manage failed`

Symptom: A failover or test failover operation fails during the storage_failover step. The error_message mentions volume manage or missing snapshot.

Likely cause: The Cinder volume manage policy has not been applied on the target site, or the replicated snapshot does not exist on FlashArray B — either because replication has not yet run (RPO not yet met) or the replication policy is misconfigured.

Fix:

Confirm the Cinder policy on the target site includes:

"volume_extension:volume_manage": "rule:admin_or_owner"

Verify replication status on the FlashArray B management interface or via the protector-engine log — look for the Protection Group name specified in the replication policy (pure_pg_name).
If replication is lagging, force a consistency group sync on the primary site and wait for the RPO interval before retrying:
```
openstack protector consistency-group sync <pg-name>
```

`rolling_back` state — what to do

Symptom: An operation transitions to rolling_back rather than completed or failed.

Likely cause: The engine encountered an error mid-operation (typically during instance_recreation) and is attempting to undo the partial changes made on the target site — removing any VMs that were created and unmanaging any volumes that were imported into Cinder.

Fix:

Do not issue another failover or failback while rollback is in progress. Wait for the operation to reach failed.
Once failed, inspect steps_failed and error_message to identify the root cause.
Check the Protection Group status:
```
openstack protector protection-group show <pg-name>
```
If status is error, the rollback completed. You can now remediate the underlying issue and retry the operation.

`result_data` is empty after a completed failover

Symptom: The operation shows status: completed but result_data contains {} and you cannot find the recovered VMs on the target site.

Likely cause: This is expected for sync_volumes and test_cleanup operation types, which do not create instances. For failover or test_failover, an empty result_data may indicate the operation completed at the metadata level but instance creation produced no output — check the steps_completed array to confirm instance_recreation is listed.

Fix:

If instance_recreation is not in steps_completed, the operation did not reach that phase despite the completed status — check the engine log for a silent error.
Query Nova directly on the target site to check whether instances with the expected names exist:
```
export OS_CLOUD=site-b
openstack server list
```

Cannot list operations — `403 Forbidden`

Symptom: openstack protector operation list returns a 403 error.

Likely cause: Your Keystone token does not satisfy the protector:operations:index policy rule on the site you are querying. The default rule requires your project role to satisfy admin_or_owner.

Fix:

Verify you are sourcing credentials for the correct project:
```
openstack token issue
```

Confirm your role in the project:

openstack role assignment list --user <your-user> --project <your-project>

If your role is correct, check whether the policy.yaml on the target site has been customised to restrict operation visibility.

Operations Panel

[DEFAULT] section

Operation record fields

RBAC policy

Listing operations

Inspecting a single operation

Watching a live operation

Understanding terminal states

Operations and Protection Group status

Example 1 — List all operations in the current project

Example 2 — Inspect a completed failover operation

Example 3 — Monitor a failover in progress

Example 4 — Inspect a failed operation

Example 5 — List operations for a specific Protection Group

Operation stuck in running state

Operation shows failed with error_message: Volume manage failed

rolling_back state — what to do

result_data is empty after a completed failover

Cannot list operations — 403 Forbidden

`[DEFAULT]` section

Operation stuck in `running` state

Operation shows `failed` with `error_message: Volume manage failed`

`rolling_back` state — what to do

`result_data` is empty after a completed failover

Cannot list operations — `403 Forbidden`