Site Recoveryfor OpenStack
Guide

Health Panel

Site reachability, replication lag, RPO compliance, failover readiness

master

Overview

The Health panel gives you a unified, real-time view of your DR estate's operational status across both OpenStack sites. It surfaces site reachability, per-Protection-Group replication lag, RPO compliance, and failover readiness so you can identify problems before a real disaster forces your hand. Because Trilio Site Recovery blocks Protection Group modifications when the peer site is unreachable, the Health panel is your first stop when something feels wrong — it tells you exactly which sites, groups, or volumes are degraded and why. Use it as your daily operational dashboard and as the pre-flight checklist before executing any planned failover or DR drill.


Prerequisites

Before using the Health panel, ensure the following conditions are met:

  • Both OpenStack sites are registeredprotector-api and protector-engine must be running independently on each site, and both sites must be registered in the Protector database with valid auth_url and service credentials.
  • protectorclient OSC plugin is installed — this is the CLI coordination layer that authenticates to both sites and collects health data. It must be configured with a clouds.yaml entry for each site.
  • At least one Protection Group exists — Health panel metrics are scoped to Protection Groups. Groups must have a replication policy attached (with rpo_minutes set) for RPO compliance checks to function.
  • Network reachability between CLI host and both sites — the protectorclient must be able to reach the Keystone, Nova, Cinder, and Protector API endpoints on both sites (ports 5000, 8774, 8776, and 8788 respectively).
  • Cinder volume types with replication_enabled='<is> True' — only volumes using eligible replication-enabled volume types contribute replication lag and RPO metrics. Volumes on non-replicated types are excluded from health calculations.
  • OpenStack Victoria or later on both sites.

Installation

The Health panel is part of the protectorclient OSC plugin and the Horizon dashboard integration. No separate installation is required beyond the standard Trilio Site Recovery deployment. If you have not yet installed the plugin, follow the steps below.

Step 1 — Install the protectorclient OSC plugin

pip install python-protectorclient

Step 2 — Verify the plugin is registered

openstack --os-cloud site-a protector --help

Expected output includes health, protection-group, operation, and site command groups.

Step 3 — Configure clouds.yaml for both sites

Create or update ~/.config/openstack/clouds.yaml so that both sites are addressable by name:

clouds:
  site-a:
    auth:
      auth_url: http://site-a-controller:5000/v3
      project_name: admin
      username: admin
      password: SITE_A_PASSWORD
      user_domain_name: Default
      project_domain_name: Default
    region_name: RegionOne

  site-b:
    auth:
      auth_url: http://site-b-controller:5000/v3
      project_name: admin
      username: admin
      password: SITE_B_PASSWORD
      user_domain_name: Default
      project_domain_name: Default
    region_name: RegionOne

Step 4 — Validate connectivity to both sites

openstack --os-cloud site-a token issue
openstack --os-cloud site-b token issue

Both commands must return a valid token. If either fails, resolve authentication errors before proceeding — the Health panel requires authenticated access to both sites simultaneously.

Step 5 — (Optional) Access via Horizon

If your deployment includes the Horizon dashboard plugin, navigate to Project → Disaster Recovery → Health after logging in. No additional installation is required; the dashboard panel reads from the same Protector API endpoints.


Configuration

The Health panel's behaviour is controlled by settings in the replication policy attached to each Protection Group, and by site-level status fields maintained by the Protector engine. The following parameters directly affect what the Health panel reports.

Replication policy fields

FieldDefaultValid valuesEffect
rpo_minutesNone (unset)Positive integer, minutesSets the Recovery Point Objective threshold. If replication lag exceeds this value, the Protection Group is flagged as RPO Violated in the Health panel. If unset, no RPO compliance check is performed for that group.
replication_intervalNone (unset)Positive integer, secondsFor async replication, this is how often Pure Storage snapshots are transferred to the secondary array. A value larger than rpo_minutes × 60 will guarantee RPO violations under normal operation — set this lower than your RPO target.

Site status values

The Health panel reads the status field from the sites table. Protector updates this automatically during connectivity probes.

ValueMeaningHealth panel indicator
activeSite is reachable and respondingGreen / Reachable
unreachableConnectivity probe failedRed / Unreachable — all Protection Group operations blocked
errorSite reached but in an error stateAmber / Degraded

Protection Group status values

The Health panel uses the Protection Group status field to determine failover readiness.

ValueFailover-ready?Notes
activeYesNormal operation
failed_overNo (already failed over)Failback may be available
failing_overNoOperation in progress
failing_backNoOperation in progress
errorNoInvestigate before attempting failover

Metadata sync strictness

Metadata sync between sites is strict by design: any attempt to modify a Protection Group is blocked when the peer site is unreachable. This cannot be configured — it is a hard constraint that prevents split-brain divergence. The Health panel will display a Sync Blocked warning on affected groups when the peer is down.


Usage

You interact with the Health panel through two interfaces: the openstack protector health CLI commands and the Horizon dashboard panel. Both surfaces read from the same underlying Protector API endpoints.

Check overall site reachability

Run this first whenever you suspect connectivity problems or before scheduling a planned failover:

openstack --os-cloud site-a protector site list

This returns the registered sites and their current status. A site showing unreachable means the Protector engine on that site cannot be contacted, and all Protection Group writes targeting that site are blocked until connectivity is restored.

Check the health of all Protection Groups

openstack --os-cloud site-a protector health list

The output summarises every Protection Group visible to the authenticated tenant, including:

  • Current status (active, failed_over, error, etc.)
  • Replication type (sync or async)
  • Current primary site
  • RPO target and whether it is currently being met
  • Failover readiness (ready, not_ready, blocked)

Inspect a single Protection Group in detail

openstack --os-cloud site-a protector health show <protection-group-id>

This surfaces per-volume replication status, the timestamp of the last successful replication, calculated lag against the RPO target, and any error messages from the Protector engine.

Validate replication readiness before a DR drill

Before executing a test failover, use the site validation endpoint to confirm both sites are healthy and the storage layer is in a consistent state:

openstack --os-cloud site-a protector site validate <site-id>

Run this for both sites. The command checks Keystone reachability, Nova and Cinder endpoint health, and whether the Protection Group's Consistency Group is in active status on the primary Cinder backend.

Monitor an in-progress DR operation

Once a failover or failback is underway, track progress through the operations interface:

openstack --os-cloud site-a protector operation list
openstack --os-cloud site-a protector operation show <operation-id>

The progress field returns a value from 0 to 100. The steps_completed and steps_failed arrays give you step-level granularity so you can identify exactly where a long-running operation is stalled.


Examples

Example 1 — List all sites and confirm reachability

openstack --os-cloud site-a protector site list

Expected output:

+--------------------------------------+---------------------+-----------+--------------------------------+
| ID                                   | Name                | Status    | Auth URL                       |
+--------------------------------------+---------------------+-----------+--------------------------------+
| a1b2c3d4-0000-0000-0000-111111111111 | site-a              | active    | http://site-a-controller:5000  |
| e5f6a7b8-0000-0000-0000-222222222222 | site-b              | active    | http://site-b-controller:5000  |
+--------------------------------------+---------------------+-----------+--------------------------------+

Both sites showing active means the Protector engine on each site is reachable and Protection Group modifications are permitted.


Example 2 — Health summary across all Protection Groups

openstack --os-cloud site-a protector health list

Expected output:

+--------------------------------------+--------------+------------------+----------+-------------+-----------+-----------+
| PG ID                                | Name         | Current Primary  | Repl Type| RPO Target  | RPO Status| Failover  |
+--------------------------------------+--------------+------------------+----------+-------------+-----------+-----------+
| f1e2d3c4-aaaa-bbbb-cccc-000000000001 | prod-app-pg  | site-a           | async    | 15 min      | OK        | ready     |
| f1e2d3c4-aaaa-bbbb-cccc-000000000002 | analytics-pg | site-a           | async    | 30 min      | VIOLATED  | not_ready |
| f1e2d3c4-aaaa-bbbb-cccc-000000000003 | db-cluster   | site-a           | sync     | 0 min       | OK        | ready     |
+--------------------------------------+--------------+------------------+----------+-------------+-----------+-----------+

The analytics-pg group shows VIOLATED RPO status, indicating that the last successful replication snapshot is older than the configured 30-minute RPO target. The group is marked not_ready for failover until replication catches up or you explicitly choose to accept data loss with --force.


Example 3 — Detailed health for a single Protection Group

openstack --os-cloud site-a protector health show f1e2d3c4-aaaa-bbbb-cccc-000000000001

Expected output:

+-----------------------------+----------------------------------------------+
| Field                       | Value                                        |
+-----------------------------+----------------------------------------------+
| id                          | f1e2d3c4-aaaa-bbbb-cccc-000000000001         |
| name                        | prod-app-pg                                  |
| status                      | active                                       |
| replication_type            | async                                        |
| current_primary_site        | site-a                                       |
| secondary_site              | site-b                                       |
| rpo_minutes                 | 15                                           |
| last_replication_at         | 2024-03-15T10:42:00Z                         |
| replication_lag_minutes     | 3                                            |
| rpo_status                  | OK                                           |
| failover_readiness          | ready                                        |
| consistency_group_status    | active                                       |
| member_count                | 3                                            |
| volumes_replicating         | 3                                            |
| volumes_in_error            | 0                                            |
+-----------------------------+----------------------------------------------+

Example 4 — Validate a site before a DR drill

openstack --os-cloud site-a protector site validate a1b2c3d4-0000-0000-0000-111111111111

Expected output:

+------------------------------+--------+---------------------------------------------------+
| Check                        | Result | Detail                                            |
+------------------------------+--------+---------------------------------------------------+
| keystone_reachable           | PASS   | http://site-a-controller:5000/v3 responded 200    |
| nova_reachable               | PASS   | 8774 endpoint responding                          |
| cinder_reachable             | PASS   | 8776 endpoint responding                          |
| service_credentials          | PASS   | protector-service authenticated successfully      |
| consistency_group_active     | PASS   | All CGs in active state                           |
+------------------------------+--------+---------------------------------------------------+
Validation result: PASS

Run this command for both site-a and site-b before any planned failover or test failover. A FAIL on any check should be resolved before proceeding.


Example 5 — Monitor a failover operation in progress

openstack --os-cloud site-b protector operation show op-9900aabb-1122-3344-5566-ccddee001122

Expected output:

+---------------------+--------------------------------------------------+
| Field               | Value                                            |
+---------------------+--------------------------------------------------+
| id                  | op-9900aabb-1122-3344-5566-ccddee001122          |
| operation_type      | failover                                         |
| status              | running                                          |
| progress            | 62                                               |
| source_site         | site-a                                           |
| target_site         | site-b                                           |
| started_at          | 2024-03-15T11:00:00Z                             |
| completed_at        | None                                             |
| steps_completed     | ["pre_validation", "vm_shutdown", "final_sync",  |
|                     |  "storage_promote"]                              |
| steps_failed        | []                                               |
| error_message       | None                                             |
+---------------------+--------------------------------------------------+

Troubleshooting

Issue: Site shows unreachable in the Health panel

Symptom: One or both sites have status: unreachable. The Health panel shows all Protection Groups on that site as blocked. Any attempt to modify a Protection Group returns an error.

Likely cause: The Protector engine on the unreachable site cannot be contacted. This may be due to a network partition, the protector-api or protector-engine service being down, or a Keystone authentication failure for the service account.

Fix:

  1. Test network connectivity from your CLI host to the affected site's Keystone endpoint: curl -k https://<site-keystone>:5000/v3
  2. SSH to the affected site's controller and check service status: systemctl status protector-api protector-engine
  3. Inspect service logs: journalctl -u protector-api -n 100 and journalctl -u protector-engine -n 100
  4. Verify the service account can authenticate: openstack --os-cloud <site> token issue
  5. Once the root cause is resolved, the Protector engine will automatically re-probe the site and update its status to active.

Issue: RPO status shows VIOLATED for an async Protection Group

Symptom: The Health panel reports RPO Status: VIOLATED for a Protection Group. replication_lag_minutes exceeds rpo_minutes.

Likely cause: The Pure Storage Protection Group snapshot has not been successfully transferred to the secondary array within the RPO window. This can be caused by: storage backend connectivity issues between arrays, the replication_interval being set higher than rpo_minutes × 60, or a transient network issue between the two Pure Storage arrays.

Fix:

  1. Check that replication_interval (in seconds) is less than rpo_minutes × 60. For example, an RPO of 15 minutes requires replication_interval of at most 900 seconds.
  2. Verify Pure Storage array-to-array connectivity independently of OpenStack.
  3. Force an immediate sync via: openstack --os-cloud <primary-site> protector consistency-group sync <pg-id>
  4. Re-check the Health panel after a few minutes. last_replication_at should update and replication_lag_minutes should fall below the RPO threshold.

Issue: failover_readiness shows not_ready even though the site appears healthy

Symptom: Site status is active, RPO is not violated, but the Protection Group still reports failover_readiness: not_ready.

Likely cause: The Consistency Group or one or more member volumes is not in active status, or the Protection Group itself is in error state.

Fix:

  1. Run openstack --os-cloud <site> protector health show <pg-id> and examine consistency_group_status and volumes_in_error.
  2. If consistency_group_status is not active, inspect Cinder: openstack --os-cloud <site> consistency group show <cinder-cg-id>.
  3. If individual volumes are in error, check the Cinder volume status and the underlying backend.
  4. If the Protection Group status is error, review the most recent DR operation for error details: openstack --os-cloud <site> protector operation list --protection-group <pg-id>

Issue: Health panel shows Sync Blocked on a Protection Group

Symptom: A Protection Group displays a Sync Blocked warning. Attempts to add or remove members, or update the group, are rejected.

Likely cause: The peer site is currently unreachable. Trilio Site Recovery enforces strict metadata synchronisation — modifications are blocked when the secondary site cannot be confirmed as consistent, to prevent the two sites from diverging.

Fix: Restore connectivity to the peer site first. See the unreachable site troubleshooting entry above. Once both sites are active, the block is automatically lifted and you can retry the modification.


Issue: protector site validate returns FAIL on service_credentials

Symptom: The site validation command fails the service_credentials check even though the site was previously working.

Likely cause: The service account password stored in the Protector database no longer matches the password in Keystone on that site, or the service account has been locked or deleted.

Fix:

  1. Confirm the service user exists on the affected site: openstack --os-cloud <site> user show protector-service
  2. Reset the password in Keystone if necessary: openstack --os-cloud <site> user set --password <new-password> protector-service
  3. Update the stored credentials in the Protector site record: openstack --os-cloud <site> protector site set <site-id> --service-password <new-password>
  4. Re-run protector site validate to confirm the check passes.