Health Panel
Site reachability, replication lag, RPO compliance, failover readiness
The Health panel gives you a unified, real-time view of your DR estate's operational status across both OpenStack sites. It surfaces site reachability, per-Protection-Group replication lag, RPO compliance, and failover readiness so you can identify problems before a real disaster forces your hand. Because Trilio Site Recovery blocks Protection Group modifications when the peer site is unreachable, the Health panel is your first stop when something feels wrong — it tells you exactly which sites, groups, or volumes are degraded and why. Use it as your daily operational dashboard and as the pre-flight checklist before executing any planned failover or DR drill.
Before using the Health panel, ensure the following conditions are met:
- Both OpenStack sites are registered —
protector-apiandprotector-enginemust be running independently on each site, and both sites must be registered in the Protector database with validauth_urland service credentials. protectorclientOSC plugin is installed — this is the CLI coordination layer that authenticates to both sites and collects health data. It must be configured with aclouds.yamlentry for each site.- At least one Protection Group exists — Health panel metrics are scoped to Protection Groups. Groups must have a replication policy attached (with
rpo_minutesset) for RPO compliance checks to function. - Network reachability between CLI host and both sites — the
protectorclientmust be able to reach the Keystone, Nova, Cinder, and Protector API endpoints on both sites (ports 5000, 8774, 8776, and 8788 respectively). - Cinder volume types with
replication_enabled='<is> True'— only volumes using eligible replication-enabled volume types contribute replication lag and RPO metrics. Volumes on non-replicated types are excluded from health calculations. - OpenStack Victoria or later on both sites.
The Health panel is part of the protectorclient OSC plugin and the Horizon dashboard integration. No separate installation is required beyond the standard Trilio Site Recovery deployment. If you have not yet installed the plugin, follow the steps below.
Step 1 — Install the protectorclient OSC plugin
pip install python-protectorclient
Step 2 — Verify the plugin is registered
openstack --os-cloud site-a protector --help
Expected output includes health, protection-group, operation, and site command groups.
Step 3 — Configure clouds.yaml for both sites
Create or update ~/.config/openstack/clouds.yaml so that both sites are addressable by name:
clouds:
site-a:
auth:
auth_url: http://site-a-controller:5000/v3
project_name: admin
username: admin
password: SITE_A_PASSWORD
user_domain_name: Default
project_domain_name: Default
region_name: RegionOne
site-b:
auth:
auth_url: http://site-b-controller:5000/v3
project_name: admin
username: admin
password: SITE_B_PASSWORD
user_domain_name: Default
project_domain_name: Default
region_name: RegionOne
Step 4 — Validate connectivity to both sites
openstack --os-cloud site-a token issue
openstack --os-cloud site-b token issue
Both commands must return a valid token. If either fails, resolve authentication errors before proceeding — the Health panel requires authenticated access to both sites simultaneously.
Step 5 — (Optional) Access via Horizon
If your deployment includes the Horizon dashboard plugin, navigate to Project → Disaster Recovery → Health after logging in. No additional installation is required; the dashboard panel reads from the same Protector API endpoints.
The Health panel's behaviour is controlled by settings in the replication policy attached to each Protection Group, and by site-level status fields maintained by the Protector engine. The following parameters directly affect what the Health panel reports.
Replication policy fields
| Field | Default | Valid values | Effect |
|---|---|---|---|
rpo_minutes | None (unset) | Positive integer, minutes | Sets the Recovery Point Objective threshold. If replication lag exceeds this value, the Protection Group is flagged as RPO Violated in the Health panel. If unset, no RPO compliance check is performed for that group. |
replication_interval | None (unset) | Positive integer, seconds | For async replication, this is how often Pure Storage snapshots are transferred to the secondary array. A value larger than rpo_minutes × 60 will guarantee RPO violations under normal operation — set this lower than your RPO target. |
Site status values
The Health panel reads the status field from the sites table. Protector updates this automatically during connectivity probes.
| Value | Meaning | Health panel indicator |
|---|---|---|
active | Site is reachable and responding | Green / Reachable |
unreachable | Connectivity probe failed | Red / Unreachable — all Protection Group operations blocked |
error | Site reached but in an error state | Amber / Degraded |
Protection Group status values
The Health panel uses the Protection Group status field to determine failover readiness.
| Value | Failover-ready? | Notes |
|---|---|---|
active | Yes | Normal operation |
failed_over | No (already failed over) | Failback may be available |
failing_over | No | Operation in progress |
failing_back | No | Operation in progress |
error | No | Investigate before attempting failover |
Metadata sync strictness
Metadata sync between sites is strict by design: any attempt to modify a Protection Group is blocked when the peer site is unreachable. This cannot be configured — it is a hard constraint that prevents split-brain divergence. The Health panel will display a Sync Blocked warning on affected groups when the peer is down.
You interact with the Health panel through two interfaces: the openstack protector health CLI commands and the Horizon dashboard panel. Both surfaces read from the same underlying Protector API endpoints.
Check overall site reachability
Run this first whenever you suspect connectivity problems or before scheduling a planned failover:
openstack --os-cloud site-a protector site list
This returns the registered sites and their current status. A site showing unreachable means the Protector engine on that site cannot be contacted, and all Protection Group writes targeting that site are blocked until connectivity is restored.
Check the health of all Protection Groups
openstack --os-cloud site-a protector health list
The output summarises every Protection Group visible to the authenticated tenant, including:
- Current status (
active,failed_over,error, etc.) - Replication type (
syncorasync) - Current primary site
- RPO target and whether it is currently being met
- Failover readiness (
ready,not_ready,blocked)
Inspect a single Protection Group in detail
openstack --os-cloud site-a protector health show <protection-group-id>
This surfaces per-volume replication status, the timestamp of the last successful replication, calculated lag against the RPO target, and any error messages from the Protector engine.
Validate replication readiness before a DR drill
Before executing a test failover, use the site validation endpoint to confirm both sites are healthy and the storage layer is in a consistent state:
openstack --os-cloud site-a protector site validate <site-id>
Run this for both sites. The command checks Keystone reachability, Nova and Cinder endpoint health, and whether the Protection Group's Consistency Group is in active status on the primary Cinder backend.
Monitor an in-progress DR operation
Once a failover or failback is underway, track progress through the operations interface:
openstack --os-cloud site-a protector operation list
openstack --os-cloud site-a protector operation show <operation-id>
The progress field returns a value from 0 to 100. The steps_completed and steps_failed arrays give you step-level granularity so you can identify exactly where a long-running operation is stalled.
Example 1 — List all sites and confirm reachability
openstack --os-cloud site-a protector site list
Expected output:
+--------------------------------------+---------------------+-----------+--------------------------------+
| ID | Name | Status | Auth URL |
+--------------------------------------+---------------------+-----------+--------------------------------+
| a1b2c3d4-0000-0000-0000-111111111111 | site-a | active | http://site-a-controller:5000 |
| e5f6a7b8-0000-0000-0000-222222222222 | site-b | active | http://site-b-controller:5000 |
+--------------------------------------+---------------------+-----------+--------------------------------+
Both sites showing active means the Protector engine on each site is reachable and Protection Group modifications are permitted.
Example 2 — Health summary across all Protection Groups
openstack --os-cloud site-a protector health list
Expected output:
+--------------------------------------+--------------+------------------+----------+-------------+-----------+-----------+
| PG ID | Name | Current Primary | Repl Type| RPO Target | RPO Status| Failover |
+--------------------------------------+--------------+------------------+----------+-------------+-----------+-----------+
| f1e2d3c4-aaaa-bbbb-cccc-000000000001 | prod-app-pg | site-a | async | 15 min | OK | ready |
| f1e2d3c4-aaaa-bbbb-cccc-000000000002 | analytics-pg | site-a | async | 30 min | VIOLATED | not_ready |
| f1e2d3c4-aaaa-bbbb-cccc-000000000003 | db-cluster | site-a | sync | 0 min | OK | ready |
+--------------------------------------+--------------+------------------+----------+-------------+-----------+-----------+
The analytics-pg group shows VIOLATED RPO status, indicating that the last successful replication snapshot is older than the configured 30-minute RPO target. The group is marked not_ready for failover until replication catches up or you explicitly choose to accept data loss with --force.
Example 3 — Detailed health for a single Protection Group
openstack --os-cloud site-a protector health show f1e2d3c4-aaaa-bbbb-cccc-000000000001
Expected output:
+-----------------------------+----------------------------------------------+
| Field | Value |
+-----------------------------+----------------------------------------------+
| id | f1e2d3c4-aaaa-bbbb-cccc-000000000001 |
| name | prod-app-pg |
| status | active |
| replication_type | async |
| current_primary_site | site-a |
| secondary_site | site-b |
| rpo_minutes | 15 |
| last_replication_at | 2024-03-15T10:42:00Z |
| replication_lag_minutes | 3 |
| rpo_status | OK |
| failover_readiness | ready |
| consistency_group_status | active |
| member_count | 3 |
| volumes_replicating | 3 |
| volumes_in_error | 0 |
+-----------------------------+----------------------------------------------+
Example 4 — Validate a site before a DR drill
openstack --os-cloud site-a protector site validate a1b2c3d4-0000-0000-0000-111111111111
Expected output:
+------------------------------+--------+---------------------------------------------------+
| Check | Result | Detail |
+------------------------------+--------+---------------------------------------------------+
| keystone_reachable | PASS | http://site-a-controller:5000/v3 responded 200 |
| nova_reachable | PASS | 8774 endpoint responding |
| cinder_reachable | PASS | 8776 endpoint responding |
| service_credentials | PASS | protector-service authenticated successfully |
| consistency_group_active | PASS | All CGs in active state |
+------------------------------+--------+---------------------------------------------------+
Validation result: PASS
Run this command for both site-a and site-b before any planned failover or test failover. A FAIL on any check should be resolved before proceeding.
Example 5 — Monitor a failover operation in progress
openstack --os-cloud site-b protector operation show op-9900aabb-1122-3344-5566-ccddee001122
Expected output:
+---------------------+--------------------------------------------------+
| Field | Value |
+---------------------+--------------------------------------------------+
| id | op-9900aabb-1122-3344-5566-ccddee001122 |
| operation_type | failover |
| status | running |
| progress | 62 |
| source_site | site-a |
| target_site | site-b |
| started_at | 2024-03-15T11:00:00Z |
| completed_at | None |
| steps_completed | ["pre_validation", "vm_shutdown", "final_sync", |
| | "storage_promote"] |
| steps_failed | [] |
| error_message | None |
+---------------------+--------------------------------------------------+
Issue: Site shows unreachable in the Health panel
Symptom: One or both sites have status: unreachable. The Health panel shows all Protection Groups on that site as blocked. Any attempt to modify a Protection Group returns an error.
Likely cause: The Protector engine on the unreachable site cannot be contacted. This may be due to a network partition, the protector-api or protector-engine service being down, or a Keystone authentication failure for the service account.
Fix:
- Test network connectivity from your CLI host to the affected site's Keystone endpoint:
curl -k https://<site-keystone>:5000/v3 - SSH to the affected site's controller and check service status:
systemctl status protector-api protector-engine - Inspect service logs:
journalctl -u protector-api -n 100andjournalctl -u protector-engine -n 100 - Verify the service account can authenticate:
openstack --os-cloud <site> token issue - Once the root cause is resolved, the Protector engine will automatically re-probe the site and update its status to
active.
Issue: RPO status shows VIOLATED for an async Protection Group
Symptom: The Health panel reports RPO Status: VIOLATED for a Protection Group. replication_lag_minutes exceeds rpo_minutes.
Likely cause: The Pure Storage Protection Group snapshot has not been successfully transferred to the secondary array within the RPO window. This can be caused by: storage backend connectivity issues between arrays, the replication_interval being set higher than rpo_minutes × 60, or a transient network issue between the two Pure Storage arrays.
Fix:
- Check that
replication_interval(in seconds) is less thanrpo_minutes × 60. For example, an RPO of 15 minutes requiresreplication_intervalof at most 900 seconds. - Verify Pure Storage array-to-array connectivity independently of OpenStack.
- Force an immediate sync via:
openstack --os-cloud <primary-site> protector consistency-group sync <pg-id> - Re-check the Health panel after a few minutes.
last_replication_atshould update andreplication_lag_minutesshould fall below the RPO threshold.
Issue: failover_readiness shows not_ready even though the site appears healthy
Symptom: Site status is active, RPO is not violated, but the Protection Group still reports failover_readiness: not_ready.
Likely cause: The Consistency Group or one or more member volumes is not in active status, or the Protection Group itself is in error state.
Fix:
- Run
openstack --os-cloud <site> protector health show <pg-id>and examineconsistency_group_statusandvolumes_in_error. - If
consistency_group_statusis notactive, inspect Cinder:openstack --os-cloud <site> consistency group show <cinder-cg-id>. - If individual volumes are in
error, check the Cinder volume status and the underlying backend. - If the Protection Group status is
error, review the most recent DR operation for error details:openstack --os-cloud <site> protector operation list --protection-group <pg-id>
Issue: Health panel shows Sync Blocked on a Protection Group
Symptom: A Protection Group displays a Sync Blocked warning. Attempts to add or remove members, or update the group, are rejected.
Likely cause: The peer site is currently unreachable. Trilio Site Recovery enforces strict metadata synchronisation — modifications are blocked when the secondary site cannot be confirmed as consistent, to prevent the two sites from diverging.
Fix: Restore connectivity to the peer site first. See the unreachable site troubleshooting entry above. Once both sites are active, the block is automatically lifted and you can retry the modification.
Issue: protector site validate returns FAIL on service_credentials
Symptom: The site validation command fails the service_credentials check even though the site was previously working.
Likely cause: The service account password stored in the Protector database no longer matches the password in Keystone on that site, or the service account has been locked or deleted.
Fix:
- Confirm the service user exists on the affected site:
openstack --os-cloud <site> user show protector-service - Reset the password in Keystone if necessary:
openstack --os-cloud <site> user set --password <new-password> protector-service - Update the stored credentials in the Protector site record:
openstack --os-cloud <site> protector site set <site-id> --service-password <new-password> - Re-run
protector site validateto confirm the check passes.