Replication Health Monitoring
Monitoring replication link health, sync status, RPO compliance, and failover readiness.
Replication health monitoring gives you continuous visibility into the state of your disaster recovery infrastructure — before you need it. The protector-engine exposes a ReplicationHealthReport for each Protection Group through both the /v1/health REST endpoint and the openstack dr health show CLI command, surfacing the metrics that matter most: replication link health, synchronization status, replication lag, RPO compliance, and an aggregated failover_ready flag that drives pre-failover validation. Staying ahead of replication degradation lets you remediate problems during normal operations rather than discovering them mid-disaster.
Before using the health monitoring features described on this page, ensure the following are in place:
- Trilio Site Recovery deployed on both sites —
protector-apiandprotector-enginemust be running independently on your primary and secondary OpenStack clouds. Each site requires its own Nova, Cinder, Neutron, and Keystone endpoints. - OSC CLI plugin installed — the
protectorclientplugin forpython-openstackclientmust be installed on the machine where you run commands. This plugin authenticates to both sites and is the coordination layer for cross-site operations. - Protection Group configured and active — at least one Protection Group must exist with a replication policy attached, including
primary_fa_url,secondary_fa_url, andrpo_minutes. See Configure a replication policy before proceeding. - Replication-enabled volume types — all volumes under protection must use a Cinder volume type with
replication_enabled='<is> True'and a matchingreplication_typeproperty. clouds.yamlconfigured for both sites — the CLI must be able to authenticate to both the primary and secondary Keystone endpoints. See the multi-siteclouds.yamlexample in the Deployment Guide.- Both sites reachable from the operator workstation — health status queries are served by the local
protector-api, but certain aggregated fields (such asfailover_ready) reflect cross-site replication state derived by the engine.
The health monitoring capability is built into protector-engine and protector-api. No additional packages are required beyond the standard Trilio Site Recovery installation.
Step 1 — Verify both services are running on each site
On the primary site:
systemctl status protector-api
systemctl status protector-engine
On the secondary site:
systemctl status protector-api
systemctl status protector-engine
Both services must report active (running) on both sites before health data is meaningful.
Step 2 — Confirm the CLI plugin is installed
openstack dr health show --help
If the command is not found, install the OSC plugin:
pip install python-protectorclient
Step 3 — Confirm your clouds.yaml includes both sites
clouds:
site-a:
auth:
auth_url: http://site-a-controller:5000/v3
project_name: admin
username: admin
password: password
user_domain_name: Default
project_domain_name: Default
region_name: RegionOne
site-b:
auth:
auth_url: http://site-b-controller:5000/v3
project_name: admin
username: admin
password: password
user_domain_name: Default
project_domain_name: Default
region_name: RegionOne
Step 4 — Test the health endpoint directly (optional)
You can verify the API is responding by calling the health endpoint with a token:
TOKEN=$(openstack --os-cloud site-a token issue -f value -c id)
TENANT_ID=$(openstack --os-cloud site-a token issue -f value -c project_id)
PG_ID="<your-protection-group-uuid>"
curl -s \
-H "X-Auth-Token: $TOKEN" \
-H "OpenStack-API-Version: protector 1.2" \
http://site-a-controller:8788/v1/${TENANT_ID}/protection-groups/${PG_ID}/health \
| python3 -m json.tool
A successful response returns a ReplicationHealthReport object (see Usage for field descriptions).
Health monitoring behavior is governed by the replication policy attached to each Protection Group and by engine-level configuration in protector.conf.
Replication policy fields that affect health reporting
The following fields in the replication_policies record directly influence what the health endpoint reports and how violations are evaluated:
| Field | Type | Default | Effect |
|---|---|---|---|
rpo_minutes | integer | — | The Recovery Point Objective in minutes. The engine compares replication_lag against this threshold to determine RPO compliance and whether to record an RPOEvent. Required. |
replication_interval | integer (seconds) | — | For async replication: the target interval between snapshots. The engine uses this to assess whether replication is falling behind. |
primary_fa_url | string | — | URL of the primary FlashArray. Used by the engine to poll live replication link state. |
secondary_fa_url | string | — | URL of the secondary FlashArray. Used to verify connectivity at both ends of the replication link. |
Update the policy for a Protection Group:
openstack dr policy update <pg-name-or-id> \
--rpo-minutes 10 \
--replication-interval 300
Engine-level health configuration (protector.conf)
The following options may appear under [engine] or a dedicated [health] section:
| Option | Effect |
|---|---|
health_poll_interval | How frequently (in seconds) the engine refreshes replication metrics from FlashArray. Lower values increase API call frequency to the arrays. |
rpo_event_retention_days | How long RPOEvent audit records are retained in the database before being purged. |
RPO event recording
Whenever the engine detects that replication_lag (in seconds) exceeds rpo_minutes × 60, it writes an RPOEvent record to the database. These records accumulate as an audit trail and are visible via the health history commands described in Usage. They do not auto-resolve — an RPOEvent persists even after replication catches up, so you have a complete history of every SLA breach for compliance review.
Checking replication health for a Protection Group
The primary workflow is to query the ReplicationHealthReport for a specific Protection Group. The report is the canonical structure returned by all health endpoints and consumed by the Horizon health panel.
openstack dr health show <pg-name-or-id>
This command authenticates to the local protector-api, which queries the protector-engine for a freshly computed ReplicationHealthReport. The report contains the following key fields:
| Field | Values | Meaning |
|---|---|---|
link_health | connected, degraded, disconnected | State of the replication link between the two FlashArrays. degraded means the link is up but experiencing elevated error rates or latency. |
sync_status | in_sync, syncing, out_of_sync | Whether the secondary copy is current. syncing means replication is actively transferring. out_of_sync means data is behind and the secondary cannot be considered a valid recovery point. |
replication_lag | integer (seconds) | Seconds since the last confirmed data transfer completed. Compare this to your rpo_minutes × 60 budget. |
last_successful_sync | ISO 8601 timestamp | Timestamp of the most recent successful replication completion. |
rpo_compliant | boolean | true if replication_lag is within the configured rpo_minutes threshold. |
failover_ready | boolean | Aggregated readiness flag computed by compute_failover_readiness(). true only when link_health is not disconnected, sync_status is in_sync or syncing, and RPO is compliant. This flag gates the pre-failover validation check. |
data_at_risk_seconds | integer | Seconds of data that would be lost if a failover were executed right now. Derived from replication_lag. Zero when fully in sync. |
throughput_mbps | float | Current replication transfer throughput in MB/s. |
iops | integer | Current replication IOPS. |
volume_health | list | Per-volume replication status for every volume in the Protection Group's Consistency Group. |
Understanding failover_ready and data_at_risk_seconds
The compute_failover_readiness() helper in health_metrics.py derives failover_ready and data_at_risk_seconds from the current health state. These two fields are the primary inputs to:
- Horizon status badges — the dashboard renders a green/amber/red badge on each Protection Group card based on
failover_ready. - Pre-failover validation — when you execute a failover or test failover, the engine checks
failover_readyfirst. If it isfalse, the operation is blocked (unless--forceis specified) anddata_at_risk_secondsis included in the error message so you know the exposure.
This means that keeping replication healthy is not just a monitoring concern — it directly affects whether your failover will proceed.
Listing RPO events for audit
Every RPO violation is recorded as an RPOEvent in the database. To review the audit trail:
openstack dr rpo-events list <pg-name-or-id>
Filter by time range:
openstack dr rpo-events list <pg-name-or-id> \
--start-time 2025-01-01T00:00:00 \
--end-time 2025-01-31T23:59:59
Checking health via the REST API directly
For integration with external monitoring systems (Prometheus, Nagios, etc.), call the health endpoint directly:
GET /v1/{tenant_id}/protection-groups/{pg_id}/health
The response is a ReplicationHealthReport object. Parse failover_ready and data_at_risk_seconds for alerting thresholds.
Example 1 — Healthy Protection Group in Full Sync
openstack dr health show prod-web-app
Expected output when replication is healthy and RPO-compliant:
+------------------------------+------------------------------------------+
| Field | Value |
+------------------------------+------------------------------------------+
| protection_group | prod-web-app |
| link_health | connected |
| sync_status | in_sync |
| replication_lag | 47 |
| last_successful_sync | 2025-06-10T14:22:13Z |
| rpo_compliant | True |
| failover_ready | True |
| data_at_risk_seconds | 0 |
| throughput_mbps | 312.4 |
| iops | 8240 |
| volume_health[0].volume_id | a1b2c3d4-... |
| volume_health[0].status | replicating |
| volume_health[1].volume_id | e5f6g7h8-... |
| volume_health[1].status | replicating |
+------------------------------+------------------------------------------+
failover_ready: True with data_at_risk_seconds: 0 means this Protection Group is ready for immediate failover with zero data exposure.
Example 2 — RPO Violation Detected
This example shows the output when replication lag has exceeded the configured rpo_minutes threshold (here: 15 minutes):
openstack dr health show prod-web-app
+------------------------------+------------------------------------------+
| Field | Value |
+------------------------------+------------------------------------------+
| protection_group | prod-web-app |
| link_health | connected |
| sync_status | out_of_sync |
| replication_lag | 1247 |
| last_successful_sync | 2025-06-10T14:01:26Z |
| rpo_compliant | False |
| failover_ready | False |
| data_at_risk_seconds | 1247 |
| throughput_mbps | 0.0 |
| iops | 0 |
+------------------------------+------------------------------------------+
replication_lag of 1247 seconds (~20 minutes) exceeds the 15-minute RPO. An RPOEvent record has been written to the database. failover_ready is false — a failover attempted now would require --force and would result in up to 1247 seconds of data loss.
Example 3 — Replication Link Degraded
openstack dr health show prod-web-app
+------------------------------+------------------------------------------+
| Field | Value |
+------------------------------+------------------------------------------+
| link_health | degraded |
| sync_status | syncing |
| replication_lag | 340 |
| rpo_compliant | True |
| failover_ready | False |
| data_at_risk_seconds | 340 |
| throughput_mbps | 12.1 |
+------------------------------+------------------------------------------+
The link is up but degraded. Throughput has dropped sharply and replication is falling behind. RPO is still technically compliant but failover_ready is false because link_health is not connected. Investigate the FlashArray replication link and inter-site network path.
Example 4 — Querying the REST API for Monitoring Integration
Retrieve the health report as JSON and extract the two alerting fields:
TOKEN=$(openstack --os-cloud site-a token issue -f value -c id)
TENANT_ID=$(openstack --os-cloud site-a token issue -f value -c project_id)
PG_ID="pg-12345678-1234-1234-1234-123456789012"
curl -s \
-H "X-Auth-Token: $TOKEN" \
-H "OpenStack-API-Version: protector 1.2" \
"http://site-a-controller:8788/v1/${TENANT_ID}/protection-groups/${PG_ID}/health" \
| python3 -c "
import json, sys
report = json.load(sys.stdin)['health_report']
print('failover_ready:', report['failover_ready'])
print('data_at_risk_seconds:', report['data_at_risk_seconds'])
"
Expected output when healthy:
failover_ready: True
data_at_risk_seconds: 0
Example 5 — Reviewing RPO Event History
openstack dr rpo-events list prod-web-app --limit 5
+--------------------------------------+----------------------+---------------------+-----------------------+
| id | protection_group | detected_at | lag_at_detection_secs |
+--------------------------------------+----------------------+---------------------+-----------------------+
| evt-aabb1122-... | prod-web-app | 2025-06-10T14:22:00Z| 1247 |
| evt-ccdd3344-... | prod-web-app | 2025-06-08T03:11:00Z| 980 |
+--------------------------------------+----------------------+---------------------+-----------------------+
These records persist for compliance auditing regardless of whether replication subsequently recovered.
Issue: failover_ready is false but replication appears healthy in the FlashArray UI
Symptom: openstack dr health show reports failover_ready: False and link_health: connected, but the Pure Storage array management interface shows replication is running normally.
Likely cause: The engine's cached health state has not yet refreshed, or sync_status is still syncing from a recent replication cycle that has not completed. compute_failover_readiness() returns false whenever sync_status is not in_sync, even if the link is healthy.
Fix: Wait one full replication_interval cycle (configured in the replication policy) and re-query. If the status does not resolve to in_sync, check protector-engine logs for errors communicating with the FlashArray API:
journalctl -u protector-engine -f | grep -i health
Issue: link_health: disconnected — failover is blocked
Symptom: link_health reports disconnected. failover_ready is false. Attempting a failover returns a pre-validation error.
Likely cause 1: The inter-site network path between the two FlashArrays is down.
Likely cause 2: The FlashArray API credentials in the replication policy have expired or been rotated.
Likely cause 3: The protector-engine cannot reach the FlashArray management IP (primary_fa_url or secondary_fa_url in the policy).
Fix:
- Verify network connectivity from the controller hosting
protector-engineto both FlashArray management IPs. - Confirm the replication link in the FlashArray management UI or via Pure1.
- If credentials were rotated, update the replication policy:
openstack dr policy update <pg-name-or-id> \
--primary-fa-token "<new-token>" \
--secondary-fa-token "<new-token>"
- If the primary site is genuinely unreachable (disaster scenario), use
--forceto bypass thefailover_readycheck — but understand this accepts thedata_at_risk_secondsexposure reported in the health report.
Issue: RPO events accumulating rapidly
Symptom: openstack dr rpo-events list shows frequent violations even though the link appears connected.
Likely cause: Replication throughput is insufficient for the write workload on the protected volumes. replication_lag is consistently exceeding rpo_minutes × 60 between cycles.
Fix:
- Check
throughput_mbpsandiopsin the health report during a violation period. Compare against your inter-site bandwidth budget. - Consider increasing
rpo_minutesin the replication policy to a value that matches your realistic replication window, or reduce the write workload on protected volumes. - For async replication, ensure
replication_intervalis achievable given the data change rate.
Issue: Health endpoint returns 404 or 401
Symptom: Calling GET /v1/{tenant_id}/protection-groups/{pg_id}/health via curl returns HTTP 404 or 401.
Likely cause: Either the Protection Group ID or tenant ID in the URL is incorrect, the API version header is missing or too low, or the token has expired.
Fix:
# Verify the correct tenant ID
TENANT_ID=$(openstack --os-cloud site-a token issue -f value -c project_id)
# Verify the Protection Group exists and belongs to this tenant
openstack dr protection-group show <pg-name-or-id>
# Ensure you are sending the correct API version header
curl -H "OpenStack-API-Version: protector 1.2" ...
Issue: sync_status: out_of_sync after site recovered from unreachable state
Symptom: After a temporary network partition between sites, the protection group shows out_of_sync even though both sites are now reachable.
Likely cause: The metadata sync was blocked during the outage (by design — modifications are blocked when the peer site is unreachable). The protector-engine needs to reconcile state.
Fix: Force a metadata sync and then verify health:
openstack dr protection-group sync-force <pg-name-or-id>
openstack dr health show <pg-name-or-id>
Also verify that replication policies are consistent on both sites. If the engine detects a version mismatch during sync, it will log a conflict warning that requires manual resolution.