Site Recoveryfor OpenStack
Guide

Replication Health Monitoring

Monitoring replication link health, sync status, RPO compliance, and failover readiness.

master

Overview

Replication health monitoring gives you continuous visibility into the state of your disaster recovery infrastructure — before you need it. The protector-engine exposes a ReplicationHealthReport for each Protection Group through both the /v1/health REST endpoint and the openstack dr health show CLI command, surfacing the metrics that matter most: replication link health, synchronization status, replication lag, RPO compliance, and an aggregated failover_ready flag that drives pre-failover validation. Staying ahead of replication degradation lets you remediate problems during normal operations rather than discovering them mid-disaster.


Prerequisites

Before using the health monitoring features described on this page, ensure the following are in place:

  • Trilio Site Recovery deployed on both sitesprotector-api and protector-engine must be running independently on your primary and secondary OpenStack clouds. Each site requires its own Nova, Cinder, Neutron, and Keystone endpoints.
  • OSC CLI plugin installed — the protectorclient plugin for python-openstackclient must be installed on the machine where you run commands. This plugin authenticates to both sites and is the coordination layer for cross-site operations.
  • Protection Group configured and active — at least one Protection Group must exist with a replication policy attached, including primary_fa_url, secondary_fa_url, and rpo_minutes. See Configure a replication policy before proceeding.
  • Replication-enabled volume types — all volumes under protection must use a Cinder volume type with replication_enabled='<is> True' and a matching replication_type property.
  • clouds.yaml configured for both sites — the CLI must be able to authenticate to both the primary and secondary Keystone endpoints. See the multi-site clouds.yaml example in the Deployment Guide.
  • Both sites reachable from the operator workstation — health status queries are served by the local protector-api, but certain aggregated fields (such as failover_ready) reflect cross-site replication state derived by the engine.

Installation

The health monitoring capability is built into protector-engine and protector-api. No additional packages are required beyond the standard Trilio Site Recovery installation.

Step 1 — Verify both services are running on each site

On the primary site:

systemctl status protector-api
systemctl status protector-engine

On the secondary site:

systemctl status protector-api
systemctl status protector-engine

Both services must report active (running) on both sites before health data is meaningful.

Step 2 — Confirm the CLI plugin is installed

openstack dr health show --help

If the command is not found, install the OSC plugin:

pip install python-protectorclient

Step 3 — Confirm your clouds.yaml includes both sites

clouds:
  site-a:
    auth:
      auth_url: http://site-a-controller:5000/v3
      project_name: admin
      username: admin
      password: password
      user_domain_name: Default
      project_domain_name: Default
    region_name: RegionOne
  site-b:
    auth:
      auth_url: http://site-b-controller:5000/v3
      project_name: admin
      username: admin
      password: password
      user_domain_name: Default
      project_domain_name: Default
    region_name: RegionOne

Step 4 — Test the health endpoint directly (optional)

You can verify the API is responding by calling the health endpoint with a token:

TOKEN=$(openstack --os-cloud site-a token issue -f value -c id)
TENANT_ID=$(openstack --os-cloud site-a token issue -f value -c project_id)
PG_ID="<your-protection-group-uuid>"

curl -s \
  -H "X-Auth-Token: $TOKEN" \
  -H "OpenStack-API-Version: protector 1.2" \
  http://site-a-controller:8788/v1/${TENANT_ID}/protection-groups/${PG_ID}/health \
  | python3 -m json.tool

A successful response returns a ReplicationHealthReport object (see Usage for field descriptions).


Configuration

Health monitoring behavior is governed by the replication policy attached to each Protection Group and by engine-level configuration in protector.conf.

Replication policy fields that affect health reporting

The following fields in the replication_policies record directly influence what the health endpoint reports and how violations are evaluated:

FieldTypeDefaultEffect
rpo_minutesintegerThe Recovery Point Objective in minutes. The engine compares replication_lag against this threshold to determine RPO compliance and whether to record an RPOEvent. Required.
replication_intervalinteger (seconds)For async replication: the target interval between snapshots. The engine uses this to assess whether replication is falling behind.
primary_fa_urlstringURL of the primary FlashArray. Used by the engine to poll live replication link state.
secondary_fa_urlstringURL of the secondary FlashArray. Used to verify connectivity at both ends of the replication link.

Update the policy for a Protection Group:

openstack dr policy update <pg-name-or-id> \
  --rpo-minutes 10 \
  --replication-interval 300

Engine-level health configuration (protector.conf)

The following options may appear under [engine] or a dedicated [health] section:

OptionEffect
health_poll_intervalHow frequently (in seconds) the engine refreshes replication metrics from FlashArray. Lower values increase API call frequency to the arrays.
rpo_event_retention_daysHow long RPOEvent audit records are retained in the database before being purged.

RPO event recording

Whenever the engine detects that replication_lag (in seconds) exceeds rpo_minutes × 60, it writes an RPOEvent record to the database. These records accumulate as an audit trail and are visible via the health history commands described in Usage. They do not auto-resolve — an RPOEvent persists even after replication catches up, so you have a complete history of every SLA breach for compliance review.


Usage

Checking replication health for a Protection Group

The primary workflow is to query the ReplicationHealthReport for a specific Protection Group. The report is the canonical structure returned by all health endpoints and consumed by the Horizon health panel.

openstack dr health show <pg-name-or-id>

This command authenticates to the local protector-api, which queries the protector-engine for a freshly computed ReplicationHealthReport. The report contains the following key fields:

FieldValuesMeaning
link_healthconnected, degraded, disconnectedState of the replication link between the two FlashArrays. degraded means the link is up but experiencing elevated error rates or latency.
sync_statusin_sync, syncing, out_of_syncWhether the secondary copy is current. syncing means replication is actively transferring. out_of_sync means data is behind and the secondary cannot be considered a valid recovery point.
replication_laginteger (seconds)Seconds since the last confirmed data transfer completed. Compare this to your rpo_minutes × 60 budget.
last_successful_syncISO 8601 timestampTimestamp of the most recent successful replication completion.
rpo_compliantbooleantrue if replication_lag is within the configured rpo_minutes threshold.
failover_readybooleanAggregated readiness flag computed by compute_failover_readiness(). true only when link_health is not disconnected, sync_status is in_sync or syncing, and RPO is compliant. This flag gates the pre-failover validation check.
data_at_risk_secondsintegerSeconds of data that would be lost if a failover were executed right now. Derived from replication_lag. Zero when fully in sync.
throughput_mbpsfloatCurrent replication transfer throughput in MB/s.
iopsintegerCurrent replication IOPS.
volume_healthlistPer-volume replication status for every volume in the Protection Group's Consistency Group.

Understanding failover_ready and data_at_risk_seconds

The compute_failover_readiness() helper in health_metrics.py derives failover_ready and data_at_risk_seconds from the current health state. These two fields are the primary inputs to:

  • Horizon status badges — the dashboard renders a green/amber/red badge on each Protection Group card based on failover_ready.
  • Pre-failover validation — when you execute a failover or test failover, the engine checks failover_ready first. If it is false, the operation is blocked (unless --force is specified) and data_at_risk_seconds is included in the error message so you know the exposure.

This means that keeping replication healthy is not just a monitoring concern — it directly affects whether your failover will proceed.

Listing RPO events for audit

Every RPO violation is recorded as an RPOEvent in the database. To review the audit trail:

openstack dr rpo-events list <pg-name-or-id>

Filter by time range:

openstack dr rpo-events list <pg-name-or-id> \
  --start-time 2025-01-01T00:00:00 \
  --end-time 2025-01-31T23:59:59

Checking health via the REST API directly

For integration with external monitoring systems (Prometheus, Nagios, etc.), call the health endpoint directly:

GET /v1/{tenant_id}/protection-groups/{pg_id}/health

The response is a ReplicationHealthReport object. Parse failover_ready and data_at_risk_seconds for alerting thresholds.


Examples

Example 1 — Healthy Protection Group in Full Sync

openstack dr health show prod-web-app

Expected output when replication is healthy and RPO-compliant:

+------------------------------+------------------------------------------+
| Field                        | Value                                    |
+------------------------------+------------------------------------------+
| protection_group             | prod-web-app                             |
| link_health                  | connected                                |
| sync_status                  | in_sync                                  |
| replication_lag              | 47                                       |
| last_successful_sync         | 2025-06-10T14:22:13Z                     |
| rpo_compliant                | True                                     |
| failover_ready               | True                                     |
| data_at_risk_seconds         | 0                                        |
| throughput_mbps              | 312.4                                    |
| iops                         | 8240                                     |
| volume_health[0].volume_id   | a1b2c3d4-...                             |
| volume_health[0].status      | replicating                              |
| volume_health[1].volume_id   | e5f6g7h8-...                             |
| volume_health[1].status      | replicating                              |
+------------------------------+------------------------------------------+

failover_ready: True with data_at_risk_seconds: 0 means this Protection Group is ready for immediate failover with zero data exposure.


Example 2 — RPO Violation Detected

This example shows the output when replication lag has exceeded the configured rpo_minutes threshold (here: 15 minutes):

openstack dr health show prod-web-app
+------------------------------+------------------------------------------+
| Field                        | Value                                    |
+------------------------------+------------------------------------------+
| protection_group             | prod-web-app                             |
| link_health                  | connected                                |
| sync_status                  | out_of_sync                              |
| replication_lag              | 1247                                     |
| last_successful_sync         | 2025-06-10T14:01:26Z                     |
| rpo_compliant                | False                                    |
| failover_ready               | False                                    |
| data_at_risk_seconds         | 1247                                     |
| throughput_mbps              | 0.0                                      |
| iops                         | 0                                        |
+------------------------------+------------------------------------------+

replication_lag of 1247 seconds (~20 minutes) exceeds the 15-minute RPO. An RPOEvent record has been written to the database. failover_ready is false — a failover attempted now would require --force and would result in up to 1247 seconds of data loss.


Example 3 — Replication Link Degraded

openstack dr health show prod-web-app
+------------------------------+------------------------------------------+
| Field                        | Value                                    |
+------------------------------+------------------------------------------+
| link_health                  | degraded                                 |
| sync_status                  | syncing                                  |
| replication_lag              | 340                                      |
| rpo_compliant                | True                                     |
| failover_ready               | False                                    |
| data_at_risk_seconds         | 340                                      |
| throughput_mbps              | 12.1                                     |
+------------------------------+------------------------------------------+

The link is up but degraded. Throughput has dropped sharply and replication is falling behind. RPO is still technically compliant but failover_ready is false because link_health is not connected. Investigate the FlashArray replication link and inter-site network path.


Example 4 — Querying the REST API for Monitoring Integration

Retrieve the health report as JSON and extract the two alerting fields:

TOKEN=$(openstack --os-cloud site-a token issue -f value -c id)
TENANT_ID=$(openstack --os-cloud site-a token issue -f value -c project_id)
PG_ID="pg-12345678-1234-1234-1234-123456789012"

curl -s \
  -H "X-Auth-Token: $TOKEN" \
  -H "OpenStack-API-Version: protector 1.2" \
  "http://site-a-controller:8788/v1/${TENANT_ID}/protection-groups/${PG_ID}/health" \
  | python3 -c "
import json, sys
report = json.load(sys.stdin)['health_report']
print('failover_ready:', report['failover_ready'])
print('data_at_risk_seconds:', report['data_at_risk_seconds'])
"

Expected output when healthy:

failover_ready: True
data_at_risk_seconds: 0

Example 5 — Reviewing RPO Event History

openstack dr rpo-events list prod-web-app --limit 5
+--------------------------------------+----------------------+---------------------+-----------------------+
| id                                   | protection_group     | detected_at         | lag_at_detection_secs |
+--------------------------------------+----------------------+---------------------+-----------------------+
| evt-aabb1122-...                     | prod-web-app         | 2025-06-10T14:22:00Z| 1247                  |
| evt-ccdd3344-...                     | prod-web-app         | 2025-06-08T03:11:00Z| 980                   |
+--------------------------------------+----------------------+---------------------+-----------------------+

These records persist for compliance auditing regardless of whether replication subsequently recovered.


Troubleshooting

Issue: failover_ready is false but replication appears healthy in the FlashArray UI

Symptom: openstack dr health show reports failover_ready: False and link_health: connected, but the Pure Storage array management interface shows replication is running normally.

Likely cause: The engine's cached health state has not yet refreshed, or sync_status is still syncing from a recent replication cycle that has not completed. compute_failover_readiness() returns false whenever sync_status is not in_sync, even if the link is healthy.

Fix: Wait one full replication_interval cycle (configured in the replication policy) and re-query. If the status does not resolve to in_sync, check protector-engine logs for errors communicating with the FlashArray API:

journalctl -u protector-engine -f | grep -i health

Issue: link_health: disconnected — failover is blocked

Symptom: link_health reports disconnected. failover_ready is false. Attempting a failover returns a pre-validation error.

Likely cause 1: The inter-site network path between the two FlashArrays is down. Likely cause 2: The FlashArray API credentials in the replication policy have expired or been rotated. Likely cause 3: The protector-engine cannot reach the FlashArray management IP (primary_fa_url or secondary_fa_url in the policy).

Fix:

  1. Verify network connectivity from the controller hosting protector-engine to both FlashArray management IPs.
  2. Confirm the replication link in the FlashArray management UI or via Pure1.
  3. If credentials were rotated, update the replication policy:
openstack dr policy update <pg-name-or-id> \
  --primary-fa-token "<new-token>" \
  --secondary-fa-token "<new-token>"
  1. If the primary site is genuinely unreachable (disaster scenario), use --force to bypass the failover_ready check — but understand this accepts the data_at_risk_seconds exposure reported in the health report.

Issue: RPO events accumulating rapidly

Symptom: openstack dr rpo-events list shows frequent violations even though the link appears connected.

Likely cause: Replication throughput is insufficient for the write workload on the protected volumes. replication_lag is consistently exceeding rpo_minutes × 60 between cycles.

Fix:

  1. Check throughput_mbps and iops in the health report during a violation period. Compare against your inter-site bandwidth budget.
  2. Consider increasing rpo_minutes in the replication policy to a value that matches your realistic replication window, or reduce the write workload on protected volumes.
  3. For async replication, ensure replication_interval is achievable given the data change rate.

Issue: Health endpoint returns 404 or 401

Symptom: Calling GET /v1/{tenant_id}/protection-groups/{pg_id}/health via curl returns HTTP 404 or 401.

Likely cause: Either the Protection Group ID or tenant ID in the URL is incorrect, the API version header is missing or too low, or the token has expired.

Fix:

# Verify the correct tenant ID
TENANT_ID=$(openstack --os-cloud site-a token issue -f value -c project_id)

# Verify the Protection Group exists and belongs to this tenant
openstack dr protection-group show <pg-name-or-id>

# Ensure you are sending the correct API version header
curl -H "OpenStack-API-Version: protector 1.2" ...

Issue: sync_status: out_of_sync after site recovered from unreachable state

Symptom: After a temporary network partition between sites, the protection group shows out_of_sync even though both sites are now reachable.

Likely cause: The metadata sync was blocked during the outage (by design — modifications are blocked when the peer site is unreachable). The protector-engine needs to reconcile state.

Fix: Force a metadata sync and then verify health:

openstack dr protection-group sync-force <pg-name-or-id>
openstack dr health show <pg-name-or-id>

Also verify that replication policies are consistent on both sites. If the engine detects a version mismatch during sync, it will log a conflict warning that requires manual resolution.