Health Api
Site reachability, replication health, component status, failover readiness
The Health API provides a unified view of your Trilio Site Recovery deployment's operational state — spanning both the primary and secondary OpenStack sites. Use it to verify site reachability, assess replication health for Protection Groups, inspect individual component status (protector-api, protector-engine, storage backend), and confirm that a Protection Group is ready to execute a failover before you commit to one. Because Trilio Site Recovery spans two independent OpenStack clouds with no direct service-to-service channel between them, the Health API is your authoritative source of truth for cross-site connectivity and replication fidelity at any given moment.
Before querying the Health API, ensure the following are in place:
- Both OpenStack sites registered — Each site must be registered in the Protector database with a valid
auth_urland service credentials. See the site registration guide. protector-apiandprotector-enginerunning on both sites — Health checks that touch the secondary site rely on the local engine making outbound calls using stored service credentials. If either service is down, those checks degrade gracefully but return partial data.protectorclientOSC plugin installed — The CLI examples on this page use theopenstack protectorcommand group. Install it alongsidepython-openstackclient.clouds.yamlconfigured for both sites — The CLI plugin authenticates to both sites independently. Your~/.config/openstack/clouds.yamlmust contain named entries for each site (for example,site-aandsite-b).- API microversion 1.2 or later — Health detail fields were introduced in microversion 1.2. Pin the version with the
OpenStack-API-Version: protector 1.2header. - Network path open between the Protector engine and both Keystone endpoints — The engine must be able to reach each site's Keystone (port 5000), Nova (port 8774), and Cinder (port 8776) to perform component-level checks.
The Health API is part of the protector-api service. No separate installation is required beyond a standard Trilio Site Recovery deployment. The steps below confirm that the relevant endpoints are reachable after deployment.
Step 1: Verify the API service is listening
Run this on each controller node:
systemctl status protector-api
Expected output includes Active: active (running). If the service is stopped, start it:
systemctl start protector-api
Step 2: Confirm the root endpoint responds
The root path returns API version information and serves as the simplest liveness check:
curl http://controller:8788/
A 200 OK response confirms the API process is up and accepting connections.
Step 3: Confirm the engine is running
Health checks that include component-level detail (storage backend, replication lag) are fulfilled by the engine. Verify it on each site:
systemctl status protector-engine
Step 4: Confirm cross-site credentials are stored
The health endpoints that probe the peer site use the service credentials stored in the sites table. Verify that both site records exist and include service_username:
mysql -u protector -p protector \
-e "SELECT name, auth_url, service_username, status FROM sites;"
Both rows must be present. A NULL service_username means cross-site health checks will fail with authentication errors.
Health check behavior is governed by the following settings in /etc/protector/protector.conf. Defaults are shown; override them in the relevant section.
[engine] section
| Option | Default | Valid values | Effect |
|---|---|---|---|
site_connectivity_timeout | 30 | Integer (seconds) | How long the engine waits for a TCP response from the peer site's Keystone before marking that site unreachable. Lower this in environments where fast failover detection matters; increase it over high-latency WAN links. |
replication_lag_warning_threshold | 300 | Integer (seconds) | When async replication lag exceeds this value, the Protection Group's replication health transitions to degraded. Set this to match or be slightly tighter than your configured RPO. |
health_check_interval | 60 | Integer (seconds) | How often the engine refreshes cached health state for all Protection Groups in the background. More frequent checks increase load on both the engine and the storage backend. |
[api] section
| Option | Default | Valid values | Effect |
|---|---|---|---|
health_detail_policy | rule:default | Oslo Policy rule | Controls which roles may retrieve component-level health detail (storage array status, replication lag values). The summary endpoint is readable by any project member by default. |
Microversion header
All Health API requests must include:
OpenStack-API-Version: protector 1.2
Omitting this header causes the API to respond at the base version (1.0), which returns only the site status field and omits replication health, component status, and failover readiness fields introduced in 1.1 and 1.2.
Site status values
The status field on each site record reflects the outcome of the most recent engine-driven connectivity probe:
| Value | Meaning |
|---|---|
active | Site is reachable and all probed components responded normally. |
unreachable | Keystone or one or more required services did not respond within site_connectivity_timeout. |
error | The site responded but reported an internal error (for example, a Cinder backend in error state). |
maintenance | Site has been administratively marked as under maintenance; health probes are suspended. |
Check overall site reachability
The simplest health query retrieves the status of all registered sites. Run this before any DR operation to confirm both sites are active:
openstack protector site list --os-cloud site-a
This command authenticates to site-a and returns each registered site's name, site_type, and status. Because site designations are workload-relative and swap on failover, interpret site_type as the current role of each site for a given Protection Group, not a permanent attribute of the underlying infrastructure.
Validate a specific site
To force an immediate connectivity probe against a site (rather than reading cached state), use the validate action:
openstack protector site validate <site-id> --os-cloud site-a
The engine authenticates to the target site using the stored service credentials, probes Keystone, Nova, and Cinder, and returns a structured result. Use this after a network change or service restart to confirm reachability before relying on cached status values.
Check replication health for a Protection Group
Replication health is scoped to a Protection Group because each group has its own associated Consistency Group and replication policy:
openstack protector protection-group show <pg-id> --os-cloud site-a
The response includes status (the DR state machine value such as active or failed_over) and the nested Consistency Group block, which carries the replication status (active, error, or replicating). For async groups, the replication policy block shows rpo_minutes and replication_interval — cross-reference these against the last-sync timestamp to assess replication currency.
Assess failover readiness before executing
Before triggering a planned failover, verify that the peer site is reachable and that metadata sync is not blocked. Metadata sync is intentionally strict: modifications to a Protection Group are rejected when the peer site is unreachable to prevent divergence. This means a Protection Group in error or a site in unreachable state will block planned operations. Run a site validate (see above) and confirm both sites show active before issuing a failover.
Check component status
Component-level status (protector-api, protector-engine, storage backend) is surfaced via the site detail endpoint:
curl -s \
-H "X-Auth-Token: $TOKEN" \
-H "OpenStack-API-Version: protector 1.2" \
http://controller:8788/v1/admin/sites/<site-id>
The capabilities JSON field in the response contains the last-known state of each component as reported by the engine's most recent health probe.
Example 1: List all sites and their status
Confirm both sites are reachable before a DR drill.
export OS_CLOUD=site-a
export TOKEN=$(openstack token issue -f value -c id)
curl -s \
-H "X-Auth-Token: $TOKEN" \
-H "OpenStack-API-Version: protector 1.2" \
http://controller:8788/v1/admin/sites
Expected output (both sites healthy):
{
"sites": [
{
"id": "a1b2c3d4-0001-0001-0001-000000000001",
"name": "site-a",
"site_type": "primary",
"auth_url": "http://site-a-controller:5000/v3",
"status": "active",
"capabilities": {},
"created_at": "2024-03-01T10:00:00Z",
"updated_at": "2024-03-15T08:22:11Z"
},
{
"id": "a1b2c3d4-0002-0002-0002-000000000002",
"name": "site-b",
"site_type": "secondary",
"auth_url": "http://site-b-controller:5000/v3",
"status": "active",
"capabilities": {},
"created_at": "2024-03-01T10:05:00Z",
"updated_at": "2024-03-15T08:22:14Z"
}
]
}
Example 2: Validate a site on demand
Force an immediate probe of site-b from site-a's engine.
curl -s -X POST \
-H "X-Auth-Token: $TOKEN" \
-H "OpenStack-API-Version: protector 1.2" \
http://controller:8788/v1/admin/sites/a1b2c3d4-0002-0002-0002-000000000002/validate
Expected output (site reachable, all components up):
{
"site": {
"id": "a1b2c3d4-0002-0002-0002-000000000002",
"name": "site-b",
"status": "active",
"capabilities": {
"keystone": "reachable",
"nova": "reachable",
"cinder": "reachable"
}
}
}
Expected output (site unreachable — Keystone timeout):
{
"site": {
"id": "a1b2c3d4-0002-0002-0002-000000000002",
"name": "site-b",
"status": "unreachable",
"capabilities": {
"keystone": "timeout",
"nova": "unknown",
"cinder": "unknown"
}
}
}
Example 3: Check replication health for a Protection Group
Show the Protection Group detail, including its Consistency Group replication status, to confirm replication is active.
TENANT_ID=$(openstack token issue -f value -c project_id)
PG_ID="pg-uuid-here"
curl -s \
-H "X-Auth-Token: $TOKEN" \
-H "OpenStack-API-Version: protector 1.2" \
http://controller:8788/v1/${TENANT_ID}/protection-groups/${PG_ID}
Expected output (replication healthy, async group):
{
"protection_group": {
"id": "pg-uuid-here",
"name": "prod-app-pg",
"status": "active",
"replication_type": "async",
"primary_site_id": "a1b2c3d4-0001-0001-0001-000000000001",
"secondary_site_id": "a1b2c3d4-0002-0002-0002-000000000002",
"current_primary_site_id": "a1b2c3d4-0001-0001-0001-000000000001",
"failover_count": 0,
"last_failover_at": null,
"consistency_group": {
"id": "cg-uuid-here",
"status": "active",
"volume_type_name": "replicated-async",
"backend_name": "pure@backend1",
"volume_count": 3
}
}
}
A consistency_group.status of error indicates the storage backend has reported a problem and the Protection Group is not safe to fail over.
Example 4: OSC CLI — full health check sequence before a DR drill
This sequence is a recommended pre-drill checklist you can script.
#!/bin/bash
set -euo pipefail
export OS_CLOUD=site-a
echo "=== Checking site status ==="
openstack protector site list
echo "=== Validating peer site connectivity ==="
SITE_B_ID=$(openstack protector site list -f value -c ID -c Name \
| grep site-b | awk '{print $1}')
openstack protector site validate "$SITE_B_ID"
echo "=== Checking Protection Group replication status ==="
PG_ID="your-pg-id-here"
openstack protector protection-group show "$PG_ID"
echo "=== Pre-drill checks complete ==="
All three commands must return active/reachable status before you proceed with openstack protector protection-group failover.
Issue: Site shows status: unreachable immediately after registration
Symptom: A newly registered site shows unreachable even though the remote OpenStack cloud is up and reachable from your workstation.
Likely cause: The service credentials stored in the site record are incorrect, or the protector-engine process cannot reach the remote Keystone endpoint from the controller node (as opposed to your workstation).
Fix:
- Confirm network reachability from the engine host (not your workstation):
curl -k https://remote-keystone:5000/v3 - Verify the service account exists on the remote site:
openstack --os-cloud site-b user show protector-service - Test authentication manually using the stored credentials:
openstack --os-auth-url https://remote-keystone:5000 --os-username protector-service --os-password <password> --os-project-name service token issue - If the credentials are wrong, update the site record and re-run the validate action.
Issue: openstack protector site validate returns keystone: reachable but nova: unknown and cinder: unknown
Symptom: Keystone authenticates successfully but Nova and Cinder are reported as unknown.
Likely cause: The service account lacks the admin role on the service project on the remote site, so the engine can authenticate but cannot query the compute and volume service catalogs.
Fix: Grant the role on the remote site:
openstack --os-cloud site-b role add \
--user protector-service \
--project service \
admin
Then re-run openstack protector site validate.
Issue: Protection Group modifications are rejected with a sync error when the peer site is reachable
Symptom: PUT /protection-groups/<pg-id> returns 409 Conflict with a message like peer site metadata sync failed even though both sites show status: active.
Likely cause: The protector-api or protector-engine on the remote site is down, even if the underlying OpenStack services are healthy. The site status field reflects OpenStack service reachability; it does not directly probe the remote Protector services. Metadata sync requires the remote Protector API to be running.
Fix:
- SSH to the remote controller and check:
systemctl status protector-api protector-engine - If either service is stopped, restart it:
systemctl start protector-api protector-engine - Retry the Protection Group modification.
Issue: consistency_group.status is error after adding a VM to a Protection Group
Symptom: The Protection Group shows status: active but the nested Consistency Group shows status: error.
Likely cause: The volume type used by one or more of the VM's Cinder volumes does not have replication_enabled='<is> True' set, or the replication_type property does not match the Protection Group's configured replication type (sync or async).
Fix:
- Identify the volume type in use:
openstack volume show <vol-id> -c volume_type - Inspect its extra specs:
openstack volume type show <type-name> -c properties - If
replication_enabledis missing or false, set it:openstack volume type set <type-name> --property replication_enabled='<is> True' - Ensure
replication_typematches the Protection Group:openstack volume type set <type-name> --property replication_type='<in> async'(orsync). - Force a Consistency Group sync to pick up the corrected metadata:
POST /v1/{tenant_id}/protection-groups/{pg_id}/consistency-group/sync
Issue: Health API returns stale status values after a network partition is resolved
Symptom: Both sites are now reachable, but the API still reports status: unreachable for the peer site.
Likely cause: The engine refreshes cached site health on a health_check_interval cycle (default: 60 seconds). During a partition, the cached value is frozen.
Fix: Trigger an immediate re-probe rather than waiting for the background cycle:
curl -s -X POST \
-H "X-Auth-Token: $TOKEN" \
-H "OpenStack-API-Version: protector 1.2" \
http://controller:8788/v1/admin/sites/<site-id>/validate
The validate action forces a synchronous probe and updates the cached status immediately. If the site is genuinely reachable, it will return status: active in the response.