Guide

Health Api

Site reachability, replication health, component status, failover readiness

master

Overview

The Health API provides a unified view of your Trilio Site Recovery deployment's operational state — spanning both the primary and secondary OpenStack sites. Use it to verify site reachability, assess replication health for Protection Groups, inspect individual component status (protector-api, protector-engine, storage backend), and confirm that a Protection Group is ready to execute a failover before you commit to one. Because Trilio Site Recovery spans two independent OpenStack clouds with no direct service-to-service channel between them, the Health API is your authoritative source of truth for cross-site connectivity and replication fidelity at any given moment.

Prerequisites

Before querying the Health API, ensure the following are in place:

Both OpenStack sites registered — Each site must be registered in the Protector database with a valid auth_url and service credentials. See the site registration guide.
protector-api and protector-engine running on both sites — Health checks that touch the secondary site rely on the local engine making outbound calls using stored service credentials. If either service is down, those checks degrade gracefully but return partial data.
protectorclient OSC plugin installed — The CLI examples on this page use the openstack protector command group. Install it alongside python-openstackclient.
clouds.yaml configured for both sites — The CLI plugin authenticates to both sites independently. Your ~/.config/openstack/clouds.yaml must contain named entries for each site (for example, site-a and site-b).
API microversion 1.2 or later — Health detail fields were introduced in microversion 1.2. Pin the version with the OpenStack-API-Version: protector 1.2 header.
Network path open between the Protector engine and both Keystone endpoints — The engine must be able to reach each site's Keystone (port 5000), Nova (port 8774), and Cinder (port 8776) to perform component-level checks.

Installation

The Health API is part of the protector-api service. No separate installation is required beyond a standard Trilio Site Recovery deployment. The steps below confirm that the relevant endpoints are reachable after deployment.

Step 1: Verify the API service is listening

Run this on each controller node:

systemctl status protector-api

Expected output includes Active: active (running). If the service is stopped, start it:

systemctl start protector-api

Step 2: Confirm the root endpoint responds

The root path returns API version information and serves as the simplest liveness check:

curl http://controller:8788/

A 200 OK response confirms the API process is up and accepting connections.

Step 3: Confirm the engine is running

Health checks that include component-level detail (storage backend, replication lag) are fulfilled by the engine. Verify it on each site:

systemctl status protector-engine

Step 4: Confirm cross-site credentials are stored

The health endpoints that probe the peer site use the service credentials stored in the sites table. Verify that both site records exist and include service_username:

mysql -u protector -p protector \
  -e "SELECT name, auth_url, service_username, status FROM sites;"

Both rows must be present. A NULL service_username means cross-site health checks will fail with authentication errors.

Configuration

Health check behavior is governed by the following settings in /etc/protector/protector.conf. Defaults are shown; override them in the relevant section.

[engine] section

Option	Default	Valid values	Effect
`site_connectivity_timeout`	`30`	Integer (seconds)	How long the engine waits for a TCP response from the peer site's Keystone before marking that site `unreachable`. Lower this in environments where fast failover detection matters; increase it over high-latency WAN links.
`replication_lag_warning_threshold`	`300`	Integer (seconds)	When async replication lag exceeds this value, the Protection Group's replication health transitions to `degraded`. Set this to match or be slightly tighter than your configured RPO.
`health_check_interval`	`60`	Integer (seconds)	How often the engine refreshes cached health state for all Protection Groups in the background. More frequent checks increase load on both the engine and the storage backend.

[api] section

Option	Default	Valid values	Effect
`health_detail_policy`	`rule:default`	Oslo Policy rule	Controls which roles may retrieve component-level health detail (storage array status, replication lag values). The summary endpoint is readable by any project member by default.

Microversion header

All Health API requests must include:

OpenStack-API-Version: protector 1.2

Omitting this header causes the API to respond at the base version (1.0), which returns only the site status field and omits replication health, component status, and failover readiness fields introduced in 1.1 and 1.2.

Site status values

The status field on each site record reflects the outcome of the most recent engine-driven connectivity probe:

Value	Meaning
`active`	Site is reachable and all probed components responded normally.
`unreachable`	Keystone or one or more required services did not respond within `site_connectivity_timeout`.
`error`	The site responded but reported an internal error (for example, a Cinder backend in error state).
`maintenance`	Site has been administratively marked as under maintenance; health probes are suspended.

Usage

Check overall site reachability

The simplest health query retrieves the status of all registered sites. Run this before any DR operation to confirm both sites are active:

openstack protector site list --os-cloud site-a

This command authenticates to site-a and returns each registered site's name, site_type, and status. Because site designations are workload-relative and swap on failover, interpret site_type as the current role of each site for a given Protection Group, not a permanent attribute of the underlying infrastructure.

Validate a specific site

To force an immediate connectivity probe against a site (rather than reading cached state), use the validate action:

openstack protector site validate <site-id> --os-cloud site-a

The engine authenticates to the target site using the stored service credentials, probes Keystone, Nova, and Cinder, and returns a structured result. Use this after a network change or service restart to confirm reachability before relying on cached status values.

Check replication health for a Protection Group

Replication health is scoped to a Protection Group because each group has its own associated Consistency Group and replication policy:

openstack protector protection-group show <pg-id> --os-cloud site-a

The response includes status (the DR state machine value such as active or failed_over) and the nested Consistency Group block, which carries the replication status (active, error, or replicating). For async groups, the replication policy block shows rpo_minutes and replication_interval — cross-reference these against the last-sync timestamp to assess replication currency.

Assess failover readiness before executing

Before triggering a planned failover, verify that the peer site is reachable and that metadata sync is not blocked. Metadata sync is intentionally strict: modifications to a Protection Group are rejected when the peer site is unreachable to prevent divergence. This means a Protection Group in error or a site in unreachable state will block planned operations. Run a site validate (see above) and confirm both sites show active before issuing a failover.

Check component status

Component-level status (protector-api, protector-engine, storage backend) is surfaced via the site detail endpoint:

curl -s \
  -H "X-Auth-Token: $TOKEN" \
  -H "OpenStack-API-Version: protector 1.2" \
  http://controller:8788/v1/admin/sites/<site-id>

The capabilities JSON field in the response contains the last-known state of each component as reported by the engine's most recent health probe.

Examples

Example 1: List all sites and their status

Confirm both sites are reachable before a DR drill.

export OS_CLOUD=site-a
export TOKEN=$(openstack token issue -f value -c id)

curl -s \
  -H "X-Auth-Token: $TOKEN" \
  -H "OpenStack-API-Version: protector 1.2" \
  http://controller:8788/v1/admin/sites

Expected output (both sites healthy):

{
  "sites": [
    {
      "id": "a1b2c3d4-0001-0001-0001-000000000001",
      "name": "site-a",
      "site_type": "primary",
      "auth_url": "http://site-a-controller:5000/v3",
      "status": "active",
      "capabilities": {},
      "created_at": "2024-03-01T10:00:00Z",
      "updated_at": "2024-03-15T08:22:11Z"
    },
    {
      "id": "a1b2c3d4-0002-0002-0002-000000000002",
      "name": "site-b",
      "site_type": "secondary",
      "auth_url": "http://site-b-controller:5000/v3",
      "status": "active",
      "capabilities": {},
      "created_at": "2024-03-01T10:05:00Z",
      "updated_at": "2024-03-15T08:22:14Z"
    }
  ]
}

Example 2: Validate a site on demand

Force an immediate probe of site-b from site-a's engine.

curl -s -X POST \
  -H "X-Auth-Token: $TOKEN" \
  -H "OpenStack-API-Version: protector 1.2" \
  http://controller:8788/v1/admin/sites/a1b2c3d4-0002-0002-0002-000000000002/validate

Expected output (site reachable, all components up):

{
  "site": {
    "id": "a1b2c3d4-0002-0002-0002-000000000002",
    "name": "site-b",
    "status": "active",
    "capabilities": {
      "keystone": "reachable",
      "nova": "reachable",
      "cinder": "reachable"
    }
  }
}

Expected output (site unreachable — Keystone timeout):

{
  "site": {
    "id": "a1b2c3d4-0002-0002-0002-000000000002",
    "name": "site-b",
    "status": "unreachable",
    "capabilities": {
      "keystone": "timeout",
      "nova": "unknown",
      "cinder": "unknown"
    }
  }
}

Example 3: Check replication health for a Protection Group

Show the Protection Group detail, including its Consistency Group replication status, to confirm replication is active.

TENANT_ID=$(openstack token issue -f value -c project_id)
PG_ID="pg-uuid-here"

curl -s \
  -H "X-Auth-Token: $TOKEN" \
  -H "OpenStack-API-Version: protector 1.2" \
  http://controller:8788/v1/${TENANT_ID}/protection-groups/${PG_ID}

Expected output (replication healthy, async group):

{
  "protection_group": {
    "id": "pg-uuid-here",
    "name": "prod-app-pg",
    "status": "active",
    "replication_type": "async",
    "primary_site_id": "a1b2c3d4-0001-0001-0001-000000000001",
    "secondary_site_id": "a1b2c3d4-0002-0002-0002-000000000002",
    "current_primary_site_id": "a1b2c3d4-0001-0001-0001-000000000001",
    "failover_count": 0,
    "last_failover_at": null,
    "consistency_group": {
      "id": "cg-uuid-here",
      "status": "active",
      "volume_type_name": "replicated-async",
      "backend_name": "pure@backend1",
      "volume_count": 3
    }
  }
}

A consistency_group.status of error indicates the storage backend has reported a problem and the Protection Group is not safe to fail over.

Example 4: OSC CLI — full health check sequence before a DR drill

This sequence is a recommended pre-drill checklist you can script.

#!/bin/bash
set -euo pipefail

export OS_CLOUD=site-a

echo "=== Checking site status ==="
openstack protector site list

echo "=== Validating peer site connectivity ==="
SITE_B_ID=$(openstack protector site list -f value -c ID -c Name \
  | grep site-b | awk '{print $1}')
openstack protector site validate "$SITE_B_ID"

echo "=== Checking Protection Group replication status ==="
PG_ID="your-pg-id-here"
openstack protector protection-group show "$PG_ID"

echo "=== Pre-drill checks complete ==="

All three commands must return active/reachable status before you proceed with openstack protector protection-group failover.

Troubleshooting

Issue: Site shows `status: unreachable` immediately after registration

Symptom: A newly registered site shows unreachable even though the remote OpenStack cloud is up and reachable from your workstation.

Likely cause: The service credentials stored in the site record are incorrect, or the protector-engine process cannot reach the remote Keystone endpoint from the controller node (as opposed to your workstation).

Fix:

Confirm network reachability from the engine host (not your workstation): curl -k https://remote-keystone:5000/v3
Verify the service account exists on the remote site: openstack --os-cloud site-b user show protector-service
Test authentication manually using the stored credentials: openstack --os-auth-url https://remote-keystone:5000 --os-username protector-service --os-password <password> --os-project-name service token issue
If the credentials are wrong, update the site record and re-run the validate action.

Issue: `openstack protector site validate` returns `keystone: reachable` but `nova: unknown` and `cinder: unknown`

Symptom: Keystone authenticates successfully but Nova and Cinder are reported as unknown.

Likely cause: The service account lacks the admin role on the service project on the remote site, so the engine can authenticate but cannot query the compute and volume service catalogs.

Fix: Grant the role on the remote site:

openstack --os-cloud site-b role add \
  --user protector-service \
  --project service \
  admin

Then re-run openstack protector site validate.

Issue: Protection Group modifications are rejected with a sync error when the peer site is reachable

Symptom: PUT /protection-groups/<pg-id> returns 409 Conflict with a message like peer site metadata sync failed even though both sites show status: active.

Likely cause: The protector-api or protector-engine on the remote site is down, even if the underlying OpenStack services are healthy. The site status field reflects OpenStack service reachability; it does not directly probe the remote Protector services. Metadata sync requires the remote Protector API to be running.

Fix:

SSH to the remote controller and check: systemctl status protector-api protector-engine
If either service is stopped, restart it: systemctl start protector-api protector-engine
Retry the Protection Group modification.

Issue: `consistency_group.status` is `error` after adding a VM to a Protection Group

Symptom: The Protection Group shows status: active but the nested Consistency Group shows status: error.

Likely cause: The volume type used by one or more of the VM's Cinder volumes does not have replication_enabled='<is> True' set, or the replication_type property does not match the Protection Group's configured replication type (sync or async).

Fix:

Identify the volume type in use: openstack volume show <vol-id> -c volume_type
Inspect its extra specs: openstack volume type show <type-name> -c properties
If replication_enabled is missing or false, set it: openstack volume type set <type-name> --property replication_enabled='<is> True'
Ensure replication_type matches the Protection Group: openstack volume type set <type-name> --property replication_type='<in> async' (or sync).
Force a Consistency Group sync to pick up the corrected metadata: POST /v1/{tenant_id}/protection-groups/{pg_id}/consistency-group/sync

Issue: Health API returns stale `status` values after a network partition is resolved

Symptom: Both sites are now reachable, but the API still reports status: unreachable for the peer site.

Likely cause: The engine refreshes cached site health on a health_check_interval cycle (default: 60 seconds). During a partition, the cached value is frozen.

Fix: Trigger an immediate re-probe rather than waiting for the background cycle:

curl -s -X POST \
  -H "X-Auth-Token: $TOKEN" \
  -H "OpenStack-API-Version: protector 1.2" \
  http://controller:8788/v1/admin/sites/<site-id>/validate

The validate action forces a synchronous probe and updates the cached status immediately. If the site is genuinely reachable, it will return status: active in the response.

Health Api

Check overall site reachability

Validate a specific site

Check replication health for a Protection Group

Assess failover readiness before executing

Check component status

Example 1: List all sites and their status

Example 2: Validate a site on demand

Example 3: Check replication health for a Protection Group

Example 4: OSC CLI — full health check sequence before a DR drill

Issue: Site shows status: unreachable immediately after registration

Issue: openstack protector site validate returns keystone: reachable but nova: unknown and cinder: unknown

Issue: Protection Group modifications are rejected with a sync error when the peer site is reachable

Issue: consistency_group.status is error after adding a VM to a Protection Group

Issue: Health API returns stale status values after a network partition is resolved

Issue: Site shows `status: unreachable` immediately after registration

Issue: `openstack protector site validate` returns `keystone: reachable` but `nova: unknown` and `cinder: unknown`

Issue: `consistency_group.status` is `error` after adding a VM to a Protection Group

Issue: Health API returns stale `status` values after a network partition is resolved