Site Recoveryfor OpenStack
Guide

Automation and Scripting

Scripting DR operations with the OpenStack CLI plugin, REST API, and Python client libraries.

master

Overview

All Trilio Site Recovery operations are fully API-driven, making every DR workflow automatable — from initial site registration through failover, failback, and ongoing health monitoring. This page covers three integration paths: the openstack dr OSC CLI plugin for shell scripting and operator workflows, the Protector REST API for direct HTTP automation, and the python-protectorclient library for programmatic access from monitoring tools, orchestration pipelines, and CI/CD systems. Understanding these interfaces lets you embed DR operations into your existing automation stack without relying on manual intervention during time-critical incidents.


Prerequisites

Before scripting DR operations, ensure the following are in place:

  • Two operational OpenStack sites — each with independent Nova, Cinder, Neutron, and Keystone endpoints. You need valid credentials (OpenRC files or equivalent) for both sites.
  • Protector services deployed on both sitesprotector-api and protector-engine must be running and reachable. The default API port is 8788.
  • python-protectorclient installed — version compatible with API microversion protector 1.2 or later. Install with pip install python-protectorclient.
  • Protection Groups configured — at least one Protection Group with replication-enabled volume types (replication_enabled='<is> True'), replication policy, VM members, and resource mappings must exist before you can script failover operations.
  • Network access to both site APIs — the host running your automation scripts must be able to reach the Protector API endpoints on both sites.
  • OpenStack CLI (python-openstackclient) installed if using the OSC plugin integration.

Installation

Install the python-protectorclient package

From PyPI (recommended):

pip install python-protectorclient

From source:

git clone https://github.com/openstack/python-protectorclient
cd python-protectorclient
pip install .

Register the OSC plugin

The python-protectorclient package registers itself as an OpenStack CLI extension via setup.cfg entry points. After installation, verify the plugin is loaded:

openstack --help | grep "^  dr"

You should see the dr command group listed. To enumerate all available DR commands:

openstack help dr

Verify API connectivity on both sites

Before running any automation, confirm both site APIs are reachable:

# Site A
curl -s http://<site-a-controller>:8788/

# Site B
curl -s http://<site-b-controller>:8788/

A successful response from each confirms the Protector API is up and listening.


Configuration

OpenRC files for both sites

Because the OSC plugin and python-protectorclient must authenticate to both sites independently, maintain a separate OpenRC file for each:

# ~/site-a-openrc
export OS_AUTH_URL=http://site-a-controller:5000/v3
export OS_USERNAME=your-username
export OS_PASSWORD=your-password
export OS_PROJECT_NAME=your-project
export OS_USER_DOMAIN_NAME=Default
export OS_PROJECT_DOMAIN_NAME=Default
# ~/site-b-openrc
export OS_AUTH_URL=http://site-b-controller:5000/v3
export OS_USERNAME=your-username
export OS_PASSWORD=your-password
export OS_PROJECT_NAME=your-project
export OS_USER_DOMAIN_NAME=Default
export OS_PROJECT_DOMAIN_NAME=Default

Source the appropriate file before running commands targeted at a specific site. For operations that span both sites (failover, metadata sync), the CLI plugin authenticates to both sites using credentials you provide.

API microversion header

The Protector REST API uses microversioning. Always include the following header to access the latest feature set:

OpenStack-API-Version: protector 1.2

Omitting this header causes the API to respond using the base version, which may not expose all fields or operations documented here.

Output format for scripting

The OSC plugin supports all standard OpenStack CLI output formats. For scripting, -f json and -f value are most useful:

FormatUse case
-f jsonParsing with jq or Python; preserves all fields
-f value -c <field>Extracting a single field value into a shell variable
-f yamlHuman-readable structured output for logging

Metadata sync strictness

Modifications to a Protection Group — including adding members, updating mappings, or initiating failover — are blocked if the peer site is unreachable. This is by design to prevent metadata divergence between the two independent site databases. Your automation scripts must handle 503 or equivalent errors that indicate peer-site unavailability and should not retry modification operations without first confirming both sites are reachable.


Usage

Pattern 1: Shell scripting with the OSC plugin

The most common pattern for operator scripts is to invoke openstack dr commands with -f json output, parse the returned operation ID, and poll until the operation reaches a terminal state.

#!/bin/bash
source ~/site-a-openrc

PG_NAME="prod-web-app"

# Initiate failover and capture the operation ID
OP_ID=$(openstack dr failover "$PG_NAME" \
  --failover-type unplanned \
  -f json | jq -r '.operation_id')

echo "Failover started: $OP_ID"

# Poll until the operation completes or fails
while true; do
  STATUS=$(openstack dr operation show "$OP_ID" -f value -c status)
  PROGRESS=$(openstack dr operation show "$OP_ID" -f value -c progress)
  echo "Status: $STATUS | Progress: $PROGRESS"

  if [[ "$STATUS" == "completed" || "$STATUS" == "failed" ]]; then
    break
  fi
  sleep 15
done

if [[ "$STATUS" == "failed" ]]; then
  echo "Failover failed. Check operation details:"
  openstack dr operation show "$OP_ID"
  exit 1
fi

echo "Failover completed successfully."

The watch command is useful for interactive monitoring during manual DR drills:

watch openstack dr operation show "$OP_ID"

Pattern 2: REST API automation

For automation systems that call APIs directly (Ansible, custom agents, monitoring integrations), use the Protector REST API. Obtain a token from Keystone, then POST to the failover endpoint:

TOKEN=$(openstack token issue -f value -c id)
TENANT_ID=$(openstack token issue -f value -c project_id)
PG_ID="<protection-group-uuid>"

# Initiate failover
curl -s -X POST \
  "http://<site-a-controller>:8788/v1/$TENANT_ID/protection-groups/$PG_ID/failover" \
  -H "X-Auth-Token: $TOKEN" \
  -H "Content-Type: application/json" \
  -H "OpenStack-API-Version: protector 1.2" \
  -d '{"failover": {"type": "unplanned"}}'

# Poll operation progress
OP_ID="<operation-id-from-response>"
curl -s -X GET \
  "http://<site-a-controller>:8788/v1/$TENANT_ID/operations/$OP_ID" \
  -H "X-Auth-Token: $TOKEN" \
  -H "OpenStack-API-Version: protector 1.2"

Pattern 3: Python client library

Use python-protectorclient directly in Python scripts for monitoring tools, orchestration engines, or CI/CD pipelines:

from protectorclient import client as protector_client

# Instantiate client authenticated to Site A
pc = protector_client.Client(
    auth_url="http://site-a-controller:5000/v3",
    username="your-username",
    password="your-password",
    project_name="your-project",
    user_domain_name="Default",
    project_domain_name="Default",
)

# Trigger failover
operation = pc.protection_groups.failover(
    protection_group_id="<pg-uuid>",
    failover_type="planned",
)

# Poll until terminal state
import time
while operation.status not in ("completed", "failed"):
    time.sleep(15)
    operation = pc.operations.get(operation.id)
    print(f"Progress: {operation.progress} | Status: {operation.status}")

Pattern 4: Infrastructure-as-Code integration

For IaC workflows, include site registration and Protection Group configuration in your templates. This allows DR topology to be version-controlled and reproduced alongside the workloads it protects.

OpenStack Heat example (resource snippet):

resources:
  dr_protection_group:
    type: OS::Protector::ProtectionGroup
    properties:
      name: prod-web-app
      replication_type: async
      primary_site: site-a
      secondary_site: site-b
      volume_type: replicated-ssd

Terraform OpenStack provider example:

resource "openstack_protector_protection_group" "web_app" {
  name             = "prod-web-app"
  replication_type = "async"
  primary_site     = "site-a"
  secondary_site   = "site-b"
  volume_type      = "replicated-ssd"
}

Examples

Example 1: Validate metadata sync before a scripted failover

Always verify that both sites are in sync before initiating a planned failover. This prevents operating on stale metadata.

source ~/site-a-openrc

openstack dr metadata sync prod-web-app

Expected output:

+------------------+----------------------------+
| Field            | Value                      |
+------------------+----------------------------+
| protection_group | prod-web-app               |
| sync_status      | in_sync                    |
| site_a_version   | 23                         |
| site_b_version   | 23                         |
| checked_at       | 2025-12-15T14:32:00Z       |
+------------------+----------------------------+

If sync_status is not in_sync, do not proceed with a planned failover until the discrepancy is resolved.


Example 2: Execute a non-disruptive DR drill (test failover)

A test failover spins up instances on the secondary site from replicated snapshots without affecting the primary workload.

source ~/site-a-openrc

# Start the test failover and capture the operation ID
TEST_OP=$(openstack dr test failover prod-web-app -f value -c id)
echo "Test failover operation: $TEST_OP"

# Monitor progress
openstack dr operation show "$TEST_OP"

Expected output (once complete):

+-----------------+--------------------------------------+
| Field           | Value                                |
+-----------------+--------------------------------------+
| id              | op-456abc...                         |
| operation_type  | test_failover                        |
| status          | completed                            |
| progress        | 100%                                 |
| started_at      | 2025-12-15T14:40:00Z                 |
| completed_at    | 2025-12-15T14:47:23Z                 |
+-----------------+--------------------------------------+

After validating the test environment, clean up the DR resources:

openstack dr test failover cleanup "$TEST_OP"

Example 3: Unplanned failover via REST API with polling loop

This example is suitable for integration with alerting or monitoring systems that trigger DR automatically on site failure detection.

#!/bin/bash
set -euo pipefail

SOURCE_API="http://site-a-controller:8788"
PG_ID="<protection-group-uuid>"

# Obtain token from Site A Keystone (adjust for your auth method)
TOKEN=$(openstack token issue -f value -c id)
TENANT_ID=$(openstack token issue -f value -c project_id)

# POST failover request
RESPONSE=$(curl -s -X POST \
  "$SOURCE_API/v1/$TENANT_ID/protection-groups/$PG_ID/failover" \
  -H "X-Auth-Token: $TOKEN" \
  -H "Content-Type: application/json" \
  -H "OpenStack-API-Version: protector 1.2" \
  -d '{"failover": {"type": "unplanned"}}')

OP_ID=$(echo "$RESPONSE" | jq -r '.operation.id')
echo "Failover initiated. Operation ID: $OP_ID"

# Poll for completion
while true; do
  OP=$(curl -s -X GET \
    "$SOURCE_API/v1/$TENANT_ID/operations/$OP_ID" \
    -H "X-Auth-Token: $TOKEN" \
    -H "OpenStack-API-Version: protector 1.2")

  STATUS=$(echo "$OP" | jq -r '.operation.status')
  PROGRESS=$(echo "$OP" | jq -r '.operation.progress')
  echo "[$(date -u +%H:%M:%S)] Status: $STATUS | Progress: $PROGRESS"

  [[ "$STATUS" == "completed" || "$STATUS" == "failed" ]] && break
  sleep 20
done

[[ "$STATUS" == "failed" ]] && { echo "Failover FAILED"; exit 1; }
echo "Failover COMPLETED successfully."

Example 4: Add a newly provisioned VM to a Protection Group in a CI/CD pipeline

This pattern is useful when VMs are created by a deployment pipeline and must be enrolled in DR automatically.

source ~/site-a-openrc

# Create the VM (your existing deployment step)
VM_ID=$(openstack server create \
  --flavor m1.large \
  --volume new-app-server-boot \
  --nic net-id=app-network \
  new-app-server \
  -f value -c id)

# Wait for ACTIVE state
while [[ $(openstack server show "$VM_ID" -f value -c status) != "ACTIVE" ]]; do
  sleep 5
done

# Enroll in Protection Group
openstack dr protection group member add prod-web-app \
  --instance "$VM_ID"

Expected output:

+-------------------+------------------------------------------+
| Field             | Value                                    |
+-------------------+------------------------------------------+
| id                | member-uuid-...                          |
| instance_id       | <vm-uuid>                                |
| instance_name     | new-app-server                           |
| status            | protected                                |
| volumes_added     | 2                                        |
+-------------------+------------------------------------------+

Example 5: Python script for scheduled replication health reporting

import time
from protectorclient import client as protector_client

pc = protector_client.Client(
    auth_url="http://site-a-controller:5000/v3",
    username="monitor-user",
    password="your-password",
    project_name="your-project",
    user_domain_name="Default",
    project_domain_name="Default",
)

pgs = pc.protection_groups.list()
for pg in pgs:
    detail = pc.protection_groups.get(pg.id)
    print(f"{detail.name}: status={detail.status}, "
          f"current_primary={detail.current_primary_site}, "
          f"failover_count={detail.failover_count}")

Sample output:

prod-web-app: status=active, current_primary=site-a, failover_count=1
prod-db: status=active, current_primary=site-a, failover_count=0

Troubleshooting

Issue: openstack dr commands not found after installation

Symptom: Running openstack dr site list returns openstack: 'dr' is not an openstack command.

Likely cause: The python-protectorclient package is not installed in the same Python environment as python-openstackclient, or the entry points were not registered.

Fix:

  1. Confirm both packages are installed in the same environment: pip show python-protectorclient python-openstackclient
  2. If installed from source, ensure you used pip install -e . or pip install . (not just python setup.py install).
  3. Re-run openstack --help | grep dr to confirm the plugin is now loaded.

Issue: Protection Group modification blocked with a 503 or sync error

Symptom: An API call to add a member, update a mapping, or initiate a planned failover returns an error indicating the peer site is unreachable or metadata sync cannot be confirmed.

Likely cause: The Protector API on the secondary site is unreachable from the automation host, or the protector-api service on the secondary site is down. Modifications are intentionally blocked when the peer site cannot be reached to prevent metadata divergence.

Fix:

  1. Verify the secondary site API is reachable: curl -s http://<site-b-controller>:8788/
  2. Check that protector-api and protector-engine are running on the secondary site.
  3. Confirm network/firewall rules allow your automation host to reach port 8788 on both sites.
  4. Once both sites are reachable, retry the operation.
  5. For unplanned failover (where the primary is genuinely unreachable), use --failover-type unplanned — this path is designed to proceed without primary-site confirmation.

Issue: 401 Unauthorized when calling the REST API directly

Symptom: REST API calls return HTTP 401 with a message about an invalid or expired token.

Likely cause: The Keystone token has expired (default token lifetime is 1 hour), or the token was issued against the wrong site's Keystone endpoint.

Fix:

  1. Re-issue the token: TOKEN=$(openstack token issue -f value -c id)
  2. Confirm the token was issued against the correct site's Keystone (OS_AUTH_URL points to the site whose Protector API you are calling).
  3. For long-running polling loops, refresh the token before it expires or implement token renewal logic in your script.

Issue: Operation stuck in running state indefinitely

Symptom: openstack dr operation show <id> continues to report status: running with no progress increment for an extended period.

Likely cause: The protector-engine process on the active site has encountered an unhandled exception or has lost connectivity to storage (Pure FlashArray) or to Nova/Cinder on the target site.

Fix:

  1. Check the engine log on the site executing the operation: tail -f /opt/openstack-protector/bin/logs/engine.log
  2. Verify that the Pure FlashArray endpoints defined in the replication policy are reachable.
  3. Confirm Nova and Cinder are healthy on the target site.
  4. If the engine is unrecoverable, attempt to cancel the operation: openstack dr operation cancel <id>. Note that cancellation may not be possible for all operation phases.
  5. After resolving the underlying issue, retry the operation.

Issue: Volumes fail validation when adding a VM to a Protection Group

Symptom: openstack dr protection group member add returns an error indicating one or more volumes are not eligible for protection.

Likely cause: One or more volumes attached to the VM use a Cinder volume type that does not have replication_enabled='<is> True' set, or the volume type does not have a matching replication_type property. Volumes may also already be members of a different Consistency Group.

Fix:

  1. Check the volume type properties on the affected volumes: openstack volume type show <type-name>
  2. Confirm replication_enabled='<is> True' and replication_type are set correctly on both sites.
  3. If a volume is already in another Consistency Group, remove it from that group first before adding it to the new Protection Group.
  4. If the volume type itself is wrong, you will need to retype the volume to a replication-enabled type before it can be protected.

Issue: jq parsing fails on OSC JSON output

Symptom: openstack dr failover <pg> -f json | jq -r '.operation_id' returns null.

Likely cause: The JSON field name may differ from what is shown in tabular output. OSC JSON output uses the raw API field names, which may be nested differently.

Fix:

  1. Inspect the full JSON response first: openstack dr failover <pg> -f json without piping to jq.
  2. Adjust your jq path to match the actual structure (for example, .operation.id instead of .operation_id).
  3. Alternatively, use -f value -c id if the operation ID is available as a top-level column in the tabular output.