Guide

Shell Scripting Patterns

OSC plugin usage in bash scripts with JSON output and operation polling

master

Overview

This page covers practical shell scripting patterns for automating Trilio Site Recovery workflows using the openstack dr CLI plugin. Because DR operations are asynchronous and span two independent OpenStack sites, effective automation requires machine-parseable output, robust operation polling, and pre-flight readiness checks before committing to a failover. The patterns here address all three concerns and compose into production-grade runbooks for planned failovers, DR drills, and failback sequences.

Prerequisites

Before working through these patterns, ensure the following are in place:

protectorclient installed — the OSC plugin must be registered with your openstack CLI (pip install -e . from the repository root, or equivalent package install)
openstack CLI — any version compatible with the protectorclient plugin entry points
jq 1.6 or later — used throughout for JSON field extraction; confirm with jq --version
clouds.yaml configured — both your primary and secondary site credentials must be present; see the deployment guide for the multi-site clouds.yaml format
Environment sourced — either source <siterc> or --os-cloud <name> available for both sites
At least one Protection Group in ACTIVE state with replication policy applied and replication readiness confirmed
Shell: patterns below assume bash 4.0+; set -euo pipefail is recommended at the top of any production script

Installation

The OSC plugin ships with the openstack-protector repository. Install it once on any host from which you run automation:

Step 1 — Install the plugin

cd /path/to/openstack-protector
pip install -e .

Step 2 — Verify the plugin is loaded

openstack --help | grep '^  dr'

You should see dr listed as a command group. If it is absent, the entry-point registration did not succeed — re-run pip install -e . and confirm there are no import errors.

Step 3 — Confirm jq is available

jq --version

If jq is missing, install it via your system package manager (e.g., apt install jq or yum install jq).

Step 4 — Verify end-to-end connectivity

# Primary site
openstack --os-cloud site-a dr site list

# Secondary site
openstack --os-cloud site-b dr site list

Both commands should return without authentication errors before you proceed to scripting.

Configuration

Script behaviour is controlled by a small set of environment variables and CLI flags. Understanding their effect lets you tune polling cadence and output fidelity without modifying script logic.

Output format flag: `-f json`

Pass -f json to any openstack dr command to receive a JSON object instead of the default human-readable table. This is the foundation of all machine-parseable patterns on this page — without it, field extraction requires fragile text parsing.

openstack dr operation show <id> -f json

`OS_PROTECTOR_API_VERSION`

Sets the microversion sent to the protector-api. Unset defaults to 1. Set explicitly in scripts that must target a specific API surface:

export OS_PROTECTOR_API_VERSION=1

`--os-cloud <name>`

Selects the named entry from clouds.yaml. Prefer this over sourcing RC files in scripts so that both sites can be addressed in the same script without re-sourcing:

openstack --os-cloud site-a dr protection group list -f json
openstack --os-cloud site-b dr protection group list -f json

Polling interval and timeout

The openstack dr plugin does not implement built-in --wait semantics — you control polling in your own loop. Two variables to set at the top of any polling script:

Variable	Recommended value	Effect
`POLL_INTERVAL`	`15` (seconds)	Time between status checks; increase for long-running async replication
`MAX_WAIT`	`1800` (seconds)	Hard timeout before the script declares the operation stalled

Debug output

Add --debug before the subcommand to emit full HTTP request/response traces to stderr. Redirect stderr separately in scripts to avoid polluting captured JSON:

openstack --debug dr operation show <id> -f json 2>debug.log | jq .status

Usage

Use `-f json` for all machine-parseable output

Every field-extraction pattern in automation should start with -f json. Avoid parsing table output — column widths and formatting can change across plugin versions.

# Capture the ID of a newly created Protection Group
PG_ID=$(openstack dr protection group create prod-app \
  --primary-site cluster1 \
  --secondary-site cluster2 \
  --replication-type async \
  -f json | jq -r '.id')
echo "Created PG: $PG_ID"

Capture the operation ID from failover output

All DR actions (failover, failback, test failover) return an operation object immediately. The operation runs asynchronously — capture its id before doing anything else:

OP_ID=$(openstack dr failover "$PG_ID" \
  --failover-type planned \
  -f json | jq -r '.id')
echo "Failover operation: $OP_ID"

If you lose the operation ID, recover it with:

openstack dr operation list -f json \
  | jq -r '.[] | select(.protection_group_id=="'"$PG_ID"'") | .id' \
  | head -1

Poll with a `while` loop until terminal state

Operation status transitions through intermediate values (pending, in_progress) before reaching a terminal state (completed or failed). Poll openstack dr operation show in a loop:

POLL_INTERVAL=15
MAX_WAIT=1800
elapsed=0

while true; do
  STATUS=$(openstack dr operation show "$OP_ID" -f json | jq -r '.status')
  echo "[$(date -u +%H:%M:%S)] status: $STATUS"

  if [[ "$STATUS" == "completed" ]]; then
    echo "Operation completed successfully."
    break
  elif [[ "$STATUS" == "failed" ]]; then
    echo "ERROR: Operation failed. Check 'openstack dr operation show $OP_ID' for details."
    exit 1
  fi

  if (( elapsed >= MAX_WAIT )); then
    echo "ERROR: Timed out after ${MAX_WAIT}s waiting for operation $OP_ID."
    exit 2
  fi

  sleep "$POLL_INTERVAL"
  (( elapsed += POLL_INTERVAL ))
done

Run pre-flight checks before scripted failovers

Before initiating any failover — especially in automated runbooks — verify that the Protection Group is in a replication-ready state. Use openstack dr health show to retrieve the current readiness flag:

FAILOVER_READY=$(openstack dr health show -f json | jq -r '.failover_ready')

if [[ "$FAILOVER_READY" != "true" ]]; then
  echo "ERROR: Site is not failover-ready (failover_ready=$FAILOVER_READY). Aborting."
  exit 3
fi

This check surfaces issues such as replication lag exceeding threshold, unreachable peer sites, or Protection Groups not in ACTIVE state — all of which would cause a failover to fail or produce inconsistent results. Because metadata sync is blocked when the peer site is unreachable, a failed health check is an authoritative signal to abort rather than proceed.

Cancel a running operation

If a scripted failover must be aborted during execution:

openstack dr operation cancel "$OP_ID"

After cancellation, poll until status reaches failed or completed before taking further action — the cancel is itself asynchronous.

Examples

Example 1 — Pre-flight health check

Verify failover readiness before any automated workflow begins.

#!/bin/bash
set -euo pipefail

OS_CLOUD="${OS_CLOUD:-site-a}"

FAILOVER_READY=$(openstack --os-cloud "$OS_CLOUD" dr health show -f json \
  | jq -r '.failover_ready')

echo "failover_ready: $FAILOVER_READY"

if [[ "$FAILOVER_READY" != "true" ]]; then
  echo "Aborting: site is not ready for failover."
  exit 1
fi

echo "Pre-flight check passed."

Expected output (healthy site):

failover_ready: true
Pre-flight check passed.

Example 2 — Planned failover with polling

Execute a planned failover and block until the operation reaches a terminal state.

#!/bin/bash
set -euo pipefail

PG_ID="$1"          # Pass PG ID as first argument
POLL_INTERVAL=15
MAX_WAIT=1800

# Pre-flight
FAILOVER_READY=$(openstack dr health show -f json | jq -r '.failover_ready')
if [[ "$FAILOVER_READY" != "true" ]]; then
  echo "ERROR: failover_ready=$FAILOVER_READY — aborting."
  exit 1
fi

# Initiate failover
OP_ID=$(openstack dr failover "$PG_ID" \
  --failover-type planned \
  -f json | jq -r '.id')
echo "Failover operation started: $OP_ID"

# Poll
elapsed=0
while true; do
  STATUS=$(openstack dr operation show "$OP_ID" -f json | jq -r '.status')
  echo "[$(date -u +%H:%M:%S)] $OP_ID → $STATUS"

  [[ "$STATUS" == "completed" ]] && { echo "Failover complete."; exit 0; }
  [[ "$STATUS" == "failed" ]]    && { echo "Failover FAILED."; exit 1; }

  (( elapsed >= MAX_WAIT )) && { echo "Timeout."; exit 2; }
  sleep "$POLL_INTERVAL"
  (( elapsed += POLL_INTERVAL ))
done

Expected output:

Failover operation started: a3f9c1d2-7e44-4b2a-9f01-bc34e8120def
[14:02:10] a3f9c1d2-7e44-4b2a-9f01-bc34e8120def → in_progress
[14:02:25] a3f9c1d2-7e44-4b2a-9f01-bc34e8120def → in_progress
[14:02:40] a3f9c1d2-7e44-4b2a-9f01-bc34e8120def → completed
Failover complete.

Example 3 — Non-disruptive DR drill (test failover + cleanup)

Run a DR test, verify the test instances appeared on the secondary, then clean up.

#!/bin/bash
set -euo pipefail

PG_ID="$1"
POLL_INTERVAL=15
MAX_WAIT=900

# Start test failover
TEST_OP_ID=$(openstack dr test failover "$PG_ID" \
  -f json | jq -r '.id')
echo "Test failover operation: $TEST_OP_ID"

# Poll until ready
elapsed=0
while true; do
  STATUS=$(openstack dr operation show "$TEST_OP_ID" -f json | jq -r '.status')
  echo "[$(date -u +%H:%M:%S)] $STATUS"
  [[ "$STATUS" == "completed" ]] && break
  [[ "$STATUS" == "failed" ]]    && { echo "Test failover failed."; exit 1; }
  (( elapsed >= MAX_WAIT ))       && { echo "Timeout."; exit 2; }
  sleep "$POLL_INTERVAL"
  (( elapsed += POLL_INTERVAL ))
done

echo "Test instances are running. Inspect the secondary site now."
echo "Press ENTER to proceed with cleanup, or Ctrl-C to abort."
read -r

# Cleanup
openstack dr test failover cleanup "$TEST_OP_ID"
echo "Test resources cleaned up."

Expected output:

Test failover operation: 77b2e901-cc13-4d8a-b5f3-192a44de0031
[09:15:02] in_progress
[09:15:17] in_progress
[09:15:32] completed
Test instances are running. Inspect the secondary site now.
Press ENTER to proceed with cleanup, or Ctrl-C to abort.

Test resources cleaned up.

Example 4 — Bulk status report across all Protection Groups

Generate a one-line status summary for every PG — useful in monitoring cron jobs or pre-maintenance checks.

#!/bin/bash
set -euo pipefail

echo "Protection Group DR Readiness Report — $(date -u)"
echo "---"

openstack dr protection group list -f json \
  | jq -r '.[] | "\(.id)  \(.name)  \(.status)"' \
  | while IFS= read -r line; do
      echo "$line"
    done

Expected output:

Protection Group DR Readiness Report — Mon Jan 13 14:00:00 UTC 2025
---
b1c3e2a0-...  prod-app      ACTIVE
4d9f8c11-...  dev-vms       ACTIVE
e7a21b03-...  legacy-batch  ERROR

Troubleshooting

Issue: `jq` returns `null` instead of a field value

Symptom: A variable like $OP_ID is literally null after JSON extraction.

Likely cause: The field name passed to jq does not match the actual key in the API response. Field names can differ between API versions or between list and show responses (e.g., id vs operation_id).

Fix: Inspect the raw JSON before piping to jq:

openstack dr operation show <id> -f json

Identify the exact key name, then update your jq filter. Use jq -r '.id // empty' to fail visibly rather than pass a literal null downstream.

Issue: Polling loop never exits

Symptom: The while loop runs past MAX_WAIT with status perpetually in_progress.

Likely cause 1: The operation has stalled on the engine due to a storage or connectivity error, but the status has not been marked failed. Check the protector-engine log on the primary site:

tail -f /var/log/protector/protector-engine.log

Likely cause 2: MAX_WAIT is too short for the replication workload (e.g., large async RPO window). Increase MAX_WAIT or check replication lag on the Pure FlashArray.

Fix: After diagnosing, either cancel the operation (openstack dr operation cancel <id>) or increase MAX_WAIT and re-run.

Issue: `openstack dr health show` returns `failover_ready: false` unexpectedly

Symptom: Pre-flight check fails even though the environment appears healthy.

Likely cause 1: The secondary site is unreachable. Because metadata sync is blocked when the peer site cannot be contacted, the service conservatively reports not ready.

Likely cause 2: One or more Protection Groups are in FAILED_OVER or ERROR state, preventing a new failover from being initiated.

Fix:

# Check site connectivity
openstack --os-cloud site-b dr site list

# Check PG states
openstack dr protection group list -f json | jq -r '.[] | "\(.name) \(.status)"'

Resolve any PGs in ERROR state before retrying.

Issue: `openstack dr failover` returns immediately with an error, no operation ID

Symptom: The failover command exits non-zero and prints an API error; there is no operation object to poll.

Likely cause: The Protection Group is not in ACTIVE state (it may already be FAILED_OVER, FAILING_OVER, or ERROR), or the replication policy has not been applied.

Fix:

openstack dr protection group show "$PG_ID" -f json \
  | jq '{status: .status, replication_type: .replication_type}'

The PG must be ACTIVE with a replication policy configured before a failover can be accepted.

Issue: Script captures an empty string instead of the operation ID

Symptom: OP_ID is empty after the command substitution; subsequent operation show calls fail with "operation not found".

Likely cause: The failover command printed an error to stderr (not stdout), and the JSON on stdout was empty or absent. Command substitution only captures stdout.

Fix: Capture both streams and inspect:

RAW=$(openstack dr failover "$PG_ID" --failover-type planned -f json 2>&1)
echo "$RAW"
OP_ID=$(echo "$RAW" | jq -r '.id // empty')

This surfaces error messages that would otherwise be silently discarded.

Issue: Plugin not found — `openstack: 'dr' is not an openstack command`

Symptom: Running any openstack dr command produces an unknown command error.

Likely cause: The protectorclient package is not installed in the Python environment used by the openstack CLI, or the entry points were not registered.

Fix:

# Confirm which Python environment openstack uses
which openstack

# Install into that environment
pip install -e /path/to/openstack-protector

# Verify registration
openstack --help | grep '^  dr'

Shell Scripting Patterns

Output format flag: -f json

OS_PROTECTOR_API_VERSION

--os-cloud <name>

Polling interval and timeout

Debug output

Use -f json for all machine-parseable output

Capture the operation ID from failover output

Poll with a while loop until terminal state

Run pre-flight checks before scripted failovers

Cancel a running operation

Example 1 — Pre-flight health check

Example 2 — Planned failover with polling

Example 3 — Non-disruptive DR drill (test failover + cleanup)

Example 4 — Bulk status report across all Protection Groups

Issue: jq returns null instead of a field value

Issue: Polling loop never exits

Issue: openstack dr health show returns failover_ready: false unexpectedly

Issue: openstack dr failover returns immediately with an error, no operation ID

Issue: Script captures an empty string instead of the operation ID

Issue: Plugin not found — openstack: 'dr' is not an openstack command

Output format flag: `-f json`

`OS_PROTECTOR_API_VERSION`

`--os-cloud <name>`

Use `-f json` for all machine-parseable output

Poll with a `while` loop until terminal state

Issue: `jq` returns `null` instead of a field value

Issue: `openstack dr health show` returns `failover_ready: false` unexpectedly

Issue: `openstack dr failover` returns immediately with an error, no operation ID

Issue: Plugin not found — `openstack: 'dr' is not an openstack command`