Shell Scripting Patterns
OSC plugin usage in bash scripts with JSON output and operation polling
This page covers practical shell scripting patterns for automating Trilio Site Recovery workflows using the openstack dr CLI plugin. Because DR operations are asynchronous and span two independent OpenStack sites, effective automation requires machine-parseable output, robust operation polling, and pre-flight readiness checks before committing to a failover. The patterns here address all three concerns and compose into production-grade runbooks for planned failovers, DR drills, and failback sequences.
Before working through these patterns, ensure the following are in place:
protectorclientinstalled — the OSC plugin must be registered with youropenstackCLI (pip install -e .from the repository root, or equivalent package install)openstackCLI — any version compatible with the protectorclient plugin entry pointsjq1.6 or later — used throughout for JSON field extraction; confirm withjq --versionclouds.yamlconfigured — both your primary and secondary site credentials must be present; see the deployment guide for the multi-siteclouds.yamlformat- Environment sourced — either
source <siterc>or--os-cloud <name>available for both sites - At least one Protection Group in
ACTIVEstate with replication policy applied and replication readiness confirmed - Shell: patterns below assume
bash4.0+;set -euo pipefailis recommended at the top of any production script
The OSC plugin ships with the openstack-protector repository. Install it once on any host from which you run automation:
Step 1 — Install the plugin
cd /path/to/openstack-protector
pip install -e .
Step 2 — Verify the plugin is loaded
openstack --help | grep '^ dr'
You should see dr listed as a command group. If it is absent, the entry-point registration did not succeed — re-run pip install -e . and confirm there are no import errors.
Step 3 — Confirm jq is available
jq --version
If jq is missing, install it via your system package manager (e.g., apt install jq or yum install jq).
Step 4 — Verify end-to-end connectivity
# Primary site
openstack --os-cloud site-a dr site list
# Secondary site
openstack --os-cloud site-b dr site list
Both commands should return without authentication errors before you proceed to scripting.
Script behaviour is controlled by a small set of environment variables and CLI flags. Understanding their effect lets you tune polling cadence and output fidelity without modifying script logic.
Output format flag: -f json
Pass -f json to any openstack dr command to receive a JSON object instead of the default human-readable table. This is the foundation of all machine-parseable patterns on this page — without it, field extraction requires fragile text parsing.
openstack dr operation show <id> -f json
OS_PROTECTOR_API_VERSION
Sets the microversion sent to the protector-api. Unset defaults to 1. Set explicitly in scripts that must target a specific API surface:
export OS_PROTECTOR_API_VERSION=1
--os-cloud <name>
Selects the named entry from clouds.yaml. Prefer this over sourcing RC files in scripts so that both sites can be addressed in the same script without re-sourcing:
openstack --os-cloud site-a dr protection group list -f json
openstack --os-cloud site-b dr protection group list -f json
Polling interval and timeout
The openstack dr plugin does not implement built-in --wait semantics — you control polling in your own loop. Two variables to set at the top of any polling script:
| Variable | Recommended value | Effect |
|---|---|---|
POLL_INTERVAL | 15 (seconds) | Time between status checks; increase for long-running async replication |
MAX_WAIT | 1800 (seconds) | Hard timeout before the script declares the operation stalled |
Debug output
Add --debug before the subcommand to emit full HTTP request/response traces to stderr. Redirect stderr separately in scripts to avoid polluting captured JSON:
openstack --debug dr operation show <id> -f json 2>debug.log | jq .status
Use -f json for all machine-parseable output
Every field-extraction pattern in automation should start with -f json. Avoid parsing table output — column widths and formatting can change across plugin versions.
# Capture the ID of a newly created Protection Group
PG_ID=$(openstack dr protection group create prod-app \
--primary-site cluster1 \
--secondary-site cluster2 \
--replication-type async \
-f json | jq -r '.id')
echo "Created PG: $PG_ID"
Capture the operation ID from failover output
All DR actions (failover, failback, test failover) return an operation object immediately. The operation runs asynchronously — capture its id before doing anything else:
OP_ID=$(openstack dr failover "$PG_ID" \
--failover-type planned \
-f json | jq -r '.id')
echo "Failover operation: $OP_ID"
If you lose the operation ID, recover it with:
openstack dr operation list -f json \
| jq -r '.[] | select(.protection_group_id=="'"$PG_ID"'") | .id' \
| head -1
Poll with a while loop until terminal state
Operation status transitions through intermediate values (pending, in_progress) before reaching a terminal state (completed or failed). Poll openstack dr operation show in a loop:
POLL_INTERVAL=15
MAX_WAIT=1800
elapsed=0
while true; do
STATUS=$(openstack dr operation show "$OP_ID" -f json | jq -r '.status')
echo "[$(date -u +%H:%M:%S)] status: $STATUS"
if [[ "$STATUS" == "completed" ]]; then
echo "Operation completed successfully."
break
elif [[ "$STATUS" == "failed" ]]; then
echo "ERROR: Operation failed. Check 'openstack dr operation show $OP_ID' for details."
exit 1
fi
if (( elapsed >= MAX_WAIT )); then
echo "ERROR: Timed out after ${MAX_WAIT}s waiting for operation $OP_ID."
exit 2
fi
sleep "$POLL_INTERVAL"
(( elapsed += POLL_INTERVAL ))
done
Run pre-flight checks before scripted failovers
Before initiating any failover — especially in automated runbooks — verify that the Protection Group is in a replication-ready state. Use openstack dr health show to retrieve the current readiness flag:
FAILOVER_READY=$(openstack dr health show -f json | jq -r '.failover_ready')
if [[ "$FAILOVER_READY" != "true" ]]; then
echo "ERROR: Site is not failover-ready (failover_ready=$FAILOVER_READY). Aborting."
exit 3
fi
This check surfaces issues such as replication lag exceeding threshold, unreachable peer sites, or Protection Groups not in ACTIVE state — all of which would cause a failover to fail or produce inconsistent results. Because metadata sync is blocked when the peer site is unreachable, a failed health check is an authoritative signal to abort rather than proceed.
Cancel a running operation
If a scripted failover must be aborted during execution:
openstack dr operation cancel "$OP_ID"
After cancellation, poll until status reaches failed or completed before taking further action — the cancel is itself asynchronous.
Example 1 — Pre-flight health check
Verify failover readiness before any automated workflow begins.
#!/bin/bash
set -euo pipefail
OS_CLOUD="${OS_CLOUD:-site-a}"
FAILOVER_READY=$(openstack --os-cloud "$OS_CLOUD" dr health show -f json \
| jq -r '.failover_ready')
echo "failover_ready: $FAILOVER_READY"
if [[ "$FAILOVER_READY" != "true" ]]; then
echo "Aborting: site is not ready for failover."
exit 1
fi
echo "Pre-flight check passed."
Expected output (healthy site):
failover_ready: true
Pre-flight check passed.
Example 2 — Planned failover with polling
Execute a planned failover and block until the operation reaches a terminal state.
#!/bin/bash
set -euo pipefail
PG_ID="$1" # Pass PG ID as first argument
POLL_INTERVAL=15
MAX_WAIT=1800
# Pre-flight
FAILOVER_READY=$(openstack dr health show -f json | jq -r '.failover_ready')
if [[ "$FAILOVER_READY" != "true" ]]; then
echo "ERROR: failover_ready=$FAILOVER_READY — aborting."
exit 1
fi
# Initiate failover
OP_ID=$(openstack dr failover "$PG_ID" \
--failover-type planned \
-f json | jq -r '.id')
echo "Failover operation started: $OP_ID"
# Poll
elapsed=0
while true; do
STATUS=$(openstack dr operation show "$OP_ID" -f json | jq -r '.status')
echo "[$(date -u +%H:%M:%S)] $OP_ID → $STATUS"
[[ "$STATUS" == "completed" ]] && { echo "Failover complete."; exit 0; }
[[ "$STATUS" == "failed" ]] && { echo "Failover FAILED."; exit 1; }
(( elapsed >= MAX_WAIT )) && { echo "Timeout."; exit 2; }
sleep "$POLL_INTERVAL"
(( elapsed += POLL_INTERVAL ))
done
Expected output:
Failover operation started: a3f9c1d2-7e44-4b2a-9f01-bc34e8120def
[14:02:10] a3f9c1d2-7e44-4b2a-9f01-bc34e8120def → in_progress
[14:02:25] a3f9c1d2-7e44-4b2a-9f01-bc34e8120def → in_progress
[14:02:40] a3f9c1d2-7e44-4b2a-9f01-bc34e8120def → completed
Failover complete.
Example 3 — Non-disruptive DR drill (test failover + cleanup)
Run a DR test, verify the test instances appeared on the secondary, then clean up.
#!/bin/bash
set -euo pipefail
PG_ID="$1"
POLL_INTERVAL=15
MAX_WAIT=900
# Start test failover
TEST_OP_ID=$(openstack dr test failover "$PG_ID" \
-f json | jq -r '.id')
echo "Test failover operation: $TEST_OP_ID"
# Poll until ready
elapsed=0
while true; do
STATUS=$(openstack dr operation show "$TEST_OP_ID" -f json | jq -r '.status')
echo "[$(date -u +%H:%M:%S)] $STATUS"
[[ "$STATUS" == "completed" ]] && break
[[ "$STATUS" == "failed" ]] && { echo "Test failover failed."; exit 1; }
(( elapsed >= MAX_WAIT )) && { echo "Timeout."; exit 2; }
sleep "$POLL_INTERVAL"
(( elapsed += POLL_INTERVAL ))
done
echo "Test instances are running. Inspect the secondary site now."
echo "Press ENTER to proceed with cleanup, or Ctrl-C to abort."
read -r
# Cleanup
openstack dr test failover cleanup "$TEST_OP_ID"
echo "Test resources cleaned up."
Expected output:
Test failover operation: 77b2e901-cc13-4d8a-b5f3-192a44de0031
[09:15:02] in_progress
[09:15:17] in_progress
[09:15:32] completed
Test instances are running. Inspect the secondary site now.
Press ENTER to proceed with cleanup, or Ctrl-C to abort.
Test resources cleaned up.
Example 4 — Bulk status report across all Protection Groups
Generate a one-line status summary for every PG — useful in monitoring cron jobs or pre-maintenance checks.
#!/bin/bash
set -euo pipefail
echo "Protection Group DR Readiness Report — $(date -u)"
echo "---"
openstack dr protection group list -f json \
| jq -r '.[] | "\(.id) \(.name) \(.status)"' \
| while IFS= read -r line; do
echo "$line"
done
Expected output:
Protection Group DR Readiness Report — Mon Jan 13 14:00:00 UTC 2025
---
b1c3e2a0-... prod-app ACTIVE
4d9f8c11-... dev-vms ACTIVE
e7a21b03-... legacy-batch ERROR
Issue: jq returns null instead of a field value
Symptom: A variable like $OP_ID is literally null after JSON extraction.
Likely cause: The field name passed to jq does not match the actual key in the API response. Field names can differ between API versions or between list and show responses (e.g., id vs operation_id).
Fix: Inspect the raw JSON before piping to jq:
openstack dr operation show <id> -f json
Identify the exact key name, then update your jq filter. Use jq -r '.id // empty' to fail visibly rather than pass a literal null downstream.
Issue: Polling loop never exits
Symptom: The while loop runs past MAX_WAIT with status perpetually in_progress.
Likely cause 1: The operation has stalled on the engine due to a storage or connectivity error, but the status has not been marked failed. Check the protector-engine log on the primary site:
tail -f /var/log/protector/protector-engine.log
Likely cause 2: MAX_WAIT is too short for the replication workload (e.g., large async RPO window). Increase MAX_WAIT or check replication lag on the Pure FlashArray.
Fix: After diagnosing, either cancel the operation (openstack dr operation cancel <id>) or increase MAX_WAIT and re-run.
Issue: openstack dr health show returns failover_ready: false unexpectedly
Symptom: Pre-flight check fails even though the environment appears healthy.
Likely cause 1: The secondary site is unreachable. Because metadata sync is blocked when the peer site cannot be contacted, the service conservatively reports not ready.
Likely cause 2: One or more Protection Groups are in FAILED_OVER or ERROR state, preventing a new failover from being initiated.
Fix:
# Check site connectivity
openstack --os-cloud site-b dr site list
# Check PG states
openstack dr protection group list -f json | jq -r '.[] | "\(.name) \(.status)"'
Resolve any PGs in ERROR state before retrying.
Issue: openstack dr failover returns immediately with an error, no operation ID
Symptom: The failover command exits non-zero and prints an API error; there is no operation object to poll.
Likely cause: The Protection Group is not in ACTIVE state (it may already be FAILED_OVER, FAILING_OVER, or ERROR), or the replication policy has not been applied.
Fix:
openstack dr protection group show "$PG_ID" -f json \
| jq '{status: .status, replication_type: .replication_type}'
The PG must be ACTIVE with a replication policy configured before a failover can be accepted.
Issue: Script captures an empty string instead of the operation ID
Symptom: OP_ID is empty after the command substitution; subsequent operation show calls fail with "operation not found".
Likely cause: The failover command printed an error to stderr (not stdout), and the JSON on stdout was empty or absent. Command substitution only captures stdout.
Fix: Capture both streams and inspect:
RAW=$(openstack dr failover "$PG_ID" --failover-type planned -f json 2>&1)
echo "$RAW"
OP_ID=$(echo "$RAW" | jq -r '.id // empty')
This surfaces error messages that would otherwise be silently discarded.
Issue: Plugin not found — openstack: 'dr' is not an openstack command
Symptom: Running any openstack dr command produces an unknown command error.
Likely cause: The protectorclient package is not installed in the Python environment used by the openstack CLI, or the entry points were not registered.
Fix:
# Confirm which Python environment openstack uses
which openstack
# Install into that environment
pip install -e /path/to/openstack-protector
# Verify registration
openstack --help | grep '^ dr'