Ci/cd Integration
Running test failover in CI/CD pipelines to continuously validate DR readiness
This guide explains how to integrate Trilio Site Recovery's test failover capability into CI/CD pipelines so you can continuously validate DR readiness without disrupting production workloads. A test failover spins up replicated VMs on the secondary site using the latest available snapshot while leaving the primary site fully operational — making it safe to run on a schedule or as a gate in your deployment pipeline. By automating this workflow you get early warning when replication lag, resource mapping drift, or infrastructure changes would cause a real failover to fail.
Before configuring a CI/CD pipeline for DR validation, ensure the following are in place:
Infrastructure
- Two independent OpenStack clouds (primary and secondary sites), each with Nova, Cinder, Neutron, and Keystone endpoints reachable from your CI runner
- Trilio Site Recovery
protector-apiandprotector-engineservices running on both sites - Network connectivity from the CI runner host to both sites on ports 5000 (Keystone), 8774 (Nova), 8776 (Cinder), and 8788 (Protector API)
Trilio Site Recovery configuration
- Both sites registered and in
activestatus (openstack protector site validatepasses on both) - At least one Protection Group in
activestatus with member VMs added - Replication policy configured on the Protection Group (FlashArray URLs/tokens or mock mode enabled)
- Volume types on both sites have
replication_enabled='<is> True'and a matchingreplication_typeproperty - Resource mappings (network, flavor) already validated manually at least once before automating
CI runner environment
- Python 3.9 or later
python-openstackclientand theprotectorclientOSC plugin installed (pip install python-protectorclient)- Service account credentials for both sites stored as CI secrets (never hard-coded in pipeline files)
jqavailable for JSON parsing in shell scripts
Mock storage (optional but recommended for non-production pipelines)
use_mock_storage = Trueanduse_mock_cinder = Trueset inprotector.confon both sites- Identical Glance images (same name) present on both sites — mock failover creates volumes from images to simulate replication
- Mock storage directory created:
mkdir -p /var/lib/protector/mock_storage
Install the protectorclient OSC plugin on every machine or container image that will run pipeline jobs. The plugin provides the openstack protector subcommands used throughout this guide.
Step 1 — Install the OSC plugin
pip install python-openstackclient python-protectorclient
Verify the plugin is detected:
openstack protector --help
You should see protection-group, operation, site, and related subcommand groups listed.
Step 2 — Create a service account on each site
Create a dedicated service account on each OpenStack site rather than using personal credentials. The account needs at minimum the member role on the tenant that owns the Protection Groups.
On the primary site:
openstack user create ci-dr-validator \
--password "${CI_SA_PASSWORD_PRIMARY}" \
--domain Default
openstack role add \
--user ci-dr-validator \
--project production-project \
member
Repeat on the secondary site using the same username so credential rotation stays symmetric.
Step 3 — Store credentials as CI secrets
In your CI system (GitHub Actions, GitLab CI, Jenkins, etc.) create the following secrets. Names shown here are illustrative — adapt them to your platform's conventions.
| Secret name | Value |
|---|---|
OS_AUTH_URL_PRIMARY | Keystone endpoint of the primary site |
OS_AUTH_URL_SECONDARY | Keystone endpoint of the secondary site |
OS_USERNAME | ci-dr-validator |
OS_PASSWORD | Service account password |
OS_PROJECT_NAME | Tenant owning the Protection Groups |
OS_USER_DOMAIN_NAME | Default |
OS_PROJECT_DOMAIN_NAME | Default |
Step 4 — Create clouds.yaml for multi-site access
The protectorclient plugin authenticates to both sites and orchestrates metadata sync. Provide a clouds.yaml file so your scripts can reference each site by name rather than repeating credential flags.
# clouds.yaml — generated at pipeline start from CI secrets
clouds:
site_primary:
auth:
auth_url: "{{ OS_AUTH_URL_PRIMARY }}"
username: "{{ OS_USERNAME }}"
password: "{{ OS_PASSWORD }}"
project_name: "{{ OS_PROJECT_NAME }}"
user_domain_name: "{{ OS_USER_DOMAIN_NAME }}"
project_domain_name: "{{ OS_PROJECT_DOMAIN_NAME }}"
verify: true
site_secondary:
auth:
auth_url: "{{ OS_AUTH_URL_SECONDARY }}"
username: "{{ OS_USERNAME }}"
password: "{{ OS_PASSWORD }}"
project_name: "{{ OS_PROJECT_NAME }}"
user_domain_name: "{{ OS_USER_DOMAIN_NAME }}"
project_domain_name: "{{ OS_PROJECT_DOMAIN_NAME }}"
verify: true
Generate this file at the start of each pipeline run using environment variable substitution — do not commit it to source control.
Step 5 — Validate connectivity before the first automated run
export OS_CLOUD=site_primary
openstack protector site validate site-a
export OS_CLOUD=site_secondary
openstack protector site validate site-b
Both commands must return status: active before proceeding.
The following parameters control test failover behavior in a CI/CD context. Configure them in your pipeline scripts or as environment variables passed to the OSC CLI.
--retain-primary (test failover flag)
Default: required (must be explicitly set for test failover)
Effect: Keeps primary site VMs running during the test. Without this flag a failover command shuts down primary VMs, causing production impact. Always pass --retain-primary in automated DR drills.
--network-mapping
Default: none — required
Effect: Maps primary-site network UUIDs to secondary-site network UUIDs for VM recreation. If the mapping is wrong or a network UUID no longer exists, the test failover fails with a resource-not-found error. Store mappings as pipeline variables and validate them in a pre-check step. The format is <primary-net-uuid>=<secondary-net-uuid> and you can repeat the flag for multiple networks.
--flavor-mapping
Default: none — optional
Effect: Maps primary-site flavor IDs to secondary-site flavor IDs. Omit this flag only when both sites have identical flavor names and IDs. In practice, flavor IDs differ between independent clouds, so define explicit mappings.
--wait / poll loop
Default: the CLI returns immediately with an operation_id
Effect: Test failover is asynchronous. Use a polling loop in your script (see the Usage section) to block the pipeline step until the operation reaches completed or failed. Failing to poll means your pipeline may report success before the DR test has actually finished.
--force
Default: false
Effect: Skips primary-site validation and uses the latest available snapshot without a final sync. Do not use this flag for scheduled DR drills — it is intended for unplanned (disaster) failover. Using --force in a test context produces misleading results because it bypasses the checks that would catch replication lag.
protector.conf — mock mode (test environments only)
[DEFAULT]
use_mock_storage = True # Simulate Pure FlashArray; no real array required
use_mock_cinder = True # Simulate Cinder consistency group operations
Set use_mock_storage = False and use_mock_cinder = False for pipelines that validate against real storage. Mock mode is appropriate for pipeline environments that do not have FlashArray access but still need to exercise the full DR workflow logic.
RPO thresholds as pipeline gates
The Protection Group's replication policy stores rpo_minutes. You can use the openstack protector protection-group show output to assert that the last replication timestamp is within your RPO before triggering the test failover. Failing this pre-check surfaces replication lag as a pipeline failure rather than as a data-loss risk discovered only during a real disaster.
A CI/CD DR validation job follows five phases: pre-checks, test failover execution, polling for completion, result validation, and cleanup. Structure your pipeline around these phases so each failure mode produces a distinct, actionable error.
Phase 1 — Pre-checks
Before triggering the test failover, confirm that the Protection Group is in a healthy state and that replication is current. This prevents the pipeline from exercising a broken DR path and mistaking a pre-existing problem for a pipeline-specific failure.
export OS_CLOUD=site_primary
# Confirm Protection Group is active
PG_STATUS=$(openstack protector protection-group show "${PG_NAME}" \
-f value -c status)
if [ "${PG_STATUS}" != "active" ]; then
echo "ERROR: Protection Group is not active (status: ${PG_STATUS}). Aborting."
exit 1
fi
# Confirm both sites are reachable
openstack protector site validate "${PRIMARY_SITE_NAME}"
openstack protector site validate "${SECONDARY_SITE_NAME}"
Phase 2 — Trigger test failover
The test-failover action keeps primary VMs running. Pass all required resource mappings. The command returns an operation_id immediately.
OP_ID=$(openstack protector protection-group test-failover "${PG_NAME}" \
--retain-primary \
--network-mapping "${PRIMARY_NET_UUID}=${SECONDARY_NET_UUID}" \
--flavor-mapping "${PRIMARY_FLAVOR_ID}=${SECONDARY_FLAVOR_ID}" \
-f value -c operation_id)
echo "Test failover started: ${OP_ID}"
Phase 3 — Poll for completion
The DR operation is asynchronous. Poll the operation endpoint until status reaches a terminal state. Set a timeout appropriate for your RPO — if the test failover takes longer than your RTO target, that is itself a finding worth failing the pipeline on.
TIMEOUT=600 # seconds
INTERVAL=15
ELAPSED=0
while [ "${ELAPSED}" -lt "${TIMEOUT}" ]; do
OP_STATUS=$(openstack protector operation show "${OP_ID}" \
-f value -c status)
PROGRESS=$(openstack protector operation show "${OP_ID}" \
-f value -c progress)
echo "[${ELAPSED}s] Operation ${OP_ID}: ${OP_STATUS} (${PROGRESS}%)"
if [ "${OP_STATUS}" = "completed" ]; then
echo "Test failover completed successfully."
break
elif [ "${OP_STATUS}" = "failed" ]; then
ERROR_MSG=$(openstack protector operation show "${OP_ID}" \
-f value -c error_message)
echo "ERROR: Test failover failed — ${ERROR_MSG}"
exit 1
fi
sleep "${INTERVAL}"
ELAPSED=$(( ELAPSED + INTERVAL ))
done
if [ "${ELAPSED}" -ge "${TIMEOUT}" ]; then
echo "ERROR: Test failover did not complete within ${TIMEOUT}s."
exit 1
fi
Phase 4 — Validate DR instances
After the operation completes, confirm that VMs are actually running on the secondary site. This catches cases where the operation reports success but individual VMs failed to start.
export OS_CLOUD=site_secondary
# List instances recreated by the test failover
openstack server list --project "${OS_PROJECT_NAME}" --status ACTIVE
# Optionally check each expected instance by name
for INSTANCE_NAME in "${EXPECTED_INSTANCES[@]}"; do
STATUS=$(openstack server show "${INSTANCE_NAME}" -f value -c status 2>/dev/null || echo "NOT_FOUND")
if [ "${STATUS}" != "ACTIVE" ]; then
echo "ERROR: Instance ${INSTANCE_NAME} is ${STATUS} on secondary site."
exit 1
fi
echo "OK: ${INSTANCE_NAME} is ACTIVE on secondary site."
done
Phase 5 — Cleanup
Run test cleanup to remove the instances created on the secondary site and return the Protection Group to active status. Skipping this step leaves orphaned VMs consuming quota and leaves the Protection Group in a test_failed_over state that blocks subsequent operations.
export OS_CLOUD=site_primary
openstack protector protection-group test-cleanup "${PG_NAME}"
# Confirm Protection Group is active again
PG_STATUS=$(openstack protector protection-group show "${PG_NAME}" \
-f value -c status)
if [ "${PG_STATUS}" != "active" ]; then
echo "ERROR: Cleanup did not restore active status (status: ${PG_STATUS})."
exit 1
fi
The following examples are self-contained and show expected output. Adapt environment variable names to match your CI platform's secret injection conventions.
Example 1 — GitHub Actions workflow: scheduled weekly DR drill
This workflow runs every Sunday at 02:00 UTC, generates a clouds.yaml from secrets, runs the full five-phase DR drill, and posts the result as a workflow annotation.
# .github/workflows/dr-drill.yml
name: Weekly DR Drill
on:
schedule:
- cron: '0 2 * * 0'
workflow_dispatch: # Allow manual trigger
env:
PG_NAME: prod-web-app
PRIMARY_SITE_NAME: site-a
SECONDARY_SITE_NAME: site-b
PRIMARY_NET_UUID: ${{ secrets.PRIMARY_NET_UUID }}
SECONDARY_NET_UUID: ${{ secrets.SECONDARY_NET_UUID }}
PRIMARY_FLAVOR_ID: ${{ secrets.PRIMARY_FLAVOR_ID }}
SECONDARY_FLAVOR_ID: ${{ secrets.SECONDARY_FLAVOR_ID }}
jobs:
dr-drill:
runs-on: ubuntu-22.04
steps:
- name: Install dependencies
run: pip install python-openstackclient python-protectorclient
- name: Generate clouds.yaml
run: |
cat > clouds.yaml << EOF
clouds:
site_primary:
auth:
auth_url: ${{ secrets.OS_AUTH_URL_PRIMARY }}
username: ${{ secrets.OS_USERNAME }}
password: ${{ secrets.OS_PASSWORD }}
project_name: ${{ secrets.OS_PROJECT_NAME }}
user_domain_name: Default
project_domain_name: Default
site_secondary:
auth:
auth_url: ${{ secrets.OS_AUTH_URL_SECONDARY }}
username: ${{ secrets.OS_USERNAME }}
password: ${{ secrets.OS_PASSWORD }}
project_name: ${{ secrets.OS_PROJECT_NAME }}
user_domain_name: Default
project_domain_name: Default
EOF
- name: Run DR drill
run: bash scripts/dr-drill.sh
- name: Cleanup on failure
if: failure()
run: |
export OS_CLOUD=site_primary
openstack protector protection-group test-cleanup "${PG_NAME}" || true
Example 2 — Core drill script (scripts/dr-drill.sh)
This reusable script implements all five phases and can be called from any CI platform.
#!/usr/bin/env bash
set -euo pipefail
export OS_CLOUD=site_primary
echo "=== Phase 1: Pre-checks ==="
PG_STATUS=$(openstack protector protection-group show "${PG_NAME}" \
-f value -c status)
echo "Protection Group status: ${PG_STATUS}"
[ "${PG_STATUS}" = "active" ] || { echo "ERROR: PG not active"; exit 1; }
openstack protector site validate "${PRIMARY_SITE_NAME}"
openstack protector site validate "${SECONDARY_SITE_NAME}"
echo "Both sites reachable."
echo "=== Phase 2: Trigger test failover ==="
OP_ID=$(openstack protector protection-group test-failover "${PG_NAME}" \
--retain-primary \
--network-mapping "${PRIMARY_NET_UUID}=${SECONDARY_NET_UUID}" \
--flavor-mapping "${PRIMARY_FLAVOR_ID}=${SECONDARY_FLAVOR_ID}" \
-f value -c operation_id)
echo "Operation ID: ${OP_ID}"
echo "=== Phase 3: Poll for completion ==="
TIMEOUT=600
INTERVAL=15
ELAPSED=0
while [ "${ELAPSED}" -lt "${TIMEOUT}" ]; do
OP_JSON=$(openstack protector operation show "${OP_ID}" -f json)
OP_STATUS=$(echo "${OP_JSON}" | jq -r '.status')
PROGRESS=$(echo "${OP_JSON}" | jq -r '.progress')
echo " [${ELAPSED}s] ${OP_STATUS} — ${PROGRESS}%"
[ "${OP_STATUS}" = "completed" ] && break
if [ "${OP_STATUS}" = "failed" ]; then
ERROR=$(echo "${OP_JSON}" | jq -r '.error_message')
echo "ERROR: ${ERROR}"
exit 1
fi
sleep "${INTERVAL}"
ELAPSED=$(( ELAPSED + INTERVAL ))
done
[ "${ELAPSED}" -lt "${TIMEOUT}" ] || { echo "ERROR: Timed out after ${TIMEOUT}s"; exit 1; }
echo "=== Phase 4: Validate secondary instances ==="
export OS_CLOUD=site_secondary
ACTIVE_COUNT=$(openstack server list \
--project "${OS_PROJECT_NAME}" \
--status ACTIVE \
-f value -c ID | wc -l)
echo "Active instances on secondary site: ${ACTIVE_COUNT}"
[ "${ACTIVE_COUNT}" -gt 0 ] || { echo "ERROR: No active instances found on secondary site"; exit 1; }
echo "=== Phase 5: Cleanup ==="
export OS_CLOUD=site_primary
openstack protector protection-group test-cleanup "${PG_NAME}"
FINAL_STATUS=$(openstack protector protection-group show "${PG_NAME}" \
-f value -c status)
echo "Protection Group status after cleanup: ${FINAL_STATUS}"
[ "${FINAL_STATUS}" = "active" ] || { echo "ERROR: Cleanup did not restore active status"; exit 1; }
echo "DR drill completed successfully."
Expected terminal output on success:
=== Phase 1: Pre-checks ===
Protection Group status: active
site-a: status active
site-b: status active
Both sites reachable.
=== Phase 2: Trigger test failover ===
Operation ID: op-7a3c1f82-...
=== Phase 3: Poll for completion ===
[0s] running — 10%
[15s] running — 35%
[30s] running — 60%
[45s] running — 85%
[60s] completed — 100%
=== Phase 4: Validate secondary instances ===
Active instances on secondary site: 3
=== Phase 5: Cleanup ===
Protection Group status after cleanup: active
DR drill completed successfully.
Example 3 — GitLab CI pipeline stage
Insert this stage into an existing .gitlab-ci.yml to gate releases on DR readiness.
# .gitlab-ci.yml (excerpt)
dr-drill:
stage: dr-validation
image: python:3.11-slim
rules:
- if: '$CI_PIPELINE_SOURCE == "schedule"'
- if: '$CI_COMMIT_BRANCH == "main"'
before_script:
- pip install --quiet python-openstackclient python-protectorclient
- |
cat > clouds.yaml << EOF
clouds:
site_primary:
auth:
auth_url: ${OS_AUTH_URL_PRIMARY}
username: ${OS_USERNAME}
password: ${OS_PASSWORD}
project_name: ${OS_PROJECT_NAME}
user_domain_name: Default
project_domain_name: Default
site_secondary:
auth:
auth_url: ${OS_AUTH_URL_SECONDARY}
username: ${OS_USERNAME}
password: ${OS_PASSWORD}
project_name: ${OS_PROJECT_NAME}
user_domain_name: Default
project_domain_name: Default
EOF
script:
- bash scripts/dr-drill.sh
after_script:
# Best-effort cleanup even if script failed
- export OS_CLOUD=site_primary
- openstack protector protection-group test-cleanup "${PG_NAME}" || true
variables:
PG_NAME: prod-web-app
PRIMARY_SITE_NAME: site-a
SECONDARY_SITE_NAME: site-b
artifacts:
when: always
reports:
dotenv: dr-drill-results.env
Use the following reference to diagnose failures in automated DR drill pipelines. Each entry follows the pattern: symptom → likely cause → fix.
Symptom: Pre-check fails with Protection Group is not active (status: failing_over)
Likely cause: A previous pipeline run triggered a test failover that did not reach the cleanup phase — either because the job was cancelled or because cleanup itself failed. The Protection Group is stuck in a transitional state.
Fix: Manually run the cleanup command, then re-run the pipeline.
export OS_CLOUD=site_primary
openstack protector protection-group test-cleanup "${PG_NAME}"
openstack protector protection-group show "${PG_NAME}" -c status
If the status does not return to active, check the Protector engine log on the primary site (/var/log/protector/engine.log) for the operation that left it in this state.
Symptom: openstack protector site validate returns status: unreachable for the secondary site
Likely cause: Network connectivity between the CI runner (or Protector engine) and the secondary site's Keystone or Protector API endpoint is blocked. This can also happen if the secondary site's protector-api service is down.
Fix:
- From the CI runner, test raw connectivity:
curl -k https://<secondary-keystone>:5000/v3 - Verify
protector-apiis running on the secondary site:ps aux | grep protector-apiandnetstat -tlnp | grep 8788 - Check that your
clouds.yamlauth_urlforsite_secondaryis the external (not internal/management) Keystone endpoint reachable from the CI runner.
Important: Because metadata sync between sites is blocked when a peer site is unreachable, any modifications to the Protection Group will also fail until connectivity is restored. The pipeline correctly surfaces this as a DR readiness gap.
Symptom: Test failover operation reaches failed status with error_message: No valid backend was found
Likely cause: The volume_backend_name property on the secondary site's volume type does not match any enabled Cinder backend on that site.
Fix:
export OS_CLOUD=site_secondary
# Check active Cinder backends
openstack volume service list
# Check volume type properties
openstack volume type show replicated-async
# Correct the backend name
openstack volume type set replicated-async \
--property volume_backend_name='<correct-backend-name>'
Symptom: Test failover completes but Phase 4 reports zero active instances on the secondary site
Likely cause 1 (real storage): The flavor mapping is missing or incorrect, causing VM creation to silently fail during the operation. The operation status may show completed at the Protection Group level while individual member statuses show error.
Fix: Check per-member status.
export OS_CLOUD=site_primary
openstack protector protection-group member-list "${PG_NAME}"
Look for members with status: error. Correct the --flavor-mapping values and re-run the drill.
Likely cause 2 (mock storage): The Glance image referenced during mock failover does not exist on the secondary site or has a different name.
Fix:
export OS_CLOUD=site_secondary
openstack image list
Ensure the same image name exists on both sites. Upload it if missing (see Prerequisites).
Symptom: Pipeline times out in Phase 3 before operation reaches completed
Likely cause: The default 600-second timeout is too short for your environment's replication snapshot retrieval or VM boot time. This is most common when async replication lag means the engine must wait for a new snapshot to be available.
Fix: Increase TIMEOUT in your drill script. Also check the operation's steps_completed and steps_failed fields to identify which phase is slow:
export OS_CLOUD=site_primary
openstack protector operation show "${OP_ID}" -f json | jq '.steps_completed, .steps_failed'
If snapshot retrieval is the bottleneck, check the replication interval in your policy (openstack protector protection-group policy-show "${PG_NAME}") and ensure replication is running on schedule.
Symptom: Cleanup step fails with Cannot modify Protection Group: peer site unreachable
Likely cause: Strict metadata sync enforcement is working as designed. The Protector service blocks all Protection Group modifications — including test cleanup — when the peer site is unreachable, to prevent metadata divergence between sites.
Fix: Restore connectivity to the peer site before retrying cleanup. Do not attempt to force cleanup by directly modifying the database — this will cause metadata divergence that can prevent future failover operations from working correctly. After restoring connectivity:
export OS_CLOUD=site_primary
openstack protector protection-group test-cleanup "${PG_NAME}"
Symptom: Authentication error 401 Unauthorized when the pipeline switches from OS_CLOUD=site_primary to OS_CLOUD=site_secondary
Likely cause: The service account exists on the primary site but was not created on the secondary site, or the password differs.
Fix: Create a matching service account on the secondary site and confirm it has the member role on the project that owns the Protection Groups. Tokens issued by Cluster 1's Keystone are not valid on Cluster 2 — each site requires independent authentication.