Guide

Ci/cd Integration

Running test failover in CI/CD pipelines to continuously validate DR readiness

master

Overview

This guide explains how to integrate Trilio Site Recovery's test failover capability into CI/CD pipelines so you can continuously validate DR readiness without disrupting production workloads. A test failover spins up replicated VMs on the secondary site using the latest available snapshot while leaving the primary site fully operational — making it safe to run on a schedule or as a gate in your deployment pipeline. By automating this workflow you get early warning when replication lag, resource mapping drift, or infrastructure changes would cause a real failover to fail.

Prerequisites

Before configuring a CI/CD pipeline for DR validation, ensure the following are in place:

Infrastructure

Two independent OpenStack clouds (primary and secondary sites), each with Nova, Cinder, Neutron, and Keystone endpoints reachable from your CI runner
Trilio Site Recovery protector-api and protector-engine services running on both sites
Network connectivity from the CI runner host to both sites on ports 5000 (Keystone), 8774 (Nova), 8776 (Cinder), and 8788 (Protector API)

Trilio Site Recovery configuration

Both sites registered and in active status (openstack protector site validate passes on both)
At least one Protection Group in active status with member VMs added
Replication policy configured on the Protection Group (FlashArray URLs/tokens or mock mode enabled)
Volume types on both sites have replication_enabled='<is> True' and a matching replication_type property
Resource mappings (network, flavor) already validated manually at least once before automating

CI runner environment

Python 3.9 or later
python-openstackclient and the protectorclient OSC plugin installed (pip install python-protectorclient)
Service account credentials for both sites stored as CI secrets (never hard-coded in pipeline files)
jq available for JSON parsing in shell scripts

Mock storage (optional but recommended for non-production pipelines)

use_mock_storage = True and use_mock_cinder = True set in protector.conf on both sites
Identical Glance images (same name) present on both sites — mock failover creates volumes from images to simulate replication
Mock storage directory created: mkdir -p /var/lib/protector/mock_storage

Installation

Install the protectorclient OSC plugin on every machine or container image that will run pipeline jobs. The plugin provides the openstack protector subcommands used throughout this guide.

Step 1 — Install the OSC plugin

pip install python-openstackclient python-protectorclient

Verify the plugin is detected:

openstack protector --help

You should see protection-group, operation, site, and related subcommand groups listed.

Step 2 — Create a service account on each site

Create a dedicated service account on each OpenStack site rather than using personal credentials. The account needs at minimum the member role on the tenant that owns the Protection Groups.

On the primary site:

openstack user create ci-dr-validator \
  --password "${CI_SA_PASSWORD_PRIMARY}" \
  --domain Default

openstack role add \
  --user ci-dr-validator \
  --project production-project \
  member

Repeat on the secondary site using the same username so credential rotation stays symmetric.

Step 3 — Store credentials as CI secrets

In your CI system (GitHub Actions, GitLab CI, Jenkins, etc.) create the following secrets. Names shown here are illustrative — adapt them to your platform's conventions.

Secret name	Value
`OS_AUTH_URL_PRIMARY`	Keystone endpoint of the primary site
`OS_AUTH_URL_SECONDARY`	Keystone endpoint of the secondary site
`OS_USERNAME`	`ci-dr-validator`
`OS_PASSWORD`	Service account password
`OS_PROJECT_NAME`	Tenant owning the Protection Groups
`OS_USER_DOMAIN_NAME`	`Default`
`OS_PROJECT_DOMAIN_NAME`	`Default`

Step 4 — Create clouds.yaml for multi-site access

The protectorclient plugin authenticates to both sites and orchestrates metadata sync. Provide a clouds.yaml file so your scripts can reference each site by name rather than repeating credential flags.

# clouds.yaml — generated at pipeline start from CI secrets
clouds:
  site_primary:
    auth:
      auth_url: "{{ OS_AUTH_URL_PRIMARY }}"
      username: "{{ OS_USERNAME }}"
      password: "{{ OS_PASSWORD }}"
      project_name: "{{ OS_PROJECT_NAME }}"
      user_domain_name: "{{ OS_USER_DOMAIN_NAME }}"
      project_domain_name: "{{ OS_PROJECT_DOMAIN_NAME }}"
    verify: true
  site_secondary:
    auth:
      auth_url: "{{ OS_AUTH_URL_SECONDARY }}"
      username: "{{ OS_USERNAME }}"
      password: "{{ OS_PASSWORD }}"
      project_name: "{{ OS_PROJECT_NAME }}"
      user_domain_name: "{{ OS_USER_DOMAIN_NAME }}"
      project_domain_name: "{{ OS_PROJECT_DOMAIN_NAME }}"
    verify: true

Generate this file at the start of each pipeline run using environment variable substitution — do not commit it to source control.

Step 5 — Validate connectivity before the first automated run

export OS_CLOUD=site_primary
openstack protector site validate site-a

export OS_CLOUD=site_secondary
openstack protector site validate site-b

Both commands must return status: active before proceeding.

Configuration

The following parameters control test failover behavior in a CI/CD context. Configure them in your pipeline scripts or as environment variables passed to the OSC CLI.

--retain-primary (test failover flag) Default: required (must be explicitly set for test failover) Effect: Keeps primary site VMs running during the test. Without this flag a failover command shuts down primary VMs, causing production impact. Always pass --retain-primary in automated DR drills.

--network-mapping Default: none — required Effect: Maps primary-site network UUIDs to secondary-site network UUIDs for VM recreation. If the mapping is wrong or a network UUID no longer exists, the test failover fails with a resource-not-found error. Store mappings as pipeline variables and validate them in a pre-check step. The format is <primary-net-uuid>=<secondary-net-uuid> and you can repeat the flag for multiple networks.

--flavor-mapping Default: none — optional Effect: Maps primary-site flavor IDs to secondary-site flavor IDs. Omit this flag only when both sites have identical flavor names and IDs. In practice, flavor IDs differ between independent clouds, so define explicit mappings.

--wait / poll loop Default: the CLI returns immediately with an operation_id Effect: Test failover is asynchronous. Use a polling loop in your script (see the Usage section) to block the pipeline step until the operation reaches completed or failed. Failing to poll means your pipeline may report success before the DR test has actually finished.

--force Default: false Effect: Skips primary-site validation and uses the latest available snapshot without a final sync. Do not use this flag for scheduled DR drills — it is intended for unplanned (disaster) failover. Using --force in a test context produces misleading results because it bypasses the checks that would catch replication lag.

protector.conf — mock mode (test environments only)

[DEFAULT]
use_mock_storage = True   # Simulate Pure FlashArray; no real array required
use_mock_cinder = True    # Simulate Cinder consistency group operations

Set use_mock_storage = False and use_mock_cinder = False for pipelines that validate against real storage. Mock mode is appropriate for pipeline environments that do not have FlashArray access but still need to exercise the full DR workflow logic.

RPO thresholds as pipeline gates The Protection Group's replication policy stores rpo_minutes. You can use the openstack protector protection-group show output to assert that the last replication timestamp is within your RPO before triggering the test failover. Failing this pre-check surfaces replication lag as a pipeline failure rather than as a data-loss risk discovered only during a real disaster.

Usage

A CI/CD DR validation job follows five phases: pre-checks, test failover execution, polling for completion, result validation, and cleanup. Structure your pipeline around these phases so each failure mode produces a distinct, actionable error.

Phase 1 — Pre-checks

Before triggering the test failover, confirm that the Protection Group is in a healthy state and that replication is current. This prevents the pipeline from exercising a broken DR path and mistaking a pre-existing problem for a pipeline-specific failure.

export OS_CLOUD=site_primary

# Confirm Protection Group is active
PG_STATUS=$(openstack protector protection-group show "${PG_NAME}" \
  -f value -c status)

if [ "${PG_STATUS}" != "active" ]; then
  echo "ERROR: Protection Group is not active (status: ${PG_STATUS}). Aborting."
  exit 1
fi

# Confirm both sites are reachable
openstack protector site validate "${PRIMARY_SITE_NAME}"
openstack protector site validate "${SECONDARY_SITE_NAME}"

Phase 2 — Trigger test failover

The test-failover action keeps primary VMs running. Pass all required resource mappings. The command returns an operation_id immediately.

OP_ID=$(openstack protector protection-group test-failover "${PG_NAME}" \
  --retain-primary \
  --network-mapping "${PRIMARY_NET_UUID}=${SECONDARY_NET_UUID}" \
  --flavor-mapping "${PRIMARY_FLAVOR_ID}=${SECONDARY_FLAVOR_ID}" \
  -f value -c operation_id)

echo "Test failover started: ${OP_ID}"

Phase 3 — Poll for completion

The DR operation is asynchronous. Poll the operation endpoint until status reaches a terminal state. Set a timeout appropriate for your RPO — if the test failover takes longer than your RTO target, that is itself a finding worth failing the pipeline on.

TIMEOUT=600  # seconds
INTERVAL=15
ELAPSED=0

while [ "${ELAPSED}" -lt "${TIMEOUT}" ]; do
  OP_STATUS=$(openstack protector operation show "${OP_ID}" \
    -f value -c status)
  PROGRESS=$(openstack protector operation show "${OP_ID}" \
    -f value -c progress)

  echo "[${ELAPSED}s] Operation ${OP_ID}: ${OP_STATUS} (${PROGRESS}%)"

  if [ "${OP_STATUS}" = "completed" ]; then
    echo "Test failover completed successfully."
    break
  elif [ "${OP_STATUS}" = "failed" ]; then
    ERROR_MSG=$(openstack protector operation show "${OP_ID}" \
      -f value -c error_message)
    echo "ERROR: Test failover failed — ${ERROR_MSG}"
    exit 1
  fi

  sleep "${INTERVAL}"
  ELAPSED=$(( ELAPSED + INTERVAL ))
done

if [ "${ELAPSED}" -ge "${TIMEOUT}" ]; then
  echo "ERROR: Test failover did not complete within ${TIMEOUT}s."
  exit 1
fi

Phase 4 — Validate DR instances

After the operation completes, confirm that VMs are actually running on the secondary site. This catches cases where the operation reports success but individual VMs failed to start.

export OS_CLOUD=site_secondary

# List instances recreated by the test failover
openstack server list --project "${OS_PROJECT_NAME}" --status ACTIVE

# Optionally check each expected instance by name
for INSTANCE_NAME in "${EXPECTED_INSTANCES[@]}"; do
  STATUS=$(openstack server show "${INSTANCE_NAME}" -f value -c status 2>/dev/null || echo "NOT_FOUND")
  if [ "${STATUS}" != "ACTIVE" ]; then
    echo "ERROR: Instance ${INSTANCE_NAME} is ${STATUS} on secondary site."
    exit 1
  fi
  echo "OK: ${INSTANCE_NAME} is ACTIVE on secondary site."
done

Phase 5 — Cleanup

Run test cleanup to remove the instances created on the secondary site and return the Protection Group to active status. Skipping this step leaves orphaned VMs consuming quota and leaves the Protection Group in a test_failed_over state that blocks subsequent operations.

export OS_CLOUD=site_primary

openstack protector protection-group test-cleanup "${PG_NAME}"

# Confirm Protection Group is active again
PG_STATUS=$(openstack protector protection-group show "${PG_NAME}" \
  -f value -c status)

if [ "${PG_STATUS}" != "active" ]; then
  echo "ERROR: Cleanup did not restore active status (status: ${PG_STATUS})."
  exit 1
fi

Examples

The following examples are self-contained and show expected output. Adapt environment variable names to match your CI platform's secret injection conventions.

Example 1 — GitHub Actions workflow: scheduled weekly DR drill

This workflow runs every Sunday at 02:00 UTC, generates a clouds.yaml from secrets, runs the full five-phase DR drill, and posts the result as a workflow annotation.

# .github/workflows/dr-drill.yml
name: Weekly DR Drill

on:
  schedule:
    - cron: '0 2 * * 0'
  workflow_dispatch:  # Allow manual trigger

env:
  PG_NAME: prod-web-app
  PRIMARY_SITE_NAME: site-a
  SECONDARY_SITE_NAME: site-b
  PRIMARY_NET_UUID: ${{ secrets.PRIMARY_NET_UUID }}
  SECONDARY_NET_UUID: ${{ secrets.SECONDARY_NET_UUID }}
  PRIMARY_FLAVOR_ID: ${{ secrets.PRIMARY_FLAVOR_ID }}
  SECONDARY_FLAVOR_ID: ${{ secrets.SECONDARY_FLAVOR_ID }}

jobs:
  dr-drill:
    runs-on: ubuntu-22.04
    steps:
      - name: Install dependencies
        run: pip install python-openstackclient python-protectorclient

      - name: Generate clouds.yaml
        run: |
          cat > clouds.yaml << EOF
          clouds:
            site_primary:
              auth:
                auth_url: ${{ secrets.OS_AUTH_URL_PRIMARY }}
                username: ${{ secrets.OS_USERNAME }}
                password: ${{ secrets.OS_PASSWORD }}
                project_name: ${{ secrets.OS_PROJECT_NAME }}
                user_domain_name: Default
                project_domain_name: Default
            site_secondary:
              auth:
                auth_url: ${{ secrets.OS_AUTH_URL_SECONDARY }}
                username: ${{ secrets.OS_USERNAME }}
                password: ${{ secrets.OS_PASSWORD }}
                project_name: ${{ secrets.OS_PROJECT_NAME }}
                user_domain_name: Default
                project_domain_name: Default
          EOF

      - name: Run DR drill
        run: bash scripts/dr-drill.sh

      - name: Cleanup on failure
        if: failure()
        run: |
          export OS_CLOUD=site_primary
          openstack protector protection-group test-cleanup "${PG_NAME}" || true

Example 2 — Core drill script (scripts/dr-drill.sh)

This reusable script implements all five phases and can be called from any CI platform.

#!/usr/bin/env bash
set -euo pipefail

export OS_CLOUD=site_primary

echo "=== Phase 1: Pre-checks ==="

PG_STATUS=$(openstack protector protection-group show "${PG_NAME}" \
  -f value -c status)
echo "Protection Group status: ${PG_STATUS}"
[ "${PG_STATUS}" = "active" ] || { echo "ERROR: PG not active"; exit 1; }

openstack protector site validate "${PRIMARY_SITE_NAME}"
openstack protector site validate "${SECONDARY_SITE_NAME}"
echo "Both sites reachable."

echo "=== Phase 2: Trigger test failover ==="

OP_ID=$(openstack protector protection-group test-failover "${PG_NAME}" \
  --retain-primary \
  --network-mapping "${PRIMARY_NET_UUID}=${SECONDARY_NET_UUID}" \
  --flavor-mapping "${PRIMARY_FLAVOR_ID}=${SECONDARY_FLAVOR_ID}" \
  -f value -c operation_id)
echo "Operation ID: ${OP_ID}"

echo "=== Phase 3: Poll for completion ==="

TIMEOUT=600
INTERVAL=15
ELAPSED=0

while [ "${ELAPSED}" -lt "${TIMEOUT}" ]; do
  OP_JSON=$(openstack protector operation show "${OP_ID}" -f json)
  OP_STATUS=$(echo "${OP_JSON}" | jq -r '.status')
  PROGRESS=$(echo "${OP_JSON}" | jq -r '.progress')
  echo "  [${ELAPSED}s] ${OP_STATUS} — ${PROGRESS}%"

  [ "${OP_STATUS}" = "completed" ] && break
  if [ "${OP_STATUS}" = "failed" ]; then
    ERROR=$(echo "${OP_JSON}" | jq -r '.error_message')
    echo "ERROR: ${ERROR}"
    exit 1
  fi

  sleep "${INTERVAL}"
  ELAPSED=$(( ELAPSED + INTERVAL ))
done

[ "${ELAPSED}" -lt "${TIMEOUT}" ] || { echo "ERROR: Timed out after ${TIMEOUT}s"; exit 1; }

echo "=== Phase 4: Validate secondary instances ==="

export OS_CLOUD=site_secondary
ACTIVE_COUNT=$(openstack server list \
  --project "${OS_PROJECT_NAME}" \
  --status ACTIVE \
  -f value -c ID | wc -l)
echo "Active instances on secondary site: ${ACTIVE_COUNT}"
[ "${ACTIVE_COUNT}" -gt 0 ] || { echo "ERROR: No active instances found on secondary site"; exit 1; }

echo "=== Phase 5: Cleanup ==="

export OS_CLOUD=site_primary
openstack protector protection-group test-cleanup "${PG_NAME}"

FINAL_STATUS=$(openstack protector protection-group show "${PG_NAME}" \
  -f value -c status)
echo "Protection Group status after cleanup: ${FINAL_STATUS}"
[ "${FINAL_STATUS}" = "active" ] || { echo "ERROR: Cleanup did not restore active status"; exit 1; }

echo "DR drill completed successfully."

Expected terminal output on success:

=== Phase 1: Pre-checks ===
Protection Group status: active
site-a: status active
site-b: status active
Both sites reachable.
=== Phase 2: Trigger test failover ===
Operation ID: op-7a3c1f82-...
=== Phase 3: Poll for completion ===
  [0s] running — 10%
  [15s] running — 35%
  [30s] running — 60%
  [45s] running — 85%
  [60s] completed — 100%
=== Phase 4: Validate secondary instances ===
Active instances on secondary site: 3
=== Phase 5: Cleanup ===
Protection Group status after cleanup: active
DR drill completed successfully.

Example 3 — GitLab CI pipeline stage

Insert this stage into an existing .gitlab-ci.yml to gate releases on DR readiness.

# .gitlab-ci.yml (excerpt)
dr-drill:
  stage: dr-validation
  image: python:3.11-slim
  rules:
    - if: '$CI_PIPELINE_SOURCE == "schedule"'
    - if: '$CI_COMMIT_BRANCH == "main"'
  before_script:
    - pip install --quiet python-openstackclient python-protectorclient
    - |
      cat > clouds.yaml << EOF
      clouds:
        site_primary:
          auth:
            auth_url: ${OS_AUTH_URL_PRIMARY}
            username: ${OS_USERNAME}
            password: ${OS_PASSWORD}
            project_name: ${OS_PROJECT_NAME}
            user_domain_name: Default
            project_domain_name: Default
        site_secondary:
          auth:
            auth_url: ${OS_AUTH_URL_SECONDARY}
            username: ${OS_USERNAME}
            password: ${OS_PASSWORD}
            project_name: ${OS_PROJECT_NAME}
            user_domain_name: Default
            project_domain_name: Default
      EOF
  script:
    - bash scripts/dr-drill.sh
  after_script:
    # Best-effort cleanup even if script failed
    - export OS_CLOUD=site_primary
    - openstack protector protection-group test-cleanup "${PG_NAME}" || true
  variables:
    PG_NAME: prod-web-app
    PRIMARY_SITE_NAME: site-a
    SECONDARY_SITE_NAME: site-b
  artifacts:
    when: always
    reports:
      dotenv: dr-drill-results.env

Troubleshooting

Use the following reference to diagnose failures in automated DR drill pipelines. Each entry follows the pattern: symptom → likely cause → fix.

Symptom: Pre-check fails with Protection Group is not active (status: failing_over)

Likely cause: A previous pipeline run triggered a test failover that did not reach the cleanup phase — either because the job was cancelled or because cleanup itself failed. The Protection Group is stuck in a transitional state.

Fix: Manually run the cleanup command, then re-run the pipeline.

export OS_CLOUD=site_primary
openstack protector protection-group test-cleanup "${PG_NAME}"
openstack protector protection-group show "${PG_NAME}" -c status

If the status does not return to active, check the Protector engine log on the primary site (/var/log/protector/engine.log) for the operation that left it in this state.

Symptom: openstack protector site validate returns status: unreachable for the secondary site

Likely cause: Network connectivity between the CI runner (or Protector engine) and the secondary site's Keystone or Protector API endpoint is blocked. This can also happen if the secondary site's protector-api service is down.

Fix:

From the CI runner, test raw connectivity: curl -k https://<secondary-keystone>:5000/v3
Verify protector-api is running on the secondary site: ps aux | grep protector-api and netstat -tlnp | grep 8788
Check that your clouds.yaml auth_url for site_secondary is the external (not internal/management) Keystone endpoint reachable from the CI runner.

Important: Because metadata sync between sites is blocked when a peer site is unreachable, any modifications to the Protection Group will also fail until connectivity is restored. The pipeline correctly surfaces this as a DR readiness gap.

Symptom: Test failover operation reaches failed status with error_message: No valid backend was found

Likely cause: The volume_backend_name property on the secondary site's volume type does not match any enabled Cinder backend on that site.

Fix:

export OS_CLOUD=site_secondary
# Check active Cinder backends
openstack volume service list
# Check volume type properties
openstack volume type show replicated-async
# Correct the backend name
openstack volume type set replicated-async \
  --property volume_backend_name='<correct-backend-name>'

Symptom: Test failover completes but Phase 4 reports zero active instances on the secondary site

Likely cause 1 (real storage): The flavor mapping is missing or incorrect, causing VM creation to silently fail during the operation. The operation status may show completed at the Protection Group level while individual member statuses show error.

Fix: Check per-member status.

export OS_CLOUD=site_primary
openstack protector protection-group member-list "${PG_NAME}"

Look for members with status: error. Correct the --flavor-mapping values and re-run the drill.

Likely cause 2 (mock storage): The Glance image referenced during mock failover does not exist on the secondary site or has a different name.

Fix:

export OS_CLOUD=site_secondary
openstack image list

Ensure the same image name exists on both sites. Upload it if missing (see Prerequisites).

Symptom: Pipeline times out in Phase 3 before operation reaches completed

Likely cause: The default 600-second timeout is too short for your environment's replication snapshot retrieval or VM boot time. This is most common when async replication lag means the engine must wait for a new snapshot to be available.

Fix: Increase TIMEOUT in your drill script. Also check the operation's steps_completed and steps_failed fields to identify which phase is slow:

export OS_CLOUD=site_primary
openstack protector operation show "${OP_ID}" -f json | jq '.steps_completed, .steps_failed'

If snapshot retrieval is the bottleneck, check the replication interval in your policy (openstack protector protection-group policy-show "${PG_NAME}") and ensure replication is running on schedule.

Symptom: Cleanup step fails with Cannot modify Protection Group: peer site unreachable

Likely cause: Strict metadata sync enforcement is working as designed. The Protector service blocks all Protection Group modifications — including test cleanup — when the peer site is unreachable, to prevent metadata divergence between sites.

Fix: Restore connectivity to the peer site before retrying cleanup. Do not attempt to force cleanup by directly modifying the database — this will cause metadata divergence that can prevent future failover operations from working correctly. After restoring connectivity:

export OS_CLOUD=site_primary
openstack protector protection-group test-cleanup "${PG_NAME}"

Symptom: Authentication error 401 Unauthorized when the pipeline switches from OS_CLOUD=site_primary to OS_CLOUD=site_secondary

Likely cause: The service account exists on the primary site but was not created on the secondary site, or the password differs.

Fix: Create a matching service account on the secondary site and confirm it has the member role on the project that owns the Protection Groups. Tokens issued by Cluster 1's Keystone are not valid on Cluster 2 — each site requires independent authentication.