Site Recoveryfor OpenStack
Guide

Operations Api

Trigger and monitor DR operations: failover, failback, test failover, sync

master

Overview

The Operations API is the control plane for executing and monitoring disaster recovery workflows in Trilio Site Recovery. It exposes four primary operation types — failover, failback, test failover, and consistency group sync — all triggered through a single action endpoint on a Protection Group. Because DR operations are long-running and stateful, the API returns an operation record immediately and lets you poll progress asynchronously. Understanding this API is essential for automating DR drills, executing planned maintenance failovers, and responding to unplanned outages.


Prerequisites

Before triggering any DR operation, verify the following:

  • Two registered and reachable sites — both the primary and secondary OpenStack sites must be registered in Protector and return status: active from GET /v1/admin/sites/{site_id}.
  • Protection Group in active or failed_over state — failover requires active; failback requires failed_over. A group in error or failing_over must be resolved before new operations are accepted.
  • Replication policy configured — the Protection Group must have a replication policy with valid FlashArray credentials for both sites (GET /v1/{tenant_id}/protection-groups/{pg_id}/policy must return a policy record).
  • Metadata in sync — the Protection Group metadata version must match on both sites. Check with the sync-status endpoint before operating. Modifications — including triggering operations — are blocked if the peer site is unreachable.
  • Volume types eligible for replication — all volumes in the Protection Group's Consistency Group must use a Cinder volume type with replication_enabled='<is> True' and a matching replication_type property.
  • Cinder policy configured — the volume_extension:volume_manage and volume_extension:volume_unmanage policies must allow the member role on the target site (see the deployment prerequisites guide).
  • Protector API version 1.0 or later — send OpenStack-API-Version: protector 1.0 (or higher) on all requests. Current version is 1.2.
  • Valid Keystone token — scoped to the tenant project. The protectorclient CLI and Horizon dashboard handle token acquisition automatically when configured with clouds.yaml.

Installation

The Operations API is part of the protector-api service. No separate installation is required beyond a functioning Trilio Site Recovery deployment. To interact with the API directly, install the protectorclient OSC plugin on any host that can reach both sites' Keystone endpoints.

Step 1 — Confirm the API service is running on both sites

# On Site A
systemctl status protector-api

# On Site B
systemctl status protector-api

Both services must report Active: active (running).

Step 2 — Verify API reachability

# Replace with your controller IP and tenant ID
curl -s http://site-a-controller:8788/
curl -s http://site-b-controller:8788/

A successful response returns the API version document. A connection refused or timeout means protector-api is not reachable on that host.

Step 3 — Install the protectorclient OSC plugin

pip install python-protectorclient

Step 4 — Configure clouds.yaml for both sites

The OSC plugin must authenticate to both sites to coordinate metadata. Create or update ~/.config/openstack/clouds.yaml:

clouds:
  site-a:
    auth:
      auth_url: http://site-a-controller:5000/v3
      project_name: your-project
      username: your-user
      password: your-password
      user_domain_name: Default
      project_domain_name: Default
    region_name: RegionOne

  site-b:
    auth:
      auth_url: http://site-b-controller:5000/v3
      project_name: your-project
      username: your-user
      password: your-password
      user_domain_name: Default
      project_domain_name: Default
    region_name: RegionOne

Step 5 — Verify CLI access

export OS_CLOUD=site-a
openstack protector protection-group list

A table of your Protection Groups confirms that authentication and API connectivity are working.


Configuration

All DR operations share a common set of parameters that control how resources are mapped and how the operation behaves under failure conditions. These are supplied in the request body of the action endpoint.

Common operation parameters

ParameterTypeDefaultDescription
network_mappingobject{}Maps source network UUIDs to target network UUIDs. Keys are UUIDs on the current primary; values are UUIDs on the target site. Required when network names differ across sites.
flavor_mappingobject{}Maps source flavor IDs to target flavor IDs. Use when the target site has different flavor names or sizes. If omitted, Protector attempts to match by name.
forcebooleanfalseBypasses the peer-site reachability check. Use only for unplanned failover when the primary site is genuinely down. Setting force: true when both sites are reachable risks metadata divergence.
retain_primarybooleanfalseApplies to test_failover only. When true, primary-site VMs remain running during the test. When false, primary VMs are quiesced for the duration.
reverse_replicationbooleanfalseApplies to failback only. When true, Protector reconfigures Pure Storage replication so that Site A becomes the replication target again after failback completes. Recommended for planned failbacks.

Operation types and valid states

The action endpoint accepts one operation key per request. The accepted key determines what Protector executes:

Action keyAllowed when PG status isResult status
failoveractivefailed_over
failbackfailed_overactive
test_failoveractiveactive (primary unchanged)
test_cleanupactive (after test failover)active
sync_volumesactive or failed_overno status change

Metadata sync behavior

By default, every operation that changes Protection Group state requires both sites to be reachable before it begins — Protector blocks the operation and returns an error if the peer site cannot be contacted. This prevents metadata divergence. The force flag overrides this block for failover only, and only when you are responding to a genuine site outage. Forced operations log a sync warning and mark the remote sync status as UNREACHABLE; once the failed site recovers, run a forced sync before resuming normal operations.


Usage

Triggering operations

All DR operations are submitted as HTTP POST requests to the Protection Group action endpoint:

POST /v1/{tenant_id}/protection-groups/{pg_id}/action

The request body contains exactly one top-level key identifying the operation. Protector validates the Protection Group state, checks peer-site reachability (unless force is set), and returns an operation record with status pending or running. The operation then executes asynchronously.

Polling operation progress

Protector does not block the HTTP response until the operation completes. Instead, poll the operation detail endpoint:

GET /v1/{tenant_id}/operations/{op_id}

The progress field is an integer from 0 to 100. The status field moves through pending → running → completed on success, or pending → running → rolling_back → failed on failure. The steps_completed and steps_failed fields are arrays that let you see exactly where a long operation stands or where it broke.

Operation lifecycle phases

Every failover-class operation passes through four phases. Progress values are approximate:

PhaseProgress rangeWhat happens
Preparation0–20%PG status set to failing_over, DR operation record created, target site validated, latest CG snapshot retrieved from Pure Storage
Storage failover20–60%Replicated snapshot identified on target FlashArray; volumes created from snapshot; volumes managed into Cinder on target site
Instance recreation60–90%VMs rebuilt on target site using stored metadata, network/flavor mappings applied, volumes attached
Finalization90–100%PG status updated, current_primary_site_id swapped, failover_count incremented, operation marked completed

Listing operations

To see all operations for your tenant — across all Protection Groups — use:

GET /v1/{tenant_id}/operations

Filter by protection_group_id or operation_type as query parameters to narrow results. This is useful when monitoring multiple Protection Groups during a site-level event.


Examples

All examples below use cURL with environment variables for readability. Replace $TOKEN, $TENANT_ID, $PG_ID, and $OP_ID with your actual values. Set OpenStack-API-Version: protector 1.1 or higher.


Obtain a token and tenant ID

export TOKEN=$(openstack token issue -f value -c id)
export TENANT_ID=$(openstack token issue -f value -c project_id)

Example 1 — Planned failover (primary site maintenance)

Use this pattern before scheduled maintenance on Site A. Both sites must be reachable.

curl -s -X POST \
  http://site-a-controller:8788/v1/$TENANT_ID/protection-groups/$PG_ID/action \
  -H "X-Auth-Token: $TOKEN" \
  -H "Content-Type: application/json" \
  -H "OpenStack-API-Version: protector 1.1" \
  -d '{
    "failover": {
      "network_mapping": {
        "aaa11111-net-primary-web": "bbb22222-net-secondary-web",
        "aaa11111-net-primary-db":  "bbb22222-net-secondary-db"
      },
      "flavor_mapping": {
        "m1.large": "m2.large"
      },
      "force": false
    }
  }'

Expected response:

{
  "operation": {
    "id": "op-456abc78-...",
    "operation_type": "failover",
    "status": "running",
    "progress": 0,
    "source_site_id": "site-a-uuid",
    "target_site_id": "site-b-uuid",
    "started_at": "2025-11-03T10:00:00Z",
    "completed_at": null,
    "error_message": null,
    "steps_completed": [],
    "steps_failed": []
  }
}

Example 2 — Unplanned failover (Site A is down)

Site A is unreachable. Set force: true to bypass the peer-site reachability check and proceed from Site B.

# Authenticate to Site B
export OS_AUTH_URL=http://site-b-controller:5000/v3
source ~/site-b-openrc
export TOKEN=$(openstack token issue -f value -c id)
export TENANT_ID=$(openstack token issue -f value -c project_id)

curl -s -X POST \
  http://site-b-controller:8788/v1/$TENANT_ID/protection-groups/$PG_ID/action \
  -H "X-Auth-Token: $TOKEN" \
  -H "Content-Type: application/json" \
  -H "OpenStack-API-Version: protector 1.1" \
  -d '{
    "failover": {
      "network_mapping": {
        "aaa11111-net-primary-web": "bbb22222-net-secondary-web"
      },
      "force": true
    }
  }'

Expected response: Same shape as Example 1. After the operation completes, the remote sync status will be UNREACHABLE. Run a forced metadata sync once Site A recovers before attempting any further operations.


Example 3 — Test failover (DR drill, non-disruptive)

Runs a full failover to Site B while leaving Site A VMs running. Use this for DR drills.

curl -s -X POST \
  http://site-a-controller:8788/v1/$TENANT_ID/protection-groups/$PG_ID/action \
  -H "X-Auth-Token: $TOKEN" \
  -H "Content-Type: application/json" \
  -H "OpenStack-API-Version: protector 1.1" \
  -d '{
    "failover": {
      "network_mapping": {
        "aaa11111-net-primary-web": "bbb22222-net-secondary-web"
      },
      "retain_primary": true
    }
  }'

When retain_primary is true, Protector creates test instances on Site B from the latest replicated snapshot without touching Site A. After validation, clean up test instances with the test_cleanup action:

curl -s -X POST \
  http://site-a-controller:8788/v1/$TENANT_ID/protection-groups/$PG_ID/action \
  -H "X-Auth-Token: $TOKEN" \
  -H "Content-Type: application/json" \
  -H "OpenStack-API-Version: protector 1.1" \
  -d '{"test_cleanup": {}}'

Example 4 — Failback with reverse replication

Site A has recovered. Fail back from Site B and re-establish replication so Site A is the protected copy again.

curl -s -X POST \
  http://site-b-controller:8788/v1/$TENANT_ID/protection-groups/$PG_ID/action \
  -H "X-Auth-Token: $TOKEN" \
  -H "Content-Type: application/json" \
  -H "OpenStack-API-Version: protector 1.1" \
  -d '{
    "failback": {
      "network_mapping": {
        "bbb22222-net-secondary-web": "aaa11111-net-primary-web",
        "bbb22222-net-secondary-db":  "aaa11111-net-primary-db"
      },
      "reverse_replication": true,
      "force": false
    }
  }'

Example 5 — Force consistency group sync

Manually trigger a snapshot and sync of the Consistency Group without a full failover:

curl -s -X POST \
  http://site-a-controller:8788/v1/$TENANT_ID/protection-groups/$PG_ID/consistency-group/sync \
  -H "X-Auth-Token: $TOKEN" \
  -H "Content-Type: application/json" \
  -H "OpenStack-API-Version: protector 1.1"

Expected response:

{
  "operation": {
    "id": "op-999def00-...",
    "operation_type": "sync_volumes",
    "status": "running",
    "progress": 0
  }
}

Example 6 — Poll operation progress

curl -s \
  http://site-a-controller:8788/v1/$TENANT_ID/operations/$OP_ID \
  -H "X-Auth-Token: $TOKEN" \
  -H "OpenStack-API-Version: protector 1.1"

Expected response (mid-operation):

{
  "operation": {
    "id": "op-456abc78-...",
    "operation_type": "failover",
    "status": "running",
    "progress": 55,
    "steps_completed": [
      "validate_target_site",
      "get_latest_snapshot",
      "promote_volumes_on_flasharray"
    ],
    "steps_failed": [],
    "error_message": null,
    "started_at": "2025-11-03T10:00:00Z",
    "completed_at": null
  }
}

Expected response (completed):

{
  "operation": {
    "id": "op-456abc78-...",
    "operation_type": "failover",
    "status": "completed",
    "progress": 100,
    "steps_completed": [
      "validate_target_site",
      "get_latest_snapshot",
      "promote_volumes_on_flasharray",
      "manage_volumes_into_cinder",
      "recreate_instances",
      "update_protection_group_state"
    ],
    "steps_failed": [],
    "error_message": null,
    "started_at": "2025-11-03T10:00:00Z",
    "completed_at": "2025-11-03T10:07:43Z",
    "result_data": {
      "instances_created": 3,
      "volumes_managed": 5,
      "current_primary_site": "site-b"
    }
  }
}

Using the OSC CLI instead of cURL

The protectorclient plugin provides named subcommands that map to the same API calls:

# Planned failover
openstack protector protection-group failover prod-web-app \
  --network-mapping net-primary-web=net-secondary-web \
  --network-mapping net-primary-db=net-secondary-db \
  --flavor-mapping m1.large=m2.large

# Unplanned failover
openstack protector protection-group failover prod-web-app \
  --type unplanned \
  --network-mapping net-primary-web=net-secondary-web

# Test failover
openstack protector protection-group test-failover prod-web-app \
  --retain-primary \
  --network-mapping net-primary-web=net-secondary-web

# Failback
openstack protector protection-group failback prod-web-app \
  --reverse-replication \
  --network-mapping net-secondary-web=net-primary-web

# Monitor an operation
openstack protector operation show op-456abc78-...

# Watch progress in a loop
watch -n 5 openstack protector operation show op-456abc78-...

# List all recent operations
openstack protector operation list

Troubleshooting

Use the following patterns to diagnose failed or stuck DR operations. Check the operation's error_message field and steps_failed array first — they often identify the exact failure point.


Symptom: Operation rejected with Cannot modify protection group — remote site unreachable

Cause: The peer site's Protector API is not responding. Protector blocks all state-changing operations by default to prevent metadata divergence between sites.

Fix:

  1. Verify network connectivity to the peer site's API endpoint: curl http://peer-site-controller:8788/
  2. Confirm protector-api is running on the peer site: systemctl status protector-api
  3. If this is a genuine unplanned outage (not a connectivity blip), use "force": true in the failover request body to override the check. Do not use force if both sites are healthy.
  4. After the peer site recovers, check sync status and run a forced metadata sync before resuming normal operations.

Symptom: Operation fails at the manage_volumes_into_cinder step with a 403 or policy error

Cause: The Cinder policy on the target site does not allow the member role to call volume_manage or volume_unmanage.

Fix: Add the following to the target site's Cinder policy file (/etc/cinder/policy.yaml):

"volume_extension:volume_manage": "rule:admin_or_owner"
"volume_extension:volume_unmanage": "rule:admin_or_owner"
"volume_extension:services:index": "rule:admin_or_owner"

For Kolla-Ansible deployments, update /etc/kolla/config/cinder/policy.yaml and run:

kolla-ansible -i inventory reconfigure -t cinder

Then retry the operation.


Symptom: Operation fails at get_latest_snapshot or promote_volumes_on_flasharray with a storage error

Cause: The replication policy credentials are incorrect, the FlashArray on the target site is unreachable, or no replicated snapshot exists yet (replication may never have run).

Fix:

  1. Confirm the replication policy is configured: GET /v1/{tenant_id}/protection-groups/{pg_id}/policy
  2. Verify the FlashArray management IP is reachable from the controller node: curl -k https://flasharray-b.example.com
  3. Check that at least one replication cycle has completed. For async replication, the interval is set by replication_interval in the policy (default: 300 seconds). Force an immediate sync with POST /v1/{tenant_id}/protection-groups/{pg_id}/consistency-group/sync and wait for it to complete before retrying the failover.
  4. If credentials have changed, update the replication policy with POST /v1/{tenant_id}/protection-groups/{pg_id}/policy.

Symptom: Operation status stuck at running with no progress change for more than 15 minutes

Cause: The protector-engine process on the active site may have died or lost its database connection mid-operation.

Fix:

  1. Check engine health: systemctl status protector-engine and journalctl -u protector-engine -n 100
  2. If the engine is stopped, restart it: systemctl restart protector-engine. The engine will pick up pending operations from the database on startup.
  3. If the operation remains stuck after engine restart, check the steps_completed field to see how far it progressed. Depending on the step, you may need to manually clean up partial resources (orphaned volumes or instances on the target site) before retrying.

Symptom: Failback rejected with Protection Group is not in failed_over state

Cause: The Protection Group status is not failed_over. Failback is only valid from this state.

Fix: Run GET /v1/{tenant_id}/protection-groups/{pg_id} and inspect status and current_primary_site_id. If the group shows active and current_primary_site_id points to what you consider the secondary, a previous failover may not have completed or may have rolled back. Review operation history with GET /v1/{tenant_id}/operations filtered by the Protection Group ID.


Symptom: After unplanned failover, metadata sync shows OUT OF SYNC and version mismatch

Cause: During the unplanned failover, Protector could not reach the failed primary site to push the updated metadata. This is expected behavior.

Fix: Once the failed site recovers, force a metadata sync from the currently active site:

openstack protector protection-group sync-force prod-web-app

Verify both sites are at the same version before attempting any further modifications or operations:

openstack protector protection-group sync-status prod-web-app

Symptom: Test cleanup leaves orphaned volumes or instances on the target site

Cause: The test_cleanup operation failed partway through, or was never run after a test failover.

Fix: Re-issue the test_cleanup action. Protector tracks which resources were created during the test failover in the operation's result_data field and will attempt to remove them:

curl -s -X POST \
  http://site-a-controller:8788/v1/$TENANT_ID/protection-groups/$PG_ID/action \
  -H "X-Auth-Token: $TOKEN" \
  -H "Content-Type: application/json" \
  -H "OpenStack-API-Version: protector 1.1" \
  -d '{"test_cleanup": {}}'

If cleanup continues to fail, identify the resources from the test failover operation's result_data and remove them manually via the Nova and Cinder APIs on the target site.