Operations Api
Trigger and monitor DR operations: failover, failback, test failover, sync
The Operations API is the control plane for executing and monitoring disaster recovery workflows in Trilio Site Recovery. It exposes four primary operation types — failover, failback, test failover, and consistency group sync — all triggered through a single action endpoint on a Protection Group. Because DR operations are long-running and stateful, the API returns an operation record immediately and lets you poll progress asynchronously. Understanding this API is essential for automating DR drills, executing planned maintenance failovers, and responding to unplanned outages.
Before triggering any DR operation, verify the following:
- Two registered and reachable sites — both the primary and secondary OpenStack sites must be registered in Protector and return
status: activefromGET /v1/admin/sites/{site_id}. - Protection Group in
activeorfailed_overstate — failover requiresactive; failback requiresfailed_over. A group inerrororfailing_overmust be resolved before new operations are accepted. - Replication policy configured — the Protection Group must have a replication policy with valid FlashArray credentials for both sites (
GET /v1/{tenant_id}/protection-groups/{pg_id}/policymust return a policy record). - Metadata in sync — the Protection Group metadata version must match on both sites. Check with the sync-status endpoint before operating. Modifications — including triggering operations — are blocked if the peer site is unreachable.
- Volume types eligible for replication — all volumes in the Protection Group's Consistency Group must use a Cinder volume type with
replication_enabled='<is> True'and a matchingreplication_typeproperty. - Cinder policy configured — the
volume_extension:volume_manageandvolume_extension:volume_unmanagepolicies must allow thememberrole on the target site (see the deployment prerequisites guide). - Protector API version 1.0 or later — send
OpenStack-API-Version: protector 1.0(or higher) on all requests. Current version is 1.2. - Valid Keystone token — scoped to the tenant project. The
protectorclientCLI and Horizon dashboard handle token acquisition automatically when configured withclouds.yaml.
The Operations API is part of the protector-api service. No separate installation is required beyond a functioning Trilio Site Recovery deployment. To interact with the API directly, install the protectorclient OSC plugin on any host that can reach both sites' Keystone endpoints.
Step 1 — Confirm the API service is running on both sites
# On Site A
systemctl status protector-api
# On Site B
systemctl status protector-api
Both services must report Active: active (running).
Step 2 — Verify API reachability
# Replace with your controller IP and tenant ID
curl -s http://site-a-controller:8788/
curl -s http://site-b-controller:8788/
A successful response returns the API version document. A connection refused or timeout means protector-api is not reachable on that host.
Step 3 — Install the protectorclient OSC plugin
pip install python-protectorclient
Step 4 — Configure clouds.yaml for both sites
The OSC plugin must authenticate to both sites to coordinate metadata. Create or update ~/.config/openstack/clouds.yaml:
clouds:
site-a:
auth:
auth_url: http://site-a-controller:5000/v3
project_name: your-project
username: your-user
password: your-password
user_domain_name: Default
project_domain_name: Default
region_name: RegionOne
site-b:
auth:
auth_url: http://site-b-controller:5000/v3
project_name: your-project
username: your-user
password: your-password
user_domain_name: Default
project_domain_name: Default
region_name: RegionOne
Step 5 — Verify CLI access
export OS_CLOUD=site-a
openstack protector protection-group list
A table of your Protection Groups confirms that authentication and API connectivity are working.
All DR operations share a common set of parameters that control how resources are mapped and how the operation behaves under failure conditions. These are supplied in the request body of the action endpoint.
Common operation parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
network_mapping | object | {} | Maps source network UUIDs to target network UUIDs. Keys are UUIDs on the current primary; values are UUIDs on the target site. Required when network names differ across sites. |
flavor_mapping | object | {} | Maps source flavor IDs to target flavor IDs. Use when the target site has different flavor names or sizes. If omitted, Protector attempts to match by name. |
force | boolean | false | Bypasses the peer-site reachability check. Use only for unplanned failover when the primary site is genuinely down. Setting force: true when both sites are reachable risks metadata divergence. |
retain_primary | boolean | false | Applies to test_failover only. When true, primary-site VMs remain running during the test. When false, primary VMs are quiesced for the duration. |
reverse_replication | boolean | false | Applies to failback only. When true, Protector reconfigures Pure Storage replication so that Site A becomes the replication target again after failback completes. Recommended for planned failbacks. |
Operation types and valid states
The action endpoint accepts one operation key per request. The accepted key determines what Protector executes:
| Action key | Allowed when PG status is | Result status |
|---|---|---|
failover | active | failed_over |
failback | failed_over | active |
test_failover | active | active (primary unchanged) |
test_cleanup | active (after test failover) | active |
sync_volumes | active or failed_over | no status change |
Metadata sync behavior
By default, every operation that changes Protection Group state requires both sites to be reachable before it begins — Protector blocks the operation and returns an error if the peer site cannot be contacted. This prevents metadata divergence. The force flag overrides this block for failover only, and only when you are responding to a genuine site outage. Forced operations log a sync warning and mark the remote sync status as UNREACHABLE; once the failed site recovers, run a forced sync before resuming normal operations.
Triggering operations
All DR operations are submitted as HTTP POST requests to the Protection Group action endpoint:
POST /v1/{tenant_id}/protection-groups/{pg_id}/action
The request body contains exactly one top-level key identifying the operation. Protector validates the Protection Group state, checks peer-site reachability (unless force is set), and returns an operation record with status pending or running. The operation then executes asynchronously.
Polling operation progress
Protector does not block the HTTP response until the operation completes. Instead, poll the operation detail endpoint:
GET /v1/{tenant_id}/operations/{op_id}
The progress field is an integer from 0 to 100. The status field moves through pending → running → completed on success, or pending → running → rolling_back → failed on failure. The steps_completed and steps_failed fields are arrays that let you see exactly where a long operation stands or where it broke.
Operation lifecycle phases
Every failover-class operation passes through four phases. Progress values are approximate:
| Phase | Progress range | What happens |
|---|---|---|
| Preparation | 0–20% | PG status set to failing_over, DR operation record created, target site validated, latest CG snapshot retrieved from Pure Storage |
| Storage failover | 20–60% | Replicated snapshot identified on target FlashArray; volumes created from snapshot; volumes managed into Cinder on target site |
| Instance recreation | 60–90% | VMs rebuilt on target site using stored metadata, network/flavor mappings applied, volumes attached |
| Finalization | 90–100% | PG status updated, current_primary_site_id swapped, failover_count incremented, operation marked completed |
Listing operations
To see all operations for your tenant — across all Protection Groups — use:
GET /v1/{tenant_id}/operations
Filter by protection_group_id or operation_type as query parameters to narrow results. This is useful when monitoring multiple Protection Groups during a site-level event.
All examples below use cURL with environment variables for readability. Replace $TOKEN, $TENANT_ID, $PG_ID, and $OP_ID with your actual values. Set OpenStack-API-Version: protector 1.1 or higher.
Obtain a token and tenant ID
export TOKEN=$(openstack token issue -f value -c id)
export TENANT_ID=$(openstack token issue -f value -c project_id)
Example 1 — Planned failover (primary site maintenance)
Use this pattern before scheduled maintenance on Site A. Both sites must be reachable.
curl -s -X POST \
http://site-a-controller:8788/v1/$TENANT_ID/protection-groups/$PG_ID/action \
-H "X-Auth-Token: $TOKEN" \
-H "Content-Type: application/json" \
-H "OpenStack-API-Version: protector 1.1" \
-d '{
"failover": {
"network_mapping": {
"aaa11111-net-primary-web": "bbb22222-net-secondary-web",
"aaa11111-net-primary-db": "bbb22222-net-secondary-db"
},
"flavor_mapping": {
"m1.large": "m2.large"
},
"force": false
}
}'
Expected response:
{
"operation": {
"id": "op-456abc78-...",
"operation_type": "failover",
"status": "running",
"progress": 0,
"source_site_id": "site-a-uuid",
"target_site_id": "site-b-uuid",
"started_at": "2025-11-03T10:00:00Z",
"completed_at": null,
"error_message": null,
"steps_completed": [],
"steps_failed": []
}
}
Example 2 — Unplanned failover (Site A is down)
Site A is unreachable. Set force: true to bypass the peer-site reachability check and proceed from Site B.
# Authenticate to Site B
export OS_AUTH_URL=http://site-b-controller:5000/v3
source ~/site-b-openrc
export TOKEN=$(openstack token issue -f value -c id)
export TENANT_ID=$(openstack token issue -f value -c project_id)
curl -s -X POST \
http://site-b-controller:8788/v1/$TENANT_ID/protection-groups/$PG_ID/action \
-H "X-Auth-Token: $TOKEN" \
-H "Content-Type: application/json" \
-H "OpenStack-API-Version: protector 1.1" \
-d '{
"failover": {
"network_mapping": {
"aaa11111-net-primary-web": "bbb22222-net-secondary-web"
},
"force": true
}
}'
Expected response: Same shape as Example 1. After the operation completes, the remote sync status will be UNREACHABLE. Run a forced metadata sync once Site A recovers before attempting any further operations.
Example 3 — Test failover (DR drill, non-disruptive)
Runs a full failover to Site B while leaving Site A VMs running. Use this for DR drills.
curl -s -X POST \
http://site-a-controller:8788/v1/$TENANT_ID/protection-groups/$PG_ID/action \
-H "X-Auth-Token: $TOKEN" \
-H "Content-Type: application/json" \
-H "OpenStack-API-Version: protector 1.1" \
-d '{
"failover": {
"network_mapping": {
"aaa11111-net-primary-web": "bbb22222-net-secondary-web"
},
"retain_primary": true
}
}'
When retain_primary is true, Protector creates test instances on Site B from the latest replicated snapshot without touching Site A. After validation, clean up test instances with the test_cleanup action:
curl -s -X POST \
http://site-a-controller:8788/v1/$TENANT_ID/protection-groups/$PG_ID/action \
-H "X-Auth-Token: $TOKEN" \
-H "Content-Type: application/json" \
-H "OpenStack-API-Version: protector 1.1" \
-d '{"test_cleanup": {}}'
Example 4 — Failback with reverse replication
Site A has recovered. Fail back from Site B and re-establish replication so Site A is the protected copy again.
curl -s -X POST \
http://site-b-controller:8788/v1/$TENANT_ID/protection-groups/$PG_ID/action \
-H "X-Auth-Token: $TOKEN" \
-H "Content-Type: application/json" \
-H "OpenStack-API-Version: protector 1.1" \
-d '{
"failback": {
"network_mapping": {
"bbb22222-net-secondary-web": "aaa11111-net-primary-web",
"bbb22222-net-secondary-db": "aaa11111-net-primary-db"
},
"reverse_replication": true,
"force": false
}
}'
Example 5 — Force consistency group sync
Manually trigger a snapshot and sync of the Consistency Group without a full failover:
curl -s -X POST \
http://site-a-controller:8788/v1/$TENANT_ID/protection-groups/$PG_ID/consistency-group/sync \
-H "X-Auth-Token: $TOKEN" \
-H "Content-Type: application/json" \
-H "OpenStack-API-Version: protector 1.1"
Expected response:
{
"operation": {
"id": "op-999def00-...",
"operation_type": "sync_volumes",
"status": "running",
"progress": 0
}
}
Example 6 — Poll operation progress
curl -s \
http://site-a-controller:8788/v1/$TENANT_ID/operations/$OP_ID \
-H "X-Auth-Token: $TOKEN" \
-H "OpenStack-API-Version: protector 1.1"
Expected response (mid-operation):
{
"operation": {
"id": "op-456abc78-...",
"operation_type": "failover",
"status": "running",
"progress": 55,
"steps_completed": [
"validate_target_site",
"get_latest_snapshot",
"promote_volumes_on_flasharray"
],
"steps_failed": [],
"error_message": null,
"started_at": "2025-11-03T10:00:00Z",
"completed_at": null
}
}
Expected response (completed):
{
"operation": {
"id": "op-456abc78-...",
"operation_type": "failover",
"status": "completed",
"progress": 100,
"steps_completed": [
"validate_target_site",
"get_latest_snapshot",
"promote_volumes_on_flasharray",
"manage_volumes_into_cinder",
"recreate_instances",
"update_protection_group_state"
],
"steps_failed": [],
"error_message": null,
"started_at": "2025-11-03T10:00:00Z",
"completed_at": "2025-11-03T10:07:43Z",
"result_data": {
"instances_created": 3,
"volumes_managed": 5,
"current_primary_site": "site-b"
}
}
}
Using the OSC CLI instead of cURL
The protectorclient plugin provides named subcommands that map to the same API calls:
# Planned failover
openstack protector protection-group failover prod-web-app \
--network-mapping net-primary-web=net-secondary-web \
--network-mapping net-primary-db=net-secondary-db \
--flavor-mapping m1.large=m2.large
# Unplanned failover
openstack protector protection-group failover prod-web-app \
--type unplanned \
--network-mapping net-primary-web=net-secondary-web
# Test failover
openstack protector protection-group test-failover prod-web-app \
--retain-primary \
--network-mapping net-primary-web=net-secondary-web
# Failback
openstack protector protection-group failback prod-web-app \
--reverse-replication \
--network-mapping net-secondary-web=net-primary-web
# Monitor an operation
openstack protector operation show op-456abc78-...
# Watch progress in a loop
watch -n 5 openstack protector operation show op-456abc78-...
# List all recent operations
openstack protector operation list
Use the following patterns to diagnose failed or stuck DR operations. Check the operation's error_message field and steps_failed array first — they often identify the exact failure point.
Symptom: Operation rejected with Cannot modify protection group — remote site unreachable
Cause: The peer site's Protector API is not responding. Protector blocks all state-changing operations by default to prevent metadata divergence between sites.
Fix:
- Verify network connectivity to the peer site's API endpoint:
curl http://peer-site-controller:8788/ - Confirm
protector-apiis running on the peer site:systemctl status protector-api - If this is a genuine unplanned outage (not a connectivity blip), use
"force": truein the failover request body to override the check. Do not useforceif both sites are healthy. - After the peer site recovers, check sync status and run a forced metadata sync before resuming normal operations.
Symptom: Operation fails at the manage_volumes_into_cinder step with a 403 or policy error
Cause: The Cinder policy on the target site does not allow the member role to call volume_manage or volume_unmanage.
Fix: Add the following to the target site's Cinder policy file (/etc/cinder/policy.yaml):
"volume_extension:volume_manage": "rule:admin_or_owner"
"volume_extension:volume_unmanage": "rule:admin_or_owner"
"volume_extension:services:index": "rule:admin_or_owner"
For Kolla-Ansible deployments, update /etc/kolla/config/cinder/policy.yaml and run:
kolla-ansible -i inventory reconfigure -t cinder
Then retry the operation.
Symptom: Operation fails at get_latest_snapshot or promote_volumes_on_flasharray with a storage error
Cause: The replication policy credentials are incorrect, the FlashArray on the target site is unreachable, or no replicated snapshot exists yet (replication may never have run).
Fix:
- Confirm the replication policy is configured:
GET /v1/{tenant_id}/protection-groups/{pg_id}/policy - Verify the FlashArray management IP is reachable from the controller node:
curl -k https://flasharray-b.example.com - Check that at least one replication cycle has completed. For async replication, the interval is set by
replication_intervalin the policy (default: 300 seconds). Force an immediate sync withPOST /v1/{tenant_id}/protection-groups/{pg_id}/consistency-group/syncand wait for it to complete before retrying the failover. - If credentials have changed, update the replication policy with
POST /v1/{tenant_id}/protection-groups/{pg_id}/policy.
Symptom: Operation status stuck at running with no progress change for more than 15 minutes
Cause: The protector-engine process on the active site may have died or lost its database connection mid-operation.
Fix:
- Check engine health:
systemctl status protector-engineandjournalctl -u protector-engine -n 100 - If the engine is stopped, restart it:
systemctl restart protector-engine. The engine will pick up pending operations from the database on startup. - If the operation remains stuck after engine restart, check the
steps_completedfield to see how far it progressed. Depending on the step, you may need to manually clean up partial resources (orphaned volumes or instances on the target site) before retrying.
Symptom: Failback rejected with Protection Group is not in failed_over state
Cause: The Protection Group status is not failed_over. Failback is only valid from this state.
Fix: Run GET /v1/{tenant_id}/protection-groups/{pg_id} and inspect status and current_primary_site_id. If the group shows active and current_primary_site_id points to what you consider the secondary, a previous failover may not have completed or may have rolled back. Review operation history with GET /v1/{tenant_id}/operations filtered by the Protection Group ID.
Symptom: After unplanned failover, metadata sync shows OUT OF SYNC and version mismatch
Cause: During the unplanned failover, Protector could not reach the failed primary site to push the updated metadata. This is expected behavior.
Fix: Once the failed site recovers, force a metadata sync from the currently active site:
openstack protector protection-group sync-force prod-web-app
Verify both sites are at the same version before attempting any further modifications or operations:
openstack protector protection-group sync-status prod-web-app
Symptom: Test cleanup leaves orphaned volumes or instances on the target site
Cause: The test_cleanup operation failed partway through, or was never run after a test failover.
Fix: Re-issue the test_cleanup action. Protector tracks which resources were created during the test failover in the operation's result_data field and will attempt to remove them:
curl -s -X POST \
http://site-a-controller:8788/v1/$TENANT_ID/protection-groups/$PG_ID/action \
-H "X-Auth-Token: $TOKEN" \
-H "Content-Type: application/json" \
-H "OpenStack-API-Version: protector 1.1" \
-d '{"test_cleanup": {}}'
If cleanup continues to fail, identify the resources from the test failover operation's result_data and remove them manually via the Nova and Cinder APIs on the target site.