Guide

Site Topology

Two-site symmetric topology with independent service stacks per site

master

Overview

Trilio Site Recovery for OpenStack uses a two-site symmetric topology: a primary site and a secondary (DR) site, each running a fully independent OpenStack control plane. This page explains how those two sites are structured, how the Protector service components are distributed across them, and why the architecture is deliberately symmetric rather than hub-and-spoke. Understanding this topology is essential before you register sites, create Protection Groups, or execute any DR operation — every workflow assumption in this documentation flows from it.

Prerequisites

Before working with the two-site topology, confirm the following:

Two independent OpenStack deployments — each with its own Nova, Cinder, Neutron, and Keystone endpoints. These can be separate physical clusters or separate regions within the same cluster.
Trilio Protector services deployed on both sites — protector-api and protector-engine must be installed, configured, and reachable on each site independently. See the Deployment guide for installation steps.
Network reachability between sites — each site's Keystone and Protector API endpoint (default port 8788) must be reachable from the operator workstation running the OSC CLI plugin (protectorclient). The sites do not need to reach each other's Protector API directly.
Pure Storage FlashArray replication configured — both arrays must have an async or sync replication connection established at the storage layer before Protector can manage protection groups.
Matching Cinder volume types on both sites — each site must have a volume type with replication_enabled='<is> True' and a replication_type property set to async or sync. The volume type name does not need to match, but the replication properties and the backing Pure Storage driver must align.
OpenStack Victoria or later on both sites.
Python 3.8+ on the workstation where you run the OSC CLI plugin.
clouds.yaml configured with credentials for both sites — the CLI plugin authenticates to both sites independently and requires named cloud entries for each.

Installation

The Protector service stack is installed identically on each site. There is no "primary" or "secondary" installation variant — both sites run the same software. Repeat every step below on both Site A and Site B.

1. Create the Protector database

mysql -u root -p << EOF
CREATE DATABASE protector CHARACTER SET utf8;
GRANT ALL PRIVILEGES ON protector.* TO 'protector'@'localhost' IDENTIFIED BY 'PROTECTOR_DBPASS';
GRANT ALL PRIVILEGES ON protector.* TO 'protector'@'%' IDENTIFIED BY 'PROTECTOR_DBPASS';
FLUSH PRIVILEGES;
EOF

2. Create the Protector service user in Keystone

source ~/admin-openrc

openstack user create --domain default --password-prompt protector
openstack role add --project service --user protector admin

openstack service create --name protector \
  --description "OpenStack Disaster Recovery Service" protector

openstack endpoint create --region RegionOne \
  protector public http://controller:8788/v1/%\(tenant_id\)s

openstack endpoint create --region RegionOne \
  protector internal http://controller:8788/v1/%\(tenant_id\)s

openstack endpoint create --region RegionOne \
  protector admin http://controller:8788/v1/%\(tenant_id\)s

The Protector service user requires the admin role in the service project. This is the standard OpenStack service account pattern and is required for Keystone trust operations.

3. Install the Protector package

git clone https://github.com/your-org/openstack-protector.git
cd openstack-protector

pip install -r requirements.txt
python setup.py install

4. Create the operating system user and directories

useradd --system --shell /bin/false protector

mkdir -p /var/log/protector
mkdir -p /var/lib/protector
mkdir -p /etc/protector

chown -R protector:protector /var/log/protector
chown -R protector:protector /var/lib/protector
chown -R protector:protector /etc/protector

5. Initialize the database schema

protector-manage db sync

6. Install systemd unit files

Create /etc/systemd/system/protector-api.service:

[Unit]
Description=OpenStack Protector API Service
After=network.target

[Service]
Type=simple
User=protector
Group=protector
ExecStart=/usr/local/bin/protector-api --config-file /etc/protector/protector.conf
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Create /etc/systemd/system/protector-engine.service:

[Unit]
Description=OpenStack Protector Engine Service
After=network.target

[Service]
Type=simple
User=protector
Group=protector
ExecStart=/usr/local/bin/protector-engine --config-file /etc/protector/protector.conf
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

7. Enable and start services

systemctl daemon-reload

systemctl enable protector-api protector-engine
systemctl start protector-api protector-engine

systemctl status protector-api
systemctl status protector-engine

Repeat steps 1–7 on the second site before proceeding to configuration.

Configuration

Each site has its own /etc/protector/protector.conf. The configuration is site-local — there is no shared configuration file between sites.

Core configuration file (/etc/protector/protector.conf)

[DEFAULT]
debug = False
log_dir = /var/log/protector
state_path = /var/lib/protector

[api]
bind_host = 0.0.0.0
bind_port = 8788
workers = 4

[database]
connection = mysql+pymysql://protector:PROTECTOR_DBPASS@controller/protector

[keystone_authtoken]
www_authenticate_uri = http://controller:5000
auth_url = http://controller:5000
memcached_servers = controller:11211
auth_type = password
project_domain_name = Default
user_domain_name = Default
project_name = service
username = protector
password = PROTECTOR_PASS

[service_credentials]
default_trust_roles = member,_member_

[oslo_policy]
policy_file = /etc/protector/policy.yaml

Key options and their effect:

Option	Section	Default	Effect
`bind_port`	`[api]`	`8788`	Port the `protector-api` process listens on. Must be reachable from operator workstations and the peer site's Keystone endpoint for token validation.
`workers`	`[api]`	`4`	Number of API worker processes. Increase on sites with high request concurrency.
`connection`	`[database]`	—	SQLAlchemy DSN for the site-local MariaDB/MySQL database. Each site has its own database; there is no shared or replicated database between sites.
`default_trust_roles`	`[service_credentials]`	`member,_member_`	Keystone roles delegated to the Protector service via trusts. These roles must exist on the local Keystone and must be sufficient (with any required policy overrides) for Cinder volume manage/unmanage and Nova instance operations.
`debug`	`[DEFAULT]`	`False`	Set to `True` to enable verbose logging. Do not use in production — log output includes API request bodies.

Multi-site clouds.yaml (operator workstation)

The OSC CLI plugin (protectorclient) authenticates to both sites independently. Configure a named entry for each site:

clouds:
  site-a:
    auth:
      auth_url: http://site-a-controller:5000/v3
      project_name: admin
      username: admin
      password: password
      user_domain_name: Default
      project_domain_name: Default
    region_name: RegionOne

  site-b:
    auth:
      auth_url: http://site-b-controller:5000/v3
      project_name: admin
      username: admin
      password: password
      user_domain_name: Default
      project_domain_name: Default
    region_name: RegionOne

Save this file to ~/.config/openstack/clouds.yaml. The cloud names (site-a, site-b) are referenced in CLI commands with --os-cloud.

Cinder policy overrides (both sites)

The Protector service manages and unmanages Cinder volumes during failover using the tenant's delegated credentials. Cinder restricts these operations to admin by default. Add the following to /etc/cinder/policy.yaml on both sites:

# Required for DR failover: import volumes from replicated snapshots
"volume_extension:volume_manage": "rule:admin_or_owner"
# Required for failback: release volumes before cleanup
"volume_extension:volume_unmanage": "rule:admin_or_owner"
# Required to discover the correct Cinder volume service host
"volume_extension:services:index": "rule:admin_or_owner"

For Kolla-Ansible deployments, write these to /etc/kolla/config/cinder/policy.yaml and reconfigure:

kolla-ansible -i inventory reconfigure -t cinder

Metadata sync behavior

Protector enforces strict metadata consistency: any modification to a Protection Group (adding members, updating policy, initiating failover) requires the peer site to be reachable at the time of the operation. If the peer site is unreachable, the operation is blocked. This prevents the two sites from diverging to an irreconcilable state. There is no configuration knob to relax this constraint for planned operations — it is a design invariant of the system.

Usage

The two-site topology is transparent during most day-to-day operations, but you need to understand two things about how site roles work before you start:

Site roles are workload-relative and dynamic. When you create a Protection Group, you designate a primary site and a secondary site for that group. Those designations track where the workload is running — they are not fixed infrastructure roles. After a failover, the secondary site becomes the effective primary for that group. The current_primary_site_id field on the Protection Group reflects this live state.

The CLI plugin, not the service, coordinates across sites. The protector-api and protector-engine processes on each site have no direct communication channel to their peers on the other site. The OSC CLI plugin (protectorclient) and Horizon dashboard are the coordination layer — they authenticate independently to both sites and push metadata between them. This means your operator workstation must be able to reach both sites' Keystone and Protector API endpoints simultaneously.

Registering sites

Before creating any Protection Groups, register both sites with the Protector service. Registration is a one-time admin operation:

# Authenticate to Site A
export OS_CLOUD=site-a

# Register Site A
openstack protector site create \
  --name site-a \
  --description "Primary datacenter - Boston" \
  --site-type primary \
  --auth-url http://10.0.1.10:5000/v3 \
  --region-name RegionOne

# Register Site B
openstack protector site create \
  --name site-b \
  --description "Secondary datacenter - Seattle" \
  --site-type secondary \
  --auth-url http://10.0.2.10:5000/v3 \
  --region-name RegionOne

Validate connectivity to both sites before proceeding:

openstack protector site validate site-a
openstack protector site validate site-b

Checking site reachability and sync state

The status field on a registered site reflects Protector's last-known connectivity state: active, unreachable, or error. You can also check the metadata sync state of a Protection Group at any time:

openstack protector protection-group sync-status prod-web-app

If the remote site is behind by one or more metadata versions (for example, after recovering from an outage), force a sync before making changes:

openstack protector protection-group sync-force prod-web-app

Tenant mapping

Because each site has its own Keystone, project UUIDs differ between sites. Create a tenant mapping once per tenant pair so the Protector service can associate the correct project context across both sites:

openstack dr tenant mapping create \
    --local-tenant <local-project-id> \
    --remote-site <remote-site-id> \
    --remote-tenant <remote-project-id> \
    --description "Production tenant mapping"

This mapping syncs automatically to the remote site after creation.

Mock storage driver for lab environments

If you do not have physical Pure FlashArray hardware, the Mock storage driver simulates FlashArray behavior using a local SQLite database. It supports the full DR workflow — Protection Group creation, replication policy, failover, failback, and test failover — without requiring real arrays. Use this driver in development and CI environments. To enable it, set the storage driver to mock in protector.conf on both sites. The mock driver state is local to each site's SQLite file, so cross-site replication simulation is handled by the driver itself.

Examples

Example 1: Verify both sites are registered and active

export OS_CLOUD=site-a
openstack protector site list

Expected output:

+--------------------------------------+--------+----------+------------------------------+---------+
| id                                   | name   | type     | auth_url                     | status  |
+--------------------------------------+--------+----------+------------------------------+---------+
| 1a2b3c4d-0000-0000-0000-000000000001 | site-a | primary  | http://10.0.1.10:5000/v3     | active  |
| 1a2b3c4d-0000-0000-0000-000000000002 | site-b | secondary| http://10.0.2.10:5000/v3     | active  |
+--------------------------------------+--------+----------+------------------------------+---------+

Both sites must show active before creating Protection Groups. If either shows unreachable, run openstack protector site validate <name> to diagnose.

Example 2: Confirm replication-eligible volume types exist on both sites

Run on Site A:

export OS_CLOUD=site-a
openstack protector site list-volume-types site-a

Run on Site B:

export OS_CLOUD=site-b
openstack protector site list-volume-types site-b

Expected output (both sites):

+--------------------------------------+----------------+--------------------+------------------+
| id                                   | name           | replication_enabled| replication_type |
+--------------------------------------+----------------+--------------------+------------------+
| abcd1234-0000-0000-0000-000000000010 | replicated-ssd | <is> True          | <in> async       |
+--------------------------------------+----------------+--------------------+------------------+

A volume type that does not appear in this list — either because replication_enabled is absent or set to False, or because the backend does not support the configured replication_type — cannot be used in a Protection Group.

Example 3: Create and verify a Protection Group with metadata in sync on both sites

export OS_CLOUD=site-a

openstack protector protection-group create \
  --name prod-web-app \
  --description "Production web application" \
  --replication-type async \
  --primary-site site-a \
  --secondary-site site-b \
  --volume-type replicated-ssd

Expected output:

+------------------------+--------------------------------------+
| Field                  | Value                                |
+------------------------+--------------------------------------+
| id                     | pg-12345678-1234-1234-1234-12345678  |
| name                   | prod-web-app                         |
| status                 | active                               |
| consistency_group_id   | cg-87654321-4321-4321-4321-87654321  |
| primary_site           | site-a                               |
| secondary_site         | site-b                               |
| current_primary_site   | site-a                               |
| failover_count         | 0                                    |
+------------------------+--------------------------------------+

Now verify that the metadata was pushed to Site B:

export OS_CLOUD=site-a
openstack protector protection-group sync-status prod-web-app

Expected output:

Sync Status: ✅ IN SYNC

Local Metadata:
  Version: 1
  Current Site: Site A
  Last Modified: 2025-11-03T09:00:00Z

Remote Sync:
  Status: SYNCED
  Remote Version: 1
  Last Sync: 2025-11-03T09:00:03Z (3 seconds ago)

Validation:
  ✅ Versions match (1 = 1)
  ✅ Sync status is 'synced'
  ✅ Last sync is recent

Example 4: Observe a blocked modification when the peer site is unreachable

If the remote site goes down while workloads are running on the current primary, any attempt to modify the Protection Group is blocked:

openstack protector protection-group member-add prod-web-app \
  --instance-id <new-vm-uuid>

Expected error:

Error: Cannot modify protection group - remote site unreachable.
Reason: Modification requires both sites to be reachable to prevent metadata divergence.
Action: Wait for the remote site to recover, then run:
        openstack protector protection-group sync-force prod-web-app

Once the remote site recovers, force a sync and retry:

openstack protector protection-group sync-force prod-web-app
openstack protector protection-group member-add prod-web-app \
  --instance-id <new-vm-uuid>

Troubleshooting

Issue: openstack protector site validate returns unreachable for one site

Symptom: The validate command completes without error but the site status is unreachable, or it returns a connection timeout.

Likely cause: The Protector API on the target site is not running, or the API endpoint is not reachable from the operator workstation on port 8788.

Fix:

SSH to the target site controller and verify the service is running: systemctl status protector-api
Confirm the port is listening: ss -tlnp | grep 8788
Verify firewall rules permit inbound TCP on port 8788 from your workstation IP.
Confirm the auth_url registered for the site matches the actual Keystone endpoint: openstack protector site show <name>.

Issue: Protection Group creation fails with "volume type does not support replication"

Symptom: openstack protector protection-group create returns an error stating the volume type is not eligible for protection.

Likely cause: The Cinder volume type is missing the required extra specs on one or both sites.

Fix: On each site, verify the volume type has both required properties:

openstack volume type show replicated-ssd -f json | jq '.extra_specs'

Expected:

{
  "volume_backend_name": "pure@backend-a",
  "replication_enabled": "<is> True",
  "replication_type": "<in> async"
}

If either property is missing, set it:

openstack volume type set replicated-ssd \
  --property replication_enabled='<is> True' \
  --property replication_type='<in> async'

Repeat on both sites.

Issue: Metadata sync is stuck in OUT OF SYNC after the peer site recovers

Symptom: openstack protector protection-group sync-status shows Remote Sync Status: UNREACHABLE or FAILED even though the peer site is now reachable.

Likely cause: The sync status is not updated automatically after recovery — a manual trigger is required.

Fix:

openstack protector protection-group sync-force prod-web-app

Verify the result:

openstack protector protection-group sync-status prod-web-app

Both sites should now show the same version number and status SYNCED.

Issue: Cinder volume_manage fails during failover with a 403 Forbidden error

Symptom: A failover operation reaches the storage phase (progress ~20–60%) and then fails. The operation error message includes Policy doesn't allow volume_extension:volume_manage to be performed.

Likely cause: The Cinder policy on the secondary site has not been updated to allow the member role to manage volumes.

Fix: Add the required policy overrides to /etc/cinder/policy.yaml on the secondary site (the failover target):

"volume_extension:volume_manage": "rule:admin_or_owner"
"volume_extension:volume_unmanage": "rule:admin_or_owner"
"volume_extension:services:index": "rule:admin_or_owner"

Restart the Cinder API and volume services, or reconfigure via Kolla-Ansible:

kolla-ansible -i inventory reconfigure -t cinder

Issue: clouds.yaml entries cause authentication errors for one site

Symptom: CLI commands targeting one site succeed, but the same commands targeting the other site return 401 Unauthorized or Could not find project.

Likely cause: The project_name, username, or user_domain_name in the clouds.yaml entry for the affected site does not match what that site's Keystone expects. Because both sites have independent Keystones, credentials and domain names can differ.

Fix: Test each cloud entry independently:

OS_CLOUD=site-a openstack token issue
OS_CLOUD=site-b openstack token issue

For any site that fails, inspect the matching entry in ~/.config/openstack/clouds.yaml and correct the credentials against that site's Keystone configuration. Ensure auth_url points to that site's Keystone v3 endpoint specifically.