Runbook

Troubleshooting

Common failure modes and remediation steps

master

Objective

This runbook identifies the most common failure modes in Trilio Site Recovery for OpenStack and provides step-by-step remediation procedures; use it whenever a DR operation (protection group management, failover, failback, or site-level failover) behaves unexpectedly or services fail to respond.

Scope

This runbook covers:

Trilio Site Recovery service failures (protector-api, protector-engine)
Database connectivity and schema issues
Keystone authentication failures for both local and cross-site service accounts
Metadata sync failures and out-of-sync protection groups
Volume type misconfiguration blocking protection group creation
Failover and failback operation failures (per-PG and site-level)
Network connectivity problems between sites
Mock storage driver issues during testing

This runbook does not cover:

Pure Storage FlashArray hardware faults or firmware upgrades
OpenStack Nova, Cinder, or Neutron service failures unrelated to Trilio Site Recovery
Keystone federation setup or SAML/OIDC configuration
RabbitMQ cluster administration beyond basic connectivity checks
Data recovery after a failed unplanned failover where volumes are corrupted

Prerequisites

Before using this runbook, confirm you have:

Access to both sites: SSH access to the controller nodes on both the primary and secondary OpenStack clouds, and valid admin credentials (admin-openrc or equivalent) for each site's Keystone.
OSC CLI plugin installed: The protectorclient plugin must be installed and your ~/.config/openstack/clouds.yaml must define both site-a and site-b cloud entries.
Database access: Ability to run queries against the MariaDB/MySQL protector database on both sites (the protector DB user or root).
Log access: Read access to /var/log/protector/protector-api.log and /var/log/protector/protector-engine.log on both controllers.
dr_site_admin role (for site-level failover troubleshooting): Your user must hold the dr_site_admin role in addition to admin.
Network reachability: The workstation or jump host running CLI commands must be able to reach port 5000 (Keystone), 8774 (Nova), 8776 (Cinder), and 8788 (Protector API) on both sites.
No in-progress DR operations should be running unless you are specifically troubleshooting a stuck operation.

Steps

Work through the sections below in order. Each section addresses a distinct failure domain. Stop at the section that matches your symptom and follow the numbered steps within it.

1. Protector Services Not Responding

Symptom: openstack protector protection-group list returns a connection error or HTTP 503.

Check service status on the affected site:
```
systemctl status protector-api
systemctl status protector-engine
```
A healthy service shows active (running). If either shows failed or inactive, proceed to step 2.
Review recent journal entries for the failed service:
```
journalctl -u protector-api -n 100 --no-pager
journalctl -u protector-engine -n 100 --no-pager
```
Look for Python tracebacks, ImportError, or database connection strings in the output.
Check whether the API port is bound:
```
netstat -tlnp | grep 8788
```
If no output, the API process did not start successfully. Confirm the config file path is correct:
```
cat /etc/systemd/system/protector-api.service | grep ExecStart
```
Verify the configuration file is parseable:
```
protector-manage config --config-file /etc/protector/protector.conf
```
Correct any syntax errors reported.

Restart the services:

systemctl restart protector-api
systemctl restart protector-engine

Confirm the health endpoint responds:
```
curl -s http://controller:8788/
```
A JSON version document confirms the API is up.

2. Database Connection Errors

Symptom: Service logs contain OperationalError, Can't connect to MySQL server, or Access denied for user 'protector'.

Verify the database is reachable from the controller:
```
mysql -h controller -u protector -p protector -e "SELECT 1;"
```
If this fails, the database host, port, or credentials in protector.conf are incorrect.
Check the connection string in the configuration file:
```
grep ^connection /etc/protector/protector.conf
```
The value must match the format mysql+pymysql://protector:PROTECTOR_DBPASS@<host>/protector.
Ensure the schema is current (run after upgrades):
```
protector-manage db sync
```
Re-run this if the log shows Table '...' doesn't exist.
If using a Galera cluster on the remote site and the remote Protector returns HTTP 500 during metadata sync, check whether the Galera node needs bootstrapping:
```
# On the remote DB node
mysql -e "SHOW STATUS LIKE 'wsrep_cluster_status';"
```
A status of Non-Primary indicates a split-brain condition. Resolve the Galera cluster state before retrying Protector operations.

3. Keystone Authentication Failures

Symptom: API returns HTTP 401 or keystonemiddleware errors appear in protector-api.log.

3a. Local Keystone (user-facing authentication)

Verify the service user exists:
```
openstack user show protector
```

Confirm the admin role is assigned to the service project:

openstack role assignment list --user protector --project service --names

Test token issuance with the service account:

openstack --os-auth-url http://controller:5000/v3 \
  --os-username protector \
  --os-password <PROTECTOR_PASS> \
  --os-project-name service \
  token issue

Check [keystone_authtoken] in /etc/protector/protector.conf — the auth_url, username, password, project_name, and domain fields must match the values above.

3b. Cross-site service account authentication

Protector uses dedicated service credentials (not user tokens) to authenticate to the remote site. Failures here block metadata sync and failover.

Test connectivity to the remote Keystone:
```
curl -k https://remote-keystone:5000/v3
```
A JSON response confirms basic reachability. A Connection refused or timeout indicates a network or firewall issue — see Section 7.

Test the remote service account credentials:

openstack --os-auth-url https://remote-keystone:5000 \
  --os-username protector-service \
  --os-password <password> \
  --os-project-name service \
  token issue

If authentication fails, verify the service account exists on the remote site and holds the admin role on the service project:

# Run on the remote site
openstack user show protector-service
openstack role assignment list --user protector-service --project service --names

Verify the site record in the database has correct credentials:

mysql -u protector -p protector \
  -e "SELECT name, auth_url, service_username FROM sites;"

Update incorrect values directly if needed:

mysql -u protector -p protector \
  -e "UPDATE sites SET service_password='<new-password>' WHERE name='cluster2-secondary';"

Then restart protector-engine for the change to take effect.

4. Metadata Sync Failures

Symptom: openstack protector protection-group sync-status <pg-name> reports OUT OF SYNC, FAILED, or UNREACHABLE; modification commands are blocked with "Cannot modify protection group — remote site unreachable".

Check current sync status:
```
openstack protector protection-group sync-status <pg-name>
```
Note the local version, remote version, and last sync timestamp.
Determine whether the remote site is reachable:
```
# Replace with actual remote Protector API endpoint
curl -k https://remote-site-controller:8788/
```
If unreachable, resolve the connectivity issue (see Section 7) before proceeding. Metadata sync is intentionally blocked while the remote site is down to prevent divergence — this is by design.
Once the remote site is back online, force a sync:
```
openstack protector protection-group sync-force <pg-name>
```
Expected output confirms both sites reach the same version number.
Verify sync succeeded:
```
openstack protector protection-group sync-status <pg-name>
```
Both Local Version and Remote Version should match, and Remote Sync Status should read SYNCED.

Review the sync audit trail if you need to understand what diverged:

openstack protector protection-group sync-log <pg-name> --limit 20

If sync-force itself fails, check the protector-engine.log on both sites for the underlying error. Common causes are expired service credentials (Section 3b) or a Galera cluster issue (Section 2, step 4).

5. Volume Type Misconfiguration

Symptom: Protection group creation fails with an error about volume types, or volumes are not being included in the Cinder Consistency Group.

For a volume type to be eligible for protection, it must carry the replication_enabled='<is> True' extra spec and a replication_type property.

Inspect the volume type extra specs:
```
openstack volume type show <volume-type-name>
```
Look for replication_type and replication_enabled in the properties field.

If properties are missing, add them:

openstack volume type set <volume-type-name> \
  --property replication_enabled='<is> True' \
  --property replication_type='async'

Use replication_type='sync' for synchronous (ActiveCluster/Pod) replication.

Verify the volume_backend_name is a valid Cinder backend:
```
openstack volume service list
```
The value after @ in the Host column is the backend name. If volume_backend_name in the volume type does not match any of these, volume creation will fail with "No valid backend was found":
```
openstack volume type set <volume-type-name> \
  --property volume_backend_name='<correct-backend-name>'
```
Confirm the same volume type (by name and properties) exists on the secondary site:
```
# Switch to secondary site credentials
source ~/site-b-openrc
openstack volume type show <volume-type-name>
```
Create or update the type on the secondary site if it is missing or mismatched.

6. Failover and Failback Operation Failures

Symptom: openstack protector protection-group failover or failback returns an error, or the resulting operation shows failed status.

6a. Per-protection-group failover

Check operation status and error message:
```
openstack protector operation list
openstack protector operation show <operation-id>
```
The error message in the operation record is the most precise starting point.
Common error: replication not healthy Verify replication status before retrying:
```
openstack dr replication health <pg-id>
```
If replication lag is high or the state is not healthy, wait for replication to catch up or investigate Pure Storage connectivity.
Common error: resource mapping missing Failover requires network and (optionally) flavor mappings so that VMs can be recreated on the secondary site. Confirm mappings are configured:
```
openstack protector protection-group show <pg-name>
```
If network or flavor mappings are absent, add them before retrying the failover.
Common error: volume attachment timeout This appears in the operation error detail. Check Cinder volume status on the secondary site:
```
source ~/site-b-openrc
openstack volume list --all-projects
```
Volumes stuck in attaching state may need to be manually reset:
```
openstack volume set --state available <volume-id>
```

Review engine logs for the full traceback:

tail -n 200 /var/log/protector/protector-engine.log | grep -A 20 "ERROR"

Retry the operation after resolving the root cause:

openstack protector protection-group failover <pg-name> --type planned

6b. Site-level failover (dr site failover) partial failures

Show the site operation to identify which PGs failed:
```
openstack dr site operation show <site-operation-id>
```
The child_operations and error_summary fields list failed PGs with their error messages.

Investigate each failed PG individually:

openstack dr operation show <child-operation-id>

Retry only the failed PGs after fixing the underlying issue:

openstack dr failover <failed-pg-id> --failover-type planned

If the operation is stuck in running, check whether the engine is still processing:
```
tail -f /var/log/protector/protector-engine.log
```
The site-level failover uses a thread pool with a maximum of 10 concurrent workers; a large number of PGs will complete in batches. Allow time proportional to the PG count before concluding the operation is stuck.
Insufficient role error: If you receive "Site actions require dr_site_admin role", assign the role and retry:
```
openstack role add --user <username> --project <project> dr_site_admin
```
Executing from the wrong site: Planned site failovers must be initiated from the secondary (DR) site. If you are on the primary site and the command is blocked, either switch to the secondary site credentials or use --force:
```
openstack dr site failover <site-id> --failover-type planned --force
```

7. Inter-Site Network Connectivity

Symptom: Metadata sync fails, cross-site authentication fails, or failover cannot reach secondary Cinder/Nova.

Test each required port from the Protector engine host to the remote site:

Remote Service	Port	Test command
Keystone	5000	`curl -k https://remote-controller:5000/v3`
Nova	8774	`curl -k https://remote-controller:8774/`
Cinder	8776	`curl -k https://remote-controller:8776/`
Protector API	8788	`curl -k https://remote-controller:8788/`

Check firewall rules on both the source and destination hosts. Ensure the above ports are open for traffic originating from the Protector engine's IP address.
Check SSL/TLS certificate errors in the engine log. If you see SSL: CERTIFICATE_VERIFY_FAILED, either install the remote site's CA certificate into the system trust store, or (for non-production environments only) set verify=False in the site client configuration.

8. Mock Storage Driver Issues

Symptom: End-to-end testing fails with errors referencing SQLite, missing Glance images, or mock storage paths.

Verify mock mode is enabled on both sites:
```
grep -E 'use_mock_cinder|use_mock_storage' /etc/protector/protector.conf
```
Both use_mock_cinder = True and use_mock_storage = True must be set.

Verify the mock storage directory exists and is writable:

ls -ld /var/lib/protector/mock_storage

If missing, create it:

mkdir -p /var/lib/protector/mock_storage
chown protector:protector /var/lib/protector/mock_storage

Verify that matching Glance images exist on both sites. In mock mode, failover creates bootable volumes from Glance images to simulate replication. The image name must be identical on both clusters:

# On Site A
source ~/site-a-openrc && openstack image list
# On Site B
source ~/site-b-openrc && openstack image list

If an image is missing on one site, upload it:

openstack image create cirros \
  --disk-format qcow2 \
  --container-format bare \
  --public \
  --file cirros-0.5.2-x86_64-disk.img

Verify python-glanceclient is installed on both controller nodes:

python3 -c "import glanceclient; print(glanceclient.__version__)"

Install it if missing:

pip install python-glanceclient

Verification

After completing the relevant remediation steps, confirm resolution using the following checks:

Services are running on both sites:
```
systemctl is-active protector-api protector-engine
```
Both return active.

API health check passes on both sites:

curl -s http://site-a-controller:8788/
curl -s http://site-b-controller:8788/

Both return a JSON version document.

Protection groups are visible and in sync:

openstack protector protection-group list
openstack protector protection-group sync-status <pg-name>

Sync status shows IN SYNC, local and remote versions match.

Cross-site service account authentication succeeds:

openstack --os-auth-url https://remote-keystone:5000 \
  --os-username protector-service \
  --os-password <password> \
  --os-project-name service \
  token issue

A token is returned without errors.

Volume types have correct properties on both sites:
```
openstack volume type show <replicated-volume-type>
```
Output includes replication_enabled='<is> True' and a replication_type value.
Any previously failed operation has been resolved or retried successfully:
```
openstack protector operation show <operation-id>
```
Status shows completed (not failed or running).
For site-level failover, all child PGs report success:
```
openstack dr site operation show <site-operation-id>
```
completed_count equals total_protection_groups and failed_count is 0.

Rollback

The appropriate rollback depends on which step introduced a change. The following guidance covers the most common interventions in this runbook.

Service restarts (Section 1, step 5): If restarting protector-api or protector-engine caused a regression, revert any configuration change made to /etc/protector/protector.conf and restart again. Keep a backup before editing:

cp /etc/protector/protector.conf /etc/protector/protector.conf.bak

Database schema changes (Section 2, step 3): Alembic migrations (protector-manage db sync) are not automatically reversible. If a migration caused instability, restore from your most recent database backup:

mysql -u root -p protector < protector_backup.sql

Then restore the configuration backup and restart services.

Volume type property changes (Section 5): If adding or changing extra specs on a volume type caused unexpected behavior, remove the property:

openstack volume type unset <volume-type-name> --property replication_enabled
openstack volume type unset <volume-type-name> --property replication_type

Site credential updates (Section 3b): If a database UPDATE to the sites table used an incorrect password, re-run the update with the correct value and restart protector-engine.

Forced sync (Section 4, step 3): sync-force pushes the local version to the remote site, overwriting the remote copy. If this was done in error (e.g., syncing from the wrong site), re-run sync-force from the authoritative site (the one where VMs are running) to restore the correct state.

Failed or partial failover: A partially completed failover leaves both sites in a transitional state. Do not attempt to reverse it manually by deleting VMs or volumes. Instead:

Use openstack protector operation show to determine exactly what succeeded.
If VMs were created on the secondary site, complete the failover for any remaining PGs rather than rolling back.
Once all PGs are in a consistent failed-over state, plan a controlled failback (openstack protector protection-group failback) when the primary site is healthy.

Escalation

If the steps in this runbook do not resolve the issue, escalate with the following information prepared:

Required information to provide:

Service logs from both sites (last 500 lines minimum, or full logs since the failure began):

journalctl -u protector-api --since "1 hour ago" > /tmp/protector-api-site-a.log
journalctl -u protector-engine --since "1 hour ago" > /tmp/protector-engine-site-a.log
# Repeat on site B

Protection group details and sync status:

openstack protector protection-group show <pg-name>
openstack protector protection-group sync-status <pg-name>
openstack protector protection-group sync-log <pg-name> --limit 20

Operation details for any failed DR operation:

openstack protector operation show <operation-id>

Site configuration (redact passwords before sharing):

mysql -u protector -p protector \
  -e "SELECT id, name, auth_url, service_username, site_type, status FROM sites;"

Volume type configuration on both sites:
```
openstack volume type list --long
```
The exact CLI command that failed, the full error output, and the timestamp.
Whether mock storage mode is in use (use_mock_storage = True/False from protector.conf).

Escalation path: