Troubleshooting
Common failure modes and remediation steps
This runbook identifies the most common failure modes in Trilio Site Recovery for OpenStack and provides step-by-step remediation procedures; use it whenever a DR operation (protection group management, failover, failback, or site-level failover) behaves unexpectedly or services fail to respond.
This runbook covers:
- Trilio Site Recovery service failures (
protector-api,protector-engine) - Database connectivity and schema issues
- Keystone authentication failures for both local and cross-site service accounts
- Metadata sync failures and out-of-sync protection groups
- Volume type misconfiguration blocking protection group creation
- Failover and failback operation failures (per-PG and site-level)
- Network connectivity problems between sites
- Mock storage driver issues during testing
This runbook does not cover:
- Pure Storage FlashArray hardware faults or firmware upgrades
- OpenStack Nova, Cinder, or Neutron service failures unrelated to Trilio Site Recovery
- Keystone federation setup or SAML/OIDC configuration
- RabbitMQ cluster administration beyond basic connectivity checks
- Data recovery after a failed unplanned failover where volumes are corrupted
Before using this runbook, confirm you have:
- Access to both sites: SSH access to the controller nodes on both the primary and secondary OpenStack clouds, and valid admin credentials (
admin-openrcor equivalent) for each site's Keystone. - OSC CLI plugin installed: The
protectorclientplugin must be installed and your~/.config/openstack/clouds.yamlmust define bothsite-aandsite-bcloud entries. - Database access: Ability to run queries against the MariaDB/MySQL
protectordatabase on both sites (theprotectorDB user or root). - Log access: Read access to
/var/log/protector/protector-api.logand/var/log/protector/protector-engine.logon both controllers. dr_site_adminrole (for site-level failover troubleshooting): Your user must hold thedr_site_adminrole in addition toadmin.- Network reachability: The workstation or jump host running CLI commands must be able to reach port
5000(Keystone),8774(Nova),8776(Cinder), and8788(Protector API) on both sites. - No in-progress DR operations should be running unless you are specifically troubleshooting a stuck operation.
Work through the sections below in order. Each section addresses a distinct failure domain. Stop at the section that matches your symptom and follow the numbered steps within it.
1. Protector Services Not Responding
Symptom: openstack protector protection-group list returns a connection error or HTTP 503.
-
Check service status on the affected site:
systemctl status protector-api systemctl status protector-engineA healthy service shows
active (running). If either showsfailedorinactive, proceed to step 2. -
Review recent journal entries for the failed service:
journalctl -u protector-api -n 100 --no-pager journalctl -u protector-engine -n 100 --no-pagerLook for Python tracebacks,
ImportError, or database connection strings in the output. -
Check whether the API port is bound:
netstat -tlnp | grep 8788If no output, the API process did not start successfully. Confirm the config file path is correct:
cat /etc/systemd/system/protector-api.service | grep ExecStart -
Verify the configuration file is parseable:
protector-manage config --config-file /etc/protector/protector.confCorrect any syntax errors reported.
-
Restart the services:
systemctl restart protector-api systemctl restart protector-engine -
Confirm the health endpoint responds:
curl -s http://controller:8788/A JSON version document confirms the API is up.
2. Database Connection Errors
Symptom: Service logs contain OperationalError, Can't connect to MySQL server, or Access denied for user 'protector'.
-
Verify the database is reachable from the controller:
mysql -h controller -u protector -p protector -e "SELECT 1;"If this fails, the database host, port, or credentials in
protector.confare incorrect. -
Check the connection string in the configuration file:
grep ^connection /etc/protector/protector.confThe value must match the format
mysql+pymysql://protector:PROTECTOR_DBPASS@<host>/protector. -
Ensure the schema is current (run after upgrades):
protector-manage db syncRe-run this if the log shows
Table '...' doesn't exist. -
If using a Galera cluster on the remote site and the remote Protector returns HTTP 500 during metadata sync, check whether the Galera node needs bootstrapping:
# On the remote DB node mysql -e "SHOW STATUS LIKE 'wsrep_cluster_status';"A status of
Non-Primaryindicates a split-brain condition. Resolve the Galera cluster state before retrying Protector operations.
3. Keystone Authentication Failures
Symptom: API returns HTTP 401 or keystonemiddleware errors appear in protector-api.log.
3a. Local Keystone (user-facing authentication)
-
Verify the service user exists:
openstack user show protector -
Confirm the
adminrole is assigned to the service project:openstack role assignment list --user protector --project service --names -
Test token issuance with the service account:
openstack --os-auth-url http://controller:5000/v3 \ --os-username protector \ --os-password <PROTECTOR_PASS> \ --os-project-name service \ token issue -
Check
[keystone_authtoken]in/etc/protector/protector.conf— theauth_url,username,password,project_name, and domain fields must match the values above.
3b. Cross-site service account authentication
Protector uses dedicated service credentials (not user tokens) to authenticate to the remote site. Failures here block metadata sync and failover.
-
Test connectivity to the remote Keystone:
curl -k https://remote-keystone:5000/v3A JSON response confirms basic reachability. A
Connection refusedor timeout indicates a network or firewall issue — see Section 7. -
Test the remote service account credentials:
openstack --os-auth-url https://remote-keystone:5000 \ --os-username protector-service \ --os-password <password> \ --os-project-name service \ token issue -
If authentication fails, verify the service account exists on the remote site and holds the
adminrole on theserviceproject:# Run on the remote site openstack user show protector-service openstack role assignment list --user protector-service --project service --names -
Verify the site record in the database has correct credentials:
mysql -u protector -p protector \ -e "SELECT name, auth_url, service_username FROM sites;"Update incorrect values directly if needed:
mysql -u protector -p protector \ -e "UPDATE sites SET service_password='<new-password>' WHERE name='cluster2-secondary';"Then restart
protector-enginefor the change to take effect.
4. Metadata Sync Failures
Symptom: openstack protector protection-group sync-status <pg-name> reports OUT OF SYNC, FAILED, or UNREACHABLE; modification commands are blocked with "Cannot modify protection group — remote site unreachable".
-
Check current sync status:
openstack protector protection-group sync-status <pg-name>Note the local version, remote version, and last sync timestamp.
-
Determine whether the remote site is reachable:
# Replace with actual remote Protector API endpoint curl -k https://remote-site-controller:8788/If unreachable, resolve the connectivity issue (see Section 7) before proceeding. Metadata sync is intentionally blocked while the remote site is down to prevent divergence — this is by design.
-
Once the remote site is back online, force a sync:
openstack protector protection-group sync-force <pg-name>Expected output confirms both sites reach the same version number.
-
Verify sync succeeded:
openstack protector protection-group sync-status <pg-name>Both
Local VersionandRemote Versionshould match, andRemote Sync Statusshould readSYNCED. -
Review the sync audit trail if you need to understand what diverged:
openstack protector protection-group sync-log <pg-name> --limit 20 -
If
sync-forceitself fails, check theprotector-engine.logon both sites for the underlying error. Common causes are expired service credentials (Section 3b) or a Galera cluster issue (Section 2, step 4).
5. Volume Type Misconfiguration
Symptom: Protection group creation fails with an error about volume types, or volumes are not being included in the Cinder Consistency Group.
For a volume type to be eligible for protection, it must carry the replication_enabled='<is> True' extra spec and a replication_type property.
-
Inspect the volume type extra specs:
openstack volume type show <volume-type-name>Look for
replication_typeandreplication_enabledin thepropertiesfield. -
If properties are missing, add them:
openstack volume type set <volume-type-name> \ --property replication_enabled='<is> True' \ --property replication_type='async'Use
replication_type='sync'for synchronous (ActiveCluster/Pod) replication. -
Verify the
volume_backend_nameis a valid Cinder backend:openstack volume service listThe value after
@in theHostcolumn is the backend name. Ifvolume_backend_namein the volume type does not match any of these, volume creation will fail with "No valid backend was found":openstack volume type set <volume-type-name> \ --property volume_backend_name='<correct-backend-name>' -
Confirm the same volume type (by name and properties) exists on the secondary site:
# Switch to secondary site credentials source ~/site-b-openrc openstack volume type show <volume-type-name>Create or update the type on the secondary site if it is missing or mismatched.
6. Failover and Failback Operation Failures
Symptom: openstack protector protection-group failover or failback returns an error, or the resulting operation shows failed status.
6a. Per-protection-group failover
-
Check operation status and error message:
openstack protector operation list openstack protector operation show <operation-id>The error message in the operation record is the most precise starting point.
-
Common error: replication not healthy Verify replication status before retrying:
openstack dr replication health <pg-id>If replication lag is high or the state is not
healthy, wait for replication to catch up or investigate Pure Storage connectivity. -
Common error: resource mapping missing Failover requires network and (optionally) flavor mappings so that VMs can be recreated on the secondary site. Confirm mappings are configured:
openstack protector protection-group show <pg-name>If network or flavor mappings are absent, add them before retrying the failover.
-
Common error: volume attachment timeout This appears in the operation error detail. Check Cinder volume status on the secondary site:
source ~/site-b-openrc openstack volume list --all-projectsVolumes stuck in
attachingstate may need to be manually reset:openstack volume set --state available <volume-id> -
Review engine logs for the full traceback:
tail -n 200 /var/log/protector/protector-engine.log | grep -A 20 "ERROR" -
Retry the operation after resolving the root cause:
openstack protector protection-group failover <pg-name> --type planned
6b. Site-level failover (dr site failover) partial failures
-
Show the site operation to identify which PGs failed:
openstack dr site operation show <site-operation-id>The
child_operationsanderror_summaryfields list failed PGs with their error messages. -
Investigate each failed PG individually:
openstack dr operation show <child-operation-id> -
Retry only the failed PGs after fixing the underlying issue:
openstack dr failover <failed-pg-id> --failover-type planned -
If the operation is stuck in
running, check whether the engine is still processing:tail -f /var/log/protector/protector-engine.logThe site-level failover uses a thread pool with a maximum of 10 concurrent workers; a large number of PGs will complete in batches. Allow time proportional to the PG count before concluding the operation is stuck.
-
Insufficient role error: If you receive "Site actions require dr_site_admin role", assign the role and retry:
openstack role add --user <username> --project <project> dr_site_admin -
Executing from the wrong site: Planned site failovers must be initiated from the secondary (DR) site. If you are on the primary site and the command is blocked, either switch to the secondary site credentials or use
--force:openstack dr site failover <site-id> --failover-type planned --force
7. Inter-Site Network Connectivity
Symptom: Metadata sync fails, cross-site authentication fails, or failover cannot reach secondary Cinder/Nova.
-
Test each required port from the Protector engine host to the remote site:
Remote Service Port Test command Keystone 5000 curl -k https://remote-controller:5000/v3Nova 8774 curl -k https://remote-controller:8774/Cinder 8776 curl -k https://remote-controller:8776/Protector API 8788 curl -k https://remote-controller:8788/ -
Check firewall rules on both the source and destination hosts. Ensure the above ports are open for traffic originating from the Protector engine's IP address.
-
Check SSL/TLS certificate errors in the engine log. If you see
SSL: CERTIFICATE_VERIFY_FAILED, either install the remote site's CA certificate into the system trust store, or (for non-production environments only) setverify=Falsein the site client configuration.
8. Mock Storage Driver Issues
Symptom: End-to-end testing fails with errors referencing SQLite, missing Glance images, or mock storage paths.
-
Verify mock mode is enabled on both sites:
grep -E 'use_mock_cinder|use_mock_storage' /etc/protector/protector.confBoth
use_mock_cinder = Trueanduse_mock_storage = Truemust be set. -
Verify the mock storage directory exists and is writable:
ls -ld /var/lib/protector/mock_storageIf missing, create it:
mkdir -p /var/lib/protector/mock_storage chown protector:protector /var/lib/protector/mock_storage -
Verify that matching Glance images exist on both sites. In mock mode, failover creates bootable volumes from Glance images to simulate replication. The image name must be identical on both clusters:
# On Site A source ~/site-a-openrc && openstack image list # On Site B source ~/site-b-openrc && openstack image listIf an image is missing on one site, upload it:
openstack image create cirros \ --disk-format qcow2 \ --container-format bare \ --public \ --file cirros-0.5.2-x86_64-disk.img -
Verify
python-glanceclientis installed on both controller nodes:python3 -c "import glanceclient; print(glanceclient.__version__)"Install it if missing:
pip install python-glanceclient
After completing the relevant remediation steps, confirm resolution using the following checks:
-
Services are running on both sites:
systemctl is-active protector-api protector-engineBoth return
active. -
API health check passes on both sites:
curl -s http://site-a-controller:8788/ curl -s http://site-b-controller:8788/Both return a JSON version document.
-
Protection groups are visible and in sync:
openstack protector protection-group list openstack protector protection-group sync-status <pg-name>Sync status shows
IN SYNC, local and remote versions match. -
Cross-site service account authentication succeeds:
openstack --os-auth-url https://remote-keystone:5000 \ --os-username protector-service \ --os-password <password> \ --os-project-name service \ token issueA token is returned without errors.
-
Volume types have correct properties on both sites:
openstack volume type show <replicated-volume-type>Output includes
replication_enabled='<is> True'and areplication_typevalue. -
Any previously failed operation has been resolved or retried successfully:
openstack protector operation show <operation-id>Status shows
completed(notfailedorrunning). -
For site-level failover, all child PGs report success:
openstack dr site operation show <site-operation-id>completed_countequalstotal_protection_groupsandfailed_countis0.
The appropriate rollback depends on which step introduced a change. The following guidance covers the most common interventions in this runbook.
Service restarts (Section 1, step 5): If restarting protector-api or protector-engine caused a regression, revert any configuration change made to /etc/protector/protector.conf and restart again. Keep a backup before editing:
cp /etc/protector/protector.conf /etc/protector/protector.conf.bak
Database schema changes (Section 2, step 3): Alembic migrations (protector-manage db sync) are not automatically reversible. If a migration caused instability, restore from your most recent database backup:
mysql -u root -p protector < protector_backup.sql
Then restore the configuration backup and restart services.
Volume type property changes (Section 5): If adding or changing extra specs on a volume type caused unexpected behavior, remove the property:
openstack volume type unset <volume-type-name> --property replication_enabled
openstack volume type unset <volume-type-name> --property replication_type
Site credential updates (Section 3b): If a database UPDATE to the sites table used an incorrect password, re-run the update with the correct value and restart protector-engine.
Forced sync (Section 4, step 3): sync-force pushes the local version to the remote site, overwriting the remote copy. If this was done in error (e.g., syncing from the wrong site), re-run sync-force from the authoritative site (the one where VMs are running) to restore the correct state.
Failed or partial failover: A partially completed failover leaves both sites in a transitional state. Do not attempt to reverse it manually by deleting VMs or volumes. Instead:
- Use
openstack protector operation showto determine exactly what succeeded. - If VMs were created on the secondary site, complete the failover for any remaining PGs rather than rolling back.
- Once all PGs are in a consistent failed-over state, plan a controlled failback (
openstack protector protection-group failback) when the primary site is healthy.
If the steps in this runbook do not resolve the issue, escalate with the following information prepared:
Required information to provide:
-
Service logs from both sites (last 500 lines minimum, or full logs since the failure began):
journalctl -u protector-api --since "1 hour ago" > /tmp/protector-api-site-a.log journalctl -u protector-engine --since "1 hour ago" > /tmp/protector-engine-site-a.log # Repeat on site B -
Protection group details and sync status:
openstack protector protection-group show <pg-name> openstack protector protection-group sync-status <pg-name> openstack protector protection-group sync-log <pg-name> --limit 20 -
Operation details for any failed DR operation:
openstack protector operation show <operation-id> -
Site configuration (redact passwords before sharing):
mysql -u protector -p protector \ -e "SELECT id, name, auth_url, service_username, site_type, status FROM sites;" -
Volume type configuration on both sites:
openstack volume type list --long -
The exact CLI command that failed, the full error output, and the timestamp.
-
Whether mock storage mode is in use (
use_mock_storage = True/Falsefromprotector.conf).
Escalation path: