Service Issues
protector-api or protector-engine failures, RPC connectivity, Keystone auth errors
This page covers the most common failure modes for the protector-api and protector-engine services, including Keystone authentication errors, RabbitMQ/oslo.messaging connectivity problems, SQLAlchemy session hygiene issues, and missing OSC plugin installations. Understanding these failures matters because Trilio Site Recovery depends on both services being healthy on each site independently — a misconfigured or unreachable service on either site blocks all DR operations for protection groups associated with that site. Use this page to diagnose symptoms quickly and apply targeted fixes without restarting unaffected components.
Before using this guide, confirm the following:
- You have shell access to the host running
protector-apiandprotector-engineon the affected site. - You can read logs at
/opt/openstack-protector/bin/logs/api.logand/opt/openstack-protector/bin/logs/engine.log. - You have
dev-launch.shavailable to start and stop services (development deployments) orsystemctlaccess for production deployments. - You have valid OpenStack credentials for the affected site (a sourced
openrcfile or a configuredclouds.yamlentry). - The
protectorclientOSC plugin is installed — or you are diagnosing precisely why it is not. - You know which site (
site-aorsite-b) is exhibiting the problem, because each site runs its own independentprotector-apiandprotector-engineinstances with no direct service-to-service communication between sites.
Service management for development deployments uses dev-launch.sh. For production deployments that use systemd, substitute the equivalent systemctl commands.
Start all services on a site
# From the openstack-protector project root on the affected site
bash dev-launch.sh start
Stop all services on a site
bash dev-launch.sh stop
Restart a single service (development)
bash dev-launch.sh stop
bash dev-launch.sh start
Restart a single service (production systemd)
# Restart only the API
systemctl restart protector-api
# Restart only the engine
systemctl restart protector-engine
# Check status
systemctl status protector-api
systemctl status protector-engine
Reinstall the OSC CLI plugin
If the openstack protector commands are not found, reinstall the plugin from source:
# Navigate to the protectorclient source directory
cd /path/to/protectorclient
# Install in editable mode
pip install -e .
# Verify the plugin is registered
pip list | grep protector
Expected output after a successful install:
protectorclient 0.1.0 /path/to/protectorclient
If openstack protector --help still fails after reinstalling, confirm that the Python environment used by the openstack command is the same one where you ran pip install -e ..
The following configuration sections in protector.conf are most relevant to the failures described on this page.
[keystone_authtoken] — controls how protector-api validates incoming tokens
| Key | Effect |
|---|---|
auth_url | Keystone endpoint this site's API uses to validate tokens. Must match the Keystone URL registered in the service catalog for this site. A mismatch causes 401 responses on every request. |
www_authenticate_uri | Returned to clients in WWW-Authenticate headers on 401 responses. Typically the same as auth_url. |
username / password / project_name | Credentials for the protector service account. This account must exist in Keystone on this site and hold the admin role in the service project. |
memcached_servers | Token cache. If Memcached is unreachable, every request forces a round-trip to Keystone, which can cause latency-induced 401s under load. |
Example [keystone_authtoken] block:
[keystone_authtoken]
www_authenticate_uri = http://controller:5000
auth_url = http://controller:5000
memcached_servers = controller:11211
auth_type = password
project_domain_name = Default
user_domain_name = Default
project_name = service
username = protector
password = PROTECTOR_PASS
[oslo_messaging_rabbit] — controls how protector-engine connects to RabbitMQ
| Key | Effect |
|---|---|
rabbit_host / transport_url | Hostname or URL of the RabbitMQ broker. If unreachable, the engine cannot receive tasks from the API and silently stops processing operations. |
rabbit_userid / rabbit_password | Broker credentials. Authentication failures appear as oslo.messaging errors in the engine log. |
[database] — controls SQLAlchemy session behavior
| Key | Effect |
|---|---|
connection | Database DSN. A misconfigured DSN causes immediate startup failure. |
SQLAlchemy session hygiene is not a protector.conf option — it is enforced in code. The engine must use expire_on_commit=False on database functions that return objects accessed after the session commits. If this is missing on a function, the engine log will show DetachedInstanceError. This requires a code fix, not a configuration change; see the Troubleshooting section for details.
Checking service health before investigating specific errors
Before diving into a specific symptom, confirm which service is affected:
# Check API log for recent errors
tail -n 100 /opt/openstack-protector/bin/logs/api.log
# Check engine log for recent errors
tail -n 100 /opt/openstack-protector/bin/logs/engine.log
For production deployments using systemd, also check the journal:
journalctl -u protector-api --since "10 minutes ago"
journalctl -u protector-engine --since "10 minutes ago"
Verifying Keystone registration for this site
A 401 from protector-api almost always means the service endpoint is not registered in Keystone on that site, or auth_url in protector.conf points to the wrong Keystone. Run these checks against the site returning 401s:
# Source credentials for the affected site
source ~/site-a-openrc # or site-b-openrc
# Confirm the protector service exists
openstack service show protector
# Confirm endpoints are registered
openstack endpoint list --service protector
# Confirm the service user exists and has the admin role in 'service'
openstack user show protector
openstack role assignment list --user protector --project service
Verifying RabbitMQ connectivity from the engine host
If operations submitted through the API never progress, check whether the engine can reach the broker:
# Test TCP connectivity to RabbitMQ
nc -zv <rabbit_host> 5672
# Search the engine log for oslo.messaging errors
grep -i 'oslo.messaging' /opt/openstack-protector/bin/logs/engine.log | tail -20
Checking metadata sync after a service restart
After restarting services on one site, verify that protection group metadata is still synchronized. A service restart does not itself cause metadata divergence, but if the restart was caused by a broader site issue, check sync status before resuming DR operations:
openstack protector protection-group sync-status <pg-name>
If the remote site was unreachable during the outage that caused the restart, the sync status may show UNREACHABLE or FAILED. Run a force sync once the site is reachable:
openstack protector protection-group sync-force <pg-name>
Example 1 — Diagnosing a 401 response from protector-api
You run an OSC command and receive a 401:
openstack protector protection-group list
The request you have made requires authentication. (HTTP 401)
Check the API log on the affected site:
grep '401\|auth\|Unauthorized' /opt/openstack-protector/bin/logs/api.log | tail -20
Expected log output pointing to a missing endpoint registration:
WARNING keystonemiddleware.auth_token [-] Identity response: 401 Unauthorized
ERROR keystonemiddleware.auth_token [-] Unable to validate token: Could not find endpoint for service 'protector' in catalog
Fix — register the endpoint on the affected site:
source ~/site-a-openrc
openstack endpoint create --region RegionOne \
protector public http://controller:8788/v1/%\(tenant_id\)s
openstack endpoint create --region RegionOne \
protector internal http://controller:8788/v1/%\(tenant_id\)s
openstack endpoint create --region RegionOne \
protector admin http://controller:8788/v1/%\(tenant_id\)s
Then restart protector-api and retry the command.
Example 2 — Engine not processing operations (RabbitMQ unreachable)
You create a DR operation (e.g., a test failover) and its status stays pending indefinitely:
openstack protector operation show op-456abc
+----------------+------------------------------------------+
| Field | Value |
+----------------+------------------------------------------+
| status | pending |
| progress | 0 |
| error_message | |
+----------------+------------------------------------------+
Search the engine log for messaging errors:
grep 'oslo.messaging\|amqp\|rabbit' /opt/openstack-protector/bin/logs/engine.log | tail -30
Expected log output:
ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on rabbit-host:5672 is unreachable: [Errno 111] Connection refused. Trying again in 2 seconds...
Fix — restore RabbitMQ connectivity and restart the engine:
# Verify RabbitMQ is reachable
nc -zv rabbit-host 5672
# If the broker is down, start it
systemctl start rabbitmq-server
# Restart the engine
systemctl restart protector-engine
# Confirm operations begin processing
tail -f /opt/openstack-protector/bin/logs/engine.log
Example 3 — DetachedInstanceError in the engine log
An operation fails with a Python traceback in the engine log:
grep -A 10 'DetachedInstanceError' /opt/openstack-protector/bin/logs/engine.log
Expected log output:
ERROR protector.engine.manager [-] Unhandled error in engine task
Traceback (most recent call last):
...
sqlalchemy.orm.exc.DetachedInstanceError: Instance <ProtectionGroup at 0x...> is not bound to a Session; attribute refresh operation cannot proceed
This is a code-level SQLAlchemy session hygiene issue — a database object is being accessed after the session that loaded it has been committed and closed. The fix is to ensure expire_on_commit=False is set on the SQLAlchemy session factory used by the affected engine function, and to avoid passing ORM objects across session boundaries. This requires a code change in the engine's database access layer; it cannot be resolved through configuration or a service restart alone. After applying the fix, restart the engine and re-trigger the failed operation.
Example 4 — OSC plugin not found
openstack protector protection-group list
openstack: 'protector' is not an openstack command. See 'openstack --help'.
Verify whether the plugin is installed:
pip list | grep protector
If there is no output, reinstall:
cd /path/to/protectorclient
pip install -e .
Verify after install:
pip list | grep protector
# Expected:
# protectorclient 0.1.0 /path/to/protectorclient
openstack protector --help
# Expected: protector subcommand help text
Use the following format for each issue: Symptom → Likely cause → Fix.
Issue: protector-api returns HTTP 401 on all requests
Symptom: Every API call or OSC command against this site returns 401 Unauthorized, regardless of the token used.
Likely cause: One or more of the following:
- The
protectorservice or its endpoints are not registered in Keystone on this site. auth_urlin the[keystone_authtoken]section ofprotector.confpoints to the wrong Keystone endpoint (e.g., the peer site's Keystone).- The
protectorservice user does not exist, or does not hold theadminrole in theserviceproject on this site.
Fix:
- Source credentials for the affected site and verify the service and endpoint registrations:
openstack service show protector openstack endpoint list --service protector - If missing, register the service and endpoints (see the Examples section).
- Verify
auth_urlinprotector.confmatches this site's Keystone:[keystone_authtoken] auth_url = http://<this-site-controller>:5000 - Verify the service user has the correct role:
openstack role assignment list --user protector --project service - Restart
protector-apiafter any configuration change.
Issue: DR operations stay in pending state — protector-engine is not processing them
Symptom: Operations created via the API or OSC show status: pending and progress: 0 and do not advance, even after several minutes.
Likely cause: The protector-engine cannot connect to RabbitMQ. Without a working oslo.messaging connection, the engine cannot receive task messages from the API.
Fix:
- Check the engine log for oslo.messaging errors:
grep -i 'oslo.messaging\|amqp\|rabbit' /opt/openstack-protector/bin/logs/engine.log | tail -20 - Test TCP connectivity to the broker:
nc -zv <rabbit_host> 5672 - If the broker is unreachable, restore network connectivity or start RabbitMQ:
systemctl start rabbitmq-server - Verify RabbitMQ credentials in
protector.confare correct. - Restart
protector-engine:systemctl restart protector-engine # or for dev deployments: bash dev-launch.sh stop && bash dev-launch.sh start - Confirm operations begin processing by tailing the engine log.
Issue: DetachedInstanceError in the engine log, causing operation failures
Symptom: A DR operation fails and the engine log contains a traceback ending in sqlalchemy.orm.exc.DetachedInstanceError.
Likely cause: A SQLAlchemy ORM object loaded in one database session is being accessed after that session has been committed and expired. This is a session hygiene bug in the engine code — the object is "detached" from any active session when an attribute is accessed, triggering a lazy-load that cannot succeed.
Fix:
- Identify the engine function named in the traceback.
- Ensure the session factory for that function sets
expire_on_commit=False. This prevents SQLAlchemy from expiring object attributes on commit, so they remain accessible after the session closes. - Audit the code path for cross-session object references — do not pass ORM objects returned from one session into a different session or into async task boundaries.
- Apply the code fix and restart
protector-engine. - Re-trigger the failed operation. If it was a failover or failback, check the protection group status first and resolve any intermediate state before retrying.
Note: A service restart alone does not fix this issue. The bug will recur on the same code path until
expire_on_commit=Falseis applied.
Issue: openstack protector commands not found — OSC plugin missing
Symptom: Running any openstack protector ... command returns 'protector' is not an openstack command.
Likely cause: The protectorclient OSC plugin is not installed in the Python environment used by the openstack CLI, or it was installed in a different virtual environment.
Fix:
- Reinstall the plugin from source:
cd /path/to/protectorclient pip install -e . - Verify the plugin is registered:
pip list | grep protector - Confirm the
openstackCLI now recognizes the plugin:openstack protector --help - If the command is still not found, confirm that
openstackandpipboth resolve to the same Python environment:
If they differ, activate the correct virtual environment and repeat thewhich openstack which pip python -c "import sys; print(sys.prefix)"pip install -e .step.
Issue: Protection group modifications blocked after a service outage
Symptom: After restarting services on one site (or after recovering a site from an outage), attempts to modify a protection group return an error indicating the remote site is unreachable or metadata is out of sync.
Likely cause: Metadata sync between sites requires both sites to be reachable. If the peer site was unreachable during the outage, the local metadata version may differ from the remote copy. Modifications are intentionally blocked to prevent divergence.
Fix:
- Check sync status:
openstack protector protection-group sync-status <pg-name> - Once the peer site is reachable, force a sync:
openstack protector protection-group sync-force <pg-name> - Confirm both sites report the same version before resuming modifications.