Guide

Service Issues

protector-api or protector-engine failures, RPC connectivity, Keystone auth errors

master

Overview

This page covers the most common failure modes for the protector-api and protector-engine services, including Keystone authentication errors, RabbitMQ/oslo.messaging connectivity problems, SQLAlchemy session hygiene issues, and missing OSC plugin installations. Understanding these failures matters because Trilio Site Recovery depends on both services being healthy on each site independently — a misconfigured or unreachable service on either site blocks all DR operations for protection groups associated with that site. Use this page to diagnose symptoms quickly and apply targeted fixes without restarting unaffected components.

Prerequisites

Before using this guide, confirm the following:

You have shell access to the host running protector-api and protector-engine on the affected site.
You can read logs at /opt/openstack-protector/bin/logs/api.log and /opt/openstack-protector/bin/logs/engine.log.
You have dev-launch.sh available to start and stop services (development deployments) or systemctl access for production deployments.
You have valid OpenStack credentials for the affected site (a sourced openrc file or a configured clouds.yaml entry).
The protectorclient OSC plugin is installed — or you are diagnosing precisely why it is not.
You know which site (site-a or site-b) is exhibiting the problem, because each site runs its own independent protector-api and protector-engine instances with no direct service-to-service communication between sites.

Installation

Service management for development deployments uses dev-launch.sh. For production deployments that use systemd, substitute the equivalent systemctl commands.

Start all services on a site

# From the openstack-protector project root on the affected site
bash dev-launch.sh start

Stop all services on a site

bash dev-launch.sh stop

Restart a single service (development)

bash dev-launch.sh stop
bash dev-launch.sh start

Restart a single service (production systemd)

# Restart only the API
systemctl restart protector-api

# Restart only the engine
systemctl restart protector-engine

# Check status
systemctl status protector-api
systemctl status protector-engine

Reinstall the OSC CLI plugin

If the openstack protector commands are not found, reinstall the plugin from source:

# Navigate to the protectorclient source directory
cd /path/to/protectorclient

# Install in editable mode
pip install -e .

# Verify the plugin is registered
pip list | grep protector

Expected output after a successful install:

protectorclient   0.1.0     /path/to/protectorclient

If openstack protector --help still fails after reinstalling, confirm that the Python environment used by the openstack command is the same one where you ran pip install -e ..

Configuration

The following configuration sections in protector.conf are most relevant to the failures described on this page.

[keystone_authtoken] — controls how protector-api validates incoming tokens

Key	Effect
`auth_url`	Keystone endpoint this site's API uses to validate tokens. Must match the Keystone URL registered in the service catalog for this site. A mismatch causes 401 responses on every request.
`www_authenticate_uri`	Returned to clients in `WWW-Authenticate` headers on 401 responses. Typically the same as `auth_url`.
`username` / `password` / `project_name`	Credentials for the `protector` service account. This account must exist in Keystone on this site and hold the `admin` role in the `service` project.
`memcached_servers`	Token cache. If Memcached is unreachable, every request forces a round-trip to Keystone, which can cause latency-induced 401s under load.

Example [keystone_authtoken] block:

[keystone_authtoken]
www_authenticate_uri = http://controller:5000
auth_url = http://controller:5000
memcached_servers = controller:11211
auth_type = password
project_domain_name = Default
user_domain_name = Default
project_name = service
username = protector
password = PROTECTOR_PASS

[oslo_messaging_rabbit] — controls how protector-engine connects to RabbitMQ

Key	Effect
`rabbit_host` / `transport_url`	Hostname or URL of the RabbitMQ broker. If unreachable, the engine cannot receive tasks from the API and silently stops processing operations.
`rabbit_userid` / `rabbit_password`	Broker credentials. Authentication failures appear as `oslo.messaging` errors in the engine log.

[database] — controls SQLAlchemy session behavior

Key	Effect
`connection`	Database DSN. A misconfigured DSN causes immediate startup failure.

SQLAlchemy session hygiene is not a protector.conf option — it is enforced in code. The engine must use expire_on_commit=False on database functions that return objects accessed after the session commits. If this is missing on a function, the engine log will show DetachedInstanceError. This requires a code fix, not a configuration change; see the Troubleshooting section for details.

Usage

Checking service health before investigating specific errors

Before diving into a specific symptom, confirm which service is affected:

# Check API log for recent errors
tail -n 100 /opt/openstack-protector/bin/logs/api.log

# Check engine log for recent errors
tail -n 100 /opt/openstack-protector/bin/logs/engine.log

For production deployments using systemd, also check the journal:

journalctl -u protector-api --since "10 minutes ago"
journalctl -u protector-engine --since "10 minutes ago"

Verifying Keystone registration for this site

A 401 from protector-api almost always means the service endpoint is not registered in Keystone on that site, or auth_url in protector.conf points to the wrong Keystone. Run these checks against the site returning 401s:

# Source credentials for the affected site
source ~/site-a-openrc   # or site-b-openrc

# Confirm the protector service exists
openstack service show protector

# Confirm endpoints are registered
openstack endpoint list --service protector

# Confirm the service user exists and has the admin role in 'service'
openstack user show protector
openstack role assignment list --user protector --project service

Verifying RabbitMQ connectivity from the engine host

If operations submitted through the API never progress, check whether the engine can reach the broker:

# Test TCP connectivity to RabbitMQ
nc -zv <rabbit_host> 5672

# Search the engine log for oslo.messaging errors
grep -i 'oslo.messaging' /opt/openstack-protector/bin/logs/engine.log | tail -20

Checking metadata sync after a service restart

After restarting services on one site, verify that protection group metadata is still synchronized. A service restart does not itself cause metadata divergence, but if the restart was caused by a broader site issue, check sync status before resuming DR operations:

openstack protector protection-group sync-status <pg-name>

If the remote site was unreachable during the outage that caused the restart, the sync status may show UNREACHABLE or FAILED. Run a force sync once the site is reachable:

openstack protector protection-group sync-force <pg-name>

Examples

Example 1 — Diagnosing a 401 response from protector-api

You run an OSC command and receive a 401:

openstack protector protection-group list

The request you have made requires authentication. (HTTP 401)

Check the API log on the affected site:

grep '401\|auth\|Unauthorized' /opt/openstack-protector/bin/logs/api.log | tail -20

Expected log output pointing to a missing endpoint registration:

WARNING keystonemiddleware.auth_token [-] Identity response: 401 Unauthorized
ERROR keystonemiddleware.auth_token [-] Unable to validate token: Could not find endpoint for service 'protector' in catalog

Fix — register the endpoint on the affected site:

source ~/site-a-openrc

openstack endpoint create --region RegionOne \
  protector public http://controller:8788/v1/%\(tenant_id\)s

openstack endpoint create --region RegionOne \
  protector internal http://controller:8788/v1/%\(tenant_id\)s

openstack endpoint create --region RegionOne \
  protector admin http://controller:8788/v1/%\(tenant_id\)s

Then restart protector-api and retry the command.

Example 2 — Engine not processing operations (RabbitMQ unreachable)

You create a DR operation (e.g., a test failover) and its status stays pending indefinitely:

openstack protector operation show op-456abc

+----------------+------------------------------------------+
| Field          | Value                                    |
+----------------+------------------------------------------+
| status         | pending                                  |
| progress       | 0                                        |
| error_message  |                                          |
+----------------+------------------------------------------+

Search the engine log for messaging errors:

grep 'oslo.messaging\|amqp\|rabbit' /opt/openstack-protector/bin/logs/engine.log | tail -30

Expected log output:

ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on rabbit-host:5672 is unreachable: [Errno 111] Connection refused. Trying again in 2 seconds...

Fix — restore RabbitMQ connectivity and restart the engine:

# Verify RabbitMQ is reachable
nc -zv rabbit-host 5672

# If the broker is down, start it
systemctl start rabbitmq-server

# Restart the engine
systemctl restart protector-engine

# Confirm operations begin processing
tail -f /opt/openstack-protector/bin/logs/engine.log

Example 3 — DetachedInstanceError in the engine log

An operation fails with a Python traceback in the engine log:

grep -A 10 'DetachedInstanceError' /opt/openstack-protector/bin/logs/engine.log

Expected log output:

ERROR protector.engine.manager [-] Unhandled error in engine task
Traceback (most recent call last):
  ...
sqlalchemy.orm.exc.DetachedInstanceError: Instance <ProtectionGroup at 0x...> is not bound to a Session; attribute refresh operation cannot proceed

This is a code-level SQLAlchemy session hygiene issue — a database object is being accessed after the session that loaded it has been committed and closed. The fix is to ensure expire_on_commit=False is set on the SQLAlchemy session factory used by the affected engine function, and to avoid passing ORM objects across session boundaries. This requires a code change in the engine's database access layer; it cannot be resolved through configuration or a service restart alone. After applying the fix, restart the engine and re-trigger the failed operation.

Example 4 — OSC plugin not found

openstack protector protection-group list

openstack: 'protector' is not an openstack command. See 'openstack --help'.

Verify whether the plugin is installed:

pip list | grep protector

If there is no output, reinstall:

cd /path/to/protectorclient
pip install -e .

Verify after install:

pip list | grep protector
# Expected:
# protectorclient   0.1.0     /path/to/protectorclient

openstack protector --help
# Expected: protector subcommand help text

Troubleshooting

Use the following format for each issue: Symptom → Likely cause → Fix.

Issue: protector-api returns HTTP 401 on all requests

Symptom: Every API call or OSC command against this site returns 401 Unauthorized, regardless of the token used.

Likely cause: One or more of the following:

The protector service or its endpoints are not registered in Keystone on this site.
auth_url in the [keystone_authtoken] section of protector.conf points to the wrong Keystone endpoint (e.g., the peer site's Keystone).
The protector service user does not exist, or does not hold the admin role in the service project on this site.

Fix:

Source credentials for the affected site and verify the service and endpoint registrations:
```
openstack service show protector
openstack endpoint list --service protector
```
If missing, register the service and endpoints (see the Examples section).

Verify auth_url in protector.conf matches this site's Keystone:

[keystone_authtoken]
auth_url = http://<this-site-controller>:5000

Verify the service user has the correct role:

openstack role assignment list --user protector --project service

Restart protector-api after any configuration change.

Issue: DR operations stay in pending state — protector-engine is not processing them

Symptom: Operations created via the API or OSC show status: pending and progress: 0 and do not advance, even after several minutes.

Likely cause: The protector-engine cannot connect to RabbitMQ. Without a working oslo.messaging connection, the engine cannot receive task messages from the API.

Fix:

Check the engine log for oslo.messaging errors:

grep -i 'oslo.messaging\|amqp\|rabbit' /opt/openstack-protector/bin/logs/engine.log | tail -20

Test TCP connectivity to the broker:
```
nc -zv <rabbit_host> 5672
```
If the broker is unreachable, restore network connectivity or start RabbitMQ:
```
systemctl start rabbitmq-server
```
Verify RabbitMQ credentials in protector.conf are correct.

Restart protector-engine:

systemctl restart protector-engine
# or for dev deployments:
bash dev-launch.sh stop && bash dev-launch.sh start

Confirm operations begin processing by tailing the engine log.

Issue: DetachedInstanceError in the engine log, causing operation failures

Symptom: A DR operation fails and the engine log contains a traceback ending in sqlalchemy.orm.exc.DetachedInstanceError.

Likely cause: A SQLAlchemy ORM object loaded in one database session is being accessed after that session has been committed and expired. This is a session hygiene bug in the engine code — the object is "detached" from any active session when an attribute is accessed, triggering a lazy-load that cannot succeed.

Fix:

Identify the engine function named in the traceback.
Ensure the session factory for that function sets expire_on_commit=False. This prevents SQLAlchemy from expiring object attributes on commit, so they remain accessible after the session closes.
Audit the code path for cross-session object references — do not pass ORM objects returned from one session into a different session or into async task boundaries.
Apply the code fix and restart protector-engine.
Re-trigger the failed operation. If it was a failover or failback, check the protection group status first and resolve any intermediate state before retrying.

Note: A service restart alone does not fix this issue. The bug will recur on the same code path until expire_on_commit=False is applied.

Issue: openstack protector commands not found — OSC plugin missing

Symptom: Running any openstack protector ... command returns 'protector' is not an openstack command.

Likely cause: The protectorclient OSC plugin is not installed in the Python environment used by the openstack CLI, or it was installed in a different virtual environment.

Fix:

Reinstall the plugin from source:

cd /path/to/protectorclient
pip install -e .

Verify the plugin is registered:
```
pip list | grep protector
```
Confirm the openstack CLI now recognizes the plugin:
```
openstack protector --help
```
If the command is still not found, confirm that openstack and pip both resolve to the same Python environment:
```
which openstack
which pip
python -c "import sys; print(sys.prefix)"
```
If they differ, activate the correct virtual environment and repeat the pip install -e . step.

Issue: Protection group modifications blocked after a service outage

Symptom: After restarting services on one site (or after recovering a site from an outage), attempts to modify a protection group return an error indicating the remote site is unreachable or metadata is out of sync.

Likely cause: Metadata sync between sites requires both sites to be reachable. If the peer site was unreachable during the outage, the local metadata version may differ from the remote copy. Modifications are intentionally blocked to prevent divergence.

Fix:

Check sync status:

openstack protector protection-group sync-status <pg-name>

Once the peer site is reachable, force a sync:

openstack protector protection-group sync-force <pg-name>

Confirm both sites report the same version before resuming modifications.