Guide

Horizon Dashboard Guide

Trilio Site Recovery Horizon plugin workflows for tenant self-service DR management.

master

Overview

The Trilio Site Recovery Horizon plugin extends the OpenStack dashboard with a dedicated Site Recovery panel under the Project menu, giving cloud tenants a self-service interface for managing disaster recovery without leaving their familiar OpenStack UI. This guide walks you through every page and workflow available in the plugin: registering sites, creating and managing Protection Groups, adding VMs, monitoring live DR operations, assessing replication health, and executing failover, failback, and DR drill operations. The plugin coordinates authentication to both your primary and secondary OpenStack sites, keeping metadata in sync across them so that every action you take in the UI is reflected on both sides of your DR topology.

Prerequisites

Before you use the Horizon Site Recovery plugin, ensure the following are in place:

Two independent OpenStack deployments — a primary site and a secondary (DR) site, each with its own Nova, Cinder, Neutron, and Keystone endpoints.
Trilio Site Recovery services deployed on both sites — protector-api and protector-engine must be running and reachable on each site independently.
Horizon Site Recovery plugin installed on the Horizon server (see Installation below).
Cinder volume types configured for replication on both sites — each eligible volume type must carry replication_enabled='<is> True' and a replication_type property set to either async or sync.
Sites registered in the Protector service (Admin panel → Site Registration) before tenants can create Protection Groups.
Tenant DR capabilities assigned (Admin panel → Tenant Mapping) so that the Site Recovery panel is visible to the intended projects.
dr_site_admin role assigned to any user who needs to trigger site-level failover from the Admin panel.
A user account with at least Member role on the target project to manage Protection Groups and run DR operations.

Installation

Step 1 — Install the plugin package

On the server running Horizon, install the trilio-horizon-dr plugin package. The package ships the Django app, static assets, and dashboard configuration.

pip install trilio-horizon-dr

If your environment uses distribution packages instead of pip, install the corresponding RPM or DEB provided by your Trilio distribution.

Step 2 — Enable the dashboard

Copy the plugin's enabled file into Horizon's enabled directory so that Django discovers it on startup.

cp /usr/lib/python3/dist-packages/trilio_horizon_dr/enabled/_90_dr_site_recovery.py \
   /etc/openstack-dashboard/enabled/

The exact source path varies by installation method; check the output of pip show trilio-horizon-dr for the package location.

Step 3 — Configure site endpoints

Add the plugin's settings block to your Horizon local settings file (typically /etc/openstack-dashboard/local_settings.py or a file in local_settings.d/).

# /etc/openstack-dashboard/local_settings.d/_90_trilio_dr.py

TRILIO_DR_PRIMARY_API_URL = "http://site-a-controller:8788"
TRILIO_DR_SECONDARY_API_URL = "http://site-b-controller:8788"

The plugin authenticates to both sites using the Keystone credentials of the currently logged-in user, so no shared service account is required here.

Step 4 — Collect static assets

python /usr/share/openstack-dashboard/manage.py collectstatic --noinput
python /usr/share/openstack-dashboard/manage.py compress --force

Step 5 — Restart Horizon

systemctl restart apache2   # Debian/Ubuntu
# or
systemctl restart httpd     # RHEL/CentOS

Step 6 — Verify the panel is visible

Log in to Horizon and confirm that Project → Site Recovery appears in the left navigation. If it does not, review the Horizon error log for import errors:

journalctl -u apache2 --since "5 minutes ago" | grep trilio

Configuration

The plugin has two layers of configuration: operator-level settings (in local_settings.py) that apply to all users, and per-tenant settings managed through the Admin panel.

Operator settings (`local_settings.py`)

Setting	Default	Valid values	Effect
`TRILIO_DR_PRIMARY_API_URL`	(required)	Any HTTP/HTTPS URL	Base URL of the `protector-api` service on the primary site.
`TRILIO_DR_SECONDARY_API_URL`	(required)	Any HTTP/HTTPS URL	Base URL of the `protector-api` service on the secondary site.
`TRILIO_DR_OPERATIONS_POLL_INTERVAL`	`5`	Integer (seconds)	How often the Operations page auto-refreshes live operation progress. Lower values give faster feedback but increase API load.
`TRILIO_DR_HEALTH_POLL_INTERVAL`	`30`	Integer (seconds)	Refresh interval for the Health page replication-lag metrics.
`TRILIO_DR_ALLOW_FORCE_FAILOVER`	`True`	`True` / `False`	When `False`, the Force checkbox is hidden in the failover dialog, preventing tenants from skipping pre-flight validation. Set to `False` in environments where RPO guarantees must be strictly enforced.

Admin panel — Site Registration

Each registered site stores:

Name — a human-readable label referenced throughout the UI.
Auth URL — the Keystone v3 endpoint for the site (e.g., http://site-a:5000/v3).
Region name — optional; used when the site is a region within a shared Keystone.
Site type — primary or secondary. Note that these designations are workload-relative and swap automatically on failover; the label here is the initial designation.

Site credentials entered during registration are validated immediately by the plugin via POST /v1/admin/sites/{site_id}/validate. If the site is unreachable, registration is blocked.

Admin panel — Tenant Mapping

Tenant Mapping controls which OpenStack projects have access to the Site Recovery panel. Until a project is mapped, tenants in that project see no Site Recovery panel. Each mapping entry specifies:

Tenant/Project ID — the OpenStack project to enable.
Allowed replication types — async, sync, or both. Restricting a tenant to async prevents them from creating sync Protection Groups even if the volume type supports it.

Protection Group replication policy

Each Protection Group's replication policy is configured through the Create Protection Group wizard or the policy edit dialog and is stored per-PG. Key fields:

Field	Effect
Replication interval (async only)	Seconds between Pure Storage protection group snapshots.
RPO minutes	Maximum acceptable recovery point age. The Health page and pre-failover validation use this value to determine compliance.
Pure Storage FlashArray URLs and API tokens	Credentials for the physical arrays backing the Cinder backends on each site. Stored encrypted.

Usage

Navigating the Site Recovery panel

After logging in to Horizon, expand Project in the left sidebar and select Site Recovery. The panel contains five pages accessible via the sub-navigation tabs:

Tab	Purpose
Protection Groups	Create, view, and act on Protection Groups.
VM Members	Add or remove VMs from a selected Protection Group.
Operations	Monitor active and historical DR operations.
Health	Check site connectivity, replication lag, and per-PG failover readiness.
Admin → Site Registration (admin only)	Register primary and secondary sites.
Admin → Tenant Mapping (admin only)	Grant DR capabilities to specific projects.

Protection Groups page

The Protection Groups list shows every PG owned by your project, with a colour-coded status badge:

Badge colour	Status meaning
Green — Active	PG is healthy and replicating normally.
Blue — Failed Over	Workloads are currently running on the secondary site.
Amber — Failing Over / Failing Back	A DR operation is in progress.
Red — Error	A DR operation did not complete successfully.
Grey — Deleting	PG deletion is in progress.

Creating a Protection Group

Click Create Protection Group to open the wizard.
Step 1 — Details: Enter a name, optional description, select primary and secondary sites, choose replication type (async or sync), and select a replication-enabled volume type. Only volume types with replication_enabled='<is> True' and a matching replication_type appear in the dropdown.
Step 2 — Replication Policy: Provide Pure Storage FlashArray URLs and API tokens for both sites, set the replication interval (async only), and define the RPO in minutes.
Step 3 — Review: Confirm all settings. Click Create to submit. The wizard calls POST /v1/{tenant_id}/protection-groups, which automatically creates the backing Cinder Consistency Group.

Creating a Protection Group also creates a 1:1 Cinder Consistency Group and a corresponding Pure Storage Protection Group (or Pod for sync replication). No separate Cinder CG management is needed.

Acting on a Protection Group

Click the Actions dropdown (⋮) next to any PG row to access:

Test Failover (DR Drill) — Boots workloads on the secondary site from replicated snapshots without disrupting the primary. Use this for scheduled DR exercises. Prompts for a recovery point selection (see below).
Planned Failover — Graceful failover: primary VMs are shut down, a final snapshot is created and replicated, then VMs are recreated on the secondary site.
Unplanned Failover — Emergency failover when the primary is unreachable. Boots from the most recent available replicated snapshot.
Failback — Returns workloads to the original primary site after a failover.
Delete — Removes the PG, its Consistency Group, and all associated metadata. Blocked if a DR operation is in progress.

Modifications to a Protection Group — including triggering actions — are blocked when the peer site is unreachable, to prevent metadata divergence between sites.

Failover dialog — Recovery Point selection

All failover and test-failover dialogs include a Recovery Point dropdown:

Latest (create new snapshot) — default for planned failovers; creates and replicates a final snapshot before switching.
Named recovery points — listed with age and replication status. Selecting a specific point enables point-in-time recovery, useful after ransomware, accidental deletion, or data corruption.

For sync replication, the recovery point dropdown is not shown because sync replication does not maintain discrete snapshots.

VM Members page

Select a Protection Group from the list and navigate to VM Members to manage which Nova instances are protected.

The table shows each member VM's name, flavor, volume count, and per-VM replication status (Protected, Unprotected, Error, or Failed Over).
Click Add VM to attach an instance. Only VMs whose Cinder volumes belong to the PG's configured replication-enabled volume type are eligible. Adding a VM automatically adds its volumes to the backing Consistency Group.
Click Remove on any row to detach a VM. Its volumes are removed from the Consistency Group.

Operations page

The Operations page shows a live-updating table of all DR operations for your project — both active and historical.

Progress bars display completion percentage for running operations.
Click any operation row to expand step-level detail: each phase (validation, quiesce, storage failover, instance recreation, finalization) is listed with its status and timestamp.
The table auto-refreshes every TRILIO_DR_OPERATIONS_POLL_INTERVAL seconds (default: 5). You can pause auto-refresh with the toggle in the table header.
Operation types shown: failover, failback, test_failover, test_cleanup, sync_volumes.
Terminal statuses: completed (green), failed (red), rolling_back (amber).

Health page

The Health page gives you a real-time snapshot of your DR readiness:

Site connectivity status — shows whether both the primary and secondary sites are reachable by the plugin. A red indicator here means PG modifications will be blocked until connectivity is restored.
Replication lag metrics — for each PG, displays the age of the latest replicated snapshot against the configured RPO. A lag exceeding the RPO is flagged in amber or red.
Failover readiness indicators — a per-PG summary of all pre-flight checks (replication validation, write-order consistency for async, pod sync state for sync replication). A green checkmark means the PG is ready to failover without --force.

Admin — Site Registration

(Requires admin role)

Navigate to Admin → Site Recovery → Site Registration.
Click Register Site and fill in the site name, Keystone auth URL, optional region, and initial site type.
The plugin validates connectivity immediately. Fix any auth URL or credential issues before saving.
To update a site (e.g., after a Keystone URL change), click Edit on the site row.
To deregister a site, click Delete. This is blocked if any Protection Groups reference the site.

Admin — Tenant Mapping

(Requires admin role)

Navigate to Admin → Site Recovery → Tenant Mapping.
Click Add Mapping, select a project from the dropdown, and choose which replication types (async, sync) the project is permitted to use.
To revoke DR access from a project, click Remove on its mapping row.

Examples

Example 1 — Create a Protection Group and add a VM

Scenario: You want to protect a web application VM (web-server-1) using async replication between Site A and Site B.

Navigate to Project → Site Recovery → Protection Groups and click Create Protection Group.
Fill in the wizard:

Name:               prod-web-app
Description:        Production web application
Primary site:       site-a
Secondary site:     site-b
Replication type:   async
Volume type:        replicated-ssd   ← must have replication_enabled='<is> True'

Replication interval: 300 seconds
RPO:                  15 minutes
Primary FA URL:       https://array-a.example.com
Primary FA API token: <redacted>
Secondary FA URL:     https://array-b.example.com
Secondary FA API token: <redacted>

Click Create. The Protection Groups list now shows prod-web-app with status badge Active.
Open the Actions menu for prod-web-app → VM Members (or navigate to the VM Members tab with prod-web-app selected).
Click Add VM, select web-server-1 from the instance list, and confirm. The VM's volumes are automatically enrolled in the backing Consistency Group.

Expected result — VM Members table:

+---------------+----------+---------+---------------------+
| Instance Name | Flavor   | Volumes | Replication Status  |
+---------------+----------+---------+---------------------+
| web-server-1  | m1.large | 2       | Protected           |
+---------------+----------+---------+---------------------+

Example 2 — Run a DR drill (test failover)

Scenario: You want to validate that prod-web-app can recover on Site B without affecting production.

On the Protection Groups page, open the Actions menu for prod-web-app and select Test Failover.
In the dialog:
- Failover Type: Test
- Recovery Point: Latest (create new snapshot) (default)
- Leave Force unchecked to run pre-flight validation.
Click Initiate Failover.
Navigate to the Operations tab. You will see a new row with type test_failover and a progress bar.

Expected result — Operations table (during drill):

+--------------------------------------+--------------+---------+----------+
| Operation ID                         | Type         | PG      | Progress |
+--------------------------------------+--------------+---------+----------+
| 3f8a1b2c-...                         | test_failover| prod-.. | 42%      |
+--------------------------------------+--------------+---------+----------+

Click the row to expand step-level detail:

Step 1: Replication validation        ✓ Completed
Step 2: Snapshot (latest)             ✓ Completed
Step 3: Storage failover (test)       ↻ Running
Step 4: Instance recreation on Site B — Pending
Step 5: Finalization / cleanup        — Pending

Once the operation shows Completed, verify workloads are running on Site B and then clean up the test environment via Actions → Test Cleanup.

Example 3 — Point-in-time failover after suspected data corruption

Scenario: A bad application deployment has corrupted data. You want to recover prod-web-app to the snapshot taken before the deployment.

On the Protection Groups page, open Actions → Planned Failover for prod-web-app.
In the failover dialog, open the Recovery Point dropdown. You will see entries similar to:

● Latest (create new snapshot)
○ prod-web-app.30  —  2h 15m ago  (Replicated)
○ prod-web-app.29  —  6h 15m ago  (Replicated)   ← before the bad deploy
○ prod-web-app.28  —  1d 2h ago   (Replicated)

Select prod-web-app.29 (the snapshot predating the corruption).
Confirm Failover Type: Planned.
Click Initiate Failover.

The primary VMs are shut down to prevent further writes, then volumes are restored from snapshot prod-web-app.29 on the secondary array, and VMs are recreated on Site B from that point in time.

Expected result — Operations detail:

Step 1: Replication validation        ✓ Completed
Step 2: Resolve recovery point .29    ✓ Completed
Step 3: Shutdown primary VMs          ✓ Completed
Step 4: Storage failover (PiT)        ✓ Completed  recovery_point_used: prod-web-app.29
Step 5: Instance recreation on Site B ✓ Completed
Step 6: Finalization                  ✓ Completed

Example 4 — Site-level failover from the Admin panel

Scenario: An entire datacenter (Site A) is being taken offline for emergency maintenance. You need to fail over all Protection Groups simultaneously.

Navigate to Admin → Disaster Recovery.
In the DR Sites table, locate site-a and click Failover Site (red button with warning icon).
In the modal:
- Confirm the site name: site-a
- Select Failover Type: Planned
- Check the confirmation checkbox.
Click Execute Failover.

Expected result — Site Operations table:

+-------------------+----------------+---------+-------+----------+
| Operation ID      | Type           | Site    | PGs   | Progress |
+-------------------+----------------+---------+-------+----------+
| a1b2c3d4-...      | site_failover  | site-a  | 5     | 0%       |
+-------------------+----------------+---------+-------+----------+

Click View Details to monitor per-PG progress. Failovers run in parallel (up to 10 concurrent). A final status of partial_success means some PGs succeeded and at least one failed — check the error summary and retry failed PGs individually.

Example 5 — Checking replication health before a planned maintenance failover

Navigate to Project → Site Recovery → Health.
Review the Site Connectivity row — both sites must show green.
Review the per-PG Failover Readiness column. A green checkmark confirms:
- For async PGs: latest snapshot exists, write-order consistency verified, snapshot age is within RPO.
- For sync PGs: pod replication state is sync, all volumes are in-sync with zero lag.
For any amber/red PG, hover over the indicator for the specific failing check (e.g., "Snapshot age 22 min exceeds RPO 15 min"). Resolve before proceeding with failover.

Troubleshooting

Protection Group actions are greyed out / blocked

Symptom: The Actions dropdown for a Protection Group shows items as disabled, or clicking them produces a banner: "Cannot modify Protection Group: peer site is unreachable."

Likely cause: The Horizon plugin cannot reach the protector-api on the peer site. Because metadata must remain strictly in sync across both sites, all modifications (including triggering DR actions) are blocked when the peer is unreachable.

Fix:

Navigate to Health and check the Site Connectivity panel to confirm which site is unreachable.
Verify that the protector-api service is running on the unreachable site: systemctl status openstack-protector-api.
Check network connectivity from the Horizon server to the unreachable site's API endpoint.
If this is a genuine disaster (the primary site is down), use the CLI (openstack dr failover --type unplanned --force) from the secondary site, which can bypass the connectivity check with --force.

The Site Recovery panel does not appear after installation

Symptom: After installing the plugin and restarting Apache, the Project → Site Recovery menu entry is absent.

Likely cause: The enabled file was not placed in the correct directory, or the static asset collection step was skipped.

Fix:

Confirm the enabled file exists: ls /etc/openstack-dashboard/enabled/ | grep dr.

Re-run static collection and compression:

python /usr/share/openstack-dashboard/manage.py collectstatic --noinput
python /usr/share/openstack-dashboard/manage.py compress --force
systemctl restart apache2

Check the Horizon error log for import errors: journalctl -u apache2 | grep trilio.
Ensure the project is mapped in Admin → Tenant Mapping — unmapped projects do not see the panel.

Volume type not appearing in the Create Protection Group wizard

Symptom: The Volume Type dropdown in the Create Protection Group wizard is empty or missing the expected type.

Likely cause: The Cinder volume type does not have replication_enabled='<is> True' set, or the replication_type property is missing or mismatched.

Fix:

As a Cinder admin, inspect the volume type:

openstack volume type show replicated-ssd

Ensure both extra specs are present:

openstack volume type set replicated-ssd \
  --property replication_enabled='<is> True' \
  --property replication_type='<in> async'

Reload the Create Protection Group wizard — the type should now appear.

Failover readiness shows amber/red on the Health page

Symptom: The Health page shows one or more Protection Groups with a non-green failover readiness indicator.

Likely cause and fix (check in order):

Indicator message	Cause	Fix
"Snapshot age X min exceeds RPO Y min"	Replication has stalled or fallen behind.	Check Pure Storage replication status and network connectivity between arrays. Verify the replication schedule on the FlashArray matches the configured interval.
"Write order consistency violation"	A volume's snapshot has a different suffix from the rest of the PG snapshot.	Wait for the next replication cycle. If the problem persists, check Cinder backend logs for volume attachment issues that may be preventing consistent snapshots.
"Pod replication state is not sync" (sync PGs)	ActiveCluster pod is in a degraded or paused state.	Check the Pure Storage Purity console for pod health. Resolve any inter-array connectivity issues.
"Protection group not found on secondary array"	The Pure Storage PG was deleted or renamed on the secondary array outside of Trilio.	Re-register the replication policy in the PG's policy edit dialog to recreate the Pure Storage PG.

Site-level failover results in `partial_success`

Symptom: After triggering a site failover from the Admin panel, the Site Operations table shows status partial_success with a non-zero failed count.

Likely cause: One or more Protection Groups had pre-flight validation failures, resource mapping gaps, or insufficient resources on the DR site.

Fix:

Click View Details on the site operation to expand per-PG results.
Note the PG IDs marked with ✗ and read the error summary.
Navigate to Project → Site Recovery → Operations and find the child operation for each failed PG to read step-level error detail.
Common per-PG failure causes and fixes:
- "Replication not healthy" — fix replication health first (see above), then retry: Actions → Planned Failover on the individual PG.
- "Resource mapping missing" — open the PG detail and add the missing network or flavor mapping before retrying.
- "Volume attachment timeout" — check Nova and Cinder logs on the secondary site for hypervisor-level issues; retry when resolved.

Operations page shows a DR operation stuck in `running`

Symptom: An operation has been in running state for an unexpectedly long time with no progress bar movement.

Likely cause: The protector-engine on the executing site may have encountered an unhandled exception, or a long-running storage or Nova operation is blocking a step.

Fix:

Expand the operation row to identify which step last completed.
On the relevant site's controller, check the engine log:
```
tail -f /var/log/protector/engine.log
```
If the engine process has crashed, restart it: systemctl restart openstack-protector-engine. The engine will detect the interrupted operation and attempt to roll back or resume depending on the step.
For site-level failovers, note that the ThreadPoolExecutor runs up to 10 concurrent PG failovers — large sites with many PGs may take longer than expected before all child operations complete.

Horizon Dashboard Guide

Step 1 — Install the plugin package

Step 2 — Enable the dashboard

Step 3 — Configure site endpoints

Step 4 — Collect static assets

Step 5 — Restart Horizon

Step 6 — Verify the panel is visible

Operator settings (local_settings.py)

Admin panel — Site Registration

Admin panel — Tenant Mapping

Protection Group replication policy

Navigating the Site Recovery panel

Protection Groups page

VM Members page

Operations page

Health page

Admin — Site Registration

Admin — Tenant Mapping

Example 1 — Create a Protection Group and add a VM

Example 2 — Run a DR drill (test failover)

Example 3 — Point-in-time failover after suspected data corruption

Example 4 — Site-level failover from the Admin panel

Example 5 — Checking replication health before a planned maintenance failover

Protection Group actions are greyed out / blocked

The Site Recovery panel does not appear after installation

Volume type not appearing in the Create Protection Group wizard

Failover readiness shows amber/red on the Health page

Site-level failover results in partial_success

Operations page shows a DR operation stuck in running

Operator settings (`local_settings.py`)

Site-level failover results in `partial_success`

Operations page shows a DR operation stuck in `running`