Site Recoveryfor OpenStack
Concept

Key Concepts

Definitions of terms a new user must understand before working with the product.

master

Overview

This page defines the core concepts you need to understand before configuring or operating Trilio Site Recovery for OpenStack. Each term maps directly to a resource, behavior, or architectural constraint you will encounter when setting up protection, executing failovers, or troubleshooting replication. Reading this page first will make the task-oriented guides significantly easier to follow.


Content

Sites

A Site is a registered OpenStack deployment — an independent cloud with its own Keystone, Nova, Cinder, and Neutron endpoints. Trilio Site Recovery requires exactly two sites: a primary site where your workloads normally run, and a secondary (DR) site to which they fail over. Sites can be separate physical datacenters, separate OpenStack clusters, or different regions within the same cluster.

Site designations are workload-relative and dynamic. When a failover completes, the secondary site becomes the new primary for that workload. Both sites are treated symmetrically by the architecture — neither is permanently "the DR site." The current_primary_site_id field on a Protection Group tracks which site is currently active for that workload.

Trilio Site Recovery services (protector-api and protector-engine) run independently on each site and do not communicate directly with each other. The OSC CLI plugin (protectorclient) and Horizon dashboard act as the coordination layer, authenticating to both sites and orchestrating metadata synchronization between them. Because of this design, modifications to a Protection Group are blocked if the peer site is unreachable — this prevents the two sites from diverging into inconsistent states.


Protection Groups

A Protection Group is a named, tenant-owned collection of Nova VMs that fail over together as a unit. All members of a Protection Group are treated as a single recoverable application: when you trigger a failover, every VM in the group moves to the secondary site in a coordinated sequence.

Each Protection Group carries a replication_type (either sync or async), references a primary site and a secondary site, and maintains a status that reflects its current DR state (for example, active, failing_over, failed_over, or failing_back). You interact with Protection Groups for every major DR workflow — creating them, adding VMs, configuring policies, and triggering operations.

Every Protection Group has a strict 1:1 relationship with a Cinder Consistency Group and a Pure Storage Protection Group (or Pod, for synchronous replication). These lower-level constructs are created and managed automatically when you create or modify a Protection Group — you do not manage them directly.


Consistency Groups

A Consistency Group is a Cinder construct that groups volumes together so that snapshots are taken across all of them simultaneously. This crash-consistent snapshot boundary is what makes it safe to recover a multi-volume, multi-VM application to a known-good point in time.

Every volume in a Consistency Group must share the same Cinder volume type, that volume type must have replication_enabled='<is> True' set, and all volumes must reside on the same storage backend. When you add a VM to a Protection Group, Trilio Site Recovery automatically discovers the VM's attached volumes and adds them to the associated Consistency Group — you do not add volumes manually.

Consistency Groups are maintained on both sites: a primary Cinder CG on the current primary site, and a secondary Cinder CG on the secondary site. These map directly to Pure Storage Protection Groups at the array layer.


Volume Types and Replication Eligibility

Not every Cinder volume type is eligible for DR protection. To participate in a Consistency Group, a volume type must have both of the following properties set:

  • replication_enabled='<is> True'
  • replication_type='<in> async' or replication_type='<in> sync'

The same volume type name and properties must exist on both sites. If a VM's volumes do not use an eligible volume type, that VM cannot be added to a Protection Group. Preparing the correct volume types on both sites is a prerequisite for all protection workflows.


Replication Policy

A Replication Policy is the configuration object that links a Protection Group to the underlying Pure Storage FlashArrays and defines the RPO target. It holds the management URLs and API tokens for both the primary and secondary FlashArrays, the name of the corresponding Pure Storage Protection Group, the replication interval (for async), and the RPO in minutes.

Each Protection Group has at most one Replication Policy. The policy is what gives the Protector engine the credentials and parameters it needs to interact with the arrays during failover and failback operations.


Replication Modes

Synchronous Replication

Synchronous replication uses Pure Storage ActiveCluster Pods. Every write to the primary array is acknowledged only after it has been committed on the secondary array, giving an effective RPO of zero — no data is lost even in a sudden failure. This mode requires the two FlashArrays to be connected with sub-50 ms round-trip latency, which typically limits it to metropolitan-distance deployments.

Asynchronous Replication

Asynchronous replication uses Pure Storage snapshot-based Protection Groups. Snapshots are taken at a configurable interval and replicated to the secondary array. The RPO is non-zero and is set by the replication_interval in the Replication Policy. Because writes are not synchronously acknowledged across sites, this mode imposes no latency requirement and supports unlimited distances between sites, at the cost of a potential data loss window equal to the replication interval.


DR Operations

A DR Operation is an auditable, trackable record created every time you initiate a failover, failback, test failover, test cleanup, or manual sync. Each operation records its type, current status (pending, running, completed, failed, or rolling_back), source and target sites, progress as a percentage from 0–100, completed and failed steps, and any error messages.

Operations are the primary way to monitor long-running DR workflows. Because failovers involve multiple coordinated phases across two sites — storage promotion, instance recreation, and status updates — having a persistent operation record lets you inspect exactly where a workflow is, whether it succeeded, and what went wrong if it failed.


Failover

Failover is the process of cutting workloads over from the current primary site to the secondary site. Failover can be:

  • Planned — initiated deliberately, for example before scheduled maintenance. The primary workload is quiesced before cutover, minimizing data loss.
  • Unplanned — initiated in response to an outage on the primary site. Trilio Site Recovery recovers workloads from the most recent replicated snapshot available on the secondary array.

After a failover completes, the Protection Group's current_primary_site_id is updated to the secondary site, and the status transitions to failed_over. The original primary site is now logically the secondary.


Failback

Failback is the return of workloads to the original primary site after a failover. Failback follows a similar process to failover — data written since the failover is replicated back, volumes are restored on the original primary, and instances are recreated there. After failback completes, the Protection Group's current_primary_site_id returns to the original primary site.


Test Failover

A Test Failover (also called a DR drill) is a non-disruptive validation of your DR readiness. It spins up snapshot-backed VMs on the secondary site using the most recent replicated snapshot, without stopping or modifying the production workload on the primary site. You can verify application behavior, validate network mappings and flavor mappings, and confirm RTO estimates — all without affecting live traffic.

After the drill is complete, a test cleanup operation tears down the test instances and volumes on the secondary site, leaving the Protection Group in its pre-drill state.


Storage Driver and Mock Mode

The StorageDriver is a pluggable interface inside protector-engine that decouples DR orchestration logic from the specifics of any storage implementation. The production driver targets Pure Storage FlashArrays. This abstraction makes it possible to swap in alternative drivers without changing the orchestration layer.

Mock Mode is a StorageDriver implementation backed by SQLite rather than physical arrays. It simulates Pure FlashArray behavior — Protection Group creation, snapshot scheduling, replication, and volume promotion — entirely in software. Mock Mode enables full end-to-end DR workflow testing, CI/CD pipeline integration, and development work without access to FlashArray hardware.


RPO and RTO

RPO (Recovery Point Objective) is the maximum amount of data loss, expressed as time, that your organization can tolerate. For synchronous replication, RPO is zero. For asynchronous replication, RPO is bounded by the replication_interval configured in the Replication Policy — data written after the last successful snapshot replication may be lost in an unplanned failover.

RTO (Recovery Time Objective) is the maximum amount of downtime your organization can tolerate after a failure. RTO is not a configuration value — it is an outcome determined by the number of VMs in a Protection Group, the size and number of volumes, the speed of the secondary site's compute and storage, and the complexity of your network and flavor mappings. Test Failover drills are the primary tool for measuring and validating your actual RTO.


Key terms

Site — A registered OpenStack deployment with independent Keystone, Nova, Cinder, and Neutron endpoints. One site acts as primary, the other as secondary (DR). Designations are workload-relative and swap on failover.

Protection Group — A tenant-owned, named collection of Nova VMs that fail over together as a unit. Has a 1:1 relationship with a Cinder Consistency Group and a Pure Storage Protection Group or Pod.

Consistency Group — A Cinder construct that groups volumes for crash-consistent, simultaneous snapshots. Automatically created and managed when you create or modify a Protection Group. All volumes must share the same replication-enabled volume type.

Replication Policy — Configuration attached to a Protection Group that supplies Pure Storage FlashArray connection details (URLs, API tokens), the Pure Protection Group name, replication interval, and RPO target.

DR Operation — A persistent, auditable record of any failover, failback, test failover, test cleanup, or sync operation. Tracks status, progress percentage, completed and failed steps, and error messages.

Synchronous Replication — ActiveCluster Pod-based replication with RPO = 0. Requires sub-50 ms round-trip latency between arrays.

Asynchronous Replication — Snapshot-based replication with a configurable RPO. Supports unlimited distance between sites.

Failover — The cutover of workloads from the current primary site to the secondary site. Can be planned (graceful) or unplanned (emergency).

Failback — The return of workloads to the original primary site after a failover has occurred.

Test Failover — A non-disruptive DR drill that starts snapshot-backed VMs on the secondary site without interrupting the live primary workload. Used to validate RTO, network mappings, and application recovery.

StorageDriver — A pluggable interface inside protector-engine that abstracts storage operations, enabling different storage backends to be used without changing orchestration logic.

Mock Mode — A StorageDriver implementation backed by SQLite that simulates Pure FlashArray behavior for development, testing, and CI/CD pipelines without physical arrays.

RPO (Recovery Point Objective) — The maximum tolerable data loss expressed as a time window. Determined by replication mode (zero for sync, interval-bounded for async).

RTO (Recovery Time Objective) — The maximum tolerable downtime after a failure. Measured empirically through Test Failover drills, not configured as a system parameter.

replication_enabled — A required Cinder volume type property (replication_enabled='<is> True') that marks a volume type as eligible for DR protection.

replication_type — A required Cinder volume type property specifying the replication mode: '<in> async' or '<in> sync'. Must match the Protection Group's configured replication type.


Examples

Check whether a volume type is eligible for protection

Before creating a Protection Group, confirm that your volume type has the required properties on both sites.

openstack volume type show replicated-ssd

Expected output (abbreviated):

+--------------------+---------------------------------------------+
| Field              | Value                                       |
+--------------------+---------------------------------------------+
| name               | replicated-ssd                              |
| properties         | replication_enabled='<is> True',            |
|                    | replication_type='<in> async',              |
|                    | volume_backend_name=pure@backend-a          |
+--------------------+---------------------------------------------+

If either replication_enabled or replication_type is absent, volumes of this type cannot be added to a Consistency Group and VMs using them cannot be protected.


Inspect a Protection Group to understand its current DR state

After creating a Protection Group, you can see its 1:1 Consistency Group link, current primary site, and status at any time.

openstack protector protection-group show prod-web-app

Expected output (abbreviated):

+----------------------------+--------------------------------------+
| Field                      | Value                                |
+----------------------------+--------------------------------------+
| id                         | pg-12345678-1234-1234-1234-123456ab  |
| name                       | prod-web-app                         |
| status                     | active                               |
| replication_type           | async                                |
| primary_site               | site-a                               |
| secondary_site             | site-b                               |
| current_primary_site       | site-a                               |
| consistency_group_id       | cg-87654321-4321-4321-4321-87654321  |
| failover_count             | 0                                    |
| last_failover_at           | None                                 |
+----------------------------+--------------------------------------+

After a failover, current_primary_site will change to site-b and status will show failed_over, illustrating how site designations are dynamic.


Monitor a DR Operation in progress

Every failover, failback, or test failover creates a DR Operation record you can poll for progress.

openstack protector operation show op-456abc

Expected output during an in-progress failover:

+-------------------+--------------------------------------+
| Field             | Value                                |
+-------------------+--------------------------------------+
| id                | op-456abc                            |
| operation_type    | failover                             |
| status            | running                              |
| progress          | 45                                   |
| source_site       | site-a                               |
| target_site       | site-b                               |
| started_at        | 2024-03-15T14:22:01Z                 |
| completed_at      | None                                 |
| steps_completed   | ["validate_target", "storage_promo"] |
| steps_failed      | []                                   |
| error_message     | None                                 |
+-------------------+--------------------------------------+

When progress reaches 100 and status becomes completed, the failover is finished. If status shows failed, inspect error_message and steps_failed to diagnose the issue.


Related concepts
  • Architecture overview — Explains how protector-api, protector-engine, and the CLI plugin interact across sites, and why there is no direct service-to-service communication between them.
  • Prepare replication-enabled volume types — Step-by-step instructions for creating Cinder volume types with the required replication_enabled and replication_type properties on both sites.
  • Deploy and configure Protector — Covers installing the Protector services and setting up the database on each site before you register sites or create Protection Groups.
  • Register sites — How to register primary and secondary OpenStack deployments with Trilio Site Recovery and validate connectivity.
  • Create a Protection Group — Walks through the full creation workflow, including how the Consistency Group and Pure Storage Protection Group are provisioned automatically.
  • Configure a replication policy — How to attach Pure Storage FlashArray credentials, replication intervals, and RPO targets to a Protection Group.
  • Execute test failover (DR drill) — How to use Test Failover to validate your RTO, resource mappings, and application recovery without disrupting production.
  • Mock Mode reference — Configuration and usage of the SQLite-backed storage driver for development and CI/CD environments.