Site Recoveryfor OpenStack
Concept

Problem Being Solved

The DR challenges this product addresses for OpenStack-hosted VMs.

master

Overview

OpenStack provides powerful infrastructure for running virtual machines at scale, but its native tooling leaves a critical gap: there is no built-in mechanism for tenant-driven disaster recovery. When a site fails, individual tenants have no self-service way to fail their workloads over to a secondary location, and platform operators face an all-or-nothing, manual recovery process. Trilio Site Recovery for OpenStack closes this gap by giving cloud platform engineers and SREs a structured, tenant-aware DR framework built on top of Pure Storage FlashArray replication.


Content

The OpenStack DR Gap

OpenStack Nova, Cinder, and Neutron are designed to manage compute, storage, and networking within a single cloud. They have no native concept of a secondary site, no cross-cloud replication policy, and no coordinated failover workflow. When a datacenter incident occurs, the tools that operators normally reach for—Nova evacuate, Cinder volume migration—operate only within a single cluster and cannot move workloads across independent OpenStack clouds.

For organizations that run production workloads on OpenStack, this creates several compounding problems.

Problem 1: No Cross-Site Coordination Layer

A production DR setup requires two independent OpenStack clouds: a primary site and a secondary (DR) site, each with its own Nova, Cinder, Neutron, and Keystone endpoints. These sites have no awareness of each other by default. There is no shared control plane, no cross-site VM inventory, and no mechanism to synchronize the state needed to reconstruct workloads on the far side.

Without a coordination layer, even organizations that have invested in Pure Storage FlashArray replication at the storage level cannot easily translate a storage-level failover into a working OpenStack compute environment. Volumes may arrive on the DR site, but the Nova instance records, network port assignments, flavor requirements, and security group memberships that describe a running VM do not travel with them.

Problem 2: Storage Replication Is Not Enough

Pure Storage FlashArray replication—both synchronous (ActiveCluster) and asynchronous—reliably moves block data between arrays. But replication alone does not make a VM recoverable on a second OpenStack cloud. To restart a VM after failover, you need:

  • The replicated volumes imported and visible to Cinder on the DR site
  • A Nova instance record created with the correct flavor, availability zone, and boot volume
  • Network ports recreated on the correct networks, with IP addresses that are valid in the DR environment
  • Security groups and key pairs that exist on the DR site

None of this happens automatically from a storage-level failover. Without tooling to orchestrate these steps, recovery devolves into a manual, error-prone process that widens the effective recovery time far beyond what the storage RPO and RTO targets suggest.

Problem 3: Tenants Cannot Drive Their Own Recovery

In a multi-tenant OpenStack environment, different teams own different workloads with different criticality levels and different recovery objectives. A platform-level DR tool that requires admin intervention for every failover creates a bottleneck: tenants must wait for operators to act on their behalf, and operators must make recovery decisions without full application context.

What tenants need is the ability to organize their VMs into logical protection groups—by application, by tier, or by business unit—set their own replication policies, and execute or test failover without filing a ticket. What operators need is guardrails: role-based access, metadata consistency enforcement, and visibility into what is happening across all tenants during a site event.

Problem 4: No DR Drill Capability

A DR plan that has never been tested is not a DR plan—it is a hypothesis. But testing failover against a live environment risks disrupting production, and testing against a secondary site can leave that site in an inconsistent state if the test is not cleanly reversed.

OpenStack provides no native mechanism for a non-disruptive DR drill: creating temporary VMs from a replicated recovery point on an isolated network, validating application behavior, and then cleaning up without affecting the production replication stream.

Problem 5: Site-Level Events Have No Coordinated Response

When an entire datacenter becomes unavailable—power failure, network partition, catastrophic hardware loss—the recovery requirement changes in character. It is no longer about failing over one application; it is about moving every tenant's workloads simultaneously. Without a coordinated site-level failover capability, operators must manually identify every protection scope, determine the correct order of operations, and execute each one individually while the clock runs.

This is precisely the moment when a "big red button" is needed: a single operation that discovers all workloads at the affected site, fans out the failover across all of them in parallel, handles partial failures gracefully, and provides real-time progress visibility to the operators managing the incident.

What Trilio Site Recovery Addresses

Trilio Site Recovery for OpenStack is purpose-built to solve each of these problems for OpenStack Nova VMs backed by Pure Storage FlashArray:

  • It provides the cross-site coordination layer that OpenStack lacks, using the OSC CLI plugin and Horizon dashboard to authenticate to both clouds and synchronize the metadata needed for workload recovery.
  • It bridges storage replication and compute recovery, automating the Cinder import, Nova instance reconstruction, and network port creation steps that turn a storage failover into a running VM.
  • It gives tenants self-service DR control through Protection Groups—logical groupings of VMs with independently configurable replication policies and resource mappings.
  • It enables non-disruptive DR drills via test failover, which creates temporary VMs from replicated snapshots on an isolated network without affecting production replication.
  • It supports site-level coordinated failover that discovers all protection groups at a site, executes their failovers in parallel, and surfaces per-group progress and error detail to administrators managing a real disaster.

Key terms

Primary site — The OpenStack cloud where a set of VMs is currently running. Primary/secondary designations are workload-relative and swap on failover; neither site is permanently "primary."

Secondary site (DR site) — The OpenStack cloud designated to receive workloads during a failover. Must have its own independent Nova, Cinder, Neutron, and Keystone endpoints.

Protection Group — A tenant-defined logical grouping of Nova VMs that share a replication policy and are failed over together. Maps 1:1 with a Cinder Consistency Group and a Pure Storage Protection Group or Pod.

Replication type — The mode of storage replication: synchronous (ActiveCluster/Pods, zero RPO) or asynchronous (Protection Groups, non-zero RPO). Determined by the Cinder volume type assigned to a Protection Group.

Recovery Time Objective (RTO) — The target duration from a failure event to workloads being operational on the DR site.

Recovery Point Objective (RPO) — The maximum acceptable data loss measured in time—determined by replication lag for async replication, or effectively zero for sync replication.

Test failover (DR drill) — A non-disruptive operation that creates temporary VMs from a replicated snapshot on an isolated network to validate DR readiness without affecting production.

Planned failover — A coordinated, graceful failover executed when the primary site is still reachable, typically for scheduled maintenance or migration.

Unplanned failover — A failover executed when the primary site is unreachable, typically in response to an actual disaster. Requires --force to override the peer-site reachability check.

Failback — The operation that returns workloads to the original primary site after a failover, once that site is restored.

Site-level failover — A cross-tenant operation that simultaneously fails over all Protection Groups at a site. Requires the dr_site_admin role to prevent accidental execution.

Resource mapping — Configuration that translates primary-site resource identifiers (networks, flavors, security groups) to their DR-site equivalents, enabling Nova instance reconstruction after failover.

Metadata sync — The process by which Protection Group configuration is kept consistent across both sites. Modifications are blocked if the peer site is unreachable, preventing state divergence.

Mock storage driver — A SQLite-backed simulation of Pure Storage FlashArray behavior that enables complete end-to-end DR testing without physical arrays.


Examples

The following examples illustrate the concrete consequences of the problems described above and how they surface in practice.

Tenant workload organization by application tier

A tenant running an e-commerce application organizes VMs into three Protection Groups with different replication types reflecting different data criticality:

# Web tier — async replication, higher RTO acceptable
openstack protector protection-group create \
  --name prod-ecommerce-web \
  --replication-type async \
  --primary-site site-a \
  --secondary-site site-b \
  --volume-type replicated-ssd

# Database tier — sync replication, zero RPO required
openstack protector protection-group create \
  --name prod-ecommerce-db \
  --replication-type sync \
  --primary-site site-a \
  --secondary-site site-b \
  --volume-type replicated-ssd

Each group fails over independently, letting the tenant prioritize database recovery before web tier recovery.


Non-disruptive DR drill on an isolated test network

Before relying on DR in a real incident, a tenant validates recovery by running a test failover against the latest replicated snapshot. This creates temporary VMs on a non-routable network without touching production:

# Create an isolated test network on the DR site
openstack network create dr-test-net
openstack subnet create dr-test-subnet \
    --network dr-test-net \
    --subnet-range 10.99.0.0/24 \
    --no-dhcp

# Run test failover — production VMs are unaffected
openstack dr test failover prod-ecommerce-db \
    --network-mapping "<prod-net-id>=<dr-test-net-id>" \
    --all-projects

Expected output (operation status after completion):

Status: completed
Test VMs created with prefix: test-
Production replication: unaffected

After validating application behavior on the test VMs, clean up all temporary resources:

openstack dr test failover cleanup prod-ecommerce-db

Site-level failover during a datacenter incident

When an entire datacenter becomes unavailable, a DR administrator with the dr_site_admin role triggers failover of all Protection Groups simultaneously from the DR site:

openstack dr site failover cluster1 \
    --failover-type unplanned \
    --force

Expected initial response:

+-------------------------+--------------------------------------+
| Field                   | Value                                |
+-------------------------+--------------------------------------+
| operation_type          | site_failover                        |
| status                  | pending                              |
| total_protection_groups | 5                                    |
| completed_count         | 0                                    |
| progress                | 0%                                   |
+-------------------------+--------------------------------------+

Progress is tracked in real time. Partial failures are reported per-group so individual issues can be diagnosed and retried without re-running the entire site operation.


Related concepts
  • Deploy and Configure Protector — How to install and configure the protector-api and protector-engine services on each site before DR workflows can be used.
  • Register Sites — How to make each OpenStack cloud aware of its peer site so that metadata sync and cross-site operations can function.
  • Prepare Replication-Enabled Volume Types — How to create Cinder volume types with the replication_enabled and replication_type properties required for Protection Group membership.
  • Create a Protection Group — How to define the logical grouping of VMs that will be recovered together, including its 1:1 relationship with a Cinder Consistency Group and a Pure Storage Protection Group or Pod.
  • Configure Resource Mappings — How to map primary-site networks, flavors, and security groups to their DR-site equivalents so that Nova instances can be reconstructed correctly after failover.
  • Execute Test Failover (DR Drill) — Step-by-step procedure for running a non-disruptive validation of DR readiness against a replicated recovery point.
  • Execute Site-Level Failover — How to use the site failover capability to move all tenant workloads simultaneously during a datacenter-scale incident.
  • Mock Storage Testing Guide — How to use the SQLite-backed mock storage driver to validate DR workflows end-to-end without Pure Storage hardware.