Concept

Alternatives

What teams use instead of Trilio Site Recovery for OpenStack and the trade-offs.

master

Overview

This page compares Trilio Site Recovery for OpenStack against the four categories of DR solutions you are most likely to encounter in an OpenStack environment. Understanding these trade-offs helps you articulate why a storage-layer, tenant-driven approach fits workloads that other tools handle poorly — and where those other tools may still be the right choice for your organization.

Content

The problem with general-purpose DR tools in OpenStack

OpenStack environments present a specific challenge: VMs are tenant resources, but the infrastructure they depend on — Cinder volumes, Neutron networks, Nova compute — is managed across multiple independent services. A DR solution that does not model this topology natively forces you to reconstruct it manually at failover time, which is exactly when you can least afford to do so.

The four alternatives described here each address part of the problem, but none of them address it as a coherent whole.

Traditional backup tools

Tools such as Trilio TrilioVault and Relax-and-Recover (ReaR) operate by taking periodic snapshots or image-level backups of VMs and their volumes.

How they work: A backup job runs on a schedule, captures the state of a VM at a point in time, and stores that state in a backup repository. Recovery means restoring that snapshot to a new or existing VM.

Trade-offs:

RPO: 15–60 minutes, depending on your backup schedule. Any work done between the last backup and the failure event is lost.
RTO: 1–4 hours. Restoring a VM from a backup repository is a sequential, data-transfer-intensive process.
DR testing: Testing requires a full restore into a staging environment. This is time-consuming and typically disruptive to the backup schedule itself.
Operational fit: These tools are well-suited for data protection against accidental deletion, corruption, and ransomware. They are not designed for continuous replication or sub-minute failover.

If your organization's RPO and RTO requirements are measured in hours rather than minutes, a traditional backup tool may be sufficient. If you need near-zero RPO or fast automated failover, the snapshot model is a structural limitation, not a configuration problem.

OpenStack built-in: Cinder failover-host

Cinder provides a failover-host command that can redirect volume traffic from a failed storage backend to a replicated secondary backend.

How it works: When a Cinder backend is marked failed, the Cinder service reassigns its volumes to a preconfigured replication target. This is a host-level operation managed by the cloud operator.

Trade-offs:

Granularity: Failover is host-scoped, not workload-scoped. You cannot select a logical group of VMs belonging to a tenant application and fail them over independently. Every volume on that host moves together.
Nova orchestration: Cinder failover-host moves block storage, but it does not recreate Nova VM instances on the secondary site. After the Cinder failover completes, your volumes exist on the secondary site but your VMs do not. You must handle VM recreation separately.
No tenant self-service: failover-host is an administrative operation. Tenants cannot initiate, monitor, or test their own failover. There is no concept of a protection group, replication policy, or per-tenant RTO/RPO target.
No consistency grouping: There is no mechanism to ensure that all volumes belonging to a multi-volume VM, or a group of related VMs, are failed over to a consistent point in time.

failover-host is a building block, not a DR solution. It solves the storage layer problem for operators but leaves the orchestration problem entirely to you.

Application-level replication

For stateful workloads such as databases, some teams replicate at the application layer — using MySQL Group Replication, PostgreSQL streaming replication, Cassandra multi-datacenter topologies, and similar mechanisms.

How it works: The application itself manages data synchronization between instances running at the primary and secondary site. Failover is an application-level operation, sometimes automated by the application's own HA tooling.

Trade-offs:

Per-application configuration: Every application requires its own replication topology, its own monitoring, and its own failover runbook. In an environment with dozens of distinct workloads, this creates significant operational overhead.
Consistency across volumes: A VM may have multiple Cinder volumes — for example, a separate data volume, a log volume, and a temp volume. Application-level replication typically covers only the data the application itself manages. The consistency relationship between those volumes during a failover is not guaranteed unless the application explicitly coordinates it.
Guest coupling: Your VMs must run specific agents, configurations, or software versions to participate in the replication scheme. This creates a dependency between infrastructure lifecycle and application lifecycle that is difficult to manage at scale.
Coverage gaps: VMs running workloads that do not have mature HA tooling — legacy applications, third-party software, internal tools — are simply not covered.

Application-level replication is appropriate when a specific workload demands application-aware consistency (for example, a database that needs to guarantee transaction-level durability across sites). It is not a substitute for infrastructure-level DR across a full tenant environment.

Vendor-managed DR appliances

Some storage and DR vendors offer dedicated appliances or software platforms that provide replication, failover orchestration, and runbook automation.

How they work: A dedicated DR management plane, often running outside of OpenStack, coordinates replication between storage arrays and orchestrates VM recovery steps. Agents are commonly required inside guest VMs.

Trade-offs:

Additional infrastructure: You must deploy, license, and maintain the DR appliance or platform separately from your OpenStack environment. This adds cost, a new operational domain, and another failure surface.
Proprietary guest agents: Many vendor appliances require agents installed inside VMs. This ties your VM images to the vendor's agent version, complicates image lifecycle management, and creates a dependency you must manage across every workload.
OpenStack integration gaps: Vendor appliances are typically designed for VMware or bare-metal environments and adapted for OpenStack as an afterthought. They often lack native integration with Nova, Cinder, Neutron, or Keystone — meaning failover may not correctly reconstruct network topology, security groups, or flavor assignments.
Tenant opacity: These platforms are typically operator-managed. Tenants have limited or no visibility into the replication status of their own workloads, and self-service failover testing is rarely supported.

Why Trilio Site Recovery for OpenStack takes a different approach

Trilio Site Recovery operates at the storage block layer, using Pure Storage FlashArray replication as the transport mechanism. This means:

No guest agents, no application changes. Any Nova VM whose root and data volumes reside on Cinder-managed Pure FlashArray volumes is automatically eligible for protection. You do not need to modify the VM image, install software inside the guest, or coordinate with application teams.
Workload-scoped protection groups. You define logical groups of VMs that must fail over together, with a consistent recovery point. These groups map directly to Cinder Consistency Groups and Pure Storage Protection Groups, ensuring storage-layer crash consistency across all volumes in the group.
Tenant self-service. Tenants create and manage their own protection groups through the standard OpenStack CLI (openstack with the protectorclient plugin) or the Horizon dashboard. Operators do not need to be involved in day-to-day DR operations.
Both synchronous and asynchronous replication. Synchronous replication provides RPO=0 for same-metro deployments. Asynchronous replication supports configurable RPO for geographically separated sites. Both modes are managed through the same platform and the same workflows.
Nova-aware failover. At failover time, Trilio Site Recovery does not just move block storage — it orchestrates the recreation of Nova VM instances on the secondary site, using the resource mappings you have configured to translate flavors, networks, and security groups between sites.

The result is a DR solution that fits naturally into the OpenStack operational model: APIs, CLI, and dashboard rather than a separate management plane; tenant ownership rather than operator dependency; storage-layer efficiency rather than full data copies.

Key terms

RPO (Recovery Point Objective) — The maximum acceptable amount of data loss measured in time. An RPO of 15 minutes means you can tolerate losing up to 15 minutes of writes in a disaster event.

RTO (Recovery Time Objective) — The maximum acceptable time to restore service after a failure. An RTO of 2 hours means your workloads must be accessible again within 2 hours of declaring a disaster.

Cinder failover-host — An OpenStack Cinder operator command that redirects volumes from a failed storage backend to a replication target. Operates at the storage host level with no VM orchestration.

Application-level replication — A DR strategy where the application itself (for example, a database engine) is responsible for synchronizing state between primary and secondary instances, without involvement from the storage or infrastructure layer.

Crash consistency — A consistency guarantee that ensures all volumes in a group are captured at the same instant, as if the system had experienced a sudden power loss. The resulting snapshot is consistent at the block level; applications may need to replay logs on recovery, but data is not corrupted.

Protection Group — In Trilio Site Recovery, a logical grouping of VMs that are replicated and failed over together. Maps 1:1 to a Cinder Consistency Group and a Pure Storage Protection Group.

Guest agent — Software that must be installed inside a VM to participate in a backup, replication, or DR workflow. Trilio Site Recovery does not require guest agents.

Synchronous replication — A replication mode in which a write is acknowledged to the application only after it has been committed on both the primary and secondary storage arrays. Provides RPO=0 but requires low-latency connectivity between sites.

Asynchronous replication — A replication mode in which writes are acknowledged on the primary site and synchronized to the secondary site on a configurable interval. Supports longer distances at the cost of a non-zero RPO.

Vendor DR appliance — A dedicated hardware or software platform, typically sold by a storage or DR vendor, that provides replication and failover orchestration outside of the native cloud management plane.

Examples

Comparison: DR approaches for a 10-VM tenant application on OpenStack

The following table summarizes how each approach handles a representative scenario: a tenant running a 10-VM application across two OpenStack sites, requiring a 5-minute RPO and a 30-minute RTO.

┌─────────────────────────────────┬────────────┬────────────┬──────────────────┬─────────────────────┐
│ Approach                        │ RPO        │ RTO        │ DR Test Method   │ Tenant Self-Service │
├─────────────────────────────────┼────────────┼────────────┼──────────────────┼─────────────────────┤
│ Traditional backup (TrilioVault)│ 15–60 min  │ 1–4 hours  │ Full restore     │ Limited             │
│ Cinder failover-host            │ Array-dep. │ Manual     │ None built-in    │ None                │
│ Application-level replication   │ App-dep.   │ App-dep.   │ App-specific     │ App-specific        │
│ Vendor DR appliance             │ Varies     │ Varies     │ Vendor-specific  │ Typically none      │
│ Trilio Site Recovery            │ 0 (sync)   │ Automated  │ Non-disruptive   │ Full (CLI/Horizon)  │
│                                 │ Config.    │            │ test failover    │                     │
│                                 │ (async)    │            │                  │                     │
└─────────────────────────────────┴────────────┴────────────┴──────────────────┴─────────────────────┘

Expected interpretation: Only Trilio Site Recovery meets both the 5-minute RPO requirement (using asynchronous replication with an appropriate replication interval) and the 30-minute RTO requirement (using automated failover with pre-configured resource mappings). It also provides the only non-disruptive DR test capability and the only tenant self-service interface in this comparison.

Related concepts

Architecture overview — Describes the two-site topology, the role of protector-api and protector-engine at each site, and how the OSC CLI plugin coordinates metadata across sites without direct service-to-service communication.
Protection Groups — Explains the 1:1:1 mapping between Trilio Protection Groups, Cinder Consistency Groups, and Pure Storage Protection Groups, and why this mapping is the foundation of crash-consistent DR.
Replication policies — Covers how synchronous and asynchronous replication modes are configured, the RPO implications of each, and how to choose the right mode for your distance and latency constraints.
Test failover (DR drill) — Describes the non-disruptive DR testing workflow that allows you to validate recovery without impacting production workloads — a capability absent from most alternatives described on this page.
Resource mappings — Explains how Nova flavors, Neutron networks, and security groups are translated between the primary and secondary site at failover time, which is the gap that Cinder failover-host leaves unaddressed.
Mock storage driver — Documents how you can evaluate Trilio Site Recovery end-to-end, including failover and failback, without physical Pure Storage arrays — relevant if you are comparing this solution against others during a proof-of-concept phase.