Key Concepts
Definitions of terms a new user must understand.
This page defines the core terms you need to understand before deploying and operating Site Recovery. The concepts here appear throughout the documentation, CLI output, and Kubernetes resources you will work with directly. Familiarity with these definitions will help you make informed decisions about deployment models, replication configuration, and recovery planning.
Deployment Models
Site Recovery supports two distinct storage backends. Every deployment decision—cluster topology, which CRDs you create, and which controllers run where—flows from this choice.
LINSTOR uses a centralized LINSTOR controller hosted on the quorum cluster with LINSTOR satellites running on worker nodes in the primary and DR clusters. This model requires exactly three clusters and is well-suited to large-scale deployments where centralized storage orchestration is preferred.
DRBD Operator installs the DRBD Operator directly on the primary and DR clusters. This model requires a minimum of two clusters and supports an optional quorum cluster for management plane isolation. It is commonly used with OpenShift Virtualization and simpler topologies.
When reading any procedure in this documentation, confirm which deployment model applies before following steps—instructions are not interchangeable between the two models.
Cluster Roles
Primary Cluster
The active production environment where your KubeVirt virtual machines run. The primary cluster hosts Protection Group controllers (LINSTOR model) or the DRBD Operator (DRBD Operator model), and its worker nodes participate directly in DRBD block replication to the DR cluster.
DR Cluster
The standby environment that mirrors production data. VMs exist on the DR cluster in a stopped state, ready to start during a failover. Like the primary cluster, DR worker nodes participate in DRBD replication traffic directly.
Quorum Cluster
A management-only cluster that hosts orchestration components: failover controllers, and (in LINSTOR deployments) the LINSTOR controller. The quorum cluster does not run application workloads. It does not relay or participate in DRBD storage replication—replication traffic flows directly between primary and DR worker nodes. Its role is to provide an isolated control plane that remains available even when one of the data clusters is impaired.
In LINSTOR deployments, the quorum cluster is mandatory. In DRBD Operator deployments, it is optional but recommended for management plane isolation.
Protection Constructs
Protection Zone
A named DR configuration that links a primary cluster to a DR cluster. The Protection Zone establishes the relationship between the two sites—encoding cluster identities, credentials, and connectivity details—so that higher-level constructs like Protection Groups know where to replicate data and where to fail over.
Protection Group
A logical grouping of virtual machines that are protected and failed over together as a unit. A Protection Group ensures that VMs with dependencies—such as a web server and its database—transition to the DR cluster in a coordinated sequence. All VMs in a Protection Group share the same failover lifecycle: they start, stop, and switch sides together.
In LINSTOR deployments, you declare a Protection Group using the ProtectionGroup CRD applied to the primary cluster. In DRBD Operator deployments, individual VMs are first enrolled via a ProtectionRequest, and then grouped.
ProtectionRequest (DRBD Operator deployments only)
A Kubernetes custom resource you create to enroll a specific VM in DR protection. When the protection controller processes a ProtectionRequest, it validates the VM and its PVCs, creates the necessary DRBDVolume resources, and migrates the VM's storage to DRBD-backed frontend PVCs. You create ProtectionRequest resources on the quorum cluster.
Storage Constructs
DRBDReplicationPolicy (DRBD Operator deployments only)
A custom resource that defines the cross-cluster replication configuration for DRBD Operator deployments. It specifies the DRBD protocol (synchronous or asynchronous), the replication endpoints for primary and DR clusters, and the storage class mappings that translate between storage classes available on each cluster. A single DRBDReplicationPolicy governs how all replicated volumes in a deployment communicate.
DRBDVolume (DRBD Operator deployments only)
A custom resource representing a single replicated block volume managed by the DRBD Operator. Each DRBDVolume corresponds to one PVC that has been brought under DRBD replication. The DRBD Operator creates and reconciles DRBDVolume resources automatically when processing a ProtectionRequest.
Frontend PVC
After a VM is enrolled in protection, its original PVC is replaced by a DRBD-backed PVC called the frontend PVC. From the VM's perspective, the frontend PVC is a normal PVC—the VM continues to read and write to it without modification. Beneath the surface, all writes flow through DRBD and are synchronously or asynchronously replicated to the DR cluster. The frontend PVC is what makes transparent replication possible without changes to the VM definition.
Operations
Failover
A cluster switchover that moves your workloads from the primary cluster to the DR cluster. Failover can be planned (initiated intentionally, for example during maintenance) or unplanned (triggered in response to a primary cluster outage). During failover, Site Recovery stops VMs on the primary (if reachable), promotes the DR-side DRBD volumes to primary role, and starts VMs on the DR cluster. You execute failovers using pgctl or the Site Manager UI.
Failback
The process of returning workloads to the original primary cluster after a failover. Failback follows a similar sequence to failover but in reverse: data that changed on the DR cluster while it was active is resynchronized back to the primary cluster before VMs are restarted there.
Test Failover
A non-disruptive DR validation that exercises your recovery capability without affecting production workloads. Test failover works by creating point-in-time snapshots of your replicated volumes on the DR cluster and starting temporary copies of your VMs from those snapshots—typically in an isolated namespace. Production replication continues uninterrupted throughout the test. When validation is complete, you delete the TestFailover resource and all temporary resources are cleaned up. Test failover is the recommended way to verify DR readiness on a regular schedule.
Recovery Objectives
RPO (Recovery Point Objective)
The maximum amount of data loss your organization can tolerate, measured as time. An RPO of zero means no data loss is acceptable—every write on the primary must be confirmed on the DR cluster before the application is notified of success. A non-zero RPO means you can tolerate losing some recent writes in a failure scenario.
RTO (Recovery Time Objective)
The maximum amount of downtime your organization can tolerate after a failure, measured as time. RTO encompasses the time from failure detection through VM restart on the DR cluster. Site Recovery targets automated failover completion in the range of minutes.
Replication Protocols
DRBD supports multiple replication protocols. The choice of protocol directly determines your RPO.
Protocol C (Synchronous Replication)
With Protocol C, a write operation on the primary is not acknowledged to the application until the data has been written to both the primary's local storage and the DR cluster's storage. This guarantees RPO = 0—no data can be lost in a failure because every committed write is already present on the DR side. Protocol C requires low-latency network connectivity between clusters (typically under 10 ms round-trip time) to avoid unacceptable write latency.
Protocol A (Asynchronous Replication)
With Protocol A, a write operation is acknowledged to the application as soon as it is written to the primary's local storage. Replication to the DR cluster happens asynchronously in the background. This yields near-zero RPO rather than a hard guarantee of zero—in a sudden failure, the most recent writes that had not yet been transmitted to the DR cluster may be lost. Protocol A tolerates higher network latency and is suitable for geographically distant clusters where Protocol C latency would be prohibitive.
Multi-Tenant Concepts
Deployment
In the context of the quorum cluster's multi-tenant architecture, a deployment is an independent DR setup consisting of one primary cluster, one DR cluster, and a dedicated namespace on the quorum cluster (named dr-<deployment-name>). Each deployment has its own isolated failover controller and its own kubeconfig secrets. Multiple deployments can coexist on a single quorum cluster without interfering with one another—for example, dr-prod, dr-staging, and dr-dev running side by side.
pgctl
pgctl is the primary command-line tool for Site Recovery operations. You use it to manage Protection Groups, execute and monitor failover and failback operations, validate cluster connectivity, and administer multi-tenant deployments. Most operational workflows documented in this guide involve pgctl commands.
Declaring a Protection Group (LINSTOR model)
The following manifest groups two VMs together so they fail over as a unit. Apply this to the primary cluster.
apiVersion: siterecovery.trilio.io/v1alpha1
kind: ProtectionGroup
metadata:
name: web-tier
namespace: default
spec:
desiredState: running
virtualMachines:
- name: vm-web-server
- name: vm-database
replicationPolicy:
type: synchronous
remoteCluster: cluster2
After applying, verify the group is recognized:
./cmds/pgctl get pg web-tier
Expected output:
NAME STATE VMS REPLICATION SYNC
web-tier running 2 synchronous InSync
Requesting VM Protection (DRBD Operator model)
The following manifest enrolls a single VM in DR protection. Apply this to the quorum cluster.
apiVersion: siterecovery.trilio.io/v1alpha1
kind: ProtectionRequest
metadata:
name: protect-vm-database
namespace: dr-prod
spec:
vmName: vm-database
vmNamespace: default
sourceCluster: cluster1
Check the status after applying:
kubectl --kubeconfig ~/.kube/config-quorum \
get protectionrequest protect-vm-database -n dr-prod -o yaml
The controller progresses through validation, DRBDVolume creation, and frontend PVC attachment. When complete, the VM's original PVC has been replaced by a DRBD-backed frontend PVC and replication is active.
Defining a DRBDReplicationPolicy (DRBD Operator model)
This policy configures synchronous (Protocol C) replication between two clusters and maps storage classes between sites.
apiVersion: siterecovery.trilio.io/v1
kind: DRBDReplicationPolicy
metadata:
name: cross-cluster-policy
namespace: dr-prod
spec:
drbdProtocol: C
storageClassMappings:
- primaryStorageClass: ocs-storagecluster-ceph-rbd
drStorageClass: ocs-storagecluster-ceph-rbd
primaryCluster:
name: cluster1
replicationEndpoint: "10.0.0.1:7000"
drCluster:
name: cluster2
replicationEndpoint: "10.0.0.2:7000"
drbdProtocol: C sets synchronous replication (RPO = 0). Change this to A for asynchronous replication when inter-cluster latency makes Protocol C impractical.
Initiating a Test Failover
This manifest creates a test failover in an isolated namespace, leaving production replication undisturbed.
apiVersion: siterecovery.trilio.io/v1alpha1
kind: TestFailover
metadata:
name: validate-web-tier
namespace: default
spec:
protectionGroupRef:
name: web-tier
namespace: default
testNamespace: dr-test-isolated
snapshotClassName: linstor-csi-snapshot-class
verification:
dataConsistency: true
cleanupPolicy: Manual
retentionTime: 2h
Apply to the DR cluster and monitor progress:
kubectl --kubeconfig ~/.kube/config-dr \
get testfailover validate-web-tier -w
Expected phase progression:
NAME PHASE
validate-web-tier CreatingSnapshots
validate-web-tier CreatingVolumes
validate-web-tier CreatingVMs
validate-web-tier VerifyingData
validate-web-tier Succeeded
When validation is complete, delete the resource to trigger cleanup:
kubectl --kubeconfig ~/.kube/config-dr \
delete testfailover validate-web-tier
- Architecture Overview — Understand how the three cluster roles (primary, DR, quorum) relate to each other physically and how DRBD replication traffic flows between sites.
- Deployment Models — A detailed comparison of the LINSTOR and DRBD Operator deployment models, including topology diagrams and component responsibilities for each.
- Replication and Storage — Deep-dive on DRBD Protocol C and Protocol A trade-offs, storage class mapping, and how frontend PVCs integrate with KubeVirt.
- Protect a VM — Step-by-step procedures for creating a
ProtectionRequest(DRBD Operator) or aProtectionGroup(LINSTOR) and verifying that replication is active. - Test Failover — Procedures for running and interpreting test failovers, including how to validate data consistency and clean up test resources.
- Failover and Failback — Operational runbooks for executing planned and unplanned failovers with
pgctl, and for failing back to the primary cluster after recovery. - Multi-Tenant Architecture — How a single quorum cluster supports multiple independent DR deployments using namespace isolation, and how to add or remove deployments.