Site Recoveryfor Kubenetes Virtual Machines
Concept

Key Concepts

Definitions of terms a new user must understand.


Overview

This page defines the core terms you need to understand before deploying and operating Site Recovery. The concepts here appear throughout the documentation, CLI output, and Kubernetes resources you will work with directly. Familiarity with these definitions will help you make informed decisions about deployment models, replication configuration, and recovery planning.


Content

Deployment Models

Site Recovery supports two distinct storage backends. Every deployment decision—cluster topology, which CRDs you create, and which controllers run where—flows from this choice.

LINSTOR uses a centralized LINSTOR controller hosted on the quorum cluster with LINSTOR satellites running on worker nodes in the primary and DR clusters. This model requires exactly three clusters and is well-suited to large-scale deployments where centralized storage orchestration is preferred.

DRBD Operator installs the DRBD Operator directly on the primary and DR clusters. This model requires a minimum of two clusters and supports an optional quorum cluster for management plane isolation. It is commonly used with OpenShift Virtualization and simpler topologies.

When reading any procedure in this documentation, confirm which deployment model applies before following steps—instructions are not interchangeable between the two models.


Cluster Roles

Primary Cluster

The active production environment where your KubeVirt virtual machines run. The primary cluster hosts Protection Group controllers (LINSTOR model) or the DRBD Operator (DRBD Operator model), and its worker nodes participate directly in DRBD block replication to the DR cluster.

DR Cluster

The standby environment that mirrors production data. VMs exist on the DR cluster in a stopped state, ready to start during a failover. Like the primary cluster, DR worker nodes participate in DRBD replication traffic directly.

Quorum Cluster

A management-only cluster that hosts orchestration components: failover controllers, and (in LINSTOR deployments) the LINSTOR controller. The quorum cluster does not run application workloads. It does not relay or participate in DRBD storage replication—replication traffic flows directly between primary and DR worker nodes. Its role is to provide an isolated control plane that remains available even when one of the data clusters is impaired.

In LINSTOR deployments, the quorum cluster is mandatory. In DRBD Operator deployments, it is optional but recommended for management plane isolation.


Protection Constructs

Protection Zone

A named DR configuration that links a primary cluster to a DR cluster. The Protection Zone establishes the relationship between the two sites—encoding cluster identities, credentials, and connectivity details—so that higher-level constructs like Protection Groups know where to replicate data and where to fail over.

Protection Group

A logical grouping of virtual machines that are protected and failed over together as a unit. A Protection Group ensures that VMs with dependencies—such as a web server and its database—transition to the DR cluster in a coordinated sequence. All VMs in a Protection Group share the same failover lifecycle: they start, stop, and switch sides together.

In LINSTOR deployments, you declare a Protection Group using the ProtectionGroup CRD applied to the primary cluster. In DRBD Operator deployments, individual VMs are first enrolled via a ProtectionRequest, and then grouped.

ProtectionRequest (DRBD Operator deployments only)

A Kubernetes custom resource you create to enroll a specific VM in DR protection. When the protection controller processes a ProtectionRequest, it validates the VM and its PVCs, creates the necessary DRBDVolume resources, and migrates the VM's storage to DRBD-backed frontend PVCs. You create ProtectionRequest resources on the quorum cluster.


Storage Constructs

DRBDReplicationPolicy (DRBD Operator deployments only)

A custom resource that defines the cross-cluster replication configuration for DRBD Operator deployments. It specifies the DRBD protocol (synchronous or asynchronous), the replication endpoints for primary and DR clusters, and the storage class mappings that translate between storage classes available on each cluster. A single DRBDReplicationPolicy governs how all replicated volumes in a deployment communicate.

DRBDVolume (DRBD Operator deployments only)

A custom resource representing a single replicated block volume managed by the DRBD Operator. Each DRBDVolume corresponds to one PVC that has been brought under DRBD replication. The DRBD Operator creates and reconciles DRBDVolume resources automatically when processing a ProtectionRequest.

Frontend PVC

After a VM is enrolled in protection, its original PVC is replaced by a DRBD-backed PVC called the frontend PVC. From the VM's perspective, the frontend PVC is a normal PVC—the VM continues to read and write to it without modification. Beneath the surface, all writes flow through DRBD and are synchronously or asynchronously replicated to the DR cluster. The frontend PVC is what makes transparent replication possible without changes to the VM definition.


Operations

Failover

A cluster switchover that moves your workloads from the primary cluster to the DR cluster. Failover can be planned (initiated intentionally, for example during maintenance) or unplanned (triggered in response to a primary cluster outage). During failover, Site Recovery stops VMs on the primary (if reachable), promotes the DR-side DRBD volumes to primary role, and starts VMs on the DR cluster. You execute failovers using pgctl or the Site Manager UI.

Failback

The process of returning workloads to the original primary cluster after a failover. Failback follows a similar sequence to failover but in reverse: data that changed on the DR cluster while it was active is resynchronized back to the primary cluster before VMs are restarted there.

Test Failover

A non-disruptive DR validation that exercises your recovery capability without affecting production workloads. Test failover works by creating point-in-time snapshots of your replicated volumes on the DR cluster and starting temporary copies of your VMs from those snapshots—typically in an isolated namespace. Production replication continues uninterrupted throughout the test. When validation is complete, you delete the TestFailover resource and all temporary resources are cleaned up. Test failover is the recommended way to verify DR readiness on a regular schedule.


Recovery Objectives

RPO (Recovery Point Objective)

The maximum amount of data loss your organization can tolerate, measured as time. An RPO of zero means no data loss is acceptable—every write on the primary must be confirmed on the DR cluster before the application is notified of success. A non-zero RPO means you can tolerate losing some recent writes in a failure scenario.

RTO (Recovery Time Objective)

The maximum amount of downtime your organization can tolerate after a failure, measured as time. RTO encompasses the time from failure detection through VM restart on the DR cluster. Site Recovery targets automated failover completion in the range of minutes.


Replication Protocols

DRBD supports multiple replication protocols. The choice of protocol directly determines your RPO.

Protocol C (Synchronous Replication)

With Protocol C, a write operation on the primary is not acknowledged to the application until the data has been written to both the primary's local storage and the DR cluster's storage. This guarantees RPO = 0—no data can be lost in a failure because every committed write is already present on the DR side. Protocol C requires low-latency network connectivity between clusters (typically under 10 ms round-trip time) to avoid unacceptable write latency.

Protocol A (Asynchronous Replication)

With Protocol A, a write operation is acknowledged to the application as soon as it is written to the primary's local storage. Replication to the DR cluster happens asynchronously in the background. This yields near-zero RPO rather than a hard guarantee of zero—in a sudden failure, the most recent writes that had not yet been transmitted to the DR cluster may be lost. Protocol A tolerates higher network latency and is suitable for geographically distant clusters where Protocol C latency would be prohibitive.


Multi-Tenant Concepts

Deployment

In the context of the quorum cluster's multi-tenant architecture, a deployment is an independent DR setup consisting of one primary cluster, one DR cluster, and a dedicated namespace on the quorum cluster (named dr-<deployment-name>). Each deployment has its own isolated failover controller and its own kubeconfig secrets. Multiple deployments can coexist on a single quorum cluster without interfering with one another—for example, dr-prod, dr-staging, and dr-dev running side by side.

pgctl

pgctl is the primary command-line tool for Site Recovery operations. You use it to manage Protection Groups, execute and monitor failover and failback operations, validate cluster connectivity, and administer multi-tenant deployments. Most operational workflows documented in this guide involve pgctl commands.


Examples

Declaring a Protection Group (LINSTOR model)

The following manifest groups two VMs together so they fail over as a unit. Apply this to the primary cluster.

apiVersion: siterecovery.trilio.io/v1alpha1
kind: ProtectionGroup
metadata:
  name: web-tier
  namespace: default
spec:
  desiredState: running
  virtualMachines:
    - name: vm-web-server
    - name: vm-database
  replicationPolicy:
    type: synchronous
    remoteCluster: cluster2

After applying, verify the group is recognized:

./cmds/pgctl get pg web-tier

Expected output:

NAME       STATE     VMS   REPLICATION   SYNC
web-tier   running   2     synchronous   InSync

Requesting VM Protection (DRBD Operator model)

The following manifest enrolls a single VM in DR protection. Apply this to the quorum cluster.

apiVersion: siterecovery.trilio.io/v1alpha1
kind: ProtectionRequest
metadata:
  name: protect-vm-database
  namespace: dr-prod
spec:
  vmName: vm-database
  vmNamespace: default
  sourceCluster: cluster1

Check the status after applying:

kubectl --kubeconfig ~/.kube/config-quorum \
  get protectionrequest protect-vm-database -n dr-prod -o yaml

The controller progresses through validation, DRBDVolume creation, and frontend PVC attachment. When complete, the VM's original PVC has been replaced by a DRBD-backed frontend PVC and replication is active.


Defining a DRBDReplicationPolicy (DRBD Operator model)

This policy configures synchronous (Protocol C) replication between two clusters and maps storage classes between sites.

apiVersion: siterecovery.trilio.io/v1
kind: DRBDReplicationPolicy
metadata:
  name: cross-cluster-policy
  namespace: dr-prod
spec:
  drbdProtocol: C
  storageClassMappings:
    - primaryStorageClass: ocs-storagecluster-ceph-rbd
      drStorageClass: ocs-storagecluster-ceph-rbd
  primaryCluster:
    name: cluster1
    replicationEndpoint: "10.0.0.1:7000"
  drCluster:
    name: cluster2
    replicationEndpoint: "10.0.0.2:7000"

drbdProtocol: C sets synchronous replication (RPO = 0). Change this to A for asynchronous replication when inter-cluster latency makes Protocol C impractical.


Initiating a Test Failover

This manifest creates a test failover in an isolated namespace, leaving production replication undisturbed.

apiVersion: siterecovery.trilio.io/v1alpha1
kind: TestFailover
metadata:
  name: validate-web-tier
  namespace: default
spec:
  protectionGroupRef:
    name: web-tier
    namespace: default
  testNamespace: dr-test-isolated
  snapshotClassName: linstor-csi-snapshot-class
  verification:
    dataConsistency: true
  cleanupPolicy: Manual
  retentionTime: 2h

Apply to the DR cluster and monitor progress:

kubectl --kubeconfig ~/.kube/config-dr \
  get testfailover validate-web-tier -w

Expected phase progression:

NAME                 PHASE
validate-web-tier    CreatingSnapshots
validate-web-tier    CreatingVolumes
validate-web-tier    CreatingVMs
validate-web-tier    VerifyingData
validate-web-tier    Succeeded

When validation is complete, delete the resource to trigger cleanup:

kubectl --kubeconfig ~/.kube/config-dr \
  delete testfailover validate-web-tier

Related concepts
  • Architecture Overview — Understand how the three cluster roles (primary, DR, quorum) relate to each other physically and how DRBD replication traffic flows between sites.
  • Deployment Models — A detailed comparison of the LINSTOR and DRBD Operator deployment models, including topology diagrams and component responsibilities for each.
  • Replication and Storage — Deep-dive on DRBD Protocol C and Protocol A trade-offs, storage class mapping, and how frontend PVCs integrate with KubeVirt.
  • Protect a VM — Step-by-step procedures for creating a ProtectionRequest (DRBD Operator) or a ProtectionGroup (LINSTOR) and verifying that replication is active.
  • Test Failover — Procedures for running and interpreting test failovers, including how to validate data consistency and clean up test resources.
  • Failover and Failback — Operational runbooks for executing planned and unplanned failovers with pgctl, and for failing back to the primary cluster after recovery.
  • Multi-Tenant Architecture — How a single quorum cluster supports multiple independent DR deployments using namespace isolation, and how to add or remove deployments.