Site Recoveryfor Kubenetes Virtual Machines
Guide

ProtectionRequest

Request VM protection (DRBD Operator deployments)


Overview

A ProtectionRequest (short names: pr, protect) is the DRBD Operator deployment's primary mechanism for placing a single KubeVirt VM under DRBD-backed disaster recovery protection. When you create a ProtectionRequest, the Protection Controller — running on the quorum cluster — stops the VM, creates DRBDVolume resources for each of its PVCs, waits for initial data synchronization to the DR cluster, and then switches the VM to DRBD-backed frontend PVCs before restarting it. From that point forward, every write the VM makes is synchronously replicated to the DR cluster via DRBD Protocol C, giving you an RPO of zero. This resource is specific to DRBD Operator deployments; if you are using a LINSTOR-based deployment, use a ProtectionGroup instead.


Prerequisites

Before creating a ProtectionRequest, confirm the following are in place:

  • Deployment model: You are running a DRBD Operator deployment (two or three clusters). This resource does not apply to LINSTOR deployments.
  • Clusters: A primary workload cluster and a DR workload cluster are deployed and reachable. A quorum cluster is strongly recommended for management plane isolation; the Protection Controller must be running there.
  • DRBD Operator: Installed and healthy on both the primary and DR workload clusters.
  • DRBDReplicationPolicy: A DRBDReplicationPolicy exists in the same namespace as the ProtectionRequest, with storage class mappings covering the VM's PVCs.
  • Controllers: The Protection Controller (src/protection-controller.py) is deployed to the quorum cluster and has kubeconfig secrets for both workload clusters (primary-kubeconfig, dr-kubeconfig).
  • VM: The target VM exists on the primary cluster, is reachable via the kubeconfig secret, and all of its PVCs use storage classes covered by your DRBDReplicationPolicy.
  • Kubernetes: 1.28+ on all clusters; OpenShift 4.17+ is supported.
  • Network: Less than 10 ms round-trip latency between primary and DR cluster nodes on DRBD replication ports (7000–7999). DRBD replication traffic flows directly between primary and DR cluster nodes — the quorum cluster does not relay it.

Installation

The ProtectionRequest CRD is installed as part of the DRBD Operator deployment. If you have already run the Ansible playbooks for your DRBD Operator deployment, the CRD is present on the quorum cluster. To verify:

kubectl --kubeconfig $KUBECONFIG_QUORUM get crd protectionrequests.siterecovery.trilio.io

If the CRD is missing, re-run the Ansible infrastructure playbook targeting the quorum cluster:

ansible-playbook ansible/deploy-dr-deployment.yml \
  --extra-vars "target_cluster=quorum"

Confirm the Protection Controller pod is running on the quorum cluster:

kubectl --kubeconfig $KUBECONFIG_QUORUM get pods \
  -n <your-dr-namespace> \
  -l app=protection-controller

The controller must be in Running state with no crash loops before you create a ProtectionRequest.


Configuration

A ProtectionRequest manifest requires the following fields:

FieldRequiredDescription
metadata.nameYesName for this protection request. Typically descriptive of the VM being protected.
metadata.namespaceYesThe namespace on the quorum cluster where DR resources are managed (e.g., dr-deployment).
spec.vmNameYesName of the VirtualMachine resource on the primary cluster.
spec.vmNamespaceYesNamespace of the VM on the primary cluster.
spec.sourceClusterYesIdentifier for the primary cluster, matching the cluster name in your DRBDReplicationPolicy.

The target DR cluster is derived from the DRBDReplicationPolicy in the same namespace — you do not specify it directly in the ProtectionRequest.

Status phases (written by the controller, read-only for you):

PhaseMeaning
PendingResource created; controller has not yet begun processing.
ValidatingController is verifying the VM exists, its PVCs are accessible, and a matching DRBDReplicationPolicy covers all storage classes.
CreatingDRBDController is creating DRBDVolume resources for each PVC on both workload clusters.
SyncingDRBD initial sync (full data copy) is in progress from primary to DR.
ReadyToActivateInitial sync is complete; all data is consistent across clusters.
ActivatingController is stopping the VM and switching it to DRBD-backed frontend PVCs.
ProtectedVM is running on DRBD-backed PVCs; synchronous replication is active.

A terminal Failed phase may also appear if the controller encounters an unrecoverable error. Inspect .status.message and the Protection Controller logs for details.


Usage

You manage ProtectionRequest resources with kubectl against the quorum cluster. The DRBD Operator's Protection Controller watches for new ProtectionRequest objects and drives the VM through the protection workflow automatically.

List all protection requests in a namespace:

kubectl --kubeconfig $KUBECONFIG_QUORUM get protectionrequest -n dr-deployment
# or using short names:
kubectl --kubeconfig $KUBECONFIG_QUORUM get pr -n dr-deployment
kubectl --kubeconfig $KUBECONFIG_QUORUM get protect -n dr-deployment

Watch protection progress in real time:

kubectl --kubeconfig $KUBECONFIG_QUORUM get pr -n dr-deployment -w

Inspect a specific request for detailed status:

kubectl --kubeconfig $KUBECONFIG_QUORUM get pr protect-my-vm -n dr-deployment -o yaml

Pay attention to .status.phase and .status.message. During Syncing, the controller will surface sync progress information in .status.conditions if available.

What the controller does on your behalf (you do not need to perform these steps manually):

  1. Validates the VM and all of its PVCs on the primary cluster.
  2. Confirms a DRBDReplicationPolicy covers every storage class in use.
  3. Creates DRBDVolume resources on both workload clusters for each PVC.
  4. Waits for DRBD initial synchronization to complete (the VM remains running during this phase).
  5. Stops the VM, replaces its PVCs with DRBD-backed frontend PVCs, and restarts it.
  6. Sets the phase to Protected once the VM is running on replicated storage.

All API calls from the controller flow from the quorum cluster outward to the workload clusters. The workload clusters have no direct knowledge of the quorum cluster.


Examples

Example 1: Protect a single VM

Create a manifest for a VM named web-server running in the production namespace on cluster1:

apiVersion: siterecovery.trilio.io/v1alpha1
kind: ProtectionRequest
metadata:
  name: protect-web-server
  namespace: dr-deployment
spec:
  vmName: web-server
  vmNamespace: production
  sourceCluster: cluster1

Apply it to the quorum cluster:

kubectl --kubeconfig $KUBECONFIG_QUORUM apply -f protect-web-server.yaml

Expected progression when you watch:

NAME                   PHASE          AGE
protect-web-server     Pending        0s
protect-web-server     Validating     3s
protect-web-server     CreatingDRBD   12s
protect-web-server     Syncing        30s
protect-web-server     ReadyToActivate  4m
protect-web-server     Activating     4m15s
protect-web-server     Protected      4m45s

The Syncing phase duration depends on the size of the VM's disks and available network bandwidth between clusters.


Example 2: Inspect detailed status during sync

kubectl --kubeconfig $KUBECONFIG_QUORUM get pr protect-web-server \
  -n dr-deployment -o yaml

Expected .status block while syncing:

status:
  phase: Syncing
  message: "DRBD initial sync in progress for 2 volumes"

Example 3: List all protected VMs across a deployment namespace

kubectl --kubeconfig $KUBECONFIG_QUORUM get pr -n dr-deployment

Example output:

NAME                    PHASE       AGE
protect-web-server      Protected   2d
protect-database        Protected   2d
protect-cache           Syncing     5m

Example 4: Check what DRBDVolumes were created for a protected VM

Once the ProtectionRequest reaches Protected, you can inspect the resulting DRBDVolume resources on the quorum cluster:

kubectl --kubeconfig $KUBECONFIG_QUORUM get drbdvolumes \
  -n dr-deployment \
  -l siterecovery.trilio.io/protected-vm=web-server

Troubleshooting

Issue: ProtectionRequest stays in Pending indefinitely

Symptom: Phase does not advance past Pending after several minutes.

Likely cause: The Protection Controller is not running or is not watching the namespace.

Fix:

# Check controller pod status on the quorum cluster
kubectl --kubeconfig $KUBECONFIG_QUORUM get pods \
  -n dr-deployment -l app=protection-controller

# View controller logs
kubectl --kubeconfig $KUBECONFIG_QUORUM logs \
  -n dr-deployment -l app=protection-controller --tail=50

If the pod is not present, redeploy it:

kubectl --kubeconfig $KUBECONFIG_QUORUM apply \
  -f deploy/crds/protection-controller-deployment.yaml

Issue: Validating phase fails — VM or PVCs not found

Symptom: Phase transitions to Failed; .status.message references a missing VM or PVC.

Likely cause: spec.vmName, spec.vmNamespace, or spec.sourceCluster does not match the actual VM, or the primary-kubeconfig secret does not have access to that cluster/namespace.

Fix: Verify the VM name and namespace directly on the primary cluster:

kubectl --kubeconfig $KUBECONFIG_CLUSTER1 get vm \
  -n <vmNamespace> <vmName>

Also confirm the kubeconfig secret is present on the quorum cluster:

kubectl --kubeconfig $KUBECONFIG_QUORUM get secret primary-kubeconfig \
  -n dr-deployment

Issue: Validating phase fails — no matching DRBDReplicationPolicy

Symptom: Phase transitions to Failed; message indicates a storage class is not covered.

Likely cause: One or more of the VM's PVCs uses a storage class not listed in any DRBDReplicationPolicy in the same namespace.

Fix: Check which storage classes the VM's PVCs use:

kubectl --kubeconfig $KUBECONFIG_CLUSTER1 get pvc \
  -n <vmNamespace> -o wide

Then inspect your DRBDReplicationPolicy and add any missing storageClassMappings:

kubectl --kubeconfig $KUBECONFIG_QUORUM get drbdreplicationpolicies \
  -n dr-deployment -o yaml

Issue: Syncing phase takes much longer than expected or never completes

Symptom: Phase stays in Syncing for hours; expected sync time based on disk size and bandwidth is much lower.

Likely cause: Network latency or bandwidth between primary and DR cluster nodes is insufficient. DRBD replication runs directly between workload cluster nodes — if those nodes cannot reach each other on ports 7000–7999, sync stalls.

Fix: Verify DRBD port connectivity between a primary cluster node and a DR cluster node:

# From a node on the primary cluster:
nc -zv <dr-node-ip> 7000

Check DRBD resource status on the primary cluster nodes:

kubectl --kubeconfig $KUBECONFIG_CLUSTER1 exec -n <drbd-namespace> \
  <drbd-node-agent-pod> -- drbdadm status

Issue: VM does not restart after Activating phase

Symptom: Phase reaches Activating but the VM does not come back up; phase may regress or show Failed.

Likely cause: The frontend PVC swap encountered an error, or the node where the VM was scheduled does not yet have DRBD resources up (the node agent applies a NoSchedule taint during startup).

Fix: Check node taints on primary cluster nodes:

kubectl --kubeconfig $KUBECONFIG_CLUSTER1 get nodes \
  -o custom-columns='NAME:.metadata.name,TAINTS:.spec.taints'

If siterecovery.trilio.io/not-ready:NoSchedule is present on the target node, wait for the DRBD node agent to complete its startup sequence and remove the taint. This taint is automatically removed once all DRBD resources on that node are in a Connected state (up to a 120-second timeout). Also inspect the Protection Controller logs for the specific error encountered during the PVC swap.