Site Recoveryfor Kubenetes Virtual Machines
Guide

Custom Resource Definitions (CRDs)

Complete reference for all Site Recovery CRDs - the declarative API for managing disaster recovery. CRDs are the primary interface for automating DR operations via kubectl, GitOps, or any Kubernetes-native tooling.


Overview

Site Recovery exposes its entire control plane as Kubernetes Custom Resource Definitions (CRDs), making disaster recovery a fully declarative, API-driven operation. You interact with the system by creating, updating, and deleting Custom Resources (CRs) — controllers watch those resources and continuously reconcile actual state toward your desired state. This architecture enables GitOps workflows (Argo CD, Flux), kubectl-based automation, and integration with any Kubernetes-native tooling without requiring proprietary clients. Two API groups organize the CRDs: dr.linstor.io for DR orchestration resources and drbd.io for storage replication resources.


Prerequisites

Before working with Site Recovery CRDs, ensure the following are in place:

  • Kubernetes clusters provisioned — at minimum, a primary cluster and a DR cluster. For LINSTOR deployments, a third quorum cluster is required; for DRBD Operator deployments, a quorum cluster is optional but recommended.
  • Infrastructure deployed via Ansible — DRBD, LINSTOR or DRBD Operator, and all controllers (Protection Group controller, Failover controller) must be deployed to your clusters before any CRs will reconcile.
  • Site Manager UI deployed via Helm to the quorum cluster (management plane).
  • kubectl configured with kubeconfig contexts for each cluster you intend to manage.
  • pgctl installed — the primary CLI for Protection Group management, failover operations, and multi-tenant deployment management.
  • RBAC permissions to create, patch, and delete custom resources in the dr.linstor.io and drbd.io API groups in the namespaces where you will operate.
  • LINSTOR-backed PVCs — all VMs you intend to protect must use PVCs provisioned by linstor.csi.linbit.com with placementCount >= 2. Single-replica or non-LINSTOR PVCs are rejected at Protection Group creation time.

Installation

CRD schemas are applied to your clusters as part of the Ansible-driven infrastructure deployment. If you need to apply or refresh them manually, follow these steps.

Step 1 — Apply the Protection Group CRD

Apply the CRD manifest to every cluster that will host the Protection Group controller (primary, DR, and quorum for LINSTOR; primary and DR at minimum for DRBD Operator):

kubectl apply -f protection-group-crd.yaml

Step 2 — Apply the Failover CRD

kubectl apply -f failover-crd.yaml

Step 3 — Verify CRD registration

Confirm both CRDs are established before creating any Custom Resources:

kubectl get crds | grep -E 'dr.linstor.io|drbd.io'

Expected output (names will vary by release):

protectiongroups.dr.linstor.io       2025-01-01T00:00:00Z
failovers.dr.linstor.io              2025-01-01T00:00:00Z

Step 4 — Deploy the controllers

Controllers must be running for CRs to reconcile. Apply the controller deployment manifests:

kubectl apply -f protection-group-controller-deployment.yaml
kubectl apply -f failover-controller-deployment.yaml

Step 5 — Verify controllers are running

kubectl get pods -l app=protection-group-controller
kubectl get pods -l app=failover-controller

Both pods should be in Running state before you proceed to create any CRs.


Configuration

Protection Group (dr.linstor.io)

The Protection Group CR is the primary unit of VM protection. It declares which VMs belong together as a logical DR unit and what their desired lifecycle state is.

FieldTypeDefaultValid ValuesEffect
spec.virtualMachineslistList of {name, namespace} objectsDeclares which KubeVirt VMs are members of this group. All listed VMs are managed as a unit during failover.
spec.desiredStatestringrunningrunning, stoppedThe target lifecycle state for all VMs in the group. The Protection Group controller reconciles actual VM state to match. Setting this to stopped is how the Failover controller halts VMs on the source cluster without directly patching VM objects.
status.currentStatestringunknownrunning, stopped, mixed, unknownReflects the actual observed state. mixed means reconciliation is in progress. The Failover controller reads this field to determine when it is safe to proceed to the next failover phase.
status.replicationHealthstringHealthy, Degraded, UnknownReports whether DRBD replication for all member PVCs is synchronized.
status.warningslistStringsNon-blocking advisory messages, such as missing replicasOnDifferent geo-placement rules on a storage class.

Geo-replication validation is enforced at create and update time. The controller rejects any Protection Group whose member VMs have PVCs that do not meet these requirements:

  • Storage class provisioner must be linstor.csi.linbit.com
  • placementCount must be >= 2

Missing replicasOnDifferent parameters generate warnings but do not block creation.


Failover CR (dr.linstor.io)

The Failover CR is how you trigger a failover operation. Creating this resource is the only correct way to initiate failover — the controller orchestrates all subsequent steps by patching Protection Group desiredState fields. It never directly patches VM objects.

FieldTypeDefaultValid ValuesEffect
spec.protectionGroupstringAny valid Protection Group nameIdentifies which Protection Group to fail over.
spec.targetClusterstringCluster identifierThe cluster where VMs should run after failover completes.
spec.failoverTypestringplannedplanned, emergencyControls pre-failover validation stringency.
spec.forceboolfalsetrue, falseBypasses concurrent-safety checks that prevent taint removal when other Protection Groups have running VMs on the target cluster. Use only in emergencies and only after coordinating with other operators.
status.phasestringPendingPending, StoppingOnSource, StartingOnTarget, Completed, FailedCurrent phase of the failover state machine.
status.statestringCompleted, FailedTerminal state. Poll or use kubectl wait on this field.

Failover state machine phases:

  1. Pending — CR created, controller analyzing source/target.
  2. StoppingOnSource — Controller has patched source Protection Group desiredState=stopped; waiting for currentState=stopped.
  3. StartingOnTarget — Source VMs confirmed stopped; DRBD taint safety check performed; controller has patched target Protection Group desiredState=running.
  4. Completed — Target Protection Group currentState=running confirmed.
  5. Failed — Failover failed after retries.

Concurrent failover safety: The controller acquires a per-Protection Group Kubernetes Lease (failover-lock-{pg-name}) before beginning. A second Failover CR targeting the same Protection Group will fail to acquire the lock and report an error. Concurrent failovers of different Protection Groups are permitted, but the safety check on taint removal (which is node-level, not per-resource) may block a second failover if the first group's VMs are already running on the target cluster nodes.


DRBDReplicationPolicy (drbd.io)

Configures DRBD block replication behavior between the primary and DR cluster nodes. Replication traffic flows directly between data-plane nodes — the quorum cluster does not relay replication.


ProtectionRequest (drbd.io — DRBD Operator model only)

Used in place of ProtectionGroup (LINSTOR model) when deploying with the DRBD Operator. Provides equivalent VM protection semantics for two-cluster (primary + DR) deployments.


Usage

Protecting VMs with a Protection Group

Create a Protection Group CR to declare which VMs should be protected together as a DR unit. The controller validates that all referenced VMs use geo-replicated LINSTOR PVCs before accepting the resource.

apiVersion: dr.linstor.io/v1alpha1
kind: ProtectionGroup
metadata:
  name: production-app
  namespace: production
spec:
  desiredState: running
  virtualMachines:
    - name: app-server-1
      namespace: production
    - name: app-server-2
      namespace: production
    - name: database-primary
      namespace: production

Apply with:

kubectl apply -f protection-group.yaml

Check status:

kubectl get protectiongroup production-app -n production

The STATE column should show Active and HEALTH should show Healthy once DRBD replication is synchronized across clusters.


Triggering a Planned Failover

Create a Failover CR to move a Protection Group from the primary cluster to the DR cluster. The Failover controller orchestrates the operation by patching Protection Group desiredState — you do not need to manually stop or start VMs.

apiVersion: dr.linstor.io/v1alpha1
kind: Failover
metadata:
  name: production-app-failover-2025
  namespace: production
spec:
  protectionGroup: production-app
  targetCluster: dr-cluster
  failoverType: planned

Apply and watch progress:

kubectl apply -f failover.yaml
kubectl wait --for=jsonpath='{.status.state}'=Completed \
  failover/production-app-failover-2025 \
  -n production \
  --timeout=600s

Follow the state machine in real time:

kubectl get failover production-app-failover-2025 -n production -w

Alternatively, use pgctl for managed failover operations, which wraps this CRD pattern with additional pre-flight checks.


Checking Replication Health

You can inspect the health of all Protection Groups across a namespace at a glance:

kubectl get protectiongroups -n production

For detailed status including per-VM replication state and any warnings:

kubectl describe protectiongroup production-app -n production

GitOps Integration

Because all operations are CRD-driven, you can store Protection Group definitions and Failover CRs in Git and apply them through Argo CD or Flux. Protection Group definitions (declaring which VMs are protected) are long-lived and suit continuous GitOps management. Failover CRs are ephemeral — create one per failover event, and delete or archive it after completion.


Examples

Example 1 — Create a Protection Group (LINSTOR deployment)

This example protects two VMs in the production namespace. The VMs use a storage class with placementCount: 3 and cross-zone placement, so they pass geo-replication validation.

apiVersion: dr.linstor.io/v1alpha1
kind: ProtectionGroup
metadata:
  name: web-tier
  namespace: production
spec:
  desiredState: running
  virtualMachines:
    - name: web-vm-1
      namespace: production
    - name: web-vm-2
      namespace: production
kubectl apply -f web-tier-pg.yaml
kubectl get protectiongroup web-tier -n production

Expected output:

NAME       STATE    VMS   REPLICATION   HEALTH    AGE
web-tier   Active   2     synchronous   Healthy   30s

Example 2 — Planned failover to DR cluster

Fail over the web-tier Protection Group to dr-cluster during a maintenance window.

apiVersion: dr.linstor.io/v1alpha1
kind: Failover
metadata:
  name: web-tier-planned-failover
  namespace: production
spec:
  protectionGroup: web-tier
  targetCluster: dr-cluster
  failoverType: planned
  force: false
kubectl apply -f web-tier-failover.yaml
kubectl get failover web-tier-planned-failover -n production -w

Expected output (as phases progress):

NAME                         PHASE               STATE     AGE
web-tier-planned-failover    StoppingOnSource              10s
web-tier-planned-failover    StartingOnTarget              35s
web-tier-planned-failover    Completed           Completed 60s

Example 3 — Reject: Protection Group with single-replica PVC

Attempting to protect a VM whose PVC uses a storage class with placementCount: 1 will fail validation.

apiVersion: dr.linstor.io/v1alpha1
kind: ProtectionGroup
metadata:
  name: invalid-pg
  namespace: staging
spec:
  desiredState: running
  virtualMachines:
    - name: local-only-vm
      namespace: staging
kubectl apply -f invalid-pg.yaml
kubectl describe protectiongroup invalid-pg -n staging

Expected status:

Status:
  State: Failed
  Conditions:
    Type:    ValidationFailed
    Status:  True
    Reason:  InvalidConfiguration
    Message: PVC local-only-vm-disk for VM local-only-vm: Storage class linstor-local
             has placementCount=1, minimum 2 required for geo-replication

Fix: Reprovision the VM's PVC using a storage class with placementCount >= 2 and provisioner: linstor.csi.linbit.com.


Example 4 — Emergency failover with force flag

Use force: true only in genuine emergencies where you have confirmed that bypassing concurrent-safety checks will not cause data corruption. This flag skips the check that prevents taint removal when other Protection Groups have running VMs on the target cluster nodes.

apiVersion: dr.linstor.io/v1alpha1
kind: Failover
metadata:
  name: web-tier-emergency-failover
  namespace: production
spec:
  protectionGroup: web-tier
  targetCluster: dr-cluster
  failoverType: emergency
  force: true

⚠️ Warning: force: true bypasses all concurrent-safety checks. Using it while other Protection Groups have VMs running on the target cluster may disrupt those workloads. Coordinate with other operators before applying.


Troubleshooting

Issue 1 — Protection Group stuck in Failed state after creation

Symptom:

NAME         STATE    VMS   AGE
my-pg        Failed   0     10s

kubectl describe protectiongroup my-pg shows ValidationFailed condition.

Likely cause: One or more VMs in spec.virtualMachines have PVCs that fail geo-replication validation — either the storage class provisioner is not linstor.csi.linbit.com, or placementCount is less than 2.

Fix:

  1. Read the Message field in the ValidationFailed condition to identify the specific PVC and storage class.
  2. If the storage class has placementCount: 1, you must reprovision the PVC using a geo-replication-capable storage class before the Protection Group can be created.
  3. Correct the VM's PVC, then delete and recreate the Protection Group CR.

Issue 2 — Failover stuck in StoppingOnSource phase

Symptom: The Failover CR has been in StoppingOnSource for several minutes with no progress.

Likely cause: The source cluster's Protection Group controller is unable to stop one or more VMs, or the Protection Group currentState is not transitioning from running to stopped. This can happen if a VM is stuck in a terminating state, or if the Protection Group controller pod is not running on the source cluster.

Fix:

  1. Check the Protection Group status on the source cluster: kubectl get protectiongroup <name> -n <namespace>.
  2. If currentState is mixed, the controller is still reconciling — wait and check controller logs: kubectl logs -l app=protection-group-controller.
  3. If the controller pod is not running, redeploy it: kubectl apply -f protection-group-controller-deployment.yaml.
  4. If a VM is stuck terminating, investigate the KubeVirt VM object directly: kubectl describe vm <vm-name> -n <namespace>.

Issue 3 — Failover aborted with "UNSAFE to remove taints" error

Symptom: The Failover CR fails with a message similar to:

UNSAFE to remove taints on dr-cluster - other Protection Groups have running VMs
Conflicting PGs: database-protection-group, web-servers-protection-group

Likely cause: Another Protection Group already has VMs running on the target cluster. Because DRBD quorum taints are node-level (not per-Protection Group), removing taints for your failover would affect all VMs on those nodes.

Fix:

  1. Wait for any in-progress failovers to the same target cluster to complete before initiating yours.
  2. If this is a genuine emergency and you have confirmed the risk, re-create the Failover CR with spec.force: true. Document the decision — force mode may transiently disrupt VMs belonging to the conflicting Protection Groups.

Issue 4 — "Failed to acquire failover lock" error

Symptom: A Failover CR fails immediately with:

ERROR: Failed to acquire failover lock for production-protection-group
Failover lock is held by: failover-controller-12345

Likely cause: A previous Failover CR for the same Protection Group is still in progress (or its controller crashed without releasing the lock). The system uses a Kubernetes Lease (failover-lock-{pg-name}) to prevent concurrent failovers of the same group.

Fix:

  1. Check if a previous Failover CR is still running: kubectl get failovers -n <namespace>.
  2. If the previous operation has genuinely completed or failed, check whether the Lease object was left behind: kubectl get leases -n <namespace> | grep failover-lock.
  3. If the Lease is stale (holder process no longer exists), delete it: kubectl delete lease failover-lock-<pg-name> -n <namespace>. The lock has a 5-minute automatic expiry, so it will also self-clean.
  4. Retry the Failover CR after the lock is released.

Issue 5 — Protection Group shows HEALTH: Degraded

Symptom:

NAME         STATE    VMS   REPLICATION   HEALTH     AGE
my-pg        Active   2     synchronous   Degraded   5m

Likely cause: DRBD replication for one or more member PVCs is not fully synchronized between the primary and DR cluster nodes. This can occur after a network partition, after a node restart, or if the DR cluster is unreachable.

Fix:

  1. Inspect the Protection Group status for per-volume replication details: kubectl describe protectiongroup my-pg -n <namespace>.
  2. Verify network connectivity between primary and DR cluster data-plane nodes — DRBD replication traffic flows directly between these nodes and does not pass through the quorum cluster.
  3. Check DRBD resource status on the affected nodes using your LINSTOR or DRBD Operator tooling.
  4. Do not initiate a failover while replicationHealth is Degraded unless it is an emergency — failing over with unsynchronized data risks data loss.