Custom Resource Definitions (CRDs)
Complete reference for all Site Recovery CRDs - the declarative API for managing disaster recovery. CRDs are the primary interface for automating DR operations via kubectl, GitOps, or any Kubernetes-native tooling.
Site Recovery exposes its entire control plane as Kubernetes Custom Resource Definitions (CRDs), making disaster recovery a fully declarative, API-driven operation. You interact with the system by creating, updating, and deleting Custom Resources (CRs) — controllers watch those resources and continuously reconcile actual state toward your desired state. This architecture enables GitOps workflows (Argo CD, Flux), kubectl-based automation, and integration with any Kubernetes-native tooling without requiring proprietary clients. Two API groups organize the CRDs: dr.linstor.io for DR orchestration resources and drbd.io for storage replication resources.
Before working with Site Recovery CRDs, ensure the following are in place:
- Kubernetes clusters provisioned — at minimum, a primary cluster and a DR cluster. For LINSTOR deployments, a third quorum cluster is required; for DRBD Operator deployments, a quorum cluster is optional but recommended.
- Infrastructure deployed via Ansible — DRBD, LINSTOR or DRBD Operator, and all controllers (Protection Group controller, Failover controller) must be deployed to your clusters before any CRs will reconcile.
- Site Manager UI deployed via Helm to the quorum cluster (management plane).
kubectlconfigured with kubeconfig contexts for each cluster you intend to manage.pgctlinstalled — the primary CLI for Protection Group management, failover operations, and multi-tenant deployment management.- RBAC permissions to create, patch, and delete custom resources in the
dr.linstor.ioanddrbd.ioAPI groups in the namespaces where you will operate. - LINSTOR-backed PVCs — all VMs you intend to protect must use PVCs provisioned by
linstor.csi.linbit.comwithplacementCount >= 2. Single-replica or non-LINSTOR PVCs are rejected at Protection Group creation time.
CRD schemas are applied to your clusters as part of the Ansible-driven infrastructure deployment. If you need to apply or refresh them manually, follow these steps.
Step 1 — Apply the Protection Group CRD
Apply the CRD manifest to every cluster that will host the Protection Group controller (primary, DR, and quorum for LINSTOR; primary and DR at minimum for DRBD Operator):
kubectl apply -f protection-group-crd.yaml
Step 2 — Apply the Failover CRD
kubectl apply -f failover-crd.yaml
Step 3 — Verify CRD registration
Confirm both CRDs are established before creating any Custom Resources:
kubectl get crds | grep -E 'dr.linstor.io|drbd.io'
Expected output (names will vary by release):
protectiongroups.dr.linstor.io 2025-01-01T00:00:00Z
failovers.dr.linstor.io 2025-01-01T00:00:00Z
Step 4 — Deploy the controllers
Controllers must be running for CRs to reconcile. Apply the controller deployment manifests:
kubectl apply -f protection-group-controller-deployment.yaml
kubectl apply -f failover-controller-deployment.yaml
Step 5 — Verify controllers are running
kubectl get pods -l app=protection-group-controller
kubectl get pods -l app=failover-controller
Both pods should be in Running state before you proceed to create any CRs.
Protection Group (dr.linstor.io)
The Protection Group CR is the primary unit of VM protection. It declares which VMs belong together as a logical DR unit and what their desired lifecycle state is.
| Field | Type | Default | Valid Values | Effect |
|---|---|---|---|---|
spec.virtualMachines | list | — | List of {name, namespace} objects | Declares which KubeVirt VMs are members of this group. All listed VMs are managed as a unit during failover. |
spec.desiredState | string | running | running, stopped | The target lifecycle state for all VMs in the group. The Protection Group controller reconciles actual VM state to match. Setting this to stopped is how the Failover controller halts VMs on the source cluster without directly patching VM objects. |
status.currentState | string | unknown | running, stopped, mixed, unknown | Reflects the actual observed state. mixed means reconciliation is in progress. The Failover controller reads this field to determine when it is safe to proceed to the next failover phase. |
status.replicationHealth | string | — | Healthy, Degraded, Unknown | Reports whether DRBD replication for all member PVCs is synchronized. |
status.warnings | list | — | Strings | Non-blocking advisory messages, such as missing replicasOnDifferent geo-placement rules on a storage class. |
Geo-replication validation is enforced at create and update time. The controller rejects any Protection Group whose member VMs have PVCs that do not meet these requirements:
- Storage class provisioner must be
linstor.csi.linbit.com placementCountmust be>= 2
Missing replicasOnDifferent parameters generate warnings but do not block creation.
Failover CR (dr.linstor.io)
The Failover CR is how you trigger a failover operation. Creating this resource is the only correct way to initiate failover — the controller orchestrates all subsequent steps by patching Protection Group desiredState fields. It never directly patches VM objects.
| Field | Type | Default | Valid Values | Effect |
|---|---|---|---|---|
spec.protectionGroup | string | — | Any valid Protection Group name | Identifies which Protection Group to fail over. |
spec.targetCluster | string | — | Cluster identifier | The cluster where VMs should run after failover completes. |
spec.failoverType | string | planned | planned, emergency | Controls pre-failover validation stringency. |
spec.force | bool | false | true, false | Bypasses concurrent-safety checks that prevent taint removal when other Protection Groups have running VMs on the target cluster. Use only in emergencies and only after coordinating with other operators. |
status.phase | string | Pending | Pending, StoppingOnSource, StartingOnTarget, Completed, Failed | Current phase of the failover state machine. |
status.state | string | — | Completed, Failed | Terminal state. Poll or use kubectl wait on this field. |
Failover state machine phases:
Pending— CR created, controller analyzing source/target.StoppingOnSource— Controller has patched source Protection GroupdesiredState=stopped; waiting forcurrentState=stopped.StartingOnTarget— Source VMs confirmed stopped; DRBD taint safety check performed; controller has patched target Protection GroupdesiredState=running.Completed— Target Protection GroupcurrentState=runningconfirmed.Failed— Failover failed after retries.
Concurrent failover safety: The controller acquires a per-Protection Group Kubernetes Lease (failover-lock-{pg-name}) before beginning. A second Failover CR targeting the same Protection Group will fail to acquire the lock and report an error. Concurrent failovers of different Protection Groups are permitted, but the safety check on taint removal (which is node-level, not per-resource) may block a second failover if the first group's VMs are already running on the target cluster nodes.
DRBDReplicationPolicy (drbd.io)
Configures DRBD block replication behavior between the primary and DR cluster nodes. Replication traffic flows directly between data-plane nodes — the quorum cluster does not relay replication.
ProtectionRequest (drbd.io — DRBD Operator model only)
Used in place of ProtectionGroup (LINSTOR model) when deploying with the DRBD Operator. Provides equivalent VM protection semantics for two-cluster (primary + DR) deployments.
Protecting VMs with a Protection Group
Create a Protection Group CR to declare which VMs should be protected together as a DR unit. The controller validates that all referenced VMs use geo-replicated LINSTOR PVCs before accepting the resource.
apiVersion: dr.linstor.io/v1alpha1
kind: ProtectionGroup
metadata:
name: production-app
namespace: production
spec:
desiredState: running
virtualMachines:
- name: app-server-1
namespace: production
- name: app-server-2
namespace: production
- name: database-primary
namespace: production
Apply with:
kubectl apply -f protection-group.yaml
Check status:
kubectl get protectiongroup production-app -n production
The STATE column should show Active and HEALTH should show Healthy once DRBD replication is synchronized across clusters.
Triggering a Planned Failover
Create a Failover CR to move a Protection Group from the primary cluster to the DR cluster. The Failover controller orchestrates the operation by patching Protection Group desiredState — you do not need to manually stop or start VMs.
apiVersion: dr.linstor.io/v1alpha1
kind: Failover
metadata:
name: production-app-failover-2025
namespace: production
spec:
protectionGroup: production-app
targetCluster: dr-cluster
failoverType: planned
Apply and watch progress:
kubectl apply -f failover.yaml
kubectl wait --for=jsonpath='{.status.state}'=Completed \
failover/production-app-failover-2025 \
-n production \
--timeout=600s
Follow the state machine in real time:
kubectl get failover production-app-failover-2025 -n production -w
Alternatively, use pgctl for managed failover operations, which wraps this CRD pattern with additional pre-flight checks.
Checking Replication Health
You can inspect the health of all Protection Groups across a namespace at a glance:
kubectl get protectiongroups -n production
For detailed status including per-VM replication state and any warnings:
kubectl describe protectiongroup production-app -n production
GitOps Integration
Because all operations are CRD-driven, you can store Protection Group definitions and Failover CRs in Git and apply them through Argo CD or Flux. Protection Group definitions (declaring which VMs are protected) are long-lived and suit continuous GitOps management. Failover CRs are ephemeral — create one per failover event, and delete or archive it after completion.
Example 1 — Create a Protection Group (LINSTOR deployment)
This example protects two VMs in the production namespace. The VMs use a storage class with placementCount: 3 and cross-zone placement, so they pass geo-replication validation.
apiVersion: dr.linstor.io/v1alpha1
kind: ProtectionGroup
metadata:
name: web-tier
namespace: production
spec:
desiredState: running
virtualMachines:
- name: web-vm-1
namespace: production
- name: web-vm-2
namespace: production
kubectl apply -f web-tier-pg.yaml
kubectl get protectiongroup web-tier -n production
Expected output:
NAME STATE VMS REPLICATION HEALTH AGE
web-tier Active 2 synchronous Healthy 30s
Example 2 — Planned failover to DR cluster
Fail over the web-tier Protection Group to dr-cluster during a maintenance window.
apiVersion: dr.linstor.io/v1alpha1
kind: Failover
metadata:
name: web-tier-planned-failover
namespace: production
spec:
protectionGroup: web-tier
targetCluster: dr-cluster
failoverType: planned
force: false
kubectl apply -f web-tier-failover.yaml
kubectl get failover web-tier-planned-failover -n production -w
Expected output (as phases progress):
NAME PHASE STATE AGE
web-tier-planned-failover StoppingOnSource 10s
web-tier-planned-failover StartingOnTarget 35s
web-tier-planned-failover Completed Completed 60s
Example 3 — Reject: Protection Group with single-replica PVC
Attempting to protect a VM whose PVC uses a storage class with placementCount: 1 will fail validation.
apiVersion: dr.linstor.io/v1alpha1
kind: ProtectionGroup
metadata:
name: invalid-pg
namespace: staging
spec:
desiredState: running
virtualMachines:
- name: local-only-vm
namespace: staging
kubectl apply -f invalid-pg.yaml
kubectl describe protectiongroup invalid-pg -n staging
Expected status:
Status:
State: Failed
Conditions:
Type: ValidationFailed
Status: True
Reason: InvalidConfiguration
Message: PVC local-only-vm-disk for VM local-only-vm: Storage class linstor-local
has placementCount=1, minimum 2 required for geo-replication
Fix: Reprovision the VM's PVC using a storage class with placementCount >= 2 and provisioner: linstor.csi.linbit.com.
Example 4 — Emergency failover with force flag
Use force: true only in genuine emergencies where you have confirmed that bypassing concurrent-safety checks will not cause data corruption. This flag skips the check that prevents taint removal when other Protection Groups have running VMs on the target cluster nodes.
apiVersion: dr.linstor.io/v1alpha1
kind: Failover
metadata:
name: web-tier-emergency-failover
namespace: production
spec:
protectionGroup: web-tier
targetCluster: dr-cluster
failoverType: emergency
force: true
⚠️ Warning:
force: truebypasses all concurrent-safety checks. Using it while other Protection Groups have VMs running on the target cluster may disrupt those workloads. Coordinate with other operators before applying.
Issue 1 — Protection Group stuck in Failed state after creation
Symptom:
NAME STATE VMS AGE
my-pg Failed 0 10s
kubectl describe protectiongroup my-pg shows ValidationFailed condition.
Likely cause: One or more VMs in spec.virtualMachines have PVCs that fail geo-replication validation — either the storage class provisioner is not linstor.csi.linbit.com, or placementCount is less than 2.
Fix:
- Read the
Messagefield in theValidationFailedcondition to identify the specific PVC and storage class. - If the storage class has
placementCount: 1, you must reprovision the PVC using a geo-replication-capable storage class before the Protection Group can be created. - Correct the VM's PVC, then delete and recreate the Protection Group CR.
Issue 2 — Failover stuck in StoppingOnSource phase
Symptom: The Failover CR has been in StoppingOnSource for several minutes with no progress.
Likely cause: The source cluster's Protection Group controller is unable to stop one or more VMs, or the Protection Group currentState is not transitioning from running to stopped. This can happen if a VM is stuck in a terminating state, or if the Protection Group controller pod is not running on the source cluster.
Fix:
- Check the Protection Group status on the source cluster:
kubectl get protectiongroup <name> -n <namespace>. - If
currentStateismixed, the controller is still reconciling — wait and check controller logs:kubectl logs -l app=protection-group-controller. - If the controller pod is not running, redeploy it:
kubectl apply -f protection-group-controller-deployment.yaml. - If a VM is stuck terminating, investigate the KubeVirt VM object directly:
kubectl describe vm <vm-name> -n <namespace>.
Issue 3 — Failover aborted with "UNSAFE to remove taints" error
Symptom: The Failover CR fails with a message similar to:
UNSAFE to remove taints on dr-cluster - other Protection Groups have running VMs
Conflicting PGs: database-protection-group, web-servers-protection-group
Likely cause: Another Protection Group already has VMs running on the target cluster. Because DRBD quorum taints are node-level (not per-Protection Group), removing taints for your failover would affect all VMs on those nodes.
Fix:
- Wait for any in-progress failovers to the same target cluster to complete before initiating yours.
- If this is a genuine emergency and you have confirmed the risk, re-create the Failover CR with
spec.force: true. Document the decision — force mode may transiently disrupt VMs belonging to the conflicting Protection Groups.
Issue 4 — "Failed to acquire failover lock" error
Symptom: A Failover CR fails immediately with:
ERROR: Failed to acquire failover lock for production-protection-group
Failover lock is held by: failover-controller-12345
Likely cause: A previous Failover CR for the same Protection Group is still in progress (or its controller crashed without releasing the lock). The system uses a Kubernetes Lease (failover-lock-{pg-name}) to prevent concurrent failovers of the same group.
Fix:
- Check if a previous Failover CR is still running:
kubectl get failovers -n <namespace>. - If the previous operation has genuinely completed or failed, check whether the Lease object was left behind:
kubectl get leases -n <namespace> | grep failover-lock. - If the Lease is stale (holder process no longer exists), delete it:
kubectl delete lease failover-lock-<pg-name> -n <namespace>. The lock has a 5-minute automatic expiry, so it will also self-clean. - Retry the Failover CR after the lock is released.
Issue 5 — Protection Group shows HEALTH: Degraded
Symptom:
NAME STATE VMS REPLICATION HEALTH AGE
my-pg Active 2 synchronous Degraded 5m
Likely cause: DRBD replication for one or more member PVCs is not fully synchronized between the primary and DR cluster nodes. This can occur after a network partition, after a node restart, or if the DR cluster is unreachable.
Fix:
- Inspect the Protection Group status for per-volume replication details:
kubectl describe protectiongroup my-pg -n <namespace>. - Verify network connectivity between primary and DR cluster data-plane nodes — DRBD replication traffic flows directly between these nodes and does not pass through the quorum cluster.
- Check DRBD resource status on the affected nodes using your LINSTOR or DRBD Operator tooling.
- Do not initiate a failover while
replicationHealthisDegradedunless it is an emergency — failing over with unsynchronized data risks data loss.