Ecosystem Integration
LINSTOR, DRBD Operator, KubeVirt, OpenShift Virtualization compatibility
This page describes how Site Recovery integrates with the storage and virtualization components in your Kubernetes environment. Site Recovery supports two storage management backends — LINSTOR and DRBD Operator — and works with both upstream KubeVirt and Red Hat OpenShift Virtualization (OpenShift CNV) for VM lifecycle management. Understanding these integrations helps you select the right deployment model, validate compatibility before installation, and diagnose issues that cross component boundaries. The Site Manager UI automatically detects your deployment type at runtime and presents only the options relevant to your configuration.
Before reviewing ecosystem integration requirements, ensure you have:
- A supported Kubernetes distribution (upstream Kubernetes or OpenShift) running on your primary and DR clusters
- One of the following storage backends deployed and operational:
- LINSTOR: requires a three-cluster topology (primary, DR, and quorum). The quorum cluster hosts the LINSTOR controller and failover controllers.
- DRBD Operator: requires a minimum two-cluster topology (primary and DR). A quorum cluster is optional but recommended for management plane isolation.
- DRBD installed on all nodes that participate in block replication (primary and DR cluster nodes only — the quorum cluster does not participate in storage replication)
- One of the following VM platforms:
- Upstream KubeVirt
- Red Hat OpenShift Virtualization (CNV)
kubectlaccess to all clusters involved in your deploymentpgctlinstalled and configured with kubeconfigs for each cluster (see pgctl configuration)- Ansible available on the machine used to deploy infrastructure components
- Helm 3 available on the machine used to deploy the Site Manager UI
Infrastructure components (DRBD, LINSTOR or DRBD Operator, and the Protection Group and Failover controllers) are deployed using Ansible. The Site Manager UI is deployed separately using Helm and targets the quorum cluster. The steps below apply to both deployment models unless noted.
Step 1: Deploy infrastructure components with Ansible
Run the Ansible playbook against your primary, DR, and quorum clusters. The playbook installs DRBD on primary and DR cluster nodes and deploys the appropriate storage operator and controllers to each cluster.
# LINSTOR model (three clusters required)
ansible-playbook site-recovery-infra.yml \
-e deployment_model=linstor \
-e primary_cluster=cluster1 \
-e dr_cluster=cluster2 \
-e quorum_cluster=cluster3
# DRBD Operator model (two or three clusters)
ansible-playbook site-recovery-infra.yml \
-e deployment_model=drbd-operator \
-e primary_cluster=cluster1 \
-e dr_cluster=cluster2
# Add -e quorum_cluster=cluster3 if using an optional quorum cluster
Ansible deploys controllers to all specified clusters. The Protection Group Controller runs on each cluster and manages VM lifecycle locally. The Failover Controller orchestrates cross-cluster operations.
Step 2: Apply Protection Group CRDs to primary and DR clusters
kubectl --kubeconfig ~/.kube/config-cluster1 apply -f protection-group-crd.yaml
kubectl --kubeconfig ~/.kube/config-cluster2 apply -f protection-group-crd.yaml
If you are using a three-cluster topology, also apply the CRD to the quorum cluster:
kubectl --kubeconfig ~/.kube/config-cluster3 apply -f protection-group-crd.yaml
Step 3: Verify controller pods are running
# Check Protection Group Controller on primary
kubectl --kubeconfig ~/.kube/config-cluster1 get pods -l app=protection-group-controller
# Check Failover Controller on primary
kubectl --kubeconfig ~/.kube/config-cluster1 get pods -l app=failover-controller
# Repeat for DR cluster
kubectl --kubeconfig ~/.kube/config-cluster2 get pods -l app=protection-group-controller
kubectl --kubeconfig ~/.kube/config-cluster2 get pods -l app=failover-controller
Step 4: Deploy the Site Manager UI with Helm
The Site Manager UI is deployed to the quorum cluster only. It does not run application workloads.
helm install site-manager ./charts/site-manager \
--namespace site-recovery \
--create-namespace \
--kubeconfig ~/.kube/config-cluster3
Step 5: Initialize pgctl
pgctl is the CLI tool you use for Protection Group management, failover, and failback. Initialize it and register your clusters:
pgctl init
Copy your cluster kubeconfigs into the pgctl configuration directory:
cp ~/.kube/config-cluster1 ~/.kube/pgctl/cluster1.kubeconfig
cp ~/.kube/config-cluster2 ~/.kube/pgctl/cluster2.kubeconfig
# If using a quorum cluster:
cp ~/.kube/config-cluster3 ~/.kube/pgctl/cluster3.kubeconfig
Add each cluster to the pgctl configuration:
pgctl config add-cluster \
--name cluster1 \
--kubeconfig cluster1.kubeconfig \
--role primary
pgctl config add-cluster \
--name cluster2 \
--kubeconfig cluster2.kubeconfig \
--role dr
Verify connectivity to all clusters:
pgctl config test
Expected output:
▶ Testing primary cluster: cluster1
✓ Primary cluster is accessible (version: v1.30.6)
▶ Testing DR cluster: cluster2
✓ DR cluster is accessible (version: v1.30.6)
✓ All clusters are accessible
The following describes the key configuration surfaces across the ecosystem components Site Recovery integrates with.
pgctl configuration (~/.kube/pgctl/config)
The pgctl configuration file controls cluster registration, failover behavior, and safety settings.
| Setting | Default | Description |
|---|---|---|
PRIMARY_CLUSTER | (required) | Name of the primary cluster as registered in pgctl |
DR_CLUSTER | (required) | Name of the DR cluster as registered in pgctl |
DEFAULT_NAMESPACE | default | Kubernetes namespace used when none is specified |
FAILOVER_TIMEOUT | 300 | Maximum seconds to wait for a complete failover operation |
VM_START_TIMEOUT | 120 | Maximum seconds to wait for VMs to reach running state on the target cluster |
DRBD_WAIT_TIME | 30 | Seconds to wait for DRBD locks to release after stopping VMs on the source cluster |
TAINT_REMOVAL_WINDOW | 30 | Duration of the aggressive taint removal loop applied when starting VMs on the target cluster |
Protection Group CRD (spec.desiredState)
The ProtectionGroup custom resource exposes a spec.desiredState field that controls the lifecycle of all VMs in the group. The Protection Group Controller reconciles VM state on the local cluster to match this field.
| Field | Values | Default | Description |
|---|---|---|---|
spec.desiredState | running, stopped | running | Declares the desired power state for all VMs in the group. The controller patches spec.running on each VM to match. |
status.currentState | running, stopped, mixed, unknown | unknown | Reflects the actual observed state of VMs. mixed indicates VMs are in inconsistent states and may require manual intervention. |
Example resource showing both fields:
apiVersion: siterecovery.trilio.io/v1alpha1
kind: ProtectionGroup
metadata:
name: production-protection-group
spec:
desiredState: running
virtualMachines:
- name: prod-vm-1
- name: prod-vm-2
status:
currentState: running
The desiredState field is the integration point between the Failover Controller (which patches it during orchestration) and the Protection Group Controller (which acts on it locally). You should not typically set this field manually in production — use pgctl failover instead.
Cluster role and validation behavior
Controllers use a CLUSTER_ROLE environment variable (set by Ansible during deployment) to apply cluster-appropriate validation logic:
| Role | Value | Validation behavior |
|---|---|---|
| Primary | primary | Full validation: VMs must exist and be running, PVCs must be Bound, geo-replication must be enabled (placementCount >= 2) |
| DR | dr | Minimal validation: VM and PVC definitions must exist; PVC binding state is not checked because CSI binds volumes during failover |
This distinction is critical: PVCs on the DR cluster are intentionally unbound until a failover occurs. If the controller applied primary-cluster validation rules to the DR cluster, it would incorrectly reject a valid configuration.
DRBD replication
DRBD replication runs directly between primary and DR cluster nodes. The quorum cluster does not relay or participate in replication traffic — it is a management plane component only. Ensure that primary and DR cluster nodes have direct network connectivity on the port DRBD uses for replication.
Once your ecosystem components are installed and connected, the primary workflows you perform are: creating and syncing Protection Groups across clusters, validating DR readiness, and executing failover and failback operations. All of these are performed through pgctl.
Detecting your deployment type
You do not need to manually configure the Site Manager UI with your deployment type. When you open the UI, it queries the local cluster and automatically detects whether LINSTOR or DRBD Operator is managing storage, and whether the VM platform is upstream KubeVirt or OpenShift Virtualization. The UI then shows only the options and workflows applicable to your environment.
Creating a Protection Group
To protect VMs, you create a Protection Group that tracks the VMs and their storage on both the primary and DR clusters. pgctl validates resources on the primary cluster, syncs VM and PVC definitions to the DR cluster, and creates the Protection Group CR on both clusters.
pgctl create protectiongroup production-pg \
--vm prod-vm-1 \
--vm prod-vm-2 \
--namespace default
During creation, pgctl performs cluster-specific validation. On the primary cluster it verifies that VMs exist, PVCs are bound, and geo-replication is enabled. On the DR cluster it verifies that synced definitions are present and the storage class supports geo-replication. PVCs on the DR cluster remain unbound until failover — this is expected.
Syncing resources to the DR cluster
If you add a VM to an existing Protection Group, or if you discover that a VM definition is missing from the DR cluster, use pgctl sync to bring the DR cluster into alignment:
# Sync a single VM
pgctl sync vm prod-vm-3
# Sync all VMs in a Protection Group
pgctl sync protectiongroup production-pg
When syncing, pgctl copies the VM definition, PVC definition, and PV definition from the primary cluster to the DR cluster. It removes fields that must not carry over (such as resourceVersion, uid, and the PVC volumeName) so that the CSI driver can bind the geo-replicated DRBD volume correctly during failover.
Validating DR readiness
Before executing a failover — or as part of a regular DR readiness check — validate that all resources exist and are correctly configured on both clusters:
pgctl validate protectiongroup production-pg
Use the dry-run flag to simulate a failover without making any changes:
pgctl failover production-pg --dry-run
The dry run checks that Protection Group, VM, PVC, and PV definitions exist on the target cluster, that DRBD replication is healthy, and that no other Protection Groups are already running on the target cluster. If the dry run passes, the environment is ready.
Executing failover
Use pgctl failover to move a Protection Group from the primary cluster to the DR cluster:
pgctl failover production-pg
pgctl orchestrates the complete sequence: stopping VMs on the source cluster via the Protection Group's desiredState, waiting for DRBD locks to release, removing quorum taints on the target cluster, and starting VMs on the target cluster. You can monitor progress in the terminal output.
Executing failback
To return a Protection Group to the primary cluster after a failover:
pgctl failback production-pg
Failback is equivalent to running pgctl failover production-pg --to cluster1. The same orchestration sequence applies in reverse.
Example 1: Verify multi-cluster VM status
Before and after failover operations, check the current state of a Protection Group across both clusters.
pgctl get protectiongroup production-pg --all-clusters
Expected output (pre-failover):
Protection Group: production-pg
PRIMARY CLUSTER (cluster1):
NAME STATE VMS HEALTH
production-pg Active 2 Healthy
DR CLUSTER (cluster2):
NAME STATE VMS HEALTH
production-pg Active 2 Healthy
VM STATUS:
VM CLUSTER STATUS
prod-vm-1 cluster1 Running
prod-vm-1 cluster2 Stopped
prod-vm-2 cluster1 Running
prod-vm-2 cluster2 Stopped
VMs on the DR cluster should be Stopped in a healthy pre-failover state. If any VMs show as Running on the DR cluster while the primary is also active, you have a split-brain condition that requires immediate investigation.
Example 2: Validate Protection Group readiness
Run validation to confirm all resources are present and correctly configured on both clusters before a planned failover.
pgctl validate protectiongroup production-pg
Expected output (healthy configuration):
▶ Validating Protection Group: production-pg
PRIMARY CLUSTER (cluster1):
✓ Protection Group exists
✓ VMs exist and are valid
✓ PVCs are bound
✓ Geo-replication enabled (placementCount >= 2)
✓ DRBD resources healthy
DR CLUSTER (cluster2):
✓ Protection Group exists
✓ VM definitions exist
✓ PVC definitions exist
✓ PV definitions exist
✓ Storage class supports geo-replication
OVERALL: ✓ Configuration valid, ready for failover
If validation reports a missing VM definition on the DR cluster, sync it before proceeding:
pgctl sync protectiongroup production-pg
Example 3: Perform a test failover (dry run)
Use the dry run mode to simulate a failover without stopping or starting any VMs. This is the recommended way to validate DR readiness on a schedule.
pgctl failover production-pg --dry-run
Expected output:
✓ Dry run mode - no changes will be made
Pre-flight checks:
✓ Protection Group exists on both clusters
✓ VM definitions exist on target cluster
✓ PVC definitions exist on target cluster
✓ PV definitions exist on target cluster
✓ No other Protection Groups running on target
✓ DRBD replication healthy
Estimated failover time: 90-120 seconds
Ready for failover: YES
Example 4: Execute a planned failover
Fail over a Protection Group from the primary cluster to the DR cluster.
pgctl failover production-pg
Expected output:
⚠ FAILOVER OPERATION
Protection Group: production-pg
Source: cluster1 (2 VMs running)
Target: cluster2
This will:
1. Stop VMs on cluster1
2. Wait for DRBD locks to release
3. Start VMs on cluster2
Continue? [y/N]: y
▶ Step 1: Stopping VMs on cluster1
✓ prod-vm-1 stopped
✓ prod-vm-2 stopped
▶ Step 2: Waiting for DRBD locks (30s)
✓ Locks released
▶ Step 3: Removing quorum taints on cluster2
✓ Taints removed
▶ Step 4: Starting VMs on cluster2
✓ prod-vm-1 starting...
✓ prod-vm-2 starting...
▶ Step 5: Monitoring startup
✓ prod-vm-1 running (45s)
✓ prod-vm-2 running (52s)
✓ Failover completed successfully!
All VMs now running on cluster2
Example 5: Manually inspect Protection Group state
You can directly query the currentState field on a Protection Group CR to check whether the controller has reconciled VMs to the desired state. This is useful when troubleshooting a stuck failover.
kubectl --kubeconfig ~/.kube/config-cluster2 \
get protectiongroup production-pg \
-n default \
-o jsonpath='{.spec.desiredState} / {.status.currentState}'
Expected output when VMs have started successfully:
running / running
If desiredState is running but currentState is stopped or mixed, the Protection Group Controller has not finished reconciling. Check the controller logs for errors.
Issue: Protection Group currentState is stuck in mixed
Symptom: kubectl get protectiongroup shows status.currentState: mixed and the state does not resolve on its own.
Likely cause: One or more VMs in the group could not be started or stopped by the Protection Group Controller. This can occur if a VM is in an error state, if a PVC failed to bind after failover, or if a node is unavailable.
Fix:
- Identify which VMs are in an inconsistent state:
kubectl --kubeconfig ~/.kube/config-cluster2 get vm,vmi -n default - Check events on any VM that is not in the expected state:
kubectl --kubeconfig ~/.kube/config-cluster2 describe vm <vm-name> -n default - Check Protection Group Controller logs for reconciliation errors:
kubectl logs deployment/protection-group-controller -n <namespace> | grep reconcile_vm_state - Resolve the underlying VM or PVC issue, then verify the controller reconciles
currentStateautomatically within its next reconciliation cycle (every 60 seconds).
Issue: PVC on DR cluster is in Lost or unbound state after failover
Symptom: A VM fails to schedule on the DR cluster after failover. Describing the VM shows: persistentvolumeclaim "<name>" bound to non-existent persistentvolume.
Likely cause: The PV definition was not successfully synced to the DR cluster before the failover. pgctl sync may have silently skipped PV creation if it encountered an error.
Fix:
- Confirm the PV exists on the DR cluster:
kubectl --kubeconfig ~/.kube/config-cluster2 get pv - If the PV is missing, re-run sync for the affected VM:
pgctl sync vm <vm-name> - After sync, confirm the PVC binds:
kubectl --kubeconfig ~/.kube/config-cluster2 get pvc -n default - If the PVC remains in
Loststate, delete and recreate it after confirming the PV is present. Ensure PV creation precedes PVC creation so the CSI driver can bind correctly.
Issue: VM validation fails on DR cluster with "PVC not Bound"
Symptom: pgctl validate or the Protection Group Controller reports a PVC binding error on the DR cluster even though the configuration looks correct.
Likely cause: The controller is running with CLUSTER_ROLE=primary on a DR cluster, causing it to apply full validation rules (which require PVCs to be Bound) instead of DR validation rules (which allow unbound PVCs).
Fix:
- Check the
CLUSTER_ROLEenvironment variable set on the Protection Group Controller deployment on the DR cluster:kubectl --kubeconfig ~/.kube/config-cluster2 \ get deployment protection-group-controller -n <namespace> \ -o jsonpath='{.spec.template.spec.containers[0].env}' - If
CLUSTER_ROLEis missing or set toprimary, update the deployment:kubectl --kubeconfig ~/.kube/config-cluster2 \ set env deployment/protection-group-controller \ CLUSTER_ROLE=dr -n <namespace> - Wait for the controller pod to restart and re-run validation.
Issue: pgctl config test fails with connection errors
Symptom: Running pgctl config test returns an error such as Unable to connect to the server for one or more clusters.
Likely cause: The kubeconfig file for the affected cluster is missing, uses an incorrect context name, or points to an unreachable API endpoint.
Fix:
- Verify the kubeconfig files are present in
~/.kube/pgctl/:ls -la ~/.kube/pgctl/ - Test the kubeconfig independently:
kubectl --kubeconfig ~/.kube/pgctl/cluster2.kubeconfig get nodes - If the file is present but the context is wrong, specify the correct context when adding the cluster:
pgctl config add-cluster \ --name cluster2 \ --kubeconfig cluster2.kubeconfig \ --role dr - Confirm the cluster API endpoint is reachable from the machine running
pgctl.
Issue: Failover times out waiting for DRBD locks
Symptom: The failover operation stalls at "Waiting for DRBD locks" and eventually times out.
Likely cause: VMs on the source cluster did not fully stop before DRBD attempted to release locks, or there is a DRBD resource in an error state holding a lock.
Fix:
- Confirm all VMs in the Protection Group are stopped on the source cluster:
There should be no running VMIs. If any are present, the Protection Group Controller may not have finished reconciling.
kubectl --kubeconfig ~/.kube/config-cluster1 get vmi -n default - Check the Protection Group
currentStateon the source cluster:It should readkubectl --kubeconfig ~/.kube/config-cluster1 \ get protectiongroup production-pg \ -n default \ -o jsonpath='{.status.currentState}'stopped. If it readsmixedorrunning, see themixedtroubleshooting entry above. - Once all VMs are confirmed stopped, re-run the failover. If DRBD locks persist without running VMs, check DRBD resource status on the primary cluster nodes directly.