Guide

Ecosystem Integration

LINSTOR, DRBD Operator, KubeVirt, OpenShift Virtualization compatibility

Overview

This page describes how Site Recovery integrates with the storage and virtualization components in your Kubernetes environment. Site Recovery supports two storage management backends — LINSTOR and DRBD Operator — and works with both upstream KubeVirt and Red Hat OpenShift Virtualization (OpenShift CNV) for VM lifecycle management. Understanding these integrations helps you select the right deployment model, validate compatibility before installation, and diagnose issues that cross component boundaries. The Site Manager UI automatically detects your deployment type at runtime and presents only the options relevant to your configuration.

Prerequisites

Before reviewing ecosystem integration requirements, ensure you have:

A supported Kubernetes distribution (upstream Kubernetes or OpenShift) running on your primary and DR clusters
One of the following storage backends deployed and operational:
- LINSTOR: requires a three-cluster topology (primary, DR, and quorum). The quorum cluster hosts the LINSTOR controller and failover controllers.
- DRBD Operator: requires a minimum two-cluster topology (primary and DR). A quorum cluster is optional but recommended for management plane isolation.
DRBD installed on all nodes that participate in block replication (primary and DR cluster nodes only — the quorum cluster does not participate in storage replication)
One of the following VM platforms:
- Upstream KubeVirt
- Red Hat OpenShift Virtualization (CNV)
kubectl access to all clusters involved in your deployment
pgctl installed and configured with kubeconfigs for each cluster (see pgctl configuration)
Ansible available on the machine used to deploy infrastructure components
Helm 3 available on the machine used to deploy the Site Manager UI

Installation

Infrastructure components (DRBD, LINSTOR or DRBD Operator, and the Protection Group and Failover controllers) are deployed using Ansible. The Site Manager UI is deployed separately using Helm and targets the quorum cluster. The steps below apply to both deployment models unless noted.

Step 1: Deploy infrastructure components with Ansible

Run the Ansible playbook against your primary, DR, and quorum clusters. The playbook installs DRBD on primary and DR cluster nodes and deploys the appropriate storage operator and controllers to each cluster.

# LINSTOR model (three clusters required)
ansible-playbook site-recovery-infra.yml \
  -e deployment_model=linstor \
  -e primary_cluster=cluster1 \
  -e dr_cluster=cluster2 \
  -e quorum_cluster=cluster3

# DRBD Operator model (two or three clusters)
ansible-playbook site-recovery-infra.yml \
  -e deployment_model=drbd-operator \
  -e primary_cluster=cluster1 \
  -e dr_cluster=cluster2
  # Add -e quorum_cluster=cluster3 if using an optional quorum cluster

Ansible deploys controllers to all specified clusters. The Protection Group Controller runs on each cluster and manages VM lifecycle locally. The Failover Controller orchestrates cross-cluster operations.

Step 2: Apply Protection Group CRDs to primary and DR clusters

kubectl --kubeconfig ~/.kube/config-cluster1 apply -f protection-group-crd.yaml
kubectl --kubeconfig ~/.kube/config-cluster2 apply -f protection-group-crd.yaml

If you are using a three-cluster topology, also apply the CRD to the quorum cluster:

kubectl --kubeconfig ~/.kube/config-cluster3 apply -f protection-group-crd.yaml

Step 3: Verify controller pods are running

# Check Protection Group Controller on primary
kubectl --kubeconfig ~/.kube/config-cluster1 get pods -l app=protection-group-controller

# Check Failover Controller on primary
kubectl --kubeconfig ~/.kube/config-cluster1 get pods -l app=failover-controller

# Repeat for DR cluster
kubectl --kubeconfig ~/.kube/config-cluster2 get pods -l app=protection-group-controller
kubectl --kubeconfig ~/.kube/config-cluster2 get pods -l app=failover-controller

Step 4: Deploy the Site Manager UI with Helm

The Site Manager UI is deployed to the quorum cluster only. It does not run application workloads.

helm install site-manager ./charts/site-manager \
  --namespace site-recovery \
  --create-namespace \
  --kubeconfig ~/.kube/config-cluster3

Step 5: Initialize pgctl

pgctl is the CLI tool you use for Protection Group management, failover, and failback. Initialize it and register your clusters:

pgctl init

Copy your cluster kubeconfigs into the pgctl configuration directory:

cp ~/.kube/config-cluster1 ~/.kube/pgctl/cluster1.kubeconfig
cp ~/.kube/config-cluster2 ~/.kube/pgctl/cluster2.kubeconfig
# If using a quorum cluster:
cp ~/.kube/config-cluster3 ~/.kube/pgctl/cluster3.kubeconfig

Add each cluster to the pgctl configuration:

pgctl config add-cluster \
  --name cluster1 \
  --kubeconfig cluster1.kubeconfig \
  --role primary

pgctl config add-cluster \
  --name cluster2 \
  --kubeconfig cluster2.kubeconfig \
  --role dr

Verify connectivity to all clusters:

pgctl config test

Expected output:

▶ Testing primary cluster: cluster1
✓ Primary cluster is accessible (version: v1.30.6)
▶ Testing DR cluster: cluster2
✓ DR cluster is accessible (version: v1.30.6)
✓ All clusters are accessible

Configuration

The following describes the key configuration surfaces across the ecosystem components Site Recovery integrates with.

pgctl configuration (`~/.kube/pgctl/config`)

The pgctl configuration file controls cluster registration, failover behavior, and safety settings.

Setting	Default	Description
`PRIMARY_CLUSTER`	(required)	Name of the primary cluster as registered in pgctl
`DR_CLUSTER`	(required)	Name of the DR cluster as registered in pgctl
`DEFAULT_NAMESPACE`	`default`	Kubernetes namespace used when none is specified
`FAILOVER_TIMEOUT`	`300`	Maximum seconds to wait for a complete failover operation
`VM_START_TIMEOUT`	`120`	Maximum seconds to wait for VMs to reach running state on the target cluster
`DRBD_WAIT_TIME`	`30`	Seconds to wait for DRBD locks to release after stopping VMs on the source cluster
`TAINT_REMOVAL_WINDOW`	`30`	Duration of the aggressive taint removal loop applied when starting VMs on the target cluster

Protection Group CRD (`spec.desiredState`)

The ProtectionGroup custom resource exposes a spec.desiredState field that controls the lifecycle of all VMs in the group. The Protection Group Controller reconciles VM state on the local cluster to match this field.

Field	Values	Default	Description
`spec.desiredState`	`running`, `stopped`	`running`	Declares the desired power state for all VMs in the group. The controller patches `spec.running` on each VM to match.
`status.currentState`	`running`, `stopped`, `mixed`, `unknown`	`unknown`	Reflects the actual observed state of VMs. `mixed` indicates VMs are in inconsistent states and may require manual intervention.

Example resource showing both fields:

apiVersion: siterecovery.trilio.io/v1alpha1
kind: ProtectionGroup
metadata:
  name: production-protection-group
spec:
  desiredState: running
  virtualMachines:
    - name: prod-vm-1
    - name: prod-vm-2
status:
  currentState: running

The desiredState field is the integration point between the Failover Controller (which patches it during orchestration) and the Protection Group Controller (which acts on it locally). You should not typically set this field manually in production — use pgctl failover instead.

Cluster role and validation behavior

Controllers use a CLUSTER_ROLE environment variable (set by Ansible during deployment) to apply cluster-appropriate validation logic:

Role	Value	Validation behavior
Primary	`primary`	Full validation: VMs must exist and be running, PVCs must be `Bound`, geo-replication must be enabled (`placementCount >= 2`)
DR	`dr`	Minimal validation: VM and PVC definitions must exist; PVC binding state is not checked because CSI binds volumes during failover

This distinction is critical: PVCs on the DR cluster are intentionally unbound until a failover occurs. If the controller applied primary-cluster validation rules to the DR cluster, it would incorrectly reject a valid configuration.

DRBD replication

DRBD replication runs directly between primary and DR cluster nodes. The quorum cluster does not relay or participate in replication traffic — it is a management plane component only. Ensure that primary and DR cluster nodes have direct network connectivity on the port DRBD uses for replication.

Usage

Once your ecosystem components are installed and connected, the primary workflows you perform are: creating and syncing Protection Groups across clusters, validating DR readiness, and executing failover and failback operations. All of these are performed through pgctl.

Detecting your deployment type

You do not need to manually configure the Site Manager UI with your deployment type. When you open the UI, it queries the local cluster and automatically detects whether LINSTOR or DRBD Operator is managing storage, and whether the VM platform is upstream KubeVirt or OpenShift Virtualization. The UI then shows only the options and workflows applicable to your environment.

Creating a Protection Group

To protect VMs, you create a Protection Group that tracks the VMs and their storage on both the primary and DR clusters. pgctl validates resources on the primary cluster, syncs VM and PVC definitions to the DR cluster, and creates the Protection Group CR on both clusters.

pgctl create protectiongroup production-pg \
  --vm prod-vm-1 \
  --vm prod-vm-2 \
  --namespace default

During creation, pgctl performs cluster-specific validation. On the primary cluster it verifies that VMs exist, PVCs are bound, and geo-replication is enabled. On the DR cluster it verifies that synced definitions are present and the storage class supports geo-replication. PVCs on the DR cluster remain unbound until failover — this is expected.

Syncing resources to the DR cluster

If you add a VM to an existing Protection Group, or if you discover that a VM definition is missing from the DR cluster, use pgctl sync to bring the DR cluster into alignment:

# Sync a single VM
pgctl sync vm prod-vm-3

# Sync all VMs in a Protection Group
pgctl sync protectiongroup production-pg

When syncing, pgctl copies the VM definition, PVC definition, and PV definition from the primary cluster to the DR cluster. It removes fields that must not carry over (such as resourceVersion, uid, and the PVC volumeName) so that the CSI driver can bind the geo-replicated DRBD volume correctly during failover.

Validating DR readiness

Before executing a failover — or as part of a regular DR readiness check — validate that all resources exist and are correctly configured on both clusters:

pgctl validate protectiongroup production-pg

Use the dry-run flag to simulate a failover without making any changes:

pgctl failover production-pg --dry-run

The dry run checks that Protection Group, VM, PVC, and PV definitions exist on the target cluster, that DRBD replication is healthy, and that no other Protection Groups are already running on the target cluster. If the dry run passes, the environment is ready.

Executing failover

Use pgctl failover to move a Protection Group from the primary cluster to the DR cluster:

pgctl failover production-pg

pgctl orchestrates the complete sequence: stopping VMs on the source cluster via the Protection Group's desiredState, waiting for DRBD locks to release, removing quorum taints on the target cluster, and starting VMs on the target cluster. You can monitor progress in the terminal output.

Executing failback

To return a Protection Group to the primary cluster after a failover:

pgctl failback production-pg

Failback is equivalent to running pgctl failover production-pg --to cluster1. The same orchestration sequence applies in reverse.

Examples

Example 1: Verify multi-cluster VM status

Before and after failover operations, check the current state of a Protection Group across both clusters.

pgctl get protectiongroup production-pg --all-clusters

Expected output (pre-failover):

Protection Group: production-pg

PRIMARY CLUSTER (cluster1):
NAME           STATE   VMS   HEALTH
production-pg  Active  2     Healthy

DR CLUSTER (cluster2):
NAME           STATE   VMS   HEALTH
production-pg  Active  2     Healthy

VM STATUS:
VM           CLUSTER   STATUS
prod-vm-1    cluster1  Running
prod-vm-1    cluster2  Stopped
prod-vm-2    cluster1  Running
prod-vm-2    cluster2  Stopped

VMs on the DR cluster should be Stopped in a healthy pre-failover state. If any VMs show as Running on the DR cluster while the primary is also active, you have a split-brain condition that requires immediate investigation.

Example 2: Validate Protection Group readiness

Run validation to confirm all resources are present and correctly configured on both clusters before a planned failover.

pgctl validate protectiongroup production-pg

Expected output (healthy configuration):

▶ Validating Protection Group: production-pg

PRIMARY CLUSTER (cluster1):
  ✓ Protection Group exists
  ✓ VMs exist and are valid
  ✓ PVCs are bound
  ✓ Geo-replication enabled (placementCount >= 2)
  ✓ DRBD resources healthy

DR CLUSTER (cluster2):
  ✓ Protection Group exists
  ✓ VM definitions exist
  ✓ PVC definitions exist
  ✓ PV definitions exist
  ✓ Storage class supports geo-replication

OVERALL: ✓ Configuration valid, ready for failover

If validation reports a missing VM definition on the DR cluster, sync it before proceeding:

pgctl sync protectiongroup production-pg

Example 3: Perform a test failover (dry run)

Use the dry run mode to simulate a failover without stopping or starting any VMs. This is the recommended way to validate DR readiness on a schedule.

pgctl failover production-pg --dry-run

Expected output:

✓ Dry run mode - no changes will be made

Pre-flight checks:
  ✓ Protection Group exists on both clusters
  ✓ VM definitions exist on target cluster
  ✓ PVC definitions exist on target cluster
  ✓ PV definitions exist on target cluster
  ✓ No other Protection Groups running on target
  ✓ DRBD replication healthy

Estimated failover time: 90-120 seconds

Ready for failover: YES

Example 4: Execute a planned failover

Fail over a Protection Group from the primary cluster to the DR cluster.

pgctl failover production-pg

Expected output:

⚠ FAILOVER OPERATION
  Protection Group: production-pg
  Source: cluster1 (2 VMs running)
  Target: cluster2

This will:
  1. Stop VMs on cluster1
  2. Wait for DRBD locks to release
  3. Start VMs on cluster2

Continue? [y/N]: y

▶ Step 1: Stopping VMs on cluster1
  ✓ prod-vm-1 stopped
  ✓ prod-vm-2 stopped

▶ Step 2: Waiting for DRBD locks (30s)
  ✓ Locks released

▶ Step 3: Removing quorum taints on cluster2
  ✓ Taints removed

▶ Step 4: Starting VMs on cluster2
  ✓ prod-vm-1 starting...
  ✓ prod-vm-2 starting...

▶ Step 5: Monitoring startup
  ✓ prod-vm-1 running (45s)
  ✓ prod-vm-2 running (52s)

✓ Failover completed successfully!
  All VMs now running on cluster2

Example 5: Manually inspect Protection Group state

You can directly query the currentState field on a Protection Group CR to check whether the controller has reconciled VMs to the desired state. This is useful when troubleshooting a stuck failover.

kubectl --kubeconfig ~/.kube/config-cluster2 \
  get protectiongroup production-pg \
  -n default \
  -o jsonpath='{.spec.desiredState} / {.status.currentState}'

Expected output when VMs have started successfully:

running / running

If desiredState is running but currentState is stopped or mixed, the Protection Group Controller has not finished reconciling. Check the controller logs for errors.

Troubleshooting

Issue: Protection Group `currentState` is stuck in `mixed`

Symptom: kubectl get protectiongroup shows status.currentState: mixed and the state does not resolve on its own.

Likely cause: One or more VMs in the group could not be started or stopped by the Protection Group Controller. This can occur if a VM is in an error state, if a PVC failed to bind after failover, or if a node is unavailable.

Fix:

Identify which VMs are in an inconsistent state:

kubectl --kubeconfig ~/.kube/config-cluster2 get vm,vmi -n default

Check events on any VM that is not in the expected state:

kubectl --kubeconfig ~/.kube/config-cluster2 describe vm <vm-name> -n default

Check Protection Group Controller logs for reconciliation errors:

kubectl logs deployment/protection-group-controller -n <namespace> | grep reconcile_vm_state

Resolve the underlying VM or PVC issue, then verify the controller reconciles currentState automatically within its next reconciliation cycle (every 60 seconds).

Issue: PVC on DR cluster is in `Lost` or unbound state after failover

Symptom: A VM fails to schedule on the DR cluster after failover. Describing the VM shows: persistentvolumeclaim "<name>" bound to non-existent persistentvolume.

Likely cause: The PV definition was not successfully synced to the DR cluster before the failover. pgctl sync may have silently skipped PV creation if it encountered an error.

Fix:

Confirm the PV exists on the DR cluster:

kubectl --kubeconfig ~/.kube/config-cluster2 get pv

If the PV is missing, re-run sync for the affected VM:
```
pgctl sync vm <vm-name>
```

After sync, confirm the PVC binds:

kubectl --kubeconfig ~/.kube/config-cluster2 get pvc -n default

If the PVC remains in Lost state, delete and recreate it after confirming the PV is present. Ensure PV creation precedes PVC creation so the CSI driver can bind correctly.

Issue: VM validation fails on DR cluster with "PVC not Bound"

Symptom: pgctl validate or the Protection Group Controller reports a PVC binding error on the DR cluster even though the configuration looks correct.

Likely cause: The controller is running with CLUSTER_ROLE=primary on a DR cluster, causing it to apply full validation rules (which require PVCs to be Bound) instead of DR validation rules (which allow unbound PVCs).

Fix:

Check the CLUSTER_ROLE environment variable set on the Protection Group Controller deployment on the DR cluster:

kubectl --kubeconfig ~/.kube/config-cluster2 \
  get deployment protection-group-controller -n <namespace> \
  -o jsonpath='{.spec.template.spec.containers[0].env}'

If CLUSTER_ROLE is missing or set to primary, update the deployment:

kubectl --kubeconfig ~/.kube/config-cluster2 \
  set env deployment/protection-group-controller \
  CLUSTER_ROLE=dr -n <namespace>

Wait for the controller pod to restart and re-run validation.

Issue: `pgctl config test` fails with connection errors

Symptom: Running pgctl config test returns an error such as Unable to connect to the server for one or more clusters.

Likely cause: The kubeconfig file for the affected cluster is missing, uses an incorrect context name, or points to an unreachable API endpoint.

Fix:

Verify the kubeconfig files are present in ~/.kube/pgctl/:
```
ls -la ~/.kube/pgctl/
```

Test the kubeconfig independently:

kubectl --kubeconfig ~/.kube/pgctl/cluster2.kubeconfig get nodes

If the file is present but the context is wrong, specify the correct context when adding the cluster:
```
pgctl config add-cluster \
  --name cluster2 \
  --kubeconfig cluster2.kubeconfig \
  --role dr
```
Confirm the cluster API endpoint is reachable from the machine running pgctl.

Issue: Failover times out waiting for DRBD locks

Symptom: The failover operation stalls at "Waiting for DRBD locks" and eventually times out.

Likely cause: VMs on the source cluster did not fully stop before DRBD attempted to release locks, or there is a DRBD resource in an error state holding a lock.

Fix:

Confirm all VMs in the Protection Group are stopped on the source cluster:
```
kubectl --kubeconfig ~/.kube/config-cluster1 get vmi -n default
```
There should be no running VMIs. If any are present, the Protection Group Controller may not have finished reconciling.
Check the Protection Group currentState on the source cluster:
```
kubectl --kubeconfig ~/.kube/config-cluster1 \
  get protectiongroup production-pg \
  -n default \
  -o jsonpath='{.status.currentState}'
```
It should read stopped. If it reads mixed or running, see the mixed troubleshooting entry above.
Once all VMs are confirmed stopped, re-run the failover. If DRBD locks persist without running VMs, check DRBD resource status on the primary cluster nodes directly.

Ecosystem Integration

pgctl configuration (~/.kube/pgctl/config)

Protection Group CRD (spec.desiredState)

Cluster role and validation behavior

DRBD replication

Detecting your deployment type

Creating a Protection Group

Syncing resources to the DR cluster

Validating DR readiness

Executing failover

Executing failback

Example 1: Verify multi-cluster VM status

Example 2: Validate Protection Group readiness

Example 3: Perform a test failover (dry run)

Example 4: Execute a planned failover

Example 5: Manually inspect Protection Group state

Issue: Protection Group currentState is stuck in mixed

Issue: PVC on DR cluster is in Lost or unbound state after failover

Issue: VM validation fails on DR cluster with "PVC not Bound"

Issue: pgctl config test fails with connection errors

Issue: Failover times out waiting for DRBD locks

pgctl configuration (`~/.kube/pgctl/config`)

Protection Group CRD (`spec.desiredState`)

Issue: Protection Group `currentState` is stuck in `mixed`

Issue: PVC on DR cluster is in `Lost` or unbound state after failover

Issue: `pgctl config test` fails with connection errors