Site Recoveryfor Kubenetes Virtual Machines
Tutorial

Getting Started

End-to-end setup from zero to a protected VM.


Overview
Getting Started: Trilio Site Recovery for OpenShift Virtualization End-to-end setup from zero to a protected VM Two OpenShift Clusters Primary + Recovery site KubeVirt Installed On both clusters Network Reachable? No Configure inter-cluster networking first Yes Install Trilio Operator OperatorHub on both sites Create Deployment CR Define primary & recovery sites DRBD Ready? No – fix config Yes Create Protection Group POST /protection-groups Add KubeVirt VMs Assign VMs to Protection Group Trigger Test Failover POST …/test-failover Monitor Status GET …/test-failover/status Test Passed? No – retry Yes Cleanup Test Failover DELETE …/test-failover VM Protected ✓

This tutorial walks you through the complete setup of Trilio Site Recovery for OpenShift Virtualization, from infrastructure deployment to your first protected virtual machine. You will follow one of two deployment paths—LINSTOR (requiring three clusters) or DRBD Operator (requiring a minimum of two clusters)—and end with a VM that is actively replicated and ready for failover. Completing this guide gives you a working foundation for every subsequent operational workflow: test failover, production failover, failback, and ongoing monitoring.


Prerequisites

Before you begin, confirm the following are in place:

Clusters

  • LINSTOR path: Three OpenShift clusters are required — a primary cluster, a DR cluster, and a quorum cluster. The quorum cluster hosts the LINSTOR controller and failover controllers; it does not run application workloads and does not participate in storage replication.
  • DRBD Operator path: Two OpenShift clusters are required — a primary cluster and a DR cluster. A third quorum cluster is optional but recommended for management plane isolation.

Tooling

  • Ansible installed and configured with inventory access to all target clusters
  • helm CLI (v3) available on the machine running the UI deployment
  • pgctl CLI installed — this is the primary tool for Protection Group management, failover operations, and multi-tenant deployment management
  • kubectl configured with valid contexts for each cluster

Networking

  • Direct node-level network connectivity between primary and DR cluster nodes (DRBD replication traffic flows directly between nodes; the quorum cluster does not relay it)
  • API server reachability from the quorum cluster to both primary and DR clusters

Access

  • Cluster-admin rights on all clusters involved in the deployment
  • Ansible playbooks and Helm chart obtained from your Trilio distribution or image registry

Quick start

Choose your deployment model and follow the matching sequence. Both paths end at the same place: a protected VM visible in the Site Manager UI.

Path A — LINSTOR (three-cluster)

  1. Deploy the LINSTOR controller to the quorum cluster:
    ansible-playbook site.yml -i inventory/quorum --tags linstor-controller
    
  2. Deploy LINSTOR satellites to the primary and DR clusters:
    ansible-playbook site.yml -i inventory/primary,inventory/dr --tags linstor-satellite
    
  3. Deploy the failover controllers to all clusters:
    ansible-playbook site.yml -i inventory/all --tags controllers
    
  4. Deploy the Site Manager UI to the quorum cluster:
    helm install site-manager <chart> -n site-manager --create-namespace -f values-quorum.yaml
    
  5. Create a Deployment and Protection Group using pgctl, then verify replication status in the UI.

Path B — DRBD Operator (two-cluster minimum)

  1. Install the DRBD Operator on the primary and DR clusters:
    ansible-playbook site.yml -i inventory/primary,inventory/dr --tags drbd-operator
    
  2. Create a DRBDReplicationPolicy to define replication parameters between the two clusters.
  3. Deploy the failover controllers (to the quorum cluster if present, otherwise to the primary cluster):
    ansible-playbook site.yml -i inventory/quorum --tags controllers
    
  4. Deploy the Site Manager UI:
    helm install site-manager <chart> -n site-manager --create-namespace -f values-quorum.yaml
    
  5. Create a ProtectionRequest for each VM via the UI or kubectl.

Steps

The steps below expand both deployment paths with detail on what each command does and how to confirm it succeeded.


LINSTOR Path

Step 1 — Deploy the LINSTOR controller to the quorum cluster

The LINSTOR controller is the management brain for storage resource allocation. It runs exclusively on the quorum cluster to isolate it from application workload disruptions on the primary or DR side.

ansible-playbook site.yml -i inventory/quorum --tags linstor-controller

Success: The play completes with no failed tasks. Verify the controller pod is running:

kubectl --context=quorum get pods -n linstor

Expect a linstor-controller-* pod in Running state.


Step 2 — Deploy LINSTOR satellites to primary and DR clusters

Satellites run on the storage nodes of the primary and DR clusters and handle the actual DRBD replication. Replication traffic flows directly between these nodes — the quorum cluster is not in the data path.

ansible-playbook site.yml -i inventory/primary,inventory/dr --tags linstor-satellite

Success: Satellite pods appear in Running state on both clusters:

kubectl --context=primary get pods -n linstor
kubectl --context=dr get pods -n linstor

Step 3 — Deploy controllers to all clusters

The failover controllers coordinate protection group state and orchestrate failover decisions. Ansible deploys them to the primary, DR, and quorum clusters.

ansible-playbook site.yml -i inventory/all --tags controllers

Success: Controller pods are Running on all three clusters.


Step 4 — Deploy the Site Manager UI to the quorum cluster

The Helm-deployed UI provides the management dashboard. It is deployed to the quorum cluster so it remains available during primary-cluster disruptions.

helm install site-manager <chart> -n site-manager --create-namespace -f values-quorum.yaml

Success: All Site Manager pods reach Running state:

kubectl --context=quorum get pods -n site-manager

Navigate to the exposed service URL to confirm the login screen loads.


Step 5 — Create a Deployment and Protection Group

A Deployment object registers the cluster topology with the Site Manager. A Protection Group defines which VMs are protected together and under what RPO policy.

Use pgctl to create the Deployment:

pgctl deployment create --name my-deployment \
  --primary-context primary \
  --dr-context dr \
  --quorum-context quorum

Then create a Protection Group:

pgctl pg create --deployment my-deployment \
  --name my-pg \
  --namespace my-vm-namespace

Success: The Protection Group appears in the Site Manager UI with replication status shown as Healthy or Syncing. You can also query replication status directly:

pgctl pg status --deployment my-deployment --name my-pg

DRBD Operator Path

Step 1 — Install the DRBD Operator on primary and DR clusters

The DRBD Operator manages DRBD resources as Kubernetes-native objects, eliminating the need for a LINSTOR controller. Ansible installs it on both the primary and DR clusters.

ansible-playbook site.yml -i inventory/primary,inventory/dr --tags drbd-operator

Success: Operator pods are Running on both clusters:

kubectl --context=primary get pods -n drbd-operator
kubectl --context=dr get pods -n drbd-operator

Step 2 — Create a DRBDReplicationPolicy

The DRBDReplicationPolicy custom resource defines how replication is configured between the primary and DR clusters — including sync mode and target cluster references. Apply it to the primary cluster:

kubectl --context=primary apply -f drbd-replication-policy.yaml

Success: The resource is accepted and the operator begins establishing DRBD connections between primary and DR nodes:

kubectl --context=primary get drbdreplicationpolicy

Expect STATUS to transition to Connected.


Step 3 — Deploy controllers

Deploy the failover controllers using Ansible. If you have a quorum cluster, target it; otherwise target the primary cluster.

# With quorum cluster
ansible-playbook site.yml -i inventory/quorum --tags controllers

# Without quorum cluster (two-cluster minimum deployment)
ansible-playbook site.yml -i inventory/primary --tags controllers

Success: Controller pods are Running on the target cluster.


Step 4 — Deploy the Site Manager UI

Deploy the UI via Helm to the quorum cluster (or primary cluster for two-cluster deployments):

helm install site-manager <chart> -n site-manager --create-namespace -f values-quorum.yaml

Success: All Site Manager pods reach Running state and the UI is accessible.


Step 5 — Create a ProtectionRequest for each VM

With the DRBD Operator model, each VM is individually enrolled for protection via a ProtectionRequest resource. You can do this from the UI or with kubectl.

Via kubectl:

kubectl --context=primary apply -f protection-request-my-vm.yaml

Via the UI: navigate to Virtual Machines, select your VM, and click Protect.

Success: Query the VM's protection status:

kubectl --context=primary get protectionrequest my-vm -n my-vm-namespace

Expect STATUS to show Protected. The Site Manager UI will also display the VM under its Protection Group with a green replication health indicator.


Examples

The following concrete examples illustrate the commands and outputs you should see at key points in the workflow.


Example 1 — Query Protection Group status with pgctl

After creating a Protection Group in the LINSTOR path, confirm it is healthy:

pgctl pg status --deployment my-deployment --name my-pg

Expected output:

Protection Group: my-pg
Deployment:       my-deployment
Namespace:        my-vm-namespace
Replication:      Healthy
Last Sync:        2024-01-15T10:32:00Z
VMs Protected:    3
RPO Compliance:   OK

Example 2 — List all Protection Groups across namespaces via the API

The Site Manager backend exposes a REST endpoint that discovers Protection Groups across all namespaces by default. You can query it directly for scripting or troubleshooting:

curl -s https://<site-manager-url>/api/deployments/<deployment-id>/protection-groups?all_namespaces=true \
  -H "Authorization: Bearer $TOKEN" | jq .

Expected output (abbreviated):

[
  {
    "name": "my-pg",
    "namespace": "my-vm-namespace",
    "status": "Healthy",
    "vmCount": 3
  }
]

Example 3 — Check replication status for a deployment

Query per-namespace replication health after your Protection Group is established:

curl -s "https://<site-manager-url>/api/deployments/<deployment-id>/replication-status?namespace=my-vm-namespace" \
  -H "Authorization: Bearer $TOKEN" | jq .

Expected output (abbreviated):

{
  "namespace": "my-vm-namespace",
  "replicationState": "Synced",
  "lagSeconds": 2
}

Example 4 — Initiate and monitor a test failover

Once a Protection Group is healthy, trigger a test failover to validate your DR readiness without impacting production:

# Initiate test failover
curl -s -X POST \
  "https://<site-manager-url>/api/deployments/<deployment-id>/protection-groups/my-pg/test-failover" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"namespace": "my-vm-namespace"}'

Poll status:

curl -s \
  "https://<site-manager-url>/api/deployments/<deployment-id>/protection-groups/my-pg/test-failover/status?namespace=my-vm-namespace" \
  -H "Authorization: Bearer $TOKEN" | jq .status

Expected output when complete:

"TestFailoverSucceeded"

Clean up the test failover environment when done:

curl -s -X DELETE \
  "https://<site-manager-url>/api/deployments/<deployment-id>/protection-groups/my-pg/test-failover?namespace=my-vm-namespace" \
  -H "Authorization: Bearer $TOKEN"

Example 5 — Check individual VM protection status

Verify that a specific VM is actively protected within a deployment:

curl -s \
  "https://<site-manager-url>/api/deployments/<deployment-id>/virtual-machines/my-vm/protection-status?namespace=my-vm-namespace" \
  -H "Authorization: Bearer $TOKEN" | jq .

Expected output:

{
  "vmName": "my-vm",
  "namespace": "my-vm-namespace",
  "protectionStatus": "Protected",
  "replicationState": "Synced"
}

Troubleshooting

Use the following table to diagnose the most common setup failures. Each entry follows the format: Symptom → Likely cause → Fix.


LINSTOR satellite pods stuck in CrashLoopBackOff

Symptom: After running the satellite Ansible play, pods on primary or DR clusters repeatedly restart.

Likely cause: The satellite cannot reach the LINSTOR controller on the quorum cluster. This is typically a network policy or firewall rule blocking the controller API port.

Fix: Confirm node-level connectivity from a satellite node to the quorum cluster's LINSTOR controller service. Check that no NetworkPolicy in the linstor namespace on the quorum cluster blocks ingress from primary/DR node CIDRs. Review satellite pod logs:

kubectl --context=primary logs -n linstor -l app=linstor-satellite --tail=50

DRBDReplicationPolicy status remains Pending (DRBD Operator path)

Symptom: After applying the DRBDReplicationPolicy, the resource does not transition to Connected.

Likely cause: DRBD replication requires direct TCP connectivity between storage nodes on the primary and DR clusters. The quorum cluster does not relay this traffic. A firewall or missing route between the two clusters' node networks will stall connection establishment.

Fix: Verify direct node-to-node reachability between primary and DR cluster nodes on the DRBD port (default 7000 range). Check DRBD Operator logs:

kubectl --context=primary logs -n drbd-operator -l app=drbd-operator --tail=100

Site Manager UI pods fail to start after helm install

Symptom: One or more site-manager-* pods are in Error or CrashLoopBackOff after Helm deployment.

Likely cause: The UI cannot reach the backend controller API, or the values file does not correctly reference the quorum cluster's controller endpoint.

Fix: Inspect the pod logs for connection errors:

kubectl --context=quorum logs -n site-manager -l app=site-manager --tail=50

Verify that the controller service is running on the quorum cluster and that the Helm values file points to the correct internal service name or external URL.


pgctl pg create returns an error about an unknown deployment

Symptom: Running pgctl pg create fails with a message indicating the deployment ID or name is not found.

Likely cause: The Deployment object was not successfully created, or pgctl is not pointing at the correct Site Manager endpoint.

Fix: First confirm the Deployment exists:

pgctl deployment list

If the deployment is absent, re-run pgctl deployment create with correct --primary-context, --dr-context, and --quorum-context flags matching your kubectl context names. If the list command itself errors, check that pgctl is configured to reach the Site Manager API (check your pgctl config file or environment variable for the endpoint URL).


Protection Group shows Degraded replication status

Symptom: The Protection Group is created successfully but the Site Manager UI or pgctl pg status reports Degraded instead of Healthy.

Likely cause: DRBD sync between primary and DR nodes has not completed, or a node is unreachable. This is common immediately after initial setup when initial full-sync is in progress.

Fix: Allow time for initial sync to complete — large volumes may take minutes to hours. Check the replication lag:

curl -s "https://<site-manager-url>/api/deployments/<id>/replication-status?namespace=my-vm-namespace" \
  -H "Authorization: Bearer $TOKEN" | jq .lagSeconds

If lag is not decreasing, inspect DRBD status on the primary nodes directly and verify node-to-node connectivity between primary and DR cluster storage nodes.


Test failover status endpoint returns 404

Symptom: Polling /api/deployments/<id>/protection-groups/<pgName>/test-failover/status returns HTTP 404.

Likely cause: No test failover has been initiated for that Protection Group in that namespace, or the namespace query parameter does not match the namespace where the test failover was started.

Fix: Confirm you are passing the correct namespace query parameter — it defaults to default if omitted. Ensure the POST to initiate the test failover succeeded before polling status. If the test failover was cleaned up via DELETE, the status resource is removed and a new test failover must be initiated.