Site Recoveryfor Kubenetes Virtual Machines
Guide

Operations API

List active/completed operations, get operation details


Overview

The Operations API provides visibility into active and completed failover operations across your Site Recovery deployment. Use it to monitor in-flight failovers, audit historical operations, and retrieve detailed status for individual Failover custom resources — all critical for validating DR readiness and diagnosing issues in production. This page documents test scenarios that validate the pgctl CLI and Failover Controller behavior, so you can confirm your deployment behaves correctly before relying on it in an actual disaster recovery event.


Prerequisites

Before running these test scenarios, ensure you have:

  • Access to all relevant clusters (primary, DR, and quorum if applicable) with valid kubeconfigs
  • pgctl installed and configured with credentials for each cluster
  • Protection Group CRDs applied: protection-group-crd.yaml and failover-crd.yaml
  • Protection Group Controller and Failover Controller deployments running and healthy
  • At least one Protection Group defined with two or more KubeVirt VMs assigned
  • kubectl available and context-switched to the appropriate cluster for each step
  • For LINSTOR deployments: three clusters (primary, DR, quorum) confirmed reachable
  • For DRBD Operator deployments: two clusters minimum (primary, DR); quorum cluster optional
  • DRBD replication confirmed active between primary and DR cluster nodes before testing
  • Sufficient RBAC permissions to create and patch protectiongroups and failovers custom resources, and to read/write Kubernetes Leases in the coordination.k8s.io/v1 API group

Installation

The Operations API is exposed through the Failover Controller and is accessed via pgctl or directly via kubectl against the Failover CRD. No separate installation is required beyond the standard Site Recovery controller deployment.

Step 1 — Verify the Failover Controller is running

kubectl get deployment failover-controller -n site-recovery

Expected output:

NAME                 READY   UP-TO-DATE   AVAILABLE   AGE
failover-controller  1/1     1            1           5d

Step 2 — Verify the Protection Group Controller is running

kubectl get deployment protection-group-controller -n site-recovery

Step 3 — Confirm CRDs are registered

kubectl get crd | grep siterecovery.trilio.io

Expected output should include both:

failovers.siterecovery.trilio.io
protectiongroups.siterecovery.trilio.io

Step 4 — Confirm pgctl can reach your clusters

pgctl status --namespace site-recovery

If pgctl returns connection errors, verify your kubeconfig contexts and that the Site Manager UI has been deployed to the quorum cluster via Helm.


Configuration

The following fields on the Failover custom resource control how the Failover Controller processes operations. Understanding these options is important for both normal failover execution and for the test scenarios documented below.

FieldDefaultValid ValuesEffect
spec.protectionGroup(required)Any valid Protection Group nameIdentifies which Protection Group this failover operation targets
spec.targetCluster(required)Cluster identifier stringThe cluster to which VMs will be failed over
spec.forcefalsetrue, falseWhen true, bypasses the concurrent-safety check that blocks taint removal if other Protection Groups have running VMs on the target cluster. Use only in emergencies and only after coordinating with other operators.
status.phase(set by controller)Pending, Running, Succeeded, FailedReflects the current phase of the failover operation as updated by the Failover Controller
status.message(set by controller)Free-form stringHuman-readable description of current progress or failure reason

The ProtectionGroup resource has two fields directly relevant to operation monitoring:

FieldDefaultValid ValuesEffect
spec.desiredStaterunningrunning, stoppedDrives the Protection Group Controller to start or stop all VMs in the group atomically. The Failover Controller patches this field rather than touching VMs directly.
status.currentState(set by controller)running, stopped, mixed, unknownReflects the actual observed state of all VMs in the group. The Failover Controller waits for this to match spec.desiredState before proceeding.

The Failover Controller uses Kubernetes Leases (coordination.k8s.io/v1) to enforce per-Protection Group locking. Each lock is named failover-lock-{protection-group-name} in the same namespace as the Protection Group, and expires after 300 seconds.


Usage

You interact with the Operations API primarily through pgctl for high-level management and through kubectl for low-level inspection of Failover CRs and Protection Group status.

Listing active operations

Use pgctl to list all in-flight and recently completed failover operations:

pgctl operations list --namespace site-recovery

To filter by Protection Group:

pgctl operations list --namespace site-recovery --protection-group production-protection-group

Getting details for a specific operation

Failover operations are represented as Failover custom resources. You can inspect them directly:

kubectl get failover -n site-recovery
kubectl describe failover <failover-name> -n site-recovery

Monitoring Protection Group state transitions

During a failover, the Failover Controller patches spec.desiredState on the Protection Group and waits for status.currentState to converge. You can watch this in real time:

kubectl get protectiongroup production-protection-group -n site-recovery -w

Checking the failover lock

If a failover appears stuck or you need to confirm whether a lock is held, inspect the Kubernetes Lease:

kubectl get lease failover-lock-production-protection-group -n site-recovery -o yaml

The spec.holderIdentity field shows which process holds the lock. The lock expires automatically after 300 seconds if the holder crashes.

Triggering a failover via CRD (recommended for production)

Create a Failover CR to trigger a Kubernetes-native, auditable failover:

kubectl apply -f - <<EOF
apiVersion: siterecovery.trilio.io/v1alpha1
kind: Failover
metadata:
  name: failover-production-pg-to-dr
  namespace: site-recovery
spec:
  protectionGroup: production-protection-group
  targetCluster: dr-cluster
  force: false
EOF

Alternatively, use the CRD-based shell script:

./intelligent-pg-failover-crd.sh production-protection-group --to-cluster2

Examples

The following scenarios validate the Operations API, pgctl, and Failover Controller behavior end-to-end. Run these in order, as later scenarios build on earlier ones.


Scenario 1: Single Protection Group failover (baseline happy path)

This is the simplest validation. One Protection Group, one failover, no concurrent operations.

# Confirm PG is running on primary cluster
kubectl get protectiongroup pg-a -n site-recovery -o jsonpath='{.status.currentState}'

Expected output:

running
# Trigger failover to DR cluster
./intelligent-pg-failover-crd.sh pg-a --to-cluster2

Expected output:

[INFO] Validating authentication...
[INFO] Detecting VM locations for pg-a...
[INFO] Creating Failover CR: failover-pg-a-<timestamp>
[INFO] Monitoring Failover CR status...
[INFO] Phase: Running
[INFO] Phase: Succeeded
[SUCCESS] Failover of pg-a to cluster2 completed successfully
# Verify PG state on DR cluster
kubectl --kubeconfig=dr-cluster.kubeconfig \
  get protectiongroup pg-a -n site-recovery \
  -o jsonpath='{.status.currentState}'

Expected output:

running
# Verify via pgctl
pgctl operations list --namespace site-recovery --protection-group pg-a

Expected output (representative):

NAME                      PROTECTION-GROUP  TARGET     PHASE      AGE
failover-pg-a-1730400000  pg-a              cluster2   Succeeded  2m

Scenario 2: Concurrent failover of the same Protection Group (lock enforcement)

This validates that the per-Protection Group Kubernetes Lease prevents two simultaneous failovers of the same group.

# Terminal 1: Start a failover
./intelligent-pg-failover-crd.sh pg-a --to-cluster2

While Terminal 1 is running, in a second terminal:

# Terminal 2: Attempt a concurrent failover of the same PG
./intelligent-pg-failover-crd.sh pg-a --to-cluster2

Expected output in Terminal 2:

[ERROR] Failed to acquire failover lock for pg-a
Failover lock is held by: failover-controller-12345
[ERROR] Failover aborted — another failover is already in progress for this Protection Group

Verify the lock exists:

kubectl get lease failover-lock-pg-a -n site-recovery -o yaml

Expected: The lease shows a holderIdentity corresponding to the first failover process. Terminal 1 should complete successfully.


Scenario 3: Concurrent failover of different Protection Groups to the same target cluster (safety check enforcement)

This validates that the concurrent-safety check blocks taint removal when another Protection Group has running VMs on the target cluster.

# Setup: pg-a and pg-b both running on cluster1
kubectl get protectiongroup pg-a -n site-recovery -o jsonpath='{.status.currentState}'
# Expected: running

kubectl get protectiongroup pg-b -n site-recovery -o jsonpath='{.status.currentState}'
# Expected: running
# Terminal 1: Failover pg-a to cluster2 (starts first)
./intelligent-pg-failover.sh pg-a --to-cluster2

While Terminal 1 is running (pg-a VMs now starting on cluster2), in Terminal 2:

# Terminal 2: Attempt to failover pg-b to the same target
./intelligent-pg-failover.sh pg-b --to-cluster2

Expected output in Terminal 2:

⚠ SAFETY CHECK: Other Protection Groups have running VMs on cluster2:
    - pg-a
UNSAFE to remove taints on cluster2 - other Protection Groups have running VMs
This could cause scheduling issues for those VMs
To override this safety check, use --force flag (NOT RECOMMENDED)
[ERROR] Failover aborted

This is the correct behavior. Wait for Terminal 1 (pg-a failover) to complete, then retry pg-b.


Scenario 4: Force override in emergency (safety bypass)

This validates that --force allows an operator to override the concurrent-safety check when they have accepted the risk. Use this only in controlled conditions.

# pg-a is already running on cluster2
# Emergency: you must also failover pg-b to cluster2 immediately

./intelligent-pg-failover.sh pg-b --to-cluster2 --force

Expected output:

⚠️  WARNING: Force mode enabled - safety checks bypassed
⚠️  Using --force can disrupt other Protection Groups
[INFO] Removing quorum taints on cluster2 (force mode)...
[INFO] Phase: Running
[INFO] Phase: Succeeded
[SUCCESS] Failover of pg-b to cluster2 completed (force mode)

After the failover, verify both Protection Groups are in the expected state:

kubectl --kubeconfig=cluster2.kubeconfig \
  get protectiongroups -n site-recovery \
  -o custom-columns='NAME:.metadata.name,DESIRED:.spec.desiredState,CURRENT:.status.currentState'

Expected output:

NAME  DESIRED   CURRENT
pg-a  running   running
pg-b  running   running

Scenario 5: Protection Group desiredState reconciliation (controller-level validation)

This validates that the Protection Group Controller correctly stops and starts all VMs when spec.desiredState is patched, independent of failover scripts.

# Manually set desiredState to stopped
kubectl patch protectiongroup pg-a -n site-recovery \
  --type=merge \
  -p '{"spec":{"desiredState":"stopped"}}'

# Watch status.currentState converge
kubectl get protectiongroup pg-a -n site-recovery -w

Expected progression:

NAME  DESIRED   CURRENT
pg-a  stopped   running
pg-a  stopped   mixed
pg-a  stopped   stopped
# Verify VMs are actually stopped
kubectl get vmi -n site-recovery

Expected: No VMIs running for VMs in pg-a.

# Restore desired state
kubectl patch protectiongroup pg-a -n site-recovery \
  --type=merge \
  -p '{"spec":{"desiredState":"running"}}'

# Confirm recovery
kubectl get protectiongroup pg-a -n site-recovery -o jsonpath='{.status.currentState}'

Expected output:

running

Scenario 6: Audit trail — querying completed operations

This validates that Failover CRs persist after completion and can be queried for audit purposes.

# List all Failover CRs including completed ones
kubectl get failovers -n site-recovery

Expected output (representative, your names will differ):

NAME                          PROTECTION-GROUP  PHASE      AGE
failover-pg-a-1730400000      pg-a              Succeeded  1h
failover-pg-b-1730403600      pg-b              Succeeded  30m
failover-pg-a-1730407200      pg-a              Failed     5m
# Get full detail on a specific operation
kubectl describe failover failover-pg-a-1730407200 -n site-recovery

The Status section will show the phase, message, and event history — use this to diagnose failed operations without needing log access.

# Same via pgctl
pgctl operations get failover-pg-a-1730407200 --namespace site-recovery

Troubleshooting

Use the following reference for common failures you may encounter when running failover operations or inspecting the Operations API.


Symptom: pgctl operations list returns no results even though a failover was recently triggered

Likely cause: The Failover CR was created in a different namespace than the one you are querying, or the Failover Controller is not running and never processed the CR.

Fix:

# Search all namespaces
kubectl get failovers --all-namespaces

# Verify Failover Controller is healthy
kubectl get pods -n site-recovery -l app=failover-controller
kubectl logs -n site-recovery -l app=failover-controller --tail=50

Symptom: Failover is stuck in phase: Running and never advances to Succeeded or Failed

Likely cause: The Protection Group status.currentState is not converging to match spec.desiredState, which means the Protection Group Controller is blocked — often because one or more VMs cannot be stopped or started. Alternatively, the DRBD lock release is timing out.

Fix:

# Check Protection Group state
kubectl describe protectiongroup <pg-name> -n site-recovery

# Check Protection Group Controller logs
kubectl logs -n site-recovery -l app=protection-group-controller --tail=100

# Check individual VM state
kubectl get vmi -n site-recovery

If a VM is stuck terminating, investigate the VMI directly:

kubectl describe vmi <vm-name> -n site-recovery

Symptom: ERROR: Failed to acquire failover lock for <pg-name> when no failover appears to be running

Likely cause: A previous failover process crashed without releasing its Kubernetes Lease. The lease has a 300-second TTL, so it will expire automatically. If you cannot wait, you can delete it manually.

Fix:

# Confirm the lock exists and check holder identity
kubectl get lease failover-lock-<pg-name> -n site-recovery -o yaml

# Delete the stale lock (only if you are certain no failover is in progress)
kubectl delete lease failover-lock-<pg-name> -n site-recovery

Symptom: UNSAFE to remove taints safety check blocks failover even though you believe no other Protection Groups are running on the target cluster

Likely cause: A Protection Group on the target cluster has status.currentState: running but its VMs may actually be in a failed or unknown state. The safety check reads the CRD status field, which may be stale if the Protection Group Controller was restarted recently.

Fix:

# Inspect all Protection Groups on the target cluster
kubectl --kubeconfig=<target-cluster>.kubeconfig \
  get protectiongroups -n site-recovery \
  -o custom-columns='NAME:.metadata.name,CURRENT:.status.currentState'

# If a PG is incorrectly reporting 'running', reconcile it manually
kubectl --kubeconfig=<target-cluster>.kubeconfig \
  patch protectiongroup <stale-pg-name> -n site-recovery \
  --type=merge \
  -p '{"spec":{"desiredState":"stopped"}}'

If you must proceed immediately and have confirmed the risk, use --force:

./intelligent-pg-failover.sh <pg-name> --to-<target> --force

Symptom: Failover CR reaches phase: Failed with message referencing split-brain detection

Likely cause: The Failover Controller detected that VMs in the Protection Group appear to be running on both clusters simultaneously (location: both). This is a critical safety stop.

Fix: Do not use --force to bypass this. Instead:

  1. Stop VMs on one cluster manually and confirm they are fully terminated before retrying:
kubectl patch protectiongroup <pg-name> -n site-recovery \
  --type=merge \
  -p '{"spec":{"desiredState":"stopped"}}'

kubectl get protectiongroup <pg-name> -n site-recovery -o jsonpath='{.status.currentState}'
  1. Verify no VMIs remain on the cluster you are stopping:
kubectl get vmi -n site-recovery
  1. Only after confirming a single authoritative copy of each VM, retry the failover.

Symptom: pgctl returns authentication or connection errors against the quorum cluster

Likely cause: The Site Manager UI was not successfully deployed via Helm to the quorum cluster, or your pgctl configuration is pointing at the wrong endpoint.

Fix:

# Verify Site Manager is running on quorum cluster
kubectl --kubeconfig=quorum-cluster.kubeconfig get pods -n site-manager

# Re-check pgctl configuration
pgctl config view

Note: For LINSTOR deployments, the quorum cluster is required and hosts the LINSTOR controller and failover controllers. For DRBD Operator deployments, the quorum cluster is optional but recommended for management plane isolation. Confirm which model your deployment uses before troubleshooting connectivity.