Operations API
List active/completed operations, get operation details
The Operations API provides visibility into active and completed failover operations across your Site Recovery deployment. Use it to monitor in-flight failovers, audit historical operations, and retrieve detailed status for individual Failover custom resources — all critical for validating DR readiness and diagnosing issues in production. This page documents test scenarios that validate the pgctl CLI and Failover Controller behavior, so you can confirm your deployment behaves correctly before relying on it in an actual disaster recovery event.
Before running these test scenarios, ensure you have:
- Access to all relevant clusters (primary, DR, and quorum if applicable) with valid kubeconfigs
pgctlinstalled and configured with credentials for each cluster- Protection Group CRDs applied:
protection-group-crd.yamlandfailover-crd.yaml - Protection Group Controller and Failover Controller deployments running and healthy
- At least one Protection Group defined with two or more KubeVirt VMs assigned
kubectlavailable and context-switched to the appropriate cluster for each step- For LINSTOR deployments: three clusters (primary, DR, quorum) confirmed reachable
- For DRBD Operator deployments: two clusters minimum (primary, DR); quorum cluster optional
- DRBD replication confirmed active between primary and DR cluster nodes before testing
- Sufficient RBAC permissions to create and patch
protectiongroupsandfailoverscustom resources, and to read/write Kubernetes Leases in thecoordination.k8s.io/v1API group
The Operations API is exposed through the Failover Controller and is accessed via pgctl or directly via kubectl against the Failover CRD. No separate installation is required beyond the standard Site Recovery controller deployment.
Step 1 — Verify the Failover Controller is running
kubectl get deployment failover-controller -n site-recovery
Expected output:
NAME READY UP-TO-DATE AVAILABLE AGE
failover-controller 1/1 1 1 5d
Step 2 — Verify the Protection Group Controller is running
kubectl get deployment protection-group-controller -n site-recovery
Step 3 — Confirm CRDs are registered
kubectl get crd | grep siterecovery.trilio.io
Expected output should include both:
failovers.siterecovery.trilio.io
protectiongroups.siterecovery.trilio.io
Step 4 — Confirm pgctl can reach your clusters
pgctl status --namespace site-recovery
If pgctl returns connection errors, verify your kubeconfig contexts and that the Site Manager UI has been deployed to the quorum cluster via Helm.
The following fields on the Failover custom resource control how the Failover Controller processes operations. Understanding these options is important for both normal failover execution and for the test scenarios documented below.
| Field | Default | Valid Values | Effect |
|---|---|---|---|
spec.protectionGroup | (required) | Any valid Protection Group name | Identifies which Protection Group this failover operation targets |
spec.targetCluster | (required) | Cluster identifier string | The cluster to which VMs will be failed over |
spec.force | false | true, false | When true, bypasses the concurrent-safety check that blocks taint removal if other Protection Groups have running VMs on the target cluster. Use only in emergencies and only after coordinating with other operators. |
status.phase | (set by controller) | Pending, Running, Succeeded, Failed | Reflects the current phase of the failover operation as updated by the Failover Controller |
status.message | (set by controller) | Free-form string | Human-readable description of current progress or failure reason |
The ProtectionGroup resource has two fields directly relevant to operation monitoring:
| Field | Default | Valid Values | Effect |
|---|---|---|---|
spec.desiredState | running | running, stopped | Drives the Protection Group Controller to start or stop all VMs in the group atomically. The Failover Controller patches this field rather than touching VMs directly. |
status.currentState | (set by controller) | running, stopped, mixed, unknown | Reflects the actual observed state of all VMs in the group. The Failover Controller waits for this to match spec.desiredState before proceeding. |
The Failover Controller uses Kubernetes Leases (coordination.k8s.io/v1) to enforce per-Protection Group locking. Each lock is named failover-lock-{protection-group-name} in the same namespace as the Protection Group, and expires after 300 seconds.
You interact with the Operations API primarily through pgctl for high-level management and through kubectl for low-level inspection of Failover CRs and Protection Group status.
Listing active operations
Use pgctl to list all in-flight and recently completed failover operations:
pgctl operations list --namespace site-recovery
To filter by Protection Group:
pgctl operations list --namespace site-recovery --protection-group production-protection-group
Getting details for a specific operation
Failover operations are represented as Failover custom resources. You can inspect them directly:
kubectl get failover -n site-recovery
kubectl describe failover <failover-name> -n site-recovery
Monitoring Protection Group state transitions
During a failover, the Failover Controller patches spec.desiredState on the Protection Group and waits for status.currentState to converge. You can watch this in real time:
kubectl get protectiongroup production-protection-group -n site-recovery -w
Checking the failover lock
If a failover appears stuck or you need to confirm whether a lock is held, inspect the Kubernetes Lease:
kubectl get lease failover-lock-production-protection-group -n site-recovery -o yaml
The spec.holderIdentity field shows which process holds the lock. The lock expires automatically after 300 seconds if the holder crashes.
Triggering a failover via CRD (recommended for production)
Create a Failover CR to trigger a Kubernetes-native, auditable failover:
kubectl apply -f - <<EOF
apiVersion: siterecovery.trilio.io/v1alpha1
kind: Failover
metadata:
name: failover-production-pg-to-dr
namespace: site-recovery
spec:
protectionGroup: production-protection-group
targetCluster: dr-cluster
force: false
EOF
Alternatively, use the CRD-based shell script:
./intelligent-pg-failover-crd.sh production-protection-group --to-cluster2
The following scenarios validate the Operations API, pgctl, and Failover Controller behavior end-to-end. Run these in order, as later scenarios build on earlier ones.
Scenario 1: Single Protection Group failover (baseline happy path)
This is the simplest validation. One Protection Group, one failover, no concurrent operations.
# Confirm PG is running on primary cluster
kubectl get protectiongroup pg-a -n site-recovery -o jsonpath='{.status.currentState}'
Expected output:
running
# Trigger failover to DR cluster
./intelligent-pg-failover-crd.sh pg-a --to-cluster2
Expected output:
[INFO] Validating authentication...
[INFO] Detecting VM locations for pg-a...
[INFO] Creating Failover CR: failover-pg-a-<timestamp>
[INFO] Monitoring Failover CR status...
[INFO] Phase: Running
[INFO] Phase: Succeeded
[SUCCESS] Failover of pg-a to cluster2 completed successfully
# Verify PG state on DR cluster
kubectl --kubeconfig=dr-cluster.kubeconfig \
get protectiongroup pg-a -n site-recovery \
-o jsonpath='{.status.currentState}'
Expected output:
running
# Verify via pgctl
pgctl operations list --namespace site-recovery --protection-group pg-a
Expected output (representative):
NAME PROTECTION-GROUP TARGET PHASE AGE
failover-pg-a-1730400000 pg-a cluster2 Succeeded 2m
Scenario 2: Concurrent failover of the same Protection Group (lock enforcement)
This validates that the per-Protection Group Kubernetes Lease prevents two simultaneous failovers of the same group.
# Terminal 1: Start a failover
./intelligent-pg-failover-crd.sh pg-a --to-cluster2
While Terminal 1 is running, in a second terminal:
# Terminal 2: Attempt a concurrent failover of the same PG
./intelligent-pg-failover-crd.sh pg-a --to-cluster2
Expected output in Terminal 2:
[ERROR] Failed to acquire failover lock for pg-a
Failover lock is held by: failover-controller-12345
[ERROR] Failover aborted — another failover is already in progress for this Protection Group
Verify the lock exists:
kubectl get lease failover-lock-pg-a -n site-recovery -o yaml
Expected: The lease shows a holderIdentity corresponding to the first failover process. Terminal 1 should complete successfully.
Scenario 3: Concurrent failover of different Protection Groups to the same target cluster (safety check enforcement)
This validates that the concurrent-safety check blocks taint removal when another Protection Group has running VMs on the target cluster.
# Setup: pg-a and pg-b both running on cluster1
kubectl get protectiongroup pg-a -n site-recovery -o jsonpath='{.status.currentState}'
# Expected: running
kubectl get protectiongroup pg-b -n site-recovery -o jsonpath='{.status.currentState}'
# Expected: running
# Terminal 1: Failover pg-a to cluster2 (starts first)
./intelligent-pg-failover.sh pg-a --to-cluster2
While Terminal 1 is running (pg-a VMs now starting on cluster2), in Terminal 2:
# Terminal 2: Attempt to failover pg-b to the same target
./intelligent-pg-failover.sh pg-b --to-cluster2
Expected output in Terminal 2:
⚠ SAFETY CHECK: Other Protection Groups have running VMs on cluster2:
- pg-a
UNSAFE to remove taints on cluster2 - other Protection Groups have running VMs
This could cause scheduling issues for those VMs
To override this safety check, use --force flag (NOT RECOMMENDED)
[ERROR] Failover aborted
This is the correct behavior. Wait for Terminal 1 (pg-a failover) to complete, then retry pg-b.
Scenario 4: Force override in emergency (safety bypass)
This validates that --force allows an operator to override the concurrent-safety check when they have accepted the risk. Use this only in controlled conditions.
# pg-a is already running on cluster2
# Emergency: you must also failover pg-b to cluster2 immediately
./intelligent-pg-failover.sh pg-b --to-cluster2 --force
Expected output:
⚠️ WARNING: Force mode enabled - safety checks bypassed
⚠️ Using --force can disrupt other Protection Groups
[INFO] Removing quorum taints on cluster2 (force mode)...
[INFO] Phase: Running
[INFO] Phase: Succeeded
[SUCCESS] Failover of pg-b to cluster2 completed (force mode)
After the failover, verify both Protection Groups are in the expected state:
kubectl --kubeconfig=cluster2.kubeconfig \
get protectiongroups -n site-recovery \
-o custom-columns='NAME:.metadata.name,DESIRED:.spec.desiredState,CURRENT:.status.currentState'
Expected output:
NAME DESIRED CURRENT
pg-a running running
pg-b running running
Scenario 5: Protection Group desiredState reconciliation (controller-level validation)
This validates that the Protection Group Controller correctly stops and starts all VMs when spec.desiredState is patched, independent of failover scripts.
# Manually set desiredState to stopped
kubectl patch protectiongroup pg-a -n site-recovery \
--type=merge \
-p '{"spec":{"desiredState":"stopped"}}'
# Watch status.currentState converge
kubectl get protectiongroup pg-a -n site-recovery -w
Expected progression:
NAME DESIRED CURRENT
pg-a stopped running
pg-a stopped mixed
pg-a stopped stopped
# Verify VMs are actually stopped
kubectl get vmi -n site-recovery
Expected: No VMIs running for VMs in pg-a.
# Restore desired state
kubectl patch protectiongroup pg-a -n site-recovery \
--type=merge \
-p '{"spec":{"desiredState":"running"}}'
# Confirm recovery
kubectl get protectiongroup pg-a -n site-recovery -o jsonpath='{.status.currentState}'
Expected output:
running
Scenario 6: Audit trail — querying completed operations
This validates that Failover CRs persist after completion and can be queried for audit purposes.
# List all Failover CRs including completed ones
kubectl get failovers -n site-recovery
Expected output (representative, your names will differ):
NAME PROTECTION-GROUP PHASE AGE
failover-pg-a-1730400000 pg-a Succeeded 1h
failover-pg-b-1730403600 pg-b Succeeded 30m
failover-pg-a-1730407200 pg-a Failed 5m
# Get full detail on a specific operation
kubectl describe failover failover-pg-a-1730407200 -n site-recovery
The Status section will show the phase, message, and event history — use this to diagnose failed operations without needing log access.
# Same via pgctl
pgctl operations get failover-pg-a-1730407200 --namespace site-recovery
Use the following reference for common failures you may encounter when running failover operations or inspecting the Operations API.
Symptom: pgctl operations list returns no results even though a failover was recently triggered
Likely cause: The Failover CR was created in a different namespace than the one you are querying, or the Failover Controller is not running and never processed the CR.
Fix:
# Search all namespaces
kubectl get failovers --all-namespaces
# Verify Failover Controller is healthy
kubectl get pods -n site-recovery -l app=failover-controller
kubectl logs -n site-recovery -l app=failover-controller --tail=50
Symptom: Failover is stuck in phase: Running and never advances to Succeeded or Failed
Likely cause: The Protection Group status.currentState is not converging to match spec.desiredState, which means the Protection Group Controller is blocked — often because one or more VMs cannot be stopped or started. Alternatively, the DRBD lock release is timing out.
Fix:
# Check Protection Group state
kubectl describe protectiongroup <pg-name> -n site-recovery
# Check Protection Group Controller logs
kubectl logs -n site-recovery -l app=protection-group-controller --tail=100
# Check individual VM state
kubectl get vmi -n site-recovery
If a VM is stuck terminating, investigate the VMI directly:
kubectl describe vmi <vm-name> -n site-recovery
Symptom: ERROR: Failed to acquire failover lock for <pg-name> when no failover appears to be running
Likely cause: A previous failover process crashed without releasing its Kubernetes Lease. The lease has a 300-second TTL, so it will expire automatically. If you cannot wait, you can delete it manually.
Fix:
# Confirm the lock exists and check holder identity
kubectl get lease failover-lock-<pg-name> -n site-recovery -o yaml
# Delete the stale lock (only if you are certain no failover is in progress)
kubectl delete lease failover-lock-<pg-name> -n site-recovery
Symptom: UNSAFE to remove taints safety check blocks failover even though you believe no other Protection Groups are running on the target cluster
Likely cause: A Protection Group on the target cluster has status.currentState: running but its VMs may actually be in a failed or unknown state. The safety check reads the CRD status field, which may be stale if the Protection Group Controller was restarted recently.
Fix:
# Inspect all Protection Groups on the target cluster
kubectl --kubeconfig=<target-cluster>.kubeconfig \
get protectiongroups -n site-recovery \
-o custom-columns='NAME:.metadata.name,CURRENT:.status.currentState'
# If a PG is incorrectly reporting 'running', reconcile it manually
kubectl --kubeconfig=<target-cluster>.kubeconfig \
patch protectiongroup <stale-pg-name> -n site-recovery \
--type=merge \
-p '{"spec":{"desiredState":"stopped"}}'
If you must proceed immediately and have confirmed the risk, use --force:
./intelligent-pg-failover.sh <pg-name> --to-<target> --force
Symptom: Failover CR reaches phase: Failed with message referencing split-brain detection
Likely cause: The Failover Controller detected that VMs in the Protection Group appear to be running on both clusters simultaneously (location: both). This is a critical safety stop.
Fix: Do not use --force to bypass this. Instead:
- Stop VMs on one cluster manually and confirm they are fully terminated before retrying:
kubectl patch protectiongroup <pg-name> -n site-recovery \
--type=merge \
-p '{"spec":{"desiredState":"stopped"}}'
kubectl get protectiongroup <pg-name> -n site-recovery -o jsonpath='{.status.currentState}'
- Verify no VMIs remain on the cluster you are stopping:
kubectl get vmi -n site-recovery
- Only after confirming a single authoritative copy of each VM, retry the failover.
Symptom: pgctl returns authentication or connection errors against the quorum cluster
Likely cause: The Site Manager UI was not successfully deployed via Helm to the quorum cluster, or your pgctl configuration is pointing at the wrong endpoint.
Fix:
# Verify Site Manager is running on quorum cluster
kubectl --kubeconfig=quorum-cluster.kubeconfig get pods -n site-manager
# Re-check pgctl configuration
pgctl config view
Note: For LINSTOR deployments, the quorum cluster is required and hosts the LINSTOR controller and failover controllers. For DRBD Operator deployments, the quorum cluster is optional but recommended for management plane isolation. Confirm which model your deployment uses before troubleshooting connectivity.