Operations Page
Monitoring active and completed operations
The Operations page lets you monitor the status of active and completed failover operations across your Site Recovery deployment. You can track Protection Group state transitions, inspect Failover CR lifecycle events, and validate the safety mechanisms that govern concurrent failover behavior. Understanding how operations are tracked is critical during DR exercises and production failover events, where visibility into controller decisions — including safety check outcomes and lock acquisition — determines how quickly you can diagnose and respond to issues.
Before using the operations monitoring workflows described on this page, ensure the following are in place:
- Kubernetes clusters: Primary and DR clusters deployed and reachable. For LINSTOR deployments, the quorum cluster must also be operational.
- CRDs applied: Both
protection-group-crd.yamlandfailover-crd.yamlmust be present on all relevant clusters. - Controllers running:
protection-group-controllerdeployment is healthy on primary and DR clusters.failover-controllerdeployment is healthy on the target cluster (DR cluster for failover, primary cluster for failback).
pgctlinstalled and configured with kubeconfigs for all clusters you intend to monitor.kubectlaccess with sufficient RBAC to readprotectiongroups,failovers, andleasesresources in the target namespace.- Site Manager UI deployed to the quorum cluster via Helm (for LINSTOR model) or to your designated management cluster (for DRBD Operator model), if you prefer a graphical view of operations.
The operations monitoring capabilities are delivered by the same controllers and CRDs you deploy during initial infrastructure setup — no additional installation is required. If you are setting up monitoring access for the first time, verify that the required resources are in place.
Step 1: Confirm the Protection Group CRD is installed on each cluster
kubectl get crd protectiongroups.siterecovery.trilio.io
Expected output:
NAME CREATED AT
protectiongroups.siterecovery.trilio.io 2025-01-15T10:00:00Z
Step 2: Confirm the Failover CRD is installed on each cluster
kubectl get crd failovers.siterecovery.trilio.io
Step 3: Verify controllers are running
Run the following on each cluster where a controller is expected:
# Check Protection Group controller
kubectl get deployment protection-group-controller -n <namespace>
# Check Failover controller
kubectl get deployment failover-controller -n <namespace>
Both deployments should show AVAILABLE equal to their DESIRED replica count.
Step 4: Restart controllers if CRDs were recently updated
If you applied updated CRD definitions (for example, adding desiredState or currentState fields), restart the controllers to pick up the schema changes:
kubectl rollout restart deployment protection-group-controller -n <namespace>
kubectl rollout restart deployment failover-controller -n <namespace>
Step 5: Verify pgctl can reach all clusters
pgctl status --all-clusters
This command confirms that pgctl can authenticate and list Protection Groups across your primary, DR, and (where applicable) quorum clusters.
Operations monitoring behavior is governed by fields in the ProtectionGroup and Failover CRDs, as well as controller-level settings. The key configuration options are described below.
ProtectionGroup CRD — spec.desiredState
| Field | Type | Default | Valid Values | Effect |
|---|---|---|---|---|
spec.desiredState | string | running | running, stopped | Instructs the Protection Group controller to start or stop all VMs in the group atomically. The controller watches for changes to this field and reconciles VMs accordingly. |
status.currentState | string | unknown | running, stopped, mixed, unknown | Reflects the actual observed state of all VMs in the group. Set by the Protection Group controller after each reconciliation. A value of mixed indicates VMs are in different states and may require investigation. |
The separation between desiredState and currentState is intentional: it gives you a clear signal of whether the controller has finished reconciling. During a failover, you should expect currentState to lag behind desiredState briefly while VMs are being stopped or started.
Failover CRD — spec.force
| Field | Type | Default | Effect |
|---|---|---|---|
spec.force | boolean | false | When true, bypasses all concurrent-safety checks before removing DRBD quorum taints. Use only in emergency scenarios where you have confirmed that disrupting other Protection Groups on the target cluster is acceptable. |
Failover Lock (Kubernetes Lease)
The Failover controller creates a Kubernetes Lease object in coordination.k8s.io/v1 for each active failover operation. This is not a field you configure directly, but you can inspect it:
| Property | Value |
|---|---|
| Lease name | failover-lock-<protection-group-name> |
| Namespace | Same as the Protection Group |
| Holder identity | failover-controller-<pid> |
| Lease duration | 300 seconds (auto-expires if the controller crashes) |
This lock prevents two concurrent failover operations from running against the same Protection Group simultaneously. It does not block failovers of different Protection Groups.
Concurrent Safety Check Behavior
Before removing DRBD quorum taints from the target cluster, the Failover controller checks whether any other Protection Groups in the same namespace have a status.currentState of running on that cluster. If conflicts are found, the operation aborts unless spec.force: true is set. This check is scoped to the same namespace — Protection Groups in other namespaces are not currently evaluated.
Checking the current state of a Protection Group
Use pgctl to inspect the desired and current state of any Protection Group:
pgctl get protection-group <pg-name> -n <namespace>
You can also use kubectl directly if you prefer:
kubectl get protectiongroup <pg-name> -n <namespace> -o yaml
Look at spec.desiredState and status.currentState together. If they differ, the Protection Group controller is still reconciling.
Watching a failover operation in progress
To follow a Failover CR as it progresses through its lifecycle:
kubectl get failover <failover-name> -n <namespace> -w
Alternatively, watch the controller logs for real-time decision output:
kubectl logs -f deployment/failover-controller -n <namespace>
Checking whether a failover lock is held
Before initiating a failover, confirm no lock is already held for your Protection Group:
kubectl get lease failover-lock-<pg-name> -n <namespace>
If the lease exists, another failover operation is in progress or a previous one did not release its lock cleanly. Check the holderIdentity field to identify the process.
Listing all active and completed failover operations
pgctl list failovers -n <namespace>
Or with kubectl:
kubectl get failovers -n <namespace>
Completed Failover CRs persist after the operation finishes, giving you a full audit trail.
Initiating a monitored failover with pgctl
pgctl failover <pg-name> --to-cluster <dr-cluster-name> -n <namespace>
Add --force only in emergency scenarios where you have confirmed it is safe to bypass the concurrent-safety checks:
pgctl failover <pg-name> --to-cluster <dr-cluster-name> -n <namespace> --force
Warning:
--forcebypasses all safety checks protecting other running Protection Groups on the target cluster. Coordinate with all operators before using this flag.
Example 1: Verifying Protection Group state before a planned failover
Before initiating a DR exercise, confirm that the Protection Group is healthy and fully running on the primary cluster.
kubectl get protectiongroup production-protection-group -n dr-system -o yaml
Expected output (abbreviated):
apiVersion: siterecovery.trilio.io/v1alpha1
kind: ProtectionGroup
metadata:
name: production-protection-group
namespace: dr-system
spec:
desiredState: running
virtualMachines:
- name: prod-vm-1
- name: prod-vm-2
status:
currentState: running
desiredState and currentState both showing running confirms the Protection Group controller has fully reconciled and all VMs are active.
Example 2: Initiating a failover and tracking its progress
Start the failover using pgctl, then watch the resulting Failover CR.
pgctl failover production-protection-group --to-cluster dr-cluster -n dr-system
Then watch the Failover CR status:
kubectl get failover -n dr-system -w
Expected output progression:
NAME STATUS AGE
production-protection-group-failover Pending 2s
production-protection-group-failover InProgress 5s
production-protection-group-failover InProgress 30s
production-protection-group-failover Completed 87s
Example 3: Monitoring Protection Group state transitions during failover
In a separate terminal, watch the Protection Group's currentState change as the failover progresses. Run this against the source (primary) cluster:
kubectl get protectiongroup production-protection-group -n dr-system -w --kubeconfig primary-kubeconfig.yaml
Expected output:
NAME DESIRED STATE CURRENT STATE
production-protection-group running running
production-protection-group stopped running
production-protection-group stopped stopped
Then run the same watch against the target (DR) cluster to see VMs come online:
kubectl get protectiongroup production-protection-group -n dr-system -w --kubeconfig dr-kubeconfig.yaml
Expected output:
NAME DESIRED STATE CURRENT STATE
production-protection-group running unknown
production-protection-group running stopped
production-protection-group running running
Example 4: Checking for an active failover lock
Before starting a second failover operation involving the same Protection Group, check whether a lock is already held:
kubectl get lease failover-lock-production-protection-group -n dr-system -o yaml
Example output if a lock is active:
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
name: failover-lock-production-protection-group
namespace: dr-system
spec:
holderIdentity: failover-controller-12345
leaseDurationSeconds: 300
acquireTime: "2025-11-01T10:32:00Z"
If the lease does not exist, no failover lock is currently held and it is safe to proceed.
Example 5: Concurrent safety check blocking a second failover
This scenario shows what happens when you attempt to fail over a second Protection Group to a cluster that already has a running Protection Group.
Attempt the failover:
pgctl failover database-protection-group --to-cluster dr-cluster -n dr-system
Expected output (aborted by safety check):
⚠ SAFETY CHECK: Other Protection Groups have running VMs on dr-cluster:
- production-protection-group
- web-servers-protection-group
UNSAFE to remove taints on dr-cluster - other Protection Groups have running VMs
This could cause scheduling issues for those VMs
To override this safety check, use --force flag (NOT RECOMMENDED)
ERROR: Failover aborted
This is expected behavior. Wait for the in-progress failover operations to complete, then retry.
Example 6: Reviewing completed failover audit trail
After a failover completes, the Failover CR persists. Inspect it for a full record of the operation:
kubectl describe failover production-protection-group-failover -n dr-system
The Events section will contain timestamped entries for each step: lock acquisition, VM stop, taint removal, DRBD synchronization, and VM start on the target cluster.
Issue 1: currentState stuck in mixed after failover
Symptom: The Protection Group's status.currentState shows mixed after a failover completes. Some VMs are running, others are not.
Likely cause: One or more VMs in the Protection Group failed to start on the target cluster. This can be caused by insufficient resources on target nodes, unresolved DRBD replication lag, or a node taint that was not removed.
Fix:
- Identify which VMs are not running:
kubectl get vmi -n <namespace> --kubeconfig dr-kubeconfig.yaml - Check events on non-running VMs:
kubectl describe vm <vm-name> -n <namespace> --kubeconfig dr-kubeconfig.yaml - Check for remaining DRBD quorum taints on target nodes:
kubectl get nodes --kubeconfig dr-kubeconfig.yaml -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints - If taints remain due to a blocked safety check, review whether another Protection Group is blocking taint removal (see Issue 3), then retry or use
--forcewith caution.
Issue 2: Failover blocked — cannot acquire lock
Symptom: pgctl failover or the failover script exits with Failed to acquire failover lock for <pg-name>. No failover operation appears to be running.
Likely cause: A previous failover operation crashed or was interrupted without releasing the Kubernetes Lease lock. The lease has not yet expired (timeout is 300 seconds).
Fix:
- Check whether the lock holder process is still running:
kubectl get lease failover-lock-<pg-name> -n <namespace> -o yaml - If the
holderIdentityprocess is no longer active and the lock is stale, delete it manually:kubectl delete lease failover-lock-<pg-name> -n <namespace> - Alternatively, wait for the 300-second lease duration to expire; the lock will auto-release.
- Retry the failover operation once the lock is cleared.
Note: Do not delete the lease if a failover is genuinely in progress. Check controller logs first to confirm the holder process is not active.
Issue 3: Safety check blocks taint removal — other PGs reported as conflicting
Symptom: Failover aborts with a message listing other Protection Groups that have running VMs on the target cluster.
Likely cause: The concurrent-safety check detected other Protection Groups in the same namespace with status.currentState: running on the target cluster. This is expected behavior designed to protect those VMs.
Fix:
- List all Protection Groups and their states on the target cluster:
kubectl get protectiongroups -n <namespace> --kubeconfig dr-kubeconfig.yaml -o custom-columns=NAME:.metadata.name,STATE:.status.currentState - Determine whether the conflicting Protection Groups genuinely have running workloads that must be protected.
- If this is a planned maintenance window and disruption is acceptable, coordinate with all operators and use
--force:pgctl failover <pg-name> --to-cluster <dr-cluster> -n <namespace> --force - If the conflicting PG's
currentStateis stale (the VMs are not actually running), manually patch the state or restart the Protection Group controller to force a reconciliation:kubectl rollout restart deployment protection-group-controller -n <namespace>
Issue 4: Failover CR stuck in InProgress
Symptom: The Failover CR status does not advance beyond InProgress for more than a few minutes.
Likely cause: The Failover controller is waiting for a condition that has not been met — most commonly, DRBD replication has not fully synchronized, or VMs on the source cluster have not reached stopped state.
Fix:
- Check Failover controller logs for the specific step where it is blocked:
kubectl logs -f deployment/failover-controller -n <namespace> --kubeconfig dr-kubeconfig.yaml - Check the Protection Group
currentStateon the source cluster to confirm VMs have stopped:kubectl get protectiongroup <pg-name> -n <namespace> --kubeconfig primary-kubeconfig.yaml -o jsonpath='{.status.currentState}' - If the Protection Group controller on the source cluster is not reconciling, check its logs:
kubectl logs -f deployment/protection-group-controller -n <namespace> --kubeconfig primary-kubeconfig.yaml - If DRBD synchronization is the bottleneck, do not force-continue. Wait for replication to complete to avoid data loss.
Issue 5: Controller logs show desiredState changes not being reconciled
Symptom: You patched spec.desiredState on a Protection Group manually, but the VMs did not start or stop, and status.currentState did not update.
Likely cause: The Protection Group controller is not running, was restarted after a CRD schema update without being restarted itself, or has an RBAC issue preventing it from patching VM resources.
Fix:
- Verify the controller is running:
kubectl get deployment protection-group-controller -n <namespace> - Restart the controller to pick up any CRD changes:
kubectl rollout restart deployment protection-group-controller -n <namespace> - Check controller logs for RBAC errors:
kubectl logs deployment/protection-group-controller -n <namespace> | grep -i "forbidden\|unauthorized" - Verify the controller's ServiceAccount has permission to patch
virtualmachines.kubevirt.ioresources in the target namespace.