Site Recoveryfor Kubenetes Virtual Machines
Guide

Operations Page

Monitoring active and completed operations


Overview

The Operations page lets you monitor the status of active and completed failover operations across your Site Recovery deployment. You can track Protection Group state transitions, inspect Failover CR lifecycle events, and validate the safety mechanisms that govern concurrent failover behavior. Understanding how operations are tracked is critical during DR exercises and production failover events, where visibility into controller decisions — including safety check outcomes and lock acquisition — determines how quickly you can diagnose and respond to issues.


Prerequisites

Before using the operations monitoring workflows described on this page, ensure the following are in place:

  • Kubernetes clusters: Primary and DR clusters deployed and reachable. For LINSTOR deployments, the quorum cluster must also be operational.
  • CRDs applied: Both protection-group-crd.yaml and failover-crd.yaml must be present on all relevant clusters.
  • Controllers running:
    • protection-group-controller deployment is healthy on primary and DR clusters.
    • failover-controller deployment is healthy on the target cluster (DR cluster for failover, primary cluster for failback).
  • pgctl installed and configured with kubeconfigs for all clusters you intend to monitor.
  • kubectl access with sufficient RBAC to read protectiongroups, failovers, and leases resources in the target namespace.
  • Site Manager UI deployed to the quorum cluster via Helm (for LINSTOR model) or to your designated management cluster (for DRBD Operator model), if you prefer a graphical view of operations.

Installation

The operations monitoring capabilities are delivered by the same controllers and CRDs you deploy during initial infrastructure setup — no additional installation is required. If you are setting up monitoring access for the first time, verify that the required resources are in place.

Step 1: Confirm the Protection Group CRD is installed on each cluster

kubectl get crd protectiongroups.siterecovery.trilio.io

Expected output:

NAME                                          CREATED AT
protectiongroups.siterecovery.trilio.io       2025-01-15T10:00:00Z

Step 2: Confirm the Failover CRD is installed on each cluster

kubectl get crd failovers.siterecovery.trilio.io

Step 3: Verify controllers are running

Run the following on each cluster where a controller is expected:

# Check Protection Group controller
kubectl get deployment protection-group-controller -n <namespace>

# Check Failover controller
kubectl get deployment failover-controller -n <namespace>

Both deployments should show AVAILABLE equal to their DESIRED replica count.

Step 4: Restart controllers if CRDs were recently updated

If you applied updated CRD definitions (for example, adding desiredState or currentState fields), restart the controllers to pick up the schema changes:

kubectl rollout restart deployment protection-group-controller -n <namespace>
kubectl rollout restart deployment failover-controller -n <namespace>

Step 5: Verify pgctl can reach all clusters

pgctl status --all-clusters

This command confirms that pgctl can authenticate and list Protection Groups across your primary, DR, and (where applicable) quorum clusters.


Configuration

Operations monitoring behavior is governed by fields in the ProtectionGroup and Failover CRDs, as well as controller-level settings. The key configuration options are described below.

ProtectionGroup CRD — spec.desiredState

FieldTypeDefaultValid ValuesEffect
spec.desiredStatestringrunningrunning, stoppedInstructs the Protection Group controller to start or stop all VMs in the group atomically. The controller watches for changes to this field and reconciles VMs accordingly.
status.currentStatestringunknownrunning, stopped, mixed, unknownReflects the actual observed state of all VMs in the group. Set by the Protection Group controller after each reconciliation. A value of mixed indicates VMs are in different states and may require investigation.

The separation between desiredState and currentState is intentional: it gives you a clear signal of whether the controller has finished reconciling. During a failover, you should expect currentState to lag behind desiredState briefly while VMs are being stopped or started.

Failover CRD — spec.force

FieldTypeDefaultEffect
spec.forcebooleanfalseWhen true, bypasses all concurrent-safety checks before removing DRBD quorum taints. Use only in emergency scenarios where you have confirmed that disrupting other Protection Groups on the target cluster is acceptable.

Failover Lock (Kubernetes Lease)

The Failover controller creates a Kubernetes Lease object in coordination.k8s.io/v1 for each active failover operation. This is not a field you configure directly, but you can inspect it:

PropertyValue
Lease namefailover-lock-<protection-group-name>
NamespaceSame as the Protection Group
Holder identityfailover-controller-<pid>
Lease duration300 seconds (auto-expires if the controller crashes)

This lock prevents two concurrent failover operations from running against the same Protection Group simultaneously. It does not block failovers of different Protection Groups.

Concurrent Safety Check Behavior

Before removing DRBD quorum taints from the target cluster, the Failover controller checks whether any other Protection Groups in the same namespace have a status.currentState of running on that cluster. If conflicts are found, the operation aborts unless spec.force: true is set. This check is scoped to the same namespace — Protection Groups in other namespaces are not currently evaluated.


Usage

Checking the current state of a Protection Group

Use pgctl to inspect the desired and current state of any Protection Group:

pgctl get protection-group <pg-name> -n <namespace>

You can also use kubectl directly if you prefer:

kubectl get protectiongroup <pg-name> -n <namespace> -o yaml

Look at spec.desiredState and status.currentState together. If they differ, the Protection Group controller is still reconciling.

Watching a failover operation in progress

To follow a Failover CR as it progresses through its lifecycle:

kubectl get failover <failover-name> -n <namespace> -w

Alternatively, watch the controller logs for real-time decision output:

kubectl logs -f deployment/failover-controller -n <namespace>

Checking whether a failover lock is held

Before initiating a failover, confirm no lock is already held for your Protection Group:

kubectl get lease failover-lock-<pg-name> -n <namespace>

If the lease exists, another failover operation is in progress or a previous one did not release its lock cleanly. Check the holderIdentity field to identify the process.

Listing all active and completed failover operations

pgctl list failovers -n <namespace>

Or with kubectl:

kubectl get failovers -n <namespace>

Completed Failover CRs persist after the operation finishes, giving you a full audit trail.

Initiating a monitored failover with pgctl

pgctl failover <pg-name> --to-cluster <dr-cluster-name> -n <namespace>

Add --force only in emergency scenarios where you have confirmed it is safe to bypass the concurrent-safety checks:

pgctl failover <pg-name> --to-cluster <dr-cluster-name> -n <namespace> --force

Warning: --force bypasses all safety checks protecting other running Protection Groups on the target cluster. Coordinate with all operators before using this flag.


Examples

Example 1: Verifying Protection Group state before a planned failover

Before initiating a DR exercise, confirm that the Protection Group is healthy and fully running on the primary cluster.

kubectl get protectiongroup production-protection-group -n dr-system -o yaml

Expected output (abbreviated):

apiVersion: siterecovery.trilio.io/v1alpha1
kind: ProtectionGroup
metadata:
  name: production-protection-group
  namespace: dr-system
spec:
  desiredState: running
  virtualMachines:
    - name: prod-vm-1
    - name: prod-vm-2
status:
  currentState: running

desiredState and currentState both showing running confirms the Protection Group controller has fully reconciled and all VMs are active.


Example 2: Initiating a failover and tracking its progress

Start the failover using pgctl, then watch the resulting Failover CR.

pgctl failover production-protection-group --to-cluster dr-cluster -n dr-system

Then watch the Failover CR status:

kubectl get failover -n dr-system -w

Expected output progression:

NAME                                    STATUS       AGE
production-protection-group-failover    Pending      2s
production-protection-group-failover    InProgress   5s
production-protection-group-failover    InProgress   30s
production-protection-group-failover    Completed    87s

Example 3: Monitoring Protection Group state transitions during failover

In a separate terminal, watch the Protection Group's currentState change as the failover progresses. Run this against the source (primary) cluster:

kubectl get protectiongroup production-protection-group -n dr-system -w --kubeconfig primary-kubeconfig.yaml

Expected output:

NAME                           DESIRED STATE   CURRENT STATE
production-protection-group    running         running
production-protection-group    stopped         running
production-protection-group    stopped         stopped

Then run the same watch against the target (DR) cluster to see VMs come online:

kubectl get protectiongroup production-protection-group -n dr-system -w --kubeconfig dr-kubeconfig.yaml

Expected output:

NAME                           DESIRED STATE   CURRENT STATE
production-protection-group    running         unknown
production-protection-group    running         stopped
production-protection-group    running         running

Example 4: Checking for an active failover lock

Before starting a second failover operation involving the same Protection Group, check whether a lock is already held:

kubectl get lease failover-lock-production-protection-group -n dr-system -o yaml

Example output if a lock is active:

apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: failover-lock-production-protection-group
  namespace: dr-system
spec:
  holderIdentity: failover-controller-12345
  leaseDurationSeconds: 300
  acquireTime: "2025-11-01T10:32:00Z"

If the lease does not exist, no failover lock is currently held and it is safe to proceed.


Example 5: Concurrent safety check blocking a second failover

This scenario shows what happens when you attempt to fail over a second Protection Group to a cluster that already has a running Protection Group.

Attempt the failover:

pgctl failover database-protection-group --to-cluster dr-cluster -n dr-system

Expected output (aborted by safety check):

⚠ SAFETY CHECK: Other Protection Groups have running VMs on dr-cluster:
    - production-protection-group
    - web-servers-protection-group
UNSAFE to remove taints on dr-cluster - other Protection Groups have running VMs
This could cause scheduling issues for those VMs
To override this safety check, use --force flag (NOT RECOMMENDED)
ERROR: Failover aborted

This is expected behavior. Wait for the in-progress failover operations to complete, then retry.


Example 6: Reviewing completed failover audit trail

After a failover completes, the Failover CR persists. Inspect it for a full record of the operation:

kubectl describe failover production-protection-group-failover -n dr-system

The Events section will contain timestamped entries for each step: lock acquisition, VM stop, taint removal, DRBD synchronization, and VM start on the target cluster.


Troubleshooting

Issue 1: currentState stuck in mixed after failover

Symptom: The Protection Group's status.currentState shows mixed after a failover completes. Some VMs are running, others are not.

Likely cause: One or more VMs in the Protection Group failed to start on the target cluster. This can be caused by insufficient resources on target nodes, unresolved DRBD replication lag, or a node taint that was not removed.

Fix:

  1. Identify which VMs are not running:
    kubectl get vmi -n <namespace> --kubeconfig dr-kubeconfig.yaml
    
  2. Check events on non-running VMs:
    kubectl describe vm <vm-name> -n <namespace> --kubeconfig dr-kubeconfig.yaml
    
  3. Check for remaining DRBD quorum taints on target nodes:
    kubectl get nodes --kubeconfig dr-kubeconfig.yaml -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
    
  4. If taints remain due to a blocked safety check, review whether another Protection Group is blocking taint removal (see Issue 3), then retry or use --force with caution.

Issue 2: Failover blocked — cannot acquire lock

Symptom: pgctl failover or the failover script exits with Failed to acquire failover lock for <pg-name>. No failover operation appears to be running.

Likely cause: A previous failover operation crashed or was interrupted without releasing the Kubernetes Lease lock. The lease has not yet expired (timeout is 300 seconds).

Fix:

  1. Check whether the lock holder process is still running:
    kubectl get lease failover-lock-<pg-name> -n <namespace> -o yaml
    
  2. If the holderIdentity process is no longer active and the lock is stale, delete it manually:
    kubectl delete lease failover-lock-<pg-name> -n <namespace>
    
  3. Alternatively, wait for the 300-second lease duration to expire; the lock will auto-release.
  4. Retry the failover operation once the lock is cleared.

Note: Do not delete the lease if a failover is genuinely in progress. Check controller logs first to confirm the holder process is not active.


Issue 3: Safety check blocks taint removal — other PGs reported as conflicting

Symptom: Failover aborts with a message listing other Protection Groups that have running VMs on the target cluster.

Likely cause: The concurrent-safety check detected other Protection Groups in the same namespace with status.currentState: running on the target cluster. This is expected behavior designed to protect those VMs.

Fix:

  1. List all Protection Groups and their states on the target cluster:
    kubectl get protectiongroups -n <namespace> --kubeconfig dr-kubeconfig.yaml -o custom-columns=NAME:.metadata.name,STATE:.status.currentState
    
  2. Determine whether the conflicting Protection Groups genuinely have running workloads that must be protected.
  3. If this is a planned maintenance window and disruption is acceptable, coordinate with all operators and use --force:
    pgctl failover <pg-name> --to-cluster <dr-cluster> -n <namespace> --force
    
  4. If the conflicting PG's currentState is stale (the VMs are not actually running), manually patch the state or restart the Protection Group controller to force a reconciliation:
    kubectl rollout restart deployment protection-group-controller -n <namespace>
    

Issue 4: Failover CR stuck in InProgress

Symptom: The Failover CR status does not advance beyond InProgress for more than a few minutes.

Likely cause: The Failover controller is waiting for a condition that has not been met — most commonly, DRBD replication has not fully synchronized, or VMs on the source cluster have not reached stopped state.

Fix:

  1. Check Failover controller logs for the specific step where it is blocked:
    kubectl logs -f deployment/failover-controller -n <namespace> --kubeconfig dr-kubeconfig.yaml
    
  2. Check the Protection Group currentState on the source cluster to confirm VMs have stopped:
    kubectl get protectiongroup <pg-name> -n <namespace> --kubeconfig primary-kubeconfig.yaml -o jsonpath='{.status.currentState}'
    
  3. If the Protection Group controller on the source cluster is not reconciling, check its logs:
    kubectl logs -f deployment/protection-group-controller -n <namespace> --kubeconfig primary-kubeconfig.yaml
    
  4. If DRBD synchronization is the bottleneck, do not force-continue. Wait for replication to complete to avoid data loss.

Issue 5: Controller logs show desiredState changes not being reconciled

Symptom: You patched spec.desiredState on a Protection Group manually, but the VMs did not start or stop, and status.currentState did not update.

Likely cause: The Protection Group controller is not running, was restarted after a CRD schema update without being restarted itself, or has an RBAC issue preventing it from patching VM resources.

Fix:

  1. Verify the controller is running:
    kubectl get deployment protection-group-controller -n <namespace>
    
  2. Restart the controller to pick up any CRD changes:
    kubectl rollout restart deployment protection-group-controller -n <namespace>
    
  3. Check controller logs for RBAC errors:
    kubectl logs deployment/protection-group-controller -n <namespace> | grep -i "forbidden\|unauthorized"
    
  4. Verify the controller's ServiceAccount has permission to patch virtualmachines.kubevirt.io resources in the target namespace.