Controller Issues
Controller pod failures, connectivity issues
This page helps you diagnose and resolve common issues with Site Recovery controllers — the pods responsible for orchestrating Protection Group lifecycle, replication health monitoring, and cross-cluster failover operations. Controller failures can silently block VM protection, prevent failovers from completing, and leave Protection Groups in an inconsistent state, so fast diagnosis is critical. The guidance here covers pod-level failures, missing credentials, permission errors, and health probe failures across both LINSTOR and DRBD Operator deployment models.
Before troubleshooting controller issues, ensure you have:
kubectlaccess to the relevant cluster(s) — primary, DR, and (for LINSTOR deployments) the quorum cluster- Sufficient RBAC permissions to read pods, deployments, secrets, ClusterRoles, and ClusterRoleBindings in the controller namespace
- The namespace where controllers are deployed (referred to as
<namespace>throughout this page) - The names of the controller deployments in your environment (referred to as
<controller>throughout this page) - For LINSTOR deployments: kubeconfig access to all three clusters (primary, DR, quorum)
- For DRBD Operator deployments: kubeconfig access to both primary and DR clusters, plus the optional quorum cluster if deployed
Controllers rely on the following configuration inputs at runtime. Misconfigurations in any of these are a frequent root cause of the issues described on this page.
Kubeconfig secrets
Each controller uses Kubernetes secrets to authenticate against remote clusters. The expected secret names follow the pattern <cluster-name>-kubeconfig. For a two-cluster DRBD Operator deployment these are typically cluster1-kubeconfig and cluster2-kubeconfig. For a three-cluster LINSTOR deployment an additional secret for the quorum cluster is required. These secrets must exist in the same namespace as the controller.
Health probe endpoint
All controllers expose a /healthz endpoint on port 8080. Kubernetes uses this endpoint for liveness and readiness probes. If the endpoint is unreachable, the pod is restarted. You can verify the endpoint manually using kubectl port-forward (see Troubleshooting).
RBAC resources
Controllers require a ClusterRole and a corresponding ClusterRoleBinding to interact with KubeVirt VMs, Protection Group custom resources, Failover custom resources, and core Kubernetes resources such as nodes, leases, and secrets. Missing or misconfigured RBAC is a common cause of permission-related CrashLoopBackOff events.
Protection Group and Failover CRDs
Controllers watch ProtectionGroup and Failover custom resources under the siterecovery.trilio.io API group. If the CRDs are not installed or are at an incompatible version, controllers will fail to start or will log errors on their watch initialization.
The primary tool for interacting with controllers during troubleshooting is kubectl. Use it to inspect pod state, stream logs, verify secrets and RBAC, and test health endpoints.
Check controller pod status
Start by confirming which pods are running and whether any are in a failed state:
kubectl get pods -n <namespace>
Look for pods with a status of CrashLoopBackOff, Error, or Pending. A CrashLoopBackOff status means the pod is starting, crashing, and being restarted repeatedly — always check logs immediately after observing this state.
Stream controller logs
kubectl logs -f deployment/<controller> -n <namespace>
If the pod has already restarted, retrieve logs from the previous instance to see the crash reason:
kubectl logs deployment/<controller> -n <namespace> --previous
Verify kubeconfig secrets
kubectl get secret cluster1-kubeconfig cluster2-kubeconfig -n <namespace>
For LINSTOR three-cluster deployments, also verify the quorum cluster secret:
kubectl get secret cluster1-kubeconfig cluster2-kubeconfig quorum-kubeconfig -n <namespace>
Verify RBAC resources
kubectl get clusterrole <controller-clusterrole-name>
kubectl get clusterrolebinding <controller-clusterrolebinding-name>
Test the health probe manually
kubectl port-forward deployment/<controller> 8080:8080 -n <namespace>
In a second terminal:
curl http://localhost:8080/healthz
A healthy controller returns an HTTP 200 response.
Example 1: Identifying a CrashLoopBackOff and reading the crash reason
List pods to spot the failing controller:
kubectl get pods -n site-recovery
Expected output showing a problem:
NAME READY STATUS RESTARTS AGE
protection-group-controller-7d9f6b-xk2lp 0/1 CrashLoopBackOff 5 8m
failover-controller-5c8d4f-mn3qt 1/1 Running 0 8m
Stream logs from the failing pod:
kubectl logs -f deployment/protection-group-controller -n site-recovery
If the pod has already restarted, retrieve the previous instance's logs:
kubectl logs deployment/protection-group-controller -n site-recovery --previous
Expected output indicating a missing secret:
ERROR - Failed to load kubeconfig: secret "cluster2-kubeconfig" not found in namespace "site-recovery"
Example 2: Verifying kubeconfig secrets exist
kubectl get secret cluster1-kubeconfig cluster2-kubeconfig -n site-recovery
Expected output when both secrets are present:
NAME TYPE DATA AGE
cluster1-kubeconfig Opaque 1 2d
cluster2-kubeconfig Opaque 1 2d
If a secret is missing, the command returns an error:
Error from server (NotFound): secrets "cluster2-kubeconfig" not found
Example 3: Checking RBAC resources
kubectl get clusterrole protection-group-controller-role
kubectl get clusterrolebinding protection-group-controller-rolebinding
Expected output when both resources exist:
NAME CREATED AT
protection-group-controller-role 2025-10-01T10:00:00Z
NAME ROLE AGE
protection-group-controller-rolebinding ClusterRole/protection-group-controller-role 2d
Example 4: Testing the /healthz health probe
Forward the controller's health port locally:
kubectl port-forward deployment/protection-group-controller 8080:8080 -n site-recovery
In a second terminal, send a request to the health endpoint:
curl -v http://localhost:8080/healthz
Expected output from a healthy controller:
* Connected to localhost (127.0.0.1) port 8080
< HTTP/1.1 200 OK
< Content-Type: text/plain
ok
A non-200 response or a connection refused error indicates the controller process is unhealthy or not listening.
Use the consistent format below for each issue: Symptom, Likely cause, Fix.
Issue 1: Controller pod is in CrashLoopBackOff
Symptom: kubectl get pods -n <namespace> shows the controller pod with status CrashLoopBackOff and an increasing restart count.
Likely cause: The controller process is encountering a fatal error on startup — most commonly a missing kubeconfig secret, a missing CRD, or an unhandled exception during initialization.
Fix:
- Stream logs from the current or previous pod instance to identify the specific error:
kubectl logs deployment/<controller> -n <namespace> --previous - Search the output for keywords such as
ERROR,not found,permission denied, orfailed to load. - Address the root cause identified in the logs (see Issues 2 and 3 below for the most common causes).
- After fixing the root cause, restart the deployment to clear the backoff:
kubectl rollout restart deployment/<controller> -n <namespace>
Issue 2: Controller fails to start — missing kubeconfig secrets
Symptom: Controller logs contain an error similar to secret "cluster2-kubeconfig" not found or failed to load kubeconfig. The pod enters CrashLoopBackOff.
Likely cause: One or more kubeconfig secrets required by the controller are absent from the controller namespace. For LINSTOR three-cluster deployments, secrets for the primary, DR, and quorum clusters must all be present. For DRBD Operator two-cluster deployments, secrets for the primary and DR clusters are required.
Fix:
- Verify which secrets are present:
For LINSTOR deployments, also check:
kubectl get secret cluster1-kubeconfig cluster2-kubeconfig -n <namespace>kubectl get secret quorum-kubeconfig -n <namespace> - For any missing secret, create it from the appropriate kubeconfig file:
kubectl create secret generic cluster2-kubeconfig \ --from-file=kubeconfig=/path/to/cluster2-kubeconfig \ -n <namespace> - Restart the controller:
kubectl rollout restart deployment/<controller> -n <namespace>
Important: The quorum cluster hosts the LINSTOR controller and failover controllers in LINSTOR deployments, but it does not run application workloads or participate in storage replication. Ensure its kubeconfig grants access to management-plane resources only.
Issue 3: Controller logs show RBAC or permission denied errors
Symptom: Controller logs contain messages such as User "system:serviceaccount:..." cannot list resource "protectiongroups", forbidden, or RBAC: access denied. The controller may start but fail to reconcile resources.
Likely cause: The ClusterRole or ClusterRoleBinding for the controller is missing, was deleted, or does not grant permissions to the required API groups and resources (for example, siterecovery.trilio.io, kubevirt.io, coordination.k8s.io).
Fix:
- Confirm whether the ClusterRole and ClusterRoleBinding exist:
kubectl get clusterrole <controller-clusterrole-name> kubectl get clusterrolebinding <controller-clusterrolebinding-name> - If either resource is missing, reapply the RBAC manifests from your original deployment package (delivered via Ansible for infrastructure components).
- If the resources exist but the error persists, describe them to check their rules:
kubectl describe clusterrole <controller-clusterrole-name> - Add any missing rules to the ClusterRole, then restart the controller:
kubectl rollout restart deployment/<controller> -n <namespace>
Issue 4: Health probe failures cause repeated pod restarts
Symptom: Kubernetes events for the controller pod show Liveness probe failed or Readiness probe failed. The pod is restarted by kubelet even when no application crash is obvious in logs.
Likely cause: The controller's /healthz endpoint on port 8080 is not responding within the probe timeout. This can be caused by the controller process hanging on a blocking operation (for example, a stalled API call to a remote cluster), resource exhaustion, or a deadlock.
Fix:
- While the pod is running, manually test the health endpoint:
Then in a second terminal:
kubectl port-forward deployment/<controller> 8080:8080 -n <namespace>curl -v http://localhost:8080/healthz - If the endpoint returns non-200 or the connection is refused, check current controller logs for blocking operations or error loops:
kubectl logs -f deployment/<controller> -n <namespace> - If the controller is stuck waiting on a remote cluster (for example, a cluster that is unreachable), verify connectivity to the target cluster using the kubeconfig secret.
- Restart the controller to clear a transient hang:
kubectl rollout restart deployment/<controller> -n <namespace> - If the problem recurs, check whether the probe timeout and failure threshold values in the controller deployment spec are appropriate for your environment's API server latency.