Guide

Controller Issues

Controller pod failures, connectivity issues

Overview

This page helps you diagnose and resolve common issues with Site Recovery controllers — the pods responsible for orchestrating Protection Group lifecycle, replication health monitoring, and cross-cluster failover operations. Controller failures can silently block VM protection, prevent failovers from completing, and leave Protection Groups in an inconsistent state, so fast diagnosis is critical. The guidance here covers pod-level failures, missing credentials, permission errors, and health probe failures across both LINSTOR and DRBD Operator deployment models.

Prerequisites

Before troubleshooting controller issues, ensure you have:

kubectl access to the relevant cluster(s) — primary, DR, and (for LINSTOR deployments) the quorum cluster
Sufficient RBAC permissions to read pods, deployments, secrets, ClusterRoles, and ClusterRoleBindings in the controller namespace
The namespace where controllers are deployed (referred to as <namespace> throughout this page)
The names of the controller deployments in your environment (referred to as <controller> throughout this page)
For LINSTOR deployments: kubeconfig access to all three clusters (primary, DR, quorum)
For DRBD Operator deployments: kubeconfig access to both primary and DR clusters, plus the optional quorum cluster if deployed

Installation

Configuration

Controllers rely on the following configuration inputs at runtime. Misconfigurations in any of these are a frequent root cause of the issues described on this page.

Kubeconfig secrets

Each controller uses Kubernetes secrets to authenticate against remote clusters. The expected secret names follow the pattern <cluster-name>-kubeconfig. For a two-cluster DRBD Operator deployment these are typically cluster1-kubeconfig and cluster2-kubeconfig. For a three-cluster LINSTOR deployment an additional secret for the quorum cluster is required. These secrets must exist in the same namespace as the controller.

Health probe endpoint

All controllers expose a /healthz endpoint on port 8080. Kubernetes uses this endpoint for liveness and readiness probes. If the endpoint is unreachable, the pod is restarted. You can verify the endpoint manually using kubectl port-forward (see Troubleshooting).

RBAC resources

Controllers require a ClusterRole and a corresponding ClusterRoleBinding to interact with KubeVirt VMs, Protection Group custom resources, Failover custom resources, and core Kubernetes resources such as nodes, leases, and secrets. Missing or misconfigured RBAC is a common cause of permission-related CrashLoopBackOff events.

Protection Group and Failover CRDs

Controllers watch ProtectionGroup and Failover custom resources under the siterecovery.trilio.io API group. If the CRDs are not installed or are at an incompatible version, controllers will fail to start or will log errors on their watch initialization.

Usage

The primary tool for interacting with controllers during troubleshooting is kubectl. Use it to inspect pod state, stream logs, verify secrets and RBAC, and test health endpoints.

Check controller pod status

Start by confirming which pods are running and whether any are in a failed state:

kubectl get pods -n <namespace>

Look for pods with a status of CrashLoopBackOff, Error, or Pending. A CrashLoopBackOff status means the pod is starting, crashing, and being restarted repeatedly — always check logs immediately after observing this state.

Stream controller logs

kubectl logs -f deployment/<controller> -n <namespace>

If the pod has already restarted, retrieve logs from the previous instance to see the crash reason:

kubectl logs deployment/<controller> -n <namespace> --previous

Verify kubeconfig secrets

kubectl get secret cluster1-kubeconfig cluster2-kubeconfig -n <namespace>

For LINSTOR three-cluster deployments, also verify the quorum cluster secret:

kubectl get secret cluster1-kubeconfig cluster2-kubeconfig quorum-kubeconfig -n <namespace>

Verify RBAC resources

kubectl get clusterrole <controller-clusterrole-name>
kubectl get clusterrolebinding <controller-clusterrolebinding-name>

Test the health probe manually

kubectl port-forward deployment/<controller> 8080:8080 -n <namespace>

In a second terminal:

curl http://localhost:8080/healthz

A healthy controller returns an HTTP 200 response.

Examples

Example 1: Identifying a CrashLoopBackOff and reading the crash reason

List pods to spot the failing controller:

kubectl get pods -n site-recovery

Expected output showing a problem:

NAME                                        READY   STATUS             RESTARTS   AGE
protection-group-controller-7d9f6b-xk2lp   0/1     CrashLoopBackOff   5          8m
failover-controller-5c8d4f-mn3qt           1/1     Running            0          8m

Stream logs from the failing pod:

kubectl logs -f deployment/protection-group-controller -n site-recovery

If the pod has already restarted, retrieve the previous instance's logs:

kubectl logs deployment/protection-group-controller -n site-recovery --previous

Expected output indicating a missing secret:

ERROR - Failed to load kubeconfig: secret "cluster2-kubeconfig" not found in namespace "site-recovery"

Example 2: Verifying kubeconfig secrets exist

kubectl get secret cluster1-kubeconfig cluster2-kubeconfig -n site-recovery

Expected output when both secrets are present:

NAME                  TYPE     DATA   AGE
cluster1-kubeconfig   Opaque   1      2d
cluster2-kubeconfig   Opaque   1      2d

If a secret is missing, the command returns an error:

Error from server (NotFound): secrets "cluster2-kubeconfig" not found

Example 3: Checking RBAC resources

kubectl get clusterrole protection-group-controller-role
kubectl get clusterrolebinding protection-group-controller-rolebinding

Expected output when both resources exist:

NAME                              CREATED AT
protection-group-controller-role  2025-10-01T10:00:00Z

NAME                                     ROLE                                          AGE
protection-group-controller-rolebinding   ClusterRole/protection-group-controller-role   2d

Example 4: Testing the /healthz health probe

Forward the controller's health port locally:

kubectl port-forward deployment/protection-group-controller 8080:8080 -n site-recovery

In a second terminal, send a request to the health endpoint:

curl -v http://localhost:8080/healthz

Expected output from a healthy controller:

* Connected to localhost (127.0.0.1) port 8080
< HTTP/1.1 200 OK
< Content-Type: text/plain
ok

A non-200 response or a connection refused error indicates the controller process is unhealthy or not listening.

Troubleshooting

Use the consistent format below for each issue: Symptom, Likely cause, Fix.

Issue 1: Controller pod is in CrashLoopBackOff

Symptom: kubectl get pods -n <namespace> shows the controller pod with status CrashLoopBackOff and an increasing restart count.

Likely cause: The controller process is encountering a fatal error on startup — most commonly a missing kubeconfig secret, a missing CRD, or an unhandled exception during initialization.

Fix:

Stream logs from the current or previous pod instance to identify the specific error:
```
kubectl logs deployment/<controller> -n <namespace> --previous
```
Search the output for keywords such as ERROR, not found, permission denied, or failed to load.
Address the root cause identified in the logs (see Issues 2 and 3 below for the most common causes).
After fixing the root cause, restart the deployment to clear the backoff:
```
kubectl rollout restart deployment/<controller> -n <namespace>
```

Issue 2: Controller fails to start — missing kubeconfig secrets

Symptom: Controller logs contain an error similar to secret "cluster2-kubeconfig" not found or failed to load kubeconfig. The pod enters CrashLoopBackOff.

Likely cause: One or more kubeconfig secrets required by the controller are absent from the controller namespace. For LINSTOR three-cluster deployments, secrets for the primary, DR, and quorum clusters must all be present. For DRBD Operator two-cluster deployments, secrets for the primary and DR clusters are required.

Fix:

Verify which secrets are present:

kubectl get secret cluster1-kubeconfig cluster2-kubeconfig -n <namespace>

For LINSTOR deployments, also check:

kubectl get secret quorum-kubeconfig -n <namespace>

For any missing secret, create it from the appropriate kubeconfig file:

kubectl create secret generic cluster2-kubeconfig \
  --from-file=kubeconfig=/path/to/cluster2-kubeconfig \
  -n <namespace>

Restart the controller:

kubectl rollout restart deployment/<controller> -n <namespace>

Important: The quorum cluster hosts the LINSTOR controller and failover controllers in LINSTOR deployments, but it does not run application workloads or participate in storage replication. Ensure its kubeconfig grants access to management-plane resources only.

Issue 3: Controller logs show RBAC or permission denied errors

Symptom: Controller logs contain messages such as User "system:serviceaccount:..." cannot list resource "protectiongroups", forbidden, or RBAC: access denied. The controller may start but fail to reconcile resources.

Likely cause: The ClusterRole or ClusterRoleBinding for the controller is missing, was deleted, or does not grant permissions to the required API groups and resources (for example, siterecovery.trilio.io, kubevirt.io, coordination.k8s.io).

Fix:

Confirm whether the ClusterRole and ClusterRoleBinding exist:

kubectl get clusterrole <controller-clusterrole-name>
kubectl get clusterrolebinding <controller-clusterrolebinding-name>

If either resource is missing, reapply the RBAC manifests from your original deployment package (delivered via Ansible for infrastructure components).
If the resources exist but the error persists, describe them to check their rules:
```
kubectl describe clusterrole <controller-clusterrole-name>
```
Add any missing rules to the ClusterRole, then restart the controller:
```
kubectl rollout restart deployment/<controller> -n <namespace>
```

Issue 4: Health probe failures cause repeated pod restarts

Symptom: Kubernetes events for the controller pod show Liveness probe failed or Readiness probe failed. The pod is restarted by kubelet even when no application crash is obvious in logs.

Likely cause: The controller's /healthz endpoint on port 8080 is not responding within the probe timeout. This can be caused by the controller process hanging on a blocking operation (for example, a stalled API call to a remote cluster), resource exhaustion, or a deadlock.

Fix:

While the pod is running, manually test the health endpoint:

kubectl port-forward deployment/<controller> 8080:8080 -n <namespace>

Then in a second terminal:

curl -v http://localhost:8080/healthz

If the endpoint returns non-200 or the connection is refused, check current controller logs for blocking operations or error loops:
```
kubectl logs -f deployment/<controller> -n <namespace>
```
If the controller is stuck waiting on a remote cluster (for example, a cluster that is unreachable), verify connectivity to the target cluster using the kubeconfig secret.

Restart the controller to clear a transient hang:

kubectl rollout restart deployment/<controller> -n <namespace>

If the problem recurs, check whether the probe timeout and failure threshold values in the controller deployment spec are appropriate for your environment's API server latency.