Site Recoveryfor Kubenetes Virtual Machines
Guide

GitOps Patterns

Using ArgoCD or Flux with Site Recovery CRDs


Overview

This guide explains how to manage Trilio Site Recovery using GitOps tooling—specifically ArgoCD or Flux. Because Site Recovery exposes all of its operational primitives as Kubernetes Custom Resource Definitions (CRDs), your DR configuration, protection group definitions, and failover intent are all first-class Kubernetes objects that can be declared in Git, reviewed through pull requests, and reconciled automatically. Adopting a GitOps pattern gives you a full audit trail of every state change, enables drift detection on your DR configuration, and allows you to treat disaster recovery as code with the same rigor you apply to application deployments.


Prerequisites

Before you can adopt a GitOps pattern with Site Recovery, ensure the following are in place:

  • Kubernetes clusters: Primary and DR clusters deployed and joined to Site Recovery. A quorum cluster is required for LINSTOR deployments; it is optional but recommended for DRBD Operator deployments.
  • Site Recovery infrastructure: Ansible-deployed DRBD, LINSTOR or DRBD Operator, and controllers are running on all relevant clusters. The Site Manager UI is deployed via Helm to the quorum cluster.
  • Site Recovery CRDs installed: ProtectionGroup, Failover, and related CRDs must be applied to the primary and DR clusters before GitOps tooling can manage resources of those types.
  • ArgoCD or Flux: A functioning ArgoCD (v2.6+) or Flux v2 (v0.40+) installation. ArgoCD or the Flux controllers must have network access to the Kubernetes API servers of each cluster you intend to manage.
  • pgctl CLI: Available on any workstation used for out-of-band operations and multi-tenant deployment management. GitOps does not replace pgctl; it complements it for the declarative configuration layer.
  • RBAC: The GitOps controller's service account on each cluster must have get, list, watch, create, update, and patch permissions on all siterecovery.trilio.io API group resources. See the RBAC guidance in the Multi-Tenant Architecture reference for a ClusterRole template you can extend.
  • Git repository: A repository that your GitOps tooling is configured to watch, with branch protection and review policies appropriate for production DR configuration.

Installation

These steps configure ArgoCD or Flux to manage Site Recovery CRD-based resources. The same logical steps apply to both tools; tool-specific differences are noted inline.

Step 1: Verify CRDs are present on each cluster

Before GitOps tooling can reconcile Site Recovery resources, the CRDs themselves must exist. CRDs are installed by the Ansible infrastructure playbooks—confirm they are present on both primary and DR clusters:

# Check on primary cluster
kubectl --kubeconfig ~/.kube/config-primary \
  get crds | grep siterecovery.trilio.io

# Check on DR cluster
kubectl --kubeconfig ~/.kube/config-dr \
  get crds | grep siterecovery.trilio.io

Expected output includes at minimum:

protectiongroups.siterecovery.trilio.io
failovers.siterecovery.trilio.io

Step 2: Create a dedicated service account for the GitOps controller

Create a namespace-scoped service account and bind it to the required ClusterRole on each cluster the GitOps tool will manage. Substitute <deployment-name> with your deployment identifier (for example, prod):

# gitops-sa.yaml — apply to primary and DR clusters
apiVersion: v1
kind: ServiceAccount
metadata:
  name: gitops-siterecovery
  namespace: dr-<deployment-name>
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: gitops-siterecovery-<deployment-name>
rules:
- apiGroups: ["siterecovery.trilio.io"]
  resources: ["protectiongroups", "protectiongroups/status",
              "failovers", "failovers/status"]
  verbs: ["get", "list", "watch", "create", "update", "patch"]
- apiGroups: [""]
  resources: ["events"]
  verbs: ["create", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gitops-siterecovery-<deployment-name>
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: gitops-siterecovery-<deployment-name>
subjects:
- kind: ServiceAccount
  name: gitops-siterecovery
  namespace: dr-<deployment-name>
kubectl --kubeconfig ~/.kube/config-primary apply -f gitops-sa.yaml
kubectl --kubeconfig ~/.kube/config-dr apply -f gitops-sa.yaml

Step 3 (ArgoCD): Register clusters and create an Application

If your primary and DR clusters are not already registered with ArgoCD, add them:

argocd cluster add primary-context-name --name primary
argocd cluster add dr-context-name --name dr

Create an ArgoCD Application that points to the directory in your Git repository containing Site Recovery manifests:

# argocd-app-siterecovery.yaml — apply to the ArgoCD cluster
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: siterecovery-prod
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://git.example.com/infra/site-recovery.git
    targetRevision: main
    path: clusters/prod/protection-groups
  destination:
    server: https://<primary-cluster-api>
    namespace: dr-prod
  syncPolicy:
    automated:
      prune: false        # Do not auto-delete ProtectionGroups
      selfHeal: true      # Re-apply on drift
    syncOptions:
    - CreateNamespace=true

Important: Set prune: false for ProtectionGroup and Failover resources. Automatically deleting a ProtectionGroup removes DR protection from the VMs it covers. Require human review for deletions.

kubectl apply -f argocd-app-siterecovery.yaml

Step 3 (Flux): Create a GitRepository and Kustomization

# Add the Git source
flux create source git site-recovery \
  --url=https://git.example.com/infra/site-recovery.git \
  --branch=main \
  --interval=1m

# Create a Kustomization targeting the primary cluster
flux create kustomization siterecovery-prod \
  --source=site-recovery \
  --path="./clusters/prod/protection-groups" \
  --prune=false \
  --interval=5m \
  --target-namespace=dr-prod

Step 4: Commit your first ProtectionGroup manifest to the repository

Add a ProtectionGroup manifest to the path your GitOps tool is watching:

mkdir -p clusters/prod/protection-groups
cat > clusters/prod/protection-groups/production-pg.yaml << 'EOF'
apiVersion: siterecovery.trilio.io/v1alpha1
kind: ProtectionGroup
metadata:
  name: production-protection-group
  namespace: dr-prod
spec:
  desiredState: running
  virtualMachines:
    - name: prod-vm-1
    - name: prod-vm-2
EOF

git add clusters/prod/protection-groups/production-pg.yaml
git commit -m "feat: add production protection group"
git push origin main

ArgoCD or Flux will detect the change within its configured polling interval and apply the manifest to the primary cluster.

Step 5: Verify reconciliation

# ArgoCD
argocd app sync siterecovery-prod
argocd app get siterecovery-prod

# Flux
flux reconcile kustomization siterecovery-prod --with-source
flux get kustomizations siterecovery-prod

# Confirm the ProtectionGroup exists on the primary cluster
kubectl --kubeconfig ~/.kube/config-primary \
  get protectiongroup production-protection-group -n dr-prod

Configuration

The following CRD fields and GitOps tool settings are most important to understand when managing Site Recovery declaratively.

ProtectionGroup spec fields

FieldTypeDefaultValid ValuesEffect
spec.desiredStatestringrunningrunning, stoppedDrives the Protection Group Controller to start or stop all VMs in the group atomically. Changing this field in Git triggers reconciliation on the cluster.
spec.virtualMachineslistList of {name: <vm-name>} objectsDefines which VMs belong to this protection group. The controller manages lifecycle for all listed VMs together.

ProtectionGroup status fields (read-only)

FieldTypePossible ValuesMeaning
status.currentStatestringrunning, stopped, mixed, unknownThe Protection Group Controller updates this to reflect the actual observed state. In a correctly reconciled group, currentState matches spec.desiredState. A persistent mixed state indicates that one or more VMs failed to transition.

Failover CR spec fields

The Failover CR expresses declarative failover intent. You typically do not commit Failover CRs to Git for automated reconciliation—doing so would trigger a failover on every sync. Instead, use Failover CRs for controlled, reviewed operations (see the Usage section).

FieldTypeEffect
spec.protectionGroupstringName of the ProtectionGroup to fail over.
spec.targetClusterstringIdentifier of the cluster that should become active after failover.

GitOps tool settings that affect Site Recovery

Pruning: Disable automatic pruning (prune: false in ArgoCD; prune: false in Flux Kustomization) for ProtectionGroup resources. A prune event removes the CR, which stops the controller from managing those VMs.

Self-healing / drift correction: Enable self-healing so that if someone manually patches a ProtectionGroup's spec.desiredState outside of Git, the GitOps tool reverts the change to the Git-declared state. This enforces Git as the authoritative source for VM desired state.

Sync interval: For Protection Groups, a 1–5 minute sync interval is appropriate. Shorter intervals increase API server load without meaningful benefit, since desiredState changes are only meaningful when you commit them.

Namespace: Scope each ArgoCD Application or Flux Kustomization to the deployment namespace (dr-<deployment-name>). This aligns with the multi-tenant isolation model where each deployment has its own namespace on the quorum cluster and workload clusters.


Usage

Managing VM desired state through Git

The most direct GitOps use case is controlling spec.desiredState on a ProtectionGroup. To stop all VMs in a group (for example, for planned maintenance), open a pull request that changes the field:

# clusters/prod/protection-groups/production-pg.yaml
apiVersion: siterecovery.trilio.io/v1alpha1
kind: ProtectionGroup
metadata:
  name: production-protection-group
  namespace: dr-prod
spec:
  desiredState: stopped   # Changed from: running
  virtualMachines:
    - name: prod-vm-1
    - name: prod-vm-2

Once the pull request is reviewed and merged, ArgoCD or Flux applies the change. The Protection Group Controller detects the spec.desiredState change and stops all listed VMs atomically. The status.currentState field transitions from running to stopped. To restart the VMs, revert the field to running through another pull request.

This pattern gives you a peer-reviewed, auditable record in Git of every intentional VM state change.

Adding or removing VMs from a Protection Group

To add a VM to an existing group, add it to spec.virtualMachines in the manifest and merge the change:

spec:
  desiredState: running
  virtualMachines:
    - name: prod-vm-1
    - name: prod-vm-2
    - name: prod-vm-3   # Newly added

The controller reconciles the group and begins managing prod-vm-3 alongside the existing VMs. To remove a VM, delete it from the list and merge.

Triggering failover with a reviewed Failover CR

While you should not store Failover CRs in the continuously-reconciled path (because re-applying an identical CR on every sync cycle is not idempotent for a failover operation), you can use a separate, manually-synced ArgoCD Application or a Flux Kustomization with suspend: true to commit failover intent as a reviewable artifact:

# clusters/prod/failovers/failover-to-dr-2024-01-15.yaml
apiVersion: siterecovery.trilio.io/v1alpha1
kind: Failover
metadata:
  name: failover-to-dr-20240115-001
  namespace: dr-prod
spec:
  protectionGroup: production-protection-group
  targetCluster: cluster2

Commit this file, open a pull request, obtain approval, and then manually trigger the sync. The Failover Controller watches for the new CR, orchestrates the cross-cluster failover sequence, and updates status on the CR as it progresses. The CR persists after completion, providing a permanent audit record.

For unplanned failovers where the review workflow is not feasible, use pgctl directly or the intelligent-pg-failover-crd.sh script, both of which create the Failover CR programmatically.

Monitoring reconciliation status

After merging a change, verify that the GitOps tool applied it and that the Protection Group Controller converged:

# Check GitOps sync status
# ArgoCD
argocd app get siterecovery-prod

# Flux
flux get kustomizations siterecovery-prod

# Check ProtectionGroup desired vs actual state
kubectl --kubeconfig ~/.kube/config-primary \
  get protectiongroup production-protection-group \
  -n dr-prod \
  -o jsonpath='desiredState={.spec.desiredState} currentState={.status.currentState}'

# Watch for convergence
kubectl --kubeconfig ~/.kube/config-primary \
  get protectiongroup production-protection-group \
  -n dr-prod -w

Multi-tenant repository layout

For environments with multiple DR deployments, organize your repository so that each deployment has its own path. This lets you create separate ArgoCD Applications or Flux Kustomizations per deployment, honoring the namespace-based isolation model:

repo/
  clusters/
    prod/
      protection-groups/
        production-pg.yaml
    staging/
      protection-groups/
        staging-pg.yaml
    dev/
      protection-groups/
        dev-pg.yaml

Each path maps to a dedicated namespace (dr-prod, dr-staging, dr-dev) and is reconciled independently.


Examples

Example 1: Declare a protection group with two VMs

This is the foundational manifest. Committing this file to Git and merging it causes ArgoCD or Flux to apply it to the primary cluster, where the Protection Group Controller takes ownership of the listed VMs.

# clusters/prod/protection-groups/production-pg.yaml
apiVersion: siterecovery.trilio.io/v1alpha1
kind: ProtectionGroup
metadata:
  name: production-protection-group
  namespace: dr-prod
spec:
  desiredState: running
  virtualMachines:
    - name: prod-vm-1
    - name: prod-vm-2

After the GitOps tool applies this manifest, confirm the controller has reconciled it:

kubectl --kubeconfig ~/.kube/config-primary \
  get protectiongroup production-protection-group \
  -n dr-prod -o yaml

Expected status section:

status:
  currentState: running

Example 2: Controlled VM shutdown via pull request

You need to stop all VMs in the production group for a storage maintenance window. You open a pull request changing only spec.desiredState:

spec:
  desiredState: stopped   # was: running
  virtualMachines:
    - name: prod-vm-1
    - name: prod-vm-2

After merge and sync, poll until convergence:

kubectl --kubeconfig ~/.kube/config-primary \
  get protectiongroup production-protection-group \
  -n dr-prod \
  -o jsonpath='{.status.currentState}'

Expected output:

stopped

Verify no VMIs are running:

kubectl --kubeconfig ~/.kube/config-primary get vmi -n dr-prod

Expected output:

No resources found in dr-prod namespace.

Example 3: Declarative failover CR committed for review

You want DR failover to go through the same pull-request review process as other infrastructure changes. Commit the Failover CR to a dedicated path that is synced only on manual trigger:

# clusters/prod/failovers/failover-dr-20240115.yaml
apiVersion: siterecovery.trilio.io/v1alpha1
kind: Failover
metadata:
  name: failover-dr-20240115-001
  namespace: dr-prod
spec:
  protectionGroup: production-protection-group
  targetCluster: cluster2

After the PR is approved and the sync is manually triggered, monitor the Failover CR status:

kubectl --kubeconfig ~/.kube/config-primary \
  get failover failover-dr-20240115-001 \
  -n dr-prod -o jsonpath='{.status}' | jq .

Expected output once the Failover Controller has completed orchestration:

{
  "phase": "Succeeded",
  "targetCluster": "cluster2",
  "completedAt": "2024-01-15T14:32:00Z"
}

Verify VMs are running on the DR cluster:

kubectl --kubeconfig ~/.kube/config-dr get vm,vmi -n dr-prod

Example 4: ArgoCD Application manifest for multi-tenant deployment

This example shows how to create a separate ArgoCD Application per deployment, respecting the namespace isolation model:

# argocd/applications/siterecovery-prod.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: siterecovery-prod
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://git.example.com/infra/site-recovery.git
    targetRevision: main
    path: clusters/prod/protection-groups
  destination:
    server: https://<primary-cluster-api>
    namespace: dr-prod
  syncPolicy:
    automated:
      prune: false
      selfHeal: true
    syncOptions:
    - CreateNamespace=true
---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: siterecovery-staging
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://git.example.com/infra/site-recovery.git
    targetRevision: main
    path: clusters/staging/protection-groups
  destination:
    server: https://<primary-cluster-api>
    namespace: dr-staging
  syncPolicy:
    automated:
      prune: false
      selfHeal: true
    syncOptions:
    - CreateNamespace=true

Apply both Applications:

kubectl apply -f argocd/applications/siterecovery-prod.yaml
kubectl apply -f argocd/applications/siterecovery-staging.yaml

Expected output:

application.argoproj.io/siterecovery-prod created
application.argoproj.io/siterecovery-staging created

Troubleshooting

Issue 1: ProtectionGroup currentState remains unknown after sync

Symptom: The GitOps tool reports the Application or Kustomization as Synced, but kubectl get protectiongroup shows status.currentState: unknown.

Likely cause: The Protection Group Controller is not running on the target cluster, or it has not been restarted after the CRD was updated.

Fix:

# Confirm the controller pod is running
kubectl --kubeconfig ~/.kube/config-primary \
  get pods -l app=protection-group-controller -n <controller-namespace>

# Check controller logs for errors
kubectl --kubeconfig ~/.kube/config-primary \
  logs -f deployment/protection-group-controller \
  -n <controller-namespace>

# If the controller is running but not reconciling, restart it
kubectl --kubeconfig ~/.kube/config-primary \
  rollout restart deployment/protection-group-controller \
  -n <controller-namespace>

Issue 2: ProtectionGroup currentState is mixed

Symptom: After a desiredState change is applied, status.currentState is mixed rather than the expected running or stopped.

Likely cause: One or more VMs listed in spec.virtualMachines could not be started or stopped. This can occur if a VM does not exist on the cluster, if a PVC binding has not resolved, or if the VM is in an error state.

Fix:

# Check individual VM and VMI statuses
kubectl --kubeconfig ~/.kube/config-primary get vm,vmi -n dr-prod

# Review Protection Group Controller logs for per-VM results
kubectl --kubeconfig ~/.kube/config-primary \
  logs deployment/protection-group-controller \
  -n <controller-namespace> | grep reconcile_vm_state

# Inspect events on the ProtectionGroup
kubectl --kubeconfig ~/.kube/config-primary \
  describe protectiongroup production-protection-group -n dr-prod

Resolve the individual VM issue (missing PVC, scheduler taint, etc.), then confirm the controller re-reconciles and currentState converges.


Issue 3: GitOps tool reports OutOfSync immediately after sync

Symptom: ArgoCD or Flux repeatedly marks the ProtectionGroup resource as OutOfSync even though the manifest in Git has not changed.

Likely cause: The Protection Group Controller is updating spec fields on the resource after the GitOps tool applies it, which causes the tool to detect drift. More commonly, status subresource changes are being compared when they should not be.

Fix (ArgoCD): Add an ignoreDifferences rule for the status subresource:

# In your ArgoCD Application spec
spec:
  ignoreDifferences:
  - group: siterecovery.trilio.io
    kind: ProtectionGroup
    jsonPointers:
    - /status

Fix (Flux): Status subresources are excluded from Flux drift detection by default. If the issue persists, check whether a controller or webhook is mutating spec fields and identify which field is changing:

kubectl --kubeconfig ~/.kube/config-primary \
  get protectiongroup production-protection-group \
  -n dr-prod -o yaml

Compare the live resource to your Git manifest to identify the diverging field.


Issue 4: Failover CR committed to Git triggers failover on every sync

Symptom: Every time the GitOps tool syncs, it re-applies the Failover CR and the Failover Controller initiates a new failover operation.

Likely cause: The Failover CR is stored in the continuously-reconciled path and the GitOps tool is re-creating or re-patching it, which the Failover Controller interprets as a new failover request.

Fix: Move Failover CRs out of the automatically-reconciled path. Use a separate ArgoCD Application with automated sync disabled, or a Flux Kustomization with suspend: true. Trigger the sync manually only when a failover is intended. After the failover completes, archive or rename the manifest so it is not re-applied.

Alternatively, use pgctl or the intelligent-pg-failover-crd.sh script to create Failover CRs imperatively when immediate failover is required, keeping the GitOps path reserved for ProtectionGroup configuration.


Issue 5: GitOps controller lacks permissions to manage Site Recovery CRDs

Symptom: ArgoCD or Flux logs show 403 Forbidden errors when attempting to apply ProtectionGroup or Failover resources.

Likely cause: The ClusterRole bound to the GitOps service account does not include permissions on the siterecovery.trilio.io API group.

Fix:

# Verify what permissions the GitOps service account has
kubectl --kubeconfig ~/.kube/config-primary \
  auth can-i create protectiongroups \
  --as=system:serviceaccount:argocd:argocd-application-controller

# If the answer is 'no', apply the RBAC manifest from the Installation section
# and confirm the ClusterRoleBinding references the correct service account name
kubectl --kubeconfig ~/.kube/config-primary \
  get clusterrolebinding gitops-siterecovery-prod -o yaml