Guide

Replication Issues

FlashArray connectivity errors, sync degraded, RPO violations, split-brain

master

Overview

This page helps you diagnose and resolve replication issues in Trilio Site Recovery for OpenStack, covering FlashArray connectivity errors, degraded sync states, RPO violations, and split-brain conditions. Replication health is the foundation of your DR readiness — a degraded or broken replication path means that when you need to fail over, your data may be stale, incomplete, or unavailable. Because Trilio Site Recovery relies on Pure Storage FlashArray replication mapped through Cinder Consistency Groups and Protection Groups, failures can originate at the storage layer, the OpenStack layer, the metadata sync layer, or across combinations of all three. Use this guide to identify which layer is failing, understand the blast radius, and restore replication health before your RPO window is breached.

Prerequisites

Before troubleshooting replication issues, confirm the following:

You have access to both the primary and secondary OpenStack clouds and can authenticate to each site's Keystone endpoint independently
You have the protectorclient OSC plugin installed and clouds.yaml configured with entries for both sites
You have credentials with at least the member role on the tenant owning the affected Protection Group, and admin access to the service project for site-level diagnostics
You know the Protection Group name or UUID and the associated Cinder Consistency Group IDs on both sites
You have network access to both FlashArray management IPs from your workstation or jump host
You have the FlashArray API tokens stored in your replication policy (retrievable via openstack protector protection-group policy-show <pg>)
Both protector-api and protector-engine services are running on each site — check with systemctl status protector-api protector-engine on each controller
Relevant log files are accessible: /var/log/protector/protector-api.log and /var/log/protector/protector-engine.log on both sites

Installation

No additional software installation is required to troubleshoot replication issues. All diagnostic commands are available through the protectorclient OSC plugin and standard OpenStack CLI tools that you installed during initial deployment.

If the OSC plugin is missing on your workstation, reinstall it:

pip install python-protectorclient

Verify the plugin is loaded:

openstack protector --help

If you need direct FlashArray API access for low-level replication diagnostics, install the Pure Storage Python SDK:

pip install py-pure-client

This is optional but useful when you need to inspect the FlashArray Protection Group or Pod state independently of OpenStack.

Configuration

The following configuration parameters directly affect replication behavior and are the first places to check when diagnosing issues.

Replication Policy (openstack protector protection-group policy-show <pg>)

Parameter	Description	Effect on behavior
`primary_fa_url`	HTTPS management URL of the primary FlashArray	Incorrect value causes all primary-side storage API calls to fail
`secondary_fa_url`	HTTPS management URL of the secondary FlashArray	Incorrect value causes storage calls on the secondary to fail; failover will not complete
`primary_fa_api_token` / `secondary_fa_api_token`	API tokens for each array (stored encrypted)	Expired or rotated tokens produce 401 errors from the FlashArray API
`pure_pg_name`	The name of the Pure Storage Protection Group on the primary array	Must match exactly what exists on the array; mismatch causes sync and snapshot operations to fail
`replication_interval`	Seconds between async replication snapshots	Lower values reduce RPO but increase array load; must be ≥ the minimum interval configured on the FlashArray replication schedule
`rpo_minutes`	Recovery Point Objective threshold in minutes	Used by Protector to determine whether the replication lag constitutes an RPO violation; does not change array behavior

Cinder Volume Type extra specs (on both sites)

For a volume to be eligible for protection and replication, its Cinder volume type must carry both of these extra specs:

replication_enabled='<is> True'
replication_type='<in> async'   # or '<in> sync'

If either property is missing or mismatched between sites, the protector-engine will reject the volume when attempting to add it to a Consistency Group, or will fail during failover when it tries to manage the replicated volume into Cinder on the secondary site.

Metadata sync behavior

Protector enforces strict metadata consistency: any modification to a Protection Group (adding/removing members, policy changes) is blocked if the remote site is unreachable. This is by design — it prevents the two sites from diverging into a split-brain state. The sync status is visible in:

openstack protector protection-group sync-status <pg-name>

The remote_sync_status field will be one of synced, failed, or unreachable. A version number tracks every change; both sites must be at the same version before modifications are allowed.

Usage

Checking overall replication health

Start every investigation by pulling the current state of the Protection Group and its metadata sync status:

# Show Protection Group state
openstack protector protection-group show <pg-name>

# Check metadata sync between sites
openstack protector protection-group sync-status <pg-name>

# Show the Consistency Group and its volumes
openstack protector consistency-group show <pg-name>

The sync-status output shows you the local version, the remote version, when the last successful sync occurred, and whether the remote site is reachable. A version mismatch combined with a failed or unreachable sync status is your primary indicator that the sites have diverged.

Forcing a metadata re-sync

Once the remote site is reachable again, you can push the current local metadata to the remote site:

openstack protector protection-group sync-force <pg-name>

This is safe to run — it pushes the local authoritative copy to the remote and updates the version number on both sides. It does not affect running VMs or storage replication.

Forcing a Consistency Group storage sync

To trigger an immediate async replication snapshot outside the normal schedule:

openstack protector consistency-group sync <pg-name>

Use this after resolving a connectivity gap to get back within your RPO window before it matters operationally.

Validating site reachability

If you suspect a site-level connectivity problem rather than a storage-layer problem:

openstack protector site validate site-a
openstack protector site validate site-b

This checks that the Keystone, Nova, Cinder, and Neutron endpoints on each site are reachable from the Protector service and that the service account credentials are valid.

Examples

Example 1 — Diagnosing a sync-degraded Protection Group

Symptom: openstack protector protection-group show prod-web-app returns status: error and you are unsure why.

openstack protector protection-group sync-status prod-web-app

Expected output when out of sync:

Sync Status: ❌ OUT OF SYNC

Local Metadata:
  Version: 8
  Current Site: Site A
  Last Modified: 2025-11-03T14:35:00Z

Remote Sync:
  Status: FAILED
  Remote Version: 7
  Last Sync: 2025-11-03T14:30:05Z (5 minutes ago)
  Error: Connection timeout

Validation:
  ❌ Version mismatch (local: 8, remote: 7)
  ❌ Sync status is 'failed'
  ⚠️  Remote site may be unreachable

Action Required:
  1. Check remote site connectivity
  2. Force sync once remote site is available

Once the remote site is back:

openstack protector protection-group sync-force prod-web-app

Expected output:

Force Sync Initiated...

Checking remote site connectivity...
  ✅ Site B is reachable

Syncing metadata (version 8)...
  Gathering current metadata... ✓
  Calculating checksum... ✓
  Pushing to Site B... ✓

Remote Site Response:
  Status: success
  Version: 8
  Duration: 450ms

✅ Sync completed successfully
Both sites now at version 8

Example 2 — Identifying an RPO violation

If you have configured rpo_minutes: 15 in your replication policy and suspect replication has fallen behind:

# Check the consistency group for replication lag indicators
openstack protector consistency-group show prod-web-app

# Force an immediate async snapshot to recover
openstack protector consistency-group sync prod-web-app

Then review protector-engine logs on the primary site for snapshot scheduling errors:

grep -i 'rpo\|snapshot\|replication' /var/log/protector/protector-engine.log | tail -50

Look for lines containing RPO violation, snapshot failed, or pure_pg_name not found — each points to a different root cause described in the Troubleshooting section.

Example 3 — Detecting and resolving a split-brain condition

Split-brain occurs when both sites believe they are the authoritative source of truth — typically after an unplanned failover where Site A recovered independently before metadata sync was re-established.

Check for version divergence:

# Authenticate to Site A
export OS_CLOUD=site-a
openstack protector protection-group sync-status prod-web-app
# Note the local version and current_primary_site

# Authenticate to Site B
export OS_CLOUD=site-b
openstack protector protection-group sync-status prod-web-app
# Compare local version and current_primary_site

If both sites report themselves as current_primary_site with divergent version numbers, you have a split-brain condition. Do not attempt modifications on either site until you resolve it. Identify which site has the most recent successful DR operation:

# On whichever site you believe is authoritative
export OS_CLOUD=site-b
openstack protector operation list --protection-group prod-web-app

Once you have identified the authoritative site, force-sync from that site:

export OS_CLOUD=site-b
openstack protector protection-group sync-force prod-web-app

This overwrites the remote site's metadata with the local authoritative copy. Verify both sides converge to the same version before resuming operations.

Example 4 — Verifying FlashArray connectivity manually

If Protector storage operations fail but you are unsure whether the array itself is reachable:

# Retrieve the FlashArray URL and token from the policy
openstack protector protection-group policy-show prod-web-app

# Test array connectivity with curl (replace with your array URL)
curl -sk https://flasharray-a.example.com/api/2.0/array \
  -H "x-auth-token: T-12345678-abcd-..." | python3 -m json.tool

A successful response returns array metadata including the array name and Purity version. A 401 Unauthorized indicates a rotated or invalid API token. A connection timeout or refused connection indicates a network path problem between the protector-engine host and the array management IP.

Troubleshooting

Use the following structured reference to match your observed symptom to a cause and fix. Each issue lists the symptom as you would see it, the most likely cause, and the steps to resolve it.

Issue: protector-engine cannot reach FlashArray — storage operations fail

Symptom: DR operations (failover, consistency-group sync) fail with an error message containing connection refused, timed out, or SSL certificate verification failed referencing the FlashArray URL. The protector-engine log shows repeated failures against primary_fa_url or secondary_fa_url.

Likely causes:

The FlashArray management IP is unreachable from the host running protector-engine (firewall, routing, or VLAN change)
The FlashArray management interface is down or the array is in a maintenance state
The URL stored in the replication policy is incorrect (wrong IP, wrong protocol, typo)
TLS certificate on the array has changed and certificate verification is failing

Fix:

From the protector-engine host, test raw connectivity:

curl -sk https://<primary_fa_url>/api/2.0/array -H "x-auth-token: <token>"

If the array is unreachable, check routing and firewall rules between the protector-engine host and the array management IP on both sites.

If the URL or token is wrong, update the replication policy:

openstack protector protection-group policy-create <pg-name> \
  --primary-fa-url https://corrected-array-a.example.com \
  --primary-fa-token "T-correct-token-..."

Note: policy-create is idempotent — re-running it on an existing policy updates the values.

If TLS verification is the issue, ensure the array's certificate is valid and trusted by the system CA bundle on the engine host, or update the array's certificate.

Issue: FlashArray API returns 401 Unauthorized

Symptom: Storage calls fail with a 401 error. The protector-engine log shows Authentication failed or Invalid API token against the FlashArray.

Likely cause: The API token stored in the replication policy has been rotated or deleted on the FlashArray.

Fix:

Log into the FlashArray Purity UI or CLI and generate a new API token for the service account.

Update the replication policy with the new token:

openstack protector protection-group policy-create <pg-name> \
  --primary-fa-url https://flasharray-a.example.com \
  --primary-fa-token "T-new-token-..."

Retry the failed operation.

Issue: Pure Storage Protection Group name not found

Symptom: Sync or failover operations fail with an error like Protection Group 'pg-prod-web-app' not found on array. The pure_pg_name in the policy does not match what exists on the FlashArray.

Likely causes:

The Protection Group was renamed or deleted on the FlashArray directly (outside of Protector)
The pure_pg_name in the replication policy was entered incorrectly at creation time
The FlashArray Protection Group was never created (incomplete initial setup)

Fix:

Log into the primary FlashArray and list Protection Groups to find the correct name.

Update the replication policy to use the correct name:

openstack protector protection-group policy-create <pg-name> \
  --pure-pg-name "correct-pg-name-on-array"

If the FlashArray Protection Group was deleted, you must recreate it on the array with the correct volumes and replication target configured, then update the policy.

Issue: Metadata sync is blocked — remote site unreachable

Symptom: Any attempt to modify the Protection Group (add/remove members, update policy) returns: Cannot modify protection group - remote site unreachable. The sync-status command shows Remote Sync Status: UNREACHABLE.

Likely cause: The protector-api or Keystone endpoint on the remote site is not reachable from the local site. This is enforced by design — Protector blocks modifications when it cannot sync to prevent metadata divergence.

Fix:

Validate the remote site's connectivity:

openstack protector site validate site-b

Check whether the remote protector-api is running:

# On the remote site controller
systemctl status protector-api

Check that the remote Keystone endpoint is reachable from the local protector-engine host:
```
curl http://site-b-controller:5000/v3
```
Check firewall rules between sites — the protector API runs on port 8788 by default and must be reachable between controllers.
Once the remote site is reachable, force a sync before making your intended changes:
```
openstack protector protection-group sync-force <pg-name>
```

Issue: RPO violation — replication lag exceeds configured threshold

Symptom: The Protection Group status reflects replication lag beyond the rpo_minutes value in your policy. Async snapshots are not being created at the expected replication_interval. Engine logs show snapshot scheduling errors or gaps in snapshot timestamps.

Likely causes:

A transient network interruption between the two FlashArrays interrupted the replication link and it has not fully recovered
The replication_interval set in the Protector policy is lower than the minimum allowed by the FlashArray replication schedule (the array enforces its own minimum)
The FlashArray replication link is degraded (bandwidth-limited, high latency)
The consistency group contains a volume whose Cinder volume type no longer has replication_enabled='<is> True' (volume type was modified)

Fix:

Check the replication link status on the FlashArray directly (via Purity UI or API) — look for the Protection Group's replication status and any reported link errors.

Force an immediate sync to recover the lag:

openstack protector consistency-group sync <pg-name>

If the replication_interval is below the array minimum, increase it in the policy:

openstack protector protection-group policy-create <pg-name> \
  --replication-interval 300

Verify all Cinder volume types still carry the required extra specs:

openstack volume type show replicated-ssd
# Confirm replication_enabled='<is> True' is present

If a volume type was modified and no longer has replication_enabled, restore the property and then remove and re-add the affected volumes from the Protection Group.

Issue: Split-brain — both sites report themselves as primary

Symptom: After an unplanned failover and subsequent recovery of the original primary site, sync-status run against each site independently shows different current_primary_site values and divergent version numbers. Operations on either site fail with metadata conflict errors.

Likely cause: Site A recovered and was left in its pre-failover state (current_primary_site = site-a, version N) while Site B successfully failed over and updated its metadata (current_primary_site = site-b, version N+1). The sites were never re-synced after Site A came back online.

Fix:

Do not make any changes to the Protection Group on either site until this is resolved.
Determine which site is the true current primary — this is the site where VMs are actively running after the failover:
```
export OS_CLOUD=site-b
openstack server list  # If VMs are here, Site B is authoritative
```

Identify the authoritative site's metadata version:

export OS_CLOUD=site-b
openstack protector protection-group sync-status prod-web-app
# Note: Local Version and current_primary_site

Force-sync from the authoritative site to overwrite the stale site:

export OS_CLOUD=site-b
openstack protector protection-group sync-force prod-web-app

Verify both sites now report the same version and current_primary_site:

export OS_CLOUD=site-a
openstack protector protection-group sync-status prod-web-app
# Should now match Site B's version and show current_primary_site = site-b

If sync-force is blocked because Site A's metadata has a higher version number (indicating conflicting writes occurred on Site A after it recovered), contact support — manual database reconciliation may be required.

Issue: Volume excluded from Consistency Group — replication_enabled missing

Symptom: Adding a VM to a Protection Group fails with an error indicating one or more of the VM's volumes cannot be added to the Consistency Group. The error references a volume type property check.

Likely cause: One or more volumes attached to the VM use a Cinder volume type that is missing replication_enabled='<is> True' or replication_type extra specs. All volumes in a Consistency Group must use a replication-enabled volume type.

Fix:

Identify which volumes are attached to the VM:

openstack server show <instance-id> -f json | python3 -m json.tool | grep -A5 volumes

For each volume, check its type:

openstack volume show <volume-id> -c volume_type
openstack volume type show <volume-type-name>

If the type is missing the required properties, set them (requires admin):

openstack volume type set replicated-ssd \
  --property replication_enabled='<is> True' \
  --property replication_type='<in> async'

If the volume is on a non-replicated backend (e.g., local SSD without FlashArray replication configured), the volume must be migrated to a replication-enabled backend before it can be protected. Consult your Cinder backend migration procedure.
After fixing the volume type, retry adding the VM to the Protection Group.