Guide

Resource Lifecycle and Cleanup

Orphaned volume and VM cleanup after failover/failback cycles

master

Overview

After each failover or failback cycle, the source site retains resources that are no longer active: the original VMs (now recreated on the target site) and their detached Cinder volumes. Left unaddressed, these orphaned resources accumulate across DR cycles and consume quota, storage capacity, and operational attention. This page explains how the protector-engine handles cleanup automatically as a final step of each operation, how cleanup failures are surfaced without rolling back the failover itself, and how you can trigger manual cleanup when automatic cleanup does not complete.

Prerequisites

Before performing or troubleshooting resource cleanup, ensure the following are in place:

A completed failover or failback operation against a Protection Group (status failed_over or active after failback)
The protectorclient OSC plugin installed and configured with credentials for both the source and target sites
The Cinder policy on the source site permits volume_extension:volume_unmanage for the member role — this is required for the engine to delete detached volumes on behalf of the tenant (see Deployment Prerequisites)
The Nova member role policy permits instance deletion within the tenant — no additional policy changes are required for Nova
The DR operation record for the completed failover or failback is accessible (openstack dr operation show <op-id>), so you can inspect steps_failed before triggering manual cleanup

Installation

Resource lifecycle cleanup is built into the protector-engine service and requires no separate installation. The openstack dr operation cleanup subcommand is provided by the protectorclient OSC plugin.

1. Verify the OSC plugin is installed and recognizes the cleanup command:

openstack dr operation cleanup --help

Expected output includes the --protection-group and --operation-id flags. If the command is not found, reinstall or upgrade the protectorclient package:

pip install --upgrade python-protectorclient

2. Confirm the engine service is running on the site where cleanup will execute (the source site of the completed operation):

# On the source site
systemctl status openstack-protector-engine

The engine must be active (running) to process cleanup requests. If the source site suffered a disaster and is fully down, see Manual Volume Deletion in the troubleshooting section.

Configuration

Cleanup behavior is controlled by the following settings in protector.conf on each site. These settings apply to the automatic cleanup step that runs at the end of every failover, failback, and test-failover-cleanup operation.

Option	Section	Default	Valid values	Effect
`cleanup_timeout_seconds`	`[engine]`	`300`	Integer ≥ 60	Maximum time the engine waits for a single resource deletion (VM or volume) before recording a `steps_failed` entry and moving on. Increase this on slow storage backends.
`cleanup_retry_attempts`	`[engine]`	`3`	Integer ≥ 1	Number of times the engine retries a failed deletion before recording it as a failure. Does not apply to manual cleanup triggered via `openstack dr operation cleanup`.
`cleanup_on_test_failover`	`[engine]`	`true`	`true` / `false`	When `true`, the engine automatically deletes test-failover VMs and volumes when `test_cleanup` is executed. Set to `false` if you want to inspect test resources before manual deletion.
`retain_source_volumes_on_failover`	`[engine]`	`false`	`true` / `false`	When `true`, the engine skips automatic volume deletion after failover, leaving source volumes detached but intact. Useful when you need to preserve the last-known-good data set on the source. Manual cleanup is then required.

Why these options matter: Cleanup runs after the target-side VMs are confirmed healthy. A timeout or transient API error during cleanup does not roll back the failover — the operation is marked completed and the failed cleanup steps are recorded in steps_failed. This design ensures a storage or Nova hiccup on the now-inactive source site cannot undo a successful failover.

After changing any of these values, restart the engine:

systemctl restart openstack-protector-engine

Usage

Automatic cleanup during failover

You do not need to take any action for normal cleanup. After the target-site VMs are confirmed healthy, the protector-engine automatically:

Deletes each source-site VM (Nova DELETE /servers/{id}).
Waits for volumes to detach (Cinder reports volume status available).
Deletes each detached source-site volume (Cinder DELETE /volumes/{id}).
Records each completed deletion in steps_completed and each failure in steps_failed on the DR operation record.

The Protection Group's overall operation status transitions to completed regardless of whether cleanup succeeded, so a partial cleanup does not leave the Protection Group in an error state.

Checking cleanup results

After a failover or failback, inspect the operation record to confirm cleanup ran cleanly:

openstack dr operation show <operation-id>

Look for steps_failed. An empty list means all cleanup succeeded. A non-empty list identifies which resources were not removed and why.

Triggering manual cleanup

If steps_failed contains cleanup entries, or if you set retain_source_volumes_on_failover = true, trigger manual cleanup against the completed operation:

openstack dr operation cleanup <operation-id>

This command re-attempts deletion of every resource listed in steps_failed for the given operation. Resources that were already successfully deleted are skipped — the command is safe to run multiple times.

Cleanup after test failover

Test failover creates VMs and volumes on the secondary site. These are cleaned up by running the test_cleanup action, not the general cleanup command:

openstack protector protection-group action prod-web-app \
  --action test-cleanup \
  --test-operation-id <test-operation-id>

The engine then deletes the test VMs and volumes on the secondary site and marks the test operation completed. The primary-site VMs are untouched throughout.

Examples

Example 1: Inspect cleanup results after a planned failover

After failing over prod-web-app from site-a to site-b, check whether source-side resources were cleaned up:

openstack dr operation show op-7a3f2c11-0001-4b22-a918-000000000001

+------------------+-------------------------------------------------------------+
| Field            | Value                                                       |
+------------------+-------------------------------------------------------------+
| id               | op-7a3f2c11-0001-4b22-a918-000000000001                    |
| operation_type   | failover                                                    |
| status           | completed                                                   |
| progress         | 100                                                         |
| source_site      | site-a                                                      |
| target_site      | site-b                                                      |
| started_at       | 2025-11-03T14:00:00Z                                        |
| completed_at     | 2025-11-03T14:07:42Z                                        |
| steps_completed  | ["promote_volumes", "manage_volumes", "recreate_vms",      |
|                  |  "delete_source_vm:web-server-1",                          |
|                  |  "delete_source_vm:web-server-2",                          |
|                  |  "delete_source_volume:vol-boot-ws1",                      |
|                  |  "delete_source_volume:vol-data-ws1",                      |
|                  |  "delete_source_volume:vol-boot-ws2",                      |
|                  |  "delete_source_volume:vol-data-ws2"]                      |
| steps_failed     | []                                                          |
+------------------+-------------------------------------------------------------+

steps_failed is empty — all source-side VMs and volumes were deleted automatically.

Example 2: Partial cleanup failure — manual remediation

A Cinder timeout during cleanup left one volume behind:

openstack dr operation show op-7a3f2c11-0002-4b22-a918-000000000002

| steps_completed  | ["promote_volumes", "manage_volumes", "recreate_vms",      |
|                  |  "delete_source_vm:web-server-1",                          |
|                  |  "delete_source_vm:db-server",                             |
|                  |  "delete_source_volume:vol-boot-ws1",                      |
|                  |  "delete_source_volume:vol-boot-db",                       |
|                  |  "delete_source_volume:vol-data-db"]                       |
| steps_failed     | [{"step": "delete_source_volume:vol-data-ws1",            |
|                  |    "error": "Cinder timeout after 300s",                   |
|                  |    "resource_id": "cinder-vol-abc123"}]                    |
| status           | completed                                                   |

The operation is completed — the failover itself succeeded. Trigger manual cleanup to remove the orphaned volume:

openstack dr operation cleanup op-7a3f2c11-0002-4b22-a918-000000000002

Retrying cleanup for operation op-7a3f2c11-0002-4b22-a918-000000000002...

Skipping already-cleaned resources (8 skipped)

Retrying: delete_source_volume:vol-data-ws1
  Checking volume status on site-a... available
  Deleting volume cinder-vol-abc123... done

✅ Cleanup completed. 1 resource deleted, 0 failures.

Example 3: Manual cleanup after retain_source_volumes_on_failover = true

If your protector.conf has retain_source_volumes_on_failover = true, no volumes are deleted automatically. After you have confirmed the failover is successful and you no longer need the source volumes, run:

openstack dr operation cleanup op-7a3f2c11-0003-4b22-a918-000000000003

Retrying cleanup for operation op-7a3f2c11-0003-4b22-a918-000000000003...

Volumes retained by configuration — processing now:
  Deleting volume cinder-vol-boot-ws1... done
  Deleting volume cinder-vol-data-ws1... done
  Deleting volume cinder-vol-boot-ws2... done
  Deleting volume cinder-vol-data-ws2... done
  Deleting volume cinder-vol-boot-db...  done
  Deleting volume cinder-vol-data-db...  done
  Deleting volume cinder-vol-log-db...   done

✅ Cleanup completed. 7 resources deleted, 0 failures.

Example 4: Clean up after a test failover

# Run test cleanup to remove test VMs and volumes from site-b
openstack protector protection-group action prod-web-app \
  --action test-cleanup \
  --test-operation-id op-7a3f2c11-0004-4b22-a918-000000000004

+------------------+---------------------------------------------+
| Field            | Value                                       |
+------------------+---------------------------------------------+
| operation_id     | op-7a3f2c11-0005-4b22-a918-000000000005    |
| operation_type   | test_cleanup                                |
| status           | running                                     |
| progress         | 0                                           |
+------------------+---------------------------------------------+

# Monitor until complete
watch openstack dr operation show op-7a3f2c11-0005-4b22-a918-000000000005

Troubleshooting

Cleanup fails with "Volume not found" on source site

Symptom: steps_failed contains delete_source_volume:<vol-id> with error Volume not found (404).

Likely cause: The volume was already deleted manually, or the source-site Cinder database was reset after a disaster and the volume record no longer exists.

Fix: This is a non-issue — the resource is already gone. The entry in steps_failed is informational. Running openstack dr operation cleanup again will attempt deletion, receive the same 404, and report success (idempotent delete). If you prefer to mark it clean without retrying, note that the Protection Group itself is in failed_over or active state and fully operational; the stale steps_failed entry does not block future DR operations.

Cleanup fails with "Volume status must be available" on source site

Symptom: steps_failed contains delete_source_volume:<vol-id> with error Invalid volume: Volume status must be available to delete.

Likely cause: The volume is still reported as in-use because Nova did not fully detach it before the engine attempted deletion. This can happen if the source-site Nova compute service was degraded during the failover.

Fix:

On the source site, check the volume's current state:
```
openstack volume show <vol-id>
```
If the volume is in-use but the source VM no longer exists, force-detach:
```
openstack volume set --state available <vol-id>
```

Re-run manual cleanup:

openstack dr operation cleanup <operation-id>

`openstack dr operation cleanup` reports "Operation not in completed state"

Symptom: Running openstack dr operation cleanup <op-id> returns an error that the operation must be in completed or failed state.

Likely cause: The operation is still running. Cleanup can only be triggered on a terminal operation — the engine must finish (or fail) the main failover steps before cleanup is addressable.

Fix: Wait for the operation to reach completed or failed, then retry. Monitor with:

watch openstack dr operation show <op-id>

Source-site VMs still visible in Nova after cleanup

Symptom: After a completed failover, openstack server list on the source site still shows the original VMs. steps_failed contains delete_source_vm:<instance-id> entries.

Likely cause: The source-site Nova API was unreachable or returned a 5xx error when the engine attempted deletion. The VMs may be in ERROR state or still ACTIVE.

Fix:

Verify the source-site Nova API is healthy:

openstack server list --all-projects  # on source site

Attempt deletion directly if the engine cannot reach Nova:
```
openstack server delete <instance-id>  # on source site
```
If Nova reports the server is locked or in a non-deletable state, force-delete:
```
openstack server delete --force <instance-id>  # on source site
```
Once manual deletion succeeds, the openstack dr operation cleanup command can be run — it will skip the already-deleted VMs and handle any remaining orphaned volumes.

Cleanup is blocked: "Remote site unreachable — metadata sync required"

Symptom: openstack dr operation cleanup fails immediately with a sync error rather than attempting resource deletion.

Likely cause: The cleanup command also updates Protection Group metadata, and metadata sync requires both sites to be reachable. If the peer site is down, the engine blocks the update to prevent metadata divergence.

Fix: Cleanup of source-side resources (Nova VMs and Cinder volumes) only touches the source site. If the peer is temporarily unreachable, wait until it recovers, verify sync status, and retry:

openstack protector protection-group sync-status prod-web-app
openstack protector protection-group sync-force prod-web-app  # once peer is back
openstack dr operation cleanup <operation-id>

Orphaned volumes accumulate across multiple DR cycles

Symptom: After several failover/failback cycles, the source site has many unattached volumes with names matching old Protection Group members. Quota is being consumed.

Likely cause: Repeated cleanup failures (due to timeouts or API errors) were not addressed after each cycle, and retain_source_volumes_on_failover may be set to true.

Fix: List all operations for the Protection Group and run cleanup against each completed operation that has non-empty steps_failed:

openstack dr operation list --protection-group prod-web-app

For each operation with failures:

openstack dr operation cleanup <operation-id>

This is safe to run against older completed operations — resources already deleted are skipped. After all cleanup passes complete, audit remaining volumes:

openstack volume list --status available  # on source site

Any volumes not associated with an active Protection Group member can be deleted manually after confirming they are not in use.