Resource Lifecycle and Cleanup
Orphaned volume and VM cleanup after failover/failback cycles
After each failover or failback cycle, the source site retains resources that are no longer active: the original VMs (now recreated on the target site) and their detached Cinder volumes. Left unaddressed, these orphaned resources accumulate across DR cycles and consume quota, storage capacity, and operational attention. This page explains how the protector-engine handles cleanup automatically as a final step of each operation, how cleanup failures are surfaced without rolling back the failover itself, and how you can trigger manual cleanup when automatic cleanup does not complete.
Before performing or troubleshooting resource cleanup, ensure the following are in place:
- A completed failover or failback operation against a Protection Group (status
failed_overoractiveafter failback) - The
protectorclientOSC plugin installed and configured with credentials for both the source and target sites - The Cinder policy on the source site permits
volume_extension:volume_unmanagefor thememberrole ā this is required for the engine to delete detached volumes on behalf of the tenant (see Deployment Prerequisites) - The Nova
memberrole policy permits instance deletion within the tenant ā no additional policy changes are required for Nova - The DR operation record for the completed failover or failback is accessible (
openstack dr operation show <op-id>), so you can inspectsteps_failedbefore triggering manual cleanup
Resource lifecycle cleanup is built into the protector-engine service and requires no separate installation. The openstack dr operation cleanup subcommand is provided by the protectorclient OSC plugin.
1. Verify the OSC plugin is installed and recognizes the cleanup command:
openstack dr operation cleanup --help
Expected output includes the --protection-group and --operation-id flags. If the command is not found, reinstall or upgrade the protectorclient package:
pip install --upgrade python-protectorclient
2. Confirm the engine service is running on the site where cleanup will execute (the source site of the completed operation):
# On the source site
systemctl status openstack-protector-engine
The engine must be active (running) to process cleanup requests. If the source site suffered a disaster and is fully down, see Manual Volume Deletion in the troubleshooting section.
Cleanup behavior is controlled by the following settings in protector.conf on each site. These settings apply to the automatic cleanup step that runs at the end of every failover, failback, and test-failover-cleanup operation.
| Option | Section | Default | Valid values | Effect |
|---|---|---|---|---|
cleanup_timeout_seconds | [engine] | 300 | Integer ā„ 60 | Maximum time the engine waits for a single resource deletion (VM or volume) before recording a steps_failed entry and moving on. Increase this on slow storage backends. |
cleanup_retry_attempts | [engine] | 3 | Integer ā„ 1 | Number of times the engine retries a failed deletion before recording it as a failure. Does not apply to manual cleanup triggered via openstack dr operation cleanup. |
cleanup_on_test_failover | [engine] | true | true / false | When true, the engine automatically deletes test-failover VMs and volumes when test_cleanup is executed. Set to false if you want to inspect test resources before manual deletion. |
retain_source_volumes_on_failover | [engine] | false | true / false | When true, the engine skips automatic volume deletion after failover, leaving source volumes detached but intact. Useful when you need to preserve the last-known-good data set on the source. Manual cleanup is then required. |
Why these options matter: Cleanup runs after the target-side VMs are confirmed healthy. A timeout or transient API error during cleanup does not roll back the failover ā the operation is marked completed and the failed cleanup steps are recorded in steps_failed. This design ensures a storage or Nova hiccup on the now-inactive source site cannot undo a successful failover.
After changing any of these values, restart the engine:
systemctl restart openstack-protector-engine
Automatic cleanup during failover
You do not need to take any action for normal cleanup. After the target-site VMs are confirmed healthy, the protector-engine automatically:
- Deletes each source-site VM (Nova
DELETE /servers/{id}). - Waits for volumes to detach (Cinder reports volume status
available). - Deletes each detached source-site volume (Cinder
DELETE /volumes/{id}). - Records each completed deletion in
steps_completedand each failure insteps_failedon the DR operation record.
The Protection Group's overall operation status transitions to completed regardless of whether cleanup succeeded, so a partial cleanup does not leave the Protection Group in an error state.
Checking cleanup results
After a failover or failback, inspect the operation record to confirm cleanup ran cleanly:
openstack dr operation show <operation-id>
Look for steps_failed. An empty list means all cleanup succeeded. A non-empty list identifies which resources were not removed and why.
Triggering manual cleanup
If steps_failed contains cleanup entries, or if you set retain_source_volumes_on_failover = true, trigger manual cleanup against the completed operation:
openstack dr operation cleanup <operation-id>
This command re-attempts deletion of every resource listed in steps_failed for the given operation. Resources that were already successfully deleted are skipped ā the command is safe to run multiple times.
Cleanup after test failover
Test failover creates VMs and volumes on the secondary site. These are cleaned up by running the test_cleanup action, not the general cleanup command:
openstack protector protection-group action prod-web-app \
--action test-cleanup \
--test-operation-id <test-operation-id>
The engine then deletes the test VMs and volumes on the secondary site and marks the test operation completed. The primary-site VMs are untouched throughout.
Example 1: Inspect cleanup results after a planned failover
After failing over prod-web-app from site-a to site-b, check whether source-side resources were cleaned up:
openstack dr operation show op-7a3f2c11-0001-4b22-a918-000000000001
+------------------+-------------------------------------------------------------+
| Field | Value |
+------------------+-------------------------------------------------------------+
| id | op-7a3f2c11-0001-4b22-a918-000000000001 |
| operation_type | failover |
| status | completed |
| progress | 100 |
| source_site | site-a |
| target_site | site-b |
| started_at | 2025-11-03T14:00:00Z |
| completed_at | 2025-11-03T14:07:42Z |
| steps_completed | ["promote_volumes", "manage_volumes", "recreate_vms", |
| | "delete_source_vm:web-server-1", |
| | "delete_source_vm:web-server-2", |
| | "delete_source_volume:vol-boot-ws1", |
| | "delete_source_volume:vol-data-ws1", |
| | "delete_source_volume:vol-boot-ws2", |
| | "delete_source_volume:vol-data-ws2"] |
| steps_failed | [] |
+------------------+-------------------------------------------------------------+
steps_failed is empty ā all source-side VMs and volumes were deleted automatically.
Example 2: Partial cleanup failure ā manual remediation
A Cinder timeout during cleanup left one volume behind:
openstack dr operation show op-7a3f2c11-0002-4b22-a918-000000000002
| steps_completed | ["promote_volumes", "manage_volumes", "recreate_vms", |
| | "delete_source_vm:web-server-1", |
| | "delete_source_vm:db-server", |
| | "delete_source_volume:vol-boot-ws1", |
| | "delete_source_volume:vol-boot-db", |
| | "delete_source_volume:vol-data-db"] |
| steps_failed | [{"step": "delete_source_volume:vol-data-ws1", |
| | "error": "Cinder timeout after 300s", |
| | "resource_id": "cinder-vol-abc123"}] |
| status | completed |
The operation is completed ā the failover itself succeeded. Trigger manual cleanup to remove the orphaned volume:
openstack dr operation cleanup op-7a3f2c11-0002-4b22-a918-000000000002
Retrying cleanup for operation op-7a3f2c11-0002-4b22-a918-000000000002...
Skipping already-cleaned resources (8 skipped)
Retrying: delete_source_volume:vol-data-ws1
Checking volume status on site-a... available
Deleting volume cinder-vol-abc123... done
ā
Cleanup completed. 1 resource deleted, 0 failures.
Example 3: Manual cleanup after retain_source_volumes_on_failover = true
If your protector.conf has retain_source_volumes_on_failover = true, no volumes are deleted automatically. After you have confirmed the failover is successful and you no longer need the source volumes, run:
openstack dr operation cleanup op-7a3f2c11-0003-4b22-a918-000000000003
Retrying cleanup for operation op-7a3f2c11-0003-4b22-a918-000000000003...
Volumes retained by configuration ā processing now:
Deleting volume cinder-vol-boot-ws1... done
Deleting volume cinder-vol-data-ws1... done
Deleting volume cinder-vol-boot-ws2... done
Deleting volume cinder-vol-data-ws2... done
Deleting volume cinder-vol-boot-db... done
Deleting volume cinder-vol-data-db... done
Deleting volume cinder-vol-log-db... done
ā
Cleanup completed. 7 resources deleted, 0 failures.
Example 4: Clean up after a test failover
# Run test cleanup to remove test VMs and volumes from site-b
openstack protector protection-group action prod-web-app \
--action test-cleanup \
--test-operation-id op-7a3f2c11-0004-4b22-a918-000000000004
+------------------+---------------------------------------------+
| Field | Value |
+------------------+---------------------------------------------+
| operation_id | op-7a3f2c11-0005-4b22-a918-000000000005 |
| operation_type | test_cleanup |
| status | running |
| progress | 0 |
+------------------+---------------------------------------------+
# Monitor until complete
watch openstack dr operation show op-7a3f2c11-0005-4b22-a918-000000000005
Cleanup fails with "Volume not found" on source site
Symptom: steps_failed contains delete_source_volume:<vol-id> with error Volume not found (404).
Likely cause: The volume was already deleted manually, or the source-site Cinder database was reset after a disaster and the volume record no longer exists.
Fix: This is a non-issue ā the resource is already gone. The entry in steps_failed is informational. Running openstack dr operation cleanup again will attempt deletion, receive the same 404, and report success (idempotent delete). If you prefer to mark it clean without retrying, note that the Protection Group itself is in failed_over or active state and fully operational; the stale steps_failed entry does not block future DR operations.
Cleanup fails with "Volume status must be available" on source site
Symptom: steps_failed contains delete_source_volume:<vol-id> with error Invalid volume: Volume status must be available to delete.
Likely cause: The volume is still reported as in-use because Nova did not fully detach it before the engine attempted deletion. This can happen if the source-site Nova compute service was degraded during the failover.
Fix:
- On the source site, check the volume's current state:
openstack volume show <vol-id> - If the volume is
in-usebut the source VM no longer exists, force-detach:openstack volume set --state available <vol-id> - Re-run manual cleanup:
openstack dr operation cleanup <operation-id>
openstack dr operation cleanup reports "Operation not in completed state"
Symptom: Running openstack dr operation cleanup <op-id> returns an error that the operation must be in completed or failed state.
Likely cause: The operation is still running. Cleanup can only be triggered on a terminal operation ā the engine must finish (or fail) the main failover steps before cleanup is addressable.
Fix: Wait for the operation to reach completed or failed, then retry. Monitor with:
watch openstack dr operation show <op-id>
Source-site VMs still visible in Nova after cleanup
Symptom: After a completed failover, openstack server list on the source site still shows the original VMs. steps_failed contains delete_source_vm:<instance-id> entries.
Likely cause: The source-site Nova API was unreachable or returned a 5xx error when the engine attempted deletion. The VMs may be in ERROR state or still ACTIVE.
Fix:
- Verify the source-site Nova API is healthy:
openstack server list --all-projects # on source site - Attempt deletion directly if the engine cannot reach Nova:
openstack server delete <instance-id> # on source site - If Nova reports the server is locked or in a non-deletable state, force-delete:
openstack server delete --force <instance-id> # on source site - Once manual deletion succeeds, the
openstack dr operation cleanupcommand can be run ā it will skip the already-deleted VMs and handle any remaining orphaned volumes.
Cleanup is blocked: "Remote site unreachable ā metadata sync required"
Symptom: openstack dr operation cleanup fails immediately with a sync error rather than attempting resource deletion.
Likely cause: The cleanup command also updates Protection Group metadata, and metadata sync requires both sites to be reachable. If the peer site is down, the engine blocks the update to prevent metadata divergence.
Fix: Cleanup of source-side resources (Nova VMs and Cinder volumes) only touches the source site. If the peer is temporarily unreachable, wait until it recovers, verify sync status, and retry:
openstack protector protection-group sync-status prod-web-app
openstack protector protection-group sync-force prod-web-app # once peer is back
openstack dr operation cleanup <operation-id>
Orphaned volumes accumulate across multiple DR cycles
Symptom: After several failover/failback cycles, the source site has many unattached volumes with names matching old Protection Group members. Quota is being consumed.
Likely cause: Repeated cleanup failures (due to timeouts or API errors) were not addressed after each cycle, and retain_source_volumes_on_failover may be set to true.
Fix: List all operations for the Protection Group and run cleanup against each completed operation that has non-empty steps_failed:
openstack dr operation list --protection-group prod-web-app
For each operation with failures:
openstack dr operation cleanup <operation-id>
This is safe to run against older completed operations ā resources already deleted are skipped. After all cleanup passes complete, audit remaining volumes:
openstack volume list --status available # on source site
Any volumes not associated with an active Protection Group member can be deleted manually after confirming they are not in use.