PR Preview Cleanup Improvements¶
Summary¶
Analysis of PR preview environment cleanup process, identifying gaps and proposing improvements for deletion ordering and resource cleanup.
Current Cleanup Flow¶
1. PR closed → GitHub Actions cleanup workflow triggers
2. Workflow deletes RabbitMQ vhost (direct API call)
3. Workflow deletes pr-{n}/ directory from cluster-gitops
4. ArgoCD ApplicationSet removes all Applications for PR
5. Each Application's finalizer deletes its resources
6. Kubernetes cascade-deletes everything when namespace is deleted
Resource Inventory¶
Managed by pr-{n}-namespace Application¶
| Resource | External System | Cleanup Behavior |
|---|---|---|
| Namespace | - | Cascade-deletes all contained resources |
| ExternalSecret | GCP Secret Manager | Secret synced, no external cleanup needed |
| Secret (mongodb-pr-password) | - | Deleted with namespace |
| AtlasDatabaseUser CR | MongoDB Atlas | Atlas Operator deletes user from Atlas |
| Connection Secret | - | Created by Atlas Operator, deleted with namespace |
| ServiceAccount, Role, RoleBinding | - | Deleted with namespace |
| Job (db-reset) | - | Deleted with namespace |
| ConfigMap (db-reset-marker) | - | Deleted with namespace |
Managed by Service Applications¶
| Resource | External System | Cleanup Behavior |
|---|---|---|
| Deployments, Services | - | Deleted with namespace |
| Ingress | external-dns, cert-manager | DNS records removed, certs cleaned |
| ServiceAccounts | - | Deleted with namespace |
External Resources (Not Kubernetes-managed)¶
| Resource | Current Cleanup | Gap |
|---|---|---|
| RabbitMQ vhost | ✅ Workflow deletes before git push | None |
MongoDB database (syrf_pr_{n}) |
❌ NOT DELETED | Data orphaned in Atlas |
Identified Issues¶
Issue 1: Orphaned MongoDB Databases (HIGH PRIORITY)¶
Problem: When a PR closes, the AtlasDatabaseUser is deleted (removing access), but the database itself (syrf_pr_{n}) and all its collections remain in MongoDB Atlas indefinitely.
Impact: - Data accumulates over time - Potential storage costs - Security concern (orphaned data)
Solution: Add database cleanup to workflow before deleting cluster-gitops files.
- name: Drop MongoDB database
env:
MONGO_ADMIN_URI: ${{ secrets.MONGO_ADMIN_URI }}
run: |
mongosh "$MONGO_ADMIN_URI" --eval "
const dbName = 'syrf_pr_${PR_NUM}';
db.getSiblingDB(dbName).dropDatabase();
print('Dropped database ' + dbName);
"
Issue 2: No Guaranteed Deletion Order¶
Problem: ArgoCD ApplicationSet doesn't guarantee deletion order when removing Applications. All apps for a PR may be deleted simultaneously.
Potential Effects: - Services may lose MongoDB connectivity before graceful shutdown - DNS records might not be cleaned if Ingress is cascade-deleted before external-dns finalizer runs
Assessment: For ephemeral PR previews, this is acceptable. Services don't need graceful shutdown, and DNS cleanup usually works.
If stricter ordering is needed in future, options include: 1. Sequential deletion in workflow (delete services, wait, delete namespace) 2. App-of-Apps pattern with sync waves 3. Custom controller with finalizer ordering
Issue 3: Potential DNS Record Orphaning¶
Problem: If namespace cascade-deletes Ingress before external-dns processes the deletion, DNS records may not be removed.
Assessment: Low probability, but worth monitoring. external-dns has reconciliation that should eventually clean up.
Mitigation: Periodic audit of DNS records vs active PR previews.
Deletion Order Considerations¶
Why Order Could Matter¶
| Scenario | If Deleted First | Impact |
|---|---|---|
| AtlasDatabaseUser | Services lose DB auth | Log errors (harmless for PRs) |
| Namespace | Everything cascade-deleted | Fast but potentially messy |
| Services | Clean pod termination | Ideal but slower |
| Ingress | DNS/cert cleanup may race | Usually fine |
ArgoCD Sync Waves and Deletion¶
Important: Sync waves on Applications affect sync order, not deletion order when ApplicationSet removes them.
- During creation: wave -1 syncs before wave 0
- During deletion: order is NOT guaranteed by waves
For controlled deletion order, would need: - App-of-Apps pattern (waves affect child app processing) - Sequential deletion in workflow - Custom controller
Recommendations¶
Short-term (Implement Now)¶
- Add MongoDB database cleanup to workflow
- Priority: HIGH
- Effort: LOW (simple mongosh command)
- Prevents data accumulation
Medium-term (Consider for Future)¶
- Monitor DNS record cleanup
- Add periodic audit script
-
Alert on orphaned records
-
Document cleanup expectations
- Set expectations that PR cleanup is "best effort"
- Not production-grade graceful shutdown
Long-term (If Needed)¶
- Implement sequential deletion (only if issues arise)
- More complex workflow
- Slower cleanup
- Only worth it if current approach causes problems
Implementation Notes¶
MongoDB Cleanup Requirements¶
To drop a PR database, need:
- MongoDB connection string with admin privileges
- Access to run dropDatabase() on syrf_pr_{n}
Options: 1. Use Atlas admin API (REST call) 2. Use mongosh with admin connection string 3. Create a cleanup Job in Kubernetes (similar to db-reset)
Workflow Integration Point¶
MongoDB cleanup should happen: - AFTER RabbitMQ vhost deletion - BEFORE deleting cluster-gitops files - While AtlasDatabaseUser still exists (has credentials)
cleanup-pr-preview:
steps:
- name: Delete RabbitMQ vhost
# ... existing ...
- name: Drop MongoDB database # NEW
# ... mongosh command ...
- name: Delete cluster-gitops files
# ... existing ...
Related Documents¶
Decision Log¶
| Date | Decision | Rationale |
|---|---|---|
| 2026-01-15 | Accept cascade deletion for PR previews | Ephemeral environments don't need graceful shutdown |
| 2026-01-15 | Prioritize MongoDB cleanup | Real data accumulation issue vs theoretical ordering issues |