Skip to content

PR Preview Cleanup Improvements

Summary

Analysis of PR preview environment cleanup process, identifying gaps and proposing improvements for deletion ordering and resource cleanup.

Current Cleanup Flow

1. PR closed → GitHub Actions cleanup workflow triggers
2. Workflow deletes RabbitMQ vhost (direct API call)
3. Workflow deletes pr-{n}/ directory from cluster-gitops
4. ArgoCD ApplicationSet removes all Applications for PR
5. Each Application's finalizer deletes its resources
6. Kubernetes cascade-deletes everything when namespace is deleted

Resource Inventory

Managed by pr-{n}-namespace Application

Resource External System Cleanup Behavior
Namespace - Cascade-deletes all contained resources
ExternalSecret GCP Secret Manager Secret synced, no external cleanup needed
Secret (mongodb-pr-password) - Deleted with namespace
AtlasDatabaseUser CR MongoDB Atlas Atlas Operator deletes user from Atlas
Connection Secret - Created by Atlas Operator, deleted with namespace
ServiceAccount, Role, RoleBinding - Deleted with namespace
Job (db-reset) - Deleted with namespace
ConfigMap (db-reset-marker) - Deleted with namespace

Managed by Service Applications

Resource External System Cleanup Behavior
Deployments, Services - Deleted with namespace
Ingress external-dns, cert-manager DNS records removed, certs cleaned
ServiceAccounts - Deleted with namespace

External Resources (Not Kubernetes-managed)

Resource Current Cleanup Gap
RabbitMQ vhost ✅ Workflow deletes before git push None
MongoDB database (syrf_pr_{n}) NOT DELETED Data orphaned in Atlas

Identified Issues

Issue 1: Orphaned MongoDB Databases (HIGH PRIORITY)

Problem: When a PR closes, the AtlasDatabaseUser is deleted (removing access), but the database itself (syrf_pr_{n}) and all its collections remain in MongoDB Atlas indefinitely.

Impact: - Data accumulates over time - Potential storage costs - Security concern (orphaned data)

Solution: Add database cleanup to workflow before deleting cluster-gitops files.

- name: Drop MongoDB database
  env:
    MONGO_ADMIN_URI: ${{ secrets.MONGO_ADMIN_URI }}
  run: |
    mongosh "$MONGO_ADMIN_URI" --eval "
      const dbName = 'syrf_pr_${PR_NUM}';
      db.getSiblingDB(dbName).dropDatabase();
      print('Dropped database ' + dbName);
    "

Issue 2: No Guaranteed Deletion Order

Problem: ArgoCD ApplicationSet doesn't guarantee deletion order when removing Applications. All apps for a PR may be deleted simultaneously.

Potential Effects: - Services may lose MongoDB connectivity before graceful shutdown - DNS records might not be cleaned if Ingress is cascade-deleted before external-dns finalizer runs

Assessment: For ephemeral PR previews, this is acceptable. Services don't need graceful shutdown, and DNS cleanup usually works.

If stricter ordering is needed in future, options include: 1. Sequential deletion in workflow (delete services, wait, delete namespace) 2. App-of-Apps pattern with sync waves 3. Custom controller with finalizer ordering

Issue 3: Potential DNS Record Orphaning

Problem: If namespace cascade-deletes Ingress before external-dns processes the deletion, DNS records may not be removed.

Assessment: Low probability, but worth monitoring. external-dns has reconciliation that should eventually clean up.

Mitigation: Periodic audit of DNS records vs active PR previews.

Deletion Order Considerations

Why Order Could Matter

Scenario If Deleted First Impact
AtlasDatabaseUser Services lose DB auth Log errors (harmless for PRs)
Namespace Everything cascade-deleted Fast but potentially messy
Services Clean pod termination Ideal but slower
Ingress DNS/cert cleanup may race Usually fine

ArgoCD Sync Waves and Deletion

Important: Sync waves on Applications affect sync order, not deletion order when ApplicationSet removes them.

  • During creation: wave -1 syncs before wave 0
  • During deletion: order is NOT guaranteed by waves

For controlled deletion order, would need: - App-of-Apps pattern (waves affect child app processing) - Sequential deletion in workflow - Custom controller

Recommendations

Short-term (Implement Now)

  1. Add MongoDB database cleanup to workflow
  2. Priority: HIGH
  3. Effort: LOW (simple mongosh command)
  4. Prevents data accumulation

Medium-term (Consider for Future)

  1. Monitor DNS record cleanup
  2. Add periodic audit script
  3. Alert on orphaned records

  4. Document cleanup expectations

  5. Set expectations that PR cleanup is "best effort"
  6. Not production-grade graceful shutdown

Long-term (If Needed)

  1. Implement sequential deletion (only if issues arise)
  2. More complex workflow
  3. Slower cleanup
  4. Only worth it if current approach causes problems

Implementation Notes

MongoDB Cleanup Requirements

To drop a PR database, need: - MongoDB connection string with admin privileges - Access to run dropDatabase() on syrf_pr_{n}

Options: 1. Use Atlas admin API (REST call) 2. Use mongosh with admin connection string 3. Create a cleanup Job in Kubernetes (similar to db-reset)

Workflow Integration Point

MongoDB cleanup should happen: - AFTER RabbitMQ vhost deletion - BEFORE deleting cluster-gitops files - While AtlasDatabaseUser still exists (has credentials)

cleanup-pr-preview:
  steps:
    - name: Delete RabbitMQ vhost
      # ... existing ...

    - name: Drop MongoDB database  # NEW
      # ... mongosh command ...

    - name: Delete cluster-gitops files
      # ... existing ...

Decision Log

Date Decision Rationale
2026-01-15 Accept cascade deletion for PR previews Ephemeral environments don't need graceful shutdown
2026-01-15 Prioritize MongoDB cleanup Real data accumulation issue vs theoretical ordering issues