Preview Environment Cleanup Improvements¶
Problem Statement¶
When the preview label is removed from a PR, the cleanup workflow deletes the PR's configuration from cluster-gitops. This triggers ArgoCD to delete the Application and all managed resources. However, two race conditions could cause resources to get stuck in deletion:
Issue 1: Stuck ArgoCD Applications¶
Symptoms:
- Application has deletionTimestamp but never completes deletion
- Error: unable to create new content in namespace pr-N because it is being terminated
Root Cause: 1. Cleanup deletes PR folder from cluster-gitops 2. ApplicationSet marks Application for deletion 3. ArgoCD tries one more sync before deletion completes 4. Namespace is already terminating, so sync fails 5. Application gets stuck because sync can't complete
Issue 2: Stuck AtlasDatabaseUser¶
Symptoms:
- Namespace stuck in Terminating state for hours/days
- AtlasDatabaseUser has mongodbatlas/finalizer that won't clear
Root Cause:
1. ExternalSecret atlas-operator-api-key is deleted (provides Atlas API credentials)
2. AtlasDatabaseUser needs those credentials to delete the user from Atlas
3. MongoDB Atlas Operator can't complete deletion without credentials
4. Finalizer blocks, namespace can't terminate
Error message:
failed to read Atlas API credentials from the secret pr-N/atlas-operator-api-key:
Secret "atlas-operator-api-key" not found
Current Solution (PR #2285)¶
Added fixes to both the cleanup and deploy workflows in .github/workflows/pr-preview.yml:
Fix 1: Clear Stuck Applications in Cleanup Job¶
New step added before pre-cleanup that checks all Applications for the PR and removes finalizers from any stuck in deletion:
Fix 2: Clear Stuck Applications in Deploy Job (write-versions)¶
Critical fix for rapid remove/re-add: When the preview label is re-added quickly after removal, Applications may still have deletionTimestamp from the cleanup. The deploy job now checks ALL Applications and clears their finalizers before pushing new config:
- name: Clear stuck Applications if needed
run: |
for APP in $(kubectl get applications -n argocd -l pr-number="$PR_NUM" -o name); do
DELETION_TS=$(kubectl get application "$APP_NAME" -n argocd \
-o jsonpath='{.metadata.deletionTimestamp}' || echo "")
if [ -n "$DELETION_TS" ]; then
# Remove finalizers to allow deletion to complete
kubectl patch application "$APP_NAME" -n argocd \
--type json -p '[{"op": "remove", "path": "/metadata/finalizers"}]'
fi
done
# Wait for namespace termination if needed
NS_PHASE=$(kubectl get ns "pr-$PR_NUM" -o jsonpath='{.status.phase}')
if [ "$NS_PHASE" = "Terminating" ]; then
kubectl wait --for=delete "namespace/pr-$PR_NUM" --timeout=120s
fi
This ensures that when the label is re-added:
- Any Applications stuck in deletion are cleared
- The namespace finishes terminating (if needed)
- ApplicationSet creates fresh Applications with clean state
Fix 3: Clear Stuck Applications Step¶
New step added before pre-cleanup that checks all Applications for the PR and removes finalizers from any stuck in deletion:
- name: Clear stuck ArgoCD Applications
if: ${{ steps.check-gcp.outputs.configured == 'true' }}
continue-on-error: true
run: |
PR_NUM=${{ github.event.pull_request.number }}
for APP in $(kubectl get applications -n argocd -l pr-number="$PR_NUM" -o name 2>/dev/null); do
APP_NAME=$(basename "$APP")
DELETION_TS=$(kubectl get application "$APP_NAME" -n argocd \
-o jsonpath='{.metadata.deletionTimestamp}' 2>/dev/null || echo "")
if [ -n "$DELETION_TS" ]; then
echo "⚠️ Application $APP_NAME stuck in deletion since $DELETION_TS"
kubectl patch application "$APP_NAME" -n argocd \
--type json -p '[{"op": "remove", "path": "/metadata/finalizers"}]' 2>/dev/null || true
fi
done
Fix 2: Remove ALL Finalizers from AtlasDatabaseUser¶
Updated the pre-cleanup step to remove ALL finalizers from AtlasDatabaseUser, not just ArgoCD hook finalizers:
# AtlasDatabaseUser - handle both ArgoCD hook finalizer and MongoDB Atlas finalizer
if kubectl get atlasdatabaseuser pr-user -n "$NS" &>/dev/null; then
FINALIZERS=$(kubectl get atlasdatabaseuser pr-user -n "$NS" -o jsonpath='{.metadata.finalizers}' 2>/dev/null)
if [ -n "$FINALIZERS" ] && [ "$FINALIZERS" != "[]" ]; then
echo " Removing all finalizers from atlasdatabaseuser/pr-user"
kubectl patch atlasdatabaseuser pr-user -n "$NS" --type=merge \
-p '{"metadata":{"finalizers":null}}' 2>/dev/null || true
fi
fi
Known Caveat: Orphaned MongoDB Users¶
When we remove the mongodbatlas/finalizer without the Atlas Operator completing its cleanup, the MongoDB user may not be deleted from Atlas. This leaves an orphaned user.
Why this is acceptable for previews:
- The preview database (syrf_pr_N) is also being deleted - orphaned user has no data to access
- It's a low-privilege preview user, not production credentials
- The alternative (namespace stuck forever) is worse
- Orphaned users can be cleaned up periodically if needed
Future Improvement: Complete Atlas User Cleanup¶
For a cleaner solution that ensures the Atlas user is actually deleted, consider implementing one of these approaches:
Option A: Dedicated API Key in Operator Namespace¶
Store the Atlas API key ExternalSecret in the mongodb-atlas-system namespace (where the operator runs) instead of the preview namespace. The operator can then always access credentials.
Pros: - No workflow changes needed - Operator handles cleanup naturally
Cons: - Requires changes to preview-infrastructure Helm chart - Need to configure operator to look in different namespace for credentials
Option B: Direct Atlas API Cleanup in Workflow¶
Add a cleanup step that uses the Atlas API directly to delete the user before removing Kubernetes resources:
- name: Delete Atlas user via API
if: ${{ steps.check-gcp.outputs.configured == 'true' }}
env:
ATLAS_PUBLIC_KEY: ${{ secrets.ATLAS_PUBLIC_KEY }}
ATLAS_PRIVATE_KEY: ${{ secrets.ATLAS_PRIVATE_KEY }}
ATLAS_PROJECT_ID: ${{ secrets.ATLAS_PROJECT_ID }}
run: |
PR_NUM=${{ github.event.pull_request.number }}
USERNAME="syrf_pr_${PR_NUM}_app"
# Delete user from Atlas directly
curl -s -u "${ATLAS_PUBLIC_KEY}:${ATLAS_PRIVATE_KEY}" \
-X DELETE \
"https://cloud.mongodb.com/api/atlas/v1.0/groups/${ATLAS_PROJECT_ID}/databaseUsers/admin/${USERNAME}" \
|| echo "User may already be deleted or not exist"
Pros: - Guarantees user is deleted from Atlas - Works even if Kubernetes resources are already gone
Cons: - Requires additional secrets (Atlas API keys) - Duplicates logic that the operator should handle
Option C: Use Atlas Operator's Deletion Protection¶
Configure the AtlasDatabaseUser with deletionProtection: false (already done) and ensure the ExternalSecret has a higher deletion wave than the AtlasDatabaseUser.
Current wave configuration:
- ExternalSecret atlas-operator-api-key: wave -10 (created early, deleted late)
- AtlasDatabaseUser pr-user: wave +10 (created late, deleted early)
This should work, but ArgoCD's cascade deletion doesn't always respect waves. The resources-finalizer triggers deletion of all resources simultaneously.
Recommendation¶
For preview environments, the current solution (remove finalizers, accept potential orphaned users) is pragmatic and sufficient. If orphaned users become a problem, implement Option B (direct Atlas API cleanup) as it's the most reliable.
Testing¶
To verify the fix works:
- Create a PR with
previewlabel - Wait for preview environment to deploy
- Remove the
previewlabel - Verify:
- No Applications stuck in
argocdnamespace - No namespace stuck in
Terminatingstate - Cleanup workflow completes successfully
Related Files¶
.github/workflows/pr-preview.yml- Cleanup job implementationsrc/charts/preview-infrastructure/- Helm chart for infrastructure resourcescluster-gitops/argocd/applicationsets/syrf-previews.yaml- ApplicationSet configuration