Preview Environment Cleanup Improvements¶

Problem Statement¶

When the preview label is removed from a PR, the cleanup workflow deletes the PR's configuration from cluster-gitops. This triggers ArgoCD to delete the Application and all managed resources. However, two race conditions could cause resources to get stuck in deletion:

Issue 1: Stuck ArgoCD Applications¶

Symptoms: - Application has deletionTimestamp but never completes deletion - Error: unable to create new content in namespace pr-N because it is being terminated

Root Cause: 1. Cleanup deletes PR folder from cluster-gitops 2. ApplicationSet marks Application for deletion 3. ArgoCD tries one more sync before deletion completes 4. Namespace is already terminating, so sync fails 5. Application gets stuck because sync can't complete

Issue 2: Stuck AtlasDatabaseUser¶

Symptoms: - Namespace stuck in Terminating state for hours/days - AtlasDatabaseUser has mongodbatlas/finalizer that won't clear

Root Cause: 1. ExternalSecret atlas-operator-api-key is deleted (provides Atlas API credentials) 2. AtlasDatabaseUser needs those credentials to delete the user from Atlas 3. MongoDB Atlas Operator can't complete deletion without credentials 4. Finalizer blocks, namespace can't terminate

Error message:

failed to read Atlas API credentials from the secret pr-N/atlas-operator-api-key:
Secret "atlas-operator-api-key" not found

Current Solution (PR #2285)¶

Added fixes to both the cleanup and deploy workflows in .github/workflows/pr-preview.yml:

Fix 1: Clear Stuck Applications in Cleanup Job¶

New step added before pre-cleanup that checks all Applications for the PR and removes finalizers from any stuck in deletion:

Fix 2: Clear Stuck Applications in Deploy Job (write-versions)¶

Critical fix for rapid remove/re-add: When the preview label is re-added quickly after removal, Applications may still have deletionTimestamp from the cleanup. The deploy job now checks ALL Applications and clears their finalizers before pushing new config:

- name: Clear stuck Applications if needed
  run: |
    for APP in $(kubectl get applications -n argocd -l pr-number="$PR_NUM" -o name); do
      DELETION_TS=$(kubectl get application "$APP_NAME" -n argocd \
        -o jsonpath='{.metadata.deletionTimestamp}' || echo "")

      if [ -n "$DELETION_TS" ]; then
        # Remove finalizers to allow deletion to complete
        kubectl patch application "$APP_NAME" -n argocd \
          --type json -p '[{"op": "remove", "path": "/metadata/finalizers"}]'
      fi
    done

    # Wait for namespace termination if needed
    NS_PHASE=$(kubectl get ns "pr-$PR_NUM" -o jsonpath='{.status.phase}')
    if [ "$NS_PHASE" = "Terminating" ]; then
      kubectl wait --for=delete "namespace/pr-$PR_NUM" --timeout=120s
    fi

This ensures that when the label is re-added:

Any Applications stuck in deletion are cleared
The namespace finishes terminating (if needed)
ApplicationSet creates fresh Applications with clean state

Fix 3: Clear Stuck Applications Step¶

New step added before pre-cleanup that checks all Applications for the PR and removes finalizers from any stuck in deletion:

- name: Clear stuck ArgoCD Applications
  if: ${{ steps.check-gcp.outputs.configured == 'true' }}
  continue-on-error: true
  run: |
    PR_NUM=${{ github.event.pull_request.number }}

    for APP in $(kubectl get applications -n argocd -l pr-number="$PR_NUM" -o name 2>/dev/null); do
      APP_NAME=$(basename "$APP")
      DELETION_TS=$(kubectl get application "$APP_NAME" -n argocd \
        -o jsonpath='{.metadata.deletionTimestamp}' 2>/dev/null || echo "")

      if [ -n "$DELETION_TS" ]; then
        echo "⚠️ Application $APP_NAME stuck in deletion since $DELETION_TS"
        kubectl patch application "$APP_NAME" -n argocd \
          --type json -p '[{"op": "remove", "path": "/metadata/finalizers"}]' 2>/dev/null || true
      fi
    done

Fix 2: Remove ALL Finalizers from AtlasDatabaseUser¶

Updated the pre-cleanup step to remove ALL finalizers from AtlasDatabaseUser, not just ArgoCD hook finalizers:

# AtlasDatabaseUser - handle both ArgoCD hook finalizer and MongoDB Atlas finalizer
if kubectl get atlasdatabaseuser pr-user -n "$NS" &>/dev/null; then
  FINALIZERS=$(kubectl get atlasdatabaseuser pr-user -n "$NS" -o jsonpath='{.metadata.finalizers}' 2>/dev/null)
  if [ -n "$FINALIZERS" ] && [ "$FINALIZERS" != "[]" ]; then
    echo "  Removing all finalizers from atlasdatabaseuser/pr-user"
    kubectl patch atlasdatabaseuser pr-user -n "$NS" --type=merge \
      -p '{"metadata":{"finalizers":null}}' 2>/dev/null || true
  fi
fi

Known Caveat: Orphaned MongoDB Users¶

When we remove the mongodbatlas/finalizer without the Atlas Operator completing its cleanup, the MongoDB user may not be deleted from Atlas. This leaves an orphaned user.

Why this is acceptable for previews: - The preview database (syrf_pr_N) is also being deleted - orphaned user has no data to access - It's a low-privilege preview user, not production credentials - The alternative (namespace stuck forever) is worse - Orphaned users can be cleaned up periodically if needed

Future Improvement: Complete Atlas User Cleanup¶

For a cleaner solution that ensures the Atlas user is actually deleted, consider implementing one of these approaches:

Option A: Dedicated API Key in Operator Namespace¶

Store the Atlas API key ExternalSecret in the mongodb-atlas-system namespace (where the operator runs) instead of the preview namespace. The operator can then always access credentials.

Pros: - No workflow changes needed - Operator handles cleanup naturally

Cons: - Requires changes to preview-infrastructure Helm chart - Need to configure operator to look in different namespace for credentials

Option B: Direct Atlas API Cleanup in Workflow¶

Add a cleanup step that uses the Atlas API directly to delete the user before removing Kubernetes resources:

- name: Delete Atlas user via API
  if: ${{ steps.check-gcp.outputs.configured == 'true' }}
  env:
    ATLAS_PUBLIC_KEY: ${{ secrets.ATLAS_PUBLIC_KEY }}
    ATLAS_PRIVATE_KEY: ${{ secrets.ATLAS_PRIVATE_KEY }}
    ATLAS_PROJECT_ID: ${{ secrets.ATLAS_PROJECT_ID }}
  run: |
    PR_NUM=${{ github.event.pull_request.number }}
    USERNAME="syrf_pr_${PR_NUM}_app"

    # Delete user from Atlas directly
    curl -s -u "${ATLAS_PUBLIC_KEY}:${ATLAS_PRIVATE_KEY}" \
      -X DELETE \
      "https://cloud.mongodb.com/api/atlas/v1.0/groups/${ATLAS_PROJECT_ID}/databaseUsers/admin/${USERNAME}" \
      || echo "User may already be deleted or not exist"

Pros: - Guarantees user is deleted from Atlas - Works even if Kubernetes resources are already gone

Cons: - Requires additional secrets (Atlas API keys) - Duplicates logic that the operator should handle

Option C: Use Atlas Operator's Deletion Protection¶

Configure the AtlasDatabaseUser with deletionProtection: false (already done) and ensure the ExternalSecret has a higher deletion wave than the AtlasDatabaseUser.

Current wave configuration: - ExternalSecret atlas-operator-api-key: wave -10 (created early, deleted late) - AtlasDatabaseUser pr-user: wave +10 (created late, deleted early)

This should work, but ArgoCD's cascade deletion doesn't always respect waves. The resources-finalizer triggers deletion of all resources simultaneously.

Recommendation¶

For preview environments, the current solution (remove finalizers, accept potential orphaned users) is pragmatic and sufficient. If orphaned users become a problem, implement Option B (direct Atlas API cleanup) as it's the most reliable.

Testing¶

To verify the fix works:

Create a PR with preview label
Wait for preview environment to deploy
Remove the preview label
Verify:
No Applications stuck in argocd namespace
No namespace stuck in Terminating state
Cleanup workflow completes successfully

.github/workflows/pr-preview.yml - Cleanup job implementation
src/charts/preview-infrastructure/ - Helm chart for infrastructure resources
cluster-gitops/argocd/applicationsets/syrf-previews.yaml - ApplicationSet configuration