MongoDB Permissions for Data Snapshot Automation¶

This document explains how MongoDB permissions work for the snapshot/restore feature, clarifying common misconceptions and documenting the actual permission model.

Critical Misconception: No Wildcard Database Permissions¶

⚠️ MongoDB does NOT support wildcard patterns for database names in role grants.

You cannot do this:

// ❌ THIS IS NOT POSSIBLE
db.grantRolesToUser("cleanup-user", [
  { role: "dbAdmin", db: "syrf_pr_*" }  // INVALID - wildcards don't work
])

Each role must be granted on a specific, named database or using cluster-wide built-in roles.

MongoDB Role Scoping Options¶

Option 1: Specific Database Roles¶

Roles are granted per database. Each database must be explicitly named:

// ✅ Valid - explicit database names
db.grantRolesToUser("pr-123-user", [
  { role: "readWrite", db: "syrf_pr_123" },  // Specific to PR 123
  { role: "dbAdmin", db: "syrf_pr_123" },    // Can drop its own DB
  { role: "read", db: "syrf_snapshot" }      // Read from snapshot
])

Implication: You must create/update permissions for each new PR database. This is what the Atlas Operator does automatically.

Option 2: Cluster-Wide Built-in Roles¶

MongoDB has cluster-wide admin roles that work on ALL databases:

Role	Access Level	Databases Affected
`readAnyDatabase`	Read	ALL databases
`readWriteAnyDatabase`	Read/Write	ALL databases
`dbAdminAnyDatabase`	Admin (drop, create, etc.)	ALL databases
`userAdminAnyDatabase`	User management	ALL databases

// ✅ Valid - cluster-wide role
db.grantRolesToUser("admin-user", [
  { role: "dbAdminAnyDatabase", db: "admin" }  // Can drop ANY database
])

⚠️ WARNING: These roles provide access to ALL databases including production! Use with extreme caution.

Users in the Data Snapshot System¶

1. Snapshot Producer User¶

Purpose: Weekly CronJob that copies syrftest → syrf_snapshot

Database	Role	Purpose
`syrftest`	`read`	Read production data (read-only)
`syrf_snapshot`	`readWrite`	Write snapshot data
`syrf_snapshot`	`dbAdmin`	Drop collections before refresh

// Created manually in Atlas Console
{
  "user": "snapshot-producer",
  "roles": [
    { "role": "read", "db": "syrftest" },
    { "role": "readWrite", "db": "syrf_snapshot" },
    { "role": "dbAdmin", "db": "syrf_snapshot" }
  ]
}

Security: Cannot write to production. Even a bug cannot corrupt syrftest.

2. PR-Specific Users (via Atlas Operator)¶

Purpose: Each PR gets an isolated user with access only to its database

Database	Role	Purpose
`syrf_pr_N`	`readWrite`	Application read/write
`syrf_pr_N`	`dbAdmin`	Can drop its own database
`syrf_snapshot`	`read`	Read from snapshot for restore

# Created automatically by Atlas Operator for each PR
apiVersion: atlas.mongodb.com/v1
kind: AtlasDatabaseUser
spec:
  username: syrf-pr-123-user
  roles:
    - roleName: readWrite
      databaseName: syrf_pr_123
    - roleName: dbAdmin
      databaseName: syrf_pr_123
    - roleName: read
      databaseName: syrf_snapshot  # When use-snapshot label present

Security: Each PR user is completely isolated. Cannot access other PR databases or production.

3. Staging User (Existing)¶

Purpose: Application access for staging environment

Database	Role	Purpose
`syrf_staging`	`readWrite`	Application access
`syrf_staging`	`dbAdmin`	Schema management

⚠️ Problem: Staging user does NOT have access to syrf_pr_* databases. It cannot drop PR databases.

The Cleanup Problem¶

When a PR is closed, who drops the syrf_pr_N database?

Challenge¶

The PR user (syrf-pr-N-user) has dbAdmin on its own database and CAN drop it
BUT the user is managed by Atlas Operator via a CRD in the PR namespace
When the PR closes, the namespace (and CRD) gets deleted
Atlas Operator then deletes the MongoDB user
Race condition: Can we drop the database before the user is deleted?

Option A: PreDelete Hook with PR User (Current Approach)¶

The PR user drops its own database before ArgoCD deletes the namespace:

# ArgoCD PreDelete hook - runs BEFORE namespace deletion
apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    argocd.argoproj.io/hook: PreDelete
spec:
  template:
    spec:
      containers:
        - name: cleanup
          image: mongo:7.0
          command: ["/bin/bash", "-c"]
          args:
            - |
              mongosh "$MONGODB_URI" --eval "
                db.getSiblingDB('syrf_pr_${PR_NUM}').dropDatabase();
              "
          env:
            - name: MONGODB_URI
              valueFrom:
                secretKeyRef:
                  name: mongodb-credentials  # PR user's credentials
                  key: connectionString

Why it works: 1. PreDelete hook runs FIRST (before any resource deletion) 2. PR user credentials still exist at this point 3. PR user has dbAdmin on syrf_pr_N (its own DB only) 4. Database is dropped 5. THEN ArgoCD deletes the namespace 6. THEN Atlas Operator cleans up the MongoDB user

Pros: - Least privilege - each user only drops its own database - No cluster-wide admin access needed

Cons: - If hook fails, database is orphaned - Requires ArgoCD to respect hook ordering

Option B: Dedicated Cleanup User with `dbAdminAnyDatabase`¶

Create a powerful service account for cleanup operations:

{
  "user": "syrf-cleanup-admin",
  "roles": [
    { "role": "dbAdminAnyDatabase", "db": "admin" }
  ]
}

Pros: - Can clean up any orphan database - Works as fallback if PreDelete hooks fail

Cons: - EXTREME RISK: Can drop production (syrftest) or any other database! - Requires defense-in-depth: script validation, allowlists, etc.

Mandatory Safety Script:

# MUST validate before ANY operation
validate_target_db() {
  local target="$1"

  # ONLY allow syrf_pr_N databases
  if [[ ! "$target" =~ ^syrf_pr_[0-9]+$ ]]; then
    echo "FATAL: Cannot target database: $target"
    exit 1
  fi

  # Explicit blocklist
  case "$target" in
    syrftest|syrfdev|syrf_snapshot|syrf_staging|admin|local|config)
      echo "FATAL: Protected database: $target"
      exit 1
      ;;
  esac
}

Option C: Atlas Admin API¶

Use MongoDB Atlas API instead of database users:

# Use Atlas API to drop database
curl -X DELETE \
  "https://cloud.mongodb.com/api/atlas/v2/groups/{projectId}/clusters/{clusterName}/databases/syrf_pr_123" \
  -H "Authorization: Bearer $ATLAS_API_KEY"

Pros: - Doesn't require a powerful MongoDB user - Uses the same API key that Atlas Operator already has

Cons: - More complex implementation - Different authentication mechanism

Option D: Accept Orphans + Manual Cleanup¶

Don't automatically drop databases. Let them accumulate and clean up periodically.

# Manual cleanup script (run occasionally by admin)
mongosh "$ADMIN_URI" --eval "
  db.adminCommand('listDatabases').databases
    .filter(d => d.name.startsWith('syrf_pr_'))
    .forEach(d => {
      print('Dropping: ' + d.name);
      db.getSiblingDB(d.name).dropDatabase();
    });
"

Pros: - Simplest implementation - No permission issues

Cons: - Databases accumulate (storage cost) - Requires manual intervention

Recommended Approach¶

Primary: Option A (PreDelete hook with PR user) Fallback: Option D (Manual cleanup for orphans)

The PreDelete hook handles the normal case. For edge cases where hooks fail, periodic manual cleanup (weekly/monthly) handles orphans.

Why NOT Option B?¶

Having a user with dbAdminAnyDatabase is dangerous. Even with script validation, a single bug or typo could drop production. The blast radius is too large.

Why NOT Option C?¶

Added complexity without significant benefit. The Atlas API approach requires different auth and more code.

Implementation Notes¶

Current State (pr-preview.yml)¶

The cleanup-tags job in the workflow currently uses staging credentials (mongo-db secret):

- name: Cleanup MongoDB database
  env:
    MONGO_URI: ${{ secrets.MONGO_DB_URI }}  # Staging credentials

Problem: Staging user doesn't have access to syrf_pr_* databases.

Fix Options: 1. Use PreDelete hook (primary cleanup mechanism) 2. Accept that this fallback won't work (rely on PreDelete hook) 3. Create syrf-cleanup-admin user (increases risk surface)

Recommended Fix¶

Remove the MongoDB cleanup from cleanup-tags job. Rely on PreDelete hook for database cleanup. The workflow should only: - Delete namespace from cluster-gitops - Delete ArgoCD Application (if needed) - Trust that PreDelete hook dropped the database

If databases are orphaned, clean them up manually/periodically.

Summary Table¶

User	syrftest	syrf_snapshot	syrf_staging	syrf_pr_N
`snapshot-producer`	📖 READ	✏️ WRITE	❌	❌
`syrf-pr-N-user`	❌	📖 READ	❌	✏️ WRITE + 🗑️ DROP
Staging user	❌	❌	✏️ WRITE	❌
Production user	✏️ WRITE	❌	❌	❌
`syrf-cleanup-admin`*	⚠️ DROP	⚠️ DROP	⚠️ DROP	⚠️ DROP

*Only if Option B is implemented (not recommended)

Document End