Data Snapshot Automation - Implementation Specification¶

Status: Draft - Pending Review Author: Claude (Senior Systems Architect) Date: 2026-01-13 Target Environment: GKE Cluster (camaradesuk) / MongoDB Atlas

Executive Summary¶

This specification details the implementation of automated production data snapshots for use in PR preview and staging environments. The solution uses a snapshot database approach where a dedicated MongoDB database (syrf_snapshot) on the Atlas cluster is refreshed weekly (or on-demand) from production, and other environments copy data from it.

Key Design Decisions:

No PII anonymization required (confirmed by stakeholder)
Single snapshot database instead of file-based storage (GCS)
PR label-based configuration (use-snapshot for data source, persist-db for lock)
Weekly refresh (Sunday 3 AM UTC) with on-demand capability
~20GB database size
$out aggregation for both jobs (fast, data stays in Atlas):
Producer (prod → snapshot): Uses syrf_snapshot_producer user (read prod, write snapshot)
Restore (snapshot → PR): Uses PR-specific user (read snapshot, write own DB only)
Defense in depth: No snapshot/restore user can write to production (syrftest)

1. High-Level Architecture¶

1.1 Overview¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                        MongoDB Atlas Cluster (Cluster0)                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────┐                      ┌─────────────────┐               │
│  │   Production    │  Weekly CronJob      │    Snapshot     │               │
│  │   syrftest      │─────────────────────▶│  syrf_snapshot  │               │
│  │   (~20GB)       │  (copy collections)  │   (~20GB)       │               │
│  └─────────────────┘                      └────────┬────────┘               │
│                                                    │                        │
│                           ┌────────────────────────┼────────────────────┐   │
│                           │                        │                    │   │
│                           ▼                        ▼                    ▼   │
│                  ┌─────────────────┐    ┌─────────────────┐    ┌──────────┐ │
│                  │    Staging      │    │   PR Preview    │    │   PR N   │ │
│                  │  syrf_staging   │    │  syrf_pr_123    │    │ syrf_pr..│ │
│                  └─────────────────┘    └─────────────────┘    └──────────┘ │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

1.2 Key Components¶

Component	Purpose	Location
Snapshot Producer	Weekly CronJob that copies production → syrf_snapshot	GKE syrf-system namespace
Snapshot Restore Job	PreSync Job that copies syrf_snapshot → target DB	Per-environment namespace
PR Preview Workflow	Detects `use-snapshot` label, generates restore job	GitHub Actions
Staging Values	GitOps config to enable/disable snapshot source	cluster-gitops

1.3 Dependencies¶

MongoDB Atlas cluster with read access to production (syrftest)
MongoDB Atlas database user for snapshot operations
GKE cluster with jobs capability
Existing PR preview infrastructure (Atlas Operator, Kyverno)

1.4 Collections to Copy¶

Based on codebase analysis, these collections are required:

INCLUDE (Core Application Data):

pmProject              # Projects, stages, memberships
pmStudy                # Studies, annotations, screening
pmInvestigator         # User accounts
pmSystematicSearch     # Literature searches
pmDataExportJob        # Export job tracking
pmStudyCorrection      # PDF correction requests
pmInvestigatorUsage    # Usage statistics
pmRiskOfBiasAiJob      # AI risk of bias jobs
pmProjectDailyStat     # Daily statistics
pmPotential            # Pre-registration records
pmInvestigatorEmail    # Email lookup cache

EXCLUDE (Infrastructure/Temporary):

resumePoints           # Change stream tokens
system.*               # MongoDB system collections

1.5 Label Interaction Model¶

The PR preview system uses three labels with distinct purposes:

Label	Purpose	Behavior
`preview`	Trigger - Enables preview environment	Creates namespace, deploys services
`use-snapshot`	Source - Determines data source	When DB is created, use snapshot instead of seed data
`persist-db`	Lock - Protects database	Prevents ALL database modifications

1.5.1 Core Principle: `persist-db` as a Lock¶

persist-db ALWAYS takes precedence. When present: - Database is NEVER dropped - Database is NEVER refreshed/recreated - Label changes (use-snapshot) are reverted with explanatory comment - PR close/merge does NOT delete database (requires manual cleanup)

1.5.2 Label State Matrix¶

persist-db	use-snapshot	On Sync	Database Action
❌	❌	Normal	Drop DB → Seed fresh data (5 sample projects)
❌	✅	Normal	Drop DB → Restore from snapshot
✅	❌	Normal	No action - DB protected
✅	✅	Normal	No action - DB protected

1.5.3 Label Change Behavior¶

When use-snapshot is added:

persist-db absent?
    │
    ├─ YES → Immediately drop DB, restore from snapshot
    │        Post comment: "🗄️ Database recreated from production snapshot (taken: <timestamp>)"
    │
    └─ NO  → Revert label addition, post comment:
             "⚠️ Cannot enable snapshot mode: database is locked by persist-db label.
              Remove persist-db first if you want to refresh the database."

When use-snapshot is removed:

persist-db absent?
    │
    ├─ YES → Immediately drop DB (next sync will seed fresh data)
    │        Post comment: "🌱 Database dropped. Fresh seed data will be created on next sync."
    │
    └─ NO  → Revert label removal, post comment:
             "⚠️ Cannot disable snapshot mode: database is locked by persist-db label.
              Remove persist-db first if you want to change data source."

When persist-db is added:

Post comment: "🔒 Database is now LOCKED. It will not be modified or deleted,
               even when this PR is closed/merged. Remove this label to unlock."

When persist-db is removed:

Is PR still open?
    │
    ├─ YES → Post comment: "🔓 Database unlocked. Next sync will apply current data source
    │        (snapshot: <yes/no>). If you don't push a new commit, manually trigger a sync."
    │        → Next sync applies current use-snapshot state
    │
    └─ NO (PR closed/merged) → Immediately drop database
             Post comment: "🗑️ Database syrf_pr_N dropped (persist-db lock removed on closed PR)"

1.5.4 PR Close/Merge Behavior¶

PR Closed/Merged
      │
      ▼
┌─────────────────────────────────────────────────────────────┐
│ Check: Is persist-db label present?                         │
├─────────────────────────────────────────────────────────────┤
│ YES → DO NOT delete database                                │
│       Post comment: "⚠️ Database syrf_pr_N was NOT deleted  │
│       because persist-db label is present.                  │
│                                                             │
│       To delete: Remove the persist-db label from this PR   │
│       (even though it's closed) and the database will be    │
│       automatically dropped."                               │
├─────────────────────────────────────────────────────────────┤
│ NO  → Delete database as normal                             │
│       Post comment: "✅ Database syrf_pr_N deleted"         │
└─────────────────────────────────────────────────────────────┘

1.5.5 First Deploy with Both Labels¶

When a new PR is created with BOTH use-snapshot AND persist-db labels from the start:

First sync: Create DB from snapshot (honoring use-snapshot)
Subsequent syncs: Database is locked (honoring persist-db)

This allows users to initialize with production data and then lock it for testing.

1.5.6 Orphan Database Cleanup¶

Databases from closed PRs with persist-db remain until manually cleaned:

Manual Cleanup Options: 1. Remove persist-db label from the closed PR → triggers automatic deletion 2. Direct MongoDB cleanup via admin tools (for bulk cleanup)

No automatic expiration - databases persist indefinitely until explicitly cleaned.

2. Detailed Design¶

2.1 MongoDB Permission Model¶

📖 Detailed Reference: See MongoDB Permissions Explained for comprehensive documentation of the permission model, including common misconceptions and cleanup strategies.

2.1.1 Security Principles¶

⚠️ CRITICAL: Production (syrftest) must NEVER be writable by any snapshot/restore job.

⚠️ IMPORTANT: MongoDB does NOT support wildcard database permissions. You cannot grant permissions on a pattern like syrf_pr_*. Each database must be explicitly named in role grants, OR you must use cluster-wide roles like dbAdminAnyDatabase (which provides access to ALL databases including production).

The permission model follows defense-in-depth:

Principle of Least Privilege: Each user has minimum permissions needed
Database-Level Isolation: No user can write to production
Explicit Database Grants: Each PR user gets permissions on its specific syrf_pr_N database only (via Atlas Operator)
Script-Level Validation: All scripts validate targets before operations
Audit Trail: All operations are logged

2.1.2 MongoDB Atlas Users¶

User	Purpose	Databases	Permissions
`syrf_snapshot_producer`	Weekly snapshot CronJob	`syrftest` → `syrf_snapshot`	READ prod, WRITE snapshot
`syrf-pr-N-user`	PR-specific user (Atlas Operator)	`syrf_snapshot`, `syrf_pr_N`	READ snapshot, WRITE own DB

Note: No separate syrf_snapshot_reader user is needed. Each PR user has read access to syrf_snapshot for restore operations.

2.1.3 User: `syrf_snapshot_producer`¶

Purpose: Weekly CronJob that copies syrftest → syrf_snapshot

MongoDB Atlas Roles:

Database	Role	Justification
`syrftest`	`read`	Read production collections for copying
`syrf_snapshot`	`readWrite`	Write copied data
`syrf_snapshot`	`dbAdmin`	Drop collections before refresh

Critical Safety: This user has NO write access to production. Even a bug in the script cannot corrupt syrftest.

// MongoDB Atlas User Configuration
{
  "user": "syrf_snapshot_producer",
  "roles": [
    { "role": "read", "db": "syrftest" },
    { "role": "readWrite", "db": "syrf_snapshot" },
    { "role": "dbAdmin", "db": "syrf_snapshot" }
  ]
}

Secret Storage: GCP Secret Manager → External Secrets → Kubernetes Secret

# GCP Secret Manager: snapshot-producer-mongodb
# Only username/password needed - connection string is constructed in CronJob
# from the mongodbHost value in Helm chart configuration
{
  "username": "snapshot-producer",
  "password": "<secure-password>"
}

2.1.4 User: `syrf_snapshot_reader` (NOT NEEDED)¶

UPDATE: A separate syrf_snapshot_reader user is NOT required.

Each PR-specific user (created by Atlas Operator) already gets read access to syrf_snapshot. See section 2.1.5 below and section 2.1.9 for details.

2.1.5 PR-Specific Users (Atlas Operator)¶

Purpose: Each PR gets its own MongoDB user with access only to its database

The existing Atlas Operator creates users like syrf-pr-123-user with:

Database	Role	Justification
`syrf_pr_123`	`readWrite`	Application access
`syrf_pr_123`	`dbAdmin`	Schema management

For snapshot restore, we extend this to also grant:

Database	Role	Justification
`syrf_snapshot`	`read`	Read snapshot for restore

This is configured in the Atlas Operator's AtlasDatabaseUser CRD:

# Example: PR user with snapshot read access
apiVersion: atlas.mongodb.com/v1
kind: AtlasDatabaseUser
metadata:
  name: syrf-pr-123-user
  namespace: pr-123
spec:
  username: syrf-pr-123-user
  databaseName: admin
  roles:
    # Existing: Access to PR database
    - roleName: readWrite
      databaseName: syrf_pr_123
    - roleName: dbAdmin
      databaseName: syrf_pr_123
    # NEW: Read access to snapshot for restore
    - roleName: read
      databaseName: syrf_snapshot

2.1.6 PR Database Cleanup Strategy¶

Problem: Who can drop syrf_pr_N databases when a PR is closed?

The PR-specific user (syrf-pr-N-user) has dbAdmin on its own database, so it CAN drop it. However, when the PR is closed:

The AtlasDatabaseUser CRD is deleted by ArgoCD
Atlas Operator deletes the MongoDB user
The credentials are gone before the workflow can use them to drop the database

Solution: Use ArgoCD PreDelete hook to drop the database BEFORE the user is deleted.

# PreDelete hook - runs before namespace/resources are deleted
apiVersion: batch/v1
kind: Job
metadata:
  name: mongodb-cleanup-${PR_NUM}
  namespace: pr-${PR_NUM}
  annotations:
    argocd.argoproj.io/hook: PreDelete
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: cleanup
          image: mongo:7.0
          command: ["/bin/bash", "-c"]
          args:
            - |
              echo "Dropping database syrf_pr_${PR_NUM}..."
              mongosh "$MONGODB_URI" --quiet --eval "
                db.getSiblingDB('syrf_pr_${PR_NUM}').dropDatabase();
                print('Database dropped successfully');
              "
          env:
            - name: MONGODB_URI
              valueFrom:
                secretKeyRef:
                  name: mongodb-credentials
                  key: connectionString

Why this works:

PreDelete hook runs BEFORE the AtlasDatabaseUser CRD is deleted
The PR user still has credentials at this point
PR user has dbAdmin on its own database (can drop it)
After hook completes, ArgoCD deletes the namespace (including the CRD)
Atlas Operator then cleans up the user

Fallback: If the hook fails or times out, the GitHub workflow cleanup step can attempt manual cleanup.

⚠️ Current gap in pr-preview.yml: The cleanup step uses mongo-db secret from staging namespace, which (per least privilege) should only have access to syrf_staging. Options:

Preferred: Rely on PreDelete hook (uses PR user's own credentials)

Fallback: Create a dedicated syrf_cleanup_user with dbAdmin on pattern syrf_pr_*

Not recommended: Give staging user dbAdminAnyDatabase (violates least privilege)

2.1.7 Permission Matrix¶

User	syrftest	syrf_snapshot	syrf_staging	syrf_pr_N
`syrf_snapshot_producer`	📖 READ	✏️ WRITE	❌	❌
`syrf-pr-N-user`	❌	📖 READ	❌	✏️ WRITE + 🗑️ DROP (own DB only)
Application (staging)	❌	❌	✏️ WRITE	❌
Application (production)	✏️ WRITE	❌	❌	❌

Key Insight: No snapshot/restore user can write to syrftest. Production is protected at the MongoDB permission level.

Why the PR user is safe for restore:

The restore flow is syrf_snapshot → syrf_pr_N (never touches syrftest)
PR user has read on syrf_snapshot (can use as $out source)
PR user has readWrite on syrf_pr_N (can use as $out target)
PR user has ZERO access to syrftest - it physically cannot touch production!

MongoDB $out Permission Requirements (per official docs):

Permission	Required On	Action
`find`	Source collection	Read documents for aggregation
`insert`	Destination collection	Write output documents
`remove`	Destination collection	Replace existing collection

Key insight: Read-only access to source is SUFFICIENT. No write access to source is needed.

2.1.8 Script-Level Validation (Defense in Depth)¶

Even with proper permissions, all scripts include validation:

# CRITICAL: Validate target database before ANY operation
validate_target_db() {
  local target="$1"

  # Must match syrf_pr_N pattern
  if [[ ! "$target" =~ ^syrf_pr_[0-9]+$ ]]; then
    echo "FATAL: Invalid target database name: $target"
    echo "Expected pattern: syrf_pr_N (e.g., syrf_pr_123)"
    exit 1
  fi

  # Explicit blocklist (belt and suspenders)
  case "$target" in
    syrftest|syrfdev|syrf_snapshot|syrf_staging|admin|local|config)
      echo "FATAL: Cannot target protected database: $target"
      exit 1
      ;;
  esac

  echo "✓ Target database validated: $target"
}

# Called at start of restore job
validate_target_db "$TARGET_DB"

2.1.9 Restore Job Connection Strategy¶

Key Insight: The restore job flow is syrf_snapshot → syrf_pr_N. It never touches syrftest!

The PR-specific user (created by Atlas Operator) can safely use $out aggregation because: 1. It has read access to syrf_snapshot (source) 2. It has readWrite access to syrf_pr_N (target) 3. It has ZERO access to syrftest - physically cannot touch production!

# Restore job uses PR-specific credentials
# This user can ONLY read syrf_snapshot and write to syrf_pr_N
PR_USER_URI="$MONGODB_CONNECTION_STRING"  # syrf-pr-N-user

# $out aggregation: syrf_snapshot → syrf_pr_N
# Safe because the user has NO access to syrftest
mongosh "$PR_USER_URI" --quiet --eval "
  db.getSiblingDB('syrf_snapshot').getCollection('pmProject').aggregate([
    { \$out: { db: 'syrf_pr_${PR_NUM}', coll: 'pmProject' } }
  ]).toArray();
"

Why this is safe:

If script tries to...	Result
Read from `syrftest`	FAILS - user has no read access
Write to `syrftest`	FAILS - user has no write access
Write to `syrf_snapshot`	FAILS - user only has read access
Write to other `syrf_pr_*`	FAILS - user only has access to own DB

Conclusion: Use $out aggregation for both producer AND restore jobs. The PR user cannot touch production because it has no permissions on syrftest.

2.1.10 No Shared Restore User Needed¶

We do NOT need a shared syrf_snapshot_reader user. Each PR's user already has the permissions needed: - read on syrf_snapshot (for $out source) - readWrite on syrf_pr_N (for $out target)

This is cleaner and more secure than a shared user.

2.2 Snapshot Producer CronJob¶

Weekly job that copies production data to the snapshot database using MongoDB's $out aggregation.

Why $out instead of mongodump/mongorestore?

Since source and destination databases are on the same Atlas cluster, $out aggregation is significantly better:

Factor	mongodump/mongorestore	$out aggregation
Data path	MongoDB → K8s pod → MongoDB	Internal to Atlas
Network transfer	20GB through pod	0 bytes through pod
K8s resources	High (holds data in memory)	Minimal (sends commands)
Execution time	30-60 minutes	5-10 minutes
GKE cost	Higher compute + egress	Minimal

# cluster-gitops/syrf/services/snapshot-producer/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: snapshot-producer
  namespace: syrf-system
  labels:
    app.kubernetes.io/name: snapshot-producer
    app.kubernetes.io/component: data-management
spec:
  schedule: "0 3 * * 0"  # Sunday 3:00 AM UTC
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      backoffLimit: 2
      activeDeadlineSeconds: 1800  # 30 minute timeout (faster with $out)
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: snapshot
              image: mongo:7.0
              command: ["/bin/bash", "-c"]
              args:
                - |
                  set -e
                  echo "=== Starting Production Snapshot (using \$out aggregation) ==="
                  echo "Timestamp: $(date -Iseconds)"

                  # Collections to copy (pm-prefixed + core)
                  COLLECTIONS="pmProject pmStudy pmInvestigator pmSystematicSearch pmDataExportJob pmStudyCorrection pmInvestigatorUsage pmRiskOfBiasAiJob pmProjectDailyStat pmPotential pmInvestigatorEmail"

                  echo "Collections to copy: $COLLECTIONS"

                  # Drop existing snapshot collections first
                  echo "Clearing existing snapshot database..."
                  mongosh "$MONGO_URI" --quiet --eval "
                    const snapDb = db.getSiblingDB('syrf_snapshot');
                    snapDb.getCollectionNames().forEach(c => {
                      print('Dropping: ' + c);
                      snapDb.getCollection(c).drop();
                    });
                  "

                  # Copy each collection using $out (stays within Atlas cluster)
                  for col in $COLLECTIONS; do
                    echo "Copying collection: $col"

                    mongosh "$MONGO_URI" --quiet --eval "
                      const startTime = Date.now();
                      const result = db.getSiblingDB('syrftest').getCollection('$col').aggregate([
                        { \\\$out: { db: 'syrf_snapshot', coll: '$col' } }
                      ]).toArray();
                      const count = db.getSiblingDB('syrf_snapshot').getCollection('$col').countDocuments();
                      const elapsed = ((Date.now() - startTime) / 1000).toFixed(1);
                      print('  ✓ $col: ' + count + ' documents (' + elapsed + 's)');
                    "
                  done

                  # Write metadata
                  echo "Writing snapshot metadata..."
                  mongosh "$MONGO_URI" --quiet --eval "
                    const collections = '$COLLECTIONS'.split(' ');
                    db.getSiblingDB('syrf_snapshot').snapshot_metadata.updateOne(
                      { _id: 'latest' },
                      {
                        \\\$set: {
                          timestamp: new Date(),
                          source: 'syrftest',
                          collections: collections,
                          status: 'complete',
                          method: '\\\$out aggregation'
                        }
                      },
                      { upsert: true }
                    );
                    print('Metadata written successfully');
                  "

                  echo "=== Snapshot Complete ==="
                  echo "Timestamp: $(date -Iseconds)"
              env:
                - name: MONGO_URI
                  valueFrom:
                    secretKeyRef:
                      name: snapshot-producer-credentials
                      key: connectionString
              resources:
                requests:
                  memory: "64Mi"
                  cpu: "50m"
                limits:
                  memory: "256Mi"
                  cpu: "200m"

Note: The job only needs minimal resources since it just sends commands to MongoDB. All data movement happens within the Atlas cluster.

2.3 Snapshot Restore Job (PreSync)¶

Job that copies from snapshot database to target PR database using $out aggregation.

Why $out is safe for restore jobs?

The restore job flow is syrf_snapshot → syrf_pr_N. It NEVER touches syrftest!

The PR-specific user (created by Atlas Operator) has:

read on syrf_snapshot (source for $out)
readWrite on syrf_pr_N (target for $out)
ZERO access to syrftest - physically cannot touch production!

If script tries to...	Result
Read from `syrftest`	FAILS - user has no read access
Write to `syrftest`	FAILS - user has no write access
Write to `syrf_snapshot`	FAILS - user only has read access

Decision: Use $out aggregation for both producer and restore jobs. This is faster (data stays in Atlas) and equally secure (PR user has no production access).

# Template for PR preview (generated by workflow)
apiVersion: batch/v1
kind: Job
metadata:
  name: snapshot-restore-${PR_NUM}-${SHORT_SHA}
  namespace: pr-${PR_NUM}
  labels:
    app.kubernetes.io/managed-by: pr-preview-workflow
    app.kubernetes.io/component: snapshot-restore
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
    argocd.argoproj.io/sync-wave: "3"  # After MongoDB user created
spec:
  ttlSecondsAfterFinished: 600
  backoffLimit: 3
  activeDeadlineSeconds: 900  # 15 minute timeout ($out is fast)
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: restore
          image: mongo:7.0
          command: ["/bin/bash", "-c"]
          args:
            - |
              set -e
              TARGET_DB="syrf_pr_${PR_NUM}"

              echo "=== Restoring Snapshot to $TARGET_DB ==="
              echo "Using \$out aggregation (data stays in Atlas)"
              echo "Timestamp: $(date -Iseconds)"

              # ============================================================
              # CRITICAL: Validate target database name (defense in depth)
              # ============================================================
              validate_target_db() {
                local target="$1"

                # Must match syrf_pr_N pattern
                if [[ ! "$target" =~ ^syrf_pr_[0-9]+$ ]]; then
                  echo "FATAL: Invalid target database name: $target"
                  echo "Expected pattern: syrf_pr_N (e.g., syrf_pr_123)"
                  exit 1
                fi

                # Explicit blocklist (belt and suspenders)
                case "$target" in
                  syrftest|syrfdev|syrf_snapshot|syrf_staging|admin|local|config)
                    echo "FATAL: Cannot target protected database: $target"
                    exit 1
                    ;;
                esac

                echo "✓ Target database validated: $target"
              }

              validate_target_db "$TARGET_DB"

              # ============================================================
              # Wait for snapshot to be available
              # ============================================================
              MAX_RETRIES=12
              RETRY_INTERVAL=30
              echo "Checking snapshot availability..."

              for i in $(seq 1 $MAX_RETRIES); do
                METADATA=$(mongosh "$MONGODB_URI" --quiet --eval "
                  JSON.stringify(db.getSiblingDB('syrf_snapshot').snapshot_metadata.findOne({_id: 'latest'}))
                ")

                if [ -n "$METADATA" ] && [ "$METADATA" != "null" ]; then
                  echo "Snapshot available: $METADATA"
                  SNAPSHOT_TIME=$(echo "$METADATA" | grep -oP '"timestamp":\s*"\K[^"]+' || echo "unknown")
                  echo "Snapshot timestamp: $SNAPSHOT_TIME"
                  break
                fi

                if [ $i -eq $MAX_RETRIES ]; then
                  echo "ERROR: Snapshot not available after $MAX_RETRIES retries"
                  echo "Options:"
                  echo "  1. Wait for Sunday 3 AM UTC weekly snapshot"
                  echo "  2. Trigger on-demand snapshot via GitHub Actions"
                  echo "  3. Remove 'use-snapshot' label to use seed data"
                  exit 1
                fi

                echo "Waiting for snapshot... (attempt $i/$MAX_RETRIES)"
                sleep $RETRY_INTERVAL
              done

              # ============================================================
              # Get collection list from snapshot metadata
              # ============================================================
              COLLECTIONS=$(mongosh "$MONGODB_URI" --quiet --eval "
                db.getSiblingDB('syrf_snapshot').snapshot_metadata.findOne({_id: 'latest'}).collections.join(' ')
              ")
              echo "Collections to restore: $COLLECTIONS"

              # ============================================================
              # Drop existing collections in target database
              # ============================================================
              echo ""
              echo "Step 1: Clearing target database..."
              mongosh "$MONGODB_URI" --quiet --eval "
                const targetDb = db.getSiblingDB('$TARGET_DB');
                targetDb.getCollectionNames().forEach(c => {
                  print('Dropping: ' + c);
                  targetDb.getCollection(c).drop();
                });
              "

              # ============================================================
              # Copy each collection using $out (stays within Atlas)
              # ============================================================
              echo ""
              echo "Step 2: Copying collections via \$out aggregation..."
              echo "(Data stays within Atlas cluster - no network transfer)"

              for col in $COLLECTIONS; do
                echo "Copying: $col"
                mongosh "$MONGODB_URI" --quiet --eval "
                  const startTime = Date.now();
                  db.getSiblingDB('syrf_snapshot').getCollection('$col').aggregate([
                    { \\\$out: { db: '$TARGET_DB', coll: '$col' } }
                  ]).toArray();
                  const count = db.getSiblingDB('$TARGET_DB').getCollection('$col').countDocuments();
                  const elapsed = ((Date.now() - startTime) / 1000).toFixed(1);
                  print('  ✓ $col: ' + count + ' documents (' + elapsed + 's)');
                "
              done

              # ============================================================
              # Verify restore
              # ============================================================
              echo ""
              echo "Step 3: Verification summary..."
              mongosh "$MONGODB_URI" --quiet --eval "
                const targetDb = db.getSiblingDB('$TARGET_DB');
                let total = 0;
                targetDb.getCollectionNames().filter(c => c !== 'system.profile').forEach(c => {
                  const count = targetDb.getCollection(c).countDocuments();
                  total += count;
                  print('  ' + c + ': ' + count);
                });
                print('Total documents: ' + total);
              "

              echo ""
              echo "=== Restore Complete ==="
              echo "Timestamp: $(date -Iseconds)"
          env:
            # PR-specific connection (read on syrf_snapshot, readWrite on syrf_pr_N)
            # This user has ZERO access to syrftest - it cannot touch production!
            - name: MONGODB_URI
              valueFrom:
                secretKeyRef:
                  name: mongodb-credentials  # Created by Atlas Operator for this PR
                  key: connectionString
            - name: PR_NUM
              value: "${PR_NUM}"
          resources:
            requests:
              memory: "64Mi"    # Minimal - just sends commands to MongoDB
              cpu: "50m"
            limits:
              memory: "256Mi"
              cpu: "200m"

Note: The restore job only needs minimal resources since it just sends commands to MongoDB. All data movement happens within the Atlas cluster. For a 20GB database, expect ~5-10 minutes restore time.

2.4 PR Preview Workflow Integration¶

Updates to .github/workflows/pr-preview.yml:

2.4.1 Label Detection Enhancement¶

# Add to check-label job
- name: Check for data source labels
  id: snapshot-check
  env:
    GH_TOKEN: ${{ github.token }}
    PR_NUM: ${{ steps.pr-info.outputs.pr_number }}
  run: |
    LABELS=$(gh pr view "$PR_NUM" --json labels -q '.labels[].name' 2>/dev/null || echo "")

    # Check persist-db FIRST (it takes precedence as a lock)
    if echo "$LABELS" | grep -q "persist-db"; then
      echo "persist_db=true" >> "$GITHUB_OUTPUT"
      echo "🔒 Label 'persist-db' found - database is LOCKED"
    else
      echo "persist_db=false" >> "$GITHUB_OUTPUT"
    fi

    # Check use-snapshot
    if echo "$LABELS" | grep -q "use-snapshot"; then
      echo "use_snapshot=true" >> "$GITHUB_OUTPUT"
      echo "🗄️ Label 'use-snapshot' found - will use production snapshot"
    else
      echo "use_snapshot=false" >> "$GITHUB_OUTPUT"
      echo "🌱 No 'use-snapshot' label - will use seed data"
    fi

- name: Handle label change with persist-db lock
  id: label-lock-check
  env:
    GH_TOKEN: ${{ github.token }}
    PR_NUM: ${{ steps.pr-info.outputs.pr_number }}
    PERSIST_DB: ${{ steps.snapshot-check.outputs.persist_db }}
    EVENT_ACTION: ${{ github.event.action }}
    LABEL_NAME: ${{ github.event.label.name }}
  run: |
    # If persist-db is present and use-snapshot label was just changed, revert it
    if [ "$PERSIST_DB" = "true" ] && [ "$LABEL_NAME" = "use-snapshot" ]; then
      if [ "$EVENT_ACTION" = "labeled" ]; then
        # use-snapshot was just ADDED while persist-db present - revert
        echo "⚠️ Cannot add use-snapshot: database locked by persist-db"
        gh pr edit "$PR_NUM" --remove-label "use-snapshot"
        gh pr comment "$PR_NUM" --body "⚠️ **Cannot enable snapshot mode**: Database is locked by \`persist-db\` label.

    The \`use-snapshot\` label has been automatically removed.

    To refresh the database from production snapshot:
    1. Remove the \`persist-db\` label first
    2. Then add the \`use-snapshot\` label"
        echo "label_reverted=true" >> "$GITHUB_OUTPUT"
      elif [ "$EVENT_ACTION" = "unlabeled" ]; then
        # use-snapshot was just REMOVED while persist-db present - re-add it
        echo "⚠️ Cannot remove use-snapshot: database locked by persist-db"
        gh pr edit "$PR_NUM" --add-label "use-snapshot"
        gh pr comment "$PR_NUM" --body "⚠️ **Cannot disable snapshot mode**: Database is locked by \`persist-db\` label.

    The \`use-snapshot\` label has been automatically restored.

    To change the data source:
    1. Remove the \`persist-db\` label first
    2. Then remove the \`use-snapshot\` label"
        echo "label_reverted=true" >> "$GITHUB_OUTPUT"
      fi
    else
      echo "label_reverted=false" >> "$GITHUB_OUTPUT"
    fi

2.4.2 PR Comment Enhancement¶

# Add new job: post-preview-comment
post-preview-comment:
  name: Post preview environment comment
  needs: [check-label, detect-changes]
  if: |
    needs.check-label.outputs.should_build == 'true' &&
    github.event.action == 'labeled' &&
    github.event.label.name == 'preview'
  runs-on: ubuntu-latest
  permissions:
    pull-requests: write
  steps:
    - name: Get snapshot timestamp
      id: snapshot-info
      if: needs.check-label.outputs.use_snapshot == 'true'
      env:
        MONGO_URI: ${{ secrets.SNAPSHOT_READER_URI }}
      run: |
        # Query snapshot metadata for timestamp
        TIMESTAMP=$(mongosh "$MONGO_URI" --quiet --eval "
          const meta = db.getSiblingDB('syrf_snapshot').snapshot_metadata.findOne({_id: 'latest'});
          if (meta && meta.timestamp) {
            print(meta.timestamp.toISOString());
          } else {
            print('unknown');
          }
        ")
        echo "timestamp=$TIMESTAMP" >> "$GITHUB_OUTPUT"

        # Format for display
        if [ "$TIMESTAMP" != "unknown" ]; then
          FORMATTED=$(date -d "$TIMESTAMP" "+%Y-%m-%d %H:%M UTC" 2>/dev/null || echo "$TIMESTAMP")
          echo "formatted=$FORMATTED" >> "$GITHUB_OUTPUT"
        else
          echo "formatted=Not yet available" >> "$GITHUB_OUTPUT"
        fi

    - name: Post preview environment comment
      env:
        GH_TOKEN: ${{ github.token }}
        PR_NUM: ${{ needs.check-label.outputs.pr_number }}
        USE_SNAPSHOT: ${{ needs.check-label.outputs.use_snapshot }}
        PERSIST_DB: ${{ needs.check-label.outputs.persist_db }}
        HEAD_SHA: ${{ needs.check-label.outputs.head_short_sha }}
        SNAPSHOT_TIME: ${{ steps.snapshot-info.outputs.formatted }}
      run: |
        if [ "$USE_SNAPSHOT" = "true" ]; then
          DB_SOURCE="🗄️ **Production Snapshot** (\`syrf_snapshot\` → \`syrf_pr_${PR_NUM}\`)"
          DB_NOTE="📅 Snapshot taken: **${SNAPSHOT_TIME}**"
          if [ "$PERSIST_DB" = "true" ]; then
            DB_NOTE="${DB_NOTE}
        🔒 Database is **LOCKED** - it will not be modified or deleted on rebuild or PR close."
          else
            DB_NOTE="${DB_NOTE}
        Data will be refreshed from snapshot on each rebuild."
          fi
        else
          DB_SOURCE="🌱 **Seed Data** (5 sample projects, ~100 studies)"
          if [ "$PERSIST_DB" = "true" ]; then
            DB_NOTE="🔒 Database is **LOCKED** - it will not be modified or deleted on rebuild or PR close.
        Add the \`use-snapshot\` label to use production data instead (must remove \`persist-db\` first)."
          else
            DB_NOTE="Add the \`use-snapshot\` label to use production data instead."
          fi
        fi

        COMMENT=$(cat <<EOF
        ## 🚀 Preview Environment Building

        A preview environment is being built for this PR.

        ### Environment Details

        | Setting | Value |
        |---------|-------|
        | **Namespace** | \`pr-${PR_NUM}\` |
        | **Database** | \`syrf_pr_${PR_NUM}\` |
        | **Data Source** | ${DB_SOURCE} |
        | **Commit** | \`${HEAD_SHA}\` |

        ### URLs (available after deployment)

        - 🌐 **Web**: https://pr-${PR_NUM}.syrf.org.uk
        - 🔌 **API**: https://api.pr-${PR_NUM}.syrf.org.uk
        - 📊 **PM**: https://project-management.pr-${PR_NUM}.syrf.org.uk

        ### Notes

        ${DB_NOTE}

        ---
        *This comment was automatically generated by the PR Preview workflow.*
        EOF
        )

        gh pr comment "$PR_NUM" --body "$COMMENT"

2.4.3 persist-db Label Added Comment¶

# Post when persist-db label is added
- name: Post persist-db lock comment
  if: |
    github.event.action == 'labeled' &&
    github.event.label.name == 'persist-db'
  env:
    GH_TOKEN: ${{ github.token }}
    PR_NUM: ${{ needs.check-label.outputs.pr_number }}
  run: |
    gh pr comment "$PR_NUM" --body "🔒 **Database is now LOCKED**

    The \`persist-db\` label has been added. Your database (\`syrf_pr_${PR_NUM}\`) will now be protected:

    - ✅ Database will NOT be dropped on rebuild
    - ✅ Database will NOT be deleted when PR is closed/merged
    - ✅ Changes to \`use-snapshot\` label will be blocked

    **To unlock:** Remove the \`persist-db\` label. If the PR is closed, this will immediately delete the database."

2.4.4 persist-db Label Removed Comment¶

# Post when persist-db label is removed
- name: Handle persist-db removal
  if: |
    github.event.action == 'unlabeled' &&
    github.event.label.name == 'persist-db'
  env:
    GH_TOKEN: ${{ github.token }}
    PR_NUM: ${{ needs.check-label.outputs.pr_number }}
    PR_STATE: ${{ github.event.pull_request.state }}
    USE_SNAPSHOT: ${{ needs.check-label.outputs.use_snapshot }}
  run: |
    if [ "$PR_STATE" = "open" ]; then
      # PR is still open - database will be handled on next sync
      if [ "$USE_SNAPSHOT" = "true" ]; then
        DATA_ACTION="refreshed from production snapshot"
      else
        DATA_ACTION="reset with fresh seed data"
      fi

      gh pr comment "$PR_NUM" --body "🔓 **Database unlocked**

    The \`persist-db\` label has been removed. On the next sync, your database will be ${DATA_ACTION}.

    If you don't push a new commit, you can manually trigger a sync from ArgoCD."
    else
      # PR is closed/merged - delete the database immediately
      echo "PR is closed - triggering immediate database cleanup"
      # Note: This triggers the cleanup workflow or direct deletion
      gh pr comment "$PR_NUM" --body "🗑️ **Database deleted**

    The \`persist-db\` label was removed from this closed PR, triggering immediate database cleanup.

    Database \`syrf_pr_${PR_NUM}\` has been dropped."
      # Actual deletion logic is in the cleanup workflow
    fi

2.5 Staging Configuration¶

GitOps-based configuration for staging environment.

# cluster-gitops/syrf/environments/staging/staging.values.yaml
# Add database source configuration
database:
  # Data source: "seed" or "snapshot"
  source: snapshot

  # Snapshot database to copy from (when source=snapshot)
  snapshotDatabase: syrf_snapshot

Staging restore is handled by a similar PreSync job, but configured via Helm values rather than generated by workflow.

2.6 On-Demand Snapshot Trigger¶

GitHub Actions workflow for manual snapshot trigger.

# .github/workflows/snapshot-on-demand.yml
name: Trigger Snapshot Refresh

on:
  workflow_dispatch:
    inputs:
      confirm:
        description: 'Type "refresh-snapshot" to confirm'
        required: true

jobs:
  trigger-snapshot:
    name: Trigger snapshot refresh
    if: inputs.confirm == 'refresh-snapshot'
    runs-on: ubuntu-latest
    steps:
      - name: Authenticate to Google Cloud
        uses: google-github-actions/auth@v2
        with:
          workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
          service_account: ${{ secrets.GCP_SERVICE_ACCOUNT }}

      - name: Set up Cloud SDK
        uses: google-github-actions/setup-gcloud@v2

      - name: Get GKE credentials
        run: |
          gcloud container clusters get-credentials camaradesuk \
            --zone europe-west2-a \
            --project camarades-net

      - name: Trigger snapshot CronJob
        run: |
          kubectl create job \
            --from=cronjob/snapshot-producer \
            snapshot-manual-$(date +%Y%m%d-%H%M%S) \
            -n syrf-system

          echo "### Snapshot Job Triggered" >> "$GITHUB_STEP_SUMMARY"
          echo "A manual snapshot job has been created." >> "$GITHUB_STEP_SUMMARY"
          echo "Check the syrf-system namespace for job status." >> "$GITHUB_STEP_SUMMARY"

3. Execution Flow¶

3.1 Weekly Snapshot (Happy Path)¶

Sunday 3:00 AM UTC
       │
       ▼
┌─────────────────────────────────────────────────────────┐
│ 1. CronJob triggers snapshot-producer                   │
├─────────────────────────────────────────────────────────┤
│ 2. Connect to MongoDB Atlas                             │
│ 3. Clear existing syrf_snapshot collections             │
│ 4. For each collection (11 total):                      │
│    a. Run $out aggregation: syrftest → syrf_snapshot    │
│    b. (Data stays within Atlas - no network transfer)   │
│ 5. Write snapshot_metadata with timestamp               │
│ 6. Job completes (estimated: 5-10 minutes for 20GB)     │
└─────────────────────────────────────────────────────────┘

3.2 PR Preview with Snapshot¶

Developer adds 'use-snapshot' label to PR
       │
       ▼
┌─────────────────────────────────────────────────────────┐
│ 1. Workflow detects 'use-snapshot' label                │
│ 2. Check if 'persist-db' is present:                    │
│    ├─ YES: REVERT label change, post comment            │
│    │       (database is locked - no changes allowed)    │
│    └─ NO:  Continue to step 3                           │
│ 3. Get snapshot timestamp from syrf_snapshot metadata   │
│ 4. Post preview environment comment with:               │
│    - Data source (snapshot)                             │
│    - Snapshot timestamp                                 │
│ 5. Generate snapshot-restore-job.yaml                   │
│ 6. Commit to cluster-gitops                             │
└─────────────────────────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────┐
│ ArgoCD Sync (only if persist-db NOT present):           │
│ 1. PreSync: Create MongoDB user (sync-wave: 1)          │
│ 2. PreSync: Drop existing database                      │
│ 3. PreSync: Run snapshot-restore job (sync-wave: 3)     │
│    a. Check snapshot_metadata for availability          │
│    b. Retry up to 12 times (6 minutes) if not ready     │
│    c. Copy collections from syrf_snapshot → syrf_pr_N   │
│ 4. Sync: Deploy services                                │
│ 5. Services connect to syrf_pr_N with real data         │
└─────────────────────────────────────────────────────────┘

3.2.1 PR Preview with persist-db Lock¶

Developer has 'persist-db' label on PR
       │
       ▼
┌─────────────────────────────────────────────────────────┐
│ 1. Workflow detects 'persist-db' label                  │
│ 2. Any attempt to add/remove 'use-snapshot' is REVERTED │
│ 3. Post explanation comment                             │
│ 4. NO database changes occur on sync                    │
│ 5. On PR close/merge: Database is NOT deleted           │
│    - Warning comment posted with cleanup instructions   │
└─────────────────────────────────────────────────────────┘

Cleanup (when persist-db removed from closed PR):
       │
       ▼
┌─────────────────────────────────────────────────────────┐
│ 1. Workflow detects 'persist-db' label removal          │
│ 2. Check PR state:                                      │
│    ├─ OPEN:   Post "database unlocked" comment          │
│    │          Next sync applies current data source     │
│    └─ CLOSED: Immediately drop database                 │
│               Post "database deleted" comment           │
└─────────────────────────────────────────────────────────┘

3.3 Sequence Diagram¶

Developer          GitHub Actions       cluster-gitops        ArgoCD          MongoDB Atlas
    │                    │                    │                  │                  │
    │─Add 'use-snapshot' │                    │                  │                  │
    │    label           │                    │                  │                  │
    │                    │                    │                  │                  │
    │                    ├─Detect label───────│                  │                  │
    │                    │                    │                  │                  │
    │                    ├─Generate restore───│                  │                  │
    │                    │   job YAML         │                  │                  │
    │                    │                    │                  │                  │
    │                    ├─Commit + push─────▶│                  │                  │
    │                    │                    │                  │                  │
    │                    │                    ├─Git webhook─────▶│                  │
    │                    │                    │                  │                  │
    │                    │                    │                  ├─PreSync: Create──│
    │                    │                    │                  │  MongoDB user    │
    │                    │                    │                  │                  │
    │                    │                    │                  ├─PreSync: Restore─│
    │                    │                    │                  │  snapshot        │
    │                    │                    │                  │                  │
    │                    │                    │                  │         ┌────────┤
    │                    │                    │                  │         │Copy    │
    │                    │                    │                  │         │data    │
    │                    │                    │                  │         └────────┤
    │                    │                    │                  │                  │
    │                    │                    │                  ├─Sync: Deploy─────│
    │                    │                    │                  │  services        │
    │                    │                    │                  │                  │
    │◀─────────────────Preview ready──────────│                  │                  │
    │                    │                    │                  │                  │

4. Edge Cases & Mitigations¶

#	Edge Case / Failure Mode	Impact	Mitigation Strategy
1	Snapshot not available when PR deploys (first week)	PR deployment blocked	Wait and retry (12 attempts, 30s intervals = 6 min max wait). After retries exhausted, fail with clear error message.
2	Snapshot producer job fails mid-copy	Incomplete snapshot database	Job uses collection-by-collection copy with atomic drop. Metadata only written on success. Restore job checks metadata.
3	Production database unavailable during snapshot	Weekly snapshot skipped	CronJob retries (backoffLimit: 2). Alert on repeated failures. Previous snapshot remains valid.
4	20GB copy takes longer than expected	Job timeout	activeDeadlineSeconds: 1800 (30 min) for producer, 900 (15 min) for restore. Both use `$out` aggregation (fast, data stays in Atlas).
5	`use-snapshot` label added while `persist-db` present	Label change blocked	Revert label change automatically, post comment explaining database is locked. User must remove `persist-db` first.
6	`use-snapshot` label removed while `persist-db` present	Label change blocked	Revert label change automatically, post comment explaining database is locked. User must remove `persist-db` first.
7	`persist-db` removed on closed/merged PR	Orphan database cleanup	Immediately drop database, post confirmation comment. This is the cleanup mechanism for orphan DBs.
8	PR closed/merged with `persist-db` label	Database persists as orphan	Do NOT delete database, post warning comment with cleanup instructions (remove `persist-db` label to trigger deletion).
9	First deploy with both `use-snapshot` AND `persist-db`	User wants snapshot data then lock	Create database from snapshot on first sync, then lock for subsequent syncs. Both labels are honored in sequence.
10	MongoDB connection issues during restore	Restore fails	backoffLimit: 3 with exponential backoff. Clear error in ArgoCD.
11	Snapshot database runs out of space	Atlas storage limit	Monitor Atlas storage. 20GB snapshot should fit in M10+ tier.
12	Collection schema changes between production and test	Potential data issues	Schema version tracked in metadata. Services handle schema migration.
13	Multiple PRs requesting snapshot simultaneously	Parallel restore from same source	Each PR gets its own copy. syrf_snapshot is read-only during restores. No conflicts.
14	Snapshot producer runs during restore	Stale data mid-restore	Restore checks metadata timestamp. If changed mid-restore, log warning but continue (acceptable for testing).
15	Manual changes to syrf_snapshot	Corruption risk	Document as read-only. Only snapshot-producer should write. Kyverno policy can enforce.

4.1 Detailed Mitigation: Snapshot Availability Wait¶

# Implemented in restore job
MAX_RETRIES=12
RETRY_INTERVAL=30  # 30 seconds

for i in $(seq 1 $MAX_RETRIES); do
  METADATA=$(mongosh ... --eval "db.snapshot_metadata.findOne({_id: 'latest'})")

  if [ -n "$METADATA" ] && [ "$METADATA" != "null" ]; then
    echo "Snapshot available"
    break
  fi

  if [ $i -eq $MAX_RETRIES ]; then
    echo "ERROR: Snapshot not available after $(($MAX_RETRIES * $RETRY_INTERVAL / 60)) minutes"
    echo "This is expected on first deployment before weekly snapshot runs."
    echo "Options:"
    echo "  1. Wait for Sunday 3 AM UTC weekly snapshot"
    echo "  2. Trigger on-demand snapshot via GitHub Actions"
    echo "  3. Remove 'use-snapshot' label to use seed data"
    exit 1
  fi

  echo "Waiting for snapshot... (attempt $i/$MAX_RETRIES)"
  sleep $RETRY_INTERVAL
done

5. Testing Strategy¶

5.1 Unit Tests¶

Not applicable - this feature is infrastructure-only (no application code changes).

5.2 Integration Tests¶

Snapshot Producer Job: Run manually, verify all collections copied
Restore Job: Deploy test PR with use-snapshot, verify data present
Label Conflict: Add both labels, verify persist-db removed and comment posted
Preview Comment: Add preview label, verify environment comment posted

5.3 Manual Verification Steps¶

# 1. Verify snapshot producer CronJob is deployed
kubectl get cronjob snapshot-producer -n syrf-system

# 2. Trigger manual snapshot
kubectl create job --from=cronjob/snapshot-producer snapshot-test -n syrf-system

# 3. Monitor job progress
kubectl logs -f job/snapshot-test -n syrf-system

# 4. Verify snapshot metadata
mongosh "mongodb+srv://..." --eval "
  db.getSiblingDB('syrf_snapshot').snapshot_metadata.findOne({_id: 'latest'})
"

# 5. Verify collection counts match production
mongosh "mongodb+srv://..." --eval "
  const snap = db.getSiblingDB('syrf_snapshot');
  const prod = db.getSiblingDB('syrftest');
  ['pmProject', 'pmStudy', 'pmInvestigator'].forEach(c => {
    print(c + ': snap=' + snap.getCollection(c).countDocuments() +
          ' prod=' + prod.getCollection(c).countDocuments());
  });
"

# 6. Create test PR with use-snapshot label
gh pr create --title "Test snapshot restore" --body "Testing snapshot feature"
gh pr edit <PR_NUM> --add-label "preview,use-snapshot"

# 7. Verify PR comments
gh pr view <PR_NUM> --comments

# 8. Verify restore job ran
kubectl get jobs -n pr-<PR_NUM>
kubectl logs job/snapshot-restore-<PR_NUM>-<SHA> -n pr-<PR_NUM>

# 9. Verify data in preview database
mongosh "mongodb+srv://..." --eval "
  db.getSiblingDB('syrf_pr_<PR_NUM>').pmProject.countDocuments()
"

6. Implementation Checklist¶

Implementation Status: Phase 1 and 3 core components implemented (2026-01-13) See Implementation Notes below for changes from original spec.

Phase 1: Infrastructure Setup¶

1.1 Create MongoDB Atlas user snapshot-producer with required roles
Requires manual creation in Atlas Console (same pattern as prod/staging users)
Roles: read on syrftest, readWrite on syrf_snapshot
1.2 Add credentials to GCP Secret Manager (snapshot-producer-mongodb)
Keys: username, password (connection string constructed from cluster host in Helm values)
1.3 Create ExternalSecret for snapshot-producer credentials (cluster-gitops)
Added to plugins/local/extra-secrets-staging/values.yaml
1.4 Update Kyverno policy to allow PR users read on syrf_snapshot
Updated plugins/helm/kyverno/resources/atlas-pr-user-policy.yaml
Rule 5 now allows syrf_snapshot (read-only) in addition to syrf_pr_*
1.5 Create snapshot-producer Helm chart
Created charts/snapshot-producer/ with CronJob template
Plugin config at plugins/local/snapshot-producer/

Phase 2: Snapshot Producer¶

2.1 Create CronJob manifest in cluster-gitops
Located at charts/snapshot-producer/templates/cronjob.yaml
Schedule: Sunday 2 AM UTC
2.2 Test manual job trigger
2.3 Verify all 11 collections are copied correctly
2.4 Verify snapshot runs successfully (no snapshot_metadata - simpler design)
2.5 Monitor first automated weekly run (Sunday 2 AM)

Phase 3: PR Preview Integration¶

3.1 Add use-snapshot label detection to pr-preview.yml
Label triggers workflow on add/remove
Check step outputs use_snapshot flag
3.2 Add persist-db conflict resolution logic
persist-db takes precedence - blocks snapshot restore when present
3.3 Add PR comment for preview environment (with DB details)
3.4 Add PR comment for label conflict resolution
3.5 Generate snapshot-restore-job.yaml when label present
Only when use-snapshot=true AND persist-db=false
Uses PR user credentials (now has read on syrf_snapshot)
3.6 Update AtlasDatabaseUser CRD to add snapshot read role conditionally
Role added only when use-snapshot label is present
3.7 Test with real PR (add both labels, verify behaviour)

Phase 4: Staging Configuration¶

4.1 Add database.source config to staging.values.yaml
4.2 Create staging-specific restore job template in Helm chart
4.3 Test staging with snapshot source enabled
4.4 Document staging configuration in cluster-gitops

Phase 5: On-Demand Trigger¶

5.1 Create snapshot-on-demand.yml workflow
5.2 Test manual trigger via workflow_dispatch
5.3 Document in how-to guide

Phase 6: Documentation & Cleanup¶

6.1 Update PR preview how-to guide with snapshot option
6.2 Update MongoDB testing strategy doc
6.3 Delete planning documents (clarify.md)
6.4 Update CLAUDE.md with snapshot feature

Implementation Notes¶

Changes from Original Specification:

Aspect	Original	Implemented	Reason
User naming	`syrf_snapshot_operator`	`snapshot-producer`	Simpler naming
Secret name	`syrf-snapshot-operator-credentials`	`snapshot-producer-mongodb`	Consistent with existing pattern
User creation	Operator CRD	Manual Atlas Console	Follows prod/staging pattern for long-lived users
Kyverno policy	Not specified	Updated Rule 5	Required to allow `syrf_snapshot` read access
Namespace	`syrf-system`	`syrf-staging`	Plugin pattern requires namespace
snapshot_metadata	Separate collection	Implemented	Now written to all databases (snapshot, PR, empty seed)

Related Documentation:

MongoDB Permissions Explained - Comprehensive reference for the permission model, including why wildcards don't work and cleanup strategies

7. Open Questions¶

All questions have been resolved through the clarification process:

Question	Resolution
PII handling	Not required - skip anonymization
Storage location	MongoDB Atlas database (syrf_snapshot), not GCS
Retention policy	Single snapshot, replaced weekly
PR configuration	Label-based (`use-snapshot` for data source, `persist-db` for lock)
Collection scope	All pm-prefixed collections (11 total)
Label interaction	`persist-db` is a LOCK that takes precedence - blocks all label changes to `use-snapshot`
Orphan databases	Manual cleanup - remove `persist-db` label from closed PR to trigger deletion
Snapshot visibility	Snapshot timestamp shown in PR comments so users know data freshness
Permission model	Defense in depth: Separate users with minimal permissions. No snapshot/restore user can write to `syrftest`. Producer uses read-only prod access. Restore uses PR-specific user that can only write to its own DB.
Restore method	`$out` aggregation for both producer and restore (fast, data stays in Atlas). PR user has `read` on snapshot, `readWrite` on own DB, ZERO access to `syrftest`.

8. References¶

MongoDB Permissions Explained - Permission model, wildcards, cleanup strategies
MongoDB Testing Strategy - Database isolation architecture
PR Preview Environments - Current preview setup
MongoDB Reference - Collection naming, CSUUID format
cluster-gitops Repository - GitOps configuration

Document End

This document must be reviewed and approved before implementation begins.