Skip to content

Data Snapshot Automation - Implementation Specification

Status: Draft - Pending Review Author: Claude (Senior Systems Architect) Date: 2026-01-13 Target Environment: GKE Cluster (camaradesuk) / MongoDB Atlas


Executive Summary

This specification details the implementation of automated production data snapshots for use in PR preview and staging environments. The solution uses a snapshot database approach where a dedicated MongoDB database (syrf_snapshot) on the Atlas cluster is refreshed weekly (or on-demand) from production, and other environments copy data from it.

Key Design Decisions:

  • No PII anonymization required (confirmed by stakeholder)
  • Single snapshot database instead of file-based storage (GCS)
  • PR label-based configuration (use-snapshot for data source, persist-db for lock)
  • Weekly refresh (Sunday 3 AM UTC) with on-demand capability
  • ~20GB database size
  • $out aggregation for both jobs (fast, data stays in Atlas):
  • Producer (prod → snapshot): Uses syrf_snapshot_producer user (read prod, write snapshot)
  • Restore (snapshot → PR): Uses PR-specific user (read snapshot, write own DB only)
  • Defense in depth: No snapshot/restore user can write to production (syrftest)

Table of Contents

  1. High-Level Architecture
  2. Detailed Design
  3. Execution Flow
  4. Edge Cases & Mitigations
  5. Testing Strategy
  6. Implementation Checklist
  7. Open Questions
  8. References

1. High-Level Architecture

1.1 Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                        MongoDB Atlas Cluster (Cluster0)                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────┐                      ┌─────────────────┐               │
│  │   Production    │  Weekly CronJob      │    Snapshot     │               │
│  │   syrftest      │─────────────────────▶│  syrf_snapshot  │               │
│  │   (~20GB)       │  (copy collections)  │   (~20GB)       │               │
│  └─────────────────┘                      └────────┬────────┘               │
│                                                    │                        │
│                           ┌────────────────────────┼────────────────────┐   │
│                           │                        │                    │   │
│                           ▼                        ▼                    ▼   │
│                  ┌─────────────────┐    ┌─────────────────┐    ┌──────────┐ │
│                  │    Staging      │    │   PR Preview    │    │   PR N   │ │
│                  │  syrf_staging   │    │  syrf_pr_123    │    │ syrf_pr..│ │
│                  └─────────────────┘    └─────────────────┘    └──────────┘ │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

1.2 Key Components

Component Purpose Location
Snapshot Producer Weekly CronJob that copies production → syrf_snapshot GKE syrf-system namespace
Snapshot Restore Job PreSync Job that copies syrf_snapshot → target DB Per-environment namespace
PR Preview Workflow Detects use-snapshot label, generates restore job GitHub Actions
Staging Values GitOps config to enable/disable snapshot source cluster-gitops

1.3 Dependencies

  • MongoDB Atlas cluster with read access to production (syrftest)
  • MongoDB Atlas database user for snapshot operations
  • GKE cluster with jobs capability
  • Existing PR preview infrastructure (Atlas Operator, Kyverno)

1.4 Collections to Copy

Based on codebase analysis, these collections are required:

INCLUDE (Core Application Data):

pmProject              # Projects, stages, memberships
pmStudy                # Studies, annotations, screening
pmInvestigator         # User accounts
pmSystematicSearch     # Literature searches
pmDataExportJob        # Export job tracking
pmStudyCorrection      # PDF correction requests
pmInvestigatorUsage    # Usage statistics
pmRiskOfBiasAiJob      # AI risk of bias jobs
pmProjectDailyStat     # Daily statistics
pmPotential            # Pre-registration records
pmInvestigatorEmail    # Email lookup cache

EXCLUDE (Infrastructure/Temporary):

resumePoints           # Change stream tokens
system.*               # MongoDB system collections

1.5 Label Interaction Model

The PR preview system uses three labels with distinct purposes:

Label Purpose Behavior
preview Trigger - Enables preview environment Creates namespace, deploys services
use-snapshot Source - Determines data source When DB is created, use snapshot instead of seed data
persist-db Lock - Protects database Prevents ALL database modifications

1.5.1 Core Principle: persist-db as a Lock

persist-db ALWAYS takes precedence. When present: - Database is NEVER dropped - Database is NEVER refreshed/recreated - Label changes (use-snapshot) are reverted with explanatory comment - PR close/merge does NOT delete database (requires manual cleanup)

1.5.2 Label State Matrix

persist-db use-snapshot On Sync Database Action
Normal Drop DB → Seed fresh data (5 sample projects)
Normal Drop DB → Restore from snapshot
Normal No action - DB protected
Normal No action - DB protected

1.5.3 Label Change Behavior

When use-snapshot is added:

persist-db absent?
    ├─ YES → Immediately drop DB, restore from snapshot
    │        Post comment: "🗄️ Database recreated from production snapshot (taken: <timestamp>)"
    └─ NO  → Revert label addition, post comment:
             "⚠️ Cannot enable snapshot mode: database is locked by persist-db label.
              Remove persist-db first if you want to refresh the database."

When use-snapshot is removed:

persist-db absent?
    ├─ YES → Immediately drop DB (next sync will seed fresh data)
    │        Post comment: "🌱 Database dropped. Fresh seed data will be created on next sync."
    └─ NO  → Revert label removal, post comment:
             "⚠️ Cannot disable snapshot mode: database is locked by persist-db label.
              Remove persist-db first if you want to change data source."

When persist-db is added:

Post comment: "🔒 Database is now LOCKED. It will not be modified or deleted,
               even when this PR is closed/merged. Remove this label to unlock."

When persist-db is removed:

Is PR still open?
    ├─ YES → Post comment: "🔓 Database unlocked. Next sync will apply current data source
    │        (snapshot: <yes/no>). If you don't push a new commit, manually trigger a sync."
    │        → Next sync applies current use-snapshot state
    └─ NO (PR closed/merged) → Immediately drop database
             Post comment: "🗑️ Database syrf_pr_N dropped (persist-db lock removed on closed PR)"

1.5.4 PR Close/Merge Behavior

PR Closed/Merged
┌─────────────────────────────────────────────────────────────┐
│ Check: Is persist-db label present?                         │
├─────────────────────────────────────────────────────────────┤
│ YES → DO NOT delete database                                │
│       Post comment: "⚠️ Database syrf_pr_N was NOT deleted  │
│       because persist-db label is present.                  │
│                                                             │
│       To delete: Remove the persist-db label from this PR   │
│       (even though it's closed) and the database will be    │
│       automatically dropped."                               │
├─────────────────────────────────────────────────────────────┤
│ NO  → Delete database as normal                             │
│       Post comment: "✅ Database syrf_pr_N deleted"         │
└─────────────────────────────────────────────────────────────┘

1.5.5 First Deploy with Both Labels

When a new PR is created with BOTH use-snapshot AND persist-db labels from the start:

  1. First sync: Create DB from snapshot (honoring use-snapshot)
  2. Subsequent syncs: Database is locked (honoring persist-db)

This allows users to initialize with production data and then lock it for testing.

1.5.6 Orphan Database Cleanup

Databases from closed PRs with persist-db remain until manually cleaned:

Manual Cleanup Options: 1. Remove persist-db label from the closed PR → triggers automatic deletion 2. Direct MongoDB cleanup via admin tools (for bulk cleanup)

No automatic expiration - databases persist indefinitely until explicitly cleaned.


2. Detailed Design

2.1 MongoDB Permission Model

📖 Detailed Reference: See MongoDB Permissions Explained for comprehensive documentation of the permission model, including common misconceptions and cleanup strategies.

2.1.1 Security Principles

⚠️ CRITICAL: Production (syrftest) must NEVER be writable by any snapshot/restore job.

⚠️ IMPORTANT: MongoDB does NOT support wildcard database permissions. You cannot grant permissions on a pattern like syrf_pr_*. Each database must be explicitly named in role grants, OR you must use cluster-wide roles like dbAdminAnyDatabase (which provides access to ALL databases including production).

The permission model follows defense-in-depth:

  1. Principle of Least Privilege: Each user has minimum permissions needed
  2. Database-Level Isolation: No user can write to production
  3. Explicit Database Grants: Each PR user gets permissions on its specific syrf_pr_N database only (via Atlas Operator)
  4. Script-Level Validation: All scripts validate targets before operations
  5. Audit Trail: All operations are logged

2.1.2 MongoDB Atlas Users

User Purpose Databases Permissions
syrf_snapshot_producer Weekly snapshot CronJob syrftestsyrf_snapshot READ prod, WRITE snapshot
syrf-pr-N-user PR-specific user (Atlas Operator) syrf_snapshot, syrf_pr_N READ snapshot, WRITE own DB

Note: No separate syrf_snapshot_reader user is needed. Each PR user has read access to syrf_snapshot for restore operations.

2.1.3 User: syrf_snapshot_producer

Purpose: Weekly CronJob that copies syrftestsyrf_snapshot

MongoDB Atlas Roles:

Database Role Justification
syrftest read Read production collections for copying
syrf_snapshot readWrite Write copied data
syrf_snapshot dbAdmin Drop collections before refresh

Critical Safety: This user has NO write access to production. Even a bug in the script cannot corrupt syrftest.

// MongoDB Atlas User Configuration
{
  "user": "syrf_snapshot_producer",
  "roles": [
    { "role": "read", "db": "syrftest" },
    { "role": "readWrite", "db": "syrf_snapshot" },
    { "role": "dbAdmin", "db": "syrf_snapshot" }
  ]
}

Secret Storage: GCP Secret Manager → External Secrets → Kubernetes Secret

# GCP Secret Manager: snapshot-producer-mongodb
# Only username/password needed - connection string is constructed in CronJob
# from the mongodbHost value in Helm chart configuration
{
  "username": "snapshot-producer",
  "password": "<secure-password>"
}

2.1.4 User: syrf_snapshot_reader (NOT NEEDED)

UPDATE: A separate syrf_snapshot_reader user is NOT required.

Each PR-specific user (created by Atlas Operator) already gets read access to syrf_snapshot. See section 2.1.5 below and section 2.1.9 for details.

2.1.5 PR-Specific Users (Atlas Operator)

Purpose: Each PR gets its own MongoDB user with access only to its database

The existing Atlas Operator creates users like syrf-pr-123-user with:

Database Role Justification
syrf_pr_123 readWrite Application access
syrf_pr_123 dbAdmin Schema management

For snapshot restore, we extend this to also grant:

Database Role Justification
syrf_snapshot read Read snapshot for restore

This is configured in the Atlas Operator's AtlasDatabaseUser CRD:

# Example: PR user with snapshot read access
apiVersion: atlas.mongodb.com/v1
kind: AtlasDatabaseUser
metadata:
  name: syrf-pr-123-user
  namespace: pr-123
spec:
  username: syrf-pr-123-user
  databaseName: admin
  roles:
    # Existing: Access to PR database
    - roleName: readWrite
      databaseName: syrf_pr_123
    - roleName: dbAdmin
      databaseName: syrf_pr_123
    # NEW: Read access to snapshot for restore
    - roleName: read
      databaseName: syrf_snapshot

2.1.6 PR Database Cleanup Strategy

Problem: Who can drop syrf_pr_N databases when a PR is closed?

The PR-specific user (syrf-pr-N-user) has dbAdmin on its own database, so it CAN drop it. However, when the PR is closed:

  1. The AtlasDatabaseUser CRD is deleted by ArgoCD
  2. Atlas Operator deletes the MongoDB user
  3. The credentials are gone before the workflow can use them to drop the database

Solution: Use ArgoCD PreDelete hook to drop the database BEFORE the user is deleted.

# PreDelete hook - runs before namespace/resources are deleted
apiVersion: batch/v1
kind: Job
metadata:
  name: mongodb-cleanup-${PR_NUM}
  namespace: pr-${PR_NUM}
  annotations:
    argocd.argoproj.io/hook: PreDelete
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: cleanup
          image: mongo:7.0
          command: ["/bin/bash", "-c"]
          args:
            - |
              echo "Dropping database syrf_pr_${PR_NUM}..."
              mongosh "$MONGODB_URI" --quiet --eval "
                db.getSiblingDB('syrf_pr_${PR_NUM}').dropDatabase();
                print('Database dropped successfully');
              "
          env:
            - name: MONGODB_URI
              valueFrom:
                secretKeyRef:
                  name: mongodb-credentials
                  key: connectionString

Why this works:

  • PreDelete hook runs BEFORE the AtlasDatabaseUser CRD is deleted
  • The PR user still has credentials at this point
  • PR user has dbAdmin on its own database (can drop it)
  • After hook completes, ArgoCD deletes the namespace (including the CRD)
  • Atlas Operator then cleans up the user

Fallback: If the hook fails or times out, the GitHub workflow cleanup step can attempt manual cleanup.

⚠️ Current gap in pr-preview.yml: The cleanup step uses mongo-db secret from staging namespace, which (per least privilege) should only have access to syrf_staging. Options:

  1. Preferred: Rely on PreDelete hook (uses PR user's own credentials)
  2. Fallback: Create a dedicated syrf_cleanup_user with dbAdmin on pattern syrf_pr_*
  3. Not recommended: Give staging user dbAdminAnyDatabase (violates least privilege)

2.1.7 Permission Matrix

User syrftest syrf_snapshot syrf_staging syrf_pr_N
syrf_snapshot_producer 📖 READ ✏️ WRITE
syrf-pr-N-user 📖 READ ✏️ WRITE + 🗑️ DROP (own DB only)
Application (staging) ✏️ WRITE
Application (production) ✏️ WRITE

Key Insight: No snapshot/restore user can write to syrftest. Production is protected at the MongoDB permission level.

Why the PR user is safe for restore:

  • The restore flow is syrf_snapshot → syrf_pr_N (never touches syrftest)
  • PR user has read on syrf_snapshot (can use as $out source)
  • PR user has readWrite on syrf_pr_N (can use as $out target)
  • PR user has ZERO access to syrftest - it physically cannot touch production!

MongoDB $out Permission Requirements (per official docs):

Permission Required On Action
find Source collection Read documents for aggregation
insert Destination collection Write output documents
remove Destination collection Replace existing collection

Key insight: Read-only access to source is SUFFICIENT. No write access to source is needed.

2.1.8 Script-Level Validation (Defense in Depth)

Even with proper permissions, all scripts include validation:

# CRITICAL: Validate target database before ANY operation
validate_target_db() {
  local target="$1"

  # Must match syrf_pr_N pattern
  if [[ ! "$target" =~ ^syrf_pr_[0-9]+$ ]]; then
    echo "FATAL: Invalid target database name: $target"
    echo "Expected pattern: syrf_pr_N (e.g., syrf_pr_123)"
    exit 1
  fi

  # Explicit blocklist (belt and suspenders)
  case "$target" in
    syrftest|syrfdev|syrf_snapshot|syrf_staging|admin|local|config)
      echo "FATAL: Cannot target protected database: $target"
      exit 1
      ;;
  esac

  echo "✓ Target database validated: $target"
}

# Called at start of restore job
validate_target_db "$TARGET_DB"

2.1.9 Restore Job Connection Strategy

Key Insight: The restore job flow is syrf_snapshot → syrf_pr_N. It never touches syrftest!

The PR-specific user (created by Atlas Operator) can safely use $out aggregation because: 1. It has read access to syrf_snapshot (source) 2. It has readWrite access to syrf_pr_N (target) 3. It has ZERO access to syrftest - physically cannot touch production!

# Restore job uses PR-specific credentials
# This user can ONLY read syrf_snapshot and write to syrf_pr_N
PR_USER_URI="$MONGODB_CONNECTION_STRING"  # syrf-pr-N-user

# $out aggregation: syrf_snapshot → syrf_pr_N
# Safe because the user has NO access to syrftest
mongosh "$PR_USER_URI" --quiet --eval "
  db.getSiblingDB('syrf_snapshot').getCollection('pmProject').aggregate([
    { \$out: { db: 'syrf_pr_${PR_NUM}', coll: 'pmProject' } }
  ]).toArray();
"

Why this is safe:

If script tries to... Result
Read from syrftest FAILS - user has no read access
Write to syrftest FAILS - user has no write access
Write to syrf_snapshot FAILS - user only has read access
Write to other syrf_pr_* FAILS - user only has access to own DB

Conclusion: Use $out aggregation for both producer AND restore jobs. The PR user cannot touch production because it has no permissions on syrftest.

2.1.10 No Shared Restore User Needed

We do NOT need a shared syrf_snapshot_reader user. Each PR's user already has the permissions needed: - read on syrf_snapshot (for $out source) - readWrite on syrf_pr_N (for $out target)

This is cleaner and more secure than a shared user.

2.2 Snapshot Producer CronJob

Weekly job that copies production data to the snapshot database using MongoDB's $out aggregation.

Why $out instead of mongodump/mongorestore?

Since source and destination databases are on the same Atlas cluster, $out aggregation is significantly better:

Factor mongodump/mongorestore $out aggregation
Data path MongoDB → K8s pod → MongoDB Internal to Atlas
Network transfer 20GB through pod 0 bytes through pod
K8s resources High (holds data in memory) Minimal (sends commands)
Execution time 30-60 minutes 5-10 minutes
GKE cost Higher compute + egress Minimal
# cluster-gitops/syrf/services/snapshot-producer/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: snapshot-producer
  namespace: syrf-system
  labels:
    app.kubernetes.io/name: snapshot-producer
    app.kubernetes.io/component: data-management
spec:
  schedule: "0 3 * * 0"  # Sunday 3:00 AM UTC
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      backoffLimit: 2
      activeDeadlineSeconds: 1800  # 30 minute timeout (faster with $out)
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: snapshot
              image: mongo:7.0
              command: ["/bin/bash", "-c"]
              args:
                - |
                  set -e
                  echo "=== Starting Production Snapshot (using \$out aggregation) ==="
                  echo "Timestamp: $(date -Iseconds)"

                  # Collections to copy (pm-prefixed + core)
                  COLLECTIONS="pmProject pmStudy pmInvestigator pmSystematicSearch pmDataExportJob pmStudyCorrection pmInvestigatorUsage pmRiskOfBiasAiJob pmProjectDailyStat pmPotential pmInvestigatorEmail"

                  echo "Collections to copy: $COLLECTIONS"

                  # Drop existing snapshot collections first
                  echo "Clearing existing snapshot database..."
                  mongosh "$MONGO_URI" --quiet --eval "
                    const snapDb = db.getSiblingDB('syrf_snapshot');
                    snapDb.getCollectionNames().forEach(c => {
                      print('Dropping: ' + c);
                      snapDb.getCollection(c).drop();
                    });
                  "

                  # Copy each collection using $out (stays within Atlas cluster)
                  for col in $COLLECTIONS; do
                    echo "Copying collection: $col"

                    mongosh "$MONGO_URI" --quiet --eval "
                      const startTime = Date.now();
                      const result = db.getSiblingDB('syrftest').getCollection('$col').aggregate([
                        { \\\$out: { db: 'syrf_snapshot', coll: '$col' } }
                      ]).toArray();
                      const count = db.getSiblingDB('syrf_snapshot').getCollection('$col').countDocuments();
                      const elapsed = ((Date.now() - startTime) / 1000).toFixed(1);
                      print('  ✓ $col: ' + count + ' documents (' + elapsed + 's)');
                    "
                  done

                  # Write metadata
                  echo "Writing snapshot metadata..."
                  mongosh "$MONGO_URI" --quiet --eval "
                    const collections = '$COLLECTIONS'.split(' ');
                    db.getSiblingDB('syrf_snapshot').snapshot_metadata.updateOne(
                      { _id: 'latest' },
                      {
                        \\\$set: {
                          timestamp: new Date(),
                          source: 'syrftest',
                          collections: collections,
                          status: 'complete',
                          method: '\\\$out aggregation'
                        }
                      },
                      { upsert: true }
                    );
                    print('Metadata written successfully');
                  "

                  echo "=== Snapshot Complete ==="
                  echo "Timestamp: $(date -Iseconds)"
              env:
                - name: MONGO_URI
                  valueFrom:
                    secretKeyRef:
                      name: snapshot-producer-credentials
                      key: connectionString
              resources:
                requests:
                  memory: "64Mi"
                  cpu: "50m"
                limits:
                  memory: "256Mi"
                  cpu: "200m"

Note: The job only needs minimal resources since it just sends commands to MongoDB. All data movement happens within the Atlas cluster.

2.3 Snapshot Restore Job (PreSync)

Job that copies from snapshot database to target PR database using $out aggregation.

Why $out is safe for restore jobs?

The restore job flow is syrf_snapshot → syrf_pr_N. It NEVER touches syrftest!

The PR-specific user (created by Atlas Operator) has:

  • read on syrf_snapshot (source for $out)
  • readWrite on syrf_pr_N (target for $out)
  • ZERO access to syrftest - physically cannot touch production!
If script tries to... Result
Read from syrftest FAILS - user has no read access
Write to syrftest FAILS - user has no write access
Write to syrf_snapshot FAILS - user only has read access

Decision: Use $out aggregation for both producer and restore jobs. This is faster (data stays in Atlas) and equally secure (PR user has no production access).

# Template for PR preview (generated by workflow)
apiVersion: batch/v1
kind: Job
metadata:
  name: snapshot-restore-${PR_NUM}-${SHORT_SHA}
  namespace: pr-${PR_NUM}
  labels:
    app.kubernetes.io/managed-by: pr-preview-workflow
    app.kubernetes.io/component: snapshot-restore
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
    argocd.argoproj.io/sync-wave: "3"  # After MongoDB user created
spec:
  ttlSecondsAfterFinished: 600
  backoffLimit: 3
  activeDeadlineSeconds: 900  # 15 minute timeout ($out is fast)
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: restore
          image: mongo:7.0
          command: ["/bin/bash", "-c"]
          args:
            - |
              set -e
              TARGET_DB="syrf_pr_${PR_NUM}"

              echo "=== Restoring Snapshot to $TARGET_DB ==="
              echo "Using \$out aggregation (data stays in Atlas)"
              echo "Timestamp: $(date -Iseconds)"

              # ============================================================
              # CRITICAL: Validate target database name (defense in depth)
              # ============================================================
              validate_target_db() {
                local target="$1"

                # Must match syrf_pr_N pattern
                if [[ ! "$target" =~ ^syrf_pr_[0-9]+$ ]]; then
                  echo "FATAL: Invalid target database name: $target"
                  echo "Expected pattern: syrf_pr_N (e.g., syrf_pr_123)"
                  exit 1
                fi

                # Explicit blocklist (belt and suspenders)
                case "$target" in
                  syrftest|syrfdev|syrf_snapshot|syrf_staging|admin|local|config)
                    echo "FATAL: Cannot target protected database: $target"
                    exit 1
                    ;;
                esac

                echo "✓ Target database validated: $target"
              }

              validate_target_db "$TARGET_DB"

              # ============================================================
              # Wait for snapshot to be available
              # ============================================================
              MAX_RETRIES=12
              RETRY_INTERVAL=30
              echo "Checking snapshot availability..."

              for i in $(seq 1 $MAX_RETRIES); do
                METADATA=$(mongosh "$MONGODB_URI" --quiet --eval "
                  JSON.stringify(db.getSiblingDB('syrf_snapshot').snapshot_metadata.findOne({_id: 'latest'}))
                ")

                if [ -n "$METADATA" ] && [ "$METADATA" != "null" ]; then
                  echo "Snapshot available: $METADATA"
                  SNAPSHOT_TIME=$(echo "$METADATA" | grep -oP '"timestamp":\s*"\K[^"]+' || echo "unknown")
                  echo "Snapshot timestamp: $SNAPSHOT_TIME"
                  break
                fi

                if [ $i -eq $MAX_RETRIES ]; then
                  echo "ERROR: Snapshot not available after $MAX_RETRIES retries"
                  echo "Options:"
                  echo "  1. Wait for Sunday 3 AM UTC weekly snapshot"
                  echo "  2. Trigger on-demand snapshot via GitHub Actions"
                  echo "  3. Remove 'use-snapshot' label to use seed data"
                  exit 1
                fi

                echo "Waiting for snapshot... (attempt $i/$MAX_RETRIES)"
                sleep $RETRY_INTERVAL
              done

              # ============================================================
              # Get collection list from snapshot metadata
              # ============================================================
              COLLECTIONS=$(mongosh "$MONGODB_URI" --quiet --eval "
                db.getSiblingDB('syrf_snapshot').snapshot_metadata.findOne({_id: 'latest'}).collections.join(' ')
              ")
              echo "Collections to restore: $COLLECTIONS"

              # ============================================================
              # Drop existing collections in target database
              # ============================================================
              echo ""
              echo "Step 1: Clearing target database..."
              mongosh "$MONGODB_URI" --quiet --eval "
                const targetDb = db.getSiblingDB('$TARGET_DB');
                targetDb.getCollectionNames().forEach(c => {
                  print('Dropping: ' + c);
                  targetDb.getCollection(c).drop();
                });
              "

              # ============================================================
              # Copy each collection using $out (stays within Atlas)
              # ============================================================
              echo ""
              echo "Step 2: Copying collections via \$out aggregation..."
              echo "(Data stays within Atlas cluster - no network transfer)"

              for col in $COLLECTIONS; do
                echo "Copying: $col"
                mongosh "$MONGODB_URI" --quiet --eval "
                  const startTime = Date.now();
                  db.getSiblingDB('syrf_snapshot').getCollection('$col').aggregate([
                    { \\\$out: { db: '$TARGET_DB', coll: '$col' } }
                  ]).toArray();
                  const count = db.getSiblingDB('$TARGET_DB').getCollection('$col').countDocuments();
                  const elapsed = ((Date.now() - startTime) / 1000).toFixed(1);
                  print('  ✓ $col: ' + count + ' documents (' + elapsed + 's)');
                "
              done

              # ============================================================
              # Verify restore
              # ============================================================
              echo ""
              echo "Step 3: Verification summary..."
              mongosh "$MONGODB_URI" --quiet --eval "
                const targetDb = db.getSiblingDB('$TARGET_DB');
                let total = 0;
                targetDb.getCollectionNames().filter(c => c !== 'system.profile').forEach(c => {
                  const count = targetDb.getCollection(c).countDocuments();
                  total += count;
                  print('  ' + c + ': ' + count);
                });
                print('Total documents: ' + total);
              "

              echo ""
              echo "=== Restore Complete ==="
              echo "Timestamp: $(date -Iseconds)"
          env:
            # PR-specific connection (read on syrf_snapshot, readWrite on syrf_pr_N)
            # This user has ZERO access to syrftest - it cannot touch production!
            - name: MONGODB_URI
              valueFrom:
                secretKeyRef:
                  name: mongodb-credentials  # Created by Atlas Operator for this PR
                  key: connectionString
            - name: PR_NUM
              value: "${PR_NUM}"
          resources:
            requests:
              memory: "64Mi"    # Minimal - just sends commands to MongoDB
              cpu: "50m"
            limits:
              memory: "256Mi"
              cpu: "200m"

Note: The restore job only needs minimal resources since it just sends commands to MongoDB. All data movement happens within the Atlas cluster. For a 20GB database, expect ~5-10 minutes restore time.

2.4 PR Preview Workflow Integration

Updates to .github/workflows/pr-preview.yml:

2.4.1 Label Detection Enhancement

# Add to check-label job
- name: Check for data source labels
  id: snapshot-check
  env:
    GH_TOKEN: ${{ github.token }}
    PR_NUM: ${{ steps.pr-info.outputs.pr_number }}
  run: |
    LABELS=$(gh pr view "$PR_NUM" --json labels -q '.labels[].name' 2>/dev/null || echo "")

    # Check persist-db FIRST (it takes precedence as a lock)
    if echo "$LABELS" | grep -q "persist-db"; then
      echo "persist_db=true" >> "$GITHUB_OUTPUT"
      echo "🔒 Label 'persist-db' found - database is LOCKED"
    else
      echo "persist_db=false" >> "$GITHUB_OUTPUT"
    fi

    # Check use-snapshot
    if echo "$LABELS" | grep -q "use-snapshot"; then
      echo "use_snapshot=true" >> "$GITHUB_OUTPUT"
      echo "🗄️ Label 'use-snapshot' found - will use production snapshot"
    else
      echo "use_snapshot=false" >> "$GITHUB_OUTPUT"
      echo "🌱 No 'use-snapshot' label - will use seed data"
    fi

- name: Handle label change with persist-db lock
  id: label-lock-check
  env:
    GH_TOKEN: ${{ github.token }}
    PR_NUM: ${{ steps.pr-info.outputs.pr_number }}
    PERSIST_DB: ${{ steps.snapshot-check.outputs.persist_db }}
    EVENT_ACTION: ${{ github.event.action }}
    LABEL_NAME: ${{ github.event.label.name }}
  run: |
    # If persist-db is present and use-snapshot label was just changed, revert it
    if [ "$PERSIST_DB" = "true" ] && [ "$LABEL_NAME" = "use-snapshot" ]; then
      if [ "$EVENT_ACTION" = "labeled" ]; then
        # use-snapshot was just ADDED while persist-db present - revert
        echo "⚠️ Cannot add use-snapshot: database locked by persist-db"
        gh pr edit "$PR_NUM" --remove-label "use-snapshot"
        gh pr comment "$PR_NUM" --body "⚠️ **Cannot enable snapshot mode**: Database is locked by \`persist-db\` label.

    The \`use-snapshot\` label has been automatically removed.

    To refresh the database from production snapshot:
    1. Remove the \`persist-db\` label first
    2. Then add the \`use-snapshot\` label"
        echo "label_reverted=true" >> "$GITHUB_OUTPUT"
      elif [ "$EVENT_ACTION" = "unlabeled" ]; then
        # use-snapshot was just REMOVED while persist-db present - re-add it
        echo "⚠️ Cannot remove use-snapshot: database locked by persist-db"
        gh pr edit "$PR_NUM" --add-label "use-snapshot"
        gh pr comment "$PR_NUM" --body "⚠️ **Cannot disable snapshot mode**: Database is locked by \`persist-db\` label.

    The \`use-snapshot\` label has been automatically restored.

    To change the data source:
    1. Remove the \`persist-db\` label first
    2. Then remove the \`use-snapshot\` label"
        echo "label_reverted=true" >> "$GITHUB_OUTPUT"
      fi
    else
      echo "label_reverted=false" >> "$GITHUB_OUTPUT"
    fi

2.4.2 PR Comment Enhancement

# Add new job: post-preview-comment
post-preview-comment:
  name: Post preview environment comment
  needs: [check-label, detect-changes]
  if: |
    needs.check-label.outputs.should_build == 'true' &&
    github.event.action == 'labeled' &&
    github.event.label.name == 'preview'
  runs-on: ubuntu-latest
  permissions:
    pull-requests: write
  steps:
    - name: Get snapshot timestamp
      id: snapshot-info
      if: needs.check-label.outputs.use_snapshot == 'true'
      env:
        MONGO_URI: ${{ secrets.SNAPSHOT_READER_URI }}
      run: |
        # Query snapshot metadata for timestamp
        TIMESTAMP=$(mongosh "$MONGO_URI" --quiet --eval "
          const meta = db.getSiblingDB('syrf_snapshot').snapshot_metadata.findOne({_id: 'latest'});
          if (meta && meta.timestamp) {
            print(meta.timestamp.toISOString());
          } else {
            print('unknown');
          }
        ")
        echo "timestamp=$TIMESTAMP" >> "$GITHUB_OUTPUT"

        # Format for display
        if [ "$TIMESTAMP" != "unknown" ]; then
          FORMATTED=$(date -d "$TIMESTAMP" "+%Y-%m-%d %H:%M UTC" 2>/dev/null || echo "$TIMESTAMP")
          echo "formatted=$FORMATTED" >> "$GITHUB_OUTPUT"
        else
          echo "formatted=Not yet available" >> "$GITHUB_OUTPUT"
        fi

    - name: Post preview environment comment
      env:
        GH_TOKEN: ${{ github.token }}
        PR_NUM: ${{ needs.check-label.outputs.pr_number }}
        USE_SNAPSHOT: ${{ needs.check-label.outputs.use_snapshot }}
        PERSIST_DB: ${{ needs.check-label.outputs.persist_db }}
        HEAD_SHA: ${{ needs.check-label.outputs.head_short_sha }}
        SNAPSHOT_TIME: ${{ steps.snapshot-info.outputs.formatted }}
      run: |
        if [ "$USE_SNAPSHOT" = "true" ]; then
          DB_SOURCE="🗄️ **Production Snapshot** (\`syrf_snapshot\` → \`syrf_pr_${PR_NUM}\`)"
          DB_NOTE="📅 Snapshot taken: **${SNAPSHOT_TIME}**"
          if [ "$PERSIST_DB" = "true" ]; then
            DB_NOTE="${DB_NOTE}
        🔒 Database is **LOCKED** - it will not be modified or deleted on rebuild or PR close."
          else
            DB_NOTE="${DB_NOTE}
        Data will be refreshed from snapshot on each rebuild."
          fi
        else
          DB_SOURCE="🌱 **Seed Data** (5 sample projects, ~100 studies)"
          if [ "$PERSIST_DB" = "true" ]; then
            DB_NOTE="🔒 Database is **LOCKED** - it will not be modified or deleted on rebuild or PR close.
        Add the \`use-snapshot\` label to use production data instead (must remove \`persist-db\` first)."
          else
            DB_NOTE="Add the \`use-snapshot\` label to use production data instead."
          fi
        fi

        COMMENT=$(cat <<EOF
        ## 🚀 Preview Environment Building

        A preview environment is being built for this PR.

        ### Environment Details

        | Setting | Value |
        |---------|-------|
        | **Namespace** | \`pr-${PR_NUM}\` |
        | **Database** | \`syrf_pr_${PR_NUM}\` |
        | **Data Source** | ${DB_SOURCE} |
        | **Commit** | \`${HEAD_SHA}\` |

        ### URLs (available after deployment)

        - 🌐 **Web**: https://pr-${PR_NUM}.syrf.org.uk
        - 🔌 **API**: https://api.pr-${PR_NUM}.syrf.org.uk
        - 📊 **PM**: https://project-management.pr-${PR_NUM}.syrf.org.uk

        ### Notes

        ${DB_NOTE}

        ---
        *This comment was automatically generated by the PR Preview workflow.*
        EOF
        )

        gh pr comment "$PR_NUM" --body "$COMMENT"

2.4.3 persist-db Label Added Comment

# Post when persist-db label is added
- name: Post persist-db lock comment
  if: |
    github.event.action == 'labeled' &&
    github.event.label.name == 'persist-db'
  env:
    GH_TOKEN: ${{ github.token }}
    PR_NUM: ${{ needs.check-label.outputs.pr_number }}
  run: |
    gh pr comment "$PR_NUM" --body "🔒 **Database is now LOCKED**

    The \`persist-db\` label has been added. Your database (\`syrf_pr_${PR_NUM}\`) will now be protected:

    - ✅ Database will NOT be dropped on rebuild
    - ✅ Database will NOT be deleted when PR is closed/merged
    - ✅ Changes to \`use-snapshot\` label will be blocked

    **To unlock:** Remove the \`persist-db\` label. If the PR is closed, this will immediately delete the database."

2.4.4 persist-db Label Removed Comment

# Post when persist-db label is removed
- name: Handle persist-db removal
  if: |
    github.event.action == 'unlabeled' &&
    github.event.label.name == 'persist-db'
  env:
    GH_TOKEN: ${{ github.token }}
    PR_NUM: ${{ needs.check-label.outputs.pr_number }}
    PR_STATE: ${{ github.event.pull_request.state }}
    USE_SNAPSHOT: ${{ needs.check-label.outputs.use_snapshot }}
  run: |
    if [ "$PR_STATE" = "open" ]; then
      # PR is still open - database will be handled on next sync
      if [ "$USE_SNAPSHOT" = "true" ]; then
        DATA_ACTION="refreshed from production snapshot"
      else
        DATA_ACTION="reset with fresh seed data"
      fi

      gh pr comment "$PR_NUM" --body "🔓 **Database unlocked**

    The \`persist-db\` label has been removed. On the next sync, your database will be ${DATA_ACTION}.

    If you don't push a new commit, you can manually trigger a sync from ArgoCD."
    else
      # PR is closed/merged - delete the database immediately
      echo "PR is closed - triggering immediate database cleanup"
      # Note: This triggers the cleanup workflow or direct deletion
      gh pr comment "$PR_NUM" --body "🗑️ **Database deleted**

    The \`persist-db\` label was removed from this closed PR, triggering immediate database cleanup.

    Database \`syrf_pr_${PR_NUM}\` has been dropped."
      # Actual deletion logic is in the cleanup workflow
    fi

2.5 Staging Configuration

GitOps-based configuration for staging environment.

# cluster-gitops/syrf/environments/staging/staging.values.yaml
# Add database source configuration
database:
  # Data source: "seed" or "snapshot"
  source: snapshot

  # Snapshot database to copy from (when source=snapshot)
  snapshotDatabase: syrf_snapshot

Staging restore is handled by a similar PreSync job, but configured via Helm values rather than generated by workflow.

2.6 On-Demand Snapshot Trigger

GitHub Actions workflow for manual snapshot trigger.

# .github/workflows/snapshot-on-demand.yml
name: Trigger Snapshot Refresh

on:
  workflow_dispatch:
    inputs:
      confirm:
        description: 'Type "refresh-snapshot" to confirm'
        required: true

jobs:
  trigger-snapshot:
    name: Trigger snapshot refresh
    if: inputs.confirm == 'refresh-snapshot'
    runs-on: ubuntu-latest
    steps:
      - name: Authenticate to Google Cloud
        uses: google-github-actions/auth@v2
        with:
          workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
          service_account: ${{ secrets.GCP_SERVICE_ACCOUNT }}

      - name: Set up Cloud SDK
        uses: google-github-actions/setup-gcloud@v2

      - name: Get GKE credentials
        run: |
          gcloud container clusters get-credentials camaradesuk \
            --zone europe-west2-a \
            --project camarades-net

      - name: Trigger snapshot CronJob
        run: |
          kubectl create job \
            --from=cronjob/snapshot-producer \
            snapshot-manual-$(date +%Y%m%d-%H%M%S) \
            -n syrf-system

          echo "### Snapshot Job Triggered" >> "$GITHUB_STEP_SUMMARY"
          echo "A manual snapshot job has been created." >> "$GITHUB_STEP_SUMMARY"
          echo "Check the syrf-system namespace for job status." >> "$GITHUB_STEP_SUMMARY"

3. Execution Flow

3.1 Weekly Snapshot (Happy Path)

Sunday 3:00 AM UTC
┌─────────────────────────────────────────────────────────┐
│ 1. CronJob triggers snapshot-producer                   │
├─────────────────────────────────────────────────────────┤
│ 2. Connect to MongoDB Atlas                             │
│ 3. Clear existing syrf_snapshot collections             │
│ 4. For each collection (11 total):                      │
│    a. Run $out aggregation: syrftest → syrf_snapshot    │
│    b. (Data stays within Atlas - no network transfer)   │
│ 5. Write snapshot_metadata with timestamp               │
│ 6. Job completes (estimated: 5-10 minutes for 20GB)     │
└─────────────────────────────────────────────────────────┘

3.2 PR Preview with Snapshot

Developer adds 'use-snapshot' label to PR
┌─────────────────────────────────────────────────────────┐
│ 1. Workflow detects 'use-snapshot' label                │
│ 2. Check if 'persist-db' is present:                    │
│    ├─ YES: REVERT label change, post comment            │
│    │       (database is locked - no changes allowed)    │
│    └─ NO:  Continue to step 3                           │
│ 3. Get snapshot timestamp from syrf_snapshot metadata   │
│ 4. Post preview environment comment with:               │
│    - Data source (snapshot)                             │
│    - Snapshot timestamp                                 │
│ 5. Generate snapshot-restore-job.yaml                   │
│ 6. Commit to cluster-gitops                             │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ ArgoCD Sync (only if persist-db NOT present):           │
│ 1. PreSync: Create MongoDB user (sync-wave: 1)          │
│ 2. PreSync: Drop existing database                      │
│ 3. PreSync: Run snapshot-restore job (sync-wave: 3)     │
│    a. Check snapshot_metadata for availability          │
│    b. Retry up to 12 times (6 minutes) if not ready     │
│    c. Copy collections from syrf_snapshot → syrf_pr_N   │
│ 4. Sync: Deploy services                                │
│ 5. Services connect to syrf_pr_N with real data         │
└─────────────────────────────────────────────────────────┘

3.2.1 PR Preview with persist-db Lock

Developer has 'persist-db' label on PR
┌─────────────────────────────────────────────────────────┐
│ 1. Workflow detects 'persist-db' label                  │
│ 2. Any attempt to add/remove 'use-snapshot' is REVERTED │
│ 3. Post explanation comment                             │
│ 4. NO database changes occur on sync                    │
│ 5. On PR close/merge: Database is NOT deleted           │
│    - Warning comment posted with cleanup instructions   │
└─────────────────────────────────────────────────────────┘

Cleanup (when persist-db removed from closed PR):
┌─────────────────────────────────────────────────────────┐
│ 1. Workflow detects 'persist-db' label removal          │
│ 2. Check PR state:                                      │
│    ├─ OPEN:   Post "database unlocked" comment          │
│    │          Next sync applies current data source     │
│    └─ CLOSED: Immediately drop database                 │
│               Post "database deleted" comment           │
└─────────────────────────────────────────────────────────┘

3.3 Sequence Diagram

Developer          GitHub Actions       cluster-gitops        ArgoCD          MongoDB Atlas
    │                    │                    │                  │                  │
    │─Add 'use-snapshot' │                    │                  │                  │
    │    label           │                    │                  │                  │
    │                    │                    │                  │                  │
    │                    ├─Detect label───────│                  │                  │
    │                    │                    │                  │                  │
    │                    ├─Generate restore───│                  │                  │
    │                    │   job YAML         │                  │                  │
    │                    │                    │                  │                  │
    │                    ├─Commit + push─────▶│                  │                  │
    │                    │                    │                  │                  │
    │                    │                    ├─Git webhook─────▶│                  │
    │                    │                    │                  │                  │
    │                    │                    │                  ├─PreSync: Create──│
    │                    │                    │                  │  MongoDB user    │
    │                    │                    │                  │                  │
    │                    │                    │                  ├─PreSync: Restore─│
    │                    │                    │                  │  snapshot        │
    │                    │                    │                  │                  │
    │                    │                    │                  │         ┌────────┤
    │                    │                    │                  │         │Copy    │
    │                    │                    │                  │         │data    │
    │                    │                    │                  │         └────────┤
    │                    │                    │                  │                  │
    │                    │                    │                  ├─Sync: Deploy─────│
    │                    │                    │                  │  services        │
    │                    │                    │                  │                  │
    │◀─────────────────Preview ready──────────│                  │                  │
    │                    │                    │                  │                  │

4. Edge Cases & Mitigations

# Edge Case / Failure Mode Impact Mitigation Strategy
1 Snapshot not available when PR deploys (first week) PR deployment blocked Wait and retry (12 attempts, 30s intervals = 6 min max wait). After retries exhausted, fail with clear error message.
2 Snapshot producer job fails mid-copy Incomplete snapshot database Job uses collection-by-collection copy with atomic drop. Metadata only written on success. Restore job checks metadata.
3 Production database unavailable during snapshot Weekly snapshot skipped CronJob retries (backoffLimit: 2). Alert on repeated failures. Previous snapshot remains valid.
4 20GB copy takes longer than expected Job timeout activeDeadlineSeconds: 1800 (30 min) for producer, 900 (15 min) for restore. Both use $out aggregation (fast, data stays in Atlas).
5 use-snapshot label added while persist-db present Label change blocked Revert label change automatically, post comment explaining database is locked. User must remove persist-db first.
6 use-snapshot label removed while persist-db present Label change blocked Revert label change automatically, post comment explaining database is locked. User must remove persist-db first.
7 persist-db removed on closed/merged PR Orphan database cleanup Immediately drop database, post confirmation comment. This is the cleanup mechanism for orphan DBs.
8 PR closed/merged with persist-db label Database persists as orphan Do NOT delete database, post warning comment with cleanup instructions (remove persist-db label to trigger deletion).
9 First deploy with both use-snapshot AND persist-db User wants snapshot data then lock Create database from snapshot on first sync, then lock for subsequent syncs. Both labels are honored in sequence.
10 MongoDB connection issues during restore Restore fails backoffLimit: 3 with exponential backoff. Clear error in ArgoCD.
11 Snapshot database runs out of space Atlas storage limit Monitor Atlas storage. 20GB snapshot should fit in M10+ tier.
12 Collection schema changes between production and test Potential data issues Schema version tracked in metadata. Services handle schema migration.
13 Multiple PRs requesting snapshot simultaneously Parallel restore from same source Each PR gets its own copy. syrf_snapshot is read-only during restores. No conflicts.
14 Snapshot producer runs during restore Stale data mid-restore Restore checks metadata timestamp. If changed mid-restore, log warning but continue (acceptable for testing).
15 Manual changes to syrf_snapshot Corruption risk Document as read-only. Only snapshot-producer should write. Kyverno policy can enforce.

4.1 Detailed Mitigation: Snapshot Availability Wait

# Implemented in restore job
MAX_RETRIES=12
RETRY_INTERVAL=30  # 30 seconds

for i in $(seq 1 $MAX_RETRIES); do
  METADATA=$(mongosh ... --eval "db.snapshot_metadata.findOne({_id: 'latest'})")

  if [ -n "$METADATA" ] && [ "$METADATA" != "null" ]; then
    echo "Snapshot available"
    break
  fi

  if [ $i -eq $MAX_RETRIES ]; then
    echo "ERROR: Snapshot not available after $(($MAX_RETRIES * $RETRY_INTERVAL / 60)) minutes"
    echo "This is expected on first deployment before weekly snapshot runs."
    echo "Options:"
    echo "  1. Wait for Sunday 3 AM UTC weekly snapshot"
    echo "  2. Trigger on-demand snapshot via GitHub Actions"
    echo "  3. Remove 'use-snapshot' label to use seed data"
    exit 1
  fi

  echo "Waiting for snapshot... (attempt $i/$MAX_RETRIES)"
  sleep $RETRY_INTERVAL
done

5. Testing Strategy

5.1 Unit Tests

Not applicable - this feature is infrastructure-only (no application code changes).

5.2 Integration Tests

  • Snapshot Producer Job: Run manually, verify all collections copied
  • Restore Job: Deploy test PR with use-snapshot, verify data present
  • Label Conflict: Add both labels, verify persist-db removed and comment posted
  • Preview Comment: Add preview label, verify environment comment posted

5.3 Manual Verification Steps

# 1. Verify snapshot producer CronJob is deployed
kubectl get cronjob snapshot-producer -n syrf-system

# 2. Trigger manual snapshot
kubectl create job --from=cronjob/snapshot-producer snapshot-test -n syrf-system

# 3. Monitor job progress
kubectl logs -f job/snapshot-test -n syrf-system

# 4. Verify snapshot metadata
mongosh "mongodb+srv://..." --eval "
  db.getSiblingDB('syrf_snapshot').snapshot_metadata.findOne({_id: 'latest'})
"

# 5. Verify collection counts match production
mongosh "mongodb+srv://..." --eval "
  const snap = db.getSiblingDB('syrf_snapshot');
  const prod = db.getSiblingDB('syrftest');
  ['pmProject', 'pmStudy', 'pmInvestigator'].forEach(c => {
    print(c + ': snap=' + snap.getCollection(c).countDocuments() +
          ' prod=' + prod.getCollection(c).countDocuments());
  });
"

# 6. Create test PR with use-snapshot label
gh pr create --title "Test snapshot restore" --body "Testing snapshot feature"
gh pr edit <PR_NUM> --add-label "preview,use-snapshot"

# 7. Verify PR comments
gh pr view <PR_NUM> --comments

# 8. Verify restore job ran
kubectl get jobs -n pr-<PR_NUM>
kubectl logs job/snapshot-restore-<PR_NUM>-<SHA> -n pr-<PR_NUM>

# 9. Verify data in preview database
mongosh "mongodb+srv://..." --eval "
  db.getSiblingDB('syrf_pr_<PR_NUM>').pmProject.countDocuments()
"

6. Implementation Checklist

Implementation Status: Phase 1 and 3 core components implemented (2026-01-13) See Implementation Notes below for changes from original spec.

Phase 1: Infrastructure Setup

  • 1.1 Create MongoDB Atlas user snapshot-producer with required roles
  • Requires manual creation in Atlas Console (same pattern as prod/staging users)
  • Roles: read on syrftest, readWrite on syrf_snapshot
  • 1.2 Add credentials to GCP Secret Manager (snapshot-producer-mongodb)
  • Keys: username, password (connection string constructed from cluster host in Helm values)
  • 1.3 Create ExternalSecret for snapshot-producer credentials (cluster-gitops)
  • Added to plugins/local/extra-secrets-staging/values.yaml
  • 1.4 Update Kyverno policy to allow PR users read on syrf_snapshot
  • Updated plugins/helm/kyverno/resources/atlas-pr-user-policy.yaml
  • Rule 5 now allows syrf_snapshot (read-only) in addition to syrf_pr_*
  • 1.5 Create snapshot-producer Helm chart
  • Created charts/snapshot-producer/ with CronJob template
  • Plugin config at plugins/local/snapshot-producer/

Phase 2: Snapshot Producer

  • 2.1 Create CronJob manifest in cluster-gitops
  • Located at charts/snapshot-producer/templates/cronjob.yaml
  • Schedule: Sunday 2 AM UTC
  • 2.2 Test manual job trigger
  • 2.3 Verify all 11 collections are copied correctly
  • 2.4 Verify snapshot runs successfully (no snapshot_metadata - simpler design)
  • 2.5 Monitor first automated weekly run (Sunday 2 AM)

Phase 3: PR Preview Integration

  • 3.1 Add use-snapshot label detection to pr-preview.yml
  • Label triggers workflow on add/remove
  • Check step outputs use_snapshot flag
  • 3.2 Add persist-db conflict resolution logic
  • persist-db takes precedence - blocks snapshot restore when present
  • 3.3 Add PR comment for preview environment (with DB details)
  • 3.4 Add PR comment for label conflict resolution
  • 3.5 Generate snapshot-restore-job.yaml when label present
  • Only when use-snapshot=true AND persist-db=false
  • Uses PR user credentials (now has read on syrf_snapshot)
  • 3.6 Update AtlasDatabaseUser CRD to add snapshot read role conditionally
  • Role added only when use-snapshot label is present
  • 3.7 Test with real PR (add both labels, verify behaviour)

Phase 4: Staging Configuration

  • 4.1 Add database.source config to staging.values.yaml
  • 4.2 Create staging-specific restore job template in Helm chart
  • 4.3 Test staging with snapshot source enabled
  • 4.4 Document staging configuration in cluster-gitops

Phase 5: On-Demand Trigger

  • 5.1 Create snapshot-on-demand.yml workflow
  • 5.2 Test manual trigger via workflow_dispatch
  • 5.3 Document in how-to guide

Phase 6: Documentation & Cleanup

  • 6.1 Update PR preview how-to guide with snapshot option
  • 6.2 Update MongoDB testing strategy doc
  • 6.3 Delete planning documents (clarify.md)
  • 6.4 Update CLAUDE.md with snapshot feature

Implementation Notes

Changes from Original Specification:

Aspect Original Implemented Reason
User naming syrf_snapshot_operator snapshot-producer Simpler naming
Secret name syrf-snapshot-operator-credentials snapshot-producer-mongodb Consistent with existing pattern
User creation Operator CRD Manual Atlas Console Follows prod/staging pattern for long-lived users
Kyverno policy Not specified Updated Rule 5 Required to allow syrf_snapshot read access
Namespace syrf-system syrf-staging Plugin pattern requires namespace
snapshot_metadata Separate collection Implemented Now written to all databases (snapshot, PR, empty seed)

Related Documentation:


7. Open Questions

All questions have been resolved through the clarification process:

Question Resolution
PII handling Not required - skip anonymization
Storage location MongoDB Atlas database (syrf_snapshot), not GCS
Retention policy Single snapshot, replaced weekly
PR configuration Label-based (use-snapshot for data source, persist-db for lock)
Collection scope All pm-prefixed collections (11 total)
Label interaction persist-db is a LOCK that takes precedence - blocks all label changes to use-snapshot
Orphan databases Manual cleanup - remove persist-db label from closed PR to trigger deletion
Snapshot visibility Snapshot timestamp shown in PR comments so users know data freshness
Permission model Defense in depth: Separate users with minimal permissions. No snapshot/restore user can write to syrftest. Producer uses read-only prod access. Restore uses PR-specific user that can only write to its own DB.
Restore method $out aggregation for both producer and restore (fast, data stays in Atlas). PR user has read on snapshot, readWrite on own DB, ZERO access to syrftest.

8. References


Document End

This document must be reviewed and approved before implementation begins.