Data Snapshot Automation - Implementation Specification¶
Status: Draft - Pending Review Author: Claude (Senior Systems Architect) Date: 2026-01-13 Target Environment: GKE Cluster (camaradesuk) / MongoDB Atlas
Executive Summary¶
This specification details the implementation of automated production data snapshots for use in PR preview and staging environments. The solution uses a snapshot database approach where a dedicated MongoDB database (syrf_snapshot) on the Atlas cluster is refreshed weekly (or on-demand) from production, and other environments copy data from it.
Key Design Decisions:
- No PII anonymization required (confirmed by stakeholder)
- Single snapshot database instead of file-based storage (GCS)
- PR label-based configuration (
use-snapshotfor data source,persist-dbfor lock) - Weekly refresh (Sunday 3 AM UTC) with on-demand capability
- ~20GB database size
$outaggregation for both jobs (fast, data stays in Atlas):- Producer (prod → snapshot): Uses
syrf_snapshot_produceruser (read prod, write snapshot) - Restore (snapshot → PR): Uses PR-specific user (read snapshot, write own DB only)
- Defense in depth: No snapshot/restore user can write to production (
syrftest)
Table of Contents¶
- High-Level Architecture
- Detailed Design
- Execution Flow
- Edge Cases & Mitigations
- Testing Strategy
- Implementation Checklist
- Open Questions
- References
1. High-Level Architecture¶
1.1 Overview¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ MongoDB Atlas Cluster (Cluster0) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Production │ Weekly CronJob │ Snapshot │ │
│ │ syrftest │─────────────────────▶│ syrf_snapshot │ │
│ │ (~20GB) │ (copy collections) │ (~20GB) │ │
│ └─────────────────┘ └────────┬────────┘ │
│ │ │
│ ┌────────────────────────┼────────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌──────────┐ │
│ │ Staging │ │ PR Preview │ │ PR N │ │
│ │ syrf_staging │ │ syrf_pr_123 │ │ syrf_pr..│ │
│ └─────────────────┘ └─────────────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
1.2 Key Components¶
| Component | Purpose | Location |
|---|---|---|
| Snapshot Producer | Weekly CronJob that copies production → syrf_snapshot | GKE syrf-system namespace |
| Snapshot Restore Job | PreSync Job that copies syrf_snapshot → target DB | Per-environment namespace |
| PR Preview Workflow | Detects use-snapshot label, generates restore job |
GitHub Actions |
| Staging Values | GitOps config to enable/disable snapshot source | cluster-gitops |
1.3 Dependencies¶
- MongoDB Atlas cluster with read access to production (
syrftest) - MongoDB Atlas database user for snapshot operations
- GKE cluster with jobs capability
- Existing PR preview infrastructure (Atlas Operator, Kyverno)
1.4 Collections to Copy¶
Based on codebase analysis, these collections are required:
INCLUDE (Core Application Data):
pmProject # Projects, stages, memberships
pmStudy # Studies, annotations, screening
pmInvestigator # User accounts
pmSystematicSearch # Literature searches
pmDataExportJob # Export job tracking
pmStudyCorrection # PDF correction requests
pmInvestigatorUsage # Usage statistics
pmRiskOfBiasAiJob # AI risk of bias jobs
pmProjectDailyStat # Daily statistics
pmPotential # Pre-registration records
pmInvestigatorEmail # Email lookup cache
EXCLUDE (Infrastructure/Temporary):
1.5 Label Interaction Model¶
The PR preview system uses three labels with distinct purposes:
| Label | Purpose | Behavior |
|---|---|---|
preview |
Trigger - Enables preview environment | Creates namespace, deploys services |
use-snapshot |
Source - Determines data source | When DB is created, use snapshot instead of seed data |
persist-db |
Lock - Protects database | Prevents ALL database modifications |
1.5.1 Core Principle: persist-db as a Lock¶
persist-db ALWAYS takes precedence. When present:
- Database is NEVER dropped
- Database is NEVER refreshed/recreated
- Label changes (use-snapshot) are reverted with explanatory comment
- PR close/merge does NOT delete database (requires manual cleanup)
1.5.2 Label State Matrix¶
| persist-db | use-snapshot | On Sync | Database Action |
|---|---|---|---|
| ❌ | ❌ | Normal | Drop DB → Seed fresh data (5 sample projects) |
| ❌ | ✅ | Normal | Drop DB → Restore from snapshot |
| ✅ | ❌ | Normal | No action - DB protected |
| ✅ | ✅ | Normal | No action - DB protected |
1.5.3 Label Change Behavior¶
When use-snapshot is added:
persist-db absent?
│
├─ YES → Immediately drop DB, restore from snapshot
│ Post comment: "🗄️ Database recreated from production snapshot (taken: <timestamp>)"
│
└─ NO → Revert label addition, post comment:
"⚠️ Cannot enable snapshot mode: database is locked by persist-db label.
Remove persist-db first if you want to refresh the database."
When use-snapshot is removed:
persist-db absent?
│
├─ YES → Immediately drop DB (next sync will seed fresh data)
│ Post comment: "🌱 Database dropped. Fresh seed data will be created on next sync."
│
└─ NO → Revert label removal, post comment:
"⚠️ Cannot disable snapshot mode: database is locked by persist-db label.
Remove persist-db first if you want to change data source."
When persist-db is added:
Post comment: "🔒 Database is now LOCKED. It will not be modified or deleted,
even when this PR is closed/merged. Remove this label to unlock."
When persist-db is removed:
Is PR still open?
│
├─ YES → Post comment: "🔓 Database unlocked. Next sync will apply current data source
│ (snapshot: <yes/no>). If you don't push a new commit, manually trigger a sync."
│ → Next sync applies current use-snapshot state
│
└─ NO (PR closed/merged) → Immediately drop database
Post comment: "🗑️ Database syrf_pr_N dropped (persist-db lock removed on closed PR)"
1.5.4 PR Close/Merge Behavior¶
PR Closed/Merged
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Check: Is persist-db label present? │
├─────────────────────────────────────────────────────────────┤
│ YES → DO NOT delete database │
│ Post comment: "⚠️ Database syrf_pr_N was NOT deleted │
│ because persist-db label is present. │
│ │
│ To delete: Remove the persist-db label from this PR │
│ (even though it's closed) and the database will be │
│ automatically dropped." │
├─────────────────────────────────────────────────────────────┤
│ NO → Delete database as normal │
│ Post comment: "✅ Database syrf_pr_N deleted" │
└─────────────────────────────────────────────────────────────┘
1.5.5 First Deploy with Both Labels¶
When a new PR is created with BOTH use-snapshot AND persist-db labels from the start:
- First sync: Create DB from snapshot (honoring
use-snapshot) - Subsequent syncs: Database is locked (honoring
persist-db)
This allows users to initialize with production data and then lock it for testing.
1.5.6 Orphan Database Cleanup¶
Databases from closed PRs with persist-db remain until manually cleaned:
Manual Cleanup Options:
1. Remove persist-db label from the closed PR → triggers automatic deletion
2. Direct MongoDB cleanup via admin tools (for bulk cleanup)
No automatic expiration - databases persist indefinitely until explicitly cleaned.
2. Detailed Design¶
2.1 MongoDB Permission Model¶
📖 Detailed Reference: See MongoDB Permissions Explained for comprehensive documentation of the permission model, including common misconceptions and cleanup strategies.
2.1.1 Security Principles¶
⚠️ CRITICAL: Production (syrftest) must NEVER be writable by any snapshot/restore job.
⚠️ IMPORTANT: MongoDB does NOT support wildcard database permissions. You cannot grant permissions on a pattern like syrf_pr_*. Each database must be explicitly named in role grants, OR you must use cluster-wide roles like dbAdminAnyDatabase (which provides access to ALL databases including production).
The permission model follows defense-in-depth:
- Principle of Least Privilege: Each user has minimum permissions needed
- Database-Level Isolation: No user can write to production
- Explicit Database Grants: Each PR user gets permissions on its specific
syrf_pr_Ndatabase only (via Atlas Operator) - Script-Level Validation: All scripts validate targets before operations
- Audit Trail: All operations are logged
2.1.2 MongoDB Atlas Users¶
| User | Purpose | Databases | Permissions |
|---|---|---|---|
syrf_snapshot_producer |
Weekly snapshot CronJob | syrftest → syrf_snapshot |
READ prod, WRITE snapshot |
syrf-pr-N-user |
PR-specific user (Atlas Operator) | syrf_snapshot, syrf_pr_N |
READ snapshot, WRITE own DB |
Note: No separate syrf_snapshot_reader user is needed. Each PR user has read access to syrf_snapshot for restore operations.
2.1.3 User: syrf_snapshot_producer¶
Purpose: Weekly CronJob that copies syrftest → syrf_snapshot
MongoDB Atlas Roles:
| Database | Role | Justification |
|---|---|---|
syrftest |
read |
Read production collections for copying |
syrf_snapshot |
readWrite |
Write copied data |
syrf_snapshot |
dbAdmin |
Drop collections before refresh |
Critical Safety: This user has NO write access to production. Even a bug in the script cannot corrupt syrftest.
// MongoDB Atlas User Configuration
{
"user": "syrf_snapshot_producer",
"roles": [
{ "role": "read", "db": "syrftest" },
{ "role": "readWrite", "db": "syrf_snapshot" },
{ "role": "dbAdmin", "db": "syrf_snapshot" }
]
}
Secret Storage: GCP Secret Manager → External Secrets → Kubernetes Secret
# GCP Secret Manager: snapshot-producer-mongodb
# Only username/password needed - connection string is constructed in CronJob
# from the mongodbHost value in Helm chart configuration
{
"username": "snapshot-producer",
"password": "<secure-password>"
}
2.1.4 User: syrf_snapshot_reader (NOT NEEDED)¶
syrf_snapshot_readerUPDATE: A separate
syrf_snapshot_readeruser is NOT required.Each PR-specific user (created by Atlas Operator) already gets
readaccess tosyrf_snapshot. See section 2.1.5 below and section 2.1.9 for details.
2.1.5 PR-Specific Users (Atlas Operator)¶
Purpose: Each PR gets its own MongoDB user with access only to its database
The existing Atlas Operator creates users like syrf-pr-123-user with:
| Database | Role | Justification |
|---|---|---|
syrf_pr_123 |
readWrite |
Application access |
syrf_pr_123 |
dbAdmin |
Schema management |
For snapshot restore, we extend this to also grant:
| Database | Role | Justification |
|---|---|---|
syrf_snapshot |
read |
Read snapshot for restore |
This is configured in the Atlas Operator's AtlasDatabaseUser CRD:
# Example: PR user with snapshot read access
apiVersion: atlas.mongodb.com/v1
kind: AtlasDatabaseUser
metadata:
name: syrf-pr-123-user
namespace: pr-123
spec:
username: syrf-pr-123-user
databaseName: admin
roles:
# Existing: Access to PR database
- roleName: readWrite
databaseName: syrf_pr_123
- roleName: dbAdmin
databaseName: syrf_pr_123
# NEW: Read access to snapshot for restore
- roleName: read
databaseName: syrf_snapshot
2.1.6 PR Database Cleanup Strategy¶
Problem: Who can drop syrf_pr_N databases when a PR is closed?
The PR-specific user (syrf-pr-N-user) has dbAdmin on its own database, so it CAN drop it. However, when the PR is closed:
- The AtlasDatabaseUser CRD is deleted by ArgoCD
- Atlas Operator deletes the MongoDB user
- The credentials are gone before the workflow can use them to drop the database
Solution: Use ArgoCD PreDelete hook to drop the database BEFORE the user is deleted.
# PreDelete hook - runs before namespace/resources are deleted
apiVersion: batch/v1
kind: Job
metadata:
name: mongodb-cleanup-${PR_NUM}
namespace: pr-${PR_NUM}
annotations:
argocd.argoproj.io/hook: PreDelete
argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
template:
spec:
restartPolicy: Never
containers:
- name: cleanup
image: mongo:7.0
command: ["/bin/bash", "-c"]
args:
- |
echo "Dropping database syrf_pr_${PR_NUM}..."
mongosh "$MONGODB_URI" --quiet --eval "
db.getSiblingDB('syrf_pr_${PR_NUM}').dropDatabase();
print('Database dropped successfully');
"
env:
- name: MONGODB_URI
valueFrom:
secretKeyRef:
name: mongodb-credentials
key: connectionString
Why this works:
- PreDelete hook runs BEFORE the AtlasDatabaseUser CRD is deleted
- The PR user still has credentials at this point
- PR user has
dbAdminon its own database (can drop it) - After hook completes, ArgoCD deletes the namespace (including the CRD)
- Atlas Operator then cleans up the user
Fallback: If the hook fails or times out, the GitHub workflow cleanup step can attempt manual cleanup.
⚠️ Current gap in
pr-preview.yml: The cleanup step usesmongo-dbsecret from staging namespace, which (per least privilege) should only have access tosyrf_staging. Options:
- Preferred: Rely on PreDelete hook (uses PR user's own credentials)
- Fallback: Create a dedicated
syrf_cleanup_userwithdbAdminon patternsyrf_pr_*- Not recommended: Give staging user
dbAdminAnyDatabase(violates least privilege)
2.1.7 Permission Matrix¶
| User | syrftest | syrf_snapshot | syrf_staging | syrf_pr_N |
|---|---|---|---|---|
syrf_snapshot_producer |
📖 READ | ✏️ WRITE | ❌ | ❌ |
syrf-pr-N-user |
❌ | 📖 READ | ❌ | ✏️ WRITE + 🗑️ DROP (own DB only) |
| Application (staging) | ❌ | ❌ | ✏️ WRITE | ❌ |
| Application (production) | ✏️ WRITE | ❌ | ❌ | ❌ |
Key Insight: No snapshot/restore user can write to syrftest. Production is protected at the MongoDB permission level.
Why the PR user is safe for restore:
- The restore flow is
syrf_snapshot → syrf_pr_N(never touchessyrftest) - PR user has
readonsyrf_snapshot(can use as$outsource) - PR user has
readWriteonsyrf_pr_N(can use as$outtarget) - PR user has ZERO access to
syrftest- it physically cannot touch production!
MongoDB $out Permission Requirements (per official docs):
| Permission | Required On | Action |
|---|---|---|
find |
Source collection | Read documents for aggregation |
insert |
Destination collection | Write output documents |
remove |
Destination collection | Replace existing collection |
Key insight: Read-only access to source is SUFFICIENT. No write access to source is needed.
2.1.8 Script-Level Validation (Defense in Depth)¶
Even with proper permissions, all scripts include validation:
# CRITICAL: Validate target database before ANY operation
validate_target_db() {
local target="$1"
# Must match syrf_pr_N pattern
if [[ ! "$target" =~ ^syrf_pr_[0-9]+$ ]]; then
echo "FATAL: Invalid target database name: $target"
echo "Expected pattern: syrf_pr_N (e.g., syrf_pr_123)"
exit 1
fi
# Explicit blocklist (belt and suspenders)
case "$target" in
syrftest|syrfdev|syrf_snapshot|syrf_staging|admin|local|config)
echo "FATAL: Cannot target protected database: $target"
exit 1
;;
esac
echo "✓ Target database validated: $target"
}
# Called at start of restore job
validate_target_db "$TARGET_DB"
2.1.9 Restore Job Connection Strategy¶
Key Insight: The restore job flow is syrf_snapshot → syrf_pr_N. It never touches syrftest!
The PR-specific user (created by Atlas Operator) can safely use $out aggregation because:
1. It has read access to syrf_snapshot (source)
2. It has readWrite access to syrf_pr_N (target)
3. It has ZERO access to syrftest - physically cannot touch production!
# Restore job uses PR-specific credentials
# This user can ONLY read syrf_snapshot and write to syrf_pr_N
PR_USER_URI="$MONGODB_CONNECTION_STRING" # syrf-pr-N-user
# $out aggregation: syrf_snapshot → syrf_pr_N
# Safe because the user has NO access to syrftest
mongosh "$PR_USER_URI" --quiet --eval "
db.getSiblingDB('syrf_snapshot').getCollection('pmProject').aggregate([
{ \$out: { db: 'syrf_pr_${PR_NUM}', coll: 'pmProject' } }
]).toArray();
"
Why this is safe:
| If script tries to... | Result |
|---|---|
Read from syrftest |
FAILS - user has no read access |
Write to syrftest |
FAILS - user has no write access |
Write to syrf_snapshot |
FAILS - user only has read access |
Write to other syrf_pr_* |
FAILS - user only has access to own DB |
Conclusion: Use $out aggregation for both producer AND restore jobs. The PR user cannot touch production because it has no permissions on syrftest.
2.1.10 No Shared Restore User Needed¶
We do NOT need a shared syrf_snapshot_reader user. Each PR's user already has the permissions needed:
- read on syrf_snapshot (for $out source)
- readWrite on syrf_pr_N (for $out target)
This is cleaner and more secure than a shared user.
2.2 Snapshot Producer CronJob¶
Weekly job that copies production data to the snapshot database using MongoDB's $out aggregation.
Why $out instead of mongodump/mongorestore?
Since source and destination databases are on the same Atlas cluster, $out aggregation is significantly better:
| Factor | mongodump/mongorestore | $out aggregation |
|---|---|---|
| Data path | MongoDB → K8s pod → MongoDB | Internal to Atlas |
| Network transfer | 20GB through pod | 0 bytes through pod |
| K8s resources | High (holds data in memory) | Minimal (sends commands) |
| Execution time | 30-60 minutes | 5-10 minutes |
| GKE cost | Higher compute + egress | Minimal |
# cluster-gitops/syrf/services/snapshot-producer/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: snapshot-producer
namespace: syrf-system
labels:
app.kubernetes.io/name: snapshot-producer
app.kubernetes.io/component: data-management
spec:
schedule: "0 3 * * 0" # Sunday 3:00 AM UTC
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
backoffLimit: 2
activeDeadlineSeconds: 1800 # 30 minute timeout (faster with $out)
template:
spec:
restartPolicy: Never
containers:
- name: snapshot
image: mongo:7.0
command: ["/bin/bash", "-c"]
args:
- |
set -e
echo "=== Starting Production Snapshot (using \$out aggregation) ==="
echo "Timestamp: $(date -Iseconds)"
# Collections to copy (pm-prefixed + core)
COLLECTIONS="pmProject pmStudy pmInvestigator pmSystematicSearch pmDataExportJob pmStudyCorrection pmInvestigatorUsage pmRiskOfBiasAiJob pmProjectDailyStat pmPotential pmInvestigatorEmail"
echo "Collections to copy: $COLLECTIONS"
# Drop existing snapshot collections first
echo "Clearing existing snapshot database..."
mongosh "$MONGO_URI" --quiet --eval "
const snapDb = db.getSiblingDB('syrf_snapshot');
snapDb.getCollectionNames().forEach(c => {
print('Dropping: ' + c);
snapDb.getCollection(c).drop();
});
"
# Copy each collection using $out (stays within Atlas cluster)
for col in $COLLECTIONS; do
echo "Copying collection: $col"
mongosh "$MONGO_URI" --quiet --eval "
const startTime = Date.now();
const result = db.getSiblingDB('syrftest').getCollection('$col').aggregate([
{ \\\$out: { db: 'syrf_snapshot', coll: '$col' } }
]).toArray();
const count = db.getSiblingDB('syrf_snapshot').getCollection('$col').countDocuments();
const elapsed = ((Date.now() - startTime) / 1000).toFixed(1);
print(' ✓ $col: ' + count + ' documents (' + elapsed + 's)');
"
done
# Write metadata
echo "Writing snapshot metadata..."
mongosh "$MONGO_URI" --quiet --eval "
const collections = '$COLLECTIONS'.split(' ');
db.getSiblingDB('syrf_snapshot').snapshot_metadata.updateOne(
{ _id: 'latest' },
{
\\\$set: {
timestamp: new Date(),
source: 'syrftest',
collections: collections,
status: 'complete',
method: '\\\$out aggregation'
}
},
{ upsert: true }
);
print('Metadata written successfully');
"
echo "=== Snapshot Complete ==="
echo "Timestamp: $(date -Iseconds)"
env:
- name: MONGO_URI
valueFrom:
secretKeyRef:
name: snapshot-producer-credentials
key: connectionString
resources:
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "256Mi"
cpu: "200m"
Note: The job only needs minimal resources since it just sends commands to MongoDB. All data movement happens within the Atlas cluster.
2.3 Snapshot Restore Job (PreSync)¶
Job that copies from snapshot database to target PR database using $out aggregation.
Why $out is safe for restore jobs?
The restore job flow is syrf_snapshot → syrf_pr_N. It NEVER touches syrftest!
The PR-specific user (created by Atlas Operator) has:
readonsyrf_snapshot(source for$out)readWriteonsyrf_pr_N(target for$out)- ZERO access to
syrftest- physically cannot touch production!
| If script tries to... | Result |
|---|---|
Read from syrftest |
FAILS - user has no read access |
Write to syrftest |
FAILS - user has no write access |
Write to syrf_snapshot |
FAILS - user only has read access |
Decision: Use $out aggregation for both producer and restore jobs. This is faster (data stays in Atlas) and equally secure (PR user has no production access).
# Template for PR preview (generated by workflow)
apiVersion: batch/v1
kind: Job
metadata:
name: snapshot-restore-${PR_NUM}-${SHORT_SHA}
namespace: pr-${PR_NUM}
labels:
app.kubernetes.io/managed-by: pr-preview-workflow
app.kubernetes.io/component: snapshot-restore
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
argocd.argoproj.io/sync-wave: "3" # After MongoDB user created
spec:
ttlSecondsAfterFinished: 600
backoffLimit: 3
activeDeadlineSeconds: 900 # 15 minute timeout ($out is fast)
template:
spec:
restartPolicy: Never
containers:
- name: restore
image: mongo:7.0
command: ["/bin/bash", "-c"]
args:
- |
set -e
TARGET_DB="syrf_pr_${PR_NUM}"
echo "=== Restoring Snapshot to $TARGET_DB ==="
echo "Using \$out aggregation (data stays in Atlas)"
echo "Timestamp: $(date -Iseconds)"
# ============================================================
# CRITICAL: Validate target database name (defense in depth)
# ============================================================
validate_target_db() {
local target="$1"
# Must match syrf_pr_N pattern
if [[ ! "$target" =~ ^syrf_pr_[0-9]+$ ]]; then
echo "FATAL: Invalid target database name: $target"
echo "Expected pattern: syrf_pr_N (e.g., syrf_pr_123)"
exit 1
fi
# Explicit blocklist (belt and suspenders)
case "$target" in
syrftest|syrfdev|syrf_snapshot|syrf_staging|admin|local|config)
echo "FATAL: Cannot target protected database: $target"
exit 1
;;
esac
echo "✓ Target database validated: $target"
}
validate_target_db "$TARGET_DB"
# ============================================================
# Wait for snapshot to be available
# ============================================================
MAX_RETRIES=12
RETRY_INTERVAL=30
echo "Checking snapshot availability..."
for i in $(seq 1 $MAX_RETRIES); do
METADATA=$(mongosh "$MONGODB_URI" --quiet --eval "
JSON.stringify(db.getSiblingDB('syrf_snapshot').snapshot_metadata.findOne({_id: 'latest'}))
")
if [ -n "$METADATA" ] && [ "$METADATA" != "null" ]; then
echo "Snapshot available: $METADATA"
SNAPSHOT_TIME=$(echo "$METADATA" | grep -oP '"timestamp":\s*"\K[^"]+' || echo "unknown")
echo "Snapshot timestamp: $SNAPSHOT_TIME"
break
fi
if [ $i -eq $MAX_RETRIES ]; then
echo "ERROR: Snapshot not available after $MAX_RETRIES retries"
echo "Options:"
echo " 1. Wait for Sunday 3 AM UTC weekly snapshot"
echo " 2. Trigger on-demand snapshot via GitHub Actions"
echo " 3. Remove 'use-snapshot' label to use seed data"
exit 1
fi
echo "Waiting for snapshot... (attempt $i/$MAX_RETRIES)"
sleep $RETRY_INTERVAL
done
# ============================================================
# Get collection list from snapshot metadata
# ============================================================
COLLECTIONS=$(mongosh "$MONGODB_URI" --quiet --eval "
db.getSiblingDB('syrf_snapshot').snapshot_metadata.findOne({_id: 'latest'}).collections.join(' ')
")
echo "Collections to restore: $COLLECTIONS"
# ============================================================
# Drop existing collections in target database
# ============================================================
echo ""
echo "Step 1: Clearing target database..."
mongosh "$MONGODB_URI" --quiet --eval "
const targetDb = db.getSiblingDB('$TARGET_DB');
targetDb.getCollectionNames().forEach(c => {
print('Dropping: ' + c);
targetDb.getCollection(c).drop();
});
"
# ============================================================
# Copy each collection using $out (stays within Atlas)
# ============================================================
echo ""
echo "Step 2: Copying collections via \$out aggregation..."
echo "(Data stays within Atlas cluster - no network transfer)"
for col in $COLLECTIONS; do
echo "Copying: $col"
mongosh "$MONGODB_URI" --quiet --eval "
const startTime = Date.now();
db.getSiblingDB('syrf_snapshot').getCollection('$col').aggregate([
{ \\\$out: { db: '$TARGET_DB', coll: '$col' } }
]).toArray();
const count = db.getSiblingDB('$TARGET_DB').getCollection('$col').countDocuments();
const elapsed = ((Date.now() - startTime) / 1000).toFixed(1);
print(' ✓ $col: ' + count + ' documents (' + elapsed + 's)');
"
done
# ============================================================
# Verify restore
# ============================================================
echo ""
echo "Step 3: Verification summary..."
mongosh "$MONGODB_URI" --quiet --eval "
const targetDb = db.getSiblingDB('$TARGET_DB');
let total = 0;
targetDb.getCollectionNames().filter(c => c !== 'system.profile').forEach(c => {
const count = targetDb.getCollection(c).countDocuments();
total += count;
print(' ' + c + ': ' + count);
});
print('Total documents: ' + total);
"
echo ""
echo "=== Restore Complete ==="
echo "Timestamp: $(date -Iseconds)"
env:
# PR-specific connection (read on syrf_snapshot, readWrite on syrf_pr_N)
# This user has ZERO access to syrftest - it cannot touch production!
- name: MONGODB_URI
valueFrom:
secretKeyRef:
name: mongodb-credentials # Created by Atlas Operator for this PR
key: connectionString
- name: PR_NUM
value: "${PR_NUM}"
resources:
requests:
memory: "64Mi" # Minimal - just sends commands to MongoDB
cpu: "50m"
limits:
memory: "256Mi"
cpu: "200m"
Note: The restore job only needs minimal resources since it just sends commands to MongoDB. All data movement happens within the Atlas cluster. For a 20GB database, expect ~5-10 minutes restore time.
2.4 PR Preview Workflow Integration¶
Updates to .github/workflows/pr-preview.yml:
2.4.1 Label Detection Enhancement¶
# Add to check-label job
- name: Check for data source labels
id: snapshot-check
env:
GH_TOKEN: ${{ github.token }}
PR_NUM: ${{ steps.pr-info.outputs.pr_number }}
run: |
LABELS=$(gh pr view "$PR_NUM" --json labels -q '.labels[].name' 2>/dev/null || echo "")
# Check persist-db FIRST (it takes precedence as a lock)
if echo "$LABELS" | grep -q "persist-db"; then
echo "persist_db=true" >> "$GITHUB_OUTPUT"
echo "🔒 Label 'persist-db' found - database is LOCKED"
else
echo "persist_db=false" >> "$GITHUB_OUTPUT"
fi
# Check use-snapshot
if echo "$LABELS" | grep -q "use-snapshot"; then
echo "use_snapshot=true" >> "$GITHUB_OUTPUT"
echo "🗄️ Label 'use-snapshot' found - will use production snapshot"
else
echo "use_snapshot=false" >> "$GITHUB_OUTPUT"
echo "🌱 No 'use-snapshot' label - will use seed data"
fi
- name: Handle label change with persist-db lock
id: label-lock-check
env:
GH_TOKEN: ${{ github.token }}
PR_NUM: ${{ steps.pr-info.outputs.pr_number }}
PERSIST_DB: ${{ steps.snapshot-check.outputs.persist_db }}
EVENT_ACTION: ${{ github.event.action }}
LABEL_NAME: ${{ github.event.label.name }}
run: |
# If persist-db is present and use-snapshot label was just changed, revert it
if [ "$PERSIST_DB" = "true" ] && [ "$LABEL_NAME" = "use-snapshot" ]; then
if [ "$EVENT_ACTION" = "labeled" ]; then
# use-snapshot was just ADDED while persist-db present - revert
echo "⚠️ Cannot add use-snapshot: database locked by persist-db"
gh pr edit "$PR_NUM" --remove-label "use-snapshot"
gh pr comment "$PR_NUM" --body "⚠️ **Cannot enable snapshot mode**: Database is locked by \`persist-db\` label.
The \`use-snapshot\` label has been automatically removed.
To refresh the database from production snapshot:
1. Remove the \`persist-db\` label first
2. Then add the \`use-snapshot\` label"
echo "label_reverted=true" >> "$GITHUB_OUTPUT"
elif [ "$EVENT_ACTION" = "unlabeled" ]; then
# use-snapshot was just REMOVED while persist-db present - re-add it
echo "⚠️ Cannot remove use-snapshot: database locked by persist-db"
gh pr edit "$PR_NUM" --add-label "use-snapshot"
gh pr comment "$PR_NUM" --body "⚠️ **Cannot disable snapshot mode**: Database is locked by \`persist-db\` label.
The \`use-snapshot\` label has been automatically restored.
To change the data source:
1. Remove the \`persist-db\` label first
2. Then remove the \`use-snapshot\` label"
echo "label_reverted=true" >> "$GITHUB_OUTPUT"
fi
else
echo "label_reverted=false" >> "$GITHUB_OUTPUT"
fi
2.4.2 PR Comment Enhancement¶
# Add new job: post-preview-comment
post-preview-comment:
name: Post preview environment comment
needs: [check-label, detect-changes]
if: |
needs.check-label.outputs.should_build == 'true' &&
github.event.action == 'labeled' &&
github.event.label.name == 'preview'
runs-on: ubuntu-latest
permissions:
pull-requests: write
steps:
- name: Get snapshot timestamp
id: snapshot-info
if: needs.check-label.outputs.use_snapshot == 'true'
env:
MONGO_URI: ${{ secrets.SNAPSHOT_READER_URI }}
run: |
# Query snapshot metadata for timestamp
TIMESTAMP=$(mongosh "$MONGO_URI" --quiet --eval "
const meta = db.getSiblingDB('syrf_snapshot').snapshot_metadata.findOne({_id: 'latest'});
if (meta && meta.timestamp) {
print(meta.timestamp.toISOString());
} else {
print('unknown');
}
")
echo "timestamp=$TIMESTAMP" >> "$GITHUB_OUTPUT"
# Format for display
if [ "$TIMESTAMP" != "unknown" ]; then
FORMATTED=$(date -d "$TIMESTAMP" "+%Y-%m-%d %H:%M UTC" 2>/dev/null || echo "$TIMESTAMP")
echo "formatted=$FORMATTED" >> "$GITHUB_OUTPUT"
else
echo "formatted=Not yet available" >> "$GITHUB_OUTPUT"
fi
- name: Post preview environment comment
env:
GH_TOKEN: ${{ github.token }}
PR_NUM: ${{ needs.check-label.outputs.pr_number }}
USE_SNAPSHOT: ${{ needs.check-label.outputs.use_snapshot }}
PERSIST_DB: ${{ needs.check-label.outputs.persist_db }}
HEAD_SHA: ${{ needs.check-label.outputs.head_short_sha }}
SNAPSHOT_TIME: ${{ steps.snapshot-info.outputs.formatted }}
run: |
if [ "$USE_SNAPSHOT" = "true" ]; then
DB_SOURCE="🗄️ **Production Snapshot** (\`syrf_snapshot\` → \`syrf_pr_${PR_NUM}\`)"
DB_NOTE="📅 Snapshot taken: **${SNAPSHOT_TIME}**"
if [ "$PERSIST_DB" = "true" ]; then
DB_NOTE="${DB_NOTE}
🔒 Database is **LOCKED** - it will not be modified or deleted on rebuild or PR close."
else
DB_NOTE="${DB_NOTE}
Data will be refreshed from snapshot on each rebuild."
fi
else
DB_SOURCE="🌱 **Seed Data** (5 sample projects, ~100 studies)"
if [ "$PERSIST_DB" = "true" ]; then
DB_NOTE="🔒 Database is **LOCKED** - it will not be modified or deleted on rebuild or PR close.
Add the \`use-snapshot\` label to use production data instead (must remove \`persist-db\` first)."
else
DB_NOTE="Add the \`use-snapshot\` label to use production data instead."
fi
fi
COMMENT=$(cat <<EOF
## 🚀 Preview Environment Building
A preview environment is being built for this PR.
### Environment Details
| Setting | Value |
|---------|-------|
| **Namespace** | \`pr-${PR_NUM}\` |
| **Database** | \`syrf_pr_${PR_NUM}\` |
| **Data Source** | ${DB_SOURCE} |
| **Commit** | \`${HEAD_SHA}\` |
### URLs (available after deployment)
- 🌐 **Web**: https://pr-${PR_NUM}.syrf.org.uk
- 🔌 **API**: https://api.pr-${PR_NUM}.syrf.org.uk
- 📊 **PM**: https://project-management.pr-${PR_NUM}.syrf.org.uk
### Notes
${DB_NOTE}
---
*This comment was automatically generated by the PR Preview workflow.*
EOF
)
gh pr comment "$PR_NUM" --body "$COMMENT"
2.4.3 persist-db Label Added Comment¶
# Post when persist-db label is added
- name: Post persist-db lock comment
if: |
github.event.action == 'labeled' &&
github.event.label.name == 'persist-db'
env:
GH_TOKEN: ${{ github.token }}
PR_NUM: ${{ needs.check-label.outputs.pr_number }}
run: |
gh pr comment "$PR_NUM" --body "🔒 **Database is now LOCKED**
The \`persist-db\` label has been added. Your database (\`syrf_pr_${PR_NUM}\`) will now be protected:
- ✅ Database will NOT be dropped on rebuild
- ✅ Database will NOT be deleted when PR is closed/merged
- ✅ Changes to \`use-snapshot\` label will be blocked
**To unlock:** Remove the \`persist-db\` label. If the PR is closed, this will immediately delete the database."
2.4.4 persist-db Label Removed Comment¶
# Post when persist-db label is removed
- name: Handle persist-db removal
if: |
github.event.action == 'unlabeled' &&
github.event.label.name == 'persist-db'
env:
GH_TOKEN: ${{ github.token }}
PR_NUM: ${{ needs.check-label.outputs.pr_number }}
PR_STATE: ${{ github.event.pull_request.state }}
USE_SNAPSHOT: ${{ needs.check-label.outputs.use_snapshot }}
run: |
if [ "$PR_STATE" = "open" ]; then
# PR is still open - database will be handled on next sync
if [ "$USE_SNAPSHOT" = "true" ]; then
DATA_ACTION="refreshed from production snapshot"
else
DATA_ACTION="reset with fresh seed data"
fi
gh pr comment "$PR_NUM" --body "🔓 **Database unlocked**
The \`persist-db\` label has been removed. On the next sync, your database will be ${DATA_ACTION}.
If you don't push a new commit, you can manually trigger a sync from ArgoCD."
else
# PR is closed/merged - delete the database immediately
echo "PR is closed - triggering immediate database cleanup"
# Note: This triggers the cleanup workflow or direct deletion
gh pr comment "$PR_NUM" --body "🗑️ **Database deleted**
The \`persist-db\` label was removed from this closed PR, triggering immediate database cleanup.
Database \`syrf_pr_${PR_NUM}\` has been dropped."
# Actual deletion logic is in the cleanup workflow
fi
2.5 Staging Configuration¶
GitOps-based configuration for staging environment.
# cluster-gitops/syrf/environments/staging/staging.values.yaml
# Add database source configuration
database:
# Data source: "seed" or "snapshot"
source: snapshot
# Snapshot database to copy from (when source=snapshot)
snapshotDatabase: syrf_snapshot
Staging restore is handled by a similar PreSync job, but configured via Helm values rather than generated by workflow.
2.6 On-Demand Snapshot Trigger¶
GitHub Actions workflow for manual snapshot trigger.
# .github/workflows/snapshot-on-demand.yml
name: Trigger Snapshot Refresh
on:
workflow_dispatch:
inputs:
confirm:
description: 'Type "refresh-snapshot" to confirm'
required: true
jobs:
trigger-snapshot:
name: Trigger snapshot refresh
if: inputs.confirm == 'refresh-snapshot'
runs-on: ubuntu-latest
steps:
- name: Authenticate to Google Cloud
uses: google-github-actions/auth@v2
with:
workload_identity_provider: ${{ secrets.GCP_WORKLOAD_IDENTITY_PROVIDER }}
service_account: ${{ secrets.GCP_SERVICE_ACCOUNT }}
- name: Set up Cloud SDK
uses: google-github-actions/setup-gcloud@v2
- name: Get GKE credentials
run: |
gcloud container clusters get-credentials camaradesuk \
--zone europe-west2-a \
--project camarades-net
- name: Trigger snapshot CronJob
run: |
kubectl create job \
--from=cronjob/snapshot-producer \
snapshot-manual-$(date +%Y%m%d-%H%M%S) \
-n syrf-system
echo "### Snapshot Job Triggered" >> "$GITHUB_STEP_SUMMARY"
echo "A manual snapshot job has been created." >> "$GITHUB_STEP_SUMMARY"
echo "Check the syrf-system namespace for job status." >> "$GITHUB_STEP_SUMMARY"
3. Execution Flow¶
3.1 Weekly Snapshot (Happy Path)¶
Sunday 3:00 AM UTC
│
▼
┌─────────────────────────────────────────────────────────┐
│ 1. CronJob triggers snapshot-producer │
├─────────────────────────────────────────────────────────┤
│ 2. Connect to MongoDB Atlas │
│ 3. Clear existing syrf_snapshot collections │
│ 4. For each collection (11 total): │
│ a. Run $out aggregation: syrftest → syrf_snapshot │
│ b. (Data stays within Atlas - no network transfer) │
│ 5. Write snapshot_metadata with timestamp │
│ 6. Job completes (estimated: 5-10 minutes for 20GB) │
└─────────────────────────────────────────────────────────┘
3.2 PR Preview with Snapshot¶
Developer adds 'use-snapshot' label to PR
│
▼
┌─────────────────────────────────────────────────────────┐
│ 1. Workflow detects 'use-snapshot' label │
│ 2. Check if 'persist-db' is present: │
│ ├─ YES: REVERT label change, post comment │
│ │ (database is locked - no changes allowed) │
│ └─ NO: Continue to step 3 │
│ 3. Get snapshot timestamp from syrf_snapshot metadata │
│ 4. Post preview environment comment with: │
│ - Data source (snapshot) │
│ - Snapshot timestamp │
│ 5. Generate snapshot-restore-job.yaml │
│ 6. Commit to cluster-gitops │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ ArgoCD Sync (only if persist-db NOT present): │
│ 1. PreSync: Create MongoDB user (sync-wave: 1) │
│ 2. PreSync: Drop existing database │
│ 3. PreSync: Run snapshot-restore job (sync-wave: 3) │
│ a. Check snapshot_metadata for availability │
│ b. Retry up to 12 times (6 minutes) if not ready │
│ c. Copy collections from syrf_snapshot → syrf_pr_N │
│ 4. Sync: Deploy services │
│ 5. Services connect to syrf_pr_N with real data │
└─────────────────────────────────────────────────────────┘
3.2.1 PR Preview with persist-db Lock¶
Developer has 'persist-db' label on PR
│
▼
┌─────────────────────────────────────────────────────────┐
│ 1. Workflow detects 'persist-db' label │
│ 2. Any attempt to add/remove 'use-snapshot' is REVERTED │
│ 3. Post explanation comment │
│ 4. NO database changes occur on sync │
│ 5. On PR close/merge: Database is NOT deleted │
│ - Warning comment posted with cleanup instructions │
└─────────────────────────────────────────────────────────┘
Cleanup (when persist-db removed from closed PR):
│
▼
┌─────────────────────────────────────────────────────────┐
│ 1. Workflow detects 'persist-db' label removal │
│ 2. Check PR state: │
│ ├─ OPEN: Post "database unlocked" comment │
│ │ Next sync applies current data source │
│ └─ CLOSED: Immediately drop database │
│ Post "database deleted" comment │
└─────────────────────────────────────────────────────────┘
3.3 Sequence Diagram¶
Developer GitHub Actions cluster-gitops ArgoCD MongoDB Atlas
│ │ │ │ │
│─Add 'use-snapshot' │ │ │ │
│ label │ │ │ │
│ │ │ │ │
│ ├─Detect label───────│ │ │
│ │ │ │ │
│ ├─Generate restore───│ │ │
│ │ job YAML │ │ │
│ │ │ │ │
│ ├─Commit + push─────▶│ │ │
│ │ │ │ │
│ │ ├─Git webhook─────▶│ │
│ │ │ │ │
│ │ │ ├─PreSync: Create──│
│ │ │ │ MongoDB user │
│ │ │ │ │
│ │ │ ├─PreSync: Restore─│
│ │ │ │ snapshot │
│ │ │ │ │
│ │ │ │ ┌────────┤
│ │ │ │ │Copy │
│ │ │ │ │data │
│ │ │ │ └────────┤
│ │ │ │ │
│ │ │ ├─Sync: Deploy─────│
│ │ │ │ services │
│ │ │ │ │
│◀─────────────────Preview ready──────────│ │ │
│ │ │ │ │
4. Edge Cases & Mitigations¶
| # | Edge Case / Failure Mode | Impact | Mitigation Strategy |
|---|---|---|---|
| 1 | Snapshot not available when PR deploys (first week) | PR deployment blocked | Wait and retry (12 attempts, 30s intervals = 6 min max wait). After retries exhausted, fail with clear error message. |
| 2 | Snapshot producer job fails mid-copy | Incomplete snapshot database | Job uses collection-by-collection copy with atomic drop. Metadata only written on success. Restore job checks metadata. |
| 3 | Production database unavailable during snapshot | Weekly snapshot skipped | CronJob retries (backoffLimit: 2). Alert on repeated failures. Previous snapshot remains valid. |
| 4 | 20GB copy takes longer than expected | Job timeout | activeDeadlineSeconds: 1800 (30 min) for producer, 900 (15 min) for restore. Both use $out aggregation (fast, data stays in Atlas). |
| 5 | use-snapshot label added while persist-db present |
Label change blocked | Revert label change automatically, post comment explaining database is locked. User must remove persist-db first. |
| 6 | use-snapshot label removed while persist-db present |
Label change blocked | Revert label change automatically, post comment explaining database is locked. User must remove persist-db first. |
| 7 | persist-db removed on closed/merged PR |
Orphan database cleanup | Immediately drop database, post confirmation comment. This is the cleanup mechanism for orphan DBs. |
| 8 | PR closed/merged with persist-db label |
Database persists as orphan | Do NOT delete database, post warning comment with cleanup instructions (remove persist-db label to trigger deletion). |
| 9 | First deploy with both use-snapshot AND persist-db |
User wants snapshot data then lock | Create database from snapshot on first sync, then lock for subsequent syncs. Both labels are honored in sequence. |
| 10 | MongoDB connection issues during restore | Restore fails | backoffLimit: 3 with exponential backoff. Clear error in ArgoCD. |
| 11 | Snapshot database runs out of space | Atlas storage limit | Monitor Atlas storage. 20GB snapshot should fit in M10+ tier. |
| 12 | Collection schema changes between production and test | Potential data issues | Schema version tracked in metadata. Services handle schema migration. |
| 13 | Multiple PRs requesting snapshot simultaneously | Parallel restore from same source | Each PR gets its own copy. syrf_snapshot is read-only during restores. No conflicts. |
| 14 | Snapshot producer runs during restore | Stale data mid-restore | Restore checks metadata timestamp. If changed mid-restore, log warning but continue (acceptable for testing). |
| 15 | Manual changes to syrf_snapshot | Corruption risk | Document as read-only. Only snapshot-producer should write. Kyverno policy can enforce. |
4.1 Detailed Mitigation: Snapshot Availability Wait¶
# Implemented in restore job
MAX_RETRIES=12
RETRY_INTERVAL=30 # 30 seconds
for i in $(seq 1 $MAX_RETRIES); do
METADATA=$(mongosh ... --eval "db.snapshot_metadata.findOne({_id: 'latest'})")
if [ -n "$METADATA" ] && [ "$METADATA" != "null" ]; then
echo "Snapshot available"
break
fi
if [ $i -eq $MAX_RETRIES ]; then
echo "ERROR: Snapshot not available after $(($MAX_RETRIES * $RETRY_INTERVAL / 60)) minutes"
echo "This is expected on first deployment before weekly snapshot runs."
echo "Options:"
echo " 1. Wait for Sunday 3 AM UTC weekly snapshot"
echo " 2. Trigger on-demand snapshot via GitHub Actions"
echo " 3. Remove 'use-snapshot' label to use seed data"
exit 1
fi
echo "Waiting for snapshot... (attempt $i/$MAX_RETRIES)"
sleep $RETRY_INTERVAL
done
5. Testing Strategy¶
5.1 Unit Tests¶
Not applicable - this feature is infrastructure-only (no application code changes).
5.2 Integration Tests¶
- Snapshot Producer Job: Run manually, verify all collections copied
- Restore Job: Deploy test PR with
use-snapshot, verify data present - Label Conflict: Add both labels, verify
persist-dbremoved and comment posted - Preview Comment: Add
previewlabel, verify environment comment posted
5.3 Manual Verification Steps¶
# 1. Verify snapshot producer CronJob is deployed
kubectl get cronjob snapshot-producer -n syrf-system
# 2. Trigger manual snapshot
kubectl create job --from=cronjob/snapshot-producer snapshot-test -n syrf-system
# 3. Monitor job progress
kubectl logs -f job/snapshot-test -n syrf-system
# 4. Verify snapshot metadata
mongosh "mongodb+srv://..." --eval "
db.getSiblingDB('syrf_snapshot').snapshot_metadata.findOne({_id: 'latest'})
"
# 5. Verify collection counts match production
mongosh "mongodb+srv://..." --eval "
const snap = db.getSiblingDB('syrf_snapshot');
const prod = db.getSiblingDB('syrftest');
['pmProject', 'pmStudy', 'pmInvestigator'].forEach(c => {
print(c + ': snap=' + snap.getCollection(c).countDocuments() +
' prod=' + prod.getCollection(c).countDocuments());
});
"
# 6. Create test PR with use-snapshot label
gh pr create --title "Test snapshot restore" --body "Testing snapshot feature"
gh pr edit <PR_NUM> --add-label "preview,use-snapshot"
# 7. Verify PR comments
gh pr view <PR_NUM> --comments
# 8. Verify restore job ran
kubectl get jobs -n pr-<PR_NUM>
kubectl logs job/snapshot-restore-<PR_NUM>-<SHA> -n pr-<PR_NUM>
# 9. Verify data in preview database
mongosh "mongodb+srv://..." --eval "
db.getSiblingDB('syrf_pr_<PR_NUM>').pmProject.countDocuments()
"
6. Implementation Checklist¶
Implementation Status: Phase 1 and 3 core components implemented (2026-01-13) See Implementation Notes below for changes from original spec.
Phase 1: Infrastructure Setup¶
- 1.1 Create MongoDB Atlas user
snapshot-producerwith required roles - Requires manual creation in Atlas Console (same pattern as prod/staging users)
- Roles:
readonsyrftest,readWriteonsyrf_snapshot - 1.2 Add credentials to GCP Secret Manager (
snapshot-producer-mongodb) - Keys:
username,password(connection string constructed from cluster host in Helm values) - 1.3 Create ExternalSecret for snapshot-producer credentials (cluster-gitops)
- Added to
plugins/local/extra-secrets-staging/values.yaml - 1.4 Update Kyverno policy to allow PR users
readonsyrf_snapshot - Updated
plugins/helm/kyverno/resources/atlas-pr-user-policy.yaml - Rule 5 now allows
syrf_snapshot(read-only) in addition tosyrf_pr_* - 1.5 Create snapshot-producer Helm chart
- Created
charts/snapshot-producer/with CronJob template - Plugin config at
plugins/local/snapshot-producer/
Phase 2: Snapshot Producer¶
- 2.1 Create CronJob manifest in cluster-gitops
- Located at
charts/snapshot-producer/templates/cronjob.yaml - Schedule: Sunday 2 AM UTC
- 2.2 Test manual job trigger
- 2.3 Verify all 11 collections are copied correctly
- 2.4 Verify snapshot runs successfully (no snapshot_metadata - simpler design)
- 2.5 Monitor first automated weekly run (Sunday 2 AM)
Phase 3: PR Preview Integration¶
- 3.1 Add
use-snapshotlabel detection to pr-preview.yml - Label triggers workflow on add/remove
- Check step outputs
use_snapshotflag - 3.2 Add
persist-dbconflict resolution logic persist-dbtakes precedence - blocks snapshot restore when present- 3.3 Add PR comment for preview environment (with DB details)
- 3.4 Add PR comment for label conflict resolution
- 3.5 Generate snapshot-restore-job.yaml when label present
- Only when
use-snapshot=trueANDpersist-db=false - Uses PR user credentials (now has
readonsyrf_snapshot) - 3.6 Update AtlasDatabaseUser CRD to add snapshot read role conditionally
- Role added only when
use-snapshotlabel is present - 3.7 Test with real PR (add both labels, verify behaviour)
Phase 4: Staging Configuration¶
- 4.1 Add database.source config to staging.values.yaml
- 4.2 Create staging-specific restore job template in Helm chart
- 4.3 Test staging with snapshot source enabled
- 4.4 Document staging configuration in cluster-gitops
Phase 5: On-Demand Trigger¶
- 5.1 Create snapshot-on-demand.yml workflow
- 5.2 Test manual trigger via workflow_dispatch
- 5.3 Document in how-to guide
Phase 6: Documentation & Cleanup¶
- 6.1 Update PR preview how-to guide with snapshot option
- 6.2 Update MongoDB testing strategy doc
- 6.3 Delete planning documents (clarify.md)
- 6.4 Update CLAUDE.md with snapshot feature
Implementation Notes¶
Changes from Original Specification:
| Aspect | Original | Implemented | Reason |
|---|---|---|---|
| User naming | syrf_snapshot_operator |
snapshot-producer |
Simpler naming |
| Secret name | syrf-snapshot-operator-credentials |
snapshot-producer-mongodb |
Consistent with existing pattern |
| User creation | Operator CRD | Manual Atlas Console | Follows prod/staging pattern for long-lived users |
| Kyverno policy | Not specified | Updated Rule 5 | Required to allow syrf_snapshot read access |
| Namespace | syrf-system |
syrf-staging |
Plugin pattern requires namespace |
| snapshot_metadata | Separate collection | Implemented | Now written to all databases (snapshot, PR, empty seed) |
Related Documentation:
- MongoDB Permissions Explained - Comprehensive reference for the permission model, including why wildcards don't work and cleanup strategies
7. Open Questions¶
All questions have been resolved through the clarification process:
| Question | Resolution |
|---|---|
| PII handling | Not required - skip anonymization |
| Storage location | MongoDB Atlas database (syrf_snapshot), not GCS |
| Retention policy | Single snapshot, replaced weekly |
| PR configuration | Label-based (use-snapshot for data source, persist-db for lock) |
| Collection scope | All pm-prefixed collections (11 total) |
| Label interaction | persist-db is a LOCK that takes precedence - blocks all label changes to use-snapshot |
| Orphan databases | Manual cleanup - remove persist-db label from closed PR to trigger deletion |
| Snapshot visibility | Snapshot timestamp shown in PR comments so users know data freshness |
| Permission model | Defense in depth: Separate users with minimal permissions. No snapshot/restore user can write to syrftest. Producer uses read-only prod access. Restore uses PR-specific user that can only write to its own DB. |
| Restore method | $out aggregation for both producer and restore (fast, data stays in Atlas). PR user has read on snapshot, readWrite on own DB, ZERO access to syrftest. |
8. References¶
- MongoDB Permissions Explained - Permission model, wildcards, cleanup strategies
- MongoDB Testing Strategy - Database isolation architecture
- PR Preview Environments - Current preview setup
- MongoDB Reference - Collection naming, CSUUID format
- cluster-gitops Repository - GitOps configuration
Document End
This document must be reviewed and approved before implementation begins.