Snapshot Producer Reference¶
Overview¶
The snapshot-producer is a Kubernetes CronJob that creates weekly copies of production MongoDB data to a snapshot database. This snapshot is used by PR preview environments to provide realistic test data without accessing production directly.
Architecture¶
┌─────────────────────────────────────────────────────────────────────────┐
│ MongoDB Atlas │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Cluster0 │ │ Preview Cluster │ │
│ │ (Production) │ │ │ │
│ │ │ │ │ │
│ │ ┌───────────────┐ │ stream │ ┌───────────────┐ │ │
│ │ │ syrftest │──┼──────────────┼─▶│ syrf_snapshot │ │ │
│ │ │ (prod data) │ │ mongodump │ │ (weekly copy) │ │ │
│ │ └───────────────┘ │ mongorestore│ └───────────────┘ │ │
│ │ │ │ │ │ │
│ └─────────────────────┘ │ ▼ │ │
│ │ ┌───────────────┐ │ │
│ │ │ syrf_pr_123 │◀─┼─ restore │
│ │ │ syrf_pr_456 │ │ │
│ │ │ syrf_pr_789 │ │ │
│ │ └───────────────┘ │ │
│ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ GKE Cluster │
│ ┌─────────────────────┐ │
│ │ staging namespace │ │
│ │ │ │
│ │ CronJob: │ Schedule: Sunday 3 AM UTC │
│ │ snapshot-producer │───────────────────────────────────────────────▶│
│ │ │ │
│ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Helm Chart¶
Location: cluster-gitops/charts/snapshot-producer/
Chart Structure¶
snapshot-producer/
├── Chart.yaml # Chart metadata (version 0.1.0)
├── values.yaml # Default configuration
└── templates/
├── _helpers.tpl # Template helpers (name, labels)
├── cronjob.yaml # Main CronJob definition
└── serviceaccount.yaml # Service account for pod
values.yaml Reference¶
# =============================================================================
# Schedule
# =============================================================================
schedule: "0 3 * * 0" # Cron expression (Sunday 3 AM UTC)
# =============================================================================
# Image Configuration
# =============================================================================
image:
repository: mongo # Official MongoDB image
tag: "7" # MongoDB 7.x (includes mongodump, mongorestore, mongosh)
pullPolicy: IfNotPresent
# =============================================================================
# MongoDB Credentials
# =============================================================================
credentials:
secretName: snapshot-producer-credentials # Kubernetes secret name
usernameKey: username # Key for username in secret
passwordKey: password # Key for password in secret
# =============================================================================
# Source Cluster (Production - Cluster0)
# =============================================================================
source:
cluster: Cluster0 # Human-readable name (for logging)
host: "" # MUST override: cluster0-pri.siwfo.mongodb.net
database: syrftest # Production database name
# =============================================================================
# Target Cluster (Preview)
# =============================================================================
target:
cluster: Preview # Human-readable name (for logging)
host: "" # MUST override: preview-pri.siwfo.mongodb.net
database: syrf_snapshot # Snapshot database name
# =============================================================================
# Collections to Copy
# =============================================================================
collections:
- pmProject
- pmStudy
- pmInvestigator
- pmSystematicSearch
- pmDataExportJob
- pmStudyCorrection
- pmInvestigatorUsage
- pmRiskOfBiasAiJob
- pmProjectDailyStat
- pmPotential
- pmInvestigatorEmail
# =============================================================================
# Streaming Options
# =============================================================================
streaming:
numParallelCollections: 4 # mongorestore parallelism (per-collection)
gzip: true # Compress data in transit
# =============================================================================
# Job Configuration
# =============================================================================
successfulJobsHistoryLimit: 3 # Keep last 3 successful job pods
failedJobsHistoryLimit: 3 # Keep last 3 failed job pods
activeDeadlineSeconds: 3600 # 1 hour timeout (job killed if exceeded)
ttlSecondsAfterFinished: 3600 # Delete completed pods after 1 hour
# =============================================================================
# Retry Configuration
# =============================================================================
retry:
maxAttempts: 3 # Retry each collection up to 3 times
delaySeconds: 30 # Wait 30 seconds between retries
# =============================================================================
# Resource Limits
# =============================================================================
resources:
limits:
cpu: 1000m # 1 CPU core max
memory: 1Gi # 1 GB RAM max
requests:
cpu: 250m # 0.25 CPU cores guaranteed
memory: 512Mi # 512 MB RAM guaranteed
# =============================================================================
# Pod Configuration
# =============================================================================
podAnnotations: {}
podLabels: {}
serviceAccount:
create: true
name: snapshot-producer
CronJob Behavior¶
Execution Flow¶
- Pre-flight checks:
- Validate required values (
source.host,target.host) - URL-encode MongoDB password for special characters
-
Test connectivity to both clusters (fail fast on connection errors)
-
Collection copy loop (sequential):
-
For each collection:
- Count source documents
- Skip if empty (with warning)
- Stream copy using
mongodump | mongorestorepipeline - Retry up to N times on failure
- Verify target document count
- Log progress with timestamps
-
Metadata write:
- Calculate total duration
- Write
snapshot_metadatadocument to target database - Warn (don't fail) if metadata write fails
Streaming Copy Mechanism¶
The copy uses a streaming pipeline that avoids disk I/O:
mongodump \
--uri="$SOURCE_URI" \
--collection="$collection" \
--archive \ # Stream to stdout (no files)
--gzip \ # Compress in transit
--verbose \ # Progress logging
| mongorestore \
--uri="$TARGET_URI" \
--archive \ # Read from stdin
--gzip \ # Decompress
--drop \ # Replace existing collection
--nsFrom="source_db.collection" \
--nsTo="target_db.collection" \
--verbose
Benefits:
- No temporary files (streams directly between clusters)
- Network compression reduces bandwidth
--dropensures clean replacement (idempotent)--nsFrom/--nsTohandles database rename
Error Handling¶
| Scenario | Behavior |
|---|---|
| Connection failure (pre-flight) | Job fails immediately with clear error |
| Collection copy failure | Retry up to maxAttempts times, then mark failed |
| Any collection fails | Job exits with error after all collections attempted |
| Metadata write fails | Warning logged, job still succeeds (data was copied) |
| Timeout exceeded | Job killed by Kubernetes, marked as failed |
PIPESTATUS Diagnostics¶
The job captures exit codes from both sides of the pipeline:
DUMP_EXIT=${PIPESTATUS[0]}
RESTORE_EXIT=${PIPESTATUS[1]}
if [ "$DUMP_EXIT" -ne 0 ] || [ "$RESTORE_EXIT" -ne 0 ]; then
echo "Pipeline failed: mongodump=$DUMP_EXIT, mongorestore=$RESTORE_EXIT"
fi
This provides clear diagnostics when failures occur.
Metadata Document¶
After successful completion, the job writes a metadata document:
Collection: syrf_snapshot.snapshot_metadata
Document ID: "latest" (always upserts the same document)
{
_id: "latest",
// Timing
createdAt: ISODate("2026-01-26T03:45:00Z"), // When metadata was written
startedAt: ISODate("2026-01-26T03:00:00Z"), // Job start time
finishedAt: ISODate("2026-01-26T03:45:00Z"), // Job end time
durationSeconds: 2700, // Total runtime
// Source info
sourceCluster: "Cluster0",
sourceDatabase: "syrftest",
sourceHost: "cluster0-pri.siwfo.mongodb.net",
// Target info
targetCluster: "Preview",
targetDatabase: "syrf_snapshot",
targetHost: "preview-pri.siwfo.mongodb.net",
// Collection details
collections: ["pmProject", "pmStudy", ...],
collectionsCount: 11,
documentCounts: {
pmProject: 1234,
pmStudy: 56789,
pmInvestigator: 2345,
// ... one entry per collection
},
totalDocuments: 123456,
// Method info
method: "mongodump | mongorestore streaming",
crossCluster: true,
status: "complete"
}
Querying Metadata¶
// Check snapshot freshness
db.snapshot_metadata.findOne({ _id: "latest" })
// Get snapshot age in hours
db.snapshot_metadata.aggregate([
{ $match: { _id: "latest" } },
{ $project: {
ageHours: {
$divide: [
{ $subtract: [new Date(), "$createdAt"] },
1000 * 60 * 60
]
}
}}
])
Security¶
Pod Security Context¶
# Pod level
securityContext:
seccompProfile:
type: RuntimeDefault # Use container runtime's default seccomp profile
# Container level
securityContext:
allowPrivilegeEscalation: false # Cannot gain elevated privileges
capabilities:
drop:
- ALL # Drop all Linux capabilities
MongoDB Credentials¶
- Stored in Kubernetes Secret (
snapshot-producer-credentials) - Secret synced from GCP Secret Manager via External Secrets Operator
- Same credentials work for both clusters (user has "All Resources" access in Atlas project)
Network Access¶
- Uses
-prihostname suffix for VPC Peering (private IP routing) - No data transfer charges within same Atlas project
- MongoDB Atlas IP allowlist must include GKE cluster's egress IPs
Operations¶
Manual Trigger¶
# Create one-time job from CronJob
kubectl create job --from=cronjob/snapshot-producer snapshot-manual-$(date +%s) -n staging
# Watch logs
kubectl logs -f -l job-name=snapshot-manual-<timestamp> -n staging
# Check job status
kubectl get jobs -n staging | grep snapshot
Suspend/Resume CronJob¶
# Suspend (prevent scheduled runs)
kubectl patch cronjob snapshot-producer -n staging -p '{"spec":{"suspend":true}}'
# Resume
kubectl patch cronjob snapshot-producer -n staging -p '{"spec":{"suspend":false}}'
Check History¶
# List recent jobs
kubectl get jobs -n staging -l app.kubernetes.io/name=snapshot-producer
# View CronJob status
kubectl get cronjob snapshot-producer -n staging -o wide
View Logs from Last Run¶
# Get most recent job
LAST_JOB=$(kubectl get jobs -n staging -l app.kubernetes.io/name=snapshot-producer \
--sort-by=.status.startTime -o jsonpath='{.items[-1].metadata.name}')
# View logs
kubectl logs -n staging job/$LAST_JOB
Troubleshooting¶
Job Stuck in Pending¶
Symptoms: Pod shows Pending status, doesn't start
Causes:
- Insufficient cluster resources (CPU/memory)
- Secret not found
- Image pull failure
Debug:
Connection Timeouts¶
Symptoms: "Cannot connect to source/target cluster"
Causes:
- MongoDB Atlas IP allowlist doesn't include GKE egress IPs
- VPC Peering not configured (if using
-prihosts) - Credentials incorrect or expired
Debug:
# Test from a debug pod
kubectl run mongo-debug --rm -it --image=mongo:7 -n staging -- \
mongosh "mongodb+srv://user:pass@cluster0.xxx.mongodb.net/test" --eval "db.runCommand({ping:1})"
Timeout Exceeded¶
Symptoms: Job shows DeadlineExceeded failure
Causes:
- Large collections take longer than
activeDeadlineSeconds - Network throttling
- Atlas cluster performance tier too low
Fix:
- Increase
activeDeadlineSecondsin values - Consider upgrading Atlas cluster tier
- Check if specific collections are unusually large
Partial Failure¶
Symptoms: Some collections succeed, others fail
Causes:
- Transient network errors (should be handled by retry)
- Collection-specific issues (size, indexes)
- Insufficient memory for large documents
Debug:
- Check logs for specific collection that failed
- Look for PIPESTATUS output showing which command failed
- Verify collection exists in source database
Related Documentation¶
- Using PR Preview Environments - How previews use snapshots
- Data Snapshot Automation - Feature architecture
- MongoDB Reference - Database architecture and CSUUID format