Snapshot Producer Reference¶

Overview¶

The snapshot-producer is a Kubernetes CronJob that creates weekly copies of production MongoDB data to a snapshot database. This snapshot is used by PR preview environments to provide realistic test data without accessing production directly.

Architecture¶

┌─────────────────────────────────────────────────────────────────────────┐
│                         MongoDB Atlas                                    │
│  ┌─────────────────────┐              ┌─────────────────────┐          │
│  │    Cluster0         │              │   Preview Cluster    │          │
│  │   (Production)      │              │                      │          │
│  │                     │              │                      │          │
│  │  ┌───────────────┐  │   stream     │  ┌───────────────┐  │          │
│  │  │   syrftest    │──┼──────────────┼─▶│ syrf_snapshot │  │          │
│  │  │  (prod data)  │  │  mongodump   │  │ (weekly copy) │  │          │
│  │  └───────────────┘  │  mongorestore│  └───────────────┘  │          │
│  │                     │              │          │          │          │
│  └─────────────────────┘              │          ▼          │          │
│                                       │  ┌───────────────┐  │          │
│                                       │  │ syrf_pr_123   │◀─┼─ restore │
│                                       │  │ syrf_pr_456   │  │          │
│                                       │  │ syrf_pr_789   │  │          │
│                                       │  └───────────────┘  │          │
│                                       └─────────────────────┘          │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────┐
│                         GKE Cluster                                      │
│  ┌─────────────────────┐                                                │
│  │   staging namespace │                                                │
│  │                     │                                                │
│  │  CronJob:           │   Schedule: Sunday 3 AM UTC                    │
│  │  snapshot-producer  │───────────────────────────────────────────────▶│
│  │                     │                                                │
│  └─────────────────────┘                                                │
└─────────────────────────────────────────────────────────────────────────┘

Helm Chart¶

Location: cluster-gitops/charts/snapshot-producer/

Chart Structure¶

snapshot-producer/
├── Chart.yaml              # Chart metadata (version 0.1.0)
├── values.yaml             # Default configuration
└── templates/
    ├── _helpers.tpl        # Template helpers (name, labels)
    ├── cronjob.yaml        # Main CronJob definition
    └── serviceaccount.yaml # Service account for pod

values.yaml Reference¶

# =============================================================================
# Schedule
# =============================================================================
schedule: "0 3 * * 0"  # Cron expression (Sunday 3 AM UTC)

# =============================================================================
# Image Configuration
# =============================================================================
image:
  repository: mongo      # Official MongoDB image
  tag: "7"               # MongoDB 7.x (includes mongodump, mongorestore, mongosh)
  pullPolicy: IfNotPresent

# =============================================================================
# MongoDB Credentials
# =============================================================================
credentials:
  secretName: snapshot-producer-credentials  # Kubernetes secret name
  usernameKey: username                      # Key for username in secret
  passwordKey: password                      # Key for password in secret

# =============================================================================
# Source Cluster (Production - Cluster0)
# =============================================================================
source:
  cluster: Cluster0                          # Human-readable name (for logging)
  host: ""                                   # MUST override: cluster0-pri.siwfo.mongodb.net
  database: syrftest                         # Production database name

# =============================================================================
# Target Cluster (Preview)
# =============================================================================
target:
  cluster: Preview                           # Human-readable name (for logging)
  host: ""                                   # MUST override: preview-pri.siwfo.mongodb.net
  database: syrf_snapshot                    # Snapshot database name

# =============================================================================
# Collections to Copy
# =============================================================================
collections:
  - pmProject
  - pmStudy
  - pmInvestigator
  - pmSystematicSearch
  - pmDataExportJob
  - pmStudyCorrection
  - pmInvestigatorUsage
  - pmRiskOfBiasAiJob
  - pmProjectDailyStat
  - pmPotential
  - pmInvestigatorEmail

# =============================================================================
# Streaming Options
# =============================================================================
streaming:
  numParallelCollections: 4   # mongorestore parallelism (per-collection)
  gzip: true                  # Compress data in transit

# =============================================================================
# Job Configuration
# =============================================================================
successfulJobsHistoryLimit: 3    # Keep last 3 successful job pods
failedJobsHistoryLimit: 3        # Keep last 3 failed job pods
activeDeadlineSeconds: 3600      # 1 hour timeout (job killed if exceeded)
ttlSecondsAfterFinished: 3600    # Delete completed pods after 1 hour

# =============================================================================
# Retry Configuration
# =============================================================================
retry:
  maxAttempts: 3      # Retry each collection up to 3 times
  delaySeconds: 30    # Wait 30 seconds between retries

# =============================================================================
# Resource Limits
# =============================================================================
resources:
  limits:
    cpu: 1000m        # 1 CPU core max
    memory: 1Gi       # 1 GB RAM max
  requests:
    cpu: 250m         # 0.25 CPU cores guaranteed
    memory: 512Mi     # 512 MB RAM guaranteed

# =============================================================================
# Pod Configuration
# =============================================================================
podAnnotations: {}
podLabels: {}

serviceAccount:
  create: true
  name: snapshot-producer

CronJob Behavior¶

Execution Flow¶

Pre-flight checks:
Validate required values (source.host, target.host)
URL-encode MongoDB password for special characters
Test connectivity to both clusters (fail fast on connection errors)
Collection copy loop (sequential):
For each collection:
- Count source documents
- Skip if empty (with warning)
- Stream copy using mongodump | mongorestore pipeline
- Retry up to N times on failure
- Verify target document count
- Log progress with timestamps
Metadata write:
Calculate total duration
Write snapshot_metadata document to target database
Warn (don't fail) if metadata write fails

Streaming Copy Mechanism¶

The copy uses a streaming pipeline that avoids disk I/O:

mongodump \
  --uri="$SOURCE_URI" \
  --collection="$collection" \
  --archive \          # Stream to stdout (no files)
  --gzip \             # Compress in transit
  --verbose \          # Progress logging
| mongorestore \
  --uri="$TARGET_URI" \
  --archive \          # Read from stdin
  --gzip \             # Decompress
  --drop \             # Replace existing collection
  --nsFrom="source_db.collection" \
  --nsTo="target_db.collection" \
  --verbose

Benefits:

No temporary files (streams directly between clusters)
Network compression reduces bandwidth
--drop ensures clean replacement (idempotent)
--nsFrom/--nsTo handles database rename

Error Handling¶

Scenario	Behavior
Connection failure (pre-flight)	Job fails immediately with clear error
Collection copy failure	Retry up to `maxAttempts` times, then mark failed
Any collection fails	Job exits with error after all collections attempted
Metadata write fails	Warning logged, job still succeeds (data was copied)
Timeout exceeded	Job killed by Kubernetes, marked as failed

PIPESTATUS Diagnostics¶

The job captures exit codes from both sides of the pipeline:

DUMP_EXIT=${PIPESTATUS[0]}
RESTORE_EXIT=${PIPESTATUS[1]}

if [ "$DUMP_EXIT" -ne 0 ] || [ "$RESTORE_EXIT" -ne 0 ]; then
  echo "Pipeline failed: mongodump=$DUMP_EXIT, mongorestore=$RESTORE_EXIT"
fi

This provides clear diagnostics when failures occur.

Metadata Document¶

After successful completion, the job writes a metadata document:

Collection: syrf_snapshot.snapshot_metadata Document ID: "latest" (always upserts the same document)

{
  _id: "latest",

  // Timing
  createdAt: ISODate("2026-01-26T03:45:00Z"),    // When metadata was written
  startedAt: ISODate("2026-01-26T03:00:00Z"),   // Job start time
  finishedAt: ISODate("2026-01-26T03:45:00Z"),  // Job end time
  durationSeconds: 2700,                         // Total runtime

  // Source info
  sourceCluster: "Cluster0",
  sourceDatabase: "syrftest",
  sourceHost: "cluster0-pri.siwfo.mongodb.net",

  // Target info
  targetCluster: "Preview",
  targetDatabase: "syrf_snapshot",
  targetHost: "preview-pri.siwfo.mongodb.net",

  // Collection details
  collections: ["pmProject", "pmStudy", ...],
  collectionsCount: 11,
  documentCounts: {
    pmProject: 1234,
    pmStudy: 56789,
    pmInvestigator: 2345,
    // ... one entry per collection
  },
  totalDocuments: 123456,

  // Method info
  method: "mongodump | mongorestore streaming",
  crossCluster: true,
  status: "complete"
}

Querying Metadata¶

// Check snapshot freshness
db.snapshot_metadata.findOne({ _id: "latest" })

// Get snapshot age in hours
db.snapshot_metadata.aggregate([
  { $match: { _id: "latest" } },
  { $project: {
    ageHours: {
      $divide: [
        { $subtract: [new Date(), "$createdAt"] },
        1000 * 60 * 60
      ]
    }
  }}
])

Security¶

Pod Security Context¶

# Pod level
securityContext:
  seccompProfile:
    type: RuntimeDefault    # Use container runtime's default seccomp profile

# Container level
securityContext:
  allowPrivilegeEscalation: false   # Cannot gain elevated privileges
  capabilities:
    drop:
      - ALL                          # Drop all Linux capabilities

MongoDB Credentials¶

Stored in Kubernetes Secret (snapshot-producer-credentials)
Secret synced from GCP Secret Manager via External Secrets Operator
Same credentials work for both clusters (user has "All Resources" access in Atlas project)

Network Access¶

Uses -pri hostname suffix for VPC Peering (private IP routing)
No data transfer charges within same Atlas project
MongoDB Atlas IP allowlist must include GKE cluster's egress IPs

Operations¶

Manual Trigger¶

# Create one-time job from CronJob
kubectl create job --from=cronjob/snapshot-producer snapshot-manual-$(date +%s) -n staging

# Watch logs
kubectl logs -f -l job-name=snapshot-manual-<timestamp> -n staging

# Check job status
kubectl get jobs -n staging | grep snapshot

Suspend/Resume CronJob¶

# Suspend (prevent scheduled runs)
kubectl patch cronjob snapshot-producer -n staging -p '{"spec":{"suspend":true}}'

# Resume
kubectl patch cronjob snapshot-producer -n staging -p '{"spec":{"suspend":false}}'

Check History¶

# List recent jobs
kubectl get jobs -n staging -l app.kubernetes.io/name=snapshot-producer

# View CronJob status
kubectl get cronjob snapshot-producer -n staging -o wide

View Logs from Last Run¶

# Get most recent job
LAST_JOB=$(kubectl get jobs -n staging -l app.kubernetes.io/name=snapshot-producer \
  --sort-by=.status.startTime -o jsonpath='{.items[-1].metadata.name}')

# View logs
kubectl logs -n staging job/$LAST_JOB

Troubleshooting¶

Job Stuck in Pending¶

Symptoms: Pod shows Pending status, doesn't start

Causes:

Insufficient cluster resources (CPU/memory)
Secret not found
Image pull failure

Debug:

kubectl describe pod -n staging -l job-name=<job-name>

Connection Timeouts¶

Symptoms: "Cannot connect to source/target cluster"

Causes:

MongoDB Atlas IP allowlist doesn't include GKE egress IPs
VPC Peering not configured (if using -pri hosts)
Credentials incorrect or expired

Debug:

# Test from a debug pod
kubectl run mongo-debug --rm -it --image=mongo:7 -n staging -- \
  mongosh "mongodb+srv://user:pass@cluster0.xxx.mongodb.net/test" --eval "db.runCommand({ping:1})"

Timeout Exceeded¶

Symptoms: Job shows DeadlineExceeded failure

Causes:

Large collections take longer than activeDeadlineSeconds
Network throttling
Atlas cluster performance tier too low

Fix:

Increase activeDeadlineSeconds in values
Consider upgrading Atlas cluster tier
Check if specific collections are unusually large

Partial Failure¶

Symptoms: Some collections succeed, others fail

Causes:

Transient network errors (should be handled by retry)
Collection-specific issues (size, indexes)
Insufficient memory for large documents

Debug:

Check logs for specific collection that failed
Look for PIPESTATUS output showing which command failed
Verify collection exists in source database

Using PR Preview Environments - How previews use snapshots
Data Snapshot Automation - Feature architecture
MongoDB Reference - Database architecture and CSUUID format

Snapshot Producer Reference¶

Overview¶

Architecture¶

Helm Chart¶

Chart Structure¶

values.yaml Reference¶

CronJob Behavior¶

Execution Flow¶

Streaming Copy Mechanism¶

Error Handling¶

PIPESTATUS Diagnostics¶

Metadata Document¶

Querying Metadata¶

Security¶

Pod Security Context¶

MongoDB Credentials¶

Network Access¶

Operations¶

Manual Trigger¶

Suspend/Resume CronJob¶

Check History¶

View Logs from Last Run¶

Troubleshooting¶

Job Stuck in Pending¶

Connection Timeouts¶

Timeout Exceeded¶

Partial Failure¶

Related Documentation¶