Skip to content

Snapshot Producer Reference

Overview

The snapshot-producer is a Kubernetes CronJob that creates weekly copies of production MongoDB data to a snapshot database. This snapshot is used by PR preview environments to provide realistic test data without accessing production directly.

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         MongoDB Atlas                                    │
│  ┌─────────────────────┐              ┌─────────────────────┐          │
│  │    Cluster0         │              │   Preview Cluster    │          │
│  │   (Production)      │              │                      │          │
│  │                     │              │                      │          │
│  │  ┌───────────────┐  │   stream     │  ┌───────────────┐  │          │
│  │  │   syrftest    │──┼──────────────┼─▶│ syrf_snapshot │  │          │
│  │  │  (prod data)  │  │  mongodump   │  │ (weekly copy) │  │          │
│  │  └───────────────┘  │  mongorestore│  └───────────────┘  │          │
│  │                     │              │          │          │          │
│  └─────────────────────┘              │          ▼          │          │
│                                       │  ┌───────────────┐  │          │
│                                       │  │ syrf_pr_123   │◀─┼─ restore │
│                                       │  │ syrf_pr_456   │  │          │
│                                       │  │ syrf_pr_789   │  │          │
│                                       │  └───────────────┘  │          │
│                                       └─────────────────────┘          │
└─────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────┐
│                         GKE Cluster                                      │
│  ┌─────────────────────┐                                                │
│  │   staging namespace │                                                │
│  │                     │                                                │
│  │  CronJob:           │   Schedule: Sunday 3 AM UTC                    │
│  │  snapshot-producer  │───────────────────────────────────────────────▶│
│  │                     │                                                │
│  └─────────────────────┘                                                │
└─────────────────────────────────────────────────────────────────────────┘

Helm Chart

Location: cluster-gitops/charts/snapshot-producer/

Chart Structure

snapshot-producer/
├── Chart.yaml              # Chart metadata (version 0.1.0)
├── values.yaml             # Default configuration
└── templates/
    ├── _helpers.tpl        # Template helpers (name, labels)
    ├── cronjob.yaml        # Main CronJob definition
    └── serviceaccount.yaml # Service account for pod

values.yaml Reference

# =============================================================================
# Schedule
# =============================================================================
schedule: "0 3 * * 0"  # Cron expression (Sunday 3 AM UTC)

# =============================================================================
# Image Configuration
# =============================================================================
image:
  repository: mongo      # Official MongoDB image
  tag: "7"               # MongoDB 7.x (includes mongodump, mongorestore, mongosh)
  pullPolicy: IfNotPresent

# =============================================================================
# MongoDB Credentials
# =============================================================================
credentials:
  secretName: snapshot-producer-credentials  # Kubernetes secret name
  usernameKey: username                      # Key for username in secret
  passwordKey: password                      # Key for password in secret

# =============================================================================
# Source Cluster (Production - Cluster0)
# =============================================================================
source:
  cluster: Cluster0                          # Human-readable name (for logging)
  host: ""                                   # MUST override: cluster0-pri.siwfo.mongodb.net
  database: syrftest                         # Production database name

# =============================================================================
# Target Cluster (Preview)
# =============================================================================
target:
  cluster: Preview                           # Human-readable name (for logging)
  host: ""                                   # MUST override: preview-pri.siwfo.mongodb.net
  database: syrf_snapshot                    # Snapshot database name

# =============================================================================
# Collections to Copy
# =============================================================================
collections:
  - pmProject
  - pmStudy
  - pmInvestigator
  - pmSystematicSearch
  - pmDataExportJob
  - pmStudyCorrection
  - pmInvestigatorUsage
  - pmRiskOfBiasAiJob
  - pmProjectDailyStat
  - pmPotential
  - pmInvestigatorEmail

# =============================================================================
# Streaming Options
# =============================================================================
streaming:
  numParallelCollections: 4   # mongorestore parallelism (per-collection)
  gzip: true                  # Compress data in transit

# =============================================================================
# Job Configuration
# =============================================================================
successfulJobsHistoryLimit: 3    # Keep last 3 successful job pods
failedJobsHistoryLimit: 3        # Keep last 3 failed job pods
activeDeadlineSeconds: 3600      # 1 hour timeout (job killed if exceeded)
ttlSecondsAfterFinished: 3600    # Delete completed pods after 1 hour

# =============================================================================
# Retry Configuration
# =============================================================================
retry:
  maxAttempts: 3      # Retry each collection up to 3 times
  delaySeconds: 30    # Wait 30 seconds between retries

# =============================================================================
# Resource Limits
# =============================================================================
resources:
  limits:
    cpu: 1000m        # 1 CPU core max
    memory: 1Gi       # 1 GB RAM max
  requests:
    cpu: 250m         # 0.25 CPU cores guaranteed
    memory: 512Mi     # 512 MB RAM guaranteed

# =============================================================================
# Pod Configuration
# =============================================================================
podAnnotations: {}
podLabels: {}

serviceAccount:
  create: true
  name: snapshot-producer

CronJob Behavior

Execution Flow

  1. Pre-flight checks:
  2. Validate required values (source.host, target.host)
  3. URL-encode MongoDB password for special characters
  4. Test connectivity to both clusters (fail fast on connection errors)

  5. Collection copy loop (sequential):

  6. For each collection:

    • Count source documents
    • Skip if empty (with warning)
    • Stream copy using mongodump | mongorestore pipeline
    • Retry up to N times on failure
    • Verify target document count
    • Log progress with timestamps
  7. Metadata write:

  8. Calculate total duration
  9. Write snapshot_metadata document to target database
  10. Warn (don't fail) if metadata write fails

Streaming Copy Mechanism

The copy uses a streaming pipeline that avoids disk I/O:

mongodump \
  --uri="$SOURCE_URI" \
  --collection="$collection" \
  --archive \          # Stream to stdout (no files)
  --gzip \             # Compress in transit
  --verbose \          # Progress logging
| mongorestore \
  --uri="$TARGET_URI" \
  --archive \          # Read from stdin
  --gzip \             # Decompress
  --drop \             # Replace existing collection
  --nsFrom="source_db.collection" \
  --nsTo="target_db.collection" \
  --verbose

Benefits:

  • No temporary files (streams directly between clusters)
  • Network compression reduces bandwidth
  • --drop ensures clean replacement (idempotent)
  • --nsFrom/--nsTo handles database rename

Error Handling

Scenario Behavior
Connection failure (pre-flight) Job fails immediately with clear error
Collection copy failure Retry up to maxAttempts times, then mark failed
Any collection fails Job exits with error after all collections attempted
Metadata write fails Warning logged, job still succeeds (data was copied)
Timeout exceeded Job killed by Kubernetes, marked as failed

PIPESTATUS Diagnostics

The job captures exit codes from both sides of the pipeline:

DUMP_EXIT=${PIPESTATUS[0]}
RESTORE_EXIT=${PIPESTATUS[1]}

if [ "$DUMP_EXIT" -ne 0 ] || [ "$RESTORE_EXIT" -ne 0 ]; then
  echo "Pipeline failed: mongodump=$DUMP_EXIT, mongorestore=$RESTORE_EXIT"
fi

This provides clear diagnostics when failures occur.

Metadata Document

After successful completion, the job writes a metadata document:

Collection: syrf_snapshot.snapshot_metadata Document ID: "latest" (always upserts the same document)

{
  _id: "latest",

  // Timing
  createdAt: ISODate("2026-01-26T03:45:00Z"),    // When metadata was written
  startedAt: ISODate("2026-01-26T03:00:00Z"),   // Job start time
  finishedAt: ISODate("2026-01-26T03:45:00Z"),  // Job end time
  durationSeconds: 2700,                         // Total runtime

  // Source info
  sourceCluster: "Cluster0",
  sourceDatabase: "syrftest",
  sourceHost: "cluster0-pri.siwfo.mongodb.net",

  // Target info
  targetCluster: "Preview",
  targetDatabase: "syrf_snapshot",
  targetHost: "preview-pri.siwfo.mongodb.net",

  // Collection details
  collections: ["pmProject", "pmStudy", ...],
  collectionsCount: 11,
  documentCounts: {
    pmProject: 1234,
    pmStudy: 56789,
    pmInvestigator: 2345,
    // ... one entry per collection
  },
  totalDocuments: 123456,

  // Method info
  method: "mongodump | mongorestore streaming",
  crossCluster: true,
  status: "complete"
}

Querying Metadata

// Check snapshot freshness
db.snapshot_metadata.findOne({ _id: "latest" })

// Get snapshot age in hours
db.snapshot_metadata.aggregate([
  { $match: { _id: "latest" } },
  { $project: {
    ageHours: {
      $divide: [
        { $subtract: [new Date(), "$createdAt"] },
        1000 * 60 * 60
      ]
    }
  }}
])

Security

Pod Security Context

# Pod level
securityContext:
  seccompProfile:
    type: RuntimeDefault    # Use container runtime's default seccomp profile

# Container level
securityContext:
  allowPrivilegeEscalation: false   # Cannot gain elevated privileges
  capabilities:
    drop:
      - ALL                          # Drop all Linux capabilities

MongoDB Credentials

  • Stored in Kubernetes Secret (snapshot-producer-credentials)
  • Secret synced from GCP Secret Manager via External Secrets Operator
  • Same credentials work for both clusters (user has "All Resources" access in Atlas project)

Network Access

  • Uses -pri hostname suffix for VPC Peering (private IP routing)
  • No data transfer charges within same Atlas project
  • MongoDB Atlas IP allowlist must include GKE cluster's egress IPs

Operations

Manual Trigger

# Create one-time job from CronJob
kubectl create job --from=cronjob/snapshot-producer snapshot-manual-$(date +%s) -n staging

# Watch logs
kubectl logs -f -l job-name=snapshot-manual-<timestamp> -n staging

# Check job status
kubectl get jobs -n staging | grep snapshot

Suspend/Resume CronJob

# Suspend (prevent scheduled runs)
kubectl patch cronjob snapshot-producer -n staging -p '{"spec":{"suspend":true}}'

# Resume
kubectl patch cronjob snapshot-producer -n staging -p '{"spec":{"suspend":false}}'

Check History

# List recent jobs
kubectl get jobs -n staging -l app.kubernetes.io/name=snapshot-producer

# View CronJob status
kubectl get cronjob snapshot-producer -n staging -o wide

View Logs from Last Run

# Get most recent job
LAST_JOB=$(kubectl get jobs -n staging -l app.kubernetes.io/name=snapshot-producer \
  --sort-by=.status.startTime -o jsonpath='{.items[-1].metadata.name}')

# View logs
kubectl logs -n staging job/$LAST_JOB

Troubleshooting

Job Stuck in Pending

Symptoms: Pod shows Pending status, doesn't start

Causes:

  • Insufficient cluster resources (CPU/memory)
  • Secret not found
  • Image pull failure

Debug:

kubectl describe pod -n staging -l job-name=<job-name>

Connection Timeouts

Symptoms: "Cannot connect to source/target cluster"

Causes:

  • MongoDB Atlas IP allowlist doesn't include GKE egress IPs
  • VPC Peering not configured (if using -pri hosts)
  • Credentials incorrect or expired

Debug:

# Test from a debug pod
kubectl run mongo-debug --rm -it --image=mongo:7 -n staging -- \
  mongosh "mongodb+srv://user:pass@cluster0.xxx.mongodb.net/test" --eval "db.runCommand({ping:1})"

Timeout Exceeded

Symptoms: Job shows DeadlineExceeded failure

Causes:

  • Large collections take longer than activeDeadlineSeconds
  • Network throttling
  • Atlas cluster performance tier too low

Fix:

  • Increase activeDeadlineSeconds in values
  • Consider upgrading Atlas cluster tier
  • Check if specific collections are unusually large

Partial Failure

Symptoms: Some collections succeed, others fail

Causes:

  • Transient network errors (should be handled by retry)
  • Collection-specific issues (size, indexes)
  • Insufficient memory for large documents

Debug:

  • Check logs for specific collection that failed
  • Look for PIPESTATUS output showing which command failed
  • Verify collection exists in source database