Skip to content

Feature: Automated Data Snapshot Copy for Preview/Staging Environments

Documentation

Document Description
README.md (this file) Feature brief and high-level strategy
App-of-Apps Architecture ArgoCD App-of-Apps pattern with full auto-sync
Implementation Spec Detailed implementation specification
Edge Case Analysis Analysis of edge cases and scenarios
MongoDB Permissions Permission model reference

Overview

Automate the copying of production data snapshots to preview and staging databases, with the data source policy configurable via PR description. This enables developers to test features against realistic production data while maintaining isolation and data safety.

Problem Statement

Current State:

  • Preview environments use DatabaseSeeder with hardcoded sample data (5 projects, ~100 studies)
  • Staging environment has no standardized data population strategy
  • No mechanism to test against production-like data volumes or structures
  • Developers cannot reproduce production bugs that depend on specific data patterns
  • Real-world edge cases (large projects, complex annotations) are not represented in test data

Impact:

  • Features may work in preview but fail in production with real data
  • Performance issues only discovered after production deployment
  • Bug reproduction requires manual data manipulation
  • No confidence that migrations will work on production data patterns

Goals

  1. Production Snapshot Capture: Automated periodic snapshots of production data
  2. Policy-Based Data Source: PR description config to choose data source (seed, snapshot, staging)
  3. Data Anonymization: PII masking when restoring production data to non-production environments
  4. Staging Refresh: Ability to refresh staging with anonymized production snapshot
  5. Selective Restore: Choose which collections/data subsets to restore

Non-Goals

  • Real-time data replication (too complex, unnecessary for testing)
  • Full production data without anonymization in non-production (security risk)
  • Automatic production promotion based on preview testing (separate concern)

Data Source Policies

The feature introduces three data source policies configurable via PR description:

Policy 1: Seed Data (Default - Current Behavior)

#preview-config
database:
  source: seed
  • Uses existing DatabaseSeeder with 5 sample projects
  • Fast, deterministic, no external dependencies
  • Good for: UI testing, new feature development, quick iterations

Policy 2: Production Snapshot

#preview-config
database:
  source: snapshot
  snapshotId: latest        # or specific: "2026-01-09-daily"
  anonymize: true           # Required for non-production (enforced)
  collections:              # Optional: subset of collections
    - pmProject
    - pmStudy
  • Restores anonymized production data from snapshot
  • Realistic data volumes and patterns
  • Good for: Performance testing, bug reproduction, migration testing

Policy 3: Staging Clone

#preview-config
database:
  source: staging
  anonymize: false          # Staging already anonymized
  • Clones current staging database to preview
  • Useful when staging has specific test scenarios set up
  • Good for: Testing against staging-specific configurations

Solution Architecture

High-Level Data Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                         Production Data Snapshot Flow                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────┐    ┌─────────────────┐    ┌─────────────────────────────┐  │
│  │  Production │    │  Scheduled Job  │    │      GCS Bucket             │  │
│  │  Database   │───▶│  (Daily 3AM)    │───▶│  gs://syrf-db-snapshots/    │  │
│  │  syrftest   │    │  mongodump      │    │  └── prod/                  │  │
│  └─────────────┘    └─────────────────┘    │      ├── 2026-01-09/        │  │
│                                            │      │   ├── manifest.json  │  │
│                                            │      │   └── dump.gz        │  │
│                                            │      └── latest -> 2026-01-09│ │
│                                            └─────────────────────────────┘  │
│                                                         │                   │
│  ┌─────────────────────────────────────────────────────┼───────────────────┐│
│  │                    Preview Environment Restore       │                   ││
│  │                                                      ▼                   ││
│  │  ┌─────────────┐    ┌─────────────────┐    ┌─────────────────┐          ││
│  │  │ PR Preview  │◀───│  Restore Job    │◀───│   Anonymizer    │          ││
│  │  │ syrf_pr_123 │    │  mongorestore   │    │   (streaming)   │          ││
│  │  └─────────────┘    └─────────────────┘    └─────────────────┘          ││
│  └──────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────┘

Component Overview

┌──────────────────────────────────────────────────────────────────────────────┐
│                          Snapshot Automation Components                       │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. SNAPSHOT CAPTURE (Daily Scheduled)                                       │
│     ┌─────────────────────────────────────────────────────────────────────┐ │
│     │  Kubernetes CronJob: snapshot-producer                              │ │
│     │  - Runs: Daily at 3:00 AM UTC                                       │ │
│     │  - Tool: mongodump with gzip compression                            │ │
│     │  - Output: GCS bucket with versioned directories                    │ │
│     │  - Retention: 7 daily, 4 weekly snapshots                           │ │
│     └─────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  2. SNAPSHOT REGISTRY (Metadata Service)                                     │
│     ┌─────────────────────────────────────────────────────────────────────┐ │
│     │  GCS manifest.json per snapshot:                                    │ │
│     │  - Timestamp, size, collection list                                 │ │
│     │  - Schema versions, checksums                                       │ │
│     │  - Source database, MongoDB version                                 │ │
│     └─────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  3. ANONYMIZATION ENGINE (Transform Layer)                                   │
│     ┌─────────────────────────────────────────────────────────────────────┐ │
│     │  anonymization-rules.yaml:                                          │ │
│     │  - Field mappings: email → hash, name → "User N"                    │ │
│     │  - Exclusions: audit logs, sessions, tokens                         │ │
│     │  - Streaming transform during restore (no intermediate files)       │ │
│     └─────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  4. RESTORE ORCHESTRATOR (PR Preview Integration)                            │
│     ┌─────────────────────────────────────────────────────────────────────┐ │
│     │  PreSync Job: snapshot-restore                                      │ │
│     │  - Reads policy from PR description                                 │ │
│     │  - Downloads snapshot from GCS                                      │ │
│     │  - Applies anonymization rules                                      │ │
│     │  - Restores to target database (syrf_pr_N or syrf_staging)          │ │
│     └─────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Database Collections and Anonymization

Collection Categories

Category Collections Snapshot? Anonymize?
Core Data pmProject, pmStudy, pmSystematicSearch Yes Partial
User Data pmInvestigator Yes Full
Audit/Logs asFileListing, logs, sessions No N/A
AI Jobs pmRiskOfBiasAiJob Yes Partial
Exports pmDataExportJob Optional Partial

Anonymization Rules

# anonymization-rules.yaml
rules:
  pmInvestigator:
    fields:
      Email:
        transform: hash_email  # user@example.com → user-abc123@anonymized.syrf.local
      FirstName:
        transform: sequential  # "John" → "User"
      LastName:
        transform: sequential_suffix  # "Smith" → "1234"
      Auth0Id:
        transform: hash  # Preserve uniqueness, hide real ID

  pmProject:
    fields:
      ContactEmail:
        transform: hash_email
      # Project names, descriptions: Keep as-is (research metadata)

  pmStudy:
    # Most study data is research content, keep as-is
    fields:
      # No PII in studies typically

exclusions:
  # Collections to skip entirely
  - asFileListing      # Contains file paths, not needed for testing
  - pmAuditLog         # Sensitive audit trail
  - system.sessions    # Auth sessions

size_limits:
  # For preview environments, limit data volume
  pmStudy:
    max_documents: 10000  # ~10k studies sufficient for testing
  pmProject:
    max_documents: 100    # ~100 projects

Implementation Plan

Phase 1: Snapshot Infrastructure (Foundation)

Goal: Automated production snapshot capture and storage

1.1 GCS Bucket Setup

# Create snapshot bucket with lifecycle management
gsutil mb -l europe-west2 gs://syrf-db-snapshots

# Configure lifecycle (delete snapshots older than 30 days)
gsutil lifecycle set lifecycle-config.json gs://syrf-db-snapshots

1.2 Snapshot Producer CronJob

# cluster-gitops/syrf/services/snapshot-producer/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: snapshot-producer
  namespace: syrf-system
spec:
  schedule: "0 3 * * *"  # Daily at 3 AM UTC
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: mongodump
              image: mongo:7.0
              command:
                - /bin/bash
                - -c
                - |
                  DATE=$(date +%Y-%m-%d)
                  mongodump \
                    --uri="$MONGO_URI" \
                    --db=syrftest \
                    --gzip \
                    --archive=/tmp/dump.gz

                  # Get collection names for manifest
                  COLLECTIONS=$(mongosh "$MONGO_URI/syrftest" --quiet --eval 'JSON.stringify(db.getCollectionNames())')

                  # Create manifest (using wc -c for portability)
                  cat > /tmp/manifest.json <<EOF
                  {
                    "timestamp": "$(date -Iseconds)",
                    "source_db": "syrftest",
                    "mongo_version": "7.0",
                    "size_bytes": $(wc -c < /tmp/dump.gz),
                    "collections": $COLLECTIONS
                  }
                  EOF

                  # Upload to GCS
                  gsutil cp /tmp/dump.gz gs://syrf-db-snapshots/prod/$DATE/dump.gz
                  gsutil cp /tmp/manifest.json gs://syrf-db-snapshots/prod/$DATE/manifest.json

                  # Update latest marker (GCS has no symlinks, use marker file)
                  echo "prod/$DATE" > /tmp/latest.txt
                  gsutil cp /tmp/latest.txt gs://syrf-db-snapshots/prod/latest.txt
              env:
                - name: MONGO_URI
                  valueFrom:
                    secretKeyRef:
                      name: mongo-db-prod
                      key: connection-string
          restartPolicy: OnFailure

1.3 Deliverables

  • GCS bucket syrf-db-snapshots created with appropriate IAM
  • Workload Identity configured for GCS access from GKE
  • CronJob manifest in cluster-gitops
  • Snapshot retention policy (7 daily, 4 weekly)
  • Monitoring/alerting for snapshot failures

Phase 2: Anonymization Engine

Goal: PII masking for non-production data

2.1 Anonymization Service

Create a lightweight Go or Python service that: - Streams BSON data from mongodump archive - Applies transformation rules - Outputs anonymized BSON for mongorestore

# Conceptual implementation
# NOTE: This uses deterministic hashing which provides pseudonymization, not true anonymization.
# For stronger protection, consider using HMAC with a secret key not present in preview environments.
class Anonymizer:
    def __init__(self, rules_path: str):
        self.rules = yaml.safe_load(open(rules_path))
        # Maintain per-field counters for sequential transforms
        self._sequential_counters: dict[str, int] = {}

    def transform_document(self, collection: str, doc: dict) -> dict:
        rules = self.rules.get('rules', {}).get(collection, {})
        for field, config in rules.get('fields', {}).items():
            if field in doc:
                doc[field] = self.apply_transform(field, doc[field], config['transform'])
        return doc

    def apply_transform(self, field_name: str, value: str, transform: str) -> str:
        if transform == 'hash_email':
            # Safely handle malformed email values
            if not isinstance(value, str) or value.count('@') != 1:
                return value  # Return unchanged if malformed
            local, domain = value.split('@', 1)
            local_hash = hashlib.sha256(local.encode()).hexdigest()[:8]
            domain_hash = hashlib.sha256(domain.encode()).hexdigest()[:4]
            return f"{local_hash}.{domain_hash}@anonymized.syrf.local"
        elif transform == 'sequential':
            # Generate sequential values: User1, User2, etc.
            current = self._sequential_counters.get(field_name, 0) + 1
            self._sequential_counters[field_name] = current
            return f"User{current}"
        elif transform == 'hash':
            return hashlib.sha256(value.encode()).hexdigest()[:16]

2.2 Deliverables

  • Anonymization rules configuration (YAML)
  • Anonymizer container image
  • Unit tests for each transform type
  • Documentation of anonymization approach

Phase 3: PR Preview Integration

Goal: Database source policy via PR description

3.1 Extend Preview Config Parser

Update .github/workflows/pr-preview.yml to parse database configuration:

- name: Parse database config from preview-config
  id: db-config
  run: |
    CONFIG_B64="${{ needs.check-label.outputs.preview_config }}"
    if [ -n "$CONFIG_B64" ]; then
      CONFIG=$(echo "$CONFIG_B64" | base64 -d)

      # Extract database settings
      # NOTE: Uses mikefarah/yq v4+ syntax. Install via: brew install yq (macOS) or snap install yq (Linux)
      DB_SOURCE=$(echo "$CONFIG" | yq -r '.database.source // "seed"')
      SNAPSHOT_ID=$(echo "$CONFIG" | yq -r '.database.snapshotId // "latest"')
      ANONYMIZE=$(echo "$CONFIG" | yq -r '.database.anonymize // "true"')
      COLLECTIONS=$(echo "$CONFIG" | yq -r '.database.collections // []' | jq -c)

      echo "db_source=$DB_SOURCE" >> "$GITHUB_OUTPUT"
      echo "snapshot_id=$SNAPSHOT_ID" >> "$GITHUB_OUTPUT"
      echo "anonymize=$ANONYMIZE" >> "$GITHUB_OUTPUT"
      echo "collections=$COLLECTIONS" >> "$GITHUB_OUTPUT"
    else
      # Defaults
      echo "db_source=seed" >> "$GITHUB_OUTPUT"
      echo "snapshot_id=" >> "$GITHUB_OUTPUT"
      echo "anonymize=true" >> "$GITHUB_OUTPUT"
      echo "collections=[]" >> "$GITHUB_OUTPUT"
    fi

3.2 Snapshot Restore Job

Replace/extend db-reset-job.yaml to support snapshot restore:

# Generated when database.source=snapshot
apiVersion: batch/v1
kind: Job
metadata:
  name: snapshot-restore-{{ .prNumber }}
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
spec:
  template:
    spec:
      containers:
        - name: restore
          # TODO: Build and publish this image as part of Phase 2 implementation
          image: ghcr.io/camaradesuk/syrf-db-tools:latest
          command:
            - /bin/bash
            - -c
            - |
              # Download snapshot
              gsutil cp gs://syrf-db-snapshots/prod/{{ .snapshotId }}/dump.gz /tmp/dump.gz

              # Anonymize and restore
              # NOTE: syrf-anonymize is a custom tool to be built as part of Phase 2
              # It streams BSON, applies anonymization rules, and outputs to mongorestore
              syrf-anonymize \
                --input /tmp/dump.gz \
                --rules /config/anonymization-rules.yaml \
                --output - | \
              mongorestore \
                --uri="$MONGO_URI" \
                --db={{ .databaseName }} \
                --drop \
                --gzip \
                --archive=-
          env:
            - name: MONGO_URI
              valueFrom:
                secretKeyRef:
                  name: {{ .mongoSecretName }}
                  key: connection-string
          volumeMounts:
            - name: config
              mountPath: /config
      volumes:
        - name: config
          configMap:
            name: anonymization-rules
      restartPolicy: OnFailure

3.3 Deliverables

  • Extended preview-config parser for database settings
  • Conditional job generation (seed vs snapshot)
  • Snapshot restore job template
  • Integration with existing db-reset-job flow
  • Documentation in PR template

Phase 4: Staging Refresh

Goal: Periodic staging database refresh from production snapshot

4.1 Manual Staging Refresh

Create a workflow for manual staging refresh:

# .github/workflows/staging-db-refresh.yml
name: Refresh Staging Database

on:
  workflow_dispatch:
    inputs:
      snapshot_id:
        description: 'Snapshot ID (or "latest")'
        default: 'latest'
        required: true
      confirm:
        description: 'Type "refresh-staging" to confirm'
        required: true

jobs:
  refresh:
    if: inputs.confirm == 'refresh-staging'
    runs-on: ubuntu-latest
    steps:
      - name: Trigger staging restore
        run: |
          # Create ArgoCD hook job for staging restore
          # Similar to preview but targets syrf_staging database

4.2 Deliverables

  • Staging refresh workflow
  • Staging-specific restore job
  • Confirmation gate to prevent accidents
  • Notification on completion

Phase 5: Monitoring and Operations

Goal: Observability and operational tooling

5.1 Metrics and Alerts

  • Snapshot job success/failure rate
  • Snapshot size over time
  • Restore duration metrics
  • Storage usage alerts

5.2 Deliverables

  • Prometheus metrics for snapshot jobs
  • Grafana dashboard for snapshot operations
  • PagerDuty alerts for snapshot failures
  • Runbook for manual snapshot/restore

PR Description Template

Add to PR template:

## Preview Environment Configuration (Optional)

To customize your preview environment, add a YAML block with `#preview-config`:

\`\`\`yaml
#preview-config
database:
  # Data source: seed (default), snapshot, or staging
  source: seed

  # For snapshot source:
  # snapshotId: latest        # or specific date: "2026-01-09"
  # collections:              # Optional: restore only specific collections
  #   - pmProject
  #   - pmStudy

web:
  featureFlags:
    experimentalFeature: true
\`\`\`

**Available database sources:**
- `seed` - Sample data (fast, deterministic) - **default**
- `snapshot` - Anonymized production data (realistic volumes)
- `staging` - Clone of current staging database

Security Considerations

Data Protection

  1. Anonymization is mandatory for production snapshots in non-production
  2. No raw production data ever reaches preview environments
  3. Audit logging of all snapshot access and restores
  4. GCS bucket policies restrict access to authorized service accounts only

Access Control

Action Who Can Do It
Create production snapshot CronJob only (automated)
Download snapshot Restore jobs with Workload Identity
Restore to preview PR preview workflow (automated)
Restore to staging Requires refresh-staging workflow with confirmation
View snapshot contents SRE team only (manual access)

Compliance

  • GDPR: Anonymization removes identifying information
  • Research data: Scientific content preserved (not PII)
  • Retention: Snapshots auto-deleted after 30 days

Rollout Strategy

Phase 1 (Weeks 1-2): Snapshot Infrastructure

  • Create GCS bucket and IAM
  • Deploy snapshot CronJob
  • Verify daily snapshots are working
  • Set up monitoring

Phase 2 (Weeks 3-4): Anonymization

  • Define anonymization rules
  • Build and test anonymizer
  • Validate anonymized data quality
  • Security review of anonymization

Phase 3 (Weeks 5-6): Preview Integration

  • Extend PR preview workflow
  • Add restore job template
  • Test with real PRs
  • Update documentation

Phase 4 (Week 7): Staging Refresh

  • Create staging refresh workflow
  • Test staging restore
  • Document runbook

Phase 5 (Week 8): Polish

  • Monitoring dashboard
  • Alert configuration
  • Team training
  • Retrospective

Success Metrics

Metric Current Target
Production bug reproduction time Hours (manual) Minutes (automated)
Data realism in preview Low (5 projects) High (anonymized prod)
Staging data freshness Stale/manual Weekly automated
Snapshot reliability N/A 99.9% success rate

Open Questions

  1. Snapshot size management: How to handle large production databases (currently ~X GB)?
  2. Option A: Full database snapshots (simple, large)
  3. Option B: Incremental snapshots (complex, smaller)
  4. Option C: Collection-level snapshots (flexible, medium complexity)

  5. Restore time optimization: How to speed up restore for large snapshots?

  6. Option A: Pre-warm snapshot in preview namespace
  7. Option B: Lazy restore (restore on first access)
  8. Option C: Subset restore based on PR needs

  9. Staging snapshot frequency: Daily, weekly, or on-demand?

  10. Daily: More current, more storage
  11. Weekly: Good balance
  12. On-demand: Manual trigger only

Dependencies

  • GCS bucket with Workload Identity access
  • MongoDB Atlas operator (already deployed)
  • mongodump/mongorestore tools in container images
  • yq for YAML parsing in workflows

References