Feature: Automated Data Snapshot Copy for Preview/Staging Environments¶

Documentation¶

Document	Description
README.md (this file)	Feature brief and high-level strategy
App-of-Apps Architecture	ArgoCD App-of-Apps pattern with full auto-sync
Implementation Spec	Detailed implementation specification
Edge Case Analysis	Analysis of edge cases and scenarios
MongoDB Permissions	Permission model reference

Overview¶

Automate the copying of production data snapshots to preview and staging databases, with the data source policy configurable via PR description. This enables developers to test features against realistic production data while maintaining isolation and data safety.

Problem Statement¶

Current State:

Preview environments use DatabaseSeeder with hardcoded sample data (5 projects, ~100 studies)
Staging environment has no standardized data population strategy
No mechanism to test against production-like data volumes or structures
Developers cannot reproduce production bugs that depend on specific data patterns
Real-world edge cases (large projects, complex annotations) are not represented in test data

Impact:

Features may work in preview but fail in production with real data
Performance issues only discovered after production deployment
Bug reproduction requires manual data manipulation
No confidence that migrations will work on production data patterns

Goals¶

Production Snapshot Capture: Automated periodic snapshots of production data
Policy-Based Data Source: PR description config to choose data source (seed, snapshot, staging)
Data Anonymization: PII masking when restoring production data to non-production environments
Staging Refresh: Ability to refresh staging with anonymized production snapshot
Selective Restore: Choose which collections/data subsets to restore

Non-Goals¶

Real-time data replication (too complex, unnecessary for testing)
Full production data without anonymization in non-production (security risk)
Automatic production promotion based on preview testing (separate concern)

Data Source Policies¶

The feature introduces three data source policies configurable via PR description:

Policy 1: Seed Data (Default - Current Behavior)¶

#preview-config
database:
  source: seed

Uses existing DatabaseSeeder with 5 sample projects
Fast, deterministic, no external dependencies
Good for: UI testing, new feature development, quick iterations

Policy 2: Production Snapshot¶

#preview-config
database:
  source: snapshot
  snapshotId: latest        # or specific: "2026-01-09-daily"
  anonymize: true           # Required for non-production (enforced)
  collections:              # Optional: subset of collections
    - pmProject
    - pmStudy

Restores anonymized production data from snapshot
Realistic data volumes and patterns
Good for: Performance testing, bug reproduction, migration testing

Policy 3: Staging Clone¶

#preview-config
database:
  source: staging
  anonymize: false          # Staging already anonymized

Clones current staging database to preview
Useful when staging has specific test scenarios set up
Good for: Testing against staging-specific configurations

Solution Architecture¶

High-Level Data Flow¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                         Production Data Snapshot Flow                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────┐    ┌─────────────────┐    ┌─────────────────────────────┐  │
│  │  Production │    │  Scheduled Job  │    │      GCS Bucket             │  │
│  │  Database   │───▶│  (Daily 3AM)    │───▶│  gs://syrf-db-snapshots/    │  │
│  │  syrftest   │    │  mongodump      │    │  └── prod/                  │  │
│  └─────────────┘    └─────────────────┘    │      ├── 2026-01-09/        │  │
│                                            │      │   ├── manifest.json  │  │
│                                            │      │   └── dump.gz        │  │
│                                            │      └── latest -> 2026-01-09│ │
│                                            └─────────────────────────────┘  │
│                                                         │                   │
│  ┌─────────────────────────────────────────────────────┼───────────────────┐│
│  │                    Preview Environment Restore       │                   ││
│  │                                                      ▼                   ││
│  │  ┌─────────────┐    ┌─────────────────┐    ┌─────────────────┐          ││
│  │  │ PR Preview  │◀───│  Restore Job    │◀───│   Anonymizer    │          ││
│  │  │ syrf_pr_123 │    │  mongorestore   │    │   (streaming)   │          ││
│  │  └─────────────┘    └─────────────────┘    └─────────────────┘          ││
│  └──────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────┘

Component Overview¶

┌──────────────────────────────────────────────────────────────────────────────┐
│                          Snapshot Automation Components                       │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. SNAPSHOT CAPTURE (Daily Scheduled)                                       │
│     ┌─────────────────────────────────────────────────────────────────────┐ │
│     │  Kubernetes CronJob: snapshot-producer                              │ │
│     │  - Runs: Daily at 3:00 AM UTC                                       │ │
│     │  - Tool: mongodump with gzip compression                            │ │
│     │  - Output: GCS bucket with versioned directories                    │ │
│     │  - Retention: 7 daily, 4 weekly snapshots                           │ │
│     └─────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  2. SNAPSHOT REGISTRY (Metadata Service)                                     │
│     ┌─────────────────────────────────────────────────────────────────────┐ │
│     │  GCS manifest.json per snapshot:                                    │ │
│     │  - Timestamp, size, collection list                                 │ │
│     │  - Schema versions, checksums                                       │ │
│     │  - Source database, MongoDB version                                 │ │
│     └─────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  3. ANONYMIZATION ENGINE (Transform Layer)                                   │
│     ┌─────────────────────────────────────────────────────────────────────┐ │
│     │  anonymization-rules.yaml:                                          │ │
│     │  - Field mappings: email → hash, name → "User N"                    │ │
│     │  - Exclusions: audit logs, sessions, tokens                         │ │
│     │  - Streaming transform during restore (no intermediate files)       │ │
│     └─────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
│  4. RESTORE ORCHESTRATOR (PR Preview Integration)                            │
│     ┌─────────────────────────────────────────────────────────────────────┐ │
│     │  PreSync Job: snapshot-restore                                      │ │
│     │  - Reads policy from PR description                                 │ │
│     │  - Downloads snapshot from GCS                                      │ │
│     │  - Applies anonymization rules                                      │ │
│     │  - Restores to target database (syrf_pr_N or syrf_staging)          │ │
│     └─────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Database Collections and Anonymization¶

Collection Categories¶

Category	Collections	Snapshot?	Anonymize?
Core Data	pmProject, pmStudy, pmSystematicSearch	Yes	Partial
User Data	pmInvestigator	Yes	Full
Audit/Logs	asFileListing, logs, sessions	No	N/A
AI Jobs	pmRiskOfBiasAiJob	Yes	Partial
Exports	pmDataExportJob	Optional	Partial

Anonymization Rules¶

# anonymization-rules.yaml
rules:
  pmInvestigator:
    fields:
      Email:
        transform: hash_email  # user@example.com → user-abc123@anonymized.syrf.local
      FirstName:
        transform: sequential  # "John" → "User"
      LastName:
        transform: sequential_suffix  # "Smith" → "1234"
      Auth0Id:
        transform: hash  # Preserve uniqueness, hide real ID

  pmProject:
    fields:
      ContactEmail:
        transform: hash_email
      # Project names, descriptions: Keep as-is (research metadata)

  pmStudy:
    # Most study data is research content, keep as-is
    fields:
      # No PII in studies typically

exclusions:
  # Collections to skip entirely
  - asFileListing      # Contains file paths, not needed for testing
  - pmAuditLog         # Sensitive audit trail
  - system.sessions    # Auth sessions

size_limits:
  # For preview environments, limit data volume
  pmStudy:
    max_documents: 10000  # ~10k studies sufficient for testing
  pmProject:
    max_documents: 100    # ~100 projects

Implementation Plan¶

Phase 1: Snapshot Infrastructure (Foundation)¶

Goal: Automated production snapshot capture and storage

1.1 GCS Bucket Setup¶

# Create snapshot bucket with lifecycle management
gsutil mb -l europe-west2 gs://syrf-db-snapshots

# Configure lifecycle (delete snapshots older than 30 days)
gsutil lifecycle set lifecycle-config.json gs://syrf-db-snapshots

1.2 Snapshot Producer CronJob¶

# cluster-gitops/syrf/services/snapshot-producer/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: snapshot-producer
  namespace: syrf-system
spec:
  schedule: "0 3 * * *"  # Daily at 3 AM UTC
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: mongodump
              image: mongo:7.0
              command:
                - /bin/bash
                - -c
                - |
                  DATE=$(date +%Y-%m-%d)
                  mongodump \
                    --uri="$MONGO_URI" \
                    --db=syrftest \
                    --gzip \
                    --archive=/tmp/dump.gz

                  # Get collection names for manifest
                  COLLECTIONS=$(mongosh "$MONGO_URI/syrftest" --quiet --eval 'JSON.stringify(db.getCollectionNames())')

                  # Create manifest (using wc -c for portability)
                  cat > /tmp/manifest.json <<EOF
                  {
                    "timestamp": "$(date -Iseconds)",
                    "source_db": "syrftest",
                    "mongo_version": "7.0",
                    "size_bytes": $(wc -c < /tmp/dump.gz),
                    "collections": $COLLECTIONS
                  }
                  EOF

                  # Upload to GCS
                  gsutil cp /tmp/dump.gz gs://syrf-db-snapshots/prod/$DATE/dump.gz
                  gsutil cp /tmp/manifest.json gs://syrf-db-snapshots/prod/$DATE/manifest.json

                  # Update latest marker (GCS has no symlinks, use marker file)
                  echo "prod/$DATE" > /tmp/latest.txt
                  gsutil cp /tmp/latest.txt gs://syrf-db-snapshots/prod/latest.txt
              env:
                - name: MONGO_URI
                  valueFrom:
                    secretKeyRef:
                      name: mongo-db-prod
                      key: connection-string
          restartPolicy: OnFailure

1.3 Deliverables¶

GCS bucket syrf-db-snapshots created with appropriate IAM
Workload Identity configured for GCS access from GKE
CronJob manifest in cluster-gitops
Snapshot retention policy (7 daily, 4 weekly)
Monitoring/alerting for snapshot failures

Phase 2: Anonymization Engine¶

Goal: PII masking for non-production data

2.1 Anonymization Service¶

Create a lightweight Go or Python service that: - Streams BSON data from mongodump archive - Applies transformation rules - Outputs anonymized BSON for mongorestore

# Conceptual implementation
# NOTE: This uses deterministic hashing which provides pseudonymization, not true anonymization.
# For stronger protection, consider using HMAC with a secret key not present in preview environments.
class Anonymizer:
    def __init__(self, rules_path: str):
        self.rules = yaml.safe_load(open(rules_path))
        # Maintain per-field counters for sequential transforms
        self._sequential_counters: dict[str, int] = {}

    def transform_document(self, collection: str, doc: dict) -> dict:
        rules = self.rules.get('rules', {}).get(collection, {})
        for field, config in rules.get('fields', {}).items():
            if field in doc:
                doc[field] = self.apply_transform(field, doc[field], config['transform'])
        return doc

    def apply_transform(self, field_name: str, value: str, transform: str) -> str:
        if transform == 'hash_email':
            # Safely handle malformed email values
            if not isinstance(value, str) or value.count('@') != 1:
                return value  # Return unchanged if malformed
            local, domain = value.split('@', 1)
            local_hash = hashlib.sha256(local.encode()).hexdigest()[:8]
            domain_hash = hashlib.sha256(domain.encode()).hexdigest()[:4]
            return f"{local_hash}.{domain_hash}@anonymized.syrf.local"
        elif transform == 'sequential':
            # Generate sequential values: User1, User2, etc.
            current = self._sequential_counters.get(field_name, 0) + 1
            self._sequential_counters[field_name] = current
            return f"User{current}"
        elif transform == 'hash':
            return hashlib.sha256(value.encode()).hexdigest()[:16]

2.2 Deliverables¶

Anonymization rules configuration (YAML)
Anonymizer container image
Unit tests for each transform type
Documentation of anonymization approach

Phase 3: PR Preview Integration¶

Goal: Database source policy via PR description

3.1 Extend Preview Config Parser¶

Update .github/workflows/pr-preview.yml to parse database configuration:

- name: Parse database config from preview-config
  id: db-config
  run: |
    CONFIG_B64="${{ needs.check-label.outputs.preview_config }}"
    if [ -n "$CONFIG_B64" ]; then
      CONFIG=$(echo "$CONFIG_B64" | base64 -d)

      # Extract database settings
      # NOTE: Uses mikefarah/yq v4+ syntax. Install via: brew install yq (macOS) or snap install yq (Linux)
      DB_SOURCE=$(echo "$CONFIG" | yq -r '.database.source // "seed"')
      SNAPSHOT_ID=$(echo "$CONFIG" | yq -r '.database.snapshotId // "latest"')
      ANONYMIZE=$(echo "$CONFIG" | yq -r '.database.anonymize // "true"')
      COLLECTIONS=$(echo "$CONFIG" | yq -r '.database.collections // []' | jq -c)

      echo "db_source=$DB_SOURCE" >> "$GITHUB_OUTPUT"
      echo "snapshot_id=$SNAPSHOT_ID" >> "$GITHUB_OUTPUT"
      echo "anonymize=$ANONYMIZE" >> "$GITHUB_OUTPUT"
      echo "collections=$COLLECTIONS" >> "$GITHUB_OUTPUT"
    else
      # Defaults
      echo "db_source=seed" >> "$GITHUB_OUTPUT"
      echo "snapshot_id=" >> "$GITHUB_OUTPUT"
      echo "anonymize=true" >> "$GITHUB_OUTPUT"
      echo "collections=[]" >> "$GITHUB_OUTPUT"
    fi

3.2 Snapshot Restore Job¶

Replace/extend db-reset-job.yaml to support snapshot restore:

# Generated when database.source=snapshot
apiVersion: batch/v1
kind: Job
metadata:
  name: snapshot-restore-{{ .prNumber }}
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
spec:
  template:
    spec:
      containers:
        - name: restore
          # TODO: Build and publish this image as part of Phase 2 implementation
          image: ghcr.io/camaradesuk/syrf-db-tools:latest
          command:
            - /bin/bash
            - -c
            - |
              # Download snapshot
              gsutil cp gs://syrf-db-snapshots/prod/{{ .snapshotId }}/dump.gz /tmp/dump.gz

              # Anonymize and restore
              # NOTE: syrf-anonymize is a custom tool to be built as part of Phase 2
              # It streams BSON, applies anonymization rules, and outputs to mongorestore
              syrf-anonymize \
                --input /tmp/dump.gz \
                --rules /config/anonymization-rules.yaml \
                --output - | \
              mongorestore \
                --uri="$MONGO_URI" \
                --db={{ .databaseName }} \
                --drop \
                --gzip \
                --archive=-
          env:
            - name: MONGO_URI
              valueFrom:
                secretKeyRef:
                  name: {{ .mongoSecretName }}
                  key: connection-string
          volumeMounts:
            - name: config
              mountPath: /config
      volumes:
        - name: config
          configMap:
            name: anonymization-rules
      restartPolicy: OnFailure

3.3 Deliverables¶

Extended preview-config parser for database settings
Conditional job generation (seed vs snapshot)
Snapshot restore job template
Integration with existing db-reset-job flow
Documentation in PR template

Phase 4: Staging Refresh¶

Goal: Periodic staging database refresh from production snapshot

4.1 Manual Staging Refresh¶

Create a workflow for manual staging refresh:

# .github/workflows/staging-db-refresh.yml
name: Refresh Staging Database

on:
  workflow_dispatch:
    inputs:
      snapshot_id:
        description: 'Snapshot ID (or "latest")'
        default: 'latest'
        required: true
      confirm:
        description: 'Type "refresh-staging" to confirm'
        required: true

jobs:
  refresh:
    if: inputs.confirm == 'refresh-staging'
    runs-on: ubuntu-latest
    steps:
      - name: Trigger staging restore
        run: |
          # Create ArgoCD hook job for staging restore
          # Similar to preview but targets syrf_staging database

4.2 Deliverables¶

Staging refresh workflow
Staging-specific restore job
Confirmation gate to prevent accidents
Notification on completion

Phase 5: Monitoring and Operations¶

Goal: Observability and operational tooling

5.1 Metrics and Alerts¶

Snapshot job success/failure rate
Snapshot size over time
Restore duration metrics
Storage usage alerts

5.2 Deliverables¶

Prometheus metrics for snapshot jobs
Grafana dashboard for snapshot operations
PagerDuty alerts for snapshot failures
Runbook for manual snapshot/restore

PR Description Template¶

Add to PR template:

## Preview Environment Configuration (Optional)

To customize your preview environment, add a YAML block with `#preview-config`:

\`\`\`yaml
#preview-config
database:
  # Data source: seed (default), snapshot, or staging
  source: seed

  # For snapshot source:
  # snapshotId: latest        # or specific date: "2026-01-09"
  # collections:              # Optional: restore only specific collections
  #   - pmProject
  #   - pmStudy

web:
  featureFlags:
    experimentalFeature: true
\`\`\`

**Available database sources:**
- `seed` - Sample data (fast, deterministic) - **default**
- `snapshot` - Anonymized production data (realistic volumes)
- `staging` - Clone of current staging database

Security Considerations¶

Data Protection¶

Anonymization is mandatory for production snapshots in non-production
No raw production data ever reaches preview environments
Audit logging of all snapshot access and restores
GCS bucket policies restrict access to authorized service accounts only

Access Control¶

Action	Who Can Do It
Create production snapshot	CronJob only (automated)
Download snapshot	Restore jobs with Workload Identity
Restore to preview	PR preview workflow (automated)
Restore to staging	Requires `refresh-staging` workflow with confirmation
View snapshot contents	SRE team only (manual access)

Compliance¶

GDPR: Anonymization removes identifying information
Research data: Scientific content preserved (not PII)
Retention: Snapshots auto-deleted after 30 days

Rollout Strategy¶

Phase 1 (Weeks 1-2): Snapshot Infrastructure¶

Create GCS bucket and IAM
Deploy snapshot CronJob
Verify daily snapshots are working
Set up monitoring

Phase 2 (Weeks 3-4): Anonymization¶

Define anonymization rules
Build and test anonymizer
Validate anonymized data quality
Security review of anonymization

Phase 3 (Weeks 5-6): Preview Integration¶

Extend PR preview workflow
Add restore job template
Test with real PRs
Update documentation

Phase 4 (Week 7): Staging Refresh¶

Create staging refresh workflow
Test staging restore
Document runbook

Phase 5 (Week 8): Polish¶

Monitoring dashboard
Alert configuration
Team training
Retrospective

Success Metrics¶

Metric	Current	Target
Production bug reproduction time	Hours (manual)	Minutes (automated)
Data realism in preview	Low (5 projects)	High (anonymized prod)
Staging data freshness	Stale/manual	Weekly automated
Snapshot reliability	N/A	99.9% success rate

Open Questions¶

Snapshot size management: How to handle large production databases (currently ~X GB)?
Option A: Full database snapshots (simple, large)
Option B: Incremental snapshots (complex, smaller)
Option C: Collection-level snapshots (flexible, medium complexity)
Restore time optimization: How to speed up restore for large snapshots?
Option A: Pre-warm snapshot in preview namespace
Option B: Lazy restore (restore on first access)
Option C: Subset restore based on PR needs
Staging snapshot frequency: Daily, weekly, or on-demand?
Daily: More current, more storage
Weekly: Good balance
On-demand: Manual trigger only

Dependencies¶

GCS bucket with Workload Identity access
MongoDB Atlas operator (already deployed)
mongodump/mongorestore tools in container images
yq for YAML parsing in workflows

References¶

MongoDB Testing Strategy - Database isolation architecture
PR Preview Environments - Current preview setup
MongoDB Reference - Collection naming, CSUUID format