Feature: Automated Data Snapshot Copy for Preview/Staging Environments¶
Documentation¶
| Document | Description |
|---|---|
| README.md (this file) | Feature brief and high-level strategy |
| App-of-Apps Architecture | ArgoCD App-of-Apps pattern with full auto-sync |
| Implementation Spec | Detailed implementation specification |
| Edge Case Analysis | Analysis of edge cases and scenarios |
| MongoDB Permissions | Permission model reference |
Overview¶
Automate the copying of production data snapshots to preview and staging databases, with the data source policy configurable via PR description. This enables developers to test features against realistic production data while maintaining isolation and data safety.
Problem Statement¶
Current State:
- Preview environments use
DatabaseSeederwith hardcoded sample data (5 projects, ~100 studies) - Staging environment has no standardized data population strategy
- No mechanism to test against production-like data volumes or structures
- Developers cannot reproduce production bugs that depend on specific data patterns
- Real-world edge cases (large projects, complex annotations) are not represented in test data
Impact:
- Features may work in preview but fail in production with real data
- Performance issues only discovered after production deployment
- Bug reproduction requires manual data manipulation
- No confidence that migrations will work on production data patterns
Goals¶
- Production Snapshot Capture: Automated periodic snapshots of production data
- Policy-Based Data Source: PR description config to choose data source (seed, snapshot, staging)
- Data Anonymization: PII masking when restoring production data to non-production environments
- Staging Refresh: Ability to refresh staging with anonymized production snapshot
- Selective Restore: Choose which collections/data subsets to restore
Non-Goals¶
- Real-time data replication (too complex, unnecessary for testing)
- Full production data without anonymization in non-production (security risk)
- Automatic production promotion based on preview testing (separate concern)
Data Source Policies¶
The feature introduces three data source policies configurable via PR description:
Policy 1: Seed Data (Default - Current Behavior)¶
- Uses existing
DatabaseSeederwith 5 sample projects - Fast, deterministic, no external dependencies
- Good for: UI testing, new feature development, quick iterations
Policy 2: Production Snapshot¶
#preview-config
database:
source: snapshot
snapshotId: latest # or specific: "2026-01-09-daily"
anonymize: true # Required for non-production (enforced)
collections: # Optional: subset of collections
- pmProject
- pmStudy
- Restores anonymized production data from snapshot
- Realistic data volumes and patterns
- Good for: Performance testing, bug reproduction, migration testing
Policy 3: Staging Clone¶
- Clones current staging database to preview
- Useful when staging has specific test scenarios set up
- Good for: Testing against staging-specific configurations
Solution Architecture¶
High-Level Data Flow¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ Production Data Snapshot Flow │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────────┐ ┌─────────────────────────────┐ │
│ │ Production │ │ Scheduled Job │ │ GCS Bucket │ │
│ │ Database │───▶│ (Daily 3AM) │───▶│ gs://syrf-db-snapshots/ │ │
│ │ syrftest │ │ mongodump │ │ └── prod/ │ │
│ └─────────────┘ └─────────────────┘ │ ├── 2026-01-09/ │ │
│ │ │ ├── manifest.json │ │
│ │ │ └── dump.gz │ │
│ │ └── latest -> 2026-01-09│ │
│ └─────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────┼───────────────────┐│
│ │ Preview Environment Restore │ ││
│ │ ▼ ││
│ │ ┌─────────────┐ ┌─────────────────┐ ┌─────────────────┐ ││
│ │ │ PR Preview │◀───│ Restore Job │◀───│ Anonymizer │ ││
│ │ │ syrf_pr_123 │ │ mongorestore │ │ (streaming) │ ││
│ │ └─────────────┘ └─────────────────┘ └─────────────────┘ ││
│ └──────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────┘
Component Overview¶
┌──────────────────────────────────────────────────────────────────────────────┐
│ Snapshot Automation Components │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. SNAPSHOT CAPTURE (Daily Scheduled) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Kubernetes CronJob: snapshot-producer │ │
│ │ - Runs: Daily at 3:00 AM UTC │ │
│ │ - Tool: mongodump with gzip compression │ │
│ │ - Output: GCS bucket with versioned directories │ │
│ │ - Retention: 7 daily, 4 weekly snapshots │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ 2. SNAPSHOT REGISTRY (Metadata Service) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ GCS manifest.json per snapshot: │ │
│ │ - Timestamp, size, collection list │ │
│ │ - Schema versions, checksums │ │
│ │ - Source database, MongoDB version │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ 3. ANONYMIZATION ENGINE (Transform Layer) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ anonymization-rules.yaml: │ │
│ │ - Field mappings: email → hash, name → "User N" │ │
│ │ - Exclusions: audit logs, sessions, tokens │ │
│ │ - Streaming transform during restore (no intermediate files) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ 4. RESTORE ORCHESTRATOR (PR Preview Integration) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ PreSync Job: snapshot-restore │ │
│ │ - Reads policy from PR description │ │
│ │ - Downloads snapshot from GCS │ │
│ │ - Applies anonymization rules │ │
│ │ - Restores to target database (syrf_pr_N or syrf_staging) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
Database Collections and Anonymization¶
Collection Categories¶
| Category | Collections | Snapshot? | Anonymize? |
|---|---|---|---|
| Core Data | pmProject, pmStudy, pmSystematicSearch | Yes | Partial |
| User Data | pmInvestigator | Yes | Full |
| Audit/Logs | asFileListing, logs, sessions | No | N/A |
| AI Jobs | pmRiskOfBiasAiJob | Yes | Partial |
| Exports | pmDataExportJob | Optional | Partial |
Anonymization Rules¶
# anonymization-rules.yaml
rules:
pmInvestigator:
fields:
Email:
transform: hash_email # user@example.com → user-abc123@anonymized.syrf.local
FirstName:
transform: sequential # "John" → "User"
LastName:
transform: sequential_suffix # "Smith" → "1234"
Auth0Id:
transform: hash # Preserve uniqueness, hide real ID
pmProject:
fields:
ContactEmail:
transform: hash_email
# Project names, descriptions: Keep as-is (research metadata)
pmStudy:
# Most study data is research content, keep as-is
fields:
# No PII in studies typically
exclusions:
# Collections to skip entirely
- asFileListing # Contains file paths, not needed for testing
- pmAuditLog # Sensitive audit trail
- system.sessions # Auth sessions
size_limits:
# For preview environments, limit data volume
pmStudy:
max_documents: 10000 # ~10k studies sufficient for testing
pmProject:
max_documents: 100 # ~100 projects
Implementation Plan¶
Phase 1: Snapshot Infrastructure (Foundation)¶
Goal: Automated production snapshot capture and storage
1.1 GCS Bucket Setup¶
# Create snapshot bucket with lifecycle management
gsutil mb -l europe-west2 gs://syrf-db-snapshots
# Configure lifecycle (delete snapshots older than 30 days)
gsutil lifecycle set lifecycle-config.json gs://syrf-db-snapshots
1.2 Snapshot Producer CronJob¶
# cluster-gitops/syrf/services/snapshot-producer/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: snapshot-producer
namespace: syrf-system
spec:
schedule: "0 3 * * *" # Daily at 3 AM UTC
jobTemplate:
spec:
template:
spec:
containers:
- name: mongodump
image: mongo:7.0
command:
- /bin/bash
- -c
- |
DATE=$(date +%Y-%m-%d)
mongodump \
--uri="$MONGO_URI" \
--db=syrftest \
--gzip \
--archive=/tmp/dump.gz
# Get collection names for manifest
COLLECTIONS=$(mongosh "$MONGO_URI/syrftest" --quiet --eval 'JSON.stringify(db.getCollectionNames())')
# Create manifest (using wc -c for portability)
cat > /tmp/manifest.json <<EOF
{
"timestamp": "$(date -Iseconds)",
"source_db": "syrftest",
"mongo_version": "7.0",
"size_bytes": $(wc -c < /tmp/dump.gz),
"collections": $COLLECTIONS
}
EOF
# Upload to GCS
gsutil cp /tmp/dump.gz gs://syrf-db-snapshots/prod/$DATE/dump.gz
gsutil cp /tmp/manifest.json gs://syrf-db-snapshots/prod/$DATE/manifest.json
# Update latest marker (GCS has no symlinks, use marker file)
echo "prod/$DATE" > /tmp/latest.txt
gsutil cp /tmp/latest.txt gs://syrf-db-snapshots/prod/latest.txt
env:
- name: MONGO_URI
valueFrom:
secretKeyRef:
name: mongo-db-prod
key: connection-string
restartPolicy: OnFailure
1.3 Deliverables¶
- GCS bucket
syrf-db-snapshotscreated with appropriate IAM - Workload Identity configured for GCS access from GKE
- CronJob manifest in cluster-gitops
- Snapshot retention policy (7 daily, 4 weekly)
- Monitoring/alerting for snapshot failures
Phase 2: Anonymization Engine¶
Goal: PII masking for non-production data
2.1 Anonymization Service¶
Create a lightweight Go or Python service that: - Streams BSON data from mongodump archive - Applies transformation rules - Outputs anonymized BSON for mongorestore
# Conceptual implementation
# NOTE: This uses deterministic hashing which provides pseudonymization, not true anonymization.
# For stronger protection, consider using HMAC with a secret key not present in preview environments.
class Anonymizer:
def __init__(self, rules_path: str):
self.rules = yaml.safe_load(open(rules_path))
# Maintain per-field counters for sequential transforms
self._sequential_counters: dict[str, int] = {}
def transform_document(self, collection: str, doc: dict) -> dict:
rules = self.rules.get('rules', {}).get(collection, {})
for field, config in rules.get('fields', {}).items():
if field in doc:
doc[field] = self.apply_transform(field, doc[field], config['transform'])
return doc
def apply_transform(self, field_name: str, value: str, transform: str) -> str:
if transform == 'hash_email':
# Safely handle malformed email values
if not isinstance(value, str) or value.count('@') != 1:
return value # Return unchanged if malformed
local, domain = value.split('@', 1)
local_hash = hashlib.sha256(local.encode()).hexdigest()[:8]
domain_hash = hashlib.sha256(domain.encode()).hexdigest()[:4]
return f"{local_hash}.{domain_hash}@anonymized.syrf.local"
elif transform == 'sequential':
# Generate sequential values: User1, User2, etc.
current = self._sequential_counters.get(field_name, 0) + 1
self._sequential_counters[field_name] = current
return f"User{current}"
elif transform == 'hash':
return hashlib.sha256(value.encode()).hexdigest()[:16]
2.2 Deliverables¶
- Anonymization rules configuration (YAML)
- Anonymizer container image
- Unit tests for each transform type
- Documentation of anonymization approach
Phase 3: PR Preview Integration¶
Goal: Database source policy via PR description
3.1 Extend Preview Config Parser¶
Update .github/workflows/pr-preview.yml to parse database configuration:
- name: Parse database config from preview-config
id: db-config
run: |
CONFIG_B64="${{ needs.check-label.outputs.preview_config }}"
if [ -n "$CONFIG_B64" ]; then
CONFIG=$(echo "$CONFIG_B64" | base64 -d)
# Extract database settings
# NOTE: Uses mikefarah/yq v4+ syntax. Install via: brew install yq (macOS) or snap install yq (Linux)
DB_SOURCE=$(echo "$CONFIG" | yq -r '.database.source // "seed"')
SNAPSHOT_ID=$(echo "$CONFIG" | yq -r '.database.snapshotId // "latest"')
ANONYMIZE=$(echo "$CONFIG" | yq -r '.database.anonymize // "true"')
COLLECTIONS=$(echo "$CONFIG" | yq -r '.database.collections // []' | jq -c)
echo "db_source=$DB_SOURCE" >> "$GITHUB_OUTPUT"
echo "snapshot_id=$SNAPSHOT_ID" >> "$GITHUB_OUTPUT"
echo "anonymize=$ANONYMIZE" >> "$GITHUB_OUTPUT"
echo "collections=$COLLECTIONS" >> "$GITHUB_OUTPUT"
else
# Defaults
echo "db_source=seed" >> "$GITHUB_OUTPUT"
echo "snapshot_id=" >> "$GITHUB_OUTPUT"
echo "anonymize=true" >> "$GITHUB_OUTPUT"
echo "collections=[]" >> "$GITHUB_OUTPUT"
fi
3.2 Snapshot Restore Job¶
Replace/extend db-reset-job.yaml to support snapshot restore:
# Generated when database.source=snapshot
apiVersion: batch/v1
kind: Job
metadata:
name: snapshot-restore-{{ .prNumber }}
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
spec:
template:
spec:
containers:
- name: restore
# TODO: Build and publish this image as part of Phase 2 implementation
image: ghcr.io/camaradesuk/syrf-db-tools:latest
command:
- /bin/bash
- -c
- |
# Download snapshot
gsutil cp gs://syrf-db-snapshots/prod/{{ .snapshotId }}/dump.gz /tmp/dump.gz
# Anonymize and restore
# NOTE: syrf-anonymize is a custom tool to be built as part of Phase 2
# It streams BSON, applies anonymization rules, and outputs to mongorestore
syrf-anonymize \
--input /tmp/dump.gz \
--rules /config/anonymization-rules.yaml \
--output - | \
mongorestore \
--uri="$MONGO_URI" \
--db={{ .databaseName }} \
--drop \
--gzip \
--archive=-
env:
- name: MONGO_URI
valueFrom:
secretKeyRef:
name: {{ .mongoSecretName }}
key: connection-string
volumeMounts:
- name: config
mountPath: /config
volumes:
- name: config
configMap:
name: anonymization-rules
restartPolicy: OnFailure
3.3 Deliverables¶
- Extended preview-config parser for database settings
- Conditional job generation (seed vs snapshot)
- Snapshot restore job template
- Integration with existing db-reset-job flow
- Documentation in PR template
Phase 4: Staging Refresh¶
Goal: Periodic staging database refresh from production snapshot
4.1 Manual Staging Refresh¶
Create a workflow for manual staging refresh:
# .github/workflows/staging-db-refresh.yml
name: Refresh Staging Database
on:
workflow_dispatch:
inputs:
snapshot_id:
description: 'Snapshot ID (or "latest")'
default: 'latest'
required: true
confirm:
description: 'Type "refresh-staging" to confirm'
required: true
jobs:
refresh:
if: inputs.confirm == 'refresh-staging'
runs-on: ubuntu-latest
steps:
- name: Trigger staging restore
run: |
# Create ArgoCD hook job for staging restore
# Similar to preview but targets syrf_staging database
4.2 Deliverables¶
- Staging refresh workflow
- Staging-specific restore job
- Confirmation gate to prevent accidents
- Notification on completion
Phase 5: Monitoring and Operations¶
Goal: Observability and operational tooling
5.1 Metrics and Alerts¶
- Snapshot job success/failure rate
- Snapshot size over time
- Restore duration metrics
- Storage usage alerts
5.2 Deliverables¶
- Prometheus metrics for snapshot jobs
- Grafana dashboard for snapshot operations
- PagerDuty alerts for snapshot failures
- Runbook for manual snapshot/restore
PR Description Template¶
Add to PR template:
## Preview Environment Configuration (Optional)
To customize your preview environment, add a YAML block with `#preview-config`:
\`\`\`yaml
#preview-config
database:
# Data source: seed (default), snapshot, or staging
source: seed
# For snapshot source:
# snapshotId: latest # or specific date: "2026-01-09"
# collections: # Optional: restore only specific collections
# - pmProject
# - pmStudy
web:
featureFlags:
experimentalFeature: true
\`\`\`
**Available database sources:**
- `seed` - Sample data (fast, deterministic) - **default**
- `snapshot` - Anonymized production data (realistic volumes)
- `staging` - Clone of current staging database
Security Considerations¶
Data Protection¶
- Anonymization is mandatory for production snapshots in non-production
- No raw production data ever reaches preview environments
- Audit logging of all snapshot access and restores
- GCS bucket policies restrict access to authorized service accounts only
Access Control¶
| Action | Who Can Do It |
|---|---|
| Create production snapshot | CronJob only (automated) |
| Download snapshot | Restore jobs with Workload Identity |
| Restore to preview | PR preview workflow (automated) |
| Restore to staging | Requires refresh-staging workflow with confirmation |
| View snapshot contents | SRE team only (manual access) |
Compliance¶
- GDPR: Anonymization removes identifying information
- Research data: Scientific content preserved (not PII)
- Retention: Snapshots auto-deleted after 30 days
Rollout Strategy¶
Phase 1 (Weeks 1-2): Snapshot Infrastructure¶
- Create GCS bucket and IAM
- Deploy snapshot CronJob
- Verify daily snapshots are working
- Set up monitoring
Phase 2 (Weeks 3-4): Anonymization¶
- Define anonymization rules
- Build and test anonymizer
- Validate anonymized data quality
- Security review of anonymization
Phase 3 (Weeks 5-6): Preview Integration¶
- Extend PR preview workflow
- Add restore job template
- Test with real PRs
- Update documentation
Phase 4 (Week 7): Staging Refresh¶
- Create staging refresh workflow
- Test staging restore
- Document runbook
Phase 5 (Week 8): Polish¶
- Monitoring dashboard
- Alert configuration
- Team training
- Retrospective
Success Metrics¶
| Metric | Current | Target |
|---|---|---|
| Production bug reproduction time | Hours (manual) | Minutes (automated) |
| Data realism in preview | Low (5 projects) | High (anonymized prod) |
| Staging data freshness | Stale/manual | Weekly automated |
| Snapshot reliability | N/A | 99.9% success rate |
Open Questions¶
- Snapshot size management: How to handle large production databases (currently ~X GB)?
- Option A: Full database snapshots (simple, large)
- Option B: Incremental snapshots (complex, smaller)
-
Option C: Collection-level snapshots (flexible, medium complexity)
-
Restore time optimization: How to speed up restore for large snapshots?
- Option A: Pre-warm snapshot in preview namespace
- Option B: Lazy restore (restore on first access)
-
Option C: Subset restore based on PR needs
-
Staging snapshot frequency: Daily, weekly, or on-demand?
- Daily: More current, more storage
- Weekly: Good balance
- On-demand: Manual trigger only
Dependencies¶
- GCS bucket with Workload Identity access
- MongoDB Atlas operator (already deployed)
- mongodump/mongorestore tools in container images
- yq for YAML parsing in workflows
References¶
- MongoDB Testing Strategy - Database isolation architecture
- PR Preview Environments - Current preview setup
- MongoDB Reference - Collection naming, CSUUID format