Skip to content

Using PR Preview Environments

Purpose

This guide explains how to use PR (Pull Request) preview environments for testing changes before merging to master. Preview environments are ephemeral, automatically-deployed instances of the SyRF platform that match your PR's code.

What Are PR Preview Environments?

PR preview environments provide:

  • Isolated testing: Each PR gets its own complete environment
  • Automatic deployment: ArgoCD deploys your changes automatically
  • Unique URLs: Access your preview at pr-{number}.syrf.org.uk
  • Auto-cleanup: Environments are deleted when PR closes
  • Full stack: All 6 services (API, PM, Quartz, Web, Docs, User Guide) deployed together
  • GitHub Deployments: Native GitHub UI integration with clickable environment URLs

How It Works

1. Open PR (any PR to main)
2. pr-tests.yml runs automatically:
   - test-dotnet (if .NET changed) ─┬─ MUST PASS
   - test-web (if Angular changed) ─┘
3. Add 'preview' label (optional, for preview environment)
4. TWO workflows trigger in parallel:
   ├─ pr-preview.yml (Kubernetes services):
   │   - Tag-based detection: find last tag for each service
   │   - Build Docker images for changed services
   │   - Write version files to cluster-gitops
   │   - ArgoCD deploys to pr-{number} namespace
   │   - Updates PR description with K8s status
   └─ pr-preview-lambda.yml (S3 Notifier Lambda):
       - Build Lambda package from s3-notifier code
       - Deploy syrfAppUploadS3Notifier-pr-{number} to AWS
       - Configure S3 trigger for preview/pr-{number}/ prefix
       - Updates PR description with Lambda status
5. ArgoCD ApplicationSet detects version files
6. ArgoCD creates namespace: pr-{number}
7. ArgoCD deploys all services to preview namespace
8. Preview URLs and file upload processing available within 5 minutes
9. PR description shows unified status table at top

PR Description Status

Preview status is displayed directly in the PR description (not comments) so it stays visible at the top of the PR. Both workflows update the same status table:

Component Status
S3 Notifier Lambda 0.1.5
K8s Services ✅ Ready

Preview URLs (once ArgoCD syncs):

  • 🌐 Web: https://pr-{number}.syrf.org.uk
  • 🔌 API: https://api.pr-{number}.syrf.org.uk
  • 📁 S3 Prefix: preview/pr-{number}/

This approach:

  • Keeps status always visible (not buried in comments)
  • Shows both Lambda and K8s status in one place
  • Includes version numbers and links to workflow runs

GitHub Deployments

Preview environments are tracked via the GitHub Deployments API, providing native integration with GitHub's UI:

Where to Find Deployments:

  • PR Sidebar: Look for "Environments" section with clickable pr-{number} link
  • Repository Deployments: Navigate to repository → Deployments tab
  • Commit Status: Deployment status appears on commits in the PR

Deployment Status Flow:

pending → in_progress → queued → success
                              ↘ failure
Status Description Trigger
pending Deployment created, waiting to start PR preview workflow starts
in_progress Building Docker images Build job begins
queued Pushed to GitOps, waiting for ArgoCD After cluster-gitops push
success ArgoCD sync complete, preview live ArgoCD PostSync hook
failure Build or deployment failed Workflow failure
inactive Environment cleaned up PR closed/label removed

Benefits:

  • Click environment URL directly from PR sidebar
  • See deployment history for the PR
  • Track deployment state changes
  • Automatic cleanup marking when PR closes

Fork PR Limitation: PRs from forked repositories don't create GitHub Deployments due to token permission restrictions. The preview environment still deploys normally, but without the GitHub Deployment tracking.

Why Two Workflows?

The preview environment needs both Kubernetes services AND a Lambda function to work properly:

Component Workflow Purpose
API, PM, Quartz, Web pr-preview.yml Application logic, UI, background jobs
S3 Notifier Lambda pr-preview-lambda.yml File upload notifications to RabbitMQ

Without the Lambda, file uploads in preview environments wouldn't trigger the study import process. The Lambda listens to S3 events on the preview/pr-{number}/ prefix and publishes messages to RabbitMQ.

Automated Testing

All PRs run automated tests via the pr-tests.yml workflow, regardless of whether they have a preview environment.

Test Workflow

Job Trigger Timeout What it tests
test-dotnet .NET code changed 10 min xUnit tests for API, PM, Quartz, S3 Notifier
test-web Angular code changed 5 min Vitest tests with coverage thresholds

Both jobs run in parallel after change detection.

Coverage Requirements

Angular tests enforce minimum coverage thresholds:

Metric Threshold
Statements 50%
Branches 40%
Functions 50%
Lines 50%

Tests fail if coverage drops below these thresholds.

Test Results

Test results are uploaded as GitHub Actions artifacts:

  • dotnet-test-results: TRX test result files
  • dotnet-coverage: Cobertura XML coverage
  • web-coverage: lcov, cobertura, text coverage reports
  • web-test-results: JUnit XML test results

Code Quality (SonarCloud)

Coverage and code quality are analyzed by SonarCloud which provides:

  • Quality Gate: Pass/fail status based on coverage, bugs, and code smells
  • PR Decoration: Inline comments on issues in changed code
  • Coverage Report: Line-by-line coverage visualization
  • Security Analysis: Detection of vulnerabilities and hotspots

See How to: Run Tests for local testing and SonarCloud setup.

Version File Structure (in cluster-gitops):

syrf/environments/preview/
  services/                       # Service defaults (all previews)
    api/
      config.yaml                 # hostPrefix, imageRepo
      values.yaml                 # Default Helm values
    web/
      config.yaml
      values.yaml
    ...
  pr-123/
    pr.yaml                       # PR trigger file (prNumber, headSha, branch)
    namespace.yaml                # Kubernetes namespace manifest
    services/                     # PR-specific values (one file per service)
      api.values.yaml             # image.tag from git tag or new build
      project-management.values.yaml
      quartz.values.yaml
      web.values.yaml
      docs.values.yaml
      user-guide.values.yaml

Prerequisites

  • Open pull request in syrf repository
  • Changes to at least one service
  • GitHub Actions workflows enabled
  • ArgoCD installed and configured (cluster requirement)

Optional: GCP Secrets for RabbitMQ Cleanup

When a PR is closed, the cleanup job can delete the PR's RabbitMQ vhost to free resources. This requires GCP credentials to access the GKE cluster.

Required GitHub Secrets (optional but recommended):

Secret Description How to Get
GCP_WORKLOAD_IDENTITY_PROVIDER Workload Identity Federation provider projects/{project-number}/locations/global/workloadIdentityPools/{pool}/providers/{provider}
GCP_SERVICE_ACCOUNT GCP service account email {sa-name}@{project}.iam.gserviceaccount.com

If not configured:

  • RabbitMQ vhost cleanup is skipped (not a failure)
  • Orphan vhosts accumulate until manually cleaned
  • All other cleanup (Lambda, K8s namespace) still works

To configure (requires GCP admin):

  1. Create a Workload Identity Pool for GitHub Actions
  2. Create a service account with container.developer role
  3. Add the secrets to GitHub repository settings

See GCP Workload Identity Federation documentation for setup details.

Creating a Preview Environment

Step 1: Create Your PR

# Create feature branch
git checkout -b feature/my-awesome-feature

# Make changes
# ... edit files ...

# Commit and push
git add .
git commit -m "feat(api): add awesome new feature"
git push origin feature/my-awesome-feature

Step 2: Open Pull Request

  1. Go to GitHub repository
  2. Click "Compare & pull request"
  3. Fill in PR title and description
  4. Add the preview label to the PR
  5. Click "Create pull request"

Step 3: Wait for Build

The pr-preview.yml workflow will automatically:

  • ✅ Detect changed services
  • ✅ Build Docker images with pr-{number} tag
  • ✅ Push images to GHCR
  • ✅ Comment on PR with preview info

Build time: ~5-10 minutes depending on changed services

Step 4: ArgoCD Deploys

ArgoCD will automatically:

  • ✅ Detect the PR (checks every 5 minutes)
  • ✅ Create namespace pr-{number}
  • ✅ Deploy all changed services
  • ✅ Configure ingress with preview URLs
  • ✅ Set up TLS certificates

Deployment time: ~5-10 minutes after build completes

Step 5: Access Your Preview

Once deployed, access your preview at:

  • Web UI: https://pr-{number}.syrf.org.uk
  • API: https://api.pr-{number}.syrf.org.uk
  • PM Service: https://project-management.pr-{number}.syrf.org.uk
  • Docs: https://docs.pr-{number}.syrf.org.uk
  • User Guide: https://help.pr-{number}.syrf.org.uk

Replace {number} with your PR number (e.g., PR #42 → https://pr-42.syrf.org.uk)

Managing Preview Environments

Updating Your Preview

Push new commits to your PR branch:

# Make more changes
git add .
git commit -m "fix(api): address review feedback"
git push

The preview workflow will automatically:

  1. Rebuild changed images
  2. Push new images with same pr-{number} tag
  3. ArgoCD detects new image
  4. ArgoCD syncs updated deployment

Update time: ~10-15 minutes total

Checking Preview Status

GitHub Actions:

  1. Go to PR → "Checks" tab
  2. View "PR Preview Build" workflow
  3. Check which services were built

ArgoCD UI (if accessible):

  1. Open ArgoCD dashboard
  2. Look for application syrf-pr-{number}
  3. View sync status and health

Kubernetes (if you have kubectl access):

# List preview namespaces
kubectl get namespaces | grep pr-

# Check pods in your preview
kubectl get pods -n pr-{number}

# View preview ingresses
kubectl get ingress -n pr-{number}

Disabling Preview

Remove the preview label from your PR:

  1. Go to PR on GitHub
  2. Click "Labels" gear icon
  3. Uncheck "preview"
  4. Both workflows trigger cleanup automatically:
  5. Lambda function deleted via Terraform
  6. K8s namespace deleted via ArgoCD
  7. PR description updated with cleanup status
  8. S3 files are preserved for debugging

Deleting Preview

Preview environments are automatically deleted when:

  • PR is closed (merged or closed without merging)
  • PR is converted to draft
  • preview label is removed (triggers unlabeled event)

Cleanup includes:

  • Lambda function deletion
  • K8s namespace and resources deletion
  • Git tags cleanup
  • PR description update showing cleanup reason

Preserved for debugging:

  • S3 uploaded files in preview/pr-{number}/ prefix

Cleanup time: Usually complete within 2-3 minutes

Cleanup Architecture

Understanding how preview cleanup works helps diagnose issues when cleanup fails.

ArgoCD Hook Lifecycle

Preview environments use ArgoCD PreSync hooks for database reset operations. These hooks have finalizers that control when resources can be deleted:

Policy When Finalizer Removed Use Case
BeforeHookCreation On next sync only Resources that must persist across syncs
HookSucceeded Immediately after success Ephemeral resources (db-reset job)
HookFailed Immediately after failure Ephemeral resources (db-reset job)

The db-reset resources use HookSucceeded,HookFailed so they clean up immediately after completion.

Cleanup Flow

PR Closed/Label Removed
cleanup-tags job starts
1. Pre-cleanup: Remove ArgoCD hook finalizers
   (Prevents deadlock if hooks are still running)
2. Delete PR version files from cluster-gitops
3. ArgoCD detects missing files
4. ArgoCD deletes Application
5. Namespace and all resources deleted

Why Pre-Cleanup Matters

If git files are deleted before hooks complete, ArgoCD loses visibility of the hook annotations (they were in git). Without knowing the hook-delete-policy, ArgoCD doesn't know how to handle the finalizers, causing resources to get stuck in Terminating state indefinitely.

The pre-cleanup step removes argocd.argoproj.io/hook-finalizer from all hook resources before deleting git files, ensuring clean deletion regardless of hook state.

Resources with Hook Finalizers

Resource Finalizer Source Cleanup Behavior
job/db-reset-{N}-{sha} ArgoCD PreSync hook Deleted immediately after completion
serviceaccount/db-reset-sa ArgoCD PreSync hook Deleted immediately after completion
role/db-reset-marker-role ArgoCD PreSync hook Deleted immediately after completion
rolebinding/db-reset-marker-binding ArgoCD PreSync hook Deleted immediately after completion
externalsecret/atlas-operator-api-key ArgoCD PreSync hook Persists until Application deletion
secret/mongodb-pr-password ArgoCD PreSync hook Persists until Application deletion
atlasdatabaseuser/pr-user ArgoCD PreSync hook Persists until Application deletion

Troubleshooting

Preview Not Building

Symptom: No "PR Preview Build" workflow run

Causes:

  1. PR doesn't have preview label
  2. No changes to service code (only docs/config changed)
  3. Workflow file syntax error

Fix:

  1. Add preview label to PR
  2. Make a small change to service code
  3. Check .github/workflows/pr-preview.yml syntax

Build Fails

Symptom: Red X on "PR Preview Build" workflow

Causes:

  1. Docker build errors
  2. TypeScript/compile errors
  3. Missing dependencies

Fix:

  1. Click on failed workflow run
  2. Review job logs
  3. Fix errors in your branch
  4. Push fix to trigger rebuild

Preview Not Deploying

Symptom: Build succeeds but no preview URLs work

Causes:

  1. ArgoCD not installed yet (cluster not ready)
  2. ApplicationSet not configured
  3. GHCR image pull issues
  4. DNS not configured

Fix (requires cluster access):

  1. Check ArgoCD application exists: kubectl get application -n argocd | grep pr-{number}
  2. Check ApplicationSet: kubectl get applicationset -n argocd syrf-preview
  3. View ArgoCD logs: kubectl logs -n argocd -l app.kubernetes.io/name=argocd-applicationset-controller

404 Not Found on Preview URL

Symptom: Preview URL returns 404

Causes:

  1. Ingress not created yet (DNS propagation)
  2. Certificate not ready
  3. Service not healthy

Fix:

  1. Wait 5-10 minutes for DNS and cert
  2. Check service health: kubectl get pods -n pr-{number}
  3. Check ingress: kubectl get ingress -n pr-{number}

Preview Shows Old Code

Symptom: Preview doesn't reflect latest changes

Causes:

  1. Image tag not updated (cached)
  2. ArgoCD not synced yet
  3. Browser caching old assets

Fix:

  1. Check image tag in deployment: kubectl describe deployment -n pr-{number}
  2. Force ArgoCD sync (if accessible)
  3. Hard refresh browser (Ctrl+Shift+R)

Namespace Stuck in Terminating State

Symptom: After PR closes, namespace shows Terminating status but doesn't delete. ArgoCD Application shows Unknown sync status.

Diagnosis:

# Check namespace status
kubectl get namespace pr-{number}

# Check for resources with finalizers blocking deletion
kubectl api-resources --verbs=list --namespaced -o name | \
  xargs -I {} kubectl get {} -n pr-{number} \
  -o custom-columns='KIND:.kind,NAME:.metadata.name,FINALIZERS:.metadata.finalizers' \
  --ignore-not-found 2>/dev/null | grep -v "<none>"

# Check ArgoCD Application
kubectl get application pr-{number}-namespace -n argocd -o yaml | \
  grep -A5 "deletionTimestamp\|finalizers"

Common Causes:

  1. ArgoCD hook finalizer stuck on Job - Most common cause
  2. External Secrets Operator waiting - ESO finalizers blocking deletion
  3. Atlas Database User finalizer - MongoDB operator cleanup pending

Manual Fix:

# Remove finalizer from stuck Job (find the exact name first)
kubectl get jobs -n pr-{number}
kubectl patch job {job-name} -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'

# If RBAC resources are stuck
kubectl patch serviceaccount db-reset-sa -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'
kubectl patch role db-reset-marker-role -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'
kubectl patch rolebinding db-reset-marker-binding -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'

# If MongoDB resources are stuck
kubectl patch externalsecret atlas-operator-api-key -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'
kubectl patch secret mongodb-pr-password -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'
kubectl patch atlasdatabaseuser pr-user -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'

Prevention: This issue is prevented by the pre-cleanup step in the workflow. If you see this issue, the pre-cleanup step may have failed or GCP credentials weren't configured.

ArgoCD Application Won't Delete

Symptom: Application shows Unknown sync status and has deletionTimestamp set but won't complete deletion.

Causes:

  1. Resources with finalizers in the namespace blocking deletion
  2. ArgoCD controller unable to access the namespace
  3. Custom resource definitions (CRDs) blocking deletion

Fix:

# First, fix any stuck resources in the namespace (see above)

# If Application still won't delete, check its finalizers
kubectl get application pr-{number}-namespace -n argocd \
  -o jsonpath='{.metadata.finalizers}'

# Remove ArgoCD resource finalizer (last resort)
kubectl patch application pr-{number}-namespace -n argocd --type=merge \
  -p '{"metadata":{"finalizers":null}}'

Warning: Removing the Application finalizer skips ArgoCD's cascade deletion. Only do this after manually cleaning up namespace resources.

Best Practices

When to Use Previews

Good use cases:

  • ✅ Testing new features before review
  • ✅ Validating bug fixes with real data
  • ✅ Demonstrating changes to stakeholders
  • ✅ QA testing before merge
  • ✅ Integration testing across services

Avoid for:

  • ❌ Every single PR (resource intensive)
  • ❌ Documentation-only changes
  • ❌ Config-only changes
  • ❌ Very small typo fixes

Resource Limits

Preview environments have reduced resources compared to staging:

Service Memory Limit CPU Limit Replicas
API 256Mi 250m 1
PM 256Mi 250m 1
Quartz 256Mi 250m 1
Web 128Mi 100m 1

Implications:

  • Slower response times than production
  • Not suitable for load testing
  • May have memory/CPU limits under heavy use

Testing Checklist

Before merging, verify in your preview:

  • Application starts without errors
  • Core functionality works
  • API endpoints respond correctly
  • UI renders properly
  • No console errors in browser
  • Authentication works (if applicable)
  • Database operations succeed

Cleanup

Always remove the preview label or close your PR when done testing to free up cluster resources.

Preview Environment Details

Namespace Structure

Each preview gets its own Kubernetes namespace:

pr-{number}/
├── deployments/
│   ├── syrf-api
│   ├── syrf-pm
│   ├── syrf-quartz
│   └── syrf-web
├── services/
│   ├── syrf-api
│   ├── syrf-pm
│   ├── syrf-quartz
│   └── syrf-web
├── ingresses/
│   ├── pr-{number}-api
│   ├── pr-{number}-pm
│   └── pr-{number}-web
├── configmaps/
│   └── preview-info
└── secrets/
    └── (TLS certificates)

Angular Development Build

Preview environments build the Angular app with --configuration development instead of production. This provides:

  • Verbose error messages: Full stack traces with component context
  • Development mode checks: Extra change detection cycles to catch issues
  • Source maps included: Debug TypeScript directly in browser DevTools
  • Named chunks: Easier to identify modules when debugging

Trade-off: Bundle size is larger, but this is acceptable for preview testing.

devMode Feature Flag

Preview environments have the devMode feature flag enabled by default. This can be used in the app to:

  • Enable additional console logging
  • Show debug panels/overlays
  • Enable verbose API request logging
  • Show feature flag debugging UI
  • Enable performance profiling helpers

Access via ngrx selector: selectDevMode

Environment Variables

Preview services run with:

  • ASPNETCORE_ENVIRONMENT=Preview (.NET services)
  • ENVIRONMENT=preview (Web service)
  • REMOVE_SOURCEMAPS=false (Web keeps sourcemaps for debugging)
  • SYRF__FeatureFlags__DevMode=true (Web devMode enabled)

Data Isolation

Important: Preview environments share the same backend resources as staging:

  • Same MongoDB instance
  • Same RabbitMQ instance
  • Same S3 buckets

Be careful:

  • Use test data only
  • Don't delete production-like data
  • Consider data namespace isolation in code

Customizing Your Preview

Feature Flags via PR Description

You can enable or disable feature flags for your preview environment by adding a config block to your PR description.

Format: Add a YAML code block starting with #preview-config:

```yaml
#preview-config
web:
  featureFlags:
    experimentalFeature: true
    newDashboard: true
```

Rules:

  1. The YAML block MUST start with #preview-config on the first line
  2. Top-level keys are service names (api, web, project-management, quartz, docs, user-guide)
  3. Only featureFlags are allowed - other settings are ignored for security
  4. The config is parsed when the preview workflow runs

Example - Enable multiple flags:

```yaml
#preview-config
web:
  featureFlags:
    newScreeningOverview: true
    newStageOverview: true
api:
  featureFlags:
    enableBetaEndpoints: true
```

What You Can and Cannot Customize

Setting How to Customize
Feature flags PR description (any PR creator)
Resources (memory, CPU) Requires cluster-gitops write access
Logging level Requires cluster-gitops write access
Replica count Requires cluster-gitops write access

Why this restriction? Resource limits are restricted to prevent accidental cost increases and ensure fair resource sharing across previews.

Service-Specific Defaults

Platform operators can set service-specific defaults for all previews in:

cluster-gitops/syrf/environments/preview/services/{service}/values.yaml

These values apply to all preview environments and can set things like:

  • Default feature flags for previews
  • Resource limits for specific services
  • Logging configuration

S3 Notifier Lambda (File Upload Processing)

Each preview environment gets its own AWS Lambda function to handle file uploads.

How It Works

User uploads file → S3 bucket → S3 Event Notification
                    Lambda: syrfAppUploadS3Notifier-pr-{number}
                    RabbitMQ message → Project Management service
                    Study import processing begins

Lambda Details

Aspect Value
Function Name syrfAppUploadS3Notifier-pr-{number}
S3 Prefix preview/pr-{number}/
Runtime .NET 10 (linux-x64)
Managed By Terraform (in camarades-infrastructure/)

S3 Key Format

Files uploaded in preview environments use a special prefix:

s3://syrfapp-uploads/preview/pr-{number}/Projects/{projectId}/...

This ensures:

  • Preview uploads don't interfere with production data
  • Each PR's Lambda only processes its own files
  • Files are isolated per PR for debugging purposes

Change Detection

The Lambda workflow uses smart change detection to avoid unnecessary deployments:

  1. On synchronize (new commits), checks if s3-notifier code changed
  2. Compares current commit against last deployed SHA (stored in Lambda tags)
  3. Only deploys if files in src/services/s3-notifier/ changed since last deploy
  4. First deploy for a PR always runs (no previous SHA)

This prevents rebuilding the Lambda on every push when only K8s services changed.

Cleanup

When a PR is closed or the preview label is removed, pr-preview-lambda.yml automatically:

  1. Preserves S3 files in preview/pr-{number}/ (for debugging)
  2. Removes the Lambda function via Terraform
  3. Deletes the Lambda package from S3 state bucket
  4. Updates PR description with cleanup status

Note: S3 files are intentionally preserved after cleanup to help debug any issues. They can be manually deleted if needed.

PR Labels Reference

Preview environments are controlled by GitHub labels. Each label has specific effects on the workflow behavior.

Label Effects Matrix

Label When Added When Removed Effects
preview PR labeled PR unlabeled Add: Triggers full preview build/deploy. Remove: Triggers complete cleanup (namespace, Lambda, MongoDB user, tags)
persist-db PR labeled PR unlabeled Add: Skips database reset on subsequent syncs (preserves data). Remove: Re-enables database reset, runs db-reset job on next sync
use-snapshot PR labeled PR unlabeled Add: Uses production snapshot data instead of seed data (future feature). Remove: Reverts to seed data

Label: preview

Purpose: Primary control for preview environment lifecycle.

When Added:

1. Workflow triggers on 'labeled' event
2. check-should-run job validates preview should be created
3. All service images built with pr-{number} tag
4. Version files written to cluster-gitops
5. ArgoCD detects files, creates namespace and Application
6. Lambda function deployed for S3 file notifications
7. GitHub Deployment created for tracking

When Removed:

1. Workflow triggers on 'unlabeled' event
2. cleanup-tags job executes
3. Pre-cleanup removes ArgoCD hook finalizers (prevents deadlock)
4. MongoDB database user deleted
5. Quartz SQL schema dropped
6. Git tags cleaned up
7. Version files deleted from cluster-gitops
8. ArgoCD detects missing files, deletes Application
9. Lambda function destroyed via Terraform
10. RabbitMQ vhost deleted (if GCP credentials configured)
11. GitHub Deployment marked inactive

Timing: Label changes are processed immediately when the workflow runs.

Label: persist-db

Purpose: Prevents database reset between preview environment syncs.

Behavior:

persist-db Label State Database Reset Job Use Case
Not present (default) Runs on every sync Clean slate for each push
Present Skipped Preserve test data across commits

When Added:

  • db-reset-job.yaml file is deleted from cluster-gitops
  • ArgoCD no longer runs PreSync hook to reset database
  • Existing data in syrf_pr_{number} is preserved

When Removed:

  • db-reset-job.yaml file is re-created in cluster-gitops
  • Database reset runs on next ArgoCD sync
  • All data in preview database is replaced with seed data

Important: Adding persist-db does NOT retroactively restore deleted data. It only prevents future resets.

Label: use-snapshot

Purpose: Initialize preview database with production snapshot data instead of empty/seed data.

Current Status: ✅ Fully Implemented - See Data Snapshot Automation for architecture details.

Behavior:

use-snapshot Label State Data Source Database Initialization
Not present (default) Empty database Services create data as needed
Present Production snapshot (syrf_snapshot) Snapshot restore job copies collections

How it works:

  1. MongoDB User Updated: PR user gets additional read role on syrf_snapshot database
  2. Snapshot Restore Job Created: PreSync hook job copies 11 collections from syrf_snapshot to syrf_pr_{number}
  3. db-reset Job Skipped: When use-snapshot=true, the database reset job is NOT generated (snapshot-restore handles initialization)
  4. Idempotency: Completion marker (ConfigMap) prevents duplicate restores on manual ArgoCD syncs

Collections Copied (via $out aggregation):

  • pmProject, pmStudy, pmInvestigator, pmSystematicSearch
  • pmDataExportJob, pmStudyCorrection, pmInvestigatorUsage
  • pmRiskOfBiasAiJob, pmProjectDailyStat, pmPotential, pmInvestigatorEmail

Security: Kyverno policy enforces that PR users can ONLY have read access (not readWrite) on syrf_snapshot. See Kyverno Security Policy below.

When Added:

1. Workflow detects 'use-snapshot' label
2. AtlasDatabaseUser gets additional role:
   - roleName: read
   - databaseName: syrf_snapshot
3. snapshot-restore-job.yaml generated instead of db-reset-job.yaml
4. ArgoCD runs PreSync hook to copy data
5. Services start with production-like data

When Removed:

1. AtlasDatabaseUser loses syrf_snapshot read role
2. snapshot-restore-job.yaml removed
3. On next sync with service changes, db-reset-job.yaml may be generated

Database Isolation

Per-Environment Databases

Each environment type has its own database isolation strategy:

Environment Database Name Isolation Level Data Source
Production syrftest Full (dedicated) Live data
Staging syrftest ⚠️ Shared with production Production data
Preview syrf_pr_{number} Full (per-PR) Seed data

⚠️ CRITICAL: Staging currently shares the production database (syrftest). This is a known issue documented in the MongoDB Testing Strategy.

Preview Database Lifecycle

PR Opens + preview label added
1. MongoDB Atlas database user created: pr-{number}-user
2. Database syrf_pr_{number} created (on first write)
3. DatabaseSeeder runs, populates with sample data
4. Preview services connect using pr-{number}-user credentials
   [PR active - multiple syncs may occur]
5. PR closes OR preview label removed
6. Database user deleted (pr-{number}-user)
7. Database syrf_pr_{number} becomes orphaned
8. Manual cleanup required for orphan databases

MongoDB Atlas User Permissions

Preview environments use dedicated MongoDB Atlas database users with scoped permissions:

User Database Access Role
pr-{number}-user syrf_pr_{number} only dbOwner

dbOwner role provides:

  • Read/write access to all collections
  • Create/drop collections
  • Create/drop indexes
  • Run aggregation pipelines

Cleanup note: Users created before the dbOwner role update may have insufficient permissions for some cleanup operations.

Quartz SQL Schema Isolation

The Quartz service (background jobs) uses SQL Server with per-environment schema isolation:

Environment Schema Name Isolation
Production [production] Dedicated
Staging [staging] Dedicated
Preview [preview_{number}] Per-PR

Cleanup: When a PR closes, the cleanup-tags job drops the [preview_{number}] schema.

RabbitMQ Vhost Isolation

Each preview environment gets its own RabbitMQ virtual host:

Environment Vhost Name
Production / (default)
Staging staging
Preview pr-{number}

Cleanup requirement: Requires GCP credentials to access the GKE cluster and delete vhosts.

Edge Cases and Known Issues

Fork PR Limitations

Issue: PRs from forked repositories cannot create GitHub Deployments.

Cause: GITHUB_TOKEN in fork PRs has restricted permissions and cannot create deployments on the upstream repository.

Impact:

  • Preview environment deploys normally
  • GitHub Deployments UI shows no environment for the PR
  • Users must manually check ArgoCD or workflow logs for status

Workaround: None. This is a GitHub security limitation.

ArgoCD Hook Finalizer Deadlock

Issue: Namespace gets stuck in Terminating state indefinitely.

Scenario:

1. PR closes while db-reset job is running
2. Git files deleted from cluster-gitops
3. ArgoCD loses visibility of hook annotations (they were in git)
4. ArgoCD doesn't know hook-delete-policy
5. Resources with finalizers block namespace deletion

Prevention: The cleanup-tags job runs a pre-cleanup step that removes argocd.argoproj.io/hook-finalizer from all resources BEFORE deleting git files.

Manual Fix (if pre-cleanup fails):

# Remove finalizers from stuck jobs
kubectl get jobs -n pr-{number}
kubectl patch job {job-name} -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'

# Remove from RBAC resources
kubectl patch serviceaccount db-reset-sa -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'
kubectl patch role db-reset-marker-role -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'
kubectl patch rolebinding db-reset-marker-binding -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'

MongoDB Cleanup Failures for Older PRs

Issue: MongoDB database user cleanup may fail silently for PRs created before the dbOwner role change.

Cause: Users created with older roles may not have dbOwner permissions required for some cleanup operations.

Impact: Orphan database users may remain in MongoDB Atlas.

Resolution: Manual cleanup via MongoDB Atlas UI or CLI.

Race Condition: Closed PR and Build

Issue: Build job could recreate files that cleanup job deleted.

Scenario:

1. PR closes
2. cleanup-tags job starts, deletes files
3. build-images job (already running) finishes, writes new files
4. Files recreated after cleanup

Prevention: The workflow checks github.event.action == 'closed' at the start of check-should-run job and immediately exits if true. This ensures build jobs don't run for closed PRs.

Code reference (pr-preview.yml:70-77):

# Skip build if PR is closed
if [ "${{ github.event.action }}" == "closed" ]; then
  echo "result=false" >> "$GITHUB_OUTPUT"
  echo "skip_reason=PR is closed" >> "$GITHUB_OUTPUT"
  exit 0
fi

GCP Credentials Not Configured

Issue: Some cleanup steps fail silently when GCP credentials are not configured.

Affected Operations:

Operation Without GCP Credentials
RabbitMQ vhost deletion Skipped
ArgoCD finalizer pre-cleanup Skipped
Quartz SQL schema cleanup Skipped

Impact: Orphan resources accumulate until manually cleaned.

Required Secrets:

  • GCP_WORKLOAD_IDENTITY_PROVIDER
  • GCP_SERVICE_ACCOUNT

Tag-Based Change Detection Edge Cases

Issue: Services may not rebuild when expected due to tag-based detection.

How it works:

1. Find last git tag for service (e.g., api-v1.2.3)
2. Compare current commit against tagged commit
3. If files in service path changed → rebuild
4. If no changes → reuse existing image

Edge cases:

Scenario Behavior
First PR ever (no tags) Uses base branch comparison
Service has no tags Always rebuilds
Tag deleted manually May cause unexpected rebuilds
Shared library changed Detected via DEPENDENCY-MAP.yaml

PR Description Parsing Failures

Issue: Malformed YAML in PR description can cause config parsing to fail silently.

Scenarios:

  • Invalid YAML syntax → defaults used
  • Missing #preview-config marker → block ignored
  • Unsupported settings → ignored (security feature)

Debug: Check workflow logs for "Parse PR description for preview config" step.

Lambda S3 Prefix Routing

Issue: File uploads must use correct S3 prefix or they won't trigger the Lambda.

Expected prefix: preview/pr-{number}/Projects/{projectId}/...

Common mistakes:

  • Using production prefix (Projects/...) → triggers production Lambda
  • Missing preview/pr-{number} prefix → no Lambda triggered
  • Wrong PR number in prefix → wrong Lambda triggered

Label Interaction Matrix

Understanding how labels interact is critical for predictable preview behavior.

Label Precedence Rules

persist-db > use-snapshot > reset-db (implicit)

Decision Flow:

Has 'persist-db' label?
├── YES → Skip ALL database operations (highest priority)
│         Database preserved exactly as-is
└── NO → Has 'use-snapshot' label?
         ├── YES → Run snapshot-restore job (copies from syrf_snapshot)
         │         Skip db-reset job (snapshot handles initialization)
         └── NO → Were MongoDB services rebuilt?
                  ├── YES → Run db-reset job (drop all collections)
                  └── NO → Skip db-reset job (no changes to reset for)

Complete Label Scenario Matrix

persist-db use-snapshot Services Rebuilt db-reset Job snapshot-restore Job Result
❌ Skip ❌ Skip Database unchanged
✅ Run ❌ Skip Collections dropped, empty database
❌ Skip ✅ Run Snapshot data copied
❌ Skip ✅ Run Snapshot data copied (reset skipped)
❌ Skip ❌ Skip Database preserved
❌ Skip ❌ Skip Database preserved despite rebuild
❌ Skip ❌ Skip Database preserved (persist wins)
❌ Skip ❌ Skip Database preserved (persist wins)

Key Insight: persist-db always wins. When present, no database initialization jobs run regardless of other labels or rebuild status.

Why use-snapshot Skips db-reset

When use-snapshot is true, the db-reset job is NOT generated because:

  1. Redundant: $out aggregation in snapshot-restore completely replaces target collections
  2. Order Problem: db-reset runs at sync wave -1 (AFTER PreSync hooks), so it would DROP data that snapshot-restore just copied
  3. Efficiency: No need to drop then copy; just copy (which replaces)

ArgoCD Sync Wave Ordering

Preview environments use ArgoCD sync waves to ensure resources are created in the correct order.

Complete Sync Order

Wave -5: AtlasDatabaseUser (MongoDB user creation)
         ↓ MongoDB Atlas Operator creates user in Atlas
         ↓ Connection secret becomes available

Wave -2: db-reset RBAC resources (if db-reset enabled)
         - ServiceAccount: db-reset-sa
         - Role: db-reset-marker-role
         - RoleBinding: db-reset-marker-binding

Wave -1: db-reset Job (if db-reset enabled)
         ↓ Drops all collections in syrf_pr_{number}
         ↓ Creates completion marker ConfigMap

Wave 0+: Application services (API, PM, Quartz, Web)
         ↓ Services start with fresh/empty database

PreSync Hooks (snapshot-restore)

When use-snapshot label is present, PreSync hooks run BEFORE the sync wave sequence:

[PreSync Phase - runs before wave sequence]

PreSync Wave 1: snapshot-restore RBAC resources
                - ServiceAccount: db-reset-sa
                - Role: db-reset-marker-role
                - RoleBinding: db-reset-marker-binding

PreSync Wave 3: snapshot-restore Job
                ↓ Copies 11 collections from syrf_snapshot to syrf_pr_{number}
                ↓ Creates completion marker ConfigMap

[/PreSync Phase]

Wave -5: AtlasDatabaseUser (already has syrf_snapshot read role)
Wave 0+: Application services start with snapshot data

Hook Delete Policies

Resource Type Delete Policy Behavior
db-reset RBAC None (regular sync resource) Persists as long as file exists in git
snapshot-restore RBAC BeforeHookCreation Deleted before next sync, recreated
snapshot-restore Job BeforeHookCreation Old job deleted before new one created

Why BeforeHookCreation? Resources persist throughout the current sync (Job can use them), then get cleaned up before the next sync. HookSucceeded would delete them immediately after creation, before the Job runs.

Database Coordination with Init Containers

Preview environments use a sophisticated coordination mechanism to ensure services don't start until the database is ready. This prevents race conditions where services might try to access a database that hasn't been seeded yet.

Architecture Overview

PR Preview Environment
└── pr-{number} (Parent Application - App-of-Apps)
    ├── pr-{number}-infrastructure (AUTO-SYNC ✓)
    │   ├── Namespace, ExternalSecret, AtlasDatabaseUser
    │   └── DatabaseLifecycle CR → manages "db-ready" ConfigMap
    ├── pr-{number}-api (AUTO-SYNC ✓)
    ├── pr-{number}-project-management (AUTO-SYNC ✓)
    ├── pr-{number}-quartz (AUTO-SYNC ✓)
    └── pr-{number}-web (AUTO-SYNC ✓)
        └── Init containers wait for ConfigMap with MATCHING seedVersion

How seedVersion Matching Works

Simply waiting for a db-ready ConfigMap is NOT sufficient. Here's why:

Race Condition During Reseed:

1. New seedVersion pushed to cluster-gitops
2. ArgoCD syncs both infrastructure AND service apps (independently!)
3. Service pods restart (annotation changed)
4. Init container checks for db-ready ConfigMap
5. OLD ConfigMap still exists (operator hasn't updated it yet)
6. Init container PASSES with stale ConfigMap ← BUG!
7. Pods start while database is being reseeded ← DATA CORRUPTION

The Fix: Init containers wait for ConfigMap with matching seedVersion:

# ConfigMap created by DatabaseLifecycle operator
apiVersion: v1
kind: ConfigMap
metadata:
  name: db-ready
  namespace: pr-{number}
data:
  status: "ready"
  seedVersion: "abc123"  # Must match pod's expected version
  seededAt: "2026-01-17T12:00:00Z"
  sourceDatabase: "syrf_snapshot"

Per-Service waitForDatabase Configuration

Not all services need to wait for the database. The waitForDatabase flag is configured per-service in cluster-gitops:

Service waitForDatabase Reason
api true Connects to MongoDB
project-management true Connects to MongoDB
quartz true Connects to MongoDB
web false Frontend only, no database
docs false Static documentation site
user-guide false Static documentation site

Configuration Location: cluster-gitops/syrf/environments/preview/services/{service}/config.yaml

# Example: api/config.yaml
serviceName: api
hostPrefix: "api."
imageRepo: ghcr.io/camaradesuk/syrf-api
waitForDatabase: true  # Init container waits for db-ready ConfigMap

Init Container Behavior

Services with waitForDatabase: true get an init container added automatically:

initContainers:
  - name: wait-for-database
    image: bitnami/kubectl:latest
    command: ['sh', '-c']
    args:
      - |
        EXPECTED_VERSION="${SEED_VERSION}"
        echo "Waiting for db-ready ConfigMap with seedVersion=$EXPECTED_VERSION..."

        while true; do
          CURRENT=$(kubectl get configmap db-ready -n ${NAMESPACE} \
            -o jsonpath='{.data.seedVersion}' 2>/dev/null || echo "")

          if [ "$CURRENT" = "$EXPECTED_VERSION" ]; then
            echo "Database is ready with correct seedVersion!"
            exit 0
          fi

          echo "Current: '$CURRENT', Expected: '$EXPECTED_VERSION'. Waiting..."
          sleep 5
        done

DatabaseLifecycle Operator

The DatabaseLifecycle operator (deployed in the cluster) handles database seeding:

  1. Watches DatabaseLifecycle custom resources
  2. Waits for watched deployments to have 0 ready replicas (via watchedDeployments field)
  3. Seeds database from snapshot when seedVersion changes
  4. Creates/Updates db-ready ConfigMap after successful seeding

Coordination Flow:

Init containers block → Pods stuck at Init:0/1 → Operator sees 0 ready pods
→ Operator seeds database → Operator creates db-ready ConfigMap
→ Init containers pass → Main containers start

Key Design Principle: The operator does NOT scale down services. Instead:

  • Init containers block pods from starting (by waiting for db-ready ConfigMap)
  • Operator waits for watched deployments to have 0 ready replicas
  • This happens naturally because init containers prevent pods from becoming ready

watchedDeployments Configuration (in DatabaseLifecycle CR):

spec:
  watchedDeployments:
    labelSelector: "syrf.org.uk/uses-database=true"
    timeout: 300  # seconds

Services with waitForDatabase: true get the label syrf.org.uk/uses-database=true automatically.

Scenario Walkthroughs

Initial Deployment:

1. PR gets 'preview' + 'use-snapshot' labels
2. Workflow pushes seedVersion to cluster-gitops
3. ArgoCD syncs ApplicationSet, creates apps for infrastructure + services
4. Apps sync in parallel:
   - Infrastructure: creates DatabaseLifecycle CR with watchedDeployments
   - Services: create Deployments with init containers
5. Service pods enter Init:0/1 state, blocked waiting for db-ready ConfigMap
6. DatabaseLifecycle operator:
   - Checks watchedDeployments → all have 0 ready replicas ✓
   - Seeds database from snapshot
   - Creates db-ready ConfigMap with seedVersion
7. Init containers detect matching seedVersion → pods start

Normal Code Push (no database changes):

1. Developer pushes code
2. Workflow pushes new headSha, same seedVersion
3. ArgoCD syncs child apps
4. Kubernetes does rolling update
5. Init container checks ConfigMap - seedVersion matches!
6. Init container passes immediately (~1 second)
7. No database work needed

Reseed Trigger (seedVersion changes):

1. New seedVersion pushed to cluster-gitops
2. ArgoCD syncs apps (infrastructure + services)
3. Services use Recreate strategy → old pods terminated first
4. New pods created with new seedVersion
5. New pods enter Init:0/1 state (waiting for ConfigMap with new seedVersion)
6. DatabaseLifecycle operator:
   - Checks watchedDeployments → all have 0 ready replicas ✓
   - Drops database, seeds from snapshot
   - Updates ConfigMap with new seedVersion
7. Init containers detect matching seedVersion → pods start

Manual Reseed via /reseed-db Command

To trigger a database reseed on an existing preview environment, comment /reseed-db on the PR:

Command: Comment /reseed-db on any PR with the preview label.

What happens:

1. Workflow detects /reseed-db command in comment
2. Checks that PR has 'preview' label and NOT 'persist-db' label
3. Updates seedVersion in pr.yaml (single source of truth)
4. ArgoCD detects change, syncs all apps
5. Services restart (Recreate strategy terminates old pods first)
6. Init containers wait for new db-ready ConfigMap
7. Operator reseeds database, creates ConfigMap with new seedVersion
8. Services start successfully

Blocked when:

  • persist-db label present → "Remove persist-db label first"
  • preview label missing → "Add preview label first"

Use cases:

  • Database corruption during testing
  • Want fresh snapshot data
  • Recovering from stuck deployments (Init:0/1 state)
  • Schema migration testing

Recreate Deployment Strategy

Services with waitForDatabase: true use Recreate deployment strategy instead of RollingUpdate:

Why? RollingUpdate keeps old pods running until new pods are ready. With init containers blocking new pods, this creates a deadlock:

RollingUpdate (would cause deadlock):
1. Old pods: 1/1 Running (no init container)
2. New pods: Init:0/1 (waiting for db-ready ConfigMap)
3. Operator waits for 0 ready replicas (watchedDeployments check)
4. Old pod has 1 ready replica → operator waits forever
5. db-ready ConfigMap never created → new pods wait forever

Recreate strategy breaks the deadlock:

1. Recreate terminates old pods first
2. 0 ready replicas achieved
3. Operator safe to seed
4. ConfigMap created
5. New pods start
Environment Strategy Reason
Production RollingUpdate Zero-downtime required
Staging RollingUpdate Zero-downtime preferred
Preview (waitForDatabase=false) RollingUpdate No coordination needed
Preview (waitForDatabase=true) Recreate Enables safe database seeding

Acceptable for previews because brief downtime during deployments is acceptable in non-production environments.

Startup Probe for MongoDB Index Creation

Services in preview environments with waitForDatabase: true have a startupProbe to handle slow MongoDB startup:

Problem: MongoDB creates indexes on freshly seeded databases, taking 60+ seconds. The default liveness probe allows only ~90 seconds total before killing the pod, causing restart loops.

Solution: A startupProbe runs only during initial startup, allowing up to 310 seconds before liveness probes take over.

# Configured automatically in _deployment-dotnet.tpl
startupProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 30  # 10 + (30 * 10) = 310 seconds max
Environment Max Startup Time
Production/Staging 90 seconds (liveness probe only)
Preview (waitForDatabase=true) 310 seconds (startup probe)

Symptoms of insufficient startup time (fixed by startup probe):

  • Pods showing 0/1 Running with multiple restarts
  • Logs showing "Now listening" 60+ seconds after startup begins
  • Eventual success after 3-5 restarts (when indexes are cached)

For detailed architecture documentation, see:

Kyverno Security Policy

A Kyverno ClusterPolicy enforces that PR preview database users can only access appropriate databases.

Policy: atlas-block-production-access

Location: cluster-gitops/plugins/helm/kyverno/resources/atlas-pr-user-policy.yaml

Enforcement: Blocks creation/update of AtlasDatabaseUser resources in pr-* namespaces that violate rules.

Rules Summary

Rule Blocked Pattern Purpose
1. block-any-database-roles readWriteAnyDatabase, dbAdminAnyDatabase, root Prevent broad access
2. block-production-database databaseName: syrftest Protect production data
3. block-staging-database databaseName: syrf_staging Protect staging data
4. block-admin-database databaseName: admin Protect system database
5. validate-pr-database-pattern See below Enforce allowed patterns

Rule 5: Allowed Database Access

PR users can ONLY have:

Database Pattern Allowed Roles Purpose
syrf_pr_* Any (readWrite, dbOwner, etc.) PR-specific database
syrf_snapshot read ONLY Snapshot data source

Denied Examples:

# ❌ BLOCKED - syrf_snapshot with readWrite
roles:
  - roleName: readWrite
    databaseName: syrf_snapshot

# ❌ BLOCKED - accessing production
roles:
  - roleName: read
    databaseName: syrftest

# ❌ BLOCKED - accessing unrecognized database
roles:
  - roleName: readWrite
    databaseName: some_other_db

Allowed Examples:

# ✅ ALLOWED - PR database with readWrite
roles:
  - roleName: readWrite
    databaseName: syrf_pr_123

# ✅ ALLOWED - PR database + snapshot read
roles:
  - roleName: readWrite
    databaseName: syrf_pr_123
  - roleName: read
    databaseName: syrf_snapshot

Policy Violation Response

If a PR attempts to create an AtlasDatabaseUser that violates the policy:

  1. Kyverno blocks the resource creation
  2. ArgoCD sync fails with policy violation message
  3. PR preview deployment halts at the AtlasDatabaseUser step
  4. GitHub Deployment status shows failure

Fix: Remove the violating role from the AtlasDatabaseUser definition in the workflow.

Snapshot Producer (Weekly Data Snapshots)

The snapshot-producer CronJob creates weekly copies of production data to the syrf_snapshot database, which is then used by preview environments with the use-snapshot label.

How It Works

Weekly Schedule (Sunday 3 AM UTC)
Snapshot Producer CronJob starts
1. Test connectivity to source (Cluster0) and target (Preview) clusters
2. For each collection (11 total):
   - Count source documents
   - Stream copy: mongodump | mongorestore (no disk writes)
   - Verify target document count
   - Retry up to 3 times on failure
3. Write snapshot_metadata document with:
   - Timestamp, duration, document counts
   - Source/target cluster info
   - Collections copied
Preview environments can now use fresh snapshot data

Collections Copied

The following collections are copied from syrftest (production) to syrf_snapshot:

Collection Description
pmProject Projects with stages, memberships, questions
pmStudy Studies with screening, extraction, annotations
pmInvestigator User accounts and profiles
pmSystematicSearch Literature searches
pmDataExportJob Export job tracking
pmStudyCorrection PDF correction requests
pmInvestigatorUsage Usage statistics
pmRiskOfBiasAiJob AI risk-of-bias jobs
pmProjectDailyStat Daily statistics
pmPotential Potential references
pmInvestigatorEmail Email records

Snapshot Metadata

After each successful run, a metadata document is written to syrf_snapshot.snapshot_metadata:

{
  _id: "latest",
  createdAt: ISODate("2026-01-26T03:45:00Z"),
  startedAt: ISODate("2026-01-26T03:00:00Z"),
  finishedAt: ISODate("2026-01-26T03:45:00Z"),
  durationSeconds: 2700,
  sourceCluster: "Cluster0",
  sourceDatabase: "syrftest",
  sourceHost: "cluster0-pri.siwfo.mongodb.net",
  targetCluster: "Preview",
  targetDatabase: "syrf_snapshot",
  targetHost: "preview-pri.siwfo.mongodb.net",
  collections: ["pmProject", "pmStudy", ...],
  collectionsCount: 11,
  documentCounts: {
    pmProject: 1234,
    pmStudy: 56789,
    // ...
  },
  totalDocuments: 123456,
  method: "mongodump | mongorestore streaming",
  crossCluster: true,
  status: "complete"
}

Preview environments can query this document to verify snapshot freshness before using data.

Configuration

The snapshot-producer is deployed to the staging namespace. Configuration is in:

  • Chart: cluster-gitops/charts/snapshot-producer/
  • Values: cluster-gitops/plugins/local/snapshot-producer/values.yaml

Key configuration options:

Setting Default Description
schedule "0 3 * * 0" Cron schedule (Sunday 3 AM UTC)
activeDeadlineSeconds 3600 1 hour timeout
retry.maxAttempts 3 Retries per collection
retry.delaySeconds 30 Delay between retries
streaming.gzip true Compress data in transit

Manual Trigger

To manually trigger a snapshot (useful for testing or recovery):

# Create a one-time Job from the CronJob
kubectl create job --from=cronjob/snapshot-producer snapshot-manual-$(date +%s) -n staging

# Watch progress
kubectl logs -f job/snapshot-manual-<timestamp> -n staging

Troubleshooting

Snapshot job fails with connection error:

  • Check MongoDB Atlas network access (IP allowlist)
  • Verify credentials in snapshot-producer-credentials secret
  • Check if VPC Peering is configured for -pri hostnames

Snapshot takes too long:

  • Large collections may exceed the 1-hour timeout
  • Consider increasing activeDeadlineSeconds
  • Check MongoDB Atlas cluster tier (affects throughput)

Preview shows stale data:

  • Check snapshot_metadata.createdAt timestamp
  • Verify CronJob is running: kubectl get cronjob -n staging
  • Check recent job status: kubectl get jobs -n staging | grep snapshot

For detailed chart configuration, see Snapshot Producer Reference.

Future Enhancements

Planned improvements for PR previews:

  1. Performance metrics: Show load times and resource usage
  2. Visual regression: Screenshot comparison with base branch
  3. E2E tests: Automated testing against preview environment
  4. Cost tracking: Monitor resource usage per preview
  5. Orphan database cleanup: Automated cleanup of syrf_pr_* databases

Recently Completed:

  • Database per preview: Isolated MongoDB database per PR (syrf_pr_{number})
  • Seed data: Auto-populate preview with test data on first startup
  • GitHub Deployments: Native GitHub UI integration with deployment tracking
  • Snapshot restore: Use production data snapshot via use-snapshot label
  • Kyverno security: Policy enforcement for PR database access patterns