argocd cleanup database-isolation finalizers github-deployments gitops kyverno labels lambda mongodb pr-preview snapshot sync-waves terraform testing

Using PR Preview Environments¶

Purpose¶

This guide explains how to use PR (Pull Request) preview environments for testing changes before merging to master. Preview environments are ephemeral, automatically-deployed instances of the SyRF platform that match your PR's code.

What Are PR Preview Environments?¶

PR preview environments provide:

Isolated testing: Each PR gets its own complete environment
Automatic deployment: ArgoCD deploys your changes automatically
Unique URLs: Access your preview at pr-{number}.syrf.org.uk
Auto-cleanup: Environments are deleted when PR closes
Full stack: All 6 services (API, PM, Quartz, Web, Docs, User Guide) deployed together
GitHub Deployments: Native GitHub UI integration with clickable environment URLs

How It Works¶

1. Open PR (any PR to main)
   ↓
2. pr-tests.yml runs automatically:
   - test-dotnet (if .NET changed) ─┬─ MUST PASS
   - test-web (if Angular changed) ─┘
   ↓
3. Add 'preview' label (optional, for preview environment)
   ↓
4. TWO workflows trigger in parallel:
   │
   ├─ pr-preview.yml (Kubernetes services):
   │   - Tag-based detection: find last tag for each service
   │   - Build Docker images for changed services
   │   - Write version files to cluster-gitops
   │   - ArgoCD deploys to pr-{number} namespace
   │   - Updates PR description with K8s status
   │
   └─ pr-preview-lambda.yml (S3 Notifier Lambda):
       - Build Lambda package from s3-notifier code
       - Deploy syrfAppUploadS3Notifier-pr-{number} to AWS
       - Configure S3 trigger for preview/pr-{number}/ prefix
       - Updates PR description with Lambda status
   ↓
5. ArgoCD ApplicationSet detects version files
   ↓
6. ArgoCD creates namespace: pr-{number}
   ↓
7. ArgoCD deploys all services to preview namespace
   ↓
8. Preview URLs and file upload processing available within 5 minutes
   ↓
9. PR description shows unified status table at top

PR Description Status¶

Preview status is displayed directly in the PR description (not comments) so it stays visible at the top of the PR. Both workflows update the same status table:

Component	Status
S3 Notifier Lambda	✅ `0.1.5`
K8s Services	✅ Ready

Preview URLs (once ArgoCD syncs):

🌐 Web: https://pr-{number}.syrf.org.uk
🔌 API: https://api.pr-{number}.syrf.org.uk
📁 S3 Prefix: preview/pr-{number}/

This approach:

Keeps status always visible (not buried in comments)
Shows both Lambda and K8s status in one place
Includes version numbers and links to workflow runs

GitHub Deployments¶

Preview environments are tracked via the GitHub Deployments API, providing native integration with GitHub's UI:

Where to Find Deployments:

PR Sidebar: Look for "Environments" section with clickable pr-{number} link
Repository Deployments: Navigate to repository → Deployments tab
Commit Status: Deployment status appears on commits in the PR

Deployment Status Flow:

pending → in_progress → queued → success
                              ↘ failure

Status	Description	Trigger
`pending`	Deployment created, waiting to start	PR preview workflow starts
`in_progress`	Building Docker images	Build job begins
`queued`	Pushed to GitOps, waiting for ArgoCD	After cluster-gitops push
`success`	ArgoCD sync complete, preview live	ArgoCD PostSync hook
`failure`	Build or deployment failed	Workflow failure
`inactive`	Environment cleaned up	PR closed/label removed

Benefits:

Click environment URL directly from PR sidebar
See deployment history for the PR
Track deployment state changes
Automatic cleanup marking when PR closes

Fork PR Limitation: PRs from forked repositories don't create GitHub Deployments due to token permission restrictions. The preview environment still deploys normally, but without the GitHub Deployment tracking.

Why Two Workflows?¶

The preview environment needs both Kubernetes services AND a Lambda function to work properly:

Component	Workflow	Purpose
API, PM, Quartz, Web	`pr-preview.yml`	Application logic, UI, background jobs
S3 Notifier Lambda	`pr-preview-lambda.yml`	File upload notifications to RabbitMQ

Without the Lambda, file uploads in preview environments wouldn't trigger the study import process. The Lambda listens to S3 events on the preview/pr-{number}/ prefix and publishes messages to RabbitMQ.

Automated Testing¶

All PRs run automated tests via the pr-tests.yml workflow, regardless of whether they have a preview environment.

Test Workflow¶

Job	Trigger	Timeout	What it tests
`test-dotnet`	.NET code changed	10 min	xUnit tests for API, PM, Quartz, S3 Notifier
`test-web`	Angular code changed	5 min	Vitest tests with coverage thresholds

Both jobs run in parallel after change detection.

Coverage Requirements¶

Angular tests enforce minimum coverage thresholds:

Metric	Threshold
Statements	50%
Branches	40%
Functions	50%
Lines	50%

Tests fail if coverage drops below these thresholds.

Test Results¶

Test results are uploaded as GitHub Actions artifacts:

dotnet-test-results: TRX test result files
dotnet-coverage: Cobertura XML coverage
web-coverage: lcov, cobertura, text coverage reports
web-test-results: JUnit XML test results

Code Quality (SonarCloud)¶

Coverage and code quality are analyzed by SonarCloud which provides:

Quality Gate: Pass/fail status based on coverage, bugs, and code smells
PR Decoration: Inline comments on issues in changed code
Coverage Report: Line-by-line coverage visualization
Security Analysis: Detection of vulnerabilities and hotspots

See How to: Run Tests for local testing and SonarCloud setup.

Version File Structure (in cluster-gitops):

syrf/environments/preview/
  services/                       # Service defaults (all previews)
    api/
      config.yaml                 # hostPrefix, imageRepo
      values.yaml                 # Default Helm values
    web/
      config.yaml
      values.yaml
    ...
  pr-123/
    pr.yaml                       # PR trigger file (prNumber, headSha, branch)
    namespace.yaml                # Kubernetes namespace manifest
    services/                     # PR-specific values (one file per service)
      api.values.yaml             # image.tag from git tag or new build
      project-management.values.yaml
      quartz.values.yaml
      web.values.yaml
      docs.values.yaml
      user-guide.values.yaml

Prerequisites¶

Open pull request in syrf repository
Changes to at least one service
GitHub Actions workflows enabled
ArgoCD installed and configured (cluster requirement)

Optional: GCP Secrets for RabbitMQ Cleanup¶

When a PR is closed, the cleanup job can delete the PR's RabbitMQ vhost to free resources. This requires GCP credentials to access the GKE cluster.

Required GitHub Secrets (optional but recommended):

Secret	Description	How to Get
`GCP_WORKLOAD_IDENTITY_PROVIDER`	Workload Identity Federation provider	`projects/{project-number}/locations/global/workloadIdentityPools/{pool}/providers/{provider}`
`GCP_SERVICE_ACCOUNT`	GCP service account email	`{sa-name}@{project}.iam.gserviceaccount.com`

If not configured:

RabbitMQ vhost cleanup is skipped (not a failure)
Orphan vhosts accumulate until manually cleaned
All other cleanup (Lambda, K8s namespace) still works

To configure (requires GCP admin):

Create a Workload Identity Pool for GitHub Actions
Create a service account with container.developer role
Add the secrets to GitHub repository settings

See GCP Workload Identity Federation documentation for setup details.

Creating a Preview Environment¶

Step 1: Create Your PR¶

# Create feature branch
git checkout -b feature/my-awesome-feature

# Make changes
# ... edit files ...

# Commit and push
git add .
git commit -m "feat(api): add awesome new feature"
git push origin feature/my-awesome-feature

Step 2: Open Pull Request¶

Go to GitHub repository
Click "Compare & pull request"
Fill in PR title and description
Add the preview label to the PR
Click "Create pull request"

Step 3: Wait for Build¶

The pr-preview.yml workflow will automatically:

✅ Detect changed services
✅ Build Docker images with pr-{number} tag
✅ Push images to GHCR
✅ Comment on PR with preview info

Build time: ~5-10 minutes depending on changed services

Step 4: ArgoCD Deploys¶

ArgoCD will automatically:

✅ Detect the PR (checks every 5 minutes)
✅ Create namespace pr-{number}
✅ Deploy all changed services
✅ Configure ingress with preview URLs
✅ Set up TLS certificates

Deployment time: ~5-10 minutes after build completes

Step 5: Access Your Preview¶

Once deployed, access your preview at:

Web UI: https://pr-{number}.syrf.org.uk
API: https://api.pr-{number}.syrf.org.uk
PM Service: https://project-management.pr-{number}.syrf.org.uk
Docs: https://docs.pr-{number}.syrf.org.uk
User Guide: https://help.pr-{number}.syrf.org.uk

Replace {number} with your PR number (e.g., PR #42 → https://pr-42.syrf.org.uk)

Managing Preview Environments¶

Updating Your Preview¶

Push new commits to your PR branch:

# Make more changes
git add .
git commit -m "fix(api): address review feedback"
git push

The preview workflow will automatically:

Rebuild changed images
Push new images with same pr-{number} tag
ArgoCD detects new image
ArgoCD syncs updated deployment

Update time: ~10-15 minutes total

Checking Preview Status¶

GitHub Actions:

Go to PR → "Checks" tab
View "PR Preview Build" workflow
Check which services were built

ArgoCD UI (if accessible):

Open ArgoCD dashboard
Look for application syrf-pr-{number}
View sync status and health

Kubernetes (if you have kubectl access):

# List preview namespaces
kubectl get namespaces | grep pr-

# Check pods in your preview
kubectl get pods -n pr-{number}

# View preview ingresses
kubectl get ingress -n pr-{number}

Disabling Preview¶

Remove the preview label from your PR:

Go to PR on GitHub
Click "Labels" gear icon
Uncheck "preview"
Both workflows trigger cleanup automatically:
Lambda function deleted via Terraform
K8s namespace deleted via ArgoCD
PR description updated with cleanup status
S3 files are preserved for debugging

Deleting Preview¶

Preview environments are automatically deleted when:

PR is closed (merged or closed without merging)
PR is converted to draft
preview label is removed (triggers unlabeled event)

Cleanup includes:

Lambda function deletion
K8s namespace and resources deletion
Git tags cleanup
PR description update showing cleanup reason

Preserved for debugging:

S3 uploaded files in preview/pr-{number}/ prefix

Cleanup time: Usually complete within 2-3 minutes

Cleanup Architecture¶

Understanding how preview cleanup works helps diagnose issues when cleanup fails.

ArgoCD Hook Lifecycle¶

Preview environments use ArgoCD PreSync hooks for database reset operations. These hooks have finalizers that control when resources can be deleted:

Policy	When Finalizer Removed	Use Case
`BeforeHookCreation`	On next sync only	Resources that must persist across syncs
`HookSucceeded`	Immediately after success	Ephemeral resources (db-reset job)
`HookFailed`	Immediately after failure	Ephemeral resources (db-reset job)

The db-reset resources use HookSucceeded,HookFailed so they clean up immediately after completion.

Cleanup Flow¶

PR Closed/Label Removed
         ↓
cleanup-tags job starts
         ↓
1. Pre-cleanup: Remove ArgoCD hook finalizers
   (Prevents deadlock if hooks are still running)
         ↓
2. Delete PR version files from cluster-gitops
         ↓
3. ArgoCD detects missing files
         ↓
4. ArgoCD deletes Application
         ↓
5. Namespace and all resources deleted

Why Pre-Cleanup Matters¶

If git files are deleted before hooks complete, ArgoCD loses visibility of the hook annotations (they were in git). Without knowing the hook-delete-policy, ArgoCD doesn't know how to handle the finalizers, causing resources to get stuck in Terminating state indefinitely.

The pre-cleanup step removes argocd.argoproj.io/hook-finalizer from all hook resources before deleting git files, ensuring clean deletion regardless of hook state.

Resources with Hook Finalizers¶

Resource	Finalizer Source	Cleanup Behavior
`job/db-reset-{N}-{sha}`	ArgoCD PreSync hook	Deleted immediately after completion
`serviceaccount/db-reset-sa`	ArgoCD PreSync hook	Deleted immediately after completion
`role/db-reset-marker-role`	ArgoCD PreSync hook	Deleted immediately after completion
`rolebinding/db-reset-marker-binding`	ArgoCD PreSync hook	Deleted immediately after completion
`externalsecret/atlas-operator-api-key`	ArgoCD PreSync hook	Persists until Application deletion
`secret/mongodb-pr-password`	ArgoCD PreSync hook	Persists until Application deletion
`atlasdatabaseuser/pr-user`	ArgoCD PreSync hook	Persists until Application deletion

Troubleshooting¶

Preview Not Building¶

Symptom: No "PR Preview Build" workflow run

Causes:

PR doesn't have preview label
No changes to service code (only docs/config changed)
Workflow file syntax error

Fix:

Add preview label to PR
Make a small change to service code
Check .github/workflows/pr-preview.yml syntax

Build Fails¶

Symptom: Red X on "PR Preview Build" workflow

Causes:

Docker build errors
TypeScript/compile errors
Missing dependencies

Fix:

Click on failed workflow run
Review job logs
Fix errors in your branch
Push fix to trigger rebuild

Preview Not Deploying¶

Symptom: Build succeeds but no preview URLs work

Causes:

ArgoCD not installed yet (cluster not ready)
ApplicationSet not configured
GHCR image pull issues
DNS not configured

Fix (requires cluster access):

Check ArgoCD application exists: kubectl get application -n argocd | grep pr-{number}
Check ApplicationSet: kubectl get applicationset -n argocd syrf-preview
View ArgoCD logs: kubectl logs -n argocd -l app.kubernetes.io/name=argocd-applicationset-controller

404 Not Found on Preview URL¶

Symptom: Preview URL returns 404

Causes:

Ingress not created yet (DNS propagation)
Certificate not ready
Service not healthy

Fix:

Wait 5-10 minutes for DNS and cert
Check service health: kubectl get pods -n pr-{number}
Check ingress: kubectl get ingress -n pr-{number}

Preview Shows Old Code¶

Symptom: Preview doesn't reflect latest changes

Causes:

Image tag not updated (cached)
ArgoCD not synced yet
Browser caching old assets

Fix:

Check image tag in deployment: kubectl describe deployment -n pr-{number}
Force ArgoCD sync (if accessible)
Hard refresh browser (Ctrl+Shift+R)

Namespace Stuck in Terminating State¶

Symptom: After PR closes, namespace shows Terminating status but doesn't delete. ArgoCD Application shows Unknown sync status.

Diagnosis:

# Check namespace status
kubectl get namespace pr-{number}

# Check for resources with finalizers blocking deletion
kubectl api-resources --verbs=list --namespaced -o name | \
  xargs -I {} kubectl get {} -n pr-{number} \
  -o custom-columns='KIND:.kind,NAME:.metadata.name,FINALIZERS:.metadata.finalizers' \
  --ignore-not-found 2>/dev/null | grep -v "<none>"

# Check ArgoCD Application
kubectl get application pr-{number}-namespace -n argocd -o yaml | \
  grep -A5 "deletionTimestamp\|finalizers"

Common Causes:

ArgoCD hook finalizer stuck on Job - Most common cause
External Secrets Operator waiting - ESO finalizers blocking deletion
Atlas Database User finalizer - MongoDB operator cleanup pending

Manual Fix:

# Remove finalizer from stuck Job (find the exact name first)
kubectl get jobs -n pr-{number}
kubectl patch job {job-name} -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'

# If RBAC resources are stuck
kubectl patch serviceaccount db-reset-sa -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'
kubectl patch role db-reset-marker-role -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'
kubectl patch rolebinding db-reset-marker-binding -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'

# If MongoDB resources are stuck
kubectl patch externalsecret atlas-operator-api-key -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'
kubectl patch secret mongodb-pr-password -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'
kubectl patch atlasdatabaseuser pr-user -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'

Prevention: This issue is prevented by the pre-cleanup step in the workflow. If you see this issue, the pre-cleanup step may have failed or GCP credentials weren't configured.

ArgoCD Application Won't Delete¶

Symptom: Application shows Unknown sync status and has deletionTimestamp set but won't complete deletion.

Causes:

Resources with finalizers in the namespace blocking deletion
ArgoCD controller unable to access the namespace
Custom resource definitions (CRDs) blocking deletion

Fix:

# First, fix any stuck resources in the namespace (see above)

# If Application still won't delete, check its finalizers
kubectl get application pr-{number}-namespace -n argocd \
  -o jsonpath='{.metadata.finalizers}'

# Remove ArgoCD resource finalizer (last resort)
kubectl patch application pr-{number}-namespace -n argocd --type=merge \
  -p '{"metadata":{"finalizers":null}}'

Warning: Removing the Application finalizer skips ArgoCD's cascade deletion. Only do this after manually cleaning up namespace resources.

Best Practices¶

When to Use Previews¶

Good use cases:

✅ Testing new features before review
✅ Validating bug fixes with real data
✅ Demonstrating changes to stakeholders
✅ QA testing before merge
✅ Integration testing across services

Avoid for:

❌ Every single PR (resource intensive)
❌ Documentation-only changes
❌ Config-only changes
❌ Very small typo fixes

Resource Limits¶

Preview environments have reduced resources compared to staging:

Service	Memory Limit	CPU Limit	Replicas
API	256Mi	250m	1
PM	256Mi	250m	1
Quartz	256Mi	250m	1
Web	128Mi	100m	1

Implications:

Slower response times than production
Not suitable for load testing
May have memory/CPU limits under heavy use

Testing Checklist¶

Before merging, verify in your preview:

Application starts without errors
Core functionality works
API endpoints respond correctly
UI renders properly
No console errors in browser
Authentication works (if applicable)
Database operations succeed

Cleanup¶

Always remove the preview label or close your PR when done testing to free up cluster resources.

Preview Environment Details¶

Namespace Structure¶

Each preview gets its own Kubernetes namespace:

pr-{number}/
├── deployments/
│   ├── syrf-api
│   ├── syrf-pm
│   ├── syrf-quartz
│   └── syrf-web
├── services/
│   ├── syrf-api
│   ├── syrf-pm
│   ├── syrf-quartz
│   └── syrf-web
├── ingresses/
│   ├── pr-{number}-api
│   ├── pr-{number}-pm
│   └── pr-{number}-web
├── configmaps/
│   └── preview-info
└── secrets/
    └── (TLS certificates)

Angular Development Build¶

Preview environments build the Angular app with --configuration development instead of production. This provides:

Verbose error messages: Full stack traces with component context
Development mode checks: Extra change detection cycles to catch issues
Source maps included: Debug TypeScript directly in browser DevTools
Named chunks: Easier to identify modules when debugging

Trade-off: Bundle size is larger, but this is acceptable for preview testing.

devMode Feature Flag¶

Preview environments have the devMode feature flag enabled by default. This can be used in the app to:

Enable additional console logging
Show debug panels/overlays
Enable verbose API request logging
Show feature flag debugging UI
Enable performance profiling helpers

Access via ngrx selector: selectDevMode

Environment Variables¶

Preview services run with:

ASPNETCORE_ENVIRONMENT=Preview (.NET services)
ENVIRONMENT=preview (Web service)
REMOVE_SOURCEMAPS=false (Web keeps sourcemaps for debugging)
SYRF__FeatureFlags__DevMode=true (Web devMode enabled)

Data Isolation¶

Important: Preview environments share the same backend resources as staging:

Same MongoDB instance
Same RabbitMQ instance
Same S3 buckets

Be careful:

Use test data only
Don't delete production-like data
Consider data namespace isolation in code

Customizing Your Preview¶

Feature Flags via PR Description¶

You can enable or disable feature flags for your preview environment by adding a config block to your PR description.

Format: Add a YAML code block starting with #preview-config:

```yaml
#preview-config
web:
  featureFlags:
    experimentalFeature: true
    newDashboard: true
```

Rules:

The YAML block MUST start with #preview-config on the first line
Top-level keys are service names (api, web, project-management, quartz, docs, user-guide)
Only featureFlags are allowed - other settings are ignored for security
The config is parsed when the preview workflow runs

Example - Enable multiple flags:

```yaml
#preview-config
web:
  featureFlags:
    newScreeningOverview: true
    newStageOverview: true
api:
  featureFlags:
    enableBetaEndpoints: true
```

What You Can and Cannot Customize¶

Setting	How to Customize
Feature flags	PR description (any PR creator)
Resources (memory, CPU)	Requires cluster-gitops write access
Logging level	Requires cluster-gitops write access
Replica count	Requires cluster-gitops write access

Why this restriction? Resource limits are restricted to prevent accidental cost increases and ensure fair resource sharing across previews.

Service-Specific Defaults¶

Platform operators can set service-specific defaults for all previews in:

cluster-gitops/syrf/environments/preview/services/{service}/values.yaml

These values apply to all preview environments and can set things like:

Default feature flags for previews
Resource limits for specific services
Logging configuration

S3 Notifier Lambda (File Upload Processing)¶

Each preview environment gets its own AWS Lambda function to handle file uploads.

How It Works¶

User uploads file → S3 bucket → S3 Event Notification
                                       ↓
                    Lambda: syrfAppUploadS3Notifier-pr-{number}
                                       ↓
                    RabbitMQ message → Project Management service
                                       ↓
                    Study import processing begins

Lambda Details¶

Aspect	Value
Function Name	`syrfAppUploadS3Notifier-pr-{number}`
S3 Prefix	`preview/pr-{number}/`
Runtime	.NET 10 (linux-x64)
Managed By	Terraform (in `camarades-infrastructure/`)

S3 Key Format¶

Files uploaded in preview environments use a special prefix:

s3://syrfapp-uploads/preview/pr-{number}/Projects/{projectId}/...

This ensures:

Preview uploads don't interfere with production data
Each PR's Lambda only processes its own files
Files are isolated per PR for debugging purposes

Change Detection¶

The Lambda workflow uses smart change detection to avoid unnecessary deployments:

On synchronize (new commits), checks if s3-notifier code changed
Compares current commit against last deployed SHA (stored in Lambda tags)
Only deploys if files in src/services/s3-notifier/ changed since last deploy
First deploy for a PR always runs (no previous SHA)

This prevents rebuilding the Lambda on every push when only K8s services changed.

Cleanup¶

When a PR is closed or the preview label is removed, pr-preview-lambda.yml automatically:

Preserves S3 files in preview/pr-{number}/ (for debugging)
Removes the Lambda function via Terraform
Deletes the Lambda package from S3 state bucket
Updates PR description with cleanup status

Note: S3 files are intentionally preserved after cleanup to help debug any issues. They can be manually deleted if needed.

PR Preview Workflow: .github/workflows/pr-preview.yml
PR Preview Lambda Workflow (archived): .github/workflows/archived/pr-preview-lambda.yml
Lambda Permissions: configure-lambda-deployment-permissions.md
ApplicationSet Config: See cluster-gitops repository
CI/CD Workflow: ci-cd-workflow.md
ArgoCD Docs: https://argo-cd.readthedocs.io/en/stable/operator-manual/applicationset/

PR Labels Reference¶

Preview environments are controlled by GitHub labels. Each label has specific effects on the workflow behavior.

Label Effects Matrix¶

Label	When Added	When Removed	Effects
`preview`	PR labeled	PR unlabeled	Add: Triggers full preview build/deploy. Remove: Triggers complete cleanup (namespace, Lambda, MongoDB user, tags)
`persist-db`	PR labeled	PR unlabeled	Add: Skips database reset on subsequent syncs (preserves data). Remove: Re-enables database reset, runs db-reset job on next sync
`use-snapshot`	PR labeled	PR unlabeled	Add: Uses production snapshot data instead of seed data (future feature). Remove: Reverts to seed data

Label: `preview`¶

Purpose: Primary control for preview environment lifecycle.

When Added:

1. Workflow triggers on 'labeled' event
2. check-should-run job validates preview should be created
3. All service images built with pr-{number} tag
4. Version files written to cluster-gitops
5. ArgoCD detects files, creates namespace and Application
6. Lambda function deployed for S3 file notifications
7. GitHub Deployment created for tracking

When Removed:

1. Workflow triggers on 'unlabeled' event
2. cleanup-tags job executes
3. Pre-cleanup removes ArgoCD hook finalizers (prevents deadlock)
4. MongoDB database user deleted
5. Quartz SQL schema dropped
6. Git tags cleaned up
7. Version files deleted from cluster-gitops
8. ArgoCD detects missing files, deletes Application
9. Lambda function destroyed via Terraform
10. RabbitMQ vhost deleted (if GCP credentials configured)
11. GitHub Deployment marked inactive

Timing: Label changes are processed immediately when the workflow runs.

Label: `persist-db`¶

Purpose: Prevents database reset between preview environment syncs.

Behavior:

`persist-db` Label State	Database Reset Job	Use Case
Not present (default)	Runs on every sync	Clean slate for each push
Present	Skipped	Preserve test data across commits

When Added:

db-reset-job.yaml file is deleted from cluster-gitops
ArgoCD no longer runs PreSync hook to reset database
Existing data in syrf_pr_{number} is preserved

When Removed:

db-reset-job.yaml file is re-created in cluster-gitops
Database reset runs on next ArgoCD sync
All data in preview database is replaced with seed data

Important: Adding persist-db does NOT retroactively restore deleted data. It only prevents future resets.

Label: `use-snapshot`¶

Purpose: Initialize preview database with production snapshot data instead of empty/seed data.

Current Status: ✅ Fully Implemented - See Data Snapshot Automation for architecture details.

Behavior:

`use-snapshot` Label State	Data Source	Database Initialization
Not present (default)	Empty database	Services create data as needed
Present	Production snapshot (`syrf_snapshot`)	Snapshot restore job copies collections

How it works:

MongoDB User Updated: PR user gets additional read role on syrf_snapshot database
Snapshot Restore Job Created: PreSync hook job copies 11 collections from syrf_snapshot to syrf_pr_{number}
db-reset Job Skipped: When use-snapshot=true, the database reset job is NOT generated (snapshot-restore handles initialization)
Idempotency: Completion marker (ConfigMap) prevents duplicate restores on manual ArgoCD syncs

Collections Copied (via $out aggregation):

pmProject, pmStudy, pmInvestigator, pmSystematicSearch
pmDataExportJob, pmStudyCorrection, pmInvestigatorUsage
pmRiskOfBiasAiJob, pmProjectDailyStat, pmPotential, pmInvestigatorEmail

Security: Kyverno policy enforces that PR users can ONLY have read access (not readWrite) on syrf_snapshot. See Kyverno Security Policy below.

When Added:

1. Workflow detects 'use-snapshot' label
2. AtlasDatabaseUser gets additional role:
   - roleName: read
   - databaseName: syrf_snapshot
3. snapshot-restore-job.yaml generated instead of db-reset-job.yaml
4. ArgoCD runs PreSync hook to copy data
5. Services start with production-like data

When Removed:

1. AtlasDatabaseUser loses syrf_snapshot read role
2. snapshot-restore-job.yaml removed
3. On next sync with service changes, db-reset-job.yaml may be generated

Database Isolation¶

Per-Environment Databases¶

Each environment type has its own database isolation strategy:

Environment	Database Name	Isolation Level	Data Source
Production	`syrftest`	Full (dedicated)	Live data
Staging	`syrftest` ⚠️	Shared with production	Production data
Preview	`syrf_pr_{number}`	Full (per-PR)	Seed data

⚠️ CRITICAL: Staging currently shares the production database (syrftest). This is a known issue documented in the MongoDB Testing Strategy.

Preview Database Lifecycle¶

PR Opens + preview label added
           ↓
1. MongoDB Atlas database user created: pr-{number}-user
           ↓
2. Database syrf_pr_{number} created (on first write)
           ↓
3. DatabaseSeeder runs, populates with sample data
           ↓
4. Preview services connect using pr-{number}-user credentials
           ↓
   [PR active - multiple syncs may occur]
           ↓
5. PR closes OR preview label removed
           ↓
6. Database user deleted (pr-{number}-user)
           ↓
7. Database syrf_pr_{number} becomes orphaned
           ↓
8. Manual cleanup required for orphan databases

MongoDB Atlas User Permissions¶

Preview environments use dedicated MongoDB Atlas database users with scoped permissions:

User	Database Access	Role
`pr-{number}-user`	`syrf_pr_{number}` only	`dbOwner`

dbOwner role provides:

Read/write access to all collections
Create/drop collections
Create/drop indexes
Run aggregation pipelines

Cleanup note: Users created before the dbOwner role update may have insufficient permissions for some cleanup operations.

Quartz SQL Schema Isolation¶

The Quartz service (background jobs) uses SQL Server with per-environment schema isolation:

Environment	Schema Name	Isolation
Production	`[production]`	Dedicated
Staging	`[staging]`	Dedicated
Preview	`[preview_{number}]`	Per-PR

Cleanup: When a PR closes, the cleanup-tags job drops the [preview_{number}] schema.

RabbitMQ Vhost Isolation¶

Each preview environment gets its own RabbitMQ virtual host:

Environment	Vhost Name
Production	`/` (default)
Staging	`staging`
Preview	`pr-{number}`

Cleanup requirement: Requires GCP credentials to access the GKE cluster and delete vhosts.

Edge Cases and Known Issues¶

Fork PR Limitations¶

Issue: PRs from forked repositories cannot create GitHub Deployments.

Cause: GITHUB_TOKEN in fork PRs has restricted permissions and cannot create deployments on the upstream repository.

Impact:

Preview environment deploys normally
GitHub Deployments UI shows no environment for the PR
Users must manually check ArgoCD or workflow logs for status

Workaround: None. This is a GitHub security limitation.

ArgoCD Hook Finalizer Deadlock¶

Issue: Namespace gets stuck in Terminating state indefinitely.

Scenario:

1. PR closes while db-reset job is running
2. Git files deleted from cluster-gitops
3. ArgoCD loses visibility of hook annotations (they were in git)
4. ArgoCD doesn't know hook-delete-policy
5. Resources with finalizers block namespace deletion

Prevention: The cleanup-tags job runs a pre-cleanup step that removes argocd.argoproj.io/hook-finalizer from all resources BEFORE deleting git files.

Manual Fix (if pre-cleanup fails):

# Remove finalizers from stuck jobs
kubectl get jobs -n pr-{number}
kubectl patch job {job-name} -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'

# Remove from RBAC resources
kubectl patch serviceaccount db-reset-sa -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'
kubectl patch role db-reset-marker-role -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'
kubectl patch rolebinding db-reset-marker-binding -n pr-{number} --type=merge \
  -p '{"metadata":{"finalizers":null}}'

MongoDB Cleanup Failures for Older PRs¶

Issue: MongoDB database user cleanup may fail silently for PRs created before the dbOwner role change.

Cause: Users created with older roles may not have dbOwner permissions required for some cleanup operations.

Impact: Orphan database users may remain in MongoDB Atlas.

Resolution: Manual cleanup via MongoDB Atlas UI or CLI.

Race Condition: Closed PR and Build¶

Issue: Build job could recreate files that cleanup job deleted.

Scenario:

1. PR closes
2. cleanup-tags job starts, deletes files
3. build-images job (already running) finishes, writes new files
4. Files recreated after cleanup

Prevention: The workflow checks github.event.action == 'closed' at the start of check-should-run job and immediately exits if true. This ensures build jobs don't run for closed PRs.

Code reference (pr-preview.yml:70-77):

# Skip build if PR is closed
if [ "${{ github.event.action }}" == "closed" ]; then
  echo "result=false" >> "$GITHUB_OUTPUT"
  echo "skip_reason=PR is closed" >> "$GITHUB_OUTPUT"
  exit 0
fi

GCP Credentials Not Configured¶

Issue: Some cleanup steps fail silently when GCP credentials are not configured.

Affected Operations:

Operation	Without GCP Credentials
RabbitMQ vhost deletion	Skipped
ArgoCD finalizer pre-cleanup	Skipped
Quartz SQL schema cleanup	Skipped

Impact: Orphan resources accumulate until manually cleaned.

Required Secrets:

GCP_WORKLOAD_IDENTITY_PROVIDER
GCP_SERVICE_ACCOUNT

Tag-Based Change Detection Edge Cases¶

Issue: Services may not rebuild when expected due to tag-based detection.

How it works:

1. Find last git tag for service (e.g., api-v1.2.3)
2. Compare current commit against tagged commit
3. If files in service path changed → rebuild
4. If no changes → reuse existing image

Edge cases:

Scenario	Behavior
First PR ever (no tags)	Uses base branch comparison
Service has no tags	Always rebuilds
Tag deleted manually	May cause unexpected rebuilds
Shared library changed	Detected via DEPENDENCY-MAP.yaml

PR Description Parsing Failures¶

Issue: Malformed YAML in PR description can cause config parsing to fail silently.

Scenarios:

Invalid YAML syntax → defaults used
Missing #preview-config marker → block ignored
Unsupported settings → ignored (security feature)

Debug: Check workflow logs for "Parse PR description for preview config" step.

Lambda S3 Prefix Routing¶

Issue: File uploads must use correct S3 prefix or they won't trigger the Lambda.

Expected prefix: preview/pr-{number}/Projects/{projectId}/...

Common mistakes:

Using production prefix (Projects/...) → triggers production Lambda
Missing preview/pr-{number} prefix → no Lambda triggered
Wrong PR number in prefix → wrong Lambda triggered

Label Interaction Matrix¶

Understanding how labels interact is critical for predictable preview behavior.

Label Precedence Rules¶

persist-db > use-snapshot > reset-db (implicit)

Decision Flow:

Has 'persist-db' label?
├── YES → Skip ALL database operations (highest priority)
│         Database preserved exactly as-is
└── NO → Has 'use-snapshot' label?
         ├── YES → Run snapshot-restore job (copies from syrf_snapshot)
         │         Skip db-reset job (snapshot handles initialization)
         └── NO → Were MongoDB services rebuilt?
                  ├── YES → Run db-reset job (drop all collections)
                  └── NO → Skip db-reset job (no changes to reset for)

Complete Label Scenario Matrix¶

`persist-db`	`use-snapshot`	Services Rebuilt	db-reset Job	snapshot-restore Job	Result
❌	❌	❌	❌ Skip	❌ Skip	Database unchanged
❌	❌	✅	✅ Run	❌ Skip	Collections dropped, empty database
❌	✅	❌	❌ Skip	✅ Run	Snapshot data copied
❌	✅	✅	❌ Skip	✅ Run	Snapshot data copied (reset skipped)
✅	❌	❌	❌ Skip	❌ Skip	Database preserved
✅	❌	✅	❌ Skip	❌ Skip	Database preserved despite rebuild
✅	✅	❌	❌ Skip	❌ Skip	Database preserved (persist wins)
✅	✅	✅	❌ Skip	❌ Skip	Database preserved (persist wins)

Key Insight: persist-db always wins. When present, no database initialization jobs run regardless of other labels or rebuild status.

Why `use-snapshot` Skips `db-reset`¶

When use-snapshot is true, the db-reset job is NOT generated because:

Redundant: $out aggregation in snapshot-restore completely replaces target collections
Order Problem: db-reset runs at sync wave -1 (AFTER PreSync hooks), so it would DROP data that snapshot-restore just copied
Efficiency: No need to drop then copy; just copy (which replaces)

ArgoCD Sync Wave Ordering¶

Preview environments use ArgoCD sync waves to ensure resources are created in the correct order.

Complete Sync Order¶

Wave -5: AtlasDatabaseUser (MongoDB user creation)
         ↓ MongoDB Atlas Operator creates user in Atlas
         ↓ Connection secret becomes available

Wave -2: db-reset RBAC resources (if db-reset enabled)
         - ServiceAccount: db-reset-sa
         - Role: db-reset-marker-role
         - RoleBinding: db-reset-marker-binding

Wave -1: db-reset Job (if db-reset enabled)
         ↓ Drops all collections in syrf_pr_{number}
         ↓ Creates completion marker ConfigMap

Wave 0+: Application services (API, PM, Quartz, Web)
         ↓ Services start with fresh/empty database

PreSync Hooks (snapshot-restore)¶

When use-snapshot label is present, PreSync hooks run BEFORE the sync wave sequence:

[PreSync Phase - runs before wave sequence]

PreSync Wave 1: snapshot-restore RBAC resources
                - ServiceAccount: db-reset-sa
                - Role: db-reset-marker-role
                - RoleBinding: db-reset-marker-binding

PreSync Wave 3: snapshot-restore Job
                ↓ Copies 11 collections from syrf_snapshot to syrf_pr_{number}
                ↓ Creates completion marker ConfigMap

[/PreSync Phase]

Wave -5: AtlasDatabaseUser (already has syrf_snapshot read role)
Wave 0+: Application services start with snapshot data

Hook Delete Policies¶

Resource Type	Delete Policy	Behavior
db-reset RBAC	None (regular sync resource)	Persists as long as file exists in git
snapshot-restore RBAC	`BeforeHookCreation`	Deleted before next sync, recreated
snapshot-restore Job	`BeforeHookCreation`	Old job deleted before new one created

Why BeforeHookCreation? Resources persist throughout the current sync (Job can use them), then get cleaned up before the next sync. HookSucceeded would delete them immediately after creation, before the Job runs.

Database Coordination with Init Containers¶

Preview environments use a sophisticated coordination mechanism to ensure services don't start until the database is ready. This prevents race conditions where services might try to access a database that hasn't been seeded yet.

Architecture Overview¶

PR Preview Environment
└── pr-{number} (Parent Application - App-of-Apps)
    ├── pr-{number}-infrastructure (AUTO-SYNC ✓)
    │   ├── Namespace, ExternalSecret, AtlasDatabaseUser
    │   └── DatabaseLifecycle CR → manages "db-ready" ConfigMap
    │
    ├── pr-{number}-api (AUTO-SYNC ✓)
    ├── pr-{number}-project-management (AUTO-SYNC ✓)
    ├── pr-{number}-quartz (AUTO-SYNC ✓)
    └── pr-{number}-web (AUTO-SYNC ✓)
        └── Init containers wait for ConfigMap with MATCHING seedVersion

How seedVersion Matching Works¶

Simply waiting for a db-ready ConfigMap is NOT sufficient. Here's why:

Race Condition During Reseed:

1. New seedVersion pushed to cluster-gitops
2. ArgoCD syncs both infrastructure AND service apps (independently!)
3. Service pods restart (annotation changed)
4. Init container checks for db-ready ConfigMap
5. OLD ConfigMap still exists (operator hasn't updated it yet)
6. Init container PASSES with stale ConfigMap ← BUG!
7. Pods start while database is being reseeded ← DATA CORRUPTION

The Fix: Init containers wait for ConfigMap with matching seedVersion:

# ConfigMap created by DatabaseLifecycle operator
apiVersion: v1
kind: ConfigMap
metadata:
  name: db-ready
  namespace: pr-{number}
data:
  status: "ready"
  seedVersion: "abc123"  # Must match pod's expected version
  seededAt: "2026-01-17T12:00:00Z"
  sourceDatabase: "syrf_snapshot"

Per-Service waitForDatabase Configuration¶

Not all services need to wait for the database. The waitForDatabase flag is configured per-service in cluster-gitops:

Service	`waitForDatabase`	Reason
api	`true`	Connects to MongoDB
project-management	`true`	Connects to MongoDB
quartz	`true`	Connects to MongoDB
web	`false`	Frontend only, no database
docs	`false`	Static documentation site
user-guide	`false`	Static documentation site

Configuration Location: cluster-gitops/syrf/environments/preview/services/{service}/config.yaml

# Example: api/config.yaml
serviceName: api
hostPrefix: "api."
imageRepo: ghcr.io/camaradesuk/syrf-api
waitForDatabase: true  # Init container waits for db-ready ConfigMap

Init Container Behavior¶

Services with waitForDatabase: true get an init container added automatically:

initContainers:
  - name: wait-for-database
    image: bitnami/kubectl:latest
    command: ['sh', '-c']
    args:
      - |
        EXPECTED_VERSION="${SEED_VERSION}"
        echo "Waiting for db-ready ConfigMap with seedVersion=$EXPECTED_VERSION..."

        while true; do
          CURRENT=$(kubectl get configmap db-ready -n ${NAMESPACE} \
            -o jsonpath='{.data.seedVersion}' 2>/dev/null || echo "")

          if [ "$CURRENT" = "$EXPECTED_VERSION" ]; then
            echo "Database is ready with correct seedVersion!"
            exit 0
          fi

          echo "Current: '$CURRENT', Expected: '$EXPECTED_VERSION'. Waiting..."
          sleep 5
        done

DatabaseLifecycle Operator¶

The DatabaseLifecycle operator (deployed in the cluster) handles database seeding:

Watches DatabaseLifecycle custom resources
Waits for watched deployments to have 0 ready replicas (via watchedDeployments field)
Seeds database from snapshot when seedVersion changes
Creates/Updates db-ready ConfigMap after successful seeding

Coordination Flow:

Init containers block → Pods stuck at Init:0/1 → Operator sees 0 ready pods
→ Operator seeds database → Operator creates db-ready ConfigMap
→ Init containers pass → Main containers start

Key Design Principle: The operator does NOT scale down services. Instead:

Init containers block pods from starting (by waiting for db-ready ConfigMap)
Operator waits for watched deployments to have 0 ready replicas
This happens naturally because init containers prevent pods from becoming ready

watchedDeployments Configuration (in DatabaseLifecycle CR):

spec:
  watchedDeployments:
    labelSelector: "syrf.org.uk/uses-database=true"
    timeout: 300  # seconds

Services with waitForDatabase: true get the label syrf.org.uk/uses-database=true automatically.

Scenario Walkthroughs¶

Initial Deployment:

1. PR gets 'preview' + 'use-snapshot' labels
2. Workflow pushes seedVersion to cluster-gitops
3. ArgoCD syncs ApplicationSet, creates apps for infrastructure + services
4. Apps sync in parallel:
   - Infrastructure: creates DatabaseLifecycle CR with watchedDeployments
   - Services: create Deployments with init containers
5. Service pods enter Init:0/1 state, blocked waiting for db-ready ConfigMap
6. DatabaseLifecycle operator:
   - Checks watchedDeployments → all have 0 ready replicas ✓
   - Seeds database from snapshot
   - Creates db-ready ConfigMap with seedVersion
7. Init containers detect matching seedVersion → pods start

Normal Code Push (no database changes):

1. Developer pushes code
2. Workflow pushes new headSha, same seedVersion
3. ArgoCD syncs child apps
4. Kubernetes does rolling update
5. Init container checks ConfigMap - seedVersion matches!
6. Init container passes immediately (~1 second)
7. No database work needed

Reseed Trigger (seedVersion changes):

1. New seedVersion pushed to cluster-gitops
2. ArgoCD syncs apps (infrastructure + services)
3. Services use Recreate strategy → old pods terminated first
4. New pods created with new seedVersion
5. New pods enter Init:0/1 state (waiting for ConfigMap with new seedVersion)
6. DatabaseLifecycle operator:
   - Checks watchedDeployments → all have 0 ready replicas ✓
   - Drops database, seeds from snapshot
   - Updates ConfigMap with new seedVersion
7. Init containers detect matching seedVersion → pods start

Manual Reseed via `/reseed-db` Command¶

To trigger a database reseed on an existing preview environment, comment /reseed-db on the PR:

Command: Comment /reseed-db on any PR with the preview label.

What happens:

1. Workflow detects /reseed-db command in comment
2. Checks that PR has 'preview' label and NOT 'persist-db' label
3. Updates seedVersion in pr.yaml (single source of truth)
4. ArgoCD detects change, syncs all apps
5. Services restart (Recreate strategy terminates old pods first)
6. Init containers wait for new db-ready ConfigMap
7. Operator reseeds database, creates ConfigMap with new seedVersion
8. Services start successfully

Blocked when:

persist-db label present → "Remove persist-db label first"
preview label missing → "Add preview label first"

Use cases:

Database corruption during testing
Want fresh snapshot data
Recovering from stuck deployments (Init:0/1 state)
Schema migration testing

Recreate Deployment Strategy¶

Services with waitForDatabase: true use Recreate deployment strategy instead of RollingUpdate:

Why? RollingUpdate keeps old pods running until new pods are ready. With init containers blocking new pods, this creates a deadlock:

RollingUpdate (would cause deadlock):
1. Old pods: 1/1 Running (no init container)
2. New pods: Init:0/1 (waiting for db-ready ConfigMap)
3. Operator waits for 0 ready replicas (watchedDeployments check)
4. Old pod has 1 ready replica → operator waits forever
5. db-ready ConfigMap never created → new pods wait forever

Recreate strategy breaks the deadlock:

1. Recreate terminates old pods first
2. 0 ready replicas achieved
3. Operator safe to seed
4. ConfigMap created
5. New pods start

Environment	Strategy	Reason
Production	RollingUpdate	Zero-downtime required
Staging	RollingUpdate	Zero-downtime preferred
Preview (waitForDatabase=false)	RollingUpdate	No coordination needed
Preview (waitForDatabase=true)	Recreate	Enables safe database seeding

Acceptable for previews because brief downtime during deployments is acceptable in non-production environments.

Startup Probe for MongoDB Index Creation¶

Services in preview environments with waitForDatabase: true have a startupProbe to handle slow MongoDB startup:

Problem: MongoDB creates indexes on freshly seeded databases, taking 60+ seconds. The default liveness probe allows only ~90 seconds total before killing the pod, causing restart loops.

Solution: A startupProbe runs only during initial startup, allowing up to 310 seconds before liveness probes take over.

# Configured automatically in _deployment-dotnet.tpl
startupProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 30  # 10 + (30 * 10) = 310 seconds max

Environment	Max Startup Time
Production/Staging	90 seconds (liveness probe only)
Preview (waitForDatabase=true)	310 seconds (startup probe)

Symptoms of insufficient startup time (fixed by startup probe):

Pods showing 0/1 Running with multiple restarts
Logs showing "Now listening" 60+ seconds after startup begins
Eventual success after 3-5 restarts (when indexes are cached)

For detailed architecture documentation, see:

Kyverno Security Policy¶

A Kyverno ClusterPolicy enforces that PR preview database users can only access appropriate databases.

Policy: `atlas-block-production-access`¶

Location: cluster-gitops/plugins/helm/kyverno/resources/atlas-pr-user-policy.yaml

Enforcement: Blocks creation/update of AtlasDatabaseUser resources in pr-* namespaces that violate rules.

Rules Summary¶

Rule	Blocked Pattern	Purpose
1. `block-any-database-roles`	`readWriteAnyDatabase`, `dbAdminAnyDatabase`, `root`	Prevent broad access
2. `block-production-database`	`databaseName: syrftest`	Protect production data
3. `block-staging-database`	`databaseName: syrf_staging`	Protect staging data
4. `block-admin-database`	`databaseName: admin`	Protect system database
5. `validate-pr-database-pattern`	See below	Enforce allowed patterns

Rule 5: Allowed Database Access¶

PR users can ONLY have:

Database Pattern	Allowed Roles	Purpose
`syrf_pr_*`	Any (`readWrite`, `dbOwner`, etc.)	PR-specific database
`syrf_snapshot`	`read` ONLY	Snapshot data source

Denied Examples:

# ❌ BLOCKED - syrf_snapshot with readWrite
roles:
  - roleName: readWrite
    databaseName: syrf_snapshot

# ❌ BLOCKED - accessing production
roles:
  - roleName: read
    databaseName: syrftest

# ❌ BLOCKED - accessing unrecognized database
roles:
  - roleName: readWrite
    databaseName: some_other_db

Allowed Examples:

# ✅ ALLOWED - PR database with readWrite
roles:
  - roleName: readWrite
    databaseName: syrf_pr_123

# ✅ ALLOWED - PR database + snapshot read
roles:
  - roleName: readWrite
    databaseName: syrf_pr_123
  - roleName: read
    databaseName: syrf_snapshot

Policy Violation Response¶

If a PR attempts to create an AtlasDatabaseUser that violates the policy:

Kyverno blocks the resource creation
ArgoCD sync fails with policy violation message
PR preview deployment halts at the AtlasDatabaseUser step
GitHub Deployment status shows failure

Fix: Remove the violating role from the AtlasDatabaseUser definition in the workflow.

Snapshot Producer (Weekly Data Snapshots)¶

The snapshot-producer CronJob creates weekly copies of production data to the syrf_snapshot database, which is then used by preview environments with the use-snapshot label.

How It Works¶

Weekly Schedule (Sunday 3 AM UTC)
         ↓
Snapshot Producer CronJob starts
         ↓
1. Test connectivity to source (Cluster0) and target (Preview) clusters
         ↓
2. For each collection (11 total):
   - Count source documents
   - Stream copy: mongodump | mongorestore (no disk writes)
   - Verify target document count
   - Retry up to 3 times on failure
         ↓
3. Write snapshot_metadata document with:
   - Timestamp, duration, document counts
   - Source/target cluster info
   - Collections copied
         ↓
Preview environments can now use fresh snapshot data

Collections Copied¶

The following collections are copied from syrftest (production) to syrf_snapshot:

Collection	Description
`pmProject`	Projects with stages, memberships, questions
`pmStudy`	Studies with screening, extraction, annotations
`pmInvestigator`	User accounts and profiles
`pmSystematicSearch`	Literature searches
`pmDataExportJob`	Export job tracking
`pmStudyCorrection`	PDF correction requests
`pmInvestigatorUsage`	Usage statistics
`pmRiskOfBiasAiJob`	AI risk-of-bias jobs
`pmProjectDailyStat`	Daily statistics
`pmPotential`	Potential references
`pmInvestigatorEmail`	Email records

Snapshot Metadata¶

After each successful run, a metadata document is written to syrf_snapshot.snapshot_metadata:

{
  _id: "latest",
  createdAt: ISODate("2026-01-26T03:45:00Z"),
  startedAt: ISODate("2026-01-26T03:00:00Z"),
  finishedAt: ISODate("2026-01-26T03:45:00Z"),
  durationSeconds: 2700,
  sourceCluster: "Cluster0",
  sourceDatabase: "syrftest",
  sourceHost: "cluster0-pri.siwfo.mongodb.net",
  targetCluster: "Preview",
  targetDatabase: "syrf_snapshot",
  targetHost: "preview-pri.siwfo.mongodb.net",
  collections: ["pmProject", "pmStudy", ...],
  collectionsCount: 11,
  documentCounts: {
    pmProject: 1234,
    pmStudy: 56789,
    // ...
  },
  totalDocuments: 123456,
  method: "mongodump | mongorestore streaming",
  crossCluster: true,
  status: "complete"
}

Preview environments can query this document to verify snapshot freshness before using data.

Configuration¶

The snapshot-producer is deployed to the staging namespace. Configuration is in:

Chart: cluster-gitops/charts/snapshot-producer/
Values: cluster-gitops/plugins/local/snapshot-producer/values.yaml

Key configuration options:

Setting	Default	Description
`schedule`	`"0 3 * * 0"`	Cron schedule (Sunday 3 AM UTC)
`activeDeadlineSeconds`	`3600`	1 hour timeout
`retry.maxAttempts`	`3`	Retries per collection
`retry.delaySeconds`	`30`	Delay between retries
`streaming.gzip`	`true`	Compress data in transit

Manual Trigger¶

To manually trigger a snapshot (useful for testing or recovery):

# Create a one-time Job from the CronJob
kubectl create job --from=cronjob/snapshot-producer snapshot-manual-$(date +%s) -n staging

# Watch progress
kubectl logs -f job/snapshot-manual-<timestamp> -n staging

Troubleshooting¶

Snapshot job fails with connection error:

Check MongoDB Atlas network access (IP allowlist)
Verify credentials in snapshot-producer-credentials secret
Check if VPC Peering is configured for -pri hostnames

Snapshot takes too long:

Large collections may exceed the 1-hour timeout
Consider increasing activeDeadlineSeconds
Check MongoDB Atlas cluster tier (affects throughput)

Preview shows stale data:

Check snapshot_metadata.createdAt timestamp
Verify CronJob is running: kubectl get cronjob -n staging
Check recent job status: kubectl get jobs -n staging | grep snapshot

For detailed chart configuration, see Snapshot Producer Reference.

Future Enhancements¶

Planned improvements for PR previews:

Performance metrics: Show load times and resource usage
Visual regression: Screenshot comparison with base branch
E2E tests: Automated testing against preview environment
Cost tracking: Monitor resource usage per preview
Orphan database cleanup: Automated cleanup of syrf_pr_* databases

Recently Completed:

✅ Database per preview: Isolated MongoDB database per PR (syrf_pr_{number})
✅ Seed data: Auto-populate preview with test data on first startup
✅ GitHub Deployments: Native GitHub UI integration with deployment tracking
✅ Snapshot restore: Use production data snapshot via use-snapshot label
✅ Kyverno security: Policy enforcement for PR database access patterns

Using PR Preview Environments¶

Purpose¶

What Are PR Preview Environments?¶

How It Works¶

PR Description Status¶

GitHub Deployments¶

Why Two Workflows?¶

Automated Testing¶

Test Workflow¶

Coverage Requirements¶

Test Results¶

Code Quality (SonarCloud)¶

Prerequisites¶

Optional: GCP Secrets for RabbitMQ Cleanup¶

Creating a Preview Environment¶

Step 1: Create Your PR¶

Step 2: Open Pull Request¶

Step 3: Wait for Build¶

Step 4: ArgoCD Deploys¶

Step 5: Access Your Preview¶

Managing Preview Environments¶

Updating Your Preview¶

Checking Preview Status¶

Disabling Preview¶

Deleting Preview¶

Cleanup Architecture¶

ArgoCD Hook Lifecycle¶

Cleanup Flow¶

Why Pre-Cleanup Matters¶

Resources with Hook Finalizers¶

Troubleshooting¶

Preview Not Building¶

Build Fails¶

Preview Not Deploying¶

404 Not Found on Preview URL¶

Preview Shows Old Code¶

Namespace Stuck in Terminating State¶

ArgoCD Application Won't Delete¶

Best Practices¶

When to Use Previews¶

Resource Limits¶

Testing Checklist¶

Cleanup¶

Preview Environment Details¶

Namespace Structure¶

Angular Development Build¶

devMode Feature Flag¶

Environment Variables¶

Data Isolation¶

Customizing Your Preview¶

Feature Flags via PR Description¶

What You Can and Cannot Customize¶

Service-Specific Defaults¶

S3 Notifier Lambda (File Upload Processing)¶

How It Works¶

Lambda Details¶

S3 Key Format¶

Change Detection¶

Cleanup¶

Related Documentation¶

PR Labels Reference¶

Label Effects Matrix¶

Label: preview¶

Label: persist-db¶

Label: use-snapshot¶

Database Isolation¶

Per-Environment Databases¶

Preview Database Lifecycle¶

MongoDB Atlas User Permissions¶

Quartz SQL Schema Isolation¶

RabbitMQ Vhost Isolation¶

Edge Cases and Known Issues¶

Fork PR Limitations¶

ArgoCD Hook Finalizer Deadlock¶

MongoDB Cleanup Failures for Older PRs¶

Race Condition: Closed PR and Build¶

GCP Credentials Not Configured¶

Tag-Based Change Detection Edge Cases¶

PR Description Parsing Failures¶

Lambda S3 Prefix Routing¶

Label: `preview`¶

Label: `persist-db`¶

Label: `use-snapshot`¶

Why `use-snapshot` Skips `db-reset`¶

Manual Reseed via `/reseed-db` Command¶

Policy: `atlas-block-production-access`¶