Using PR Preview Environments¶
Purpose¶
This guide explains how to use PR (Pull Request) preview environments for testing changes before merging to master. Preview environments are ephemeral, automatically-deployed instances of the SyRF platform that match your PR's code.
What Are PR Preview Environments?¶
PR preview environments provide:
- Isolated testing: Each PR gets its own complete environment
- Automatic deployment: ArgoCD deploys your changes automatically
- Unique URLs: Access your preview at
pr-{number}.syrf.org.uk - Auto-cleanup: Environments are deleted when PR closes
- Full stack: All 6 services (API, PM, Quartz, Web, Docs, User Guide) deployed together
- GitHub Deployments: Native GitHub UI integration with clickable environment URLs
How It Works¶
1. Open PR (any PR to main)
↓
2. pr-tests.yml runs automatically:
- test-dotnet (if .NET changed) ─┬─ MUST PASS
- test-web (if Angular changed) ─┘
↓
3. Add 'preview' label (optional, for preview environment)
↓
4. TWO workflows trigger in parallel:
│
├─ pr-preview.yml (Kubernetes services):
│ - Tag-based detection: find last tag for each service
│ - Build Docker images for changed services
│ - Write version files to cluster-gitops
│ - ArgoCD deploys to pr-{number} namespace
│ - Updates PR description with K8s status
│
└─ pr-preview-lambda.yml (S3 Notifier Lambda):
- Build Lambda package from s3-notifier code
- Deploy syrfAppUploadS3Notifier-pr-{number} to AWS
- Configure S3 trigger for preview/pr-{number}/ prefix
- Updates PR description with Lambda status
↓
5. ArgoCD ApplicationSet detects version files
↓
6. ArgoCD creates namespace: pr-{number}
↓
7. ArgoCD deploys all services to preview namespace
↓
8. Preview URLs and file upload processing available within 5 minutes
↓
9. PR description shows unified status table at top
PR Description Status¶
Preview status is displayed directly in the PR description (not comments) so it stays visible at the top of the PR. Both workflows update the same status table:
| Component | Status |
|---|---|
| S3 Notifier Lambda | ✅ 0.1.5 |
| K8s Services | ✅ Ready |
Preview URLs (once ArgoCD syncs):
- 🌐 Web:
https://pr-{number}.syrf.org.uk - 🔌 API:
https://api.pr-{number}.syrf.org.uk - 📁 S3 Prefix:
preview/pr-{number}/
This approach:
- Keeps status always visible (not buried in comments)
- Shows both Lambda and K8s status in one place
- Includes version numbers and links to workflow runs
GitHub Deployments¶
Preview environments are tracked via the GitHub Deployments API, providing native integration with GitHub's UI:
Where to Find Deployments:
- PR Sidebar: Look for "Environments" section with clickable
pr-{number}link - Repository Deployments: Navigate to repository → Deployments tab
- Commit Status: Deployment status appears on commits in the PR
Deployment Status Flow:
| Status | Description | Trigger |
|---|---|---|
pending |
Deployment created, waiting to start | PR preview workflow starts |
in_progress |
Building Docker images | Build job begins |
queued |
Pushed to GitOps, waiting for ArgoCD | After cluster-gitops push |
success |
ArgoCD sync complete, preview live | ArgoCD PostSync hook |
failure |
Build or deployment failed | Workflow failure |
inactive |
Environment cleaned up | PR closed/label removed |
Benefits:
- Click environment URL directly from PR sidebar
- See deployment history for the PR
- Track deployment state changes
- Automatic cleanup marking when PR closes
Fork PR Limitation: PRs from forked repositories don't create GitHub Deployments due to token permission restrictions. The preview environment still deploys normally, but without the GitHub Deployment tracking.
Why Two Workflows?¶
The preview environment needs both Kubernetes services AND a Lambda function to work properly:
| Component | Workflow | Purpose |
|---|---|---|
| API, PM, Quartz, Web | pr-preview.yml |
Application logic, UI, background jobs |
| S3 Notifier Lambda | pr-preview-lambda.yml |
File upload notifications to RabbitMQ |
Without the Lambda, file uploads in preview environments wouldn't trigger the study import process. The Lambda listens to S3 events on the preview/pr-{number}/ prefix and publishes messages to RabbitMQ.
Automated Testing¶
All PRs run automated tests via the pr-tests.yml workflow, regardless of whether they have a preview environment.
Test Workflow¶
| Job | Trigger | Timeout | What it tests |
|---|---|---|---|
test-dotnet |
.NET code changed | 10 min | xUnit tests for API, PM, Quartz, S3 Notifier |
test-web |
Angular code changed | 5 min | Vitest tests with coverage thresholds |
Both jobs run in parallel after change detection.
Coverage Requirements¶
Angular tests enforce minimum coverage thresholds:
| Metric | Threshold |
|---|---|
| Statements | 50% |
| Branches | 40% |
| Functions | 50% |
| Lines | 50% |
Tests fail if coverage drops below these thresholds.
Test Results¶
Test results are uploaded as GitHub Actions artifacts:
dotnet-test-results: TRX test result filesdotnet-coverage: Cobertura XML coverageweb-coverage: lcov, cobertura, text coverage reportsweb-test-results: JUnit XML test results
Code Quality (SonarCloud)¶
Coverage and code quality are analyzed by SonarCloud which provides:
- Quality Gate: Pass/fail status based on coverage, bugs, and code smells
- PR Decoration: Inline comments on issues in changed code
- Coverage Report: Line-by-line coverage visualization
- Security Analysis: Detection of vulnerabilities and hotspots
See How to: Run Tests for local testing and SonarCloud setup.
Version File Structure (in cluster-gitops):
syrf/environments/preview/
services/ # Service defaults (all previews)
api/
config.yaml # hostPrefix, imageRepo
values.yaml # Default Helm values
web/
config.yaml
values.yaml
...
pr-123/
pr.yaml # PR trigger file (prNumber, headSha, branch)
namespace.yaml # Kubernetes namespace manifest
services/ # PR-specific values (one file per service)
api.values.yaml # image.tag from git tag or new build
project-management.values.yaml
quartz.values.yaml
web.values.yaml
docs.values.yaml
user-guide.values.yaml
Prerequisites¶
- Open pull request in syrf repository
- Changes to at least one service
- GitHub Actions workflows enabled
- ArgoCD installed and configured (cluster requirement)
Optional: GCP Secrets for RabbitMQ Cleanup¶
When a PR is closed, the cleanup job can delete the PR's RabbitMQ vhost to free resources. This requires GCP credentials to access the GKE cluster.
Required GitHub Secrets (optional but recommended):
| Secret | Description | How to Get |
|---|---|---|
GCP_WORKLOAD_IDENTITY_PROVIDER |
Workload Identity Federation provider | projects/{project-number}/locations/global/workloadIdentityPools/{pool}/providers/{provider} |
GCP_SERVICE_ACCOUNT |
GCP service account email | {sa-name}@{project}.iam.gserviceaccount.com |
If not configured:
- RabbitMQ vhost cleanup is skipped (not a failure)
- Orphan vhosts accumulate until manually cleaned
- All other cleanup (Lambda, K8s namespace) still works
To configure (requires GCP admin):
- Create a Workload Identity Pool for GitHub Actions
- Create a service account with
container.developerrole - Add the secrets to GitHub repository settings
See GCP Workload Identity Federation documentation for setup details.
Creating a Preview Environment¶
Step 1: Create Your PR¶
# Create feature branch
git checkout -b feature/my-awesome-feature
# Make changes
# ... edit files ...
# Commit and push
git add .
git commit -m "feat(api): add awesome new feature"
git push origin feature/my-awesome-feature
Step 2: Open Pull Request¶
- Go to GitHub repository
- Click "Compare & pull request"
- Fill in PR title and description
- Add the
previewlabel to the PR - Click "Create pull request"
Step 3: Wait for Build¶
The pr-preview.yml workflow will automatically:
- ✅ Detect changed services
- ✅ Build Docker images with
pr-{number}tag - ✅ Push images to GHCR
- ✅ Comment on PR with preview info
Build time: ~5-10 minutes depending on changed services
Step 4: ArgoCD Deploys¶
ArgoCD will automatically:
- ✅ Detect the PR (checks every 5 minutes)
- ✅ Create namespace
pr-{number} - ✅ Deploy all changed services
- ✅ Configure ingress with preview URLs
- ✅ Set up TLS certificates
Deployment time: ~5-10 minutes after build completes
Step 5: Access Your Preview¶
Once deployed, access your preview at:
- Web UI:
https://pr-{number}.syrf.org.uk - API:
https://api.pr-{number}.syrf.org.uk - PM Service:
https://project-management.pr-{number}.syrf.org.uk - Docs:
https://docs.pr-{number}.syrf.org.uk - User Guide:
https://help.pr-{number}.syrf.org.uk
Replace {number} with your PR number (e.g., PR #42 → https://pr-42.syrf.org.uk)
Managing Preview Environments¶
Updating Your Preview¶
Push new commits to your PR branch:
The preview workflow will automatically:
- Rebuild changed images
- Push new images with same
pr-{number}tag - ArgoCD detects new image
- ArgoCD syncs updated deployment
Update time: ~10-15 minutes total
Checking Preview Status¶
GitHub Actions:
- Go to PR → "Checks" tab
- View "PR Preview Build" workflow
- Check which services were built
ArgoCD UI (if accessible):
- Open ArgoCD dashboard
- Look for application
syrf-pr-{number} - View sync status and health
Kubernetes (if you have kubectl access):
# List preview namespaces
kubectl get namespaces | grep pr-
# Check pods in your preview
kubectl get pods -n pr-{number}
# View preview ingresses
kubectl get ingress -n pr-{number}
Disabling Preview¶
Remove the preview label from your PR:
- Go to PR on GitHub
- Click "Labels" gear icon
- Uncheck "preview"
- Both workflows trigger cleanup automatically:
- Lambda function deleted via Terraform
- K8s namespace deleted via ArgoCD
- PR description updated with cleanup status
- S3 files are preserved for debugging
Deleting Preview¶
Preview environments are automatically deleted when:
- PR is closed (merged or closed without merging)
- PR is converted to draft
previewlabel is removed (triggersunlabeledevent)
Cleanup includes:
- Lambda function deletion
- K8s namespace and resources deletion
- Git tags cleanup
- PR description update showing cleanup reason
Preserved for debugging:
- S3 uploaded files in
preview/pr-{number}/prefix
Cleanup time: Usually complete within 2-3 minutes
Cleanup Architecture¶
Understanding how preview cleanup works helps diagnose issues when cleanup fails.
ArgoCD Hook Lifecycle¶
Preview environments use ArgoCD PreSync hooks for database reset operations. These hooks have finalizers that control when resources can be deleted:
| Policy | When Finalizer Removed | Use Case |
|---|---|---|
BeforeHookCreation |
On next sync only | Resources that must persist across syncs |
HookSucceeded |
Immediately after success | Ephemeral resources (db-reset job) |
HookFailed |
Immediately after failure | Ephemeral resources (db-reset job) |
The db-reset resources use HookSucceeded,HookFailed so they clean up immediately after completion.
Cleanup Flow¶
PR Closed/Label Removed
↓
cleanup-tags job starts
↓
1. Pre-cleanup: Remove ArgoCD hook finalizers
(Prevents deadlock if hooks are still running)
↓
2. Delete PR version files from cluster-gitops
↓
3. ArgoCD detects missing files
↓
4. ArgoCD deletes Application
↓
5. Namespace and all resources deleted
Why Pre-Cleanup Matters¶
If git files are deleted before hooks complete, ArgoCD loses visibility of the hook annotations (they were in git). Without knowing the hook-delete-policy, ArgoCD doesn't know how to handle the finalizers, causing resources to get stuck in Terminating state indefinitely.
The pre-cleanup step removes argocd.argoproj.io/hook-finalizer from all hook resources before deleting git files, ensuring clean deletion regardless of hook state.
Resources with Hook Finalizers¶
| Resource | Finalizer Source | Cleanup Behavior |
|---|---|---|
job/db-reset-{N}-{sha} |
ArgoCD PreSync hook | Deleted immediately after completion |
serviceaccount/db-reset-sa |
ArgoCD PreSync hook | Deleted immediately after completion |
role/db-reset-marker-role |
ArgoCD PreSync hook | Deleted immediately after completion |
rolebinding/db-reset-marker-binding |
ArgoCD PreSync hook | Deleted immediately after completion |
externalsecret/atlas-operator-api-key |
ArgoCD PreSync hook | Persists until Application deletion |
secret/mongodb-pr-password |
ArgoCD PreSync hook | Persists until Application deletion |
atlasdatabaseuser/pr-user |
ArgoCD PreSync hook | Persists until Application deletion |
Troubleshooting¶
Preview Not Building¶
Symptom: No "PR Preview Build" workflow run
Causes:
- PR doesn't have
previewlabel - No changes to service code (only docs/config changed)
- Workflow file syntax error
Fix:
- Add
previewlabel to PR - Make a small change to service code
- Check
.github/workflows/pr-preview.ymlsyntax
Build Fails¶
Symptom: Red X on "PR Preview Build" workflow
Causes:
- Docker build errors
- TypeScript/compile errors
- Missing dependencies
Fix:
- Click on failed workflow run
- Review job logs
- Fix errors in your branch
- Push fix to trigger rebuild
Preview Not Deploying¶
Symptom: Build succeeds but no preview URLs work
Causes:
- ArgoCD not installed yet (cluster not ready)
- ApplicationSet not configured
- GHCR image pull issues
- DNS not configured
Fix (requires cluster access):
- Check ArgoCD application exists:
kubectl get application -n argocd | grep pr-{number} - Check ApplicationSet:
kubectl get applicationset -n argocd syrf-preview - View ArgoCD logs:
kubectl logs -n argocd -l app.kubernetes.io/name=argocd-applicationset-controller
404 Not Found on Preview URL¶
Symptom: Preview URL returns 404
Causes:
- Ingress not created yet (DNS propagation)
- Certificate not ready
- Service not healthy
Fix:
- Wait 5-10 minutes for DNS and cert
- Check service health:
kubectl get pods -n pr-{number} - Check ingress:
kubectl get ingress -n pr-{number}
Preview Shows Old Code¶
Symptom: Preview doesn't reflect latest changes
Causes:
- Image tag not updated (cached)
- ArgoCD not synced yet
- Browser caching old assets
Fix:
- Check image tag in deployment:
kubectl describe deployment -n pr-{number} - Force ArgoCD sync (if accessible)
- Hard refresh browser (Ctrl+Shift+R)
Namespace Stuck in Terminating State¶
Symptom: After PR closes, namespace shows Terminating status but doesn't delete. ArgoCD Application shows Unknown sync status.
Diagnosis:
# Check namespace status
kubectl get namespace pr-{number}
# Check for resources with finalizers blocking deletion
kubectl api-resources --verbs=list --namespaced -o name | \
xargs -I {} kubectl get {} -n pr-{number} \
-o custom-columns='KIND:.kind,NAME:.metadata.name,FINALIZERS:.metadata.finalizers' \
--ignore-not-found 2>/dev/null | grep -v "<none>"
# Check ArgoCD Application
kubectl get application pr-{number}-namespace -n argocd -o yaml | \
grep -A5 "deletionTimestamp\|finalizers"
Common Causes:
- ArgoCD hook finalizer stuck on Job - Most common cause
- External Secrets Operator waiting - ESO finalizers blocking deletion
- Atlas Database User finalizer - MongoDB operator cleanup pending
Manual Fix:
# Remove finalizer from stuck Job (find the exact name first)
kubectl get jobs -n pr-{number}
kubectl patch job {job-name} -n pr-{number} --type=merge \
-p '{"metadata":{"finalizers":null}}'
# If RBAC resources are stuck
kubectl patch serviceaccount db-reset-sa -n pr-{number} --type=merge \
-p '{"metadata":{"finalizers":null}}'
kubectl patch role db-reset-marker-role -n pr-{number} --type=merge \
-p '{"metadata":{"finalizers":null}}'
kubectl patch rolebinding db-reset-marker-binding -n pr-{number} --type=merge \
-p '{"metadata":{"finalizers":null}}'
# If MongoDB resources are stuck
kubectl patch externalsecret atlas-operator-api-key -n pr-{number} --type=merge \
-p '{"metadata":{"finalizers":null}}'
kubectl patch secret mongodb-pr-password -n pr-{number} --type=merge \
-p '{"metadata":{"finalizers":null}}'
kubectl patch atlasdatabaseuser pr-user -n pr-{number} --type=merge \
-p '{"metadata":{"finalizers":null}}'
Prevention: This issue is prevented by the pre-cleanup step in the workflow. If you see this issue, the pre-cleanup step may have failed or GCP credentials weren't configured.
ArgoCD Application Won't Delete¶
Symptom: Application shows Unknown sync status and has deletionTimestamp set but won't complete deletion.
Causes:
- Resources with finalizers in the namespace blocking deletion
- ArgoCD controller unable to access the namespace
- Custom resource definitions (CRDs) blocking deletion
Fix:
# First, fix any stuck resources in the namespace (see above)
# If Application still won't delete, check its finalizers
kubectl get application pr-{number}-namespace -n argocd \
-o jsonpath='{.metadata.finalizers}'
# Remove ArgoCD resource finalizer (last resort)
kubectl patch application pr-{number}-namespace -n argocd --type=merge \
-p '{"metadata":{"finalizers":null}}'
Warning: Removing the Application finalizer skips ArgoCD's cascade deletion. Only do this after manually cleaning up namespace resources.
Best Practices¶
When to Use Previews¶
Good use cases:
- ✅ Testing new features before review
- ✅ Validating bug fixes with real data
- ✅ Demonstrating changes to stakeholders
- ✅ QA testing before merge
- ✅ Integration testing across services
Avoid for:
- ❌ Every single PR (resource intensive)
- ❌ Documentation-only changes
- ❌ Config-only changes
- ❌ Very small typo fixes
Resource Limits¶
Preview environments have reduced resources compared to staging:
| Service | Memory Limit | CPU Limit | Replicas |
|---|---|---|---|
| API | 256Mi | 250m | 1 |
| PM | 256Mi | 250m | 1 |
| Quartz | 256Mi | 250m | 1 |
| Web | 128Mi | 100m | 1 |
Implications:
- Slower response times than production
- Not suitable for load testing
- May have memory/CPU limits under heavy use
Testing Checklist¶
Before merging, verify in your preview:
- Application starts without errors
- Core functionality works
- API endpoints respond correctly
- UI renders properly
- No console errors in browser
- Authentication works (if applicable)
- Database operations succeed
Cleanup¶
Always remove the preview label or close your PR when done testing to free up cluster resources.
Preview Environment Details¶
Namespace Structure¶
Each preview gets its own Kubernetes namespace:
pr-{number}/
├── deployments/
│ ├── syrf-api
│ ├── syrf-pm
│ ├── syrf-quartz
│ └── syrf-web
├── services/
│ ├── syrf-api
│ ├── syrf-pm
│ ├── syrf-quartz
│ └── syrf-web
├── ingresses/
│ ├── pr-{number}-api
│ ├── pr-{number}-pm
│ └── pr-{number}-web
├── configmaps/
│ └── preview-info
└── secrets/
└── (TLS certificates)
Angular Development Build¶
Preview environments build the Angular app with --configuration development instead of production. This provides:
- Verbose error messages: Full stack traces with component context
- Development mode checks: Extra change detection cycles to catch issues
- Source maps included: Debug TypeScript directly in browser DevTools
- Named chunks: Easier to identify modules when debugging
Trade-off: Bundle size is larger, but this is acceptable for preview testing.
devMode Feature Flag¶
Preview environments have the devMode feature flag enabled by default. This can be used in the app to:
- Enable additional console logging
- Show debug panels/overlays
- Enable verbose API request logging
- Show feature flag debugging UI
- Enable performance profiling helpers
Access via ngrx selector: selectDevMode
Environment Variables¶
Preview services run with:
ASPNETCORE_ENVIRONMENT=Preview(.NET services)ENVIRONMENT=preview(Web service)REMOVE_SOURCEMAPS=false(Web keeps sourcemaps for debugging)SYRF__FeatureFlags__DevMode=true(Web devMode enabled)
Data Isolation¶
Important: Preview environments share the same backend resources as staging:
- Same MongoDB instance
- Same RabbitMQ instance
- Same S3 buckets
Be careful:
- Use test data only
- Don't delete production-like data
- Consider data namespace isolation in code
Customizing Your Preview¶
Feature Flags via PR Description¶
You can enable or disable feature flags for your preview environment by adding a config block to your PR description.
Format: Add a YAML code block starting with #preview-config:
Rules:
- The YAML block MUST start with
#preview-configon the first line - Top-level keys are service names (
api,web,project-management,quartz,docs,user-guide) - Only
featureFlagsare allowed - other settings are ignored for security - The config is parsed when the preview workflow runs
Example - Enable multiple flags:
```yaml
#preview-config
web:
featureFlags:
newScreeningOverview: true
newStageOverview: true
api:
featureFlags:
enableBetaEndpoints: true
```
What You Can and Cannot Customize¶
| Setting | How to Customize |
|---|---|
| Feature flags | PR description (any PR creator) |
| Resources (memory, CPU) | Requires cluster-gitops write access |
| Logging level | Requires cluster-gitops write access |
| Replica count | Requires cluster-gitops write access |
Why this restriction? Resource limits are restricted to prevent accidental cost increases and ensure fair resource sharing across previews.
Service-Specific Defaults¶
Platform operators can set service-specific defaults for all previews in:
These values apply to all preview environments and can set things like:
- Default feature flags for previews
- Resource limits for specific services
- Logging configuration
S3 Notifier Lambda (File Upload Processing)¶
Each preview environment gets its own AWS Lambda function to handle file uploads.
How It Works¶
User uploads file → S3 bucket → S3 Event Notification
↓
Lambda: syrfAppUploadS3Notifier-pr-{number}
↓
RabbitMQ message → Project Management service
↓
Study import processing begins
Lambda Details¶
| Aspect | Value |
|---|---|
| Function Name | syrfAppUploadS3Notifier-pr-{number} |
| S3 Prefix | preview/pr-{number}/ |
| Runtime | .NET 10 (linux-x64) |
| Managed By | Terraform (in camarades-infrastructure/) |
S3 Key Format¶
Files uploaded in preview environments use a special prefix:
This ensures:
- Preview uploads don't interfere with production data
- Each PR's Lambda only processes its own files
- Files are isolated per PR for debugging purposes
Change Detection¶
The Lambda workflow uses smart change detection to avoid unnecessary deployments:
- On
synchronize(new commits), checks if s3-notifier code changed - Compares current commit against last deployed SHA (stored in Lambda tags)
- Only deploys if files in
src/services/s3-notifier/changed since last deploy - First deploy for a PR always runs (no previous SHA)
This prevents rebuilding the Lambda on every push when only K8s services changed.
Cleanup¶
When a PR is closed or the preview label is removed, pr-preview-lambda.yml automatically:
- Preserves S3 files in
preview/pr-{number}/(for debugging) - Removes the Lambda function via Terraform
- Deletes the Lambda package from S3 state bucket
- Updates PR description with cleanup status
Note: S3 files are intentionally preserved after cleanup to help debug any issues. They can be manually deleted if needed.
Related Documentation¶
- PR Preview Workflow: .github/workflows/pr-preview.yml
- PR Preview Lambda Workflow (archived): .github/workflows/archived/pr-preview-lambda.yml
- Lambda Permissions: configure-lambda-deployment-permissions.md
- ApplicationSet Config: See cluster-gitops repository
- CI/CD Workflow: ci-cd-workflow.md
- ArgoCD Docs: https://argo-cd.readthedocs.io/en/stable/operator-manual/applicationset/
PR Labels Reference¶
Preview environments are controlled by GitHub labels. Each label has specific effects on the workflow behavior.
Label Effects Matrix¶
| Label | When Added | When Removed | Effects |
|---|---|---|---|
preview |
PR labeled | PR unlabeled | Add: Triggers full preview build/deploy. Remove: Triggers complete cleanup (namespace, Lambda, MongoDB user, tags) |
persist-db |
PR labeled | PR unlabeled | Add: Skips database reset on subsequent syncs (preserves data). Remove: Re-enables database reset, runs db-reset job on next sync |
use-snapshot |
PR labeled | PR unlabeled | Add: Uses production snapshot data instead of seed data (future feature). Remove: Reverts to seed data |
Label: preview¶
Purpose: Primary control for preview environment lifecycle.
When Added:
1. Workflow triggers on 'labeled' event
2. check-should-run job validates preview should be created
3. All service images built with pr-{number} tag
4. Version files written to cluster-gitops
5. ArgoCD detects files, creates namespace and Application
6. Lambda function deployed for S3 file notifications
7. GitHub Deployment created for tracking
When Removed:
1. Workflow triggers on 'unlabeled' event
2. cleanup-tags job executes
3. Pre-cleanup removes ArgoCD hook finalizers (prevents deadlock)
4. MongoDB database user deleted
5. Quartz SQL schema dropped
6. Git tags cleaned up
7. Version files deleted from cluster-gitops
8. ArgoCD detects missing files, deletes Application
9. Lambda function destroyed via Terraform
10. RabbitMQ vhost deleted (if GCP credentials configured)
11. GitHub Deployment marked inactive
Timing: Label changes are processed immediately when the workflow runs.
Label: persist-db¶
Purpose: Prevents database reset between preview environment syncs.
Behavior:
persist-db Label State |
Database Reset Job | Use Case |
|---|---|---|
| Not present (default) | Runs on every sync | Clean slate for each push |
| Present | Skipped | Preserve test data across commits |
When Added:
db-reset-job.yamlfile is deleted from cluster-gitops- ArgoCD no longer runs PreSync hook to reset database
- Existing data in
syrf_pr_{number}is preserved
When Removed:
db-reset-job.yamlfile is re-created in cluster-gitops- Database reset runs on next ArgoCD sync
- All data in preview database is replaced with seed data
Important: Adding persist-db does NOT retroactively restore deleted data. It only prevents future resets.
Label: use-snapshot¶
Purpose: Initialize preview database with production snapshot data instead of empty/seed data.
Current Status: ✅ Fully Implemented - See Data Snapshot Automation for architecture details.
Behavior:
use-snapshot Label State |
Data Source | Database Initialization |
|---|---|---|
| Not present (default) | Empty database | Services create data as needed |
| Present | Production snapshot (syrf_snapshot) |
Snapshot restore job copies collections |
How it works:
- MongoDB User Updated: PR user gets additional
readrole onsyrf_snapshotdatabase - Snapshot Restore Job Created: PreSync hook job copies 11 collections from
syrf_snapshottosyrf_pr_{number} - db-reset Job Skipped: When
use-snapshot=true, the database reset job is NOT generated (snapshot-restore handles initialization) - Idempotency: Completion marker (ConfigMap) prevents duplicate restores on manual ArgoCD syncs
Collections Copied (via $out aggregation):
pmProject,pmStudy,pmInvestigator,pmSystematicSearchpmDataExportJob,pmStudyCorrection,pmInvestigatorUsagepmRiskOfBiasAiJob,pmProjectDailyStat,pmPotential,pmInvestigatorEmail
Security: Kyverno policy enforces that PR users can ONLY have read access (not readWrite) on syrf_snapshot. See Kyverno Security Policy below.
When Added:
1. Workflow detects 'use-snapshot' label
2. AtlasDatabaseUser gets additional role:
- roleName: read
- databaseName: syrf_snapshot
3. snapshot-restore-job.yaml generated instead of db-reset-job.yaml
4. ArgoCD runs PreSync hook to copy data
5. Services start with production-like data
When Removed:
1. AtlasDatabaseUser loses syrf_snapshot read role
2. snapshot-restore-job.yaml removed
3. On next sync with service changes, db-reset-job.yaml may be generated
Database Isolation¶
Per-Environment Databases¶
Each environment type has its own database isolation strategy:
| Environment | Database Name | Isolation Level | Data Source |
|---|---|---|---|
| Production | syrftest |
Full (dedicated) | Live data |
| Staging | syrftest ⚠️ |
Shared with production | Production data |
| Preview | syrf_pr_{number} |
Full (per-PR) | Seed data |
⚠️ CRITICAL: Staging currently shares the production database (syrftest). This is a known issue documented in the MongoDB Testing Strategy.
Preview Database Lifecycle¶
PR Opens + preview label added
↓
1. MongoDB Atlas database user created: pr-{number}-user
↓
2. Database syrf_pr_{number} created (on first write)
↓
3. DatabaseSeeder runs, populates with sample data
↓
4. Preview services connect using pr-{number}-user credentials
↓
[PR active - multiple syncs may occur]
↓
5. PR closes OR preview label removed
↓
6. Database user deleted (pr-{number}-user)
↓
7. Database syrf_pr_{number} becomes orphaned
↓
8. Manual cleanup required for orphan databases
MongoDB Atlas User Permissions¶
Preview environments use dedicated MongoDB Atlas database users with scoped permissions:
| User | Database Access | Role |
|---|---|---|
pr-{number}-user |
syrf_pr_{number} only |
dbOwner |
dbOwner role provides:
- Read/write access to all collections
- Create/drop collections
- Create/drop indexes
- Run aggregation pipelines
Cleanup note: Users created before the dbOwner role update may have insufficient permissions for some cleanup operations.
Quartz SQL Schema Isolation¶
The Quartz service (background jobs) uses SQL Server with per-environment schema isolation:
| Environment | Schema Name | Isolation |
|---|---|---|
| Production | [production] |
Dedicated |
| Staging | [staging] |
Dedicated |
| Preview | [preview_{number}] |
Per-PR |
Cleanup: When a PR closes, the cleanup-tags job drops the [preview_{number}] schema.
RabbitMQ Vhost Isolation¶
Each preview environment gets its own RabbitMQ virtual host:
| Environment | Vhost Name |
|---|---|
| Production | / (default) |
| Staging | staging |
| Preview | pr-{number} |
Cleanup requirement: Requires GCP credentials to access the GKE cluster and delete vhosts.
Edge Cases and Known Issues¶
Fork PR Limitations¶
Issue: PRs from forked repositories cannot create GitHub Deployments.
Cause: GITHUB_TOKEN in fork PRs has restricted permissions and cannot create deployments on the upstream repository.
Impact:
- Preview environment deploys normally
- GitHub Deployments UI shows no environment for the PR
- Users must manually check ArgoCD or workflow logs for status
Workaround: None. This is a GitHub security limitation.
ArgoCD Hook Finalizer Deadlock¶
Issue: Namespace gets stuck in Terminating state indefinitely.
Scenario:
1. PR closes while db-reset job is running
2. Git files deleted from cluster-gitops
3. ArgoCD loses visibility of hook annotations (they were in git)
4. ArgoCD doesn't know hook-delete-policy
5. Resources with finalizers block namespace deletion
Prevention: The cleanup-tags job runs a pre-cleanup step that removes argocd.argoproj.io/hook-finalizer from all resources BEFORE deleting git files.
Manual Fix (if pre-cleanup fails):
# Remove finalizers from stuck jobs
kubectl get jobs -n pr-{number}
kubectl patch job {job-name} -n pr-{number} --type=merge \
-p '{"metadata":{"finalizers":null}}'
# Remove from RBAC resources
kubectl patch serviceaccount db-reset-sa -n pr-{number} --type=merge \
-p '{"metadata":{"finalizers":null}}'
kubectl patch role db-reset-marker-role -n pr-{number} --type=merge \
-p '{"metadata":{"finalizers":null}}'
kubectl patch rolebinding db-reset-marker-binding -n pr-{number} --type=merge \
-p '{"metadata":{"finalizers":null}}'
MongoDB Cleanup Failures for Older PRs¶
Issue: MongoDB database user cleanup may fail silently for PRs created before the dbOwner role change.
Cause: Users created with older roles may not have dbOwner permissions required for some cleanup operations.
Impact: Orphan database users may remain in MongoDB Atlas.
Resolution: Manual cleanup via MongoDB Atlas UI or CLI.
Race Condition: Closed PR and Build¶
Issue: Build job could recreate files that cleanup job deleted.
Scenario:
1. PR closes
2. cleanup-tags job starts, deletes files
3. build-images job (already running) finishes, writes new files
4. Files recreated after cleanup
Prevention: The workflow checks github.event.action == 'closed' at the start of check-should-run job and immediately exits if true. This ensures build jobs don't run for closed PRs.
Code reference (pr-preview.yml:70-77):
# Skip build if PR is closed
if [ "${{ github.event.action }}" == "closed" ]; then
echo "result=false" >> "$GITHUB_OUTPUT"
echo "skip_reason=PR is closed" >> "$GITHUB_OUTPUT"
exit 0
fi
GCP Credentials Not Configured¶
Issue: Some cleanup steps fail silently when GCP credentials are not configured.
Affected Operations:
| Operation | Without GCP Credentials |
|---|---|
| RabbitMQ vhost deletion | Skipped |
| ArgoCD finalizer pre-cleanup | Skipped |
| Quartz SQL schema cleanup | Skipped |
Impact: Orphan resources accumulate until manually cleaned.
Required Secrets:
GCP_WORKLOAD_IDENTITY_PROVIDERGCP_SERVICE_ACCOUNT
Tag-Based Change Detection Edge Cases¶
Issue: Services may not rebuild when expected due to tag-based detection.
How it works:
1. Find last git tag for service (e.g., api-v1.2.3)
2. Compare current commit against tagged commit
3. If files in service path changed → rebuild
4. If no changes → reuse existing image
Edge cases:
| Scenario | Behavior |
|---|---|
| First PR ever (no tags) | Uses base branch comparison |
| Service has no tags | Always rebuilds |
| Tag deleted manually | May cause unexpected rebuilds |
| Shared library changed | Detected via DEPENDENCY-MAP.yaml |
PR Description Parsing Failures¶
Issue: Malformed YAML in PR description can cause config parsing to fail silently.
Scenarios:
- Invalid YAML syntax → defaults used
- Missing
#preview-configmarker → block ignored - Unsupported settings → ignored (security feature)
Debug: Check workflow logs for "Parse PR description for preview config" step.
Lambda S3 Prefix Routing¶
Issue: File uploads must use correct S3 prefix or they won't trigger the Lambda.
Expected prefix: preview/pr-{number}/Projects/{projectId}/...
Common mistakes:
- Using production prefix (
Projects/...) → triggers production Lambda - Missing
preview/pr-{number}prefix → no Lambda triggered - Wrong PR number in prefix → wrong Lambda triggered
Label Interaction Matrix¶
Understanding how labels interact is critical for predictable preview behavior.
Label Precedence Rules¶
Decision Flow:
Has 'persist-db' label?
├── YES → Skip ALL database operations (highest priority)
│ Database preserved exactly as-is
└── NO → Has 'use-snapshot' label?
├── YES → Run snapshot-restore job (copies from syrf_snapshot)
│ Skip db-reset job (snapshot handles initialization)
└── NO → Were MongoDB services rebuilt?
├── YES → Run db-reset job (drop all collections)
└── NO → Skip db-reset job (no changes to reset for)
Complete Label Scenario Matrix¶
persist-db |
use-snapshot |
Services Rebuilt | db-reset Job | snapshot-restore Job | Result |
|---|---|---|---|---|---|
| ❌ | ❌ | ❌ | ❌ Skip | ❌ Skip | Database unchanged |
| ❌ | ❌ | ✅ | ✅ Run | ❌ Skip | Collections dropped, empty database |
| ❌ | ✅ | ❌ | ❌ Skip | ✅ Run | Snapshot data copied |
| ❌ | ✅ | ✅ | ❌ Skip | ✅ Run | Snapshot data copied (reset skipped) |
| ✅ | ❌ | ❌ | ❌ Skip | ❌ Skip | Database preserved |
| ✅ | ❌ | ✅ | ❌ Skip | ❌ Skip | Database preserved despite rebuild |
| ✅ | ✅ | ❌ | ❌ Skip | ❌ Skip | Database preserved (persist wins) |
| ✅ | ✅ | ✅ | ❌ Skip | ❌ Skip | Database preserved (persist wins) |
Key Insight: persist-db always wins. When present, no database initialization jobs run regardless of other labels or rebuild status.
Why use-snapshot Skips db-reset¶
When use-snapshot is true, the db-reset job is NOT generated because:
- Redundant:
$outaggregation in snapshot-restore completely replaces target collections - Order Problem: db-reset runs at sync wave -1 (AFTER PreSync hooks), so it would DROP data that snapshot-restore just copied
- Efficiency: No need to drop then copy; just copy (which replaces)
ArgoCD Sync Wave Ordering¶
Preview environments use ArgoCD sync waves to ensure resources are created in the correct order.
Complete Sync Order¶
Wave -5: AtlasDatabaseUser (MongoDB user creation)
↓ MongoDB Atlas Operator creates user in Atlas
↓ Connection secret becomes available
Wave -2: db-reset RBAC resources (if db-reset enabled)
- ServiceAccount: db-reset-sa
- Role: db-reset-marker-role
- RoleBinding: db-reset-marker-binding
Wave -1: db-reset Job (if db-reset enabled)
↓ Drops all collections in syrf_pr_{number}
↓ Creates completion marker ConfigMap
Wave 0+: Application services (API, PM, Quartz, Web)
↓ Services start with fresh/empty database
PreSync Hooks (snapshot-restore)¶
When use-snapshot label is present, PreSync hooks run BEFORE the sync wave sequence:
[PreSync Phase - runs before wave sequence]
PreSync Wave 1: snapshot-restore RBAC resources
- ServiceAccount: db-reset-sa
- Role: db-reset-marker-role
- RoleBinding: db-reset-marker-binding
PreSync Wave 3: snapshot-restore Job
↓ Copies 11 collections from syrf_snapshot to syrf_pr_{number}
↓ Creates completion marker ConfigMap
[/PreSync Phase]
Wave -5: AtlasDatabaseUser (already has syrf_snapshot read role)
Wave 0+: Application services start with snapshot data
Hook Delete Policies¶
| Resource Type | Delete Policy | Behavior |
|---|---|---|
| db-reset RBAC | None (regular sync resource) | Persists as long as file exists in git |
| snapshot-restore RBAC | BeforeHookCreation |
Deleted before next sync, recreated |
| snapshot-restore Job | BeforeHookCreation |
Old job deleted before new one created |
Why BeforeHookCreation? Resources persist throughout the current sync (Job can use them), then get cleaned up before the next sync. HookSucceeded would delete them immediately after creation, before the Job runs.
Database Coordination with Init Containers¶
Preview environments use a sophisticated coordination mechanism to ensure services don't start until the database is ready. This prevents race conditions where services might try to access a database that hasn't been seeded yet.
Architecture Overview¶
PR Preview Environment
└── pr-{number} (Parent Application - App-of-Apps)
├── pr-{number}-infrastructure (AUTO-SYNC ✓)
│ ├── Namespace, ExternalSecret, AtlasDatabaseUser
│ └── DatabaseLifecycle CR → manages "db-ready" ConfigMap
│
├── pr-{number}-api (AUTO-SYNC ✓)
├── pr-{number}-project-management (AUTO-SYNC ✓)
├── pr-{number}-quartz (AUTO-SYNC ✓)
└── pr-{number}-web (AUTO-SYNC ✓)
└── Init containers wait for ConfigMap with MATCHING seedVersion
How seedVersion Matching Works¶
Simply waiting for a db-ready ConfigMap is NOT sufficient. Here's why:
Race Condition During Reseed:
1. New seedVersion pushed to cluster-gitops
2. ArgoCD syncs both infrastructure AND service apps (independently!)
3. Service pods restart (annotation changed)
4. Init container checks for db-ready ConfigMap
5. OLD ConfigMap still exists (operator hasn't updated it yet)
6. Init container PASSES with stale ConfigMap ← BUG!
7. Pods start while database is being reseeded ← DATA CORRUPTION
The Fix: Init containers wait for ConfigMap with matching seedVersion:
# ConfigMap created by DatabaseLifecycle operator
apiVersion: v1
kind: ConfigMap
metadata:
name: db-ready
namespace: pr-{number}
data:
status: "ready"
seedVersion: "abc123" # Must match pod's expected version
seededAt: "2026-01-17T12:00:00Z"
sourceDatabase: "syrf_snapshot"
Per-Service waitForDatabase Configuration¶
Not all services need to wait for the database. The waitForDatabase flag is configured per-service in cluster-gitops:
| Service | waitForDatabase |
Reason |
|---|---|---|
| api | true |
Connects to MongoDB |
| project-management | true |
Connects to MongoDB |
| quartz | true |
Connects to MongoDB |
| web | false |
Frontend only, no database |
| docs | false |
Static documentation site |
| user-guide | false |
Static documentation site |
Configuration Location: cluster-gitops/syrf/environments/preview/services/{service}/config.yaml
# Example: api/config.yaml
serviceName: api
hostPrefix: "api."
imageRepo: ghcr.io/camaradesuk/syrf-api
waitForDatabase: true # Init container waits for db-ready ConfigMap
Init Container Behavior¶
Services with waitForDatabase: true get an init container added automatically:
initContainers:
- name: wait-for-database
image: bitnami/kubectl:latest
command: ['sh', '-c']
args:
- |
EXPECTED_VERSION="${SEED_VERSION}"
echo "Waiting for db-ready ConfigMap with seedVersion=$EXPECTED_VERSION..."
while true; do
CURRENT=$(kubectl get configmap db-ready -n ${NAMESPACE} \
-o jsonpath='{.data.seedVersion}' 2>/dev/null || echo "")
if [ "$CURRENT" = "$EXPECTED_VERSION" ]; then
echo "Database is ready with correct seedVersion!"
exit 0
fi
echo "Current: '$CURRENT', Expected: '$EXPECTED_VERSION'. Waiting..."
sleep 5
done
DatabaseLifecycle Operator¶
The DatabaseLifecycle operator (deployed in the cluster) handles database seeding:
- Watches DatabaseLifecycle custom resources
- Waits for watched deployments to have 0 ready replicas (via
watchedDeploymentsfield) - Seeds database from snapshot when
seedVersionchanges - Creates/Updates db-ready ConfigMap after successful seeding
Coordination Flow:
Init containers block → Pods stuck at Init:0/1 → Operator sees 0 ready pods
→ Operator seeds database → Operator creates db-ready ConfigMap
→ Init containers pass → Main containers start
Key Design Principle: The operator does NOT scale down services. Instead:
- Init containers block pods from starting (by waiting for db-ready ConfigMap)
- Operator waits for watched deployments to have 0 ready replicas
- This happens naturally because init containers prevent pods from becoming ready
watchedDeployments Configuration (in DatabaseLifecycle CR):
Services with waitForDatabase: true get the label syrf.org.uk/uses-database=true automatically.
Scenario Walkthroughs¶
Initial Deployment:
1. PR gets 'preview' + 'use-snapshot' labels
2. Workflow pushes seedVersion to cluster-gitops
3. ArgoCD syncs ApplicationSet, creates apps for infrastructure + services
4. Apps sync in parallel:
- Infrastructure: creates DatabaseLifecycle CR with watchedDeployments
- Services: create Deployments with init containers
5. Service pods enter Init:0/1 state, blocked waiting for db-ready ConfigMap
6. DatabaseLifecycle operator:
- Checks watchedDeployments → all have 0 ready replicas ✓
- Seeds database from snapshot
- Creates db-ready ConfigMap with seedVersion
7. Init containers detect matching seedVersion → pods start
Normal Code Push (no database changes):
1. Developer pushes code
2. Workflow pushes new headSha, same seedVersion
3. ArgoCD syncs child apps
4. Kubernetes does rolling update
5. Init container checks ConfigMap - seedVersion matches!
6. Init container passes immediately (~1 second)
7. No database work needed
Reseed Trigger (seedVersion changes):
1. New seedVersion pushed to cluster-gitops
2. ArgoCD syncs apps (infrastructure + services)
3. Services use Recreate strategy → old pods terminated first
4. New pods created with new seedVersion
5. New pods enter Init:0/1 state (waiting for ConfigMap with new seedVersion)
6. DatabaseLifecycle operator:
- Checks watchedDeployments → all have 0 ready replicas ✓
- Drops database, seeds from snapshot
- Updates ConfigMap with new seedVersion
7. Init containers detect matching seedVersion → pods start
Manual Reseed via /reseed-db Command¶
To trigger a database reseed on an existing preview environment, comment /reseed-db on the PR:
Command: Comment /reseed-db on any PR with the preview label.
What happens:
1. Workflow detects /reseed-db command in comment
2. Checks that PR has 'preview' label and NOT 'persist-db' label
3. Updates seedVersion in pr.yaml (single source of truth)
4. ArgoCD detects change, syncs all apps
5. Services restart (Recreate strategy terminates old pods first)
6. Init containers wait for new db-ready ConfigMap
7. Operator reseeds database, creates ConfigMap with new seedVersion
8. Services start successfully
Blocked when:
persist-dblabel present → "Remove persist-db label first"previewlabel missing → "Add preview label first"
Use cases:
- Database corruption during testing
- Want fresh snapshot data
- Recovering from stuck deployments (Init:0/1 state)
- Schema migration testing
Recreate Deployment Strategy¶
Services with waitForDatabase: true use Recreate deployment strategy instead of RollingUpdate:
Why? RollingUpdate keeps old pods running until new pods are ready. With init containers blocking new pods, this creates a deadlock:
RollingUpdate (would cause deadlock):
1. Old pods: 1/1 Running (no init container)
2. New pods: Init:0/1 (waiting for db-ready ConfigMap)
3. Operator waits for 0 ready replicas (watchedDeployments check)
4. Old pod has 1 ready replica → operator waits forever
5. db-ready ConfigMap never created → new pods wait forever
Recreate strategy breaks the deadlock:
1. Recreate terminates old pods first
2. 0 ready replicas achieved
3. Operator safe to seed
4. ConfigMap created
5. New pods start
| Environment | Strategy | Reason |
|---|---|---|
| Production | RollingUpdate | Zero-downtime required |
| Staging | RollingUpdate | Zero-downtime preferred |
| Preview (waitForDatabase=false) | RollingUpdate | No coordination needed |
| Preview (waitForDatabase=true) | Recreate | Enables safe database seeding |
Acceptable for previews because brief downtime during deployments is acceptable in non-production environments.
Startup Probe for MongoDB Index Creation¶
Services in preview environments with waitForDatabase: true have a startupProbe to handle slow MongoDB startup:
Problem: MongoDB creates indexes on freshly seeded databases, taking 60+ seconds. The default liveness probe allows only ~90 seconds total before killing the pod, causing restart loops.
Solution: A startupProbe runs only during initial startup, allowing up to 310 seconds before liveness probes take over.
# Configured automatically in _deployment-dotnet.tpl
startupProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 30 # 10 + (30 * 10) = 310 seconds max
| Environment | Max Startup Time |
|---|---|
| Production/Staging | 90 seconds (liveness probe only) |
| Preview (waitForDatabase=true) | 310 seconds (startup probe) |
Symptoms of insufficient startup time (fixed by startup probe):
- Pods showing
0/1 Runningwith multiple restarts - Logs showing "Now listening" 60+ seconds after startup begins
- Eventual success after 3-5 restarts (when indexes are cached)
Related Documentation¶
For detailed architecture documentation, see:
Kyverno Security Policy¶
A Kyverno ClusterPolicy enforces that PR preview database users can only access appropriate databases.
Policy: atlas-block-production-access¶
Location: cluster-gitops/plugins/helm/kyverno/resources/atlas-pr-user-policy.yaml
Enforcement: Blocks creation/update of AtlasDatabaseUser resources in pr-* namespaces that violate rules.
Rules Summary¶
| Rule | Blocked Pattern | Purpose |
|---|---|---|
1. block-any-database-roles |
readWriteAnyDatabase, dbAdminAnyDatabase, root |
Prevent broad access |
2. block-production-database |
databaseName: syrftest |
Protect production data |
3. block-staging-database |
databaseName: syrf_staging |
Protect staging data |
4. block-admin-database |
databaseName: admin |
Protect system database |
5. validate-pr-database-pattern |
See below | Enforce allowed patterns |
Rule 5: Allowed Database Access¶
PR users can ONLY have:
| Database Pattern | Allowed Roles | Purpose |
|---|---|---|
syrf_pr_* |
Any (readWrite, dbOwner, etc.) |
PR-specific database |
syrf_snapshot |
read ONLY |
Snapshot data source |
Denied Examples:
# ❌ BLOCKED - syrf_snapshot with readWrite
roles:
- roleName: readWrite
databaseName: syrf_snapshot
# ❌ BLOCKED - accessing production
roles:
- roleName: read
databaseName: syrftest
# ❌ BLOCKED - accessing unrecognized database
roles:
- roleName: readWrite
databaseName: some_other_db
Allowed Examples:
# ✅ ALLOWED - PR database with readWrite
roles:
- roleName: readWrite
databaseName: syrf_pr_123
# ✅ ALLOWED - PR database + snapshot read
roles:
- roleName: readWrite
databaseName: syrf_pr_123
- roleName: read
databaseName: syrf_snapshot
Policy Violation Response¶
If a PR attempts to create an AtlasDatabaseUser that violates the policy:
- Kyverno blocks the resource creation
- ArgoCD sync fails with policy violation message
- PR preview deployment halts at the AtlasDatabaseUser step
- GitHub Deployment status shows failure
Fix: Remove the violating role from the AtlasDatabaseUser definition in the workflow.
Snapshot Producer (Weekly Data Snapshots)¶
The snapshot-producer CronJob creates weekly copies of production data to the syrf_snapshot database, which is then used by preview environments with the use-snapshot label.
How It Works¶
Weekly Schedule (Sunday 3 AM UTC)
↓
Snapshot Producer CronJob starts
↓
1. Test connectivity to source (Cluster0) and target (Preview) clusters
↓
2. For each collection (11 total):
- Count source documents
- Stream copy: mongodump | mongorestore (no disk writes)
- Verify target document count
- Retry up to 3 times on failure
↓
3. Write snapshot_metadata document with:
- Timestamp, duration, document counts
- Source/target cluster info
- Collections copied
↓
Preview environments can now use fresh snapshot data
Collections Copied¶
The following collections are copied from syrftest (production) to syrf_snapshot:
| Collection | Description |
|---|---|
pmProject |
Projects with stages, memberships, questions |
pmStudy |
Studies with screening, extraction, annotations |
pmInvestigator |
User accounts and profiles |
pmSystematicSearch |
Literature searches |
pmDataExportJob |
Export job tracking |
pmStudyCorrection |
PDF correction requests |
pmInvestigatorUsage |
Usage statistics |
pmRiskOfBiasAiJob |
AI risk-of-bias jobs |
pmProjectDailyStat |
Daily statistics |
pmPotential |
Potential references |
pmInvestigatorEmail |
Email records |
Snapshot Metadata¶
After each successful run, a metadata document is written to syrf_snapshot.snapshot_metadata:
{
_id: "latest",
createdAt: ISODate("2026-01-26T03:45:00Z"),
startedAt: ISODate("2026-01-26T03:00:00Z"),
finishedAt: ISODate("2026-01-26T03:45:00Z"),
durationSeconds: 2700,
sourceCluster: "Cluster0",
sourceDatabase: "syrftest",
sourceHost: "cluster0-pri.siwfo.mongodb.net",
targetCluster: "Preview",
targetDatabase: "syrf_snapshot",
targetHost: "preview-pri.siwfo.mongodb.net",
collections: ["pmProject", "pmStudy", ...],
collectionsCount: 11,
documentCounts: {
pmProject: 1234,
pmStudy: 56789,
// ...
},
totalDocuments: 123456,
method: "mongodump | mongorestore streaming",
crossCluster: true,
status: "complete"
}
Preview environments can query this document to verify snapshot freshness before using data.
Configuration¶
The snapshot-producer is deployed to the staging namespace. Configuration is in:
- Chart:
cluster-gitops/charts/snapshot-producer/ - Values:
cluster-gitops/plugins/local/snapshot-producer/values.yaml
Key configuration options:
| Setting | Default | Description |
|---|---|---|
schedule |
"0 3 * * 0" |
Cron schedule (Sunday 3 AM UTC) |
activeDeadlineSeconds |
3600 |
1 hour timeout |
retry.maxAttempts |
3 |
Retries per collection |
retry.delaySeconds |
30 |
Delay between retries |
streaming.gzip |
true |
Compress data in transit |
Manual Trigger¶
To manually trigger a snapshot (useful for testing or recovery):
# Create a one-time Job from the CronJob
kubectl create job --from=cronjob/snapshot-producer snapshot-manual-$(date +%s) -n staging
# Watch progress
kubectl logs -f job/snapshot-manual-<timestamp> -n staging
Troubleshooting¶
Snapshot job fails with connection error:
- Check MongoDB Atlas network access (IP allowlist)
- Verify credentials in
snapshot-producer-credentialssecret - Check if VPC Peering is configured for
-prihostnames
Snapshot takes too long:
- Large collections may exceed the 1-hour timeout
- Consider increasing
activeDeadlineSeconds - Check MongoDB Atlas cluster tier (affects throughput)
Preview shows stale data:
- Check
snapshot_metadata.createdAttimestamp - Verify CronJob is running:
kubectl get cronjob -n staging - Check recent job status:
kubectl get jobs -n staging | grep snapshot
For detailed chart configuration, see Snapshot Producer Reference.
Future Enhancements¶
Planned improvements for PR previews:
- Performance metrics: Show load times and resource usage
- Visual regression: Screenshot comparison with base branch
- E2E tests: Automated testing against preview environment
- Cost tracking: Monitor resource usage per preview
- Orphan database cleanup: Automated cleanup of
syrf_pr_*databases
Recently Completed:
- ✅ Database per preview: Isolated MongoDB database per PR (syrf_pr_{number})
- ✅ Seed data: Auto-populate preview with test data on first startup
- ✅ GitHub Deployments: Native GitHub UI integration with deployment tracking
- ✅ Snapshot restore: Use production data snapshot via
use-snapshotlabel - ✅ Kyverno security: Policy enforcement for PR database access patterns