DBL Operator Redesign - Implementation Specification¶
Status: Ready for Approval Target Environment: GKE Kubernetes Cluster / ArgoCD GitOps
Executive Summary¶
This specification defines the redesigned Database Lifecycle (DBL) operator architecture for PR preview environments in the SyRF monorepo. The redesign addresses several limitations of the current implementation:
- Seeding as Jobs: Move database seeding from the operator pod to separate Kubernetes Jobs for better resource isolation, logging, and failure handling
- Custom Seeding Templates: Support both snapshot restore AND mock data seeding via configurable Job templates
- Enhanced Tracking: Split tracking into
seedId(trigger) andseedSha(audit) for clearer coordination - New Label Semantics: Add
reset-db-on-synclabel and clarifylock-dbbehavior (replacespersist-db) - Post-Seed Job Support: Optional post-seeding jobs for migrations, indexing, or custom operations
The specification describes the desired end-state architecture. Implementation should adapt the current shell-operator based system to match this specification.
Table of Contents¶
- High-Level Architecture
- Detailed Design
- Execution Flow
- Edge Cases & Mitigations
- Testing Strategy
- Implementation Checklist
- CI Workflow Architecture
- Open Questions
- Migration Mapping: Current → New
- References
1. High-Level Architecture¶
1.1 Overview¶
┌─────────────────────────────────────────────────────────────────────────────┐
│ PR Preview Environment Flow │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ GitHub PR CI Workflow cluster-gitops │
│ ┌─────────┐ ┌──────────┐ ┌─────────────┐ │
│ │ Labels: │ push/label │ Detect │ commit │ pr-{n}/ │ │
│ │ preview │ ───────────────>│ changes │ ───────────────>│ pr.yaml │ │
│ │ lock-db │ │ Generate │ │ values.yaml │ │
│ │ use-snap│ │ seedId │ └──────┬──────┘ │
│ │ reset.. │ └──────────┘ │ │
│ └─────────┘ │ │
│ │ ▼ │
│ │ /reseed-db ┌─────────────────┐ │
│ │ comment │ ArgoCD Sync │ │
│ └──────────────────────────────────────────────>│ ApplicationSet │ │
│ └────────┬────────┘ │
│ │ │
│ Kubernetes Cluster │ │
│ ┌─────────────────────────────────────────────────────────────┼──────────┐ │
│ │ ▼ │ │
│ │ ┌──────────────────┐ ┌──────────────────────────────────────────┐ │ │
│ │ │ DBL Operator │ │ PR Namespace (pr-{n}) │ │ │
│ │ │ ┌──────────────┐ │ │ │ │ │
│ │ │ │Reconciliation│ │ │ ┌─────────────┐ ┌─────────────────┐ │ │ │
│ │ │ │ Loop │─┼───>│ │ Seeding Job │──>│ db-ready │ │ │ │
│ │ │ └──────────────┘ │ │ │ (snapshot/ │ │ ConfigMap │ │ │ │
│ │ │ │ │ │ mock data) │ │ - seedId │ │ │ │
│ │ │ Watches: │ │ └─────────────┘ │ - seedSha │ │ │ │
│ │ │ DatabaseLifecycle│ │ │ │ - status │ │ │ │
│ │ │ CRs │ │ ▼ └────────┬────────┘ │ │ │
│ │ └──────────────────┘ │ ┌─────────────┐ │ │ │ │
│ │ │ │ Post-Seed │ │ │ │ │
│ │ │ │ Job (opt.) │ │ │ │ │
│ │ │ │ (indexes/ │ │ │ │ │
│ │ │ │ migrations)│ │ │ │ │
│ │ │ └─────────────┘ │ │ │ │
│ │ │ │ │ │ │ │
│ │ │ ▼ │ │ │ │
│ │ │ ┌─────────────────────────┴──────────┐ │ │ │
│ │ │ │ Service Pods (api, pm, quartz) │ │ │ │
│ │ │ │ ┌────────────────┐ │ │ │ │
│ │ │ │ │ Init Container │ waits for │ │ │ │
│ │ │ │ │ (wait-for-db) │ ConfigMap match │ │ │ │
│ │ │ │ └────────────────┘ │ │ │ │
│ │ │ └────────────────────────────────────┘ │ │ │
│ │ └──────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
1.2 Key Components¶
| Component | Location | Purpose |
|---|---|---|
| DBL Operator | syrf-system namespace |
Watches DatabaseLifecycle CRs, creates/manages seeding Jobs |
| DatabaseLifecycle CR | PR namespace | Declares desired database state, seeding config, job templates |
| Seeding Job | PR namespace | Kubernetes Job that performs snapshot restore or mock data seeding |
| Post-Seed Job | PR namespace | Optional Job for migrations, indexing, or custom operations |
| db-ready ConfigMap | PR namespace | Coordination mechanism - services wait for matching seedId |
| CI Workflow | GitHub Actions | Generates seedId/seedSha, updates cluster-gitops values |
1.3 Dependencies¶
Internal:
- ArgoCD ApplicationSet (syrf-previews.yaml)
- preview-infrastructure Helm chart
- syrf-common Helm library chart (init container templates)
- MongoDB Atlas Operator (user provisioning)
External: - MongoDB Atlas (database hosting) - GitHub Actions (CI/CD) - cluster-gitops repository (GitOps values)
1.4 Integration Points¶
- GitHub → CI Workflow: PR events (push, label, comment) trigger workflow
- CI Workflow → cluster-gitops: Workflow commits
seedId,seedSha, label-derived values - ArgoCD → Kubernetes: ApplicationSet generates Applications from cluster-gitops
- DBL Operator → Seeding Job: Operator creates Job when CR seedId doesn't match ConfigMap
- ConfigMap → Service Init Containers: Services poll ConfigMap until seedId matches
2. Detailed Design¶
2.1 DatabaseLifecycle Custom Resource Definition¶
apiVersion: database.syrf.org.uk/v1alpha1
kind: DatabaseLifecycle
metadata:
name: pr-database
namespace: pr-123
annotations:
argocd.argoproj.io/sync-wave: "15"
spec:
# Target database configuration
database: syrf_pr_123
# Connection details
connection:
secretRef:
name: syrfdb-cluster0-syrf-pr-123-app
connectionStringKey: connectionStringStandardSrv
# Seed tracking (updated by CI workflow)
seedId: "a1b2c3d4-e5f6-7890-abcd-ef1234567890" # GUID - triggers reseed when changed
seedSha: "abc123def456" # Commit SHA - audit/traceability only
# Seeding configuration
seeding:
enabled: true
# Option 1: Built-in snapshot restore
type: snapshot # or "custom"
sourceDatabase: syrf_snapshot
collections:
- pmProject
- pmStudy
- pmInvestigator
- pmSystematicSearch
- pmDataExportJob
- pmStudyCorrection
- pmInvestigatorUsage
- pmRiskOfBiasAiJob
- pmProjectDailyStat
- pmPotential
- pmInvestigatorEmail
# Option 2: Custom seeding job (for mock data or other sources)
# type: custom
# jobTemplate:
# inline: { ... } # Inline Job spec
# # OR
# configMapRef:
# name: mock-data-seeder-template
# key: job.yaml
# Timeouts
timeout: 1800 # 30 minutes for seeding
# Post-seed job (optional - NO DEFAULT)
postSeedJob:
enabled: true
jobTemplate:
inline:
spec:
template:
spec:
restartPolicy: Never
containers:
- name: index-init
image: "ghcr.io/camaradesuk/syrf-project-management:sha-abc123"
env:
- name: SYRF_INDEX_INIT_MODE
value: "true"
envFrom:
- secretRef:
name: syrfdb-cluster0-syrf-pr-123-app
# OR
# configMapRef:
# name: index-init-template
# key: job.yaml
timeout: 1800 # 30 minutes for post-seed
# Existing database policy
existingDatabasePolicy: drop # drop | skip | fail
# Cleanup configuration
cleanupOnDelete: true # Drop database when CR is deleted
lockDatabase: false # When true, prevents drop even on CR delete
# Watched deployments (wait for 0 ready replicas before seeding)
watchedDeployments:
- name: syrf-api
- name: syrf-projectmanagement
- name: syrf-quartz
2.2 Seed Tracking Fields¶
| Field | Type | Purpose | When Updated |
|---|---|---|---|
seedId |
GUID | Triggers reseed detection | On PR push (if reset-db-on-sync), on /reseed-db command, on initial deployment |
seedSha |
String | Audit trail - which commit triggered seeding | Always updated with seedId |
Key Behavior:
- Operator compares spec.seedId with db-ready ConfigMap's seedId
- If different → seeding needed
- If same → skip seeding (already done)
2.3 db-ready ConfigMap Structure¶
apiVersion: v1
kind: ConfigMap
metadata:
name: db-ready
namespace: pr-123
data:
seedId: "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
seedSha: "abc123def456"
status: "complete" # pending | seeding | post-seed | complete | failed
lastUpdated: "2026-01-20T10:30:00Z"
errorMessage: "" # Populated if status=failed
Status Transitions:
2.4 Label Semantics¶
| Label | Effect | Details |
|---|---|---|
preview |
Master switch | Required for all DBL functionality. Without it, DBL CR is not created. |
use-snapshot |
Snapshot seeding | Uses syrf_snapshot as source. Without it, uses custom seeding job template (mock data) or empty DB. |
lock-db |
Prevents DB operations | Prevents drop/reseed operations. Overrides preview removal - keeps CR active. On PR close, namespace is removed but DB is orphaned. |
reset-db-on-sync |
Reseed on every push | Updates seedId to new GUID on each PR push. Incompatible with lock-db - if both present, reset-db-on-sync is removed and comment added to PR. |
Label Precedence:
2.5 PR Commands¶
| Command | Effect |
|---|---|
/reseed-db |
Updates seedId to new GUID (triggers reseed). If lock-db present, adds comment explaining incompatibility and does nothing. |
2.6 Custom Seeding Job Template¶
The DBL CR supports two ways to define custom seeding jobs:
Option 1: Inline Job Spec
spec:
seeding:
type: custom
jobTemplate:
inline:
spec:
template:
spec:
restartPolicy: Never
containers:
- name: mock-seeder
image: ghcr.io/camaradesuk/syrf-mock-seeder:latest
env:
- name: TARGET_DATABASE
value: syrf_pr_123
Option 2: ConfigMap Reference
spec:
seeding:
type: custom
jobTemplate:
configMapRef:
name: mock-data-seeder-template
key: job.yaml
The operator substitutes template variables:
- {{ .Database }} → target database name
- {{ .Namespace }} → PR namespace
- {{ .SeedId }} → current seedId
- {{ .SeedSha }} → current seedSha
- {{ .ConnectionSecret }} → connection secret name
2.7 Mock Data Seeding (Reusing DatabaseSeeder.cs)¶
Existing Implementation Analysis:
The PM service already has a complete mock data seeder at:
src/libs/project-management/SyRF.ProjectManagement.Core/Seeding/DatabaseSeeder.cs
What it creates: - 5 projects at different workflow stages: - Quick Start Demo: 10 studies, ready for screening - Screening In Progress: 30 studies with dual screening decisions - Ready for Annotation: 20 studies with annotation questions - Complete Review: 15 fully annotated studies - Private Research: 8 studies, private project
- 3 seed investigators (fake Auth0 IDs - cannot be logged into)
- Sample studies loaded from embedded JSON resource
- Annotation questions across multiple categories
Current Activation: SYRF_SEED_DATA_ENABLED=true environment variable
Proposed Activation for Jobs: SYRF_SEED_DATA_MODE=true
(Similar to existing SYRF_INDEX_INIT_MODE=true)
Reuse Assessment:
| Aspect | Assessment |
|---|---|
| Data Quality | ✅ Good - Creates realistic projects at various stages |
| Idempotency | ✅ Built-in - Checks if seed bot exists before seeding |
| Error Handling | ✅ Has corruption detection and cleanup |
| Dependencies | ⚠️ Requires PM service DI container (MongoDB, config) |
| Resource Usage | ✅ Lightweight - No external calls, just MongoDB writes |
No Significant Downsides - The implementation is well-designed and battle-tested.
Mock Data Seeding Job Template:
spec:
seeding:
type: custom
jobTemplate:
inline:
spec:
template:
spec:
restartPolicy: Never
containers:
- name: mock-seeder
image: "ghcr.io/camaradesuk/syrf-project-management:sha-{{ .SeedSha }}"
env:
- name: SYRF_SEED_DATA_MODE
value: "true"
envFrom:
- secretRef:
name: {{ .ConnectionSecret }}
activeDeadlineSeconds: 600 # 10 minute timeout
backoffLimit: 2
PM Service Changes Required:
Add to Program.cs (before DI container build, similar to index-init mode):
// Check for seed data mode (runs seeding then exits)
if (Environment.GetEnvironmentVariable("SYRF_SEED_DATA_MODE") == "true")
{
// Build minimal DI container (data services only)
var builder = WebApplication.CreateBuilder(args);
builder.AddDataServicesOnly(); // MongoDB, config, no RabbitMQ
var app = builder.Build();
// Run database seeder
var seeder = app.Services.GetRequiredService<DatabaseSeeder>();
seeder.Execute();
Console.WriteLine("Mock data seeding completed successfully.");
return; // Exit without starting web server
}
3. Execution Flow¶
3.1 Happy Path: PR with preview + use-snapshot Labels¶
1. Developer adds `preview` label to PR
│
2. CI Workflow triggers (labeled event)
│
3. Workflow generates:
├─ seedId: new GUID (uuidgen)
├─ seedSha: HEAD commit SHA
└─ commits to cluster-gitops/syrf/environments/preview/pr-{n}/
│
4. ArgoCD detects change, syncs:
├─ Wave -10: ExternalSecret (Atlas API key)
├─ Wave 0: Namespace
├─ Wave 10: AtlasDatabaseUser (creates connection secret)
└─ Wave 15: DatabaseLifecycle CR
│
5. DBL Operator reconciles CR:
├─ Check: Does db-ready ConfigMap exist with matching seedId?
│ NO → Continue with seeding
│
├─ Update ConfigMap: status=pending
│
├─ Wait for watched deployments to have 0 ready replicas
│ (Init containers block pods from becoming ready)
│
├─ Update ConfigMap: status=seeding
│
├─ Create Seeding Job (snapshot restore)
│ Job copies collections from syrf_snapshot → syrf_pr_{n}
│
├─ Wait for Job completion (30 min timeout)
│ SUCCESS → Continue
│ FAILURE → Update ConfigMap: status=failed, add PR comment, STOP
│
├─ Update ConfigMap: status=post-seed
│
├─ Create Post-Seed Job (if spec.postSeedJob.enabled)
│ Job runs index initialization
│
├─ Wait for Post-Seed Job completion (30 min timeout)
│ SUCCESS → Continue
│ FAILURE → Update ConfigMap: status=failed, add PR comment, STOP
│
└─ Update ConfigMap: seedId={new}, seedSha={sha}, status=complete
│
6. Service init containers detect ConfigMap with matching seedId + status=complete
│
7. Service pods start with seeded database
│
8. ArgoCD PostSync hook runs (github-notifier-job):
├─ Waits for db-ready ConfigMap with matching seedId AND status=complete
├─ Waits for all service Deployments to be healthy:
│ - syrf-api: Ready replicas == desired replicas
│ - syrf-projectmanagement: Ready replicas == desired replicas
│ - syrf-quartz: Ready replicas == desired replicas
│ - syrf-web: Ready replicas == desired replicas
├─ Authenticates with GitHub via GitHub App credentials
├─ Updates GitHub Deployment status to "success"
├─ Creates commit status (context: "preview/deploy")
└─ Posts PR comment with deployment URLs:
- Web: https://pr-{n}.syrf.org.uk
- API: https://api.pr-{n}.syrf.org.uk
- PM: https://project-management.pr-{n}.syrf.org.uk
3.2 Subsequent Push (No Label Changes)¶
1. Developer pushes code to PR
│
2. CI Workflow triggers (synchronize event)
│
3. Workflow checks labels:
├─ `reset-db-on-sync` present?
│ YES → Generate new seedId, update cluster-gitops
│ NO → Keep existing seedId, only update service image tags
│
4. ArgoCD syncs service deployments
│
5. DBL Operator reconciles (if seedId unchanged):
├─ Check: Does db-ready ConfigMap exist with matching seedId?
│ YES → Skip seeding entirely
│
6. Service pods restart with new code, same database
│
7. ArgoCD PostSync hook runs (github-notifier-job):
├─ Waits for db-ready ConfigMap with matching seedId AND status=complete
├─ Waits for all service Deployments to be healthy (Ready == Desired)
├─ Updates GitHub Deployment status to "success"
├─ Creates commit status
└─ Posts PR comment with deployment URLs
3.3 /reseed-db Command¶
1. User comments `/reseed-db` on PR
│
2. CI Workflow triggers (issue_comment event)
│
3. Workflow checks:
├─ Is `lock-db` label present?
│ YES → Add comment: "Cannot reseed: lock-db label prevents database operations.
│ Remove lock-db label first, then retry /reseed-db."
│ STOP
│ NO → Continue
│
4. Workflow generates new seedId, updates cluster-gitops
│
5. ArgoCD syncs DatabaseLifecycle CR with new seedId
│
6. DBL Operator detects seedId mismatch → triggers seeding flow
│
7. Services restart after seeding completes
3.4 PR Close with lock-db Label¶
1. PR is merged/closed
│
2. CI Workflow triggers (closed event)
│
3. Workflow checks: Is `lock-db` label present?
│
YES (lock-db present):
│ ├─ Add PR comment: "Database syrf_pr_{n} has been preserved (lock-db).
│ │ The Kubernetes namespace and resources have been removed.
│ │ To clean up the orphaned database, contact a database administrator."
│ │
│ ├─ Update DatabaseLifecycle CR: cleanupOnDelete=false
│ │
│ ├─ Remove namespace (ArgoCD cascade delete)
│ │ - Services removed
│ │ - DatabaseLifecycle CR removed (but DB preserved due to cleanupOnDelete=false)
│ │
│ └─ Database remains in MongoDB Atlas (orphaned)
│
NO (no lock-db):
│ ├─ DatabaseLifecycle CR has cleanupOnDelete=true
│ │
│ ├─ Remove namespace (ArgoCD cascade delete)
│ │ - DBL Operator finalizer runs
│ │ - Operator drops database
│ │ - All resources removed
│ │
│ └─ Database cleaned up
3.5 Label Conflict: lock-db + reset-db-on-sync¶
1. User adds `reset-db-on-sync` label while `lock-db` is present
│
2. CI Workflow triggers (labeled event)
│
3. Workflow detects conflict:
├─ Both `lock-db` AND `reset-db-on-sync` present
│
4. Workflow resolves conflict:
├─ Remove `reset-db-on-sync` label from PR (via GitHub API)
│
└─ Add PR comment:
"⚠️ Label conflict: `reset-db-on-sync` is incompatible with `lock-db`.
- `lock-db` prevents all database modifications
- `reset-db-on-sync` requests database reset on every push
The `reset-db-on-sync` label has been automatically removed.
To enable reset-on-sync, first remove the `lock-db` label."
3.6 Sequence Diagram: Seeding Flow¶
┌──────┐ ┌──────────┐ ┌───────────┐ ┌─────────┐ ┌─────────┐
│ArgoCD│ │DBL Op │ │Seeding Job│ │Post-Seed│ │Services │
└──┬───┘ └────┬─────┘ └─────┬─────┘ └────┬────┘ └────┬────┘
│ │ │ │ │
│ Sync CR │ │ │ │
│─────────────>│ │ │ │
│ │ │ │ │
│ │ Check seedId │ │ │
│ │ vs ConfigMap │ │ │
│ │ │ │ │
│ │ Mismatch! │ │ │
│ │ Set status= │ │ │
│ │ pending │ │ │
│ │ │ │ │
│ │ Wait for 0 │ │ │
│ │ ready replicas │ │ │
│ │<────────────────│────────────────│───────────────│
│ │ │ │ │
│ │ Create Job │ │ │
│ │────────────────>│ │ │
│ │ │ │ │
│ │ Set status= │ Copy │ │
│ │ seeding │ collections │ │
│ │ │ ──────────> │ │
│ │ │ MongoDB │ │
│ │ │ │ │
│ │ Job complete │ │ │
│ │<────────────────│ │ │
│ │ │ │ │
│ │ Create Job │ │ │
│ │────────────────────────────────->│ │
│ │ │ │ │
│ │ Set status= │ │ Create │
│ │ post-seed │ │ indexes │
│ │ │ │ ──────────> │
│ │ │ │ MongoDB │
│ │ │ │ │
│ │ Job complete │ │ │
│ │<────────────────────────────────│ │
│ │ │ │ │
│ │ Set status= │ │ │
│ │ complete │ │ │
│ │ Update seedId │ │ │
│ │ │ │ │
│ │ │ │ │
│ │ │ │ Poll ConfigMap
│ │ │ │<──────────────│
│ │ │ │ │
│ │ │ │ seedId match!│
│ │ │ │ Start pod │
│ │ │ │──────────────>│
│ │ │ │ │
4. Edge Cases & Mitigations¶
| # | Edge Case / Failure Mode | Impact | Mitigation Strategy |
|---|---|---|---|
| 1 | Seeding Job fails (MongoDB timeout, quota exceeded) | High - PR environment unusable | Fail status in ConfigMap, PR comment with error details, user retries with /reseed-db |
| 2 | Post-Seed Job fails (index creation OOM) | High - Services may start with missing indexes | Fail status in ConfigMap, PR comment, services blocked until manual intervention |
| 3 | New push during active seeding | Medium - Race condition, stale data | Cancel current seeding Job, start new with latest seedId |
| 4 | lock-db + reset-db-on-sync added together |
Low - Conflicting intent | Auto-remove reset-db-on-sync, add explanatory PR comment |
| 5 | lock-db present but /reseed-db command issued |
Low - User confusion | Add PR comment explaining lock-db prevents reseed, do nothing |
| 6 | MongoDB Atlas user not ready when seeding starts | High - Connection failures | Sync wave ordering (user wave 10, DBL wave 15), retry loop with backoff |
| 7 | Connection secret missing | High - Job fails immediately | Operator validates secret exists before creating Job |
| 8 | PR closed while seeding in progress | Medium - Orphaned resources | Finalizer waits for Job completion or timeout before cleanup |
| 9 | Multiple PRs seeding simultaneously (resource contention) | Low - Slower seeding | Jobs run in separate namespaces, MongoDB handles concurrency |
| 10 | ConfigMap deleted manually | Medium - Coordination broken | Operator recreates ConfigMap on next reconciliation |
| 11 | Service pods stuck in init (ConfigMap never updated) | High - Deployment hangs | Startup probe timeout (15 min), operator monitors for stuck states |
| 12 | Mock data seeding Job template invalid | High - Seeding never starts | Validate Job spec on CR creation, reject invalid templates |
| 13 | preview label removed while lock-db present |
Medium - Ambiguous intent | Keep CR active, DB preserved. Remove namespace only on PR close. |
| 14 | Orphaned database after lock-db PR close |
Low - Resource leak | Add PR comment with database name, require manual admin cleanup |
| 15 | Seeding Job takes longer than timeout | Medium - Incomplete data | Increase timeout in CR spec, or fail and require /reseed-db |
4.1 Detailed Mitigation: Concurrent Seed Requests¶
When a new push arrives while seeding is in progress:
- CI Workflow generates new
seedId(ifreset-db-on-syncor/reseed-db) - ArgoCD updates DatabaseLifecycle CR with new
seedId - Operator detects CR update during reconciliation
- Operator checks: Is there an active seeding Job?
- YES → Delete current Job (kubectl delete job)
- Wait for Job termination
- Operator starts new seeding with latest
seedId
Rationale: Latest commit should always win in CI. Completing an old seed while new code waits is wasteful.
4.2 Detailed Mitigation: Failure Notification¶
When seeding or post-seed Job fails:
- Operator detects Job failure (status.failed > 0)
- Operator updates ConfigMap:
- Operator creates GitHub PR comment via GitHub API: Seeding Job failed: container 'seeder' exited with code 1 [last 50 lines of Job logs]
- Services remain blocked (init containers waiting for
status: complete)
5. Testing Strategy¶
5.1 Unit Tests¶
- CI Workflow: Label conflict detection (
lock-db+reset-db-on-sync) - CI Workflow: seedId generation (valid GUID format)
- CI Workflow: seedSha extraction (valid commit SHA)
- Helm template: DatabaseLifecycle CR generation with all label combinations
- Helm template: ConfigMap RBAC for init containers
5.2 Integration Tests¶
- DBL Operator: Reconciles CR and creates Seeding Job
- DBL Operator: Updates ConfigMap on Job completion
- DBL Operator: Handles Job failure correctly
- DBL Operator: Cancels in-progress Job on new seedId
- Init Container: Waits for ConfigMap with matching seedId
- Init Container: Proceeds when status=complete
- ArgoCD: Sync wave ordering (secrets → user → DBL)
5.3 End-to-End Tests¶
- Full PR preview lifecycle: create → push → reseed → close
-
lock-db+ PR close: Database preserved, namespace removed -
reset-db-on-sync: New seedId on every push -
/reseed-dbcommand: Triggers reseed - Label conflict:
reset-db-on-syncauto-removed whenlock-dbpresent
5.4 Manual Verification Steps¶
# 1. Create PR with preview label
gh pr create --title "Test PR" --body "Testing preview"
gh pr edit 123 --add-label preview
# 2. Wait for deployment, verify ConfigMap
kubectl get configmap db-ready -n pr-123 -o yaml
# 3. Verify seedId in ConfigMap matches CR
kubectl get dbl pr-database -n pr-123 -o jsonpath='{.spec.seedId}'
# 4. Test /reseed-db command
gh pr comment 123 --body "/reseed-db"
# Wait and verify new seedId
# 5. Test lock-db + reset-db-on-sync conflict
gh pr edit 123 --add-label lock-db
gh pr edit 123 --add-label reset-db-on-sync
# Verify reset-db-on-sync removed, comment added
# 6. Test PR close with lock-db
gh pr close 123
# Verify namespace deleted, database preserved
kubectl get ns pr-123 # Should not exist
# Check MongoDB Atlas for syrf_pr_123 database
6. Implementation Checklist¶
Phase 1: Core Infrastructure¶
- Update DatabaseLifecycle CRD with new fields (
seedId,seedSha,lockDatabase,postSeedJob) - Update DBL Operator reconciliation logic:
- Compare
spec.seedIdwith ConfigMapseedId - Create Seeding Job instead of inline seeding
- Handle Job lifecycle (create, monitor, cleanup)
- Create Post-Seed Job when enabled
- Update ConfigMap with status transitions
- Update db-ready ConfigMap structure (
seedId,seedSha,status,errorMessage) - Update service init container to check
status=complete
Phase 2: CI Workflow Updates¶
- Add
seedIdgeneration (new GUID viauuidgen) - Add
seedShatracking (HEAD commit SHA) - Implement label conflict detection (
lock-db+reset-db-on-sync) - Implement
/reseed-dbcommand withlock-dbcheck - Implement
reset-db-on-syncbehavior (new seedId on each push) - Update cluster-gitops values generation
Phase 3: Helm Chart Updates¶
- Update preview-infrastructure chart:
- Pass
seedIdandseedShato DatabaseLifecycle CR - Add
lockDatabasefield based onlock-dblabel - Configure
cleanupOnDeletebased onlock-dblabel - Update syrf-common library:
- Init container checks
status=complete(not just seedId match)
Phase 4: Mock Data Seeding (via PM seed-data mode)¶
- Add
SYRF_SEED_DATA_MODEenvironment variable handling to PMProgram.cs: - Check before DI container build (similar to
SYRF_INDEX_INIT_MODE) - Build minimal DI container (data services only, no RabbitMQ/ROB)
- Execute
DatabaseSeeder.Execute() - Exit after completion
- Ensure
DatabaseSeederis registered in minimal DI container - Add integration test for seed-data mode
- Create default mock data seeding Job template in preview-infrastructure chart
- Document mock data seeding in
docs/how-to/use-pr-preview-environments.md
Phase 5: Documentation & Cleanup¶
User Documentation:
- [ ] Update docs/how-to/use-pr-preview-environments.md:
- New label names and behaviors (lock-db, reset-db-on-sync)
- Updated /reseed-db command behavior with lock-db check
- Explanation of seedId vs seedSha tracking
- Mock data seeding vs snapshot seeding decision guide
- Troubleshooting: What to do when seeding fails
Migration Documentation:
- [ ] Create docs/how-to/migrate-pr-preview-labels.md:
- persist-db → lock-db label rename
- Behavioral differences (lock-db now keeps CR active)
- One-time migration steps for existing PRs
Architecture Documentation:
- [ ] Update docs/architecture/pr-preview-environments.md (or create if not exists):
- DBL Operator Job-based seeding architecture
- ConfigMap coordination pattern with seedId/seedSha/status
- Sequence diagrams for all flows
- Component responsibility matrix
Reference Updates:
- [ ] Update CLAUDE.md:
- New label semantics
- seedId/seedSha tracking explanation
- Updated workflow triggers
- [ ] Update src/charts/preview-infrastructure/README.md:
- New Helm values for seedId/seedSha
- Custom seeding job template examples
Code Cleanup:
- [ ] Remove references to seedVersion (replaced by seedId)
- [ ] Remove deprecated persist-db label handling (replaced by lock-db)
- [ ] Update github-notifier-job.yaml to use seedId
7. CI Workflow Architecture¶
This section documents the complete CI workflow architecture for PR preview environments.
7.0 Responsibility Boundaries (CRITICAL)¶
The CI workflow (pr-preview.yml) ONLY commits to cluster-gitops. It does NOT:
- Communicate with ArgoCD directly (no ArgoCD API calls)
- Create or modify ConfigMaps on the cluster directly
- Execute kubectl commands to modify cluster state
Cluster operations are handled by ArgoCD and operators:
- ArgoCD syncs cluster-gitops changes to create Kubernetes resources
- DBL Operator creates/updates the
db-readyConfigMap github-notifier-job(ArgoCD PostSync hook) updates GitHub Deployment status
┌─────────────────────────────────────────────────────────────────────────────────┐
│ Responsibility Boundary │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ GitHub Actions (CI) │ Kubernetes Cluster │
│ ─────────────────── │ ────────────────── │
│ │ │
│ ✅ Build Docker images │ ✅ ArgoCD syncs from gitops │
│ ✅ Push to GHCR │ ✅ DBL Operator seeds database │
│ ✅ Calculate versions (GitVersion) │ ✅ DBL Operator creates ConfigMap│
│ ✅ Commit to cluster-gitops │ ✅ PostSync hook notifies GitHub │
│ ✅ Create GitHub Deployment (pending) │ ✅ Service init containers wait │
│ │ │
│ ❌ NO kubectl apply │ │
│ ❌ NO ArgoCD API calls │ │
│ ❌ NO direct ConfigMap creation │ │
│ │ │
└─────────────────────────────────────────────────────────────────────────────────┘
7.1 Full CI Pipeline Stages¶
The pr-preview.yml workflow executes the following stages:
┌──────────────────────────────────────────────────────────────────────────────────┐
│ PR Preview CI Pipeline │
├──────────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. check-label │
│ ├─ Verify PR has 'preview' label │
│ ├─ Handle label-specific events (persist-db, use-snapshot) │
│ ├─ Extract preview config from PR description (feature flags) │
│ └─ Output: should_build, pr_number, head_sha │
│ │
│ 2. create-deployment │
│ ├─ Create GitHub Deployment for preview environment │
│ └─ Set initial status to "pending" │
│ │
│ 3. detect-changes (tag-based) │
│ ├─ Compare HEAD SHA against last service tag │
│ ├─ For each service: determine action (build | use-existing | retag) │
│ ├─ Build matrix for changed services │
│ └─ Output: *_changed, *_action, *_last_version, preview_services_matrix │
│ │
│ 4. version-* jobs (parallel, per-service) │
│ ├─ Uses reusable workflow: _gitversion.yml │
│ ├─ Calculates semantic version from git history │
│ └─ Output: version, semver, fullsemver, informationalVersion │
│ │
│ 5. build-web-artifacts (if web changed) │
│ ├─ npm ci && ng build --configuration development │
│ ├─ Upload dist artifact for Docker build │
│ └─ Sentry sourcemaps upload │
│ │
│ 6. build-and-push-images (matrix, parallel) │
│ ├─ Uses reusable workflow: _docker-build.yml │
│ ├─ Build Docker image with version tag │
│ ├─ Push to ghcr.io/camaradesuk/{service}:{version} │
│ └─ Tag with sha-{shortsha} for ArgoCD deployment │
│ │
│ 7. retag-unchanged │
│ ├─ For unchanged services: crane copy :latest → :sha-{shortsha} │
│ └─ Ensures all services have sha-{shortsha} tag for this commit │
│ │
│ 8. update-pr-status │
│ ├─ Update PR description with deployment status │
│ └─ Write GitHub Actions job summary │
│ │
│ 9. write-versions (commits to cluster-gitops) │
│ ├─ Checkout cluster-gitops repository │
│ ├─ Check labels (persist-db, use-snapshot) │
│ ├─ Determine database reset trigger │
│ ├─ Write pr.yaml (PR metadata, seedVersion, deploymentNotification) │
│ ├─ Write infrastructure.values.yaml (MongoDB, DatabaseLifecycle config) │
│ ├─ Write services/*.values.yaml (image tags, GitVersion values) │
│ └─ Git commit and push │
│ │
└──────────────────────────────────────────────────────────────────────────────────┘
7.2 Workflow Triggers¶
The workflow responds to:
- PR comment containing /reseed-db
- Label changes (preview, lock-db, use-snapshot, reset-db-on-sync)
- Push events to a PR branch
- PR closed events
- workflow_dispatch (manual trigger with PR number)
7.3 Testing and Code Quality (Future Enhancement)¶
Note: The current pr-preview.yml workflow does NOT include:
- Unit test execution
- Integration test execution
- SonarQube code coverage analysis
- SonarQube static analysis
These are handled by separate CI workflows or are planned enhancements. The DBL redesign focuses on database lifecycle management - testing integration should be addressed separately.
7.4 Decision Tree (Database/Label Logic)¶
trigger: PR comment | label change | push | pr closed
1. Is comment JUST added that includes '/reseed-db'?
YES →
Is 'lock-db' label set?
YES → Comment "Cannot reseed: lock-db prevents database operations"
NO → Set seedId to new GUID, set locked=false, comment "DB is being reseeded"
NO → Continue to step 2
2. Are 'lock-db' AND 'reset-db-on-sync' BOTH set?
YES → Comment about conflict, remove 'reset-db-on-sync' label, set locked=true
NO → Continue to step 3
3. Was 'use-snapshot' label JUST set/unset AND 'lock-db' ALREADY set?
YES → Undo the set/unset of 'use-snapshot', comment explaining conflict
NO → Continue to step 4
4. Is (preview JUST set OR push with preview set OR PR JUST opened) AND PR is open?
YES →
Is label combo valid?
NO → Comment about invalid combo, STOP
YES →
Does seedId already exist in cluster-gitops for this PR?
YES →
Is 'reset-db-on-sync' set?
YES → Set seedId to new GUID
NO →
Is 'lock-db' set?
YES → Use existing seedId, set locked=true
NO → Use existing seedId, set locked=false
NO → Set seedId to new GUID
NO → Continue to step 5
5. Was 'preview' JUST unset OR PR JUST closed?
YES →
Is 'lock-db' set?
YES → Set locked=true in values, sync DBL, then delete PR folder (DB preserved)
NO → Delete PR folder (DB will be dropped)
NO → Continue to step 6
6. Was 'lock-db' label JUST changed?
JUST SET → Update values to set locked=true
JUST UNSET → Update values to set locked=false
NO CHANGE → Stop processing (nothing to do)
Note: The original logic had a bug at step 6 - it didn't distinguish between lock-db being set vs unset. This corrected version handles both cases.
7.5 Label Detection Timing¶
RESOLVED: Labels are detected at workflow start time, not when the workflow is queued.
This is correct behavior because: 1. The "interruptible" queue pattern cancels stale workflows 2. Detecting labels at run time ensures the workflow acts on current state 3. If label changes occur while a workflow is queued, the workflow sees the new state
7.6 Invocation Queue Management¶
The workflow implements a selective "interruptible" queue pattern to optimize CI efficiency:
Interruptible Events (Expensive Operations):
- Git push events
- PR opened/closed events
- preview label set/unset
- /reseed-db command (triggers database seeding - stale if new push arrives)
These events trigger expensive operations: build container images, push to registries, seed database, deploy. If a new event arrives, the old work is stale anyway - cancel and start fresh.
Non-Interruptible Events (Cheap Operations):
lock-dblabel changesreset-db-on-synclabel changesuse-snapshotlabel changes
These events only update cluster-gitops values - no builds required. They're fast and should complete before processing the next event.
Analysis of Implementation Options:
Option 1: Separate concurrency groups (NOT RECOMMENDED)
The idea: use conditional concurrency groups based on event type:
concurrency:
group: pr-${{ pr_number }}-${{ is_interruptible && 'build' || 'config' }}
cancel-in-progress: ${{ is_interruptible }}
Problem: Race conditions. If both groups have active workflows, they both try to commit to cluster-gitops simultaneously, causing git conflicts.
Option 2: Single group, always cancel (RECOMMENDED)
Why this works:
- All events eventually commit to cluster-gitops (single writer)
- Config-only changes are fast - unlikely to be interrupted
- If cancelled, the subsequent push includes current label state anyway
- User can re-apply label if needed (rare edge case)
- Simple, predictable behavior
Option 3: Separate workflow files (COMPLEX)
Split into pr-preview-build.yml and pr-preview-config.yml. Still has race condition issues and adds coordination complexity.
Recommendation: Use Option 2 (single group, always cancel). Simpler, safer, and the rare case of a config change being cancelled by a push is acceptable since the push will include current labels.
7.7 Invalid Label Combinations¶
The decision tree checks for "valid label combo" but doesn't specify invalid combinations. Define:
| Invalid Combination | Reason | Resolution |
|---|---|---|
lock-db + reset-db-on-sync |
Conflicting: lock prevents reseeds, reset requests them | Auto-remove reset-db-on-sync |
Currently, no other combinations are invalid. use-snapshot + lock-db is allowed (but changing use-snapshot while lock-db is set is blocked as meaningless).
8. Open Questions¶
Resolved Questions¶
-
Mock Data Extraction: RESOLVED - Reuse existingDatabaseSeeder.csviaSYRF_SEED_DATA_MODE=trueenvironment variable. No extraction needed - run PM image in seed-data mode (similar to index-init mode). -
GitHub App for PR Comments: RESOLVED - Reuse existing GitHub App (github-app-credentialssecret). Thegithub-notifier-job.yamlalready uses this for deployment notifications. -
Seeding Job Resource Limits: RESOLVED - Use normal amounts, can be tweaked later based on observed performance. -
Orphaned Database Cleanup: RESOLVED - No automated cleanup. Just leave the orphaned database and add a comment on the PR explaining the database name and that manual cleanup is required. -
Label Detection Timing: RESOLVED - Labels are detected at workflow start time (not when queued). This is correct behavior - the interruptible queue pattern ensures stale builds are cancelled, and detecting labels at run time ensures the workflow acts on current state. -
Interruptible Event Scope: RESOLVED - Selective interruptibility is correct: - Interruptible (push, preview, PR open/close): Trigger expensive builds - cancel-in-progress avoids wasted CI
-
Non-interruptible (lock-db, reset-db-on-sync, use-snapshot): Only update values - queue and complete (fast, cheap)
-
Notification Content: RESOLVED - Include bothseedIdandseedShain deployment notification comments for full traceability. -
Failure Comments: RESOLVED - DBL Operator posts failure comments directly using the same GitHub App credentials asgithub-notifier-job. No need for CI workflow involvement.
Remaining Questions¶
All questions resolved.
9. Migration Mapping: Current → New¶
This section shows how existing components map to the new architecture.
9.1 Terminology Changes¶
| Current Term | New Term | Notes |
|---|---|---|
seedVersion |
seedId |
GUID that triggers reseed detection |
| (none) | seedSha |
Commit SHA for audit trail |
persist-db label |
lock-db label |
Enhanced: now keeps CR active when preview removed |
| (none) | reset-db-on-sync label |
New: reseed on every push |
| (none) | status field |
New: pending/seeding/post-seed/complete/failed |
9.2 Component Changes¶
| Component | Current Location | Change Required |
|---|---|---|
| DBL CRD | cluster-gitops/charts/database-lifecycle-operator/crds/ |
Add seedId, seedSha, lockDatabase, postSeedJob fields |
| DBL Operator Hook | cluster-gitops/charts/database-lifecycle-operator/templates/configmap-hooks.yaml |
Refactor to create Jobs instead of inline seeding |
| db-ready ConfigMap | Created by operator | Add seedId, seedSha, status, errorMessage, lastUpdated |
| Init Container | src/charts/syrf-common/templates/_deployment-dotnet.tpl |
Check status=complete AND seedId match |
| GitHub Notifier | src/charts/preview-infrastructure/templates/github-notifier-job.yaml |
Replace seedVersion with seedId |
| PR Preview Workflow | .github/workflows/pr-preview.yml |
Add seedId/seedSha generation, label conflict handling |
| PM Service | src/services/project-management/ |
Add SYRF_SEED_DATA_MODE handling in Program.cs |
9.3 Helm Values Changes¶
preview-infrastructure chart values:
# Current
seedVersion: "abc123def" # Commit SHA
# New
seedId: "a1b2c3d4-e5f6-7890-abcd-ef1234567890" # GUID
seedSha: "abc123def" # Commit SHA (audit only)
lockDatabase: false # From lock-db label
9.4 CI Workflow Changes¶
Current workflow generates:
New workflow generates:
seedId: $(uuidgen) # New GUID on initial deploy, /reseed-db, or reset-db-on-sync
seedSha: ${{ github.sha }} # Always current commit
lockDatabase: ${{ contains(github.event.pull_request.labels.*.name, 'lock-db') }}
9.5 Files to Modify¶
| File | Changes |
|---|---|
cluster-gitops/charts/database-lifecycle-operator/crds/databaselifecycle.yaml |
Add new CRD fields |
cluster-gitops/charts/database-lifecycle-operator/templates/configmap-hooks.yaml |
Job-based seeding logic |
src/charts/preview-infrastructure/templates/database-lifecycle.yaml |
Pass new values |
src/charts/preview-infrastructure/templates/github-notifier-job.yaml |
Use seedId |
src/charts/syrf-common/templates/_deployment-dotnet.tpl |
Check status + seedId |
.github/workflows/pr-preview.yml |
seedId generation, label handling |
src/services/project-management/SyRF.ProjectManagement.Endpoint/Program.cs |
Add seed-data mode |
10. References¶
Internal Documentation¶
- Current DBL Operator Implementation:
cluster-gitops/charts/database-lifecycle-operator/(separate repo) - Shell-operator based controller - PR Preview Environments How-To - User guide
- MongoDB Testing Strategy - Database isolation strategy
- Data Snapshot Automation - Snapshot architecture
External Resources¶
- Kubernetes Jobs - Job controller documentation
- Shell Operator - Current operator framework
- ArgoCD Sync Waves - Resource ordering
Document End
This document must be reviewed and approved before implementation begins.