DBL Operator Redesign - Implementation Specification¶

Status: Ready for Approval Target Environment: GKE Kubernetes Cluster / ArgoCD GitOps

Executive Summary¶

This specification defines the redesigned Database Lifecycle (DBL) operator architecture for PR preview environments in the SyRF monorepo. The redesign addresses several limitations of the current implementation:

Seeding as Jobs: Move database seeding from the operator pod to separate Kubernetes Jobs for better resource isolation, logging, and failure handling
Custom Seeding Templates: Support both snapshot restore AND mock data seeding via configurable Job templates
Enhanced Tracking: Split tracking into seedId (trigger) and seedSha (audit) for clearer coordination
New Label Semantics: Add reset-db-on-sync label and clarify lock-db behavior (replaces persist-db)
Post-Seed Job Support: Optional post-seeding jobs for migrations, indexing, or custom operations

The specification describes the desired end-state architecture. Implementation should adapt the current shell-operator based system to match this specification.

1. High-Level Architecture¶

1.1 Overview¶

┌─────────────────────────────────────────────────────────────────────────────┐
│                         PR Preview Environment Flow                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  GitHub PR                    CI Workflow                  cluster-gitops   │
│  ┌─────────┐                 ┌──────────┐                 ┌─────────────┐   │
│  │ Labels: │  push/label     │ Detect   │   commit        │ pr-{n}/     │   │
│  │ preview │ ───────────────>│ changes  │ ───────────────>│ pr.yaml     │   │
│  │ lock-db │                 │ Generate │                 │ values.yaml │   │
│  │ use-snap│                 │ seedId   │                 └──────┬──────┘   │
│  │ reset.. │                 └──────────┘                        │          │
│  └─────────┘                                                     │          │
│       │                                                          ▼          │
│       │ /reseed-db                                    ┌─────────────────┐   │
│       │ comment                                       │ ArgoCD Sync     │   │
│       └──────────────────────────────────────────────>│ ApplicationSet  │   │
│                                                       └────────┬────────┘   │
│                                                                │            │
│                            Kubernetes Cluster                  │            │
│  ┌─────────────────────────────────────────────────────────────┼──────────┐ │
│  │                                                             ▼          │ │
│  │  ┌──────────────────┐    ┌──────────────────────────────────────────┐ │ │
│  │  │ DBL Operator     │    │ PR Namespace (pr-{n})                    │ │ │
│  │  │ ┌──────────────┐ │    │                                          │ │ │
│  │  │ │Reconciliation│ │    │  ┌─────────────┐   ┌─────────────────┐   │ │ │
│  │  │ │    Loop      │─┼───>│  │ Seeding Job │──>│ db-ready        │   │ │ │
│  │  │ └──────────────┘ │    │  │ (snapshot/  │   │ ConfigMap       │   │ │ │
│  │  │                  │    │  │  mock data) │   │ - seedId        │   │ │ │
│  │  │ Watches:         │    │  └─────────────┘   │ - seedSha       │   │ │ │
│  │  │ DatabaseLifecycle│    │         │          │ - status        │   │ │ │
│  │  │ CRs              │    │         ▼          └────────┬────────┘   │ │ │
│  │  └──────────────────┘    │  ┌─────────────┐           │            │ │ │
│  │                          │  │ Post-Seed   │           │            │ │ │
│  │                          │  │ Job (opt.)  │           │            │ │ │
│  │                          │  │ (indexes/   │           │            │ │ │
│  │                          │  │  migrations)│           │            │ │ │
│  │                          │  └─────────────┘           │            │ │ │
│  │                          │         │                  │            │ │ │
│  │                          │         ▼                  │            │ │ │
│  │                          │  ┌─────────────────────────┴──────────┐ │ │ │
│  │                          │  │ Service Pods (api, pm, quartz)     │ │ │ │
│  │                          │  │ ┌────────────────┐                 │ │ │ │
│  │                          │  │ │ Init Container │ waits for       │ │ │ │
│  │                          │  │ │ (wait-for-db)  │ ConfigMap match │ │ │ │
│  │                          │  │ └────────────────┘                 │ │ │ │
│  │                          │  └────────────────────────────────────┘ │ │ │
│  │                          └──────────────────────────────────────────┘ │ │
│  └──────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘

1.2 Key Components¶

Component	Location	Purpose
DBL Operator	`syrf-system` namespace	Watches DatabaseLifecycle CRs, creates/manages seeding Jobs
DatabaseLifecycle CR	PR namespace	Declares desired database state, seeding config, job templates
Seeding Job	PR namespace	Kubernetes Job that performs snapshot restore or mock data seeding
Post-Seed Job	PR namespace	Optional Job for migrations, indexing, or custom operations
db-ready ConfigMap	PR namespace	Coordination mechanism - services wait for matching seedId
CI Workflow	GitHub Actions	Generates seedId/seedSha, updates cluster-gitops values

1.3 Dependencies¶

Internal: - ArgoCD ApplicationSet (syrf-previews.yaml) - preview-infrastructure Helm chart - syrf-common Helm library chart (init container templates) - MongoDB Atlas Operator (user provisioning)

External: - MongoDB Atlas (database hosting) - GitHub Actions (CI/CD) - cluster-gitops repository (GitOps values)

1.4 Integration Points¶

GitHub → CI Workflow: PR events (push, label, comment) trigger workflow
CI Workflow → cluster-gitops: Workflow commits seedId, seedSha, label-derived values
ArgoCD → Kubernetes: ApplicationSet generates Applications from cluster-gitops
DBL Operator → Seeding Job: Operator creates Job when CR seedId doesn't match ConfigMap
ConfigMap → Service Init Containers: Services poll ConfigMap until seedId matches

2. Detailed Design¶

2.1 DatabaseLifecycle Custom Resource Definition¶

apiVersion: database.syrf.org.uk/v1alpha1
kind: DatabaseLifecycle
metadata:
  name: pr-database
  namespace: pr-123
  annotations:
    argocd.argoproj.io/sync-wave: "15"
spec:
  # Target database configuration
  database: syrf_pr_123

  # Connection details
  connection:
    secretRef:
      name: syrfdb-cluster0-syrf-pr-123-app
      connectionStringKey: connectionStringStandardSrv

  # Seed tracking (updated by CI workflow)
  seedId: "a1b2c3d4-e5f6-7890-abcd-ef1234567890"  # GUID - triggers reseed when changed
  seedSha: "abc123def456"                          # Commit SHA - audit/traceability only

  # Seeding configuration
  seeding:
    enabled: true

    # Option 1: Built-in snapshot restore
    type: snapshot  # or "custom"
    sourceDatabase: syrf_snapshot
    collections:
      - pmProject
      - pmStudy
      - pmInvestigator
      - pmSystematicSearch
      - pmDataExportJob
      - pmStudyCorrection
      - pmInvestigatorUsage
      - pmRiskOfBiasAiJob
      - pmProjectDailyStat
      - pmPotential
      - pmInvestigatorEmail

    # Option 2: Custom seeding job (for mock data or other sources)
    # type: custom
    # jobTemplate:
    #   inline: { ... }  # Inline Job spec
    #   # OR
    #   configMapRef:
    #     name: mock-data-seeder-template
    #     key: job.yaml

    # Timeouts
    timeout: 1800  # 30 minutes for seeding

  # Post-seed job (optional - NO DEFAULT)
  postSeedJob:
    enabled: true
    jobTemplate:
      inline:
        spec:
          template:
            spec:
              restartPolicy: Never
              containers:
                - name: index-init
                  image: "ghcr.io/camaradesuk/syrf-project-management:sha-abc123"
                  env:
                    - name: SYRF_INDEX_INIT_MODE
                      value: "true"
                  envFrom:
                    - secretRef:
                        name: syrfdb-cluster0-syrf-pr-123-app
      # OR
      # configMapRef:
      #   name: index-init-template
      #   key: job.yaml
    timeout: 1800  # 30 minutes for post-seed

  # Existing database policy
  existingDatabasePolicy: drop  # drop | skip | fail

  # Cleanup configuration
  cleanupOnDelete: true  # Drop database when CR is deleted
  lockDatabase: false    # When true, prevents drop even on CR delete

  # Watched deployments (wait for 0 ready replicas before seeding)
  watchedDeployments:
    - name: syrf-api
    - name: syrf-projectmanagement
    - name: syrf-quartz

2.2 Seed Tracking Fields¶

Field	Type	Purpose	When Updated
`seedId`	GUID	Triggers reseed detection	On PR push (if `reset-db-on-sync`), on `/reseed-db` command, on initial deployment
`seedSha`	String	Audit trail - which commit triggered seeding	Always updated with `seedId`

Key Behavior: - Operator compares spec.seedId with db-ready ConfigMap's seedId - If different → seeding needed - If same → skip seeding (already done)

2.3 db-ready ConfigMap Structure¶

apiVersion: v1
kind: ConfigMap
metadata:
  name: db-ready
  namespace: pr-123
data:
  seedId: "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
  seedSha: "abc123def456"
  status: "complete"  # pending | seeding | post-seed | complete | failed
  lastUpdated: "2026-01-20T10:30:00Z"
  errorMessage: ""    # Populated if status=failed

Status Transitions:

pending → seeding → post-seed → complete
                ↘         ↘
                 failed    failed

2.4 Label Semantics¶

Label	Effect	Details
`preview`	Master switch	Required for all DBL functionality. Without it, DBL CR is not created.
`use-snapshot`	Snapshot seeding	Uses `syrf_snapshot` as source. Without it, uses custom seeding job template (mock data) or empty DB.
`lock-db`	Prevents DB operations	Prevents drop/reseed operations. Overrides `preview` removal - keeps CR active. On PR close, namespace is removed but DB is orphaned.
`reset-db-on-sync`	Reseed on every push	Updates `seedId` to new GUID on each PR push. Incompatible with `lock-db` - if both present, `reset-db-on-sync` is removed and comment added to PR.

Label Precedence:

lock-db > reset-db-on-sync > use-snapshot > (default empty)

2.5 PR Commands¶

Command	Effect
`/reseed-db`	Updates `seedId` to new GUID (triggers reseed). If `lock-db` present, adds comment explaining incompatibility and does nothing.

2.6 Custom Seeding Job Template¶

The DBL CR supports two ways to define custom seeding jobs:

Option 1: Inline Job Spec

spec:
  seeding:
    type: custom
    jobTemplate:
      inline:
        spec:
          template:
            spec:
              restartPolicy: Never
              containers:
                - name: mock-seeder
                  image: ghcr.io/camaradesuk/syrf-mock-seeder:latest
                  env:
                    - name: TARGET_DATABASE
                      value: syrf_pr_123

Option 2: ConfigMap Reference

spec:
  seeding:
    type: custom
    jobTemplate:
      configMapRef:
        name: mock-data-seeder-template
        key: job.yaml

The operator substitutes template variables: - {{ .Database }} → target database name - {{ .Namespace }} → PR namespace - {{ .SeedId }} → current seedId - {{ .SeedSha }} → current seedSha - {{ .ConnectionSecret }} → connection secret name

2.7 Mock Data Seeding (Reusing DatabaseSeeder.cs)¶

Existing Implementation Analysis:

The PM service already has a complete mock data seeder at: src/libs/project-management/SyRF.ProjectManagement.Core/Seeding/DatabaseSeeder.cs

What it creates: - 5 projects at different workflow stages: - Quick Start Demo: 10 studies, ready for screening - Screening In Progress: 30 studies with dual screening decisions - Ready for Annotation: 20 studies with annotation questions - Complete Review: 15 fully annotated studies - Private Research: 8 studies, private project

3 seed investigators (fake Auth0 IDs - cannot be logged into)
Sample studies loaded from embedded JSON resource
Annotation questions across multiple categories

Current Activation: SYRF_SEED_DATA_ENABLED=true environment variable

Proposed Activation for Jobs: SYRF_SEED_DATA_MODE=true (Similar to existing SYRF_INDEX_INIT_MODE=true)

Reuse Assessment:

Aspect	Assessment
Data Quality	✅ Good - Creates realistic projects at various stages
Idempotency	✅ Built-in - Checks if seed bot exists before seeding
Error Handling	✅ Has corruption detection and cleanup
Dependencies	⚠️ Requires PM service DI container (MongoDB, config)
Resource Usage	✅ Lightweight - No external calls, just MongoDB writes

No Significant Downsides - The implementation is well-designed and battle-tested.

Mock Data Seeding Job Template:

spec:
  seeding:
    type: custom
    jobTemplate:
      inline:
        spec:
          template:
            spec:
              restartPolicy: Never
              containers:
                - name: mock-seeder
                  image: "ghcr.io/camaradesuk/syrf-project-management:sha-{{ .SeedSha }}"
                  env:
                    - name: SYRF_SEED_DATA_MODE
                      value: "true"
                  envFrom:
                    - secretRef:
                        name: {{ .ConnectionSecret }}
              activeDeadlineSeconds: 600  # 10 minute timeout
          backoffLimit: 2

PM Service Changes Required:

Add to Program.cs (before DI container build, similar to index-init mode):

// Check for seed data mode (runs seeding then exits)
if (Environment.GetEnvironmentVariable("SYRF_SEED_DATA_MODE") == "true")
{
    // Build minimal DI container (data services only)
    var builder = WebApplication.CreateBuilder(args);
    builder.AddDataServicesOnly();  // MongoDB, config, no RabbitMQ
    var app = builder.Build();

    // Run database seeder
    var seeder = app.Services.GetRequiredService<DatabaseSeeder>();
    seeder.Execute();

    Console.WriteLine("Mock data seeding completed successfully.");
    return;  // Exit without starting web server
}

3. Execution Flow¶

3.1 Happy Path: PR with `preview` + `use-snapshot` Labels¶

1. Developer adds `preview` label to PR
   │
2. CI Workflow triggers (labeled event)
   │
3. Workflow generates:
   ├─ seedId: new GUID (uuidgen)
   ├─ seedSha: HEAD commit SHA
   └─ commits to cluster-gitops/syrf/environments/preview/pr-{n}/
   │
4. ArgoCD detects change, syncs:
   ├─ Wave -10: ExternalSecret (Atlas API key)
   ├─ Wave 0: Namespace
   ├─ Wave 10: AtlasDatabaseUser (creates connection secret)
   └─ Wave 15: DatabaseLifecycle CR
   │
5. DBL Operator reconciles CR:
   ├─ Check: Does db-ready ConfigMap exist with matching seedId?
   │         NO → Continue with seeding
   │
   ├─ Update ConfigMap: status=pending
   │
   ├─ Wait for watched deployments to have 0 ready replicas
   │   (Init containers block pods from becoming ready)
   │
   ├─ Update ConfigMap: status=seeding
   │
   ├─ Create Seeding Job (snapshot restore)
   │   Job copies collections from syrf_snapshot → syrf_pr_{n}
   │
   ├─ Wait for Job completion (30 min timeout)
   │   SUCCESS → Continue
   │   FAILURE → Update ConfigMap: status=failed, add PR comment, STOP
   │
   ├─ Update ConfigMap: status=post-seed
   │
   ├─ Create Post-Seed Job (if spec.postSeedJob.enabled)
   │   Job runs index initialization
   │
   ├─ Wait for Post-Seed Job completion (30 min timeout)
   │   SUCCESS → Continue
   │   FAILURE → Update ConfigMap: status=failed, add PR comment, STOP
   │
   └─ Update ConfigMap: seedId={new}, seedSha={sha}, status=complete
   │
6. Service init containers detect ConfigMap with matching seedId + status=complete
   │
7. Service pods start with seeded database
   │
8. ArgoCD PostSync hook runs (github-notifier-job):
   ├─ Waits for db-ready ConfigMap with matching seedId AND status=complete
   ├─ Waits for all service Deployments to be healthy:
   │   - syrf-api: Ready replicas == desired replicas
   │   - syrf-projectmanagement: Ready replicas == desired replicas
   │   - syrf-quartz: Ready replicas == desired replicas
   │   - syrf-web: Ready replicas == desired replicas
   ├─ Authenticates with GitHub via GitHub App credentials
   ├─ Updates GitHub Deployment status to "success"
   ├─ Creates commit status (context: "preview/deploy")
   └─ Posts PR comment with deployment URLs:
      - Web: https://pr-{n}.syrf.org.uk
      - API: https://api.pr-{n}.syrf.org.uk
      - PM: https://project-management.pr-{n}.syrf.org.uk

3.2 Subsequent Push (No Label Changes)¶

1. Developer pushes code to PR
   │
2. CI Workflow triggers (synchronize event)
   │
3. Workflow checks labels:
   ├─ `reset-db-on-sync` present?
   │   YES → Generate new seedId, update cluster-gitops
   │   NO  → Keep existing seedId, only update service image tags
   │
4. ArgoCD syncs service deployments
   │
5. DBL Operator reconciles (if seedId unchanged):
   ├─ Check: Does db-ready ConfigMap exist with matching seedId?
   │         YES → Skip seeding entirely
   │
6. Service pods restart with new code, same database
   │
7. ArgoCD PostSync hook runs (github-notifier-job):
   ├─ Waits for db-ready ConfigMap with matching seedId AND status=complete
   ├─ Waits for all service Deployments to be healthy (Ready == Desired)
   ├─ Updates GitHub Deployment status to "success"
   ├─ Creates commit status
   └─ Posts PR comment with deployment URLs

3.3 `/reseed-db` Command¶

1. User comments `/reseed-db` on PR
   │
2. CI Workflow triggers (issue_comment event)
   │
3. Workflow checks:
   ├─ Is `lock-db` label present?
   │   YES → Add comment: "Cannot reseed: lock-db label prevents database operations.
   │          Remove lock-db label first, then retry /reseed-db."
   │          STOP
   │   NO  → Continue
   │
4. Workflow generates new seedId, updates cluster-gitops
   │
5. ArgoCD syncs DatabaseLifecycle CR with new seedId
   │
6. DBL Operator detects seedId mismatch → triggers seeding flow
   │
7. Services restart after seeding completes

3.4 PR Close with `lock-db` Label¶

1. PR is merged/closed
   │
2. CI Workflow triggers (closed event)
   │
3. Workflow checks: Is `lock-db` label present?
   │
   YES (lock-db present):
   │ ├─ Add PR comment: "Database syrf_pr_{n} has been preserved (lock-db).
   │ │   The Kubernetes namespace and resources have been removed.
   │ │   To clean up the orphaned database, contact a database administrator."
   │ │
   │ ├─ Update DatabaseLifecycle CR: cleanupOnDelete=false
   │ │
   │ ├─ Remove namespace (ArgoCD cascade delete)
   │ │   - Services removed
   │ │   - DatabaseLifecycle CR removed (but DB preserved due to cleanupOnDelete=false)
   │ │
   │ └─ Database remains in MongoDB Atlas (orphaned)
   │
   NO (no lock-db):
   │ ├─ DatabaseLifecycle CR has cleanupOnDelete=true
   │ │
   │ ├─ Remove namespace (ArgoCD cascade delete)
   │ │   - DBL Operator finalizer runs
   │ │   - Operator drops database
   │ │   - All resources removed
   │ │
   │ └─ Database cleaned up

3.5 Label Conflict: `lock-db` + `reset-db-on-sync`¶

1. User adds `reset-db-on-sync` label while `lock-db` is present
   │
2. CI Workflow triggers (labeled event)
   │
3. Workflow detects conflict:
   ├─ Both `lock-db` AND `reset-db-on-sync` present
   │
4. Workflow resolves conflict:
   ├─ Remove `reset-db-on-sync` label from PR (via GitHub API)
   │
   └─ Add PR comment:
      "⚠️ Label conflict: `reset-db-on-sync` is incompatible with `lock-db`.

      - `lock-db` prevents all database modifications
      - `reset-db-on-sync` requests database reset on every push

      The `reset-db-on-sync` label has been automatically removed.
      To enable reset-on-sync, first remove the `lock-db` label."

3.6 Sequence Diagram: Seeding Flow¶

┌──────┐     ┌──────────┐     ┌───────────┐     ┌─────────┐     ┌─────────┐
│ArgoCD│     │DBL Op    │     │Seeding Job│     │Post-Seed│     │Services │
└──┬───┘     └────┬─────┘     └─────┬─────┘     └────┬────┘     └────┬────┘
   │              │                 │                │               │
   │ Sync CR      │                 │                │               │
   │─────────────>│                 │                │               │
   │              │                 │                │               │
   │              │ Check seedId    │                │               │
   │              │ vs ConfigMap    │                │               │
   │              │                 │                │               │
   │              │ Mismatch!       │                │               │
   │              │ Set status=     │                │               │
   │              │ pending         │                │               │
   │              │                 │                │               │
   │              │ Wait for 0      │                │               │
   │              │ ready replicas  │                │               │
   │              │<────────────────│────────────────│───────────────│
   │              │                 │                │               │
   │              │ Create Job      │                │               │
   │              │────────────────>│                │               │
   │              │                 │                │               │
   │              │ Set status=     │ Copy           │               │
   │              │ seeding         │ collections    │               │
   │              │                 │ ──────────>    │               │
   │              │                 │ MongoDB        │               │
   │              │                 │                │               │
   │              │ Job complete    │                │               │
   │              │<────────────────│                │               │
   │              │                 │                │               │
   │              │ Create Job      │                │               │
   │              │────────────────────────────────->│               │
   │              │                 │                │               │
   │              │ Set status=     │                │ Create        │
   │              │ post-seed       │                │ indexes       │
   │              │                 │                │ ──────────>   │
   │              │                 │                │ MongoDB       │
   │              │                 │                │               │
   │              │ Job complete    │                │               │
   │              │<────────────────────────────────│               │
   │              │                 │                │               │
   │              │ Set status=     │                │               │
   │              │ complete        │                │               │
   │              │ Update seedId   │                │               │
   │              │                 │                │               │
   │              │                 │                │               │
   │              │                 │                │  Poll ConfigMap
   │              │                 │                │<──────────────│
   │              │                 │                │               │
   │              │                 │                │  seedId match!│
   │              │                 │                │  Start pod    │
   │              │                 │                │──────────────>│
   │              │                 │                │               │

4. Edge Cases & Mitigations¶

#	Edge Case / Failure Mode	Impact	Mitigation Strategy
1	Seeding Job fails (MongoDB timeout, quota exceeded)	High - PR environment unusable	Fail status in ConfigMap, PR comment with error details, user retries with `/reseed-db`
2	Post-Seed Job fails (index creation OOM)	High - Services may start with missing indexes	Fail status in ConfigMap, PR comment, services blocked until manual intervention
3	New push during active seeding	Medium - Race condition, stale data	Cancel current seeding Job, start new with latest seedId
4	`lock-db` + `reset-db-on-sync` added together	Low - Conflicting intent	Auto-remove `reset-db-on-sync`, add explanatory PR comment
5	`lock-db` present but `/reseed-db` command issued	Low - User confusion	Add PR comment explaining lock-db prevents reseed, do nothing
6	MongoDB Atlas user not ready when seeding starts	High - Connection failures	Sync wave ordering (user wave 10, DBL wave 15), retry loop with backoff
7	Connection secret missing	High - Job fails immediately	Operator validates secret exists before creating Job
8	PR closed while seeding in progress	Medium - Orphaned resources	Finalizer waits for Job completion or timeout before cleanup
9	Multiple PRs seeding simultaneously (resource contention)	Low - Slower seeding	Jobs run in separate namespaces, MongoDB handles concurrency
10	ConfigMap deleted manually	Medium - Coordination broken	Operator recreates ConfigMap on next reconciliation
11	Service pods stuck in init (ConfigMap never updated)	High - Deployment hangs	Startup probe timeout (15 min), operator monitors for stuck states
12	Mock data seeding Job template invalid	High - Seeding never starts	Validate Job spec on CR creation, reject invalid templates
13	`preview` label removed while `lock-db` present	Medium - Ambiguous intent	Keep CR active, DB preserved. Remove namespace only on PR close.
14	Orphaned database after `lock-db` PR close	Low - Resource leak	Add PR comment with database name, require manual admin cleanup
15	Seeding Job takes longer than timeout	Medium - Incomplete data	Increase timeout in CR spec, or fail and require `/reseed-db`

4.1 Detailed Mitigation: Concurrent Seed Requests¶

When a new push arrives while seeding is in progress:

CI Workflow generates new seedId (if reset-db-on-sync or /reseed-db)
ArgoCD updates DatabaseLifecycle CR with new seedId
Operator detects CR update during reconciliation
Operator checks: Is there an active seeding Job?
YES → Delete current Job (kubectl delete job)
Wait for Job termination
Operator starts new seeding with latest seedId

Rationale: Latest commit should always win in CI. Completing an old seed while new code waits is wasteful.

4.2 Detailed Mitigation: Failure Notification¶

When seeding or post-seed Job fails:

Operator detects Job failure (status.failed > 0)

Operator updates ConfigMap:

status: failed
errorMessage: "Seeding Job failed: container 'seeder' exited with code 1"

Operator creates GitHub PR comment via GitHub API:

## ❌ Database Seeding Failed

**Namespace:** pr-123
**Database:** syrf_pr_123
**seedId:** a1b2c3d4-...
**seedSha:** abc123def

**Error:**

Seeding Job failed: container 'seeder' exited with code 1

**Job Logs:**

[last 50 lines of Job logs]

**To retry:** Comment `/reseed-db` on this PR.

Services remain blocked (init containers waiting for status: complete)

5. Testing Strategy¶

5.1 Unit Tests¶

CI Workflow: Label conflict detection (lock-db + reset-db-on-sync)
CI Workflow: seedId generation (valid GUID format)
CI Workflow: seedSha extraction (valid commit SHA)
Helm template: DatabaseLifecycle CR generation with all label combinations
Helm template: ConfigMap RBAC for init containers

5.2 Integration Tests¶

DBL Operator: Reconciles CR and creates Seeding Job
DBL Operator: Updates ConfigMap on Job completion
DBL Operator: Handles Job failure correctly
DBL Operator: Cancels in-progress Job on new seedId
Init Container: Waits for ConfigMap with matching seedId
Init Container: Proceeds when status=complete
ArgoCD: Sync wave ordering (secrets → user → DBL)

5.3 End-to-End Tests¶

Full PR preview lifecycle: create → push → reseed → close
lock-db + PR close: Database preserved, namespace removed
reset-db-on-sync: New seedId on every push
/reseed-db command: Triggers reseed
Label conflict: reset-db-on-sync auto-removed when lock-db present

5.4 Manual Verification Steps¶

# 1. Create PR with preview label
gh pr create --title "Test PR" --body "Testing preview"
gh pr edit 123 --add-label preview

# 2. Wait for deployment, verify ConfigMap
kubectl get configmap db-ready -n pr-123 -o yaml

# 3. Verify seedId in ConfigMap matches CR
kubectl get dbl pr-database -n pr-123 -o jsonpath='{.spec.seedId}'

# 4. Test /reseed-db command
gh pr comment 123 --body "/reseed-db"
# Wait and verify new seedId

# 5. Test lock-db + reset-db-on-sync conflict
gh pr edit 123 --add-label lock-db
gh pr edit 123 --add-label reset-db-on-sync
# Verify reset-db-on-sync removed, comment added

# 6. Test PR close with lock-db
gh pr close 123
# Verify namespace deleted, database preserved
kubectl get ns pr-123  # Should not exist
# Check MongoDB Atlas for syrf_pr_123 database

6. Implementation Checklist¶

Phase 1: Core Infrastructure¶

Update DatabaseLifecycle CRD with new fields (seedId, seedSha, lockDatabase, postSeedJob)
Update DBL Operator reconciliation logic:
Compare spec.seedId with ConfigMap seedId
Create Seeding Job instead of inline seeding
Handle Job lifecycle (create, monitor, cleanup)
Create Post-Seed Job when enabled
Update ConfigMap with status transitions
Update db-ready ConfigMap structure (seedId, seedSha, status, errorMessage)
Update service init container to check status=complete

Phase 2: CI Workflow Updates¶

Add seedId generation (new GUID via uuidgen)
Add seedSha tracking (HEAD commit SHA)
Implement label conflict detection (lock-db + reset-db-on-sync)
Implement /reseed-db command with lock-db check
Implement reset-db-on-sync behavior (new seedId on each push)
Update cluster-gitops values generation

Phase 3: Helm Chart Updates¶

Update preview-infrastructure chart:
Pass seedId and seedSha to DatabaseLifecycle CR
Add lockDatabase field based on lock-db label
Configure cleanupOnDelete based on lock-db label
Update syrf-common library:
Init container checks status=complete (not just seedId match)

Phase 4: Mock Data Seeding (via PM seed-data mode)¶

Add SYRF_SEED_DATA_MODE environment variable handling to PM Program.cs:
Check before DI container build (similar to SYRF_INDEX_INIT_MODE)
Build minimal DI container (data services only, no RabbitMQ/ROB)
Execute DatabaseSeeder.Execute()
Exit after completion
Ensure DatabaseSeeder is registered in minimal DI container
Add integration test for seed-data mode
Create default mock data seeding Job template in preview-infrastructure chart
Document mock data seeding in docs/how-to/use-pr-preview-environments.md

Phase 5: Documentation & Cleanup¶

User Documentation: - [ ] Update docs/how-to/use-pr-preview-environments.md: - New label names and behaviors (lock-db, reset-db-on-sync) - Updated /reseed-db command behavior with lock-db check - Explanation of seedId vs seedSha tracking - Mock data seeding vs snapshot seeding decision guide - Troubleshooting: What to do when seeding fails

Migration Documentation: - [ ] Create docs/how-to/migrate-pr-preview-labels.md: - persist-db → lock-db label rename - Behavioral differences (lock-db now keeps CR active) - One-time migration steps for existing PRs

Architecture Documentation: - [ ] Update docs/architecture/pr-preview-environments.md (or create if not exists): - DBL Operator Job-based seeding architecture - ConfigMap coordination pattern with seedId/seedSha/status - Sequence diagrams for all flows - Component responsibility matrix

Reference Updates: - [ ] Update CLAUDE.md: - New label semantics - seedId/seedSha tracking explanation - Updated workflow triggers - [ ] Update src/charts/preview-infrastructure/README.md: - New Helm values for seedId/seedSha - Custom seeding job template examples

Code Cleanup: - [ ] Remove references to seedVersion (replaced by seedId) - [ ] Remove deprecated persist-db label handling (replaced by lock-db) - [ ] Update github-notifier-job.yaml to use seedId

7. CI Workflow Architecture¶

This section documents the complete CI workflow architecture for PR preview environments.

7.0 Responsibility Boundaries (CRITICAL)¶

The CI workflow (pr-preview.yml) ONLY commits to cluster-gitops. It does NOT:

Communicate with ArgoCD directly (no ArgoCD API calls)
Create or modify ConfigMaps on the cluster directly
Execute kubectl commands to modify cluster state

Cluster operations are handled by ArgoCD and operators:

ArgoCD syncs cluster-gitops changes to create Kubernetes resources
DBL Operator creates/updates the db-ready ConfigMap
github-notifier-job (ArgoCD PostSync hook) updates GitHub Deployment status

┌─────────────────────────────────────────────────────────────────────────────────┐
│                         Responsibility Boundary                                  │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                 │
│   GitHub Actions (CI)                     │    Kubernetes Cluster               │
│   ───────────────────                     │    ──────────────────               │
│                                           │                                     │
│   ✅ Build Docker images                  │    ✅ ArgoCD syncs from gitops     │
│   ✅ Push to GHCR                         │    ✅ DBL Operator seeds database   │
│   ✅ Calculate versions (GitVersion)      │    ✅ DBL Operator creates ConfigMap│
│   ✅ Commit to cluster-gitops             │    ✅ PostSync hook notifies GitHub │
│   ✅ Create GitHub Deployment (pending)   │    ✅ Service init containers wait  │
│                                           │                                     │
│   ❌ NO kubectl apply                     │                                     │
│   ❌ NO ArgoCD API calls                  │                                     │
│   ❌ NO direct ConfigMap creation         │                                     │
│                                           │                                     │
└─────────────────────────────────────────────────────────────────────────────────┘

7.1 Full CI Pipeline Stages¶

The pr-preview.yml workflow executes the following stages:

┌──────────────────────────────────────────────────────────────────────────────────┐
│                           PR Preview CI Pipeline                                  │
├──────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│  1. check-label                                                                  │
│     ├─ Verify PR has 'preview' label                                            │
│     ├─ Handle label-specific events (persist-db, use-snapshot)                  │
│     ├─ Extract preview config from PR description (feature flags)               │
│     └─ Output: should_build, pr_number, head_sha                                │
│                                                                                  │
│  2. create-deployment                                                            │
│     ├─ Create GitHub Deployment for preview environment                         │
│     └─ Set initial status to "pending"                                          │
│                                                                                  │
│  3. detect-changes (tag-based)                                                   │
│     ├─ Compare HEAD SHA against last service tag                                │
│     ├─ For each service: determine action (build | use-existing | retag)        │
│     ├─ Build matrix for changed services                                        │
│     └─ Output: *_changed, *_action, *_last_version, preview_services_matrix     │
│                                                                                  │
│  4. version-* jobs (parallel, per-service)                                       │
│     ├─ Uses reusable workflow: _gitversion.yml                                  │
│     ├─ Calculates semantic version from git history                             │
│     └─ Output: version, semver, fullsemver, informationalVersion                │
│                                                                                  │
│  5. build-web-artifacts (if web changed)                                         │
│     ├─ npm ci && ng build --configuration development                           │
│     ├─ Upload dist artifact for Docker build                                    │
│     └─ Sentry sourcemaps upload                                                 │
│                                                                                  │
│  6. build-and-push-images (matrix, parallel)                                     │
│     ├─ Uses reusable workflow: _docker-build.yml                                │
│     ├─ Build Docker image with version tag                                      │
│     ├─ Push to ghcr.io/camaradesuk/{service}:{version}                         │
│     └─ Tag with sha-{shortsha} for ArgoCD deployment                           │
│                                                                                  │
│  7. retag-unchanged                                                              │
│     ├─ For unchanged services: crane copy :latest → :sha-{shortsha}            │
│     └─ Ensures all services have sha-{shortsha} tag for this commit            │
│                                                                                  │
│  8. update-pr-status                                                             │
│     ├─ Update PR description with deployment status                             │
│     └─ Write GitHub Actions job summary                                         │
│                                                                                  │
│  9. write-versions (commits to cluster-gitops)                                   │
│     ├─ Checkout cluster-gitops repository                                       │
│     ├─ Check labels (persist-db, use-snapshot)                                  │
│     ├─ Determine database reset trigger                                         │
│     ├─ Write pr.yaml (PR metadata, seedVersion, deploymentNotification)         │
│     ├─ Write infrastructure.values.yaml (MongoDB, DatabaseLifecycle config)    │
│     ├─ Write services/*.values.yaml (image tags, GitVersion values)            │
│     └─ Git commit and push                                                      │
│                                                                                  │
└──────────────────────────────────────────────────────────────────────────────────┘

7.2 Workflow Triggers¶

The workflow responds to: - PR comment containing /reseed-db - Label changes (preview, lock-db, use-snapshot, reset-db-on-sync) - Push events to a PR branch - PR closed events - workflow_dispatch (manual trigger with PR number)

7.3 Testing and Code Quality (Future Enhancement)¶

Note: The current pr-preview.yml workflow does NOT include:

Unit test execution
Integration test execution
SonarQube code coverage analysis
SonarQube static analysis

These are handled by separate CI workflows or are planned enhancements. The DBL redesign focuses on database lifecycle management - testing integration should be addressed separately.

7.4 Decision Tree (Database/Label Logic)¶

trigger: PR comment | label change | push | pr closed

1. Is comment JUST added that includes '/reseed-db'?
   YES →
     Is 'lock-db' label set?
       YES → Comment "Cannot reseed: lock-db prevents database operations"
       NO  → Set seedId to new GUID, set locked=false, comment "DB is being reseeded"
   NO → Continue to step 2

2. Are 'lock-db' AND 'reset-db-on-sync' BOTH set?
   YES → Comment about conflict, remove 'reset-db-on-sync' label, set locked=true
   NO  → Continue to step 3

3. Was 'use-snapshot' label JUST set/unset AND 'lock-db' ALREADY set?
   YES → Undo the set/unset of 'use-snapshot', comment explaining conflict
   NO  → Continue to step 4

4. Is (preview JUST set OR push with preview set OR PR JUST opened) AND PR is open?
   YES →
     Is label combo valid?
       NO  → Comment about invalid combo, STOP
       YES →
         Does seedId already exist in cluster-gitops for this PR?
           YES →
             Is 'reset-db-on-sync' set?
               YES → Set seedId to new GUID
               NO  →
                 Is 'lock-db' set?
                   YES → Use existing seedId, set locked=true
                   NO  → Use existing seedId, set locked=false
           NO → Set seedId to new GUID
   NO → Continue to step 5

5. Was 'preview' JUST unset OR PR JUST closed?
   YES →
     Is 'lock-db' set?
       YES → Set locked=true in values, sync DBL, then delete PR folder (DB preserved)
       NO  → Delete PR folder (DB will be dropped)
   NO → Continue to step 6

6. Was 'lock-db' label JUST changed?
   JUST SET   → Update values to set locked=true
   JUST UNSET → Update values to set locked=false
   NO CHANGE  → Stop processing (nothing to do)

Note: The original logic had a bug at step 6 - it didn't distinguish between lock-db being set vs unset. This corrected version handles both cases.

7.5 Label Detection Timing¶

RESOLVED: Labels are detected at workflow start time, not when the workflow is queued.

This is correct behavior because: 1. The "interruptible" queue pattern cancels stale workflows 2. Detecting labels at run time ensures the workflow acts on current state 3. If label changes occur while a workflow is queued, the workflow sees the new state

7.6 Invocation Queue Management¶

The workflow implements a selective "interruptible" queue pattern to optimize CI efficiency:

Interruptible Events (Expensive Operations): - Git push events - PR opened/closed events - preview label set/unset - /reseed-db command (triggers database seeding - stale if new push arrives)

These events trigger expensive operations: build container images, push to registries, seed database, deploy. If a new event arrives, the old work is stale anyway - cancel and start fresh.

Non-Interruptible Events (Cheap Operations):

lock-db label changes
reset-db-on-sync label changes
use-snapshot label changes

These events only update cluster-gitops values - no builds required. They're fast and should complete before processing the next event.

Analysis of Implementation Options:

Option 1: Separate concurrency groups (NOT RECOMMENDED)

The idea: use conditional concurrency groups based on event type:

concurrency:
  group: pr-${{ pr_number }}-${{ is_interruptible && 'build' || 'config' }}
  cancel-in-progress: ${{ is_interruptible }}

Problem: Race conditions. If both groups have active workflows, they both try to commit to cluster-gitops simultaneously, causing git conflicts.

Option 2: Single group, always cancel (RECOMMENDED)

concurrency:
  group: pr-${{ pr_number }}
  cancel-in-progress: true

Why this works:

All events eventually commit to cluster-gitops (single writer)
Config-only changes are fast - unlikely to be interrupted
If cancelled, the subsequent push includes current label state anyway
User can re-apply label if needed (rare edge case)
Simple, predictable behavior

Option 3: Separate workflow files (COMPLEX)

Split into pr-preview-build.yml and pr-preview-config.yml. Still has race condition issues and adds coordination complexity.

Recommendation: Use Option 2 (single group, always cancel). Simpler, safer, and the rare case of a config change being cancelled by a push is acceptable since the push will include current labels.

7.7 Invalid Label Combinations¶

The decision tree checks for "valid label combo" but doesn't specify invalid combinations. Define:

Invalid Combination	Reason	Resolution
`lock-db` + `reset-db-on-sync`	Conflicting: lock prevents reseeds, reset requests them	Auto-remove `reset-db-on-sync`

Currently, no other combinations are invalid. use-snapshot + lock-db is allowed (but changing use-snapshot while lock-db is set is blocked as meaningless).

8. Open Questions¶

Resolved Questions¶

~~Mock Data Extraction~~: RESOLVED - Reuse existing DatabaseSeeder.cs via SYRF_SEED_DATA_MODE=true environment variable. No extraction needed - run PM image in seed-data mode (similar to index-init mode).
~~GitHub App for PR Comments~~: RESOLVED - Reuse existing GitHub App (github-app-credentials secret). The github-notifier-job.yaml already uses this for deployment notifications.
~~Seeding Job Resource Limits~~: RESOLVED - Use normal amounts, can be tweaked later based on observed performance.
~~Orphaned Database Cleanup~~: RESOLVED - No automated cleanup. Just leave the orphaned database and add a comment on the PR explaining the database name and that manual cleanup is required.
~~Label Detection Timing~~: RESOLVED - Labels are detected at workflow start time (not when queued). This is correct behavior - the interruptible queue pattern ensures stale builds are cancelled, and detecting labels at run time ensures the workflow acts on current state.
~~Interruptible Event Scope~~: RESOLVED - Selective interruptibility is correct:
Interruptible (push, preview, PR open/close): Trigger expensive builds - cancel-in-progress avoids wasted CI
Non-interruptible (lock-db, reset-db-on-sync, use-snapshot): Only update values - queue and complete (fast, cheap)
~~Notification Content~~: RESOLVED - Include both seedId and seedSha in deployment notification comments for full traceability.
~~Failure Comments~~: RESOLVED - DBL Operator posts failure comments directly using the same GitHub App credentials as github-notifier-job. No need for CI workflow involvement.

Remaining Questions¶

All questions resolved.

9. Migration Mapping: Current → New¶

This section shows how existing components map to the new architecture.

9.1 Terminology Changes¶

Current Term	New Term	Notes
`seedVersion`	`seedId`	GUID that triggers reseed detection
(none)	`seedSha`	Commit SHA for audit trail
`persist-db` label	`lock-db` label	Enhanced: now keeps CR active when `preview` removed
(none)	`reset-db-on-sync` label	New: reseed on every push
(none)	`status` field	New: pending/seeding/post-seed/complete/failed

9.2 Component Changes¶

Component	Current Location	Change Required
DBL CRD	`cluster-gitops/charts/database-lifecycle-operator/crds/`	Add `seedId`, `seedSha`, `lockDatabase`, `postSeedJob` fields
DBL Operator Hook	`cluster-gitops/charts/database-lifecycle-operator/templates/configmap-hooks.yaml`	Refactor to create Jobs instead of inline seeding
db-ready ConfigMap	Created by operator	Add `seedId`, `seedSha`, `status`, `errorMessage`, `lastUpdated`
Init Container	`src/charts/syrf-common/templates/_deployment-dotnet.tpl`	Check `status=complete` AND `seedId` match
GitHub Notifier	`src/charts/preview-infrastructure/templates/github-notifier-job.yaml`	Replace `seedVersion` with `seedId`
PR Preview Workflow	`.github/workflows/pr-preview.yml`	Add `seedId`/`seedSha` generation, label conflict handling
PM Service	`src/services/project-management/`	Add `SYRF_SEED_DATA_MODE` handling in `Program.cs`

9.3 Helm Values Changes¶

preview-infrastructure chart values:

# Current
seedVersion: "abc123def"  # Commit SHA

# New
seedId: "a1b2c3d4-e5f6-7890-abcd-ef1234567890"  # GUID
seedSha: "abc123def"  # Commit SHA (audit only)
lockDatabase: false  # From lock-db label

9.4 CI Workflow Changes¶

Current workflow generates:

seedVersion: ${{ github.sha }}

New workflow generates:

seedId: $(uuidgen)  # New GUID on initial deploy, /reseed-db, or reset-db-on-sync
seedSha: ${{ github.sha }}  # Always current commit
lockDatabase: ${{ contains(github.event.pull_request.labels.*.name, 'lock-db') }}

9.5 Files to Modify¶

File	Changes
`cluster-gitops/charts/database-lifecycle-operator/crds/databaselifecycle.yaml`	Add new CRD fields
`cluster-gitops/charts/database-lifecycle-operator/templates/configmap-hooks.yaml`	Job-based seeding logic
`src/charts/preview-infrastructure/templates/database-lifecycle.yaml`	Pass new values
`src/charts/preview-infrastructure/templates/github-notifier-job.yaml`	Use `seedId`
`src/charts/syrf-common/templates/_deployment-dotnet.tpl`	Check status + seedId
`.github/workflows/pr-preview.yml`	seedId generation, label handling
`src/services/project-management/SyRF.ProjectManagement.Endpoint/Program.cs`	Add seed-data mode

10. References¶

Internal Documentation¶

Current DBL Operator Implementation: cluster-gitops/charts/database-lifecycle-operator/ (separate repo) - Shell-operator based controller
PR Preview Environments How-To - User guide
MongoDB Testing Strategy - Database isolation strategy
Data Snapshot Automation - Snapshot architecture

External Resources¶

Kubernetes Jobs - Job controller documentation
Shell Operator - Current operator framework
ArgoCD Sync Waves - Resource ordering

Document End

This document must be reviewed and approved before implementation begins.