Skip to content

DBL Operator Redesign - Implementation Specification

Status: Ready for Approval Target Environment: GKE Kubernetes Cluster / ArgoCD GitOps


Executive Summary

This specification defines the redesigned Database Lifecycle (DBL) operator architecture for PR preview environments in the SyRF monorepo. The redesign addresses several limitations of the current implementation:

  1. Seeding as Jobs: Move database seeding from the operator pod to separate Kubernetes Jobs for better resource isolation, logging, and failure handling
  2. Custom Seeding Templates: Support both snapshot restore AND mock data seeding via configurable Job templates
  3. Enhanced Tracking: Split tracking into seedId (trigger) and seedSha (audit) for clearer coordination
  4. New Label Semantics: Add reset-db-on-sync label and clarify lock-db behavior (replaces persist-db)
  5. Post-Seed Job Support: Optional post-seeding jobs for migrations, indexing, or custom operations

The specification describes the desired end-state architecture. Implementation should adapt the current shell-operator based system to match this specification.


Table of Contents

  1. High-Level Architecture
  2. Detailed Design
  3. Execution Flow
  4. Edge Cases & Mitigations
  5. Testing Strategy
  6. Implementation Checklist
  7. CI Workflow Architecture
  8. Open Questions
  9. Migration Mapping: Current → New
  10. References

1. High-Level Architecture

1.1 Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                         PR Preview Environment Flow                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  GitHub PR                    CI Workflow                  cluster-gitops   │
│  ┌─────────┐                 ┌──────────┐                 ┌─────────────┐   │
│  │ Labels: │  push/label     │ Detect   │   commit        │ pr-{n}/     │   │
│  │ preview │ ───────────────>│ changes  │ ───────────────>│ pr.yaml     │   │
│  │ lock-db │                 │ Generate │                 │ values.yaml │   │
│  │ use-snap│                 │ seedId   │                 └──────┬──────┘   │
│  │ reset.. │                 └──────────┘                        │          │
│  └─────────┘                                                     │          │
│       │                                                          ▼          │
│       │ /reseed-db                                    ┌─────────────────┐   │
│       │ comment                                       │ ArgoCD Sync     │   │
│       └──────────────────────────────────────────────>│ ApplicationSet  │   │
│                                                       └────────┬────────┘   │
│                                                                │            │
│                            Kubernetes Cluster                  │            │
│  ┌─────────────────────────────────────────────────────────────┼──────────┐ │
│  │                                                             ▼          │ │
│  │  ┌──────────────────┐    ┌──────────────────────────────────────────┐ │ │
│  │  │ DBL Operator     │    │ PR Namespace (pr-{n})                    │ │ │
│  │  │ ┌──────────────┐ │    │                                          │ │ │
│  │  │ │Reconciliation│ │    │  ┌─────────────┐   ┌─────────────────┐   │ │ │
│  │  │ │    Loop      │─┼───>│  │ Seeding Job │──>│ db-ready        │   │ │ │
│  │  │ └──────────────┘ │    │  │ (snapshot/  │   │ ConfigMap       │   │ │ │
│  │  │                  │    │  │  mock data) │   │ - seedId        │   │ │ │
│  │  │ Watches:         │    │  └─────────────┘   │ - seedSha       │   │ │ │
│  │  │ DatabaseLifecycle│    │         │          │ - status        │   │ │ │
│  │  │ CRs              │    │         ▼          └────────┬────────┘   │ │ │
│  │  └──────────────────┘    │  ┌─────────────┐           │            │ │ │
│  │                          │  │ Post-Seed   │           │            │ │ │
│  │                          │  │ Job (opt.)  │           │            │ │ │
│  │                          │  │ (indexes/   │           │            │ │ │
│  │                          │  │  migrations)│           │            │ │ │
│  │                          │  └─────────────┘           │            │ │ │
│  │                          │         │                  │            │ │ │
│  │                          │         ▼                  │            │ │ │
│  │                          │  ┌─────────────────────────┴──────────┐ │ │ │
│  │                          │  │ Service Pods (api, pm, quartz)     │ │ │ │
│  │                          │  │ ┌────────────────┐                 │ │ │ │
│  │                          │  │ │ Init Container │ waits for       │ │ │ │
│  │                          │  │ │ (wait-for-db)  │ ConfigMap match │ │ │ │
│  │                          │  │ └────────────────┘                 │ │ │ │
│  │                          │  └────────────────────────────────────┘ │ │ │
│  │                          └──────────────────────────────────────────┘ │ │
│  └──────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘

1.2 Key Components

Component Location Purpose
DBL Operator syrf-system namespace Watches DatabaseLifecycle CRs, creates/manages seeding Jobs
DatabaseLifecycle CR PR namespace Declares desired database state, seeding config, job templates
Seeding Job PR namespace Kubernetes Job that performs snapshot restore or mock data seeding
Post-Seed Job PR namespace Optional Job for migrations, indexing, or custom operations
db-ready ConfigMap PR namespace Coordination mechanism - services wait for matching seedId
CI Workflow GitHub Actions Generates seedId/seedSha, updates cluster-gitops values

1.3 Dependencies

Internal: - ArgoCD ApplicationSet (syrf-previews.yaml) - preview-infrastructure Helm chart - syrf-common Helm library chart (init container templates) - MongoDB Atlas Operator (user provisioning)

External: - MongoDB Atlas (database hosting) - GitHub Actions (CI/CD) - cluster-gitops repository (GitOps values)

1.4 Integration Points

  1. GitHub → CI Workflow: PR events (push, label, comment) trigger workflow
  2. CI Workflow → cluster-gitops: Workflow commits seedId, seedSha, label-derived values
  3. ArgoCD → Kubernetes: ApplicationSet generates Applications from cluster-gitops
  4. DBL Operator → Seeding Job: Operator creates Job when CR seedId doesn't match ConfigMap
  5. ConfigMap → Service Init Containers: Services poll ConfigMap until seedId matches

2. Detailed Design

2.1 DatabaseLifecycle Custom Resource Definition

apiVersion: database.syrf.org.uk/v1alpha1
kind: DatabaseLifecycle
metadata:
  name: pr-database
  namespace: pr-123
  annotations:
    argocd.argoproj.io/sync-wave: "15"
spec:
  # Target database configuration
  database: syrf_pr_123

  # Connection details
  connection:
    secretRef:
      name: syrfdb-cluster0-syrf-pr-123-app
      connectionStringKey: connectionStringStandardSrv

  # Seed tracking (updated by CI workflow)
  seedId: "a1b2c3d4-e5f6-7890-abcd-ef1234567890"  # GUID - triggers reseed when changed
  seedSha: "abc123def456"                          # Commit SHA - audit/traceability only

  # Seeding configuration
  seeding:
    enabled: true

    # Option 1: Built-in snapshot restore
    type: snapshot  # or "custom"
    sourceDatabase: syrf_snapshot
    collections:
      - pmProject
      - pmStudy
      - pmInvestigator
      - pmSystematicSearch
      - pmDataExportJob
      - pmStudyCorrection
      - pmInvestigatorUsage
      - pmRiskOfBiasAiJob
      - pmProjectDailyStat
      - pmPotential
      - pmInvestigatorEmail

    # Option 2: Custom seeding job (for mock data or other sources)
    # type: custom
    # jobTemplate:
    #   inline: { ... }  # Inline Job spec
    #   # OR
    #   configMapRef:
    #     name: mock-data-seeder-template
    #     key: job.yaml

    # Timeouts
    timeout: 1800  # 30 minutes for seeding

  # Post-seed job (optional - NO DEFAULT)
  postSeedJob:
    enabled: true
    jobTemplate:
      inline:
        spec:
          template:
            spec:
              restartPolicy: Never
              containers:
                - name: index-init
                  image: "ghcr.io/camaradesuk/syrf-project-management:sha-abc123"
                  env:
                    - name: SYRF_INDEX_INIT_MODE
                      value: "true"
                  envFrom:
                    - secretRef:
                        name: syrfdb-cluster0-syrf-pr-123-app
      # OR
      # configMapRef:
      #   name: index-init-template
      #   key: job.yaml
    timeout: 1800  # 30 minutes for post-seed

  # Existing database policy
  existingDatabasePolicy: drop  # drop | skip | fail

  # Cleanup configuration
  cleanupOnDelete: true  # Drop database when CR is deleted
  lockDatabase: false    # When true, prevents drop even on CR delete

  # Watched deployments (wait for 0 ready replicas before seeding)
  watchedDeployments:
    - name: syrf-api
    - name: syrf-projectmanagement
    - name: syrf-quartz

2.2 Seed Tracking Fields

Field Type Purpose When Updated
seedId GUID Triggers reseed detection On PR push (if reset-db-on-sync), on /reseed-db command, on initial deployment
seedSha String Audit trail - which commit triggered seeding Always updated with seedId

Key Behavior: - Operator compares spec.seedId with db-ready ConfigMap's seedId - If different → seeding needed - If same → skip seeding (already done)

2.3 db-ready ConfigMap Structure

apiVersion: v1
kind: ConfigMap
metadata:
  name: db-ready
  namespace: pr-123
data:
  seedId: "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
  seedSha: "abc123def456"
  status: "complete"  # pending | seeding | post-seed | complete | failed
  lastUpdated: "2026-01-20T10:30:00Z"
  errorMessage: ""    # Populated if status=failed

Status Transitions:

pending → seeding → post-seed → complete
                ↘         ↘
                 failed    failed

2.4 Label Semantics

Label Effect Details
preview Master switch Required for all DBL functionality. Without it, DBL CR is not created.
use-snapshot Snapshot seeding Uses syrf_snapshot as source. Without it, uses custom seeding job template (mock data) or empty DB.
lock-db Prevents DB operations Prevents drop/reseed operations. Overrides preview removal - keeps CR active. On PR close, namespace is removed but DB is orphaned.
reset-db-on-sync Reseed on every push Updates seedId to new GUID on each PR push. Incompatible with lock-db - if both present, reset-db-on-sync is removed and comment added to PR.

Label Precedence:

lock-db > reset-db-on-sync > use-snapshot > (default empty)

2.5 PR Commands

Command Effect
/reseed-db Updates seedId to new GUID (triggers reseed). If lock-db present, adds comment explaining incompatibility and does nothing.

2.6 Custom Seeding Job Template

The DBL CR supports two ways to define custom seeding jobs:

Option 1: Inline Job Spec

spec:
  seeding:
    type: custom
    jobTemplate:
      inline:
        spec:
          template:
            spec:
              restartPolicy: Never
              containers:
                - name: mock-seeder
                  image: ghcr.io/camaradesuk/syrf-mock-seeder:latest
                  env:
                    - name: TARGET_DATABASE
                      value: syrf_pr_123

Option 2: ConfigMap Reference

spec:
  seeding:
    type: custom
    jobTemplate:
      configMapRef:
        name: mock-data-seeder-template
        key: job.yaml

The operator substitutes template variables: - {{ .Database }} → target database name - {{ .Namespace }} → PR namespace - {{ .SeedId }} → current seedId - {{ .SeedSha }} → current seedSha - {{ .ConnectionSecret }} → connection secret name

2.7 Mock Data Seeding (Reusing DatabaseSeeder.cs)

Existing Implementation Analysis:

The PM service already has a complete mock data seeder at: src/libs/project-management/SyRF.ProjectManagement.Core/Seeding/DatabaseSeeder.cs

What it creates: - 5 projects at different workflow stages: - Quick Start Demo: 10 studies, ready for screening - Screening In Progress: 30 studies with dual screening decisions - Ready for Annotation: 20 studies with annotation questions - Complete Review: 15 fully annotated studies - Private Research: 8 studies, private project

  • 3 seed investigators (fake Auth0 IDs - cannot be logged into)
  • Sample studies loaded from embedded JSON resource
  • Annotation questions across multiple categories

Current Activation: SYRF_SEED_DATA_ENABLED=true environment variable

Proposed Activation for Jobs: SYRF_SEED_DATA_MODE=true (Similar to existing SYRF_INDEX_INIT_MODE=true)

Reuse Assessment:

Aspect Assessment
Data Quality ✅ Good - Creates realistic projects at various stages
Idempotency ✅ Built-in - Checks if seed bot exists before seeding
Error Handling ✅ Has corruption detection and cleanup
Dependencies ⚠️ Requires PM service DI container (MongoDB, config)
Resource Usage ✅ Lightweight - No external calls, just MongoDB writes

No Significant Downsides - The implementation is well-designed and battle-tested.

Mock Data Seeding Job Template:

spec:
  seeding:
    type: custom
    jobTemplate:
      inline:
        spec:
          template:
            spec:
              restartPolicy: Never
              containers:
                - name: mock-seeder
                  image: "ghcr.io/camaradesuk/syrf-project-management:sha-{{ .SeedSha }}"
                  env:
                    - name: SYRF_SEED_DATA_MODE
                      value: "true"
                  envFrom:
                    - secretRef:
                        name: {{ .ConnectionSecret }}
              activeDeadlineSeconds: 600  # 10 minute timeout
          backoffLimit: 2

PM Service Changes Required:

Add to Program.cs (before DI container build, similar to index-init mode):

// Check for seed data mode (runs seeding then exits)
if (Environment.GetEnvironmentVariable("SYRF_SEED_DATA_MODE") == "true")
{
    // Build minimal DI container (data services only)
    var builder = WebApplication.CreateBuilder(args);
    builder.AddDataServicesOnly();  // MongoDB, config, no RabbitMQ
    var app = builder.Build();

    // Run database seeder
    var seeder = app.Services.GetRequiredService<DatabaseSeeder>();
    seeder.Execute();

    Console.WriteLine("Mock data seeding completed successfully.");
    return;  // Exit without starting web server
}

3. Execution Flow

3.1 Happy Path: PR with preview + use-snapshot Labels

1. Developer adds `preview` label to PR
2. CI Workflow triggers (labeled event)
3. Workflow generates:
   ├─ seedId: new GUID (uuidgen)
   ├─ seedSha: HEAD commit SHA
   └─ commits to cluster-gitops/syrf/environments/preview/pr-{n}/
4. ArgoCD detects change, syncs:
   ├─ Wave -10: ExternalSecret (Atlas API key)
   ├─ Wave 0: Namespace
   ├─ Wave 10: AtlasDatabaseUser (creates connection secret)
   └─ Wave 15: DatabaseLifecycle CR
5. DBL Operator reconciles CR:
   ├─ Check: Does db-ready ConfigMap exist with matching seedId?
   │         NO → Continue with seeding
   ├─ Update ConfigMap: status=pending
   ├─ Wait for watched deployments to have 0 ready replicas
   │   (Init containers block pods from becoming ready)
   ├─ Update ConfigMap: status=seeding
   ├─ Create Seeding Job (snapshot restore)
   │   Job copies collections from syrf_snapshot → syrf_pr_{n}
   ├─ Wait for Job completion (30 min timeout)
   │   SUCCESS → Continue
   │   FAILURE → Update ConfigMap: status=failed, add PR comment, STOP
   ├─ Update ConfigMap: status=post-seed
   ├─ Create Post-Seed Job (if spec.postSeedJob.enabled)
   │   Job runs index initialization
   ├─ Wait for Post-Seed Job completion (30 min timeout)
   │   SUCCESS → Continue
   │   FAILURE → Update ConfigMap: status=failed, add PR comment, STOP
   └─ Update ConfigMap: seedId={new}, seedSha={sha}, status=complete
6. Service init containers detect ConfigMap with matching seedId + status=complete
7. Service pods start with seeded database
8. ArgoCD PostSync hook runs (github-notifier-job):
   ├─ Waits for db-ready ConfigMap with matching seedId AND status=complete
   ├─ Waits for all service Deployments to be healthy:
   │   - syrf-api: Ready replicas == desired replicas
   │   - syrf-projectmanagement: Ready replicas == desired replicas
   │   - syrf-quartz: Ready replicas == desired replicas
   │   - syrf-web: Ready replicas == desired replicas
   ├─ Authenticates with GitHub via GitHub App credentials
   ├─ Updates GitHub Deployment status to "success"
   ├─ Creates commit status (context: "preview/deploy")
   └─ Posts PR comment with deployment URLs:
      - Web: https://pr-{n}.syrf.org.uk
      - API: https://api.pr-{n}.syrf.org.uk
      - PM: https://project-management.pr-{n}.syrf.org.uk

3.2 Subsequent Push (No Label Changes)

1. Developer pushes code to PR
2. CI Workflow triggers (synchronize event)
3. Workflow checks labels:
   ├─ `reset-db-on-sync` present?
   │   YES → Generate new seedId, update cluster-gitops
   │   NO  → Keep existing seedId, only update service image tags
4. ArgoCD syncs service deployments
5. DBL Operator reconciles (if seedId unchanged):
   ├─ Check: Does db-ready ConfigMap exist with matching seedId?
   │         YES → Skip seeding entirely
6. Service pods restart with new code, same database
7. ArgoCD PostSync hook runs (github-notifier-job):
   ├─ Waits for db-ready ConfigMap with matching seedId AND status=complete
   ├─ Waits for all service Deployments to be healthy (Ready == Desired)
   ├─ Updates GitHub Deployment status to "success"
   ├─ Creates commit status
   └─ Posts PR comment with deployment URLs

3.3 /reseed-db Command

1. User comments `/reseed-db` on PR
2. CI Workflow triggers (issue_comment event)
3. Workflow checks:
   ├─ Is `lock-db` label present?
   │   YES → Add comment: "Cannot reseed: lock-db label prevents database operations.
   │          Remove lock-db label first, then retry /reseed-db."
   │          STOP
   │   NO  → Continue
4. Workflow generates new seedId, updates cluster-gitops
5. ArgoCD syncs DatabaseLifecycle CR with new seedId
6. DBL Operator detects seedId mismatch → triggers seeding flow
7. Services restart after seeding completes

3.4 PR Close with lock-db Label

1. PR is merged/closed
2. CI Workflow triggers (closed event)
3. Workflow checks: Is `lock-db` label present?
   YES (lock-db present):
   │ ├─ Add PR comment: "Database syrf_pr_{n} has been preserved (lock-db).
   │ │   The Kubernetes namespace and resources have been removed.
   │ │   To clean up the orphaned database, contact a database administrator."
   │ │
   │ ├─ Update DatabaseLifecycle CR: cleanupOnDelete=false
   │ │
   │ ├─ Remove namespace (ArgoCD cascade delete)
   │ │   - Services removed
   │ │   - DatabaseLifecycle CR removed (but DB preserved due to cleanupOnDelete=false)
   │ │
   │ └─ Database remains in MongoDB Atlas (orphaned)
   NO (no lock-db):
   │ ├─ DatabaseLifecycle CR has cleanupOnDelete=true
   │ │
   │ ├─ Remove namespace (ArgoCD cascade delete)
   │ │   - DBL Operator finalizer runs
   │ │   - Operator drops database
   │ │   - All resources removed
   │ │
   │ └─ Database cleaned up

3.5 Label Conflict: lock-db + reset-db-on-sync

1. User adds `reset-db-on-sync` label while `lock-db` is present
2. CI Workflow triggers (labeled event)
3. Workflow detects conflict:
   ├─ Both `lock-db` AND `reset-db-on-sync` present
4. Workflow resolves conflict:
   ├─ Remove `reset-db-on-sync` label from PR (via GitHub API)
   └─ Add PR comment:
      "⚠️ Label conflict: `reset-db-on-sync` is incompatible with `lock-db`.

      - `lock-db` prevents all database modifications
      - `reset-db-on-sync` requests database reset on every push

      The `reset-db-on-sync` label has been automatically removed.
      To enable reset-on-sync, first remove the `lock-db` label."

3.6 Sequence Diagram: Seeding Flow

┌──────┐     ┌──────────┐     ┌───────────┐     ┌─────────┐     ┌─────────┐
│ArgoCD│     │DBL Op    │     │Seeding Job│     │Post-Seed│     │Services │
└──┬───┘     └────┬─────┘     └─────┬─────┘     └────┬────┘     └────┬────┘
   │              │                 │                │               │
   │ Sync CR      │                 │                │               │
   │─────────────>│                 │                │               │
   │              │                 │                │               │
   │              │ Check seedId    │                │               │
   │              │ vs ConfigMap    │                │               │
   │              │                 │                │               │
   │              │ Mismatch!       │                │               │
   │              │ Set status=     │                │               │
   │              │ pending         │                │               │
   │              │                 │                │               │
   │              │ Wait for 0      │                │               │
   │              │ ready replicas  │                │               │
   │              │<────────────────│────────────────│───────────────│
   │              │                 │                │               │
   │              │ Create Job      │                │               │
   │              │────────────────>│                │               │
   │              │                 │                │               │
   │              │ Set status=     │ Copy           │               │
   │              │ seeding         │ collections    │               │
   │              │                 │ ──────────>    │               │
   │              │                 │ MongoDB        │               │
   │              │                 │                │               │
   │              │ Job complete    │                │               │
   │              │<────────────────│                │               │
   │              │                 │                │               │
   │              │ Create Job      │                │               │
   │              │────────────────────────────────->│               │
   │              │                 │                │               │
   │              │ Set status=     │                │ Create        │
   │              │ post-seed       │                │ indexes       │
   │              │                 │                │ ──────────>   │
   │              │                 │                │ MongoDB       │
   │              │                 │                │               │
   │              │ Job complete    │                │               │
   │              │<────────────────────────────────│               │
   │              │                 │                │               │
   │              │ Set status=     │                │               │
   │              │ complete        │                │               │
   │              │ Update seedId   │                │               │
   │              │                 │                │               │
   │              │                 │                │               │
   │              │                 │                │  Poll ConfigMap
   │              │                 │                │<──────────────│
   │              │                 │                │               │
   │              │                 │                │  seedId match!│
   │              │                 │                │  Start pod    │
   │              │                 │                │──────────────>│
   │              │                 │                │               │

4. Edge Cases & Mitigations

# Edge Case / Failure Mode Impact Mitigation Strategy
1 Seeding Job fails (MongoDB timeout, quota exceeded) High - PR environment unusable Fail status in ConfigMap, PR comment with error details, user retries with /reseed-db
2 Post-Seed Job fails (index creation OOM) High - Services may start with missing indexes Fail status in ConfigMap, PR comment, services blocked until manual intervention
3 New push during active seeding Medium - Race condition, stale data Cancel current seeding Job, start new with latest seedId
4 lock-db + reset-db-on-sync added together Low - Conflicting intent Auto-remove reset-db-on-sync, add explanatory PR comment
5 lock-db present but /reseed-db command issued Low - User confusion Add PR comment explaining lock-db prevents reseed, do nothing
6 MongoDB Atlas user not ready when seeding starts High - Connection failures Sync wave ordering (user wave 10, DBL wave 15), retry loop with backoff
7 Connection secret missing High - Job fails immediately Operator validates secret exists before creating Job
8 PR closed while seeding in progress Medium - Orphaned resources Finalizer waits for Job completion or timeout before cleanup
9 Multiple PRs seeding simultaneously (resource contention) Low - Slower seeding Jobs run in separate namespaces, MongoDB handles concurrency
10 ConfigMap deleted manually Medium - Coordination broken Operator recreates ConfigMap on next reconciliation
11 Service pods stuck in init (ConfigMap never updated) High - Deployment hangs Startup probe timeout (15 min), operator monitors for stuck states
12 Mock data seeding Job template invalid High - Seeding never starts Validate Job spec on CR creation, reject invalid templates
13 preview label removed while lock-db present Medium - Ambiguous intent Keep CR active, DB preserved. Remove namespace only on PR close.
14 Orphaned database after lock-db PR close Low - Resource leak Add PR comment with database name, require manual admin cleanup
15 Seeding Job takes longer than timeout Medium - Incomplete data Increase timeout in CR spec, or fail and require /reseed-db

4.1 Detailed Mitigation: Concurrent Seed Requests

When a new push arrives while seeding is in progress:

  1. CI Workflow generates new seedId (if reset-db-on-sync or /reseed-db)
  2. ArgoCD updates DatabaseLifecycle CR with new seedId
  3. Operator detects CR update during reconciliation
  4. Operator checks: Is there an active seeding Job?
  5. YES → Delete current Job (kubectl delete job)
  6. Wait for Job termination
  7. Operator starts new seeding with latest seedId

Rationale: Latest commit should always win in CI. Completing an old seed while new code waits is wasteful.

4.2 Detailed Mitigation: Failure Notification

When seeding or post-seed Job fails:

  1. Operator detects Job failure (status.failed > 0)
  2. Operator updates ConfigMap:
    status: failed
    errorMessage: "Seeding Job failed: container 'seeder' exited with code 1"
    
  3. Operator creates GitHub PR comment via GitHub API:
    ## ❌ Database Seeding Failed
    
    **Namespace:** pr-123
    **Database:** syrf_pr_123
    **seedId:** a1b2c3d4-...
    **seedSha:** abc123def
    
    **Error:**
    
    Seeding Job failed: container 'seeder' exited with code 1
    **Job Logs:**
    
    [last 50 lines of Job logs]
    **To retry:** Comment `/reseed-db` on this PR.
    
  4. Services remain blocked (init containers waiting for status: complete)

5. Testing Strategy

5.1 Unit Tests

  • CI Workflow: Label conflict detection (lock-db + reset-db-on-sync)
  • CI Workflow: seedId generation (valid GUID format)
  • CI Workflow: seedSha extraction (valid commit SHA)
  • Helm template: DatabaseLifecycle CR generation with all label combinations
  • Helm template: ConfigMap RBAC for init containers

5.2 Integration Tests

  • DBL Operator: Reconciles CR and creates Seeding Job
  • DBL Operator: Updates ConfigMap on Job completion
  • DBL Operator: Handles Job failure correctly
  • DBL Operator: Cancels in-progress Job on new seedId
  • Init Container: Waits for ConfigMap with matching seedId
  • Init Container: Proceeds when status=complete
  • ArgoCD: Sync wave ordering (secrets → user → DBL)

5.3 End-to-End Tests

  • Full PR preview lifecycle: create → push → reseed → close
  • lock-db + PR close: Database preserved, namespace removed
  • reset-db-on-sync: New seedId on every push
  • /reseed-db command: Triggers reseed
  • Label conflict: reset-db-on-sync auto-removed when lock-db present

5.4 Manual Verification Steps

# 1. Create PR with preview label
gh pr create --title "Test PR" --body "Testing preview"
gh pr edit 123 --add-label preview

# 2. Wait for deployment, verify ConfigMap
kubectl get configmap db-ready -n pr-123 -o yaml

# 3. Verify seedId in ConfigMap matches CR
kubectl get dbl pr-database -n pr-123 -o jsonpath='{.spec.seedId}'

# 4. Test /reseed-db command
gh pr comment 123 --body "/reseed-db"
# Wait and verify new seedId

# 5. Test lock-db + reset-db-on-sync conflict
gh pr edit 123 --add-label lock-db
gh pr edit 123 --add-label reset-db-on-sync
# Verify reset-db-on-sync removed, comment added

# 6. Test PR close with lock-db
gh pr close 123
# Verify namespace deleted, database preserved
kubectl get ns pr-123  # Should not exist
# Check MongoDB Atlas for syrf_pr_123 database

6. Implementation Checklist

Phase 1: Core Infrastructure

  • Update DatabaseLifecycle CRD with new fields (seedId, seedSha, lockDatabase, postSeedJob)
  • Update DBL Operator reconciliation logic:
  • Compare spec.seedId with ConfigMap seedId
  • Create Seeding Job instead of inline seeding
  • Handle Job lifecycle (create, monitor, cleanup)
  • Create Post-Seed Job when enabled
  • Update ConfigMap with status transitions
  • Update db-ready ConfigMap structure (seedId, seedSha, status, errorMessage)
  • Update service init container to check status=complete

Phase 2: CI Workflow Updates

  • Add seedId generation (new GUID via uuidgen)
  • Add seedSha tracking (HEAD commit SHA)
  • Implement label conflict detection (lock-db + reset-db-on-sync)
  • Implement /reseed-db command with lock-db check
  • Implement reset-db-on-sync behavior (new seedId on each push)
  • Update cluster-gitops values generation

Phase 3: Helm Chart Updates

  • Update preview-infrastructure chart:
  • Pass seedId and seedSha to DatabaseLifecycle CR
  • Add lockDatabase field based on lock-db label
  • Configure cleanupOnDelete based on lock-db label
  • Update syrf-common library:
  • Init container checks status=complete (not just seedId match)

Phase 4: Mock Data Seeding (via PM seed-data mode)

  • Add SYRF_SEED_DATA_MODE environment variable handling to PM Program.cs:
  • Check before DI container build (similar to SYRF_INDEX_INIT_MODE)
  • Build minimal DI container (data services only, no RabbitMQ/ROB)
  • Execute DatabaseSeeder.Execute()
  • Exit after completion
  • Ensure DatabaseSeeder is registered in minimal DI container
  • Add integration test for seed-data mode
  • Create default mock data seeding Job template in preview-infrastructure chart
  • Document mock data seeding in docs/how-to/use-pr-preview-environments.md

Phase 5: Documentation & Cleanup

User Documentation: - [ ] Update docs/how-to/use-pr-preview-environments.md: - New label names and behaviors (lock-db, reset-db-on-sync) - Updated /reseed-db command behavior with lock-db check - Explanation of seedId vs seedSha tracking - Mock data seeding vs snapshot seeding decision guide - Troubleshooting: What to do when seeding fails

Migration Documentation: - [ ] Create docs/how-to/migrate-pr-preview-labels.md: - persist-dblock-db label rename - Behavioral differences (lock-db now keeps CR active) - One-time migration steps for existing PRs

Architecture Documentation: - [ ] Update docs/architecture/pr-preview-environments.md (or create if not exists): - DBL Operator Job-based seeding architecture - ConfigMap coordination pattern with seedId/seedSha/status - Sequence diagrams for all flows - Component responsibility matrix

Reference Updates: - [ ] Update CLAUDE.md: - New label semantics - seedId/seedSha tracking explanation - Updated workflow triggers - [ ] Update src/charts/preview-infrastructure/README.md: - New Helm values for seedId/seedSha - Custom seeding job template examples

Code Cleanup: - [ ] Remove references to seedVersion (replaced by seedId) - [ ] Remove deprecated persist-db label handling (replaced by lock-db) - [ ] Update github-notifier-job.yaml to use seedId


7. CI Workflow Architecture

This section documents the complete CI workflow architecture for PR preview environments.

7.0 Responsibility Boundaries (CRITICAL)

The CI workflow (pr-preview.yml) ONLY commits to cluster-gitops. It does NOT:

  • Communicate with ArgoCD directly (no ArgoCD API calls)
  • Create or modify ConfigMaps on the cluster directly
  • Execute kubectl commands to modify cluster state

Cluster operations are handled by ArgoCD and operators:

  • ArgoCD syncs cluster-gitops changes to create Kubernetes resources
  • DBL Operator creates/updates the db-ready ConfigMap
  • github-notifier-job (ArgoCD PostSync hook) updates GitHub Deployment status
┌─────────────────────────────────────────────────────────────────────────────────┐
│                         Responsibility Boundary                                  │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                 │
│   GitHub Actions (CI)                     │    Kubernetes Cluster               │
│   ───────────────────                     │    ──────────────────               │
│                                           │                                     │
│   ✅ Build Docker images                  │    ✅ ArgoCD syncs from gitops     │
│   ✅ Push to GHCR                         │    ✅ DBL Operator seeds database   │
│   ✅ Calculate versions (GitVersion)      │    ✅ DBL Operator creates ConfigMap│
│   ✅ Commit to cluster-gitops             │    ✅ PostSync hook notifies GitHub │
│   ✅ Create GitHub Deployment (pending)   │    ✅ Service init containers wait  │
│                                           │                                     │
│   ❌ NO kubectl apply                     │                                     │
│   ❌ NO ArgoCD API calls                  │                                     │
│   ❌ NO direct ConfigMap creation         │                                     │
│                                           │                                     │
└─────────────────────────────────────────────────────────────────────────────────┘

7.1 Full CI Pipeline Stages

The pr-preview.yml workflow executes the following stages:

┌──────────────────────────────────────────────────────────────────────────────────┐
│                           PR Preview CI Pipeline                                  │
├──────────────────────────────────────────────────────────────────────────────────┤
│                                                                                  │
│  1. check-label                                                                  │
│     ├─ Verify PR has 'preview' label                                            │
│     ├─ Handle label-specific events (persist-db, use-snapshot)                  │
│     ├─ Extract preview config from PR description (feature flags)               │
│     └─ Output: should_build, pr_number, head_sha                                │
│                                                                                  │
│  2. create-deployment                                                            │
│     ├─ Create GitHub Deployment for preview environment                         │
│     └─ Set initial status to "pending"                                          │
│                                                                                  │
│  3. detect-changes (tag-based)                                                   │
│     ├─ Compare HEAD SHA against last service tag                                │
│     ├─ For each service: determine action (build | use-existing | retag)        │
│     ├─ Build matrix for changed services                                        │
│     └─ Output: *_changed, *_action, *_last_version, preview_services_matrix     │
│                                                                                  │
│  4. version-* jobs (parallel, per-service)                                       │
│     ├─ Uses reusable workflow: _gitversion.yml                                  │
│     ├─ Calculates semantic version from git history                             │
│     └─ Output: version, semver, fullsemver, informationalVersion                │
│                                                                                  │
│  5. build-web-artifacts (if web changed)                                         │
│     ├─ npm ci && ng build --configuration development                           │
│     ├─ Upload dist artifact for Docker build                                    │
│     └─ Sentry sourcemaps upload                                                 │
│                                                                                  │
│  6. build-and-push-images (matrix, parallel)                                     │
│     ├─ Uses reusable workflow: _docker-build.yml                                │
│     ├─ Build Docker image with version tag                                      │
│     ├─ Push to ghcr.io/camaradesuk/{service}:{version}                         │
│     └─ Tag with sha-{shortsha} for ArgoCD deployment                           │
│                                                                                  │
│  7. retag-unchanged                                                              │
│     ├─ For unchanged services: crane copy :latest → :sha-{shortsha}            │
│     └─ Ensures all services have sha-{shortsha} tag for this commit            │
│                                                                                  │
│  8. update-pr-status                                                             │
│     ├─ Update PR description with deployment status                             │
│     └─ Write GitHub Actions job summary                                         │
│                                                                                  │
│  9. write-versions (commits to cluster-gitops)                                   │
│     ├─ Checkout cluster-gitops repository                                       │
│     ├─ Check labels (persist-db, use-snapshot)                                  │
│     ├─ Determine database reset trigger                                         │
│     ├─ Write pr.yaml (PR metadata, seedVersion, deploymentNotification)         │
│     ├─ Write infrastructure.values.yaml (MongoDB, DatabaseLifecycle config)    │
│     ├─ Write services/*.values.yaml (image tags, GitVersion values)            │
│     └─ Git commit and push                                                      │
│                                                                                  │
└──────────────────────────────────────────────────────────────────────────────────┘

7.2 Workflow Triggers

The workflow responds to: - PR comment containing /reseed-db - Label changes (preview, lock-db, use-snapshot, reset-db-on-sync) - Push events to a PR branch - PR closed events - workflow_dispatch (manual trigger with PR number)

7.3 Testing and Code Quality (Future Enhancement)

Note: The current pr-preview.yml workflow does NOT include:

  • Unit test execution
  • Integration test execution
  • SonarQube code coverage analysis
  • SonarQube static analysis

These are handled by separate CI workflows or are planned enhancements. The DBL redesign focuses on database lifecycle management - testing integration should be addressed separately.

7.4 Decision Tree (Database/Label Logic)

trigger: PR comment | label change | push | pr closed

1. Is comment JUST added that includes '/reseed-db'?
   YES →
     Is 'lock-db' label set?
       YES → Comment "Cannot reseed: lock-db prevents database operations"
       NO  → Set seedId to new GUID, set locked=false, comment "DB is being reseeded"
   NO → Continue to step 2

2. Are 'lock-db' AND 'reset-db-on-sync' BOTH set?
   YES → Comment about conflict, remove 'reset-db-on-sync' label, set locked=true
   NO  → Continue to step 3

3. Was 'use-snapshot' label JUST set/unset AND 'lock-db' ALREADY set?
   YES → Undo the set/unset of 'use-snapshot', comment explaining conflict
   NO  → Continue to step 4

4. Is (preview JUST set OR push with preview set OR PR JUST opened) AND PR is open?
   YES →
     Is label combo valid?
       NO  → Comment about invalid combo, STOP
       YES →
         Does seedId already exist in cluster-gitops for this PR?
           YES →
             Is 'reset-db-on-sync' set?
               YES → Set seedId to new GUID
               NO  →
                 Is 'lock-db' set?
                   YES → Use existing seedId, set locked=true
                   NO  → Use existing seedId, set locked=false
           NO → Set seedId to new GUID
   NO → Continue to step 5

5. Was 'preview' JUST unset OR PR JUST closed?
   YES →
     Is 'lock-db' set?
       YES → Set locked=true in values, sync DBL, then delete PR folder (DB preserved)
       NO  → Delete PR folder (DB will be dropped)
   NO → Continue to step 6

6. Was 'lock-db' label JUST changed?
   JUST SET   → Update values to set locked=true
   JUST UNSET → Update values to set locked=false
   NO CHANGE  → Stop processing (nothing to do)

Note: The original logic had a bug at step 6 - it didn't distinguish between lock-db being set vs unset. This corrected version handles both cases.

7.5 Label Detection Timing

RESOLVED: Labels are detected at workflow start time, not when the workflow is queued.

This is correct behavior because: 1. The "interruptible" queue pattern cancels stale workflows 2. Detecting labels at run time ensures the workflow acts on current state 3. If label changes occur while a workflow is queued, the workflow sees the new state

7.6 Invocation Queue Management

The workflow implements a selective "interruptible" queue pattern to optimize CI efficiency:

Interruptible Events (Expensive Operations): - Git push events - PR opened/closed events - preview label set/unset - /reseed-db command (triggers database seeding - stale if new push arrives)

These events trigger expensive operations: build container images, push to registries, seed database, deploy. If a new event arrives, the old work is stale anyway - cancel and start fresh.

Non-Interruptible Events (Cheap Operations):

  • lock-db label changes
  • reset-db-on-sync label changes
  • use-snapshot label changes

These events only update cluster-gitops values - no builds required. They're fast and should complete before processing the next event.

Analysis of Implementation Options:

Option 1: Separate concurrency groups (NOT RECOMMENDED)

The idea: use conditional concurrency groups based on event type:

concurrency:
  group: pr-${{ pr_number }}-${{ is_interruptible && 'build' || 'config' }}
  cancel-in-progress: ${{ is_interruptible }}

Problem: Race conditions. If both groups have active workflows, they both try to commit to cluster-gitops simultaneously, causing git conflicts.

Option 2: Single group, always cancel (RECOMMENDED)

concurrency:
  group: pr-${{ pr_number }}
  cancel-in-progress: true

Why this works:

  • All events eventually commit to cluster-gitops (single writer)
  • Config-only changes are fast - unlikely to be interrupted
  • If cancelled, the subsequent push includes current label state anyway
  • User can re-apply label if needed (rare edge case)
  • Simple, predictable behavior

Option 3: Separate workflow files (COMPLEX)

Split into pr-preview-build.yml and pr-preview-config.yml. Still has race condition issues and adds coordination complexity.

Recommendation: Use Option 2 (single group, always cancel). Simpler, safer, and the rare case of a config change being cancelled by a push is acceptable since the push will include current labels.

7.7 Invalid Label Combinations

The decision tree checks for "valid label combo" but doesn't specify invalid combinations. Define:

Invalid Combination Reason Resolution
lock-db + reset-db-on-sync Conflicting: lock prevents reseeds, reset requests them Auto-remove reset-db-on-sync

Currently, no other combinations are invalid. use-snapshot + lock-db is allowed (but changing use-snapshot while lock-db is set is blocked as meaningless).


8. Open Questions

Resolved Questions

  1. Mock Data Extraction: RESOLVED - Reuse existing DatabaseSeeder.cs via SYRF_SEED_DATA_MODE=true environment variable. No extraction needed - run PM image in seed-data mode (similar to index-init mode).

  2. GitHub App for PR Comments: RESOLVED - Reuse existing GitHub App (github-app-credentials secret). The github-notifier-job.yaml already uses this for deployment notifications.

  3. Seeding Job Resource Limits: RESOLVED - Use normal amounts, can be tweaked later based on observed performance.

  4. Orphaned Database Cleanup: RESOLVED - No automated cleanup. Just leave the orphaned database and add a comment on the PR explaining the database name and that manual cleanup is required.

  5. Label Detection Timing: RESOLVED - Labels are detected at workflow start time (not when queued). This is correct behavior - the interruptible queue pattern ensures stale builds are cancelled, and detecting labels at run time ensures the workflow acts on current state.

  6. Interruptible Event Scope: RESOLVED - Selective interruptibility is correct:

  7. Interruptible (push, preview, PR open/close): Trigger expensive builds - cancel-in-progress avoids wasted CI
  8. Non-interruptible (lock-db, reset-db-on-sync, use-snapshot): Only update values - queue and complete (fast, cheap)

  9. Notification Content: RESOLVED - Include both seedId and seedSha in deployment notification comments for full traceability.

  10. Failure Comments: RESOLVED - DBL Operator posts failure comments directly using the same GitHub App credentials as github-notifier-job. No need for CI workflow involvement.

Remaining Questions

All questions resolved.


9. Migration Mapping: Current → New

This section shows how existing components map to the new architecture.

9.1 Terminology Changes

Current Term New Term Notes
seedVersion seedId GUID that triggers reseed detection
(none) seedSha Commit SHA for audit trail
persist-db label lock-db label Enhanced: now keeps CR active when preview removed
(none) reset-db-on-sync label New: reseed on every push
(none) status field New: pending/seeding/post-seed/complete/failed

9.2 Component Changes

Component Current Location Change Required
DBL CRD cluster-gitops/charts/database-lifecycle-operator/crds/ Add seedId, seedSha, lockDatabase, postSeedJob fields
DBL Operator Hook cluster-gitops/charts/database-lifecycle-operator/templates/configmap-hooks.yaml Refactor to create Jobs instead of inline seeding
db-ready ConfigMap Created by operator Add seedId, seedSha, status, errorMessage, lastUpdated
Init Container src/charts/syrf-common/templates/_deployment-dotnet.tpl Check status=complete AND seedId match
GitHub Notifier src/charts/preview-infrastructure/templates/github-notifier-job.yaml Replace seedVersion with seedId
PR Preview Workflow .github/workflows/pr-preview.yml Add seedId/seedSha generation, label conflict handling
PM Service src/services/project-management/ Add SYRF_SEED_DATA_MODE handling in Program.cs

9.3 Helm Values Changes

preview-infrastructure chart values:

# Current
seedVersion: "abc123def"  # Commit SHA

# New
seedId: "a1b2c3d4-e5f6-7890-abcd-ef1234567890"  # GUID
seedSha: "abc123def"  # Commit SHA (audit only)
lockDatabase: false  # From lock-db label

9.4 CI Workflow Changes

Current workflow generates:

seedVersion: ${{ github.sha }}

New workflow generates:

seedId: $(uuidgen)  # New GUID on initial deploy, /reseed-db, or reset-db-on-sync
seedSha: ${{ github.sha }}  # Always current commit
lockDatabase: ${{ contains(github.event.pull_request.labels.*.name, 'lock-db') }}

9.5 Files to Modify

File Changes
cluster-gitops/charts/database-lifecycle-operator/crds/databaselifecycle.yaml Add new CRD fields
cluster-gitops/charts/database-lifecycle-operator/templates/configmap-hooks.yaml Job-based seeding logic
src/charts/preview-infrastructure/templates/database-lifecycle.yaml Pass new values
src/charts/preview-infrastructure/templates/github-notifier-job.yaml Use seedId
src/charts/syrf-common/templates/_deployment-dotnet.tpl Check status + seedId
.github/workflows/pr-preview.yml seedId generation, label handling
src/services/project-management/SyRF.ProjectManagement.Endpoint/Program.cs Add seed-data mode

10. References

Internal Documentation

External Resources


Document End

This document must be reviewed and approved before implementation begins.