Skip to content

SyRF GitOps Migration - Product Backlog

Last Updated: 2025-12-01 Project: SyRF Monorepo + GitOps Migration Sprint Planning: ZenHub/Scrum Board Format

Executive Summary

This backlog tracks the migration from Jenkins X to a GitOps-based deployment architecture using GitHub Actions and ArgoCD. The project is organized as one Epic containing Work Items, with each Work Item containing multiple Child Work Items.

Overall Progress

Work Item Status Child Work Items Complete Total Child Work Items Progress
Work Item 1: Monorepo Foundation ✅ Complete 8/8 8 100%
Work Item 2: CI/CD Automation ✅ Complete 7/7 7 100%
Work Item 3: GitOps Infrastructure ✅ Complete 5/5 5 100%
Work Item 4: ArgoCD Deployment 🔄 In Progress 9/14 14 64%
Work Item 5: Production Migration ⏳ Planned 0/4 4 0%
TOTAL 29/38 38 76%

Burn-down Estimate

  • Total Story Points: 210 (updated: +3 for dynamic matrix 2025-11-20)
  • Completed: 174 points (83%)
  • Remaining: 36 points (17%)
  • Estimated Time to Complete: 2-3 weeks

Legend

Status Icons:

  • ✅ Complete
  • 🔄 In Progress
  • ⏳ Blocked/Waiting
  • 📋 Ready
  • 🔮 Future/Backlog

Story Point Scale:

  • 1 point = 1-2 hours
  • 2 points = 2-4 hours
  • 3 points = 4-8 hours (half day)
  • 5 points = 1 full day
  • 8 points = 2 days
  • 13 points = 1 week
  • 21 points = 2 weeks

Epic: SyRF GitOps Migration

GitHub Issue: #2128 Goal: Migrate from Jenkins X to a GitOps-based deployment architecture using GitHub Actions, ArgoCD, and Kubernetes

Total Work Items: 5 Total Child Work Items: 38 Total Story Points: 210 (updated: +3 for dynamic matrix 2025-11-20) Completed: 174 points (83%) Remaining: 36 points (17%) Overall Status: 🔄 In Progress

Work Items Overview:

  1. Work Item 1: Monorepo Foundation (58 pts) - ✅ Complete (8/8 child work items)
  2. Work Item 2: CI/CD Automation (57 pts) - ✅ Complete (7/7 child work items)
  3. Work Item 3: GitOps Infrastructure (34 pts) - ✅ Complete (5/5 child work items, 100%)
  4. Work Item 4: ArgoCD Deployment (53 pts) - 🔄 In Progress (9/14 child work items, 64%)
  5. Work Item 5: Production Migration (34 pts) - ⏳ Planned (0/4 child work items)

Work Item 1: Monorepo Foundation ✅ COMPLETE

GitHub Issue: #2129 Goal: Establish monorepo structure with automated semantic versioning

Total Story Points: 58 Status: ✅ Complete (100%)


Child Work Item 1.1: Monorepo Structure Setup ✅

GitHub Issue: #2130 Status: ✅ Complete Priority: P0 (Critical) Story Points: 8 Sprint: Sprint 0 (Completed)

As a developer I want all services and libraries consolidated into a single monorepo So that I can make atomic changes across service boundaries and simplify dependency management

Acceptance Criteria:

  • All 4 services moved to src/services/ (api, project-management, quartz, web)
  • All shared libraries moved to src/libs/
  • Helm charts organized in src/services/{service}/charts/
  • Root solution file syrf.sln created with proper folder structure
  • Solution filters (.slnf) created for each service
  • Git history preserved from original repositories
  • All projects build successfully with dotnet build
  • Directory.Build.props centralized at repository root

Dependencies: None

Technical Notes:

  • Completed via migration scripts
  • Repository: camaradesuk/syrf-monorepo (production ready)
  • Test repository: camaradesuk/syrf-test

Child Work Item 1.2: GitVersion Configuration ✅

GitHub Issue: #2131 Status: ✅ Complete Priority: P0 (Critical) Story Points: 5 Sprint: Sprint 1 (Completed)

As a developer I want automated semantic versioning based on conventional commits So that versions are calculated automatically without manual intervention

Acceptance Criteria:

  • GitVersion.yml created for all 5 services (api, pm, quartz, web, s3-notifier)
  • All services use mode: ContinuousDeployment
  • Conventional commit patterns configured (feat:, fix:, chore:)
  • Service-specific tag prefixes defined (api-v, pm-v, quartz-v, web-v, s3-notifier-v)
  • Path filtering working (services version independently)
  • GitVersion.yml removed from shared libraries
  • Test commit successfully calculates version

Dependencies: Story 1.1 (Monorepo Structure)

Technical Notes:

  • Decision documented in: GITVERSION-MODE-DECISION.md
  • Used ContinuousDeployment mode instead of Mainline
  • All services at 0.1.0 baseline

Child Work Item 1.3: Chart Version Stabilization ✅

GitHub Issue: #2132 Status: ✅ Complete Priority: P0 (Critical) Story Points: 2 Sprint: Sprint 1 (Completed)

As a platform engineer I want Helm Chart versions to remain stable at 0.0.0 So that deployment versions are controlled via git refs and image tags, not chart versions

Acceptance Criteria:

  • All Chart.yaml files set to version: 0.0.0
  • Comment added: "Stable version; deployments via git ref + image tag"
  • Policy documented in CLUSTER ARCHITECTURE GOALS.md
  • CI/CD workflows do NOT update Chart.yaml versions
  • Charts still valid for Helm deployment

Dependencies: None

Technical Notes:

  • Aligns with GitOps best practices
  • Commit: 941e2a1b (2025-11-03)

Child Work Item 1.4: Dependency Mapping ✅

GitHub Issue: #2133 Status: ✅ Complete Priority: P1 (High) Story Points: 8 Sprint: Sprint 1 (Completed)

As a developer I want a clear dependency map of all services and libraries So that I can understand impact of changes and optimize builds

Acceptance Criteria:

  • DEPENDENCY-MAP.yaml created as single source of truth
  • Complete dependency trees documented for all services
  • Docker build context requirements specified
  • CI/CD workflow trigger paths defined
  • Impact analysis for library changes documented
  • Zero circular dependencies verified
  • Validation script created (validate-dependencies.sh)

Dependencies: Story 1.1 (Monorepo Structure)

Technical Notes:

  • File: architecture/dependency-map.yaml
  • SharedKernel is most critical (affects 3 services)
  • Web service has no .NET dependencies

Child Work Item 1.5: CI/CD Path Filtering Optimization ✅

GitHub Issue: #2134 Status: ✅ Complete Priority: P1 (High) Story Points: 5 Sprint: Sprint 1 (Completed)

As a developer I want CI/CD workflows to build only changed services So that builds are fast and resource-efficient

Acceptance Criteria:

  • Path filters use precise library paths (not broad src/libs/**)
  • API triggers on 6 specific library paths
  • PM triggers on 7 specific library paths
  • Quartz triggers on 2 library paths (minimal dependencies)
  • Web has no library dependencies
  • Test: Change to SharedKernel triggers API, PM, Quartz (not Web)
  • Test: Change to Web triggers only Web service

Dependencies: Story 1.4 (Dependency Mapping)

Technical Notes:

  • Uses dorny/paths-filter@v3 action
  • Prevents unnecessary builds when unrelated libraries change

Child Work Item 1.6: Documentation Consolidation ✅

GitHub Issue: #2135 Status: ✅ Complete Priority: P2 (Medium) Story Points: 8 Sprint: Sprint 1 (Completed)

As a developer I want clear, non-redundant documentation So that I can understand the current state and make informed decisions

Acceptance Criteria:

  • CLAUDE.md updated with current architecture
  • PROJECT-STATUS.md reflects current implementation status
  • IMPLEMENTATION-PLAN.md aligned with actual progress
  • Obsolete documents deleted (preserved in git history)
  • Path references standardized (src/services/ not services/)
  • GitVersion mode contradiction resolved (all docs use ContinuousDeployment)
  • Documentation anti-patterns documented
  • README.md rewritten as navigation entry point

Dependencies: All previous stories

Technical Notes:

  • Deleted 3 obsolete analysis files
  • Adopted hybrid redundancy strategy
  • DEPENDENCY-MAP.yaml is now authoritative

Child Work Item 1.7: Build Configuration Optimization ✅

GitHub Issue: #2136 Status: ✅ Complete Priority: P2 (Medium) Story Points: 3 Sprint: Sprint 1 (Completed)

As a developer I want optimized Docker build contexts and .dockerignore So that builds are faster and use less disk space

Acceptance Criteria:

  • .dockerignore excludes planning/, .github/, docs
  • .dockerignore organized by category with comments
  • Estimated 20-30% reduction in build context size
  • Directory.Build.props enhanced with:
  • Common build settings
  • Code quality settings
  • NuGet package metadata
  • Deterministic builds for CI/CD
  • Redundant Directory.Build.props files removed from service subdirectories

Dependencies: Story 1.1 (Monorepo Structure)

Technical Notes:

  • Root Directory.Build.props is single source of MSBuild configuration
  • .dockerignore tested and validated

Child Work Item 1.8: Repository Migration to Production Name ✅

GitHub Issue: #2163 Status: ✅ Complete Priority: P1 (High) Story Points: 5 Sprint: Sprint 1 (Completed 2025-11-13)

As a developer I want the monorepo migrated to the production repository name So that all GitHub metadata is preserved in one consolidated location

Acceptance Criteria:

  • Create backup of syrf-web in syrf-web-legacy repository
  • Force push monorepo branches and tags from syrf-test to syrf-web
  • Rename syrf-web to syrf via GitHub settings
  • Update all documentation references from syrf-monorepo to syrf
  • Update all references from syrf-test to syrf
  • Update git remote URLs locally
  • Verify all issues (470+) preserved with original IDs
  • Verify all PRs (47) preserved and accessible
  • Verify ZenHub workspace continues functioning
  • Verify branches coexist (no conflicts)
  • Verify tags coexist (no conflicts)
  • Verify old URLs redirect to new repository
  • Create comprehensive migration documentation (ADR-005, migration guide)

Dependencies:

  • Story 1.1 (Monorepo Structure Setup)
  • Story 1.6 (Documentation Consolidation)

Technical Notes:

  • Strategy: Force push + rename to preserve GitHub metadata
  • Backup created: camaradesuk/syrf-web-legacy
  • GitHub automatic redirects: syrf-web URLs → syrf URLs
  • Git history preserved: syrf-web main is part of monorepo via git mv
  • Branches: 3 monorepo + 93 syrf-web = no conflicts
  • Tags: Prefixed (api-v*, pm-v*) vs unprefixed (v*) = no conflicts
  • ZenHub: Repository rename transparent (same internal repo ID)
  • Files created:
  • docs/decisions/ADR-005-repository-migration-strategy.md
  • docs/how-to/repository-migration-guide.md
  • Documentation updated: 28 files (syrf-monorepo → syrf, syrf-test → syrf)
  • Commits: 5db1d9e9 (migration docs), [user executed migration]

Estimated Effort: 5 story points (1 day)


Work Item 2: CI/CD Automation ✅ COMPLETE

GitHub Issue: #2137 Goal: Build and push Docker images with automated tagging and promotion

Total Story Points: 57 (updated: +2 for version continuity, +8 for production promotion, +8 for deployment notifications, +3 for dynamic matrix) Status: ✅ Complete (100%) - 7/7 child work items


Child Work Item 2.1: Auto-Version Workflow Cleanup ✅

GitHub Issue: #2138 Status: ✅ Complete Priority: P0 (Critical) Story Points: 5 Sprint: Sprint 2 (Planned)

As a developer I want the auto-version workflow to create tags without polluting git history So that versioning is clean and doesn't create commit noise

Acceptance Criteria:

  • Remove VERSION file operations from workflow
  • Remove Chart.yaml update operations from workflow
  • Remove commit creation steps
  • Keep tag creation steps
  • Modify push step to only push tags (not commits)
  • Simplify workflow structure (remove file restoration logic)
  • Test: Workflow creates tags but NO commits
  • Test: GitVersion still calculates versions correctly
  • Ensure version continuity from polyrepos:
  • Create baseline tags for each service continuing from last polyrepo version
  • API: Last polyrepo v8.20.1 → create baseline tag api-v8.20.1
  • PM: Last polyrepo v10.44.1 → create baseline tag pm-v10.44.1
  • Web: Last polyrepo v11.27.0 → create baseline tag web-v11.27.0
  • Update GitVersion.yml configs with next-version if needed
  • Test: Next versions increment correctly (api-v8.21.0, pm-v10.45.0, web-v11.28.0)
  • Document version mapping in ADR

Dependencies:

  • Story 1.2 (GitVersion Configuration)
  • Story 1.3 (Chart Version Stabilization)

Technical Notes:

  • Aligns with GitOps principle (no auto-commits to source repo)
  • Tags are lightweight references, not commits
  • File: .github/workflows/ci-cd.yml (formerly auto-version.yml - already merged)
  • Version Continuity Strategy:
  • Polyrepo tags (v8.20.1) migrated with git history
  • Create prefixed baseline tags at same commits (api-v8.20.1)
  • GitVersion recognizes prefixed tags via tag-prefix config
  • Next versions increment from baseline: feat → minor, fix → patch
  • Maintains semantic versioning continuity across migration

Estimated Effort: 5 story points (1 day) - UPDATED: +2 points for version continuity = 7 points total


Child Work Item 2.2: Docker Image Build Integration ✅

GitHub Issue: #2139 Status: ✅ Complete Priority: P0 (Critical) Story Points: 13 Sprint: Sprint 2 (Planned)

As a platform engineer I want Docker images built and pushed to GHCR automatically So that every version has an immutable container image

Acceptance Criteria:

  • Review and validate all Dockerfiles for monorepo structure
  • Add Docker build job to auto-version workflow (after version jobs)
  • Use matrix strategy for changed services
  • Build images with correct build context
  • Tag images with both patterns:
  • {version} (e.g., 1.2.3)
  • {version}-sha.{shortsha} (e.g., 1.2.3-sha.abc123)
  • latest (updates with each push from main)
  • Push to GHCR using GITHUB_TOKEN
  • Test: Trigger workflow with code change
  • Test: Verify images exist in GHCR
  • Test: Both tags exist and point to same image

Dependencies:

  • Story 2.1 (Auto-Version Workflow Cleanup)
  • Story 1.4 (Dependency Mapping)

Technical Notes:

  • Registry: ghcr.io/camaradesuk/syrf-{service}
  • Auth: GITHUB_TOKEN (automatic, no PAT needed)
  • Build context must include entire monorepo (MSBuild requirement)
  • Reference DEPENDENCY-MAP.yaml for required paths
  • Implementation Details:
  • Created automated Dockerfile generation script (scripts/generate-dockerfiles.py)
  • Generates cache-optimized Dockerfiles with 5-layer structure
  • All Dockerfiles regenerated from dependency-map.yaml
  • Fixed PM and Quartz build contexts to use monorepo root
  • API and Web contexts already correct
  • Cache optimization: ~70% time savings for source code changes

Estimated Effort: 13 story points (1 week)


Child Work Item 2.3: Build Optimization - Conditional Rebuild ✅

GitHub Issue: #2140 Status: ✅ Complete Priority: P2 (Medium) Story Points: 8 Sprint: Sprint 3 (Completed 2025-11-19)

As a platform engineer I want to avoid rebuilding Docker images when only non-code files change So that CI/CD is faster and more resource-efficient

Acceptance Criteria:

  • Install crane CLI tool in workflow
  • Implement change detection logic:
  • Detect code vs non-code changes
  • Compare files changed since last git tag
  • Include shared libraries in detection
  • Implement conditional build/retag:
  • If no code changes and source image exists: retag using crane tag
  • If code changed or source missing: build from scratch
  • Add monitoring and summary to workflow output
  • Test: Chart-only change triggers retag (not rebuild)
  • Test: Code change triggers full rebuild
  • Test: Shared library change triggers full rebuild (logic verified)
  • Test: Missing source image falls back to rebuild (intentionally errors - signals config issue)
  • Measure time savings (target: 2-5 min per optimized build) - Achieved: 12s vs 4+ min

Dependencies:

  • Story 2.2 (Docker Image Build Integration)

Technical Notes:

  • Uses crane for manifest-based retagging (no download)
  • Transparent to GitOps (ArgoCD only cares tag exists)
  • Detailed spec in: CLUSTER ARCHITECTURE GOALS.md section 10a
  • Implementation Notes (2025-11-19):
  • Initial approach using dorny/paths-filter negation patterns didn't work (patterns are OR'd)
  • Fixed by using list-files: shell and analyzing actual file paths
  • Chart-only detection checks if ALL changed files match .chart/ pattern
  • Successfully tested: API chart-only change retagged 9.4.3 → 9.4.4 in 12s
  • Missing source image intentionally errors rather than falling back to build (signals configuration issue)

Estimated Effort: 8 story points (2 days)


Child Work Item 2.4: Promotion PR Automation ✅

GitHub Issue: #2141 Status: ✅ Complete Priority: P0 (Critical) Story Points: 8 Sprint: Sprint 2 (Completed)

As a platform engineer I want automatic PRs to cluster-gitops after successful image push So that staging deployments are triggered declaratively

Acceptance Criteria:

  • Create GitHub PAT with repo scope for cluster-gitops access
  • Add PAT as secret GITOPS_PAT to app-monorepo repository
  • Add promotion PR job to auto-version workflow
  • Install yq tool for YAML manipulation
  • Update staging values files for changed services:
  • environments/staging/{service}.values.yaml
  • Set image.tag: {version}
  • Create PR with:
  • Title: "Promote {services} to {version} (staging)"
  • Body: Image details, source tag, changelog link
  • Auto-label: promotion, staging, auto-generated
  • Test: Code change creates promotion PR in cluster-gitops
  • Test: PR contains correct version information
  • Test: PR is properly formatted and reviewable

Dependencies:

  • Story 2.2 (Docker Image Build Integration)
  • Story 3.1 (cluster-gitops Repository Complete)

Technical Notes:

  • Uses yq for YAML updates (preserves formatting)
  • PR can be auto-merged or require approval (configurable)
  • File updated: .github/workflows/auto-version.yml

Estimated Effort: 8 story points (2 days)


Child Work Item 2.5: Production Promotion Automation ✅

GitHub Issue: #2203 Status: ✅ Complete Priority: P0 (Critical) Story Points: 8 Sprint: Sprint 2 (Completed 2025-11-13)

As a platform engineer I want automated production promotion PRs after successful staging deployment So that production updates are tracked and require manual approval

Acceptance Criteria:

  • Add promote-to-production job to ci-cd.yml workflow
  • Job triggers automatically after successful staging promotion
  • Copies service versions from staging to production
  • Creates PR to cluster-gitops updating production service values
  • PR labeled requires-review with review checklist
  • PR does NOT auto-merge (requires manual administrator approval)
  • Workflow completes (shows green checkmark) after PR creation
  • Administrator can review and manually merge PR
  • After PR merge, ArgoCD syncs production automatically
  • Documentation created: docs/how-to/production-promotion-and-notifications.md

Dependencies:

  • Story 2.4 (Promotion PR Automation for staging)

Technical Notes:

  • Uses GitHub App authentication for PR creation
  • No GitHub Environment configuration needed (works on free tier)
  • Manual gate happens at PR merge step in cluster-gitops
  • Workflow shows success after PR creation, not after deployment
  • Commit: 3d4edccd (initial), 42a46855 (simplified)

Estimated Effort: 8 story points (2 days)


Child Work Item 2.6: Deployment Success Notifications ✅

GitHub Issue: #2204 Status: ✅ Complete Priority: P1 (High) Story Points: 8 Sprint: Sprint 2 (Completed 2025-11-13)

As a developer I want GitHub commit statuses when ArgoCD successfully deploys services So that I can see deployment status directly on commits and PRs

Acceptance Criteria:

  • Create PostSync hook template for all service charts
  • Job authenticates with GitHub App
  • Creates commit status on source repository
  • Status context: argocd/deploy-{environment}
  • Status description includes service name and version
  • Links to deployed service URL
  • Optional: Create GitHub Releases for production deployments
  • Configuration consolidated in environment shared-values.yaml (DRY principle)
  • Services enable with single flag: deploymentNotification.enabled: true
  • Common config inherited from shared values
  • Documentation updated with DRY configuration approach
  • PostSync jobs auto-cleanup after 5 minutes

Dependencies:

  • Story 4.2 (ArgoCD Installation) - for testing
  • Story 4.3 (Platform Add-ons) - for secrets

Technical Notes:

  • PostSync hook runs Kubernetes Job after successful sync
  • Uses curlimages/curl:8.10.1 container
  • JWT-based GitHub App authentication
  • Staging: commit statuses only (createReleaseNote: false)
  • Production: commit statuses + releases (createReleaseNote: true)
  • DRY: Common config in shared-values, services only set enabled flag
  • Files created:
  • src/services/*/chart/templates/postsync-notify.yaml (all services)
  • docs/how-to/production-promotion-and-notifications.md
  • environments/staging/shared-values.yaml (deploymentNotification section)
  • environments/production/shared-values.yaml (deploymentNotification section)
  • Commits: 3d4edccd, 118648da, 74bee73 (DRY config)

Estimated Effort: 8 story points (2 days)


Child Work Item 2.7: Dynamic Matrix for Docker Builds ✅

GitHub Issue: #2202 Status: ✅ Complete Priority: P2 (Medium) Story Points: 3 Sprint: Sprint 3 (Completed 2025-11-20)

As a developer I want the CI/CD workflow to show unchanged services as "Skipped" rather than "Succeeded" So that I can clearly see which services were actually built in each workflow run

Acceptance Criteria:

  • Implement dynamic matrix generation in detect-changes job
  • Matrix only includes services that have actually changed
  • Each matrix entry contains full service metadata (name, image, dockerfile, context, flags)
  • Build-docker job uses dynamic matrix instead of static matrix
  • Remove service_changed skip logic from reusable workflow
  • Unchanged services show as "Skipped" in GitHub UI (correct behavior)
  • Changed services build normally with all metadata preserved
  • Web service artifact handling preserved
  • Docs service additional checkouts preserved
  • Workflow validates successfully

Dependencies:

  • Story 2.3 (Build Optimization - Conditional Rebuild)

Technical Notes:

  • Replaces static 6-service matrix with dynamic matrix
  • Uses jq for reliable JSON generation
  • Matrix entries include: name, image, dockerfile, context, changed_output, and service-specific flags
  • GitHub Actions can only show "Skipped" at job level, not step level
  • With static matrix, all jobs run and succeed early (confusing UI)
  • With dynamic matrix, jobs for unchanged services don't exist (clean UI)
  • Commits: c4da6e23, fc2f97ca, a52d0be2

Estimated Effort: 3 story points (half day)


Work Item 3: GitOps Infrastructure ✅ COMPLETE

GitHub Issue: #2142 Goal: Establish cluster-gitops repository with ArgoCD configuration

Total Story Points: 34 Status: ✅ Complete (5/5 child items complete, 34/34 pts = 100%)


Child Work Item 3.1: cluster-gitops Repository Complete ✅

GitHub Issue: #2143 Status: ✅ Complete Priority: P0 (Critical) Story Points: 8 Sprint: Sprint 1 (Completed)

As a platform engineer I want a complete cluster-gitops repository structure So that ArgoCD can declaratively manage cluster state

Acceptance Criteria:

  • Repository created: camaradesuk/cluster-gitops
  • Directory structure established:
  • bootstrap/ (App-of-Apps)
  • projects/ (AppProjects)
  • clusters/{staging,prod}/apps/
  • applicationsets/
  • envs/_global/
  • envs/syrf/{api,project-management,quartz,web}/
  • Initial values files created for all 4 services
  • README and SETUP-INSTRUCTIONS.md documented
  • Initial skeleton committed and pushed
  • PLANNING.md created with migration strategy

Dependencies: None

Technical Notes:

  • Repository: github.com/camaradesuk/cluster-gitops
  • Visibility: Private
  • Multi-source pattern ready for ArgoCD ≥2.6

Child Work Item 3.2: ArgoCD Application Manifests ✅

GitHub Issue: #2144 Status: ✅ Complete Priority: P0 (Critical) Story Points: 8 Sprint: Sprint 2 (Completed 2025-11-12)

As a platform engineer I want ArgoCD Application definitions for all services So that services can be deployed via GitOps

Acceptance Criteria:

  • Create AppProject definitions (6 projects: syrf-staging, syrf-production, preview, plugins, default, bootstrap)
  • Create Application manifests via ApplicationSets:
  • argocd/applicationsets/syrf.yaml - Matrix generator for all services
  • argocd/applicationsets/plugins.yaml - Infrastructure components
  • argocd/applicationsets/argocd-infrastructure.yaml - ArgoCD components
  • Configure multi-source pattern:
  • Source 1: Chart from monorepo at specific targetRevision tag
  • Source 2: Values from cluster-gitops repository
  • Source 3: Optional resources directory
  • Configure sync policies:
  • Staging: automated (prune + selfHeal)
  • Production: automated with selfHeal disabled
  • Set targetRevision policy using service tags ({service}-vX.Y.Z)
  • Test: Render manifests locally with helm template

Dependencies:

  • Story 3.1 (cluster-gitops Repository Complete) ✅

Technical Notes:

  • Uses ArgoCD multi-source pattern (≥2.6)
  • ApplicationSets auto-generate Applications from environment configs
  • Values interpolation via $values reference
  • CreateNamespace=true for automatic namespace creation

Estimated Effort: 8 story points (2 days)


Child Work Item 3.3: Environment Values Configuration ✅

GitHub Issue: #2145 Status: ✅ Complete Priority: P0 (Critical) Story Points: 5 Sprint: Sprint 2 (Completed 2025-11-12)

As a platform engineer I want environment-specific values for all services So that staging and production have appropriate resource allocations

Acceptance Criteria:

  • Create global values (global/values.yaml)
  • Create environment-specific shared values:
  • syrf/environments/staging/shared-values.yaml
  • syrf/environments/production/shared-values.yaml
  • Create service-specific values for 6 services × 2 environments:
  • syrf/environments/{staging,production}/{api,web,project-management,quartz,docs,user-guide}/
  • Each service has config.yaml (chart reference) and values.yaml (Helm values)
  • Configure for each environment:
  • Image repository and tag (via CI/CD promotion)
  • Replica counts
  • Resource requests/limits
  • Ingress hosts and TLS
  • Environment variables
  • Health check settings
  • Document configuration knobs in comments
  • Validate YAML syntax

Dependencies:

  • Story 3.1 (cluster-gitops Repository Complete) ✅

Technical Notes:

  • Environment namespace.yaml contains sync policies
  • Shared-values.yaml contains common config (deployment notifications, etc.)
  • Service config.yaml updated automatically by CI/CD promotion workflow
  • Staging: automated sync; Production: automated with manual PR merge gate

Estimated Effort: 5 story points (1 day)


Child Work Item 3.4: ApplicationSet for PR Previews ✅

GitHub Issue: #2146 Status: ✅ Complete (manually tested and verified 2025-12-01) Priority: P1 (High) Story Points: 8 Sprint: Sprint 3 (Completed 2025-12-01)

As a developer I want ephemeral preview environments for PRs So that I can test changes before merging

Acceptance Criteria:

  • Create ApplicationSet definition (applicationsets/syrf-previews.yaml)
  • Configure Pull Request Generator:
  • Watch syrf PRs with preview label
  • GitHub App credentials (github-app-repo-creds secret)
  • Requeue every 300 seconds
  • Template Application spec:
  • Name: syrf-pr-{{number}}-{{serviceName}}
  • Namespace: pr-{{number}}
  • Chart source: PR head SHA
  • Image tag: pr-{{number}}
  • Ingress: pr-{{number}}-{{serviceName}}.staging.syrf.org.uk
  • Configure sync policy:
  • Automated (prune + selfHeal)
  • CreateNamespace=true
  • Test: Open PR creates preview environment
  • Test: PR close deletes preview environment
  • Document preview URL pattern (docs/how-to/use-pr-preview-environments.md)

Completed Components:

  • GitHub Actions workflow (pr-preview.yml) - builds images with pr-{number} tag
  • Preview AppProject (argocd/projects/preview.yaml) - allows pr-* namespaces
  • Preview common values (syrf/environments/preview/common.values.yaml)
  • Documentation (381 lines comprehensive guide)
  • ApplicationSet with PullRequest generator - syrf-previews.yaml created
  • GitHub credentials secret config - github-app-repo-creds ExternalSecret added

Remaining Work:

  • Create camarades-github-app-installation-id secret in GCP Secret Manager
  • Push cluster-gitops changes and verify ArgoCD sync
  • Test: Open PR with 'preview' label → preview environment created
  • Test: Close PR → preview environment deleted

Final State: All components complete. PR Preview environments fully operational - manually tested and verified 2025-12-01.

Dependencies:

  • Story 3.2 (ArgoCD Application Manifests) ✅
  • Story 4.2 (ArgoCD Installation) ✅

Technical Notes:

  • Requires ApplicationSet with pullRequest generator
  • GitHub PAT or GitHub App credentials needed
  • Ephemeral namespaces automatically cleaned up on PR close
  • Preview URLs: pr-{number}.staging.syrf.org.uk

Estimated Effort: 8 story points (remaining: ~4 points)


Child Work Item 3.5: Infrastructure Dependencies Analysis ✅

GitHub Issue: #2147 Status: ✅ Complete Priority: P0 (Critical) Story Points: 5 Sprint: Sprint 2 (Completed 2025-11-12)

As a platform engineer I want to identify all infrastructure dependencies for SyRF services So that the new cluster has all required components before migration

Acceptance Criteria:

  • Document required infrastructure components:
  • Ingress controller (ingress-nginx v4.11.1)
  • cert-manager (v1.15.0) for TLS
  • external-dns (v1.14.5) for DNS management
  • RabbitMQ (v14.6.6) for inter-service messaging
  • External Secrets Operator for secret management (Google Secret Manager)
  • Create Helm charts or manifests for each component (plugins/helm/ directory)
  • Define installation order (documented in docs/cluster-bootstrap.md)
  • Create bootstrap Application for platform add-ons (argocd/bootstrap/root.yaml)
  • Document configuration requirements (per-component values.yaml files)
  • Create smoke test checklist for each component

Dependencies:

  • Story 3.1 (cluster-gitops Repository Complete) ✅

Technical Notes:

  • All components deployed via GitOps (plugins ApplicationSet)
  • Each component has config.yaml + values.yaml + resources/ directory
  • RabbitMQ is CRITICAL (required by all .NET services)
  • ESO uses ClusterSecretStore with Google Secret Manager backend
  • Workload Identity configured for external-dns and ESO

Estimated Effort: 5 story points (1 day)


Work Item 4: ArgoCD Deployment 🔄 IN PROGRESS

GitHub Issue: #2148 Goal: Install and configure ArgoCD on new Kubernetes cluster

Total Story Points: 53 (32 + 21 cluster remediation issues discovered 2025-11-17) Status: 🔄 In Progress (64% - 9/14 child work items complete) Blocker Resolved: Cluster provisioned on 2025-11-12 Blocker Resolved: ExternalSecrets fixed on 2025-11-18 (Story 4.8) Blocker Resolved: Image tags fixed on 2025-11-18 (Story 4.9) Blocker Resolved: identity-server dependency removed on 2025-11-18 (Story 4.9) Current Blockers: None


Child Work Item 4.1: Kubernetes Cluster Provisioning ✅

GitHub Issue: #2149 Status: ✅ Complete Priority: P0 (Critical) Story Points: 13 Sprint: Sprint 2 (Completed 2025-11-12)

As a platform engineer I want a new Kubernetes cluster provisioned So that I can install ArgoCD and deploy services

Acceptance Criteria:

  • Decision made: GKE (Google Kubernetes Engine)
  • Cluster provisioned with Terraform:
  • Cluster: camaradesuk, europe-west2-a
  • Kubernetes version: 1.33.5-gke.1201000
  • Nodes: 3-6 (autoscaling), e2-standard-2
  • Features: Workload Identity, VPA, Shielded Nodes
  • kubectl access configured locally
  • Cluster connectivity validated
  • Basic namespaces created via ArgoCD
  • Document cluster details in camarades-infrastructure repo

Dependencies: None (but blocks all other Epic 4 stories)

Technical Notes:

  • Recommended: GKE europe-west2-a (continuity with Jenkins X)
  • Alternative: Any Kubernetes 1.27+ cluster
  • This is the PRIMARY BLOCKER for GitOps migration

Estimated Effort: 13 story points (1 week - including approval/provisioning time)


Child Work Item 4.2: ArgoCD Installation ✅

GitHub Issue: #2150 Status: ✅ Complete Priority: P0 (Critical) Story Points: 5 Sprint: Sprint 2 (Completed 2025-11-12)

As a platform engineer I want ArgoCD installed on the new cluster So that GitOps-based deployments can begin

Acceptance Criteria:

  • Install ArgoCD in argocd namespace (HA mode with Helm)
  • Verify all ArgoCD components are running
  • Access ArgoCD UI via Ingress (argocd.camarades.net)
  • TLS certificate configured with Let's Encrypt
  • ArgoCD admin password available via secret
  • GitHub credential template created for repository access

Dependencies:

  • Story 4.1 (K8s Cluster Provisioning)

Technical Notes:

  • Install command: kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
  • Wait for: kubectl wait --for=condition=available --timeout=300s deployment/argocd-server -n argocd
  • Password: kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

Estimated Effort: 5 story points (1 day)


Child Work Item 4.3: Platform Add-ons Installation ✅

GitHub Issue: #2151 Status: ✅ Complete Priority: P0 (Critical) Story Points: 8 Sprint: Sprint 2 (Completed 2025-11-12)

As a platform engineer I want all required infrastructure components installed So that SyRF services have the dependencies they need

Acceptance Criteria:

  • Install cert-manager v1.15.0 for TLS certificates
  • Install ingress-nginx v4.11.1 for HTTP routing (LoadBalancer: 34.13.36.98)
  • Install external-dns v1.14.5 for DNS management (with Workload Identity)
  • Install RabbitMQ v14.6.6 (REQUIRED for SyRF services)
  • Configure each component via ArgoCD Applications
  • Verify all components are healthy and synced
  • Document configuration in cluster-gitops/docs/cluster-bootstrap.md

Dependencies:

  • Story 4.2 (ArgoCD Installation)
  • Story 3.5 (Infrastructure Dependencies Analysis)

Technical Notes:

  • RabbitMQ is CRITICAL - services cannot start without it
  • Secret management: ESO with Google Secret Manager (current setup)
  • Use ArgoCD Applications for declarative installation

Estimated Effort: 8 story points (2 days)


Child Work Item 4.4: App-of-Apps Bootstrap ✅

GitHub Issue: #2152 Status: ✅ Complete Priority: P0 (Critical) Story Points: 3 Sprint: Sprint 3 (Completed 2025-11-12)

As a platform engineer I want ArgoCD bootstrapped via App-of-Apps pattern So that all applications are managed declaratively from Git

Acceptance Criteria:

  • Create bootstrap Application (bootstrap/root.yaml)
  • Configure to watch apps/ directory
  • Apply bootstrap Application to cluster
  • Verify ArgoCD creates child Applications
  • All Applications appear in ArgoCD UI
  • Sync status is healthy
  • Document bootstrap procedure

Dependencies:

  • Story 4.2 (ArgoCD Installation)
  • Story 3.2 (ArgoCD Application Manifests)

Technical Notes:

  • Bootstrap Application lives in cluster-gitops/bootstrap/
  • Creates Applications recursively from apps/ directory
  • Once applied, entire cluster state is Git-driven
  • Tested pruning: Applications auto-delete when YAML removed from Git
  • Updated cluster-bootstrap.md with App-of-Apps pattern

Estimated Effort: 3 story points (half day)


Child Work Item 4.5: First Service Deployment (Canary) 🔄

GitHub Issue: #2153 Status: 🔄 In Progress (75% complete) Priority: P1 (High) Story Points: 5 Sprint: Sprint 3 (Started 2025-11-12)

As a platform engineer I want to deploy one service to staging as a canary So that I can validate the entire GitOps flow before deploying all services

Acceptance Criteria:

  • Choose canary service (API selected)
  • Apply API Application manifest to ArgoCD (via App-of-Apps)
  • Verify ArgoCD syncs successfully (Synced)
  • Service pods are running and healthy (Progressing - waiting for secrets)
  • Ingress is accessible (smoke test endpoint) - BLOCKED by missing secrets
  • Check logs for errors - BLOCKED by missing secrets
  • Verify RabbitMQ connectivity - BLOCKED by missing secrets
  • Document any issues encountered
  • Create runbook for common operations

Dependencies:

  • Story 4.3 (Platform Add-ons Installation) ✅
  • Story 4.4 (App-of-Apps Bootstrap) ✅

Progress Summary (2025-11-12):

✅ Completed:

  1. All 4 .NET services deployed (API, PM, Quartz, Web)
  2. ArgoCD Applications created via App-of-Apps
  3. All showing Synced status
  4. Charts successfully templated
  5. Pods created (Progressing state)

  6. Fixed 2 critical deployment issues:

  7. Environment variable format: Changed from array to map format in all staging values files
  8. Image references: Updated Helm templates from Jenkins X pattern to standard Values pattern

  9. Documentation created:

  10. /docs/how-to/required-secrets.md - Complete guide for all 14 required secrets
  11. Includes YAML templates, verification commands, ESO examples, troubleshooting

  12. Triggered documentation service builds:

  13. Committed changes to trigger CI/CD for syrf-user-guide and syrf-docs
  14. Docker images building (commit: 421f76b5)

⏳ Blockers Identified:

  1. Missing Kubernetes Secrets (Critical - blocks all .NET services):
  2. auth0, identity-server, swagger-auth, public-api
  3. mongo-db, elastic-db, dev-postgres-credentials
  4. rabbit-mq, aws-s3, aws-ses
  5. google-sheets, rob-api-credentials
  6. elastic-apm, sentry
  7. Recommendation: Set up External Secrets Operator

  8. Missing Docker Images (syrf-docs, syrf-user-guide):

  9. Images don't exist yet in GHCR
  10. Build triggered via commit 421f76b5
  11. Expected completion: ~5-10 minutes

Current Application Status:

Platform Services:
✅ ingress-nginx: Synced, Healthy
✅ cert-manager: Synced, Healthy
✅ external-dns: Synced, Healthy
🔄 rabbitmq: Synced, Progressing

SyRF Services:
🔄 syrf-api: Synced, Progressing (waiting for secrets)
🔄 syrf-project-management: Synced, Progressing (waiting for secrets)
🔄 syrf-quartz: Synced, Progressing (waiting for secrets)
🔄 syrf-web: Synced, Progressing (waiting for secrets)
❌ syrf-docs: Synced, Degraded (ImagePullBackOff - building)
❌ syrf-user-guide: Synced, Degraded (ImagePullBackOff - building)

Next Steps:

  1. Wait for user-guide/docs images to build (~5-10 min)
  2. Set up External Secrets Operator OR create secrets manually
  3. Verify all services start successfully
  4. Test ingress endpoints
  5. Complete acceptance criteria

Technical Notes:

  • Use baseline version from Jenkins X (see PLANNING.md)
  • API service is good canary (simpler than PM)
  • Validate entire stack before other services
  • App-of-Apps pattern validated successfully

Estimated Effort: 5 story points (1 day)


Child Work Item 4.6: End-to-End GitOps Flow Validation ⏳

GitHub Issue: #2154 Status: ⏳ Blocked Priority: P1 (High) Story Points: 8 Sprint: TBD

As a platform engineer I want to validate the complete GitOps workflow So that I can confirm all automation works as designed

Acceptance Criteria:

  • Test: Make code change to one service
  • Test: Verify auto-version creates tag
  • Test: Verify Docker image is built and pushed
  • Test: Verify promotion PR is created to cluster-gitops
  • Test: Merge promotion PR
  • Test: Verify ArgoCD syncs staging environment
  • Test: Verify service is deployed with new version
  • Test: Open PR in app-monorepo
  • Test: Verify preview environment is created
  • Test: Close PR and verify preview cleanup
  • Test: Create manual production promotion
  • Test: Verify production deployment
  • Test: Rollback by reverting promotion PR
  • Document timing metrics (commit → staging deployment time)

Dependencies:

  • Story 4.5 (First Service Deployment)
  • Story 2.2 (Docker Image Build Integration)
  • Story 2.4 (Promotion PR Automation)
  • Story 3.4 (ApplicationSet for PR Previews)

Technical Notes:

  • This validates the ENTIRE GitOps architecture
  • Target: commit → staging < 10 min p50
  • Target: preview ready < 2 min
  • Document any issues for optimization

Estimated Effort: 8 story points (2 days)


Child Work Item 4.7: Helm Chart Standardization - Jenkins X Pattern Removal ✅

GitHub Issue: #2172 Status: ✅ Complete Priority: P0 (Critical) Story Points: 3 Sprint: Sprint 3 (Completed 2025-11-14)

As a platform engineer I want all Jenkins X legacy patterns removed from Helm charts So that charts use standard Kubernetes conventions and are maintainable

Acceptance Criteria:

  • Remove all jx.imagePullSecrets references (use top-level imagePullSecrets array)
  • Remove all jxRequirements.ingress.* references (use ingress.*)
  • Remove all draft label patterns
  • Update all 4 service charts (api, project-management, quartz, web)
  • Validate all charts render successfully with helm template
  • Document changes in ADR-006
  • Update environment values in cluster-gitops to match new structure

Dependencies:

  • Story 4.5 (First Service Deployment) - discovered issue during deployment

Scope Summary:

  • 52 jx references removed across 16 files (4 services × 4 files)
  • Services updated: api, project-management, quartz, web
  • Files per service: values.yaml, deployment.yaml, ingress.yaml, canary.yaml
  • Root cause: syrf-web ImagePullBackOff due to jx.imagePullSecrets vs top-level imagePullSecrets mismatch

Technical Notes:

  • Web service had 30 jx references in ingress.yaml alone (complex host name construction)
  • Used bulk sed replacements for efficiency in web ingress.yaml
  • All charts validated with helm template after changes
  • ADR-006 created: docs/decisions/ADR-006-helm-chart-standardization.md
  • Follow-up required: Update cluster-gitops environment values to use new structure

Estimated Effort: 3 story points (half day)


Child Work Item 4.8: Fix SecretStore Configuration (External Secrets Migration) ✅

GitHub Issue: #2195 Status: ✅ Complete (2025-11-18) Priority: P0 (Critical - blocks ALL services) Story Points: 8 Sprint: Sprint 3

As a platform engineer I want all ExternalSecrets migrated from v1beta1 SecretStore to v1 ClusterSecretStore So that services can retrieve secrets from GCP Secret Manager and start successfully

Acceptance Criteria:

  • Migrate extra-secrets-staging to use ClusterSecretStore pattern
  • Migrate extra-secrets-production to use ClusterSecretStore pattern
  • Update all ExternalSecret references from SecretStore to ClusterSecretStore
  • Verify ClusterSecretStore is READY in staging and production
  • Test 3-5 critical secrets sync successfully (auth0, mongo-db, rabbit-mq)
  • All 40 ExternalSecrets show READY: True status (20 staging + 20 production)
  • Document migration in cluster-gitops (chart templates documented)
  • Update environment values to reference ClusterSecretStore

Completion Summary (2025-11-18):

What was done:

  1. Chart Template Updates:
  2. Added shorthand secrets format to extra-secrets chart for cleaner values files
  3. Created ClusterExternalSecret template (cluster-external-secrets.yaml) for cluster-wide secrets
  4. All ExternalSecrets now use kind: ClusterSecretStore instead of kind: SecretStore
  5. Updated API version from v1beta1 to v1

  6. ClusterExternalSecrets Created (deployed via argocd-secrets):

  7. ghcr-secret → argocd, syrf-staging, syrf-production
  8. rabbit-mq → rabbitmq, syrf-staging, syrf-production
  9. github-app-credentials → argocd, syrf-staging, syrf-production
  10. Added ClusterExternalSecret to argocd project whitelist

  11. Values Files Simplified:

  12. Staging: 17 namespace-scoped secrets using shorthand format
  13. Production: 17 namespace-scoped secrets using shorthand format
  14. Removed duplicates (ghcr-secret, rabbit-mq, github-app-credentials) now handled by ClusterExternalSecrets

  15. Additional Fixes:

  16. Fixed webhook HMAC verification (trailing newline in GCP secret)
  17. Removed stuck finalizers from ExternalSecrets blocking deletion

Final State:

  • ✅ 20 ExternalSecrets in syrf-staging: All SecretSynced
  • ✅ 20 ExternalSecrets in syrf-production: All SecretSynced
  • ✅ 3 ClusterExternalSecrets: All Ready, provisioned to all target namespaces
  • ✅ GitHub webhook working (instant sync on push)
  • ✅ ClusterSecretStore gcpsm-secret-store serving all namespaces

Files Modified:

  • charts/extra-secrets/templates/external-secrets.yaml - Added shorthand format
  • charts/extra-secrets/templates/cluster-external-secrets.yaml - New file
  • charts/extra-secrets/values.yaml - Documented new formats
  • argocd/local/argocd-secrets/values.yaml - Added ClusterExternalSecrets config
  • argocd/projects/argocd.yaml - Added ClusterExternalSecret to whitelist
  • plugins/local/extra-secrets-staging/values.yaml - Simplified with shorthand format
  • plugins/local/extra-secrets-production/values.yaml - Simplified with shorthand format

Dependencies:

  • Story 4.3 (Platform Add-ons) ✅ Complete - ESO installed
  • Story 4.5 (First Service Deployment) - blocked by this issue

Technical Notes:

  • Reference working pattern: argocd/local/argocd-secrets/values.yaml
  • ClusterSecretStore enables cross-namespace secret access
  • Workload Identity already configured for ESO: external-secrets@camarades-net.iam.gserviceaccount.com
  • IAM binding already exists: roles/iam.workloadIdentityUser
  • Only chart updates needed, no infrastructure changes

Estimated Effort: 8 story points (2 days - includes testing and verification)


Child Work Item 4.9: Fix Staging Environment Image Tags ✅

GitHub Issue: #2196 Status: ✅ Complete (2025-11-18) Priority: P0 (Critical - staging completely broken) Story Points: 5 Sprint: Sprint 3

As a developer I want staging service pods to have valid image tags So that staging environment is functional for testing

Acceptance Criteria:

  • Identify root cause of empty image tags in staging
  • Fix deployment manifests or Helm values causing empty tags
  • Update staging values with correct image tags for all services
  • Document fix in cluster-gitops troubleshooting
  • Verify all staging pods transition to Running state
  • Delete failed pods (InvalidImageName/ImagePullBackOff)
  • Verify new pods start successfully with valid images

Completion Summary (2025-11-18):

Root Cause Identified:

  • ApplicationSet removed image.tag parameters (commit bbfd0cd)
  • Staging values files didn't have explicit image configuration
  • Result: Helm rendered : (empty repository and tag)

Fixes Applied:

  1. syrf monorepo (commit df49793a):
  2. Renamed pmproject-management throughout CI/CD workflow
  3. Updated Docker image name: syrf-pmsyrf-project-management
  4. Updated git tag prefix: pm-vproject-management-v
  5. CI/CD now sets both chartTag and imageTag when promoting

  6. cluster-gitops (commit 4b5eab1):

  7. Added image.repository to all service base values.yaml files
  8. Added ApplicationSet parameter to derive image.tag from service.imageTag
  9. Updated all environment configs with imageTag field
  10. Updated project-management chartTag: pm-v11.2.0project-management-v11.2.0
  11. Created compatibility git tag: project-management-v11.2.0

  12. Temporary workaround (commit 09eb23b):

  13. project-management uses syrf-pm image until next CI/CD build
  14. TODO in values.yaml to change to syrf-project-management after build

  15. Versioning error fix (2025-11-18):

  16. CI/CD created incorrect project-management-v1.0.0 tag (GitVersion ran before compatibility tag)
  17. Deleted incorrect tag, created correct project-management-v11.3.0 tag
  18. Updated staging config: chartTag: project-management-v11.3.0, imageTag: "11.2.0"
  19. Commit 2205291 in cluster-gitops

  20. Project-management rename completion (2025-11-18):

  21. Triggered new CI/CD build which created project-management-v11.3.1 tag
  22. Updated cluster-gitops image.repository from syrf-pm to syrf-project-management
  23. Final staging config: chartTag: project-management-v11.3.2, imageTag: "11.3.2"

  24. Identity-server removal (2025-11-18):

  25. Removed unused IdentityServer4.AccessTokenValidation package from API (now using Auth0)
  26. Removed identityServer config blocks from all 4 service Helm values.yaml files
  27. Removed identity-server secret environment variables from 3 deployment templates
  28. Added IdentityModel.AspNetCore.OAuth2Introspection for TokenRetrieval (was transitive dependency)
  29. Updated required-secrets.md - removed identity-server from required secrets list

Final Service Versions (All Healthy):

  • api: 9.2.3
  • project-management: 11.3.2
  • quartz: 0.5.1
  • web: 5.4.2
  • docs: 1.6.5
  • user-guide: 1.1.0

Architecture Improvement:

  • image.repository is now explicit in service values.yaml (static)
  • image.tag is derived via ApplicationSet from service.imageTag
  • CI/CD sets both chartTag (chart version) and imageTag (Docker tag) on promotion

Previous State (2025-11-17):

  • ❌ 5 failed pods in syrf-staging namespace
  • syrf-api-657c97878c-jzdd6: InvalidImageName (Image: : - no registry, no tag)
  • syrf-projectmanagement-9cdbf4465-tpkpw: InvalidImageName
  • syrf-projectmanagement-6fc8d864f5-65pf7: ImagePullBackOff
  • syrf-quartz-d6696d6d5-sr846: InvalidImageName
  • syrf-web-7468b67d77-t7bl6: InvalidImageName
  • ✅ Older pods still running: syrf-api-5758596878, syrf-quartz-68b97d8994, syrf-web-65c666df64

Root Cause Analysis (1 hour):

  1. Check staging values files in cluster-gitops for image.tag configuration
  2. Compare with production values (production is working)
  3. Check ApplicationSet or Application manifests for templating issues
  4. Review recent commits that may have introduced the issue
  5. Check if CI/CD promotion PR left tags empty

Remediation Steps (2-4 hours):

  1. Immediate fix: Update staging values with working image tags (30 min)
  2. Set explicit image.tag values for all services
  3. Use last known working versions from old pods
  4. Create PR to cluster-gitops

  5. Sync and verify: ArgoCD sync and pod replacement (1 hour)

  6. Sync staging Applications
  7. Delete failed pods
  8. Wait for new pods to start
  9. Verify Running status

  10. Permanent fix: Fix root cause in values/templates (1-2 hours)

  11. Fix Helm templating if issue is in chart
  12. Fix CI/CD promotion workflow if issue is in automation
  13. Add validation to prevent empty tags in future
  14. Test with dry-run deployment

  15. Documentation: Update troubleshooting guide (30 min)

  16. Document common image tag issues
  17. Add verification commands
  18. Create runbook for fixing failed pods

Dependencies:

  • Story 4.5 (First Service Deployment) - blocked by this issue
  • May depend on Story 2.4 (Promotion PR Automation) if CI/CD is root cause

Technical Notes:

  • Old working pods can provide reference for correct image tags
  • Issue likely introduced in recent deployment or promotion
  • Compare staging vs production Application specs
  • May need to rollback recent changes to cluster-gitops

Estimated Effort: 5 story points (1 day - includes analysis and fix)


Child Work Item 4.10: Fix Extra-Secrets ApplicationSet Directory Structure ✅

GitHub Issue: #2197 Status: ✅ Completed (2025-11-18) Priority: P1 (High - blocks extra-secrets deployment) Story Points: 2 Sprint: Sprint 3

As a platform engineer I want extra-secrets Applications to sync successfully So that ClusterSecretStore and ExternalSecrets can be deployed

Acceptance Criteria:

  • Create missing resources/ directories in cluster-gitops
  • Add .gitkeep files to preserve empty directories
  • Verify extra-secrets-staging Application syncs successfully
  • Verify extra-secrets-production Application syncs successfully
  • Both Applications show Synced status (not Degraded)
  • Document ApplicationSet directory requirements

Current State (2025-11-18):

  • ✅ extra-secrets-production: Synced, Healthy
  • ✅ extra-secrets-staging: Synced, Healthy
  • Resolution: Added missing resources/ directories with .gitkeep files (commit 8233395)
  • Documentation: Added troubleshooting section to cluster-gitops/docs/applicationsets.md

Affected Files (cluster-gitops repo):

  • Missing: plugins/local/extra-secrets-production/resources/
  • Missing: plugins/local/extra-secrets-staging/resources/

Remediation Steps (1-2 hours):

  1. Create directories (15 min)
cd cluster-gitops
mkdir -p plugins/local/extra-secrets-production/resources
mkdir -p plugins/local/extra-secrets-staging/resources
touch plugins/local/extra-secrets-production/resources/.gitkeep
touch plugins/local/extra-secrets-staging/resources/.gitkeep
  1. Commit and push (15 min)
  2. Create PR or commit directly to cluster-gitops
  3. Include documentation in commit message

  4. Sync and verify (30 min)

  5. Trigger ArgoCD sync for both Applications
  6. Verify Degraded status clears
  7. Verify no more "path does not exist" errors
  8. Check Application health in ArgoCD UI

  9. Document pattern (30 min)

  10. Document ApplicationSet multi-source requirements
  11. Add note about resources/ directory purpose
  12. Update cluster-gitops README if needed

Dependencies:

  • Story 4.4 (App-of-Apps Bootstrap) ✅ Complete
  • Related to Story 4.8 (SecretStore fix) - both affect extra-secrets

Technical Notes:

  • Same pattern already used for: argocd/local/argocd-secrets/resources/.gitkeep
  • ApplicationSet template unconditionally adds third source
  • Empty resources/ directory satisfies template requirement
  • .gitkeep ensures git tracks empty directory

Estimated Effort: 2 story points (2-4 hours)


Child Work Item 4.11: Fix User Guide TLS Certificate ✅

GitHub Issue: #2198 Status: ✅ Complete Priority: P2 (Medium - cert-manager issue) Story Points: 3 Sprint: Sprint 3 Completed: 2025-11-18

As a platform engineer I want the user-guide TLS certificate to issue successfully So that help.staging.syrf.org.uk is accessible over HTTPS

Acceptance Criteria:

  • Investigate why user-guide-tls certificate is stuck in "Issuing" state
  • Identify blocker (DNS, ACME challenge, rate limit, etc.)
  • Resolve issue preventing certificate issuance
  • Verify certificate transitions to Ready: True
  • Verify TLS secret is created
  • Test HTTPS access to staging URLs
  • Document resolution

Root Causes Identified (2025-11-18):

  1. Staging using production URLs: Staging ingresses were configured with production hostnames (e.g., help.syrf.org.uk) instead of staging hostnames (e.g., help.staging.syrf.org.uk)
  2. DNS mismatch: Production URLs point to GitHub Pages or legacy cluster, not GKE LoadBalancer, so ACME HTTP-01 challenges failed with 404
  3. Wrong Let's Encrypt issuer: Staging shared-values.yaml used letsencrypt-staging issuer
  4. Chart defaults pollution: Helm chart defaults had hardcoded ingress values not fully overridden

Resolution Applied:

  1. Updated all Helm chart defaults to ingress: {} (api, pm, quartz, web, user-guide, docs)
  2. Created staging environment values with correct staging URLs
  3. Changed staging shared-values to use letsencrypt-prod issuer
  4. Deleted old certificates to trigger reissuance with correct configuration

Final State (2025-11-18):

  • ✅ All staging certificates using letsencrypt-prod
  • ✅ All staging certificates Ready: True
  • ✅ Staging URLs configured (updated to new convention 2025-11-30):
  • api: api.staging.syrf.org.uk
  • web: staging.syrf.org.uk
  • project-management: project-management.staging.syrf.org.uk
  • docs: docs.staging.syrf.org.uk
  • user-guide: help.staging.syrf.org.uk
  • ✅ All production certificates Ready: True

Diagnostic Steps (1-2 hours):

  1. Check cert-manager logs (30 min)
  2. Look for errors related to user-guide-tls
  3. Check ACME challenge status
  4. Identify specific failure reason

  5. Check ACME challenge resources (30 min)

  6. List Challenge resources for this certificate
  7. Check if HTTP-01 challenge pod exists
  8. Verify challenge endpoint is reachable
  9. Check if cert-manager solver ingress exists

  10. Check DNS and ingress (30 min)

  11. Verify help.syrf.org.uk DNS resolves to ingress IP
  12. Check ingress routes for ACME challenge path
  13. Verify no conflicts with other certificates

  14. Check Let's Encrypt rate limits (15 min)

  15. Check if domain hit rate limits
  16. Verify staging vs production ACME server usage

Potential Root Causes:

  1. DNS issue: help.syrf.org.uk not resolving or resolving to wrong IP
  2. ACME challenge failure: HTTP-01 challenge endpoint not reachable
  3. Rate limiting: Let's Encrypt rate limits exceeded
  4. Ingress conflict: Multiple ingresses competing for same hostname
  5. cert-manager bug: Controller not processing certificate request

Remediation Steps (1-2 hours):

  1. Delete and recreate (if stuck in bad state)
kubectl delete certificate user-guide-tls -n syrf-staging
# ArgoCD will recreate from manifest
  1. Force new order (if ACME issue)
kubectl delete certificaterequest -n syrf-staging -l cert-manager.io/certificate-name=user-guide-tls
kubectl delete challenge -n syrf-staging --all
  1. Update certificate spec (if configuration issue)
  2. Switch to DNS-01 challenge if HTTP-01 failing
  3. Use Let's Encrypt staging server to avoid rate limits
  4. Adjust dnsNames or issuer reference

  5. Verify and monitor (30 min)

  6. Watch certificate events
  7. Monitor cert-manager logs
  8. Verify certificate reaches Ready state
  9. Test HTTPS access

Dependencies:

  • Story 4.3 (Platform Add-ons) ✅ Complete - cert-manager installed
  • May need coordination with DNS/ingress configuration

Technical Notes:

  • cert-manager version: v1.15.0
  • Issuer: Let's Encrypt (production)
  • Challenge type: HTTP-01 (assumed)
  • Other certificates issuing successfully (suggests cert-manager is healthy)
  • Issue specific to user-guide service

Estimated Effort: 3 story points (4-8 hours - investigation heavy)


Child Work Item 4.12: Sync Out-of-Sync Applications 🔄

GitHub Issue: #2199 Status: 🔄 In Progress (Plugins complete, SyRF apps pending) Priority: P2 (Medium - operational hygiene) Story Points: 2 Sprint: Sprint 3 (In Progress)

As a platform engineer I want all ArgoCD Applications to show Synced status So that cluster state matches Git and drift is eliminated

Acceptance Criteria:

  • Review all Applications with OutOfSync or Unknown status
  • Identify reason for each drift (manual change, missing config, etc.)
  • Sync or fix each Application
  • Verify all Applications show Synced status
  • Document any manual steps taken
  • Configure sync policies if needed (auto-sync, prune, self-heal)

Current State (2025-11-17): OutOfSync Applications (4):

  • argocd-secrets: OutOfSync
  • cert-manager: OutOfSync
  • rabbitmq: OutOfSync
  • root: OutOfSync

Unknown Status Applications (9):

  • docs-production: Unknown
  • external-dns: Unknown
  • extra-secrets-production: Unknown (also Degraded)
  • extra-secrets-staging: Unknown (also Degraded)
  • ingress-nginx: Unknown
  • quartz-production: Unknown
  • rabbitmq-secrets: Unknown
  • user-guide-production: Unknown
  • user-guide-staging: Synced (but has failing certificate)

Sync Process (2-4 hours):

  1. Categorize issues (30 min)
  2. Group by root cause
  3. Identify which can auto-sync vs need manual intervention
  4. Check if blocked by other child work items

  5. Sync core infrastructure (1 hour)

  6. argocd-secrets (likely just committed changes)
  7. cert-manager (check for drift)
  8. rabbitmq (verify no manual changes)
  9. root (app-of-apps - sync to propagate changes)

  10. Investigate Unknown status (1 hour)

  11. Check why Application status is Unknown
  12. May indicate health check issues
  13. May be transient during deployment
  14. Review Application logs and events

  15. Document and prevent (30 min)

  16. Document why each Application was out of sync
  17. Configure sync policies to prevent future drift
  18. Consider enabling auto-sync for infrastructure apps

Dependencies:

  • May depend on Story 4.8 (SecretStore fix) for extra-secrets apps
  • May depend on Story 4.10 (directory structure fix) for extra-secrets apps

Technical Notes:

  • Unknown status often means ArgoCD can't determine health
  • May need to configure custom health checks
  • OutOfSync is expected during active development
  • Root Application sync propagates to all child Applications

Estimated Effort: 2 story points (2-4 hours)


Child Work Item 4.13: Clean Up Orphaned Resources 🔄

GitHub Issue: #2200 Status: 🔄 In Progress (Plugins cleanup complete, SyRF apps pending) Priority: P3 (Low - cleanup task) Story Points: 1 Sprint: Sprint 3-4 (In Progress)

As a platform engineer I want orphaned resources cleaned up So that cluster is tidy and ArgoCD warnings are eliminated

Acceptance Criteria:

  • Review orphaned resources in api-staging (7 resources)
  • Review orphaned resources in extra-secrets-production (3 resources)
  • Determine if resources should be:
  • Deleted (if truly orphaned)
  • Adopted by ArgoCD (if should be managed)
  • Ignored (if intentionally manual)
  • Execute cleanup or adoption
  • Verify OrphanedResourceWarning clears from Applications
  • Document decisions for future reference

Current State (2025-11-17):

  • api-staging: 7 orphaned resources
  • extra-secrets-production: 3 orphaned resources

Orphaned Resource Analysis (1-2 hours):

  1. Identify resources (30 min)
kubectl get application api-staging -n argocd -o yaml | yq '.status.resources'
kubectl get application extra-secrets-production -n argocd -o yaml | yq '.status.resources'
  1. Determine ownership (30 min)
  2. Check resource labels and annotations
  3. Verify if resources are managed by Helm
  4. Check if resources should exist in manifests
  5. Identify why ArgoCD considers them orphaned

  6. Decide action (30 min)

  7. Delete: If resources are leftover from old deployments
  8. Adopt: If resources should be in Git but aren't
  9. Ignore: If resources are intentionally manual (add to ignoreDifferences)

Cleanup Steps (30 min - 1 hour):

  1. Option A: Delete orphaned resources
kubectl delete <resource-type> <resource-name> -n <namespace>
  1. Option B: Adopt into ArgoCD
  2. Add resource manifests to Git
  3. Configure ownerReferences
  4. Sync Application

  5. Option C: Ignore

  6. Add to Application ignoreDifferences
  7. Document why resources are manual

Dependencies:

  • None (independent cleanup task)

Technical Notes:

  • Orphaned resources don't break functionality
  • Warnings indicate resources exist but not tracked in Git
  • Common causes: manual kubectl apply, Helm 2 migration, renamed resources
  • ArgoCD can adopt resources with proper annotations

Estimated Effort: 1 story point (1-2 hours)


Child Work Item 4.14: Configure ArgoCD Sync Policies and Drift Prevention 🔄

GitHub Issue: #2201 Status: 🔄 In Progress (Plugins complete, SyRF apps pending) Priority: P1 (High - prevents future issues) Story Points: 5 Sprint: Sprint 3 (In Progress)

As a platform engineer I want proper ArgoCD sync policies configured for all Applications So that the cluster stays synchronized with Git and drift is prevented automatically

Acceptance Criteria:

  • Audit all Applications and categorize by sync strategy needs
  • Configure automated sync policies where appropriate
  • Enable prune and self-heal for automated applications
  • Configure sync waves for ordered deployments
  • Set up retry logic for transient failures
  • Add PostSync hooks for critical validations (beyond deployment notifications)
  • Configure health checks for custom resources
  • Document sync policy decisions and rationale
  • Test drift detection and auto-remediation
  • Create runbook for monitoring sync status

Current State (2025-11-17):

  • Mixed sync policies: Some auto-sync, some manual, inconsistent configuration
  • No self-heal configured: Manual changes to cluster not automatically reverted
  • No prune configured: Deleted resources in Git remain in cluster
  • No sync waves: Dependencies deployed in random order
  • Limited health checks: ArgoCD can't determine health for some resources

Sync Policy Categories (1 hour):

  1. Full Automation (auto-sync + prune + self-heal):
  2. Infrastructure: ingress-nginx, cert-manager, external-dns
  3. Platform: external-secrets-operator, rabbitmq-secrets
  4. Staging services: All syrf-staging services
  5. Justification: Non-critical, fast feedback needed

  6. Semi-Automated (auto-sync + self-heal, NO prune):

  7. Production services: All syrf-production services
  8. ArgoCD itself: Self-managing but careful pruning
  9. Justification: Auto-sync for speed, manual pruning for safety

  10. Manual Only (no auto-sync):

  11. Production database configs (if added)
  12. Security-critical resources (if added)
  13. Justification: Require explicit approval before changes

Configuration Tasks (3-4 hours):

  1. Infrastructure Applications (1 hour)
# Example: ingress-nginx
syncPolicy:
  automated:
    prune: true
    selfHeal: true
  syncOptions:
    - CreateNamespace=true
    - PrunePropagationPolicy=foreground
    - PruneLast=true
  retry:
    limit: 5
    backoff:
      duration: 5s
      factor: 2
      maxDuration: 3m
  1. Service Applications (1-2 hours)
  2. Add sync waves to ensure dependencies deploy first
  3. Configure health checks for custom resources
  4. Set appropriate retry policies
  5. Enable auto-sync for staging, semi-auto for production

  6. PostSync Validation Hooks (1 hour)

  7. Add hooks beyond deployment notifications
  8. Validate critical resources exist after sync
  9. Check for common misconfigurations
  10. Alert on unexpected drift

  11. Health Checks (30 min)

  12. Configure custom health checks for ExternalSecrets
  13. Configure health for Jobs (consider successful completion)
  14. Configure health for StatefulSets (readiness)

Example PostSync Validation Hook:

apiVersion: batch/v1
kind: Job
metadata:
  name: validate-deployment
  annotations:
    argocd.argoproj.io/hook: PostSync
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
spec:
  template:
    spec:
      containers:
      - name: validate
        image: bitnami/kubectl:latest
        command:
        - /bin/bash
        - -c
        - |
          # Validate ExternalSecrets are ready
          kubectl wait --for=condition=Ready \
            externalsecret/rabbit-mq -n {{ .Values.namespace }} \
            --timeout=60s

          # Validate pods are running
          kubectl wait --for=condition=Ready \
            pod -l app={{ .Values.appName }} \
            -n {{ .Values.namespace }} \
            --timeout=120s
      restartPolicy: Never

Sync Wave Strategy (deployment order):

# Wave -1: Prerequisites
- ClusterSecretStore (wave: -1)
- Namespaces (wave: -1)

# Wave 0: Infrastructure
- ingress-nginx (wave: 0)
- cert-manager (wave: 0)

# Wave 1: Platform Services
- external-secrets-operator (wave: 1)
- rabbitmq (wave: 1)

# Wave 2: Secrets
- extra-secrets (wave: 2)
- rabbitmq-secrets (wave: 2)

# Wave 3: Application Services
- syrf-api (wave: 3)
- syrf-pm (wave: 3)
- syrf-quartz (wave: 3)
- syrf-web (wave: 3)

Testing and Validation (1 hour):

  1. Test auto-sync (20 min)
  2. Make Git change to auto-sync app
  3. Verify ArgoCD detects and syncs automatically
  4. Verify sync completes within 3 minutes

  5. Test self-heal (20 min)

  6. Make manual change to cluster (kubectl edit)
  7. Verify ArgoCD detects drift
  8. Verify auto-remediation within 5 minutes

  9. Test prune (20 min)

  10. Delete resource from Git
  11. Verify ArgoCD removes from cluster
  12. Verify prune happens in correct order

Documentation (30 min):

  • Document sync policy for each Application
  • Create decision matrix (when to use each policy)
  • Document sync wave strategy
  • Create troubleshooting guide for sync failures
  • Update cluster-gitops README with sync policies

Dependencies:

  • Story 4.8 (SecretStore fix) - needed for testing ExternalSecret health
  • Story 4.12 (Sync out-of-sync apps) - clean slate before configuring policies

Technical Notes:

  • Prune safety: Use PruneLast=true to ensure new resources deploy before old ones delete
  • Self-heal interval: ArgoCD checks every 3 minutes by default
  • Sync waves: Lower numbers deploy first, use negative for prerequisites
  • Health checks: Custom Lua scripts for complex resources
  • Retry logic: Exponential backoff prevents sync storms
  • PostSync hooks: Run AFTER sync succeeds, useful for validation
  • PreSync hooks: Run BEFORE sync, useful for migrations

Estimated Effort: 5 story points (1 day)


Work Item 5: Production Migration ⏳ PLANNED

GitHub Issue: #2155 Goal: Migrate production traffic from Jenkins X to new cluster

Total Story Points: 34 Status: ⏳ Planned (0%)


Child Work Item 5.1: Production Deployment Validation 📋

GitHub Issue: #2156 Status: 📋 Ready (after Epic 4) Priority: P0 (Critical) Story Points: 8 Sprint: TBD

As a platform engineer I want all services deployed to production environment So that production environment is ready for traffic

Acceptance Criteria:

  • Deploy all 4 services to production namespace
  • Use current production versions (from Jenkins X cluster)
  • Verify all pods are running and healthy
  • Verify ingress routes are configured
  • Verify RabbitMQ connectivity
  • Verify database connectivity
  • Run smoke tests for each service
  • Verify monitoring and logging
  • Document production configuration

Dependencies:

  • Story 4.6 (End-to-End GitOps Flow Validation)
  • All Epic 4 stories complete

Technical Notes:

  • Use versions from PLANNING.md (Jenkins X baseline)
  • Do NOT switch traffic yet (parallel running)
  • Validate in isolation first

Estimated Effort: 8 story points (2 days)


Child Work Item 5.2: Traffic Cutover Planning 📋

GitHub Issue: #2157 Status: 📋 Ready (after Epic 4) Priority: P0 (Critical) Story Points: 5 Sprint: TBD

As a platform engineer I want a detailed cutover plan with rollback procedures So that production migration is safe and reversible

Acceptance Criteria:

  • Document cutover strategy:
  • Blue-green deployment
  • Canary rollout
  • DNS switch
  • Define success criteria
  • Create rollback plan
  • Schedule maintenance window
  • Prepare communication plan
  • Create monitoring checklist
  • Define rollback triggers
  • Test rollback procedure in staging

Dependencies:

  • Story 5.1 (Production Deployment Validation)

Technical Notes:

  • Recommended: DNS-based cutover (fastest rollback)
  • Alternative: Load balancer reconfiguration
  • Monitor for 24-48 hours before Jenkins X decomm

Estimated Effort: 5 story points (1 day)


Child Work Item 5.3: Production Cutover Execution 📋

GitHub Issue: #2158 Status: 📋 Ready (after Epic 4) Priority: P0 (Critical) Story Points: 13 Sprint: TBD

As a platform engineer I want to switch production traffic to the new cluster So that SyRF runs on the GitOps architecture

Acceptance Criteria:

  • Announce maintenance window
  • Verify new cluster is ready
  • Switch traffic (DNS or load balancer)
  • Monitor application health
  • Monitor error rates
  • Monitor latency metrics
  • Verify all services responding
  • Run production smoke tests
  • Monitor for 2 hours
  • Confirm no critical errors
  • Document cutover results
  • Update status page

Dependencies:

  • Story 5.2 (Traffic Cutover Planning)

Technical Notes:

  • This is the GO-LIVE event
  • Rollback plan must be ready
  • Team on standby during cutover

Estimated Effort: 13 story points (1 week - includes monitoring period)


Child Work Item 5.4: Jenkins X Cluster Decommission 📋

GitHub Issue: #2159 Status: 📋 Ready (after Epic 5) Priority: P2 (Medium) Story Points: 8 Sprint: TBD

As a platform engineer I want to decommission the Jenkins X cluster So that we don't pay for unused infrastructure

Acceptance Criteria:

  • Monitor new cluster for 1 week post-cutover
  • Confirm no critical issues
  • Export Jenkins X configuration (backup)
  • Document lessons learned
  • Archive Jenkins X logs and metrics
  • Delete Jenkins X Applications
  • Delete Jenkins X cluster
  • Remove DNS entries
  • Update documentation
  • Celebrate migration success! 🎉

Dependencies:

  • Story 5.3 (Production Cutover Execution)
  • 1 week monitoring period

Technical Notes:

  • Do NOT delete until 100% confident in new cluster
  • Keep backups of Jenkins X configs
  • Final step of the migration

Estimated Effort: 8 story points (2 days)


Sprint Planning Recommendations

Sprint 2 (Current - 2 weeks)

Goal: Complete CI/CD automation and cluster-gitops setup

Child Work Items to Include (42 story points - ambitious):

  • ✅ Child Work Item 2.1: Auto-Version Workflow Cleanup (5 pts)
  • ✅ Child Work Item 2.2: Docker Image Build Integration (13 pts)
  • ✅ Child Work Item 3.2: ArgoCD Application Manifests (8 pts)
  • ✅ Child Work Item 3.3: Environment Values Configuration (5 pts)
  • ✅ Child Work Item 3.5: Infrastructure Dependencies Analysis (5 pts)

Stretch Goals:

  • Child Work Item 2.4: Promotion PR Automation (8 pts)

Deliverables:

  • Docker images building and pushing to GHCR
  • ArgoCD manifests ready to apply
  • Environment values configured
  • Infrastructure requirements documented

Sprint 3 (In Progress - 2 weeks)

Goal: Complete remaining CI/CD and GitOps infrastructure

Child Work Items to Include (24 story points):

  • ✅ Child Work Item 2.4: Promotion PR Automation (8 pts) - Complete
  • ✅ Child Work Item 3.4: ApplicationSet for PR Previews (8 pts) - Complete
  • ✅ Child Work Item 2.3: Build Optimization (8 pts) - Complete (2025-11-19)

Deliverables:

  • ✅ Full CI/CD automation working
  • ✅ PR preview environments configured
  • ✅ Build optimization implemented

Sprint 4+ (Blocked on K8s cluster)

Goal: Deploy to new cluster and validate

Child Work Items:

  • All Work Item 4 child work items (29 points)
  • Requires: Kubernetes cluster provisioned

Sprint 5+ (Blocked on Sprint 4)

Goal: Production migration

Child Work Items:

  • All Work Item 5 child work items (34 points)

Risk Register

High-Risk Items

Risk Probability Impact Mitigation Status
K8s cluster not available High Critical Work on items that don't require cluster (Epics 2, 3) ⚠️ Active
Docker build failures in monorepo Medium High Test builds locally first, review Dockerfiles 📋 Planned
RabbitMQ connectivity issues Low Critical Test in staging first, document config 📋 Planned
Production cutover problems Medium Critical Detailed rollback plan, gradual cutover 📋 Planned
Data migration issues Low Critical Verify data access before cutover 📋 Planned

Medium-Risk Items

Risk Probability Impact Mitigation Status
DNS propagation delays Medium Medium Plan for TTL, use low TTL before cutover 📋 Planned
Secret management migration Medium Medium Test ESO in staging, document secrets 📋 Planned
Performance degradation Low Medium Load testing, monitoring, gradual rollout 📋 Planned

Metrics & KPIs

Development Velocity

  • Sprint Capacity: ~40 story points per 2-week sprint (1 developer)
  • Completed: 53 story points (Epic 1)
  • Remaining: 102 story points
  • Estimated Sprints: 3-4 sprints (6-8 weeks)

GitOps Success Criteria

Once Epic 4 complete, measure:

Metric Target Current Status
Commit → Staging Deploy Time < 10 min p50 N/A ⏳ Not measured
Preview Env Creation Time < 2 min N/A ⏳ Not measured
Deployment via Git PRs 100% N/A ⏳ Not measured
Untracked Drift 0 instances N/A ⏳ Not measured
Rollback Time < 5 min N/A ⏳ Not measured

Dependencies Graph

Work Item 1 (✅ Complete)
  └── Work Item 2 (🔄 In Progress)
        ├── Child Work Item 2.1 (Auto-Version Cleanup)
        ├── Child Work Item 2.2 (Docker Builds) → depends on 2.1
        ├── Child Work Item 2.3 (Build Optimization) → depends on 2.2
        └── Child Work Item 2.4 (Promotion PRs) → depends on 2.2, 3.1

Work Item 1 (✅ Complete)
  └── Work Item 3 (🔄 In Progress)
        ├── Child Work Item 3.1 (cluster-gitops) ✅ Complete
        ├── Child Work Item 3.2 (ArgoCD Apps) → depends on 3.1
        ├── Child Work Item 3.3 (Env Values) → depends on 3.1
        ├── Child Work Item 3.4 (ApplicationSets) → depends on 3.2
        └── Child Work Item 3.5 (Infrastructure) → depends on 3.1

Work Item 4 (⏳ Blocked) → BLOCKER: K8s Cluster
  ├── Child Work Item 4.1 (K8s Cluster) ⚠️ PRIMARY BLOCKER
  ├── Child Work Item 4.2 (ArgoCD Install) → depends on 4.1
  ├── Child Work Item 4.3 (Platform Add-ons) → depends on 4.2, 3.5
  ├── Child Work Item 4.4 (Bootstrap) → depends on 4.2, 3.2
  ├── Child Work Item 4.5 (First Service) → depends on 4.3, 4.4
  └── Child Work Item 4.6 (E2E Validation) → depends on 4.5, 2.2, 2.4, 3.4

Work Item 5 (⏳ Planned) → depends on Work Item 4
  ├── Child Work Item 5.1 (Prod Validation) → depends on 4.6
  ├── Child Work Item 5.2 (Cutover Plan) → depends on 5.1
  ├── Child Work Item 5.3 (Cutover) → depends on 5.2
  └── Child Work Item 5.4 (Decommission) → depends on 5.3 + 1 week

Changelog

2025-11-19 (Build Optimization - Complete)

  • Child Work Item 2.3: Build Optimization ✅ Complete
  • Implemented crane-based image retagging for chart-only changes
  • Added list-files: shell to dorny/paths-filter for file analysis
  • Chart-only detection checks if ALL changed files match .chart/ pattern
  • Initial approach using negation patterns didn't work (patterns are OR'd)
  • Fixed by analyzing actual file paths in combined step
  • Successfully tested: API chart-only change retagged 9.4.3 → 9.4.4
  • Time savings: 12 seconds vs 4+ minutes for full build
  • Sprint 3 now fully complete - All 24 story points delivered

2025-11-18 (Plugins ApplicationSet Fixes - Complete)

  • Plugins Project ArgoCD Applications Fixed: All plugins apps now Synced/Healthy
  • Sync Policy Configuration (Child Work Item 4.14 - Partial):
  • Enabled selfHeal: true for drift prevention (Git is source of truth)
  • Added ServerSideApply=true for large CRDs (ESO)
  • Added ApplyOutOfSyncOnly=true for efficient syncing
  • Added ignoreDifferences for ESO controller default fields (conversionStrategy, decodingStrategy, metadataPolicy)
  • Removed blocking SyncWindows from plugins and argocd projects
  • Orphaned Resources Cleanup (Child Work Item 4.13 - Partial):
  • Deleted orphaned rabbitmq secret from syrf-staging
  • Removed redundant rabbitmq-secrets plugin (ClusterExternalSecret handles this)
  • Configured RabbitMQ existingErlangSecret to prevent drift
  • Directory Structure Fix (Child Work Item 4.10 - Complete):
  • Created missing resources directories with .gitkeep for external-dns, ingress-nginx
  • ClusterIssuer Resolution:
  • Deleted ClusterIssuers for GitOps regeneration with correct tracking ID
  • ESO CRD Fix:
  • Removed kubectl.kubernetes.io/last-applied-configuration annotations
  • Deleted and recreated CRDs with ServerSideApply
  • Commits:
  • 1c9ebe5: fix(plugins): resolve ArgoCD application issues
  • 8b6678a: fix(plugins): enable selfHeal and remove redundant rabbitmq-secrets
  • e97f189: fix(projects): remove blocking SyncWindows from argocd and plugins
  • 422194c: fix(plugins): add ApplyOutOfSyncOnly to help with large CRD sync
  • af29deb: fix(plugins): ignore ESO controller default fields in ExternalSecrets
  • Final Status: All 7 plugins apps showing Synced/Healthy
  • cert-manager, external-dns, external-secrets-operator
  • extra-secrets-production, extra-secrets-staging, ingress-nginx, rabbitmq

2025-11-18 (TLS Certificate and Ingress Configuration Fixes)

  • Child Work Item 4.11 Complete: Fixed TLS certificates for all staging and production services
  • Root Causes Identified:
  • Staging ingresses using production URLs (e.g., help.syrf.org.uk instead of help.staging.syrf.org.uk)
  • DNS mismatch causing ACME HTTP-01 challenges to fail with 404
  • Staging using letsencrypt-staging issuer instead of letsencrypt-prod
  • Helm chart defaults with hardcoded ingress values not fully overridden
  • Fixes Applied:
  • Updated all Helm chart defaults to ingress: {} (api, pm, quartz, web, user-guide, docs)
  • Created staging environment values with correct staging URLs
  • Changed staging shared-values to use letsencrypt-prod issuer
  • All certificates now issued by Let's Encrypt production
  • Current State:
  • All staging certificates: Ready ✅
  • All production certificates: Ready ✅
  • Correct staging URLs configured for all services
  • Progress Update: Work Item 4 now 8/14 complete (57%), overall 27/37 (73%)

2025-11-17 (Cluster Health Assessment - 6 New Issues Discovered)

  • Cluster Health & Remediation Issues: Added 6 new child work items to Work Item 4 after comprehensive cluster health check
  • Critical Issues:
  • Child Work Item 4.8: Fix SecretStore Configuration (8 pts, P0) - 44 ExternalSecrets failing, all reference non-existent SecretStores
  • Child Work Item 4.9: Fix Staging Image Tags (5 pts, P0) - 5 staging pods with InvalidImageName/ImagePullBackOff
  • High Priority:
  • Child Work Item 4.10: Fix Extra-Secrets Directory Structure (2 pts, P1) - Missing resources/ directories blocking sync
  • Medium Priority:
  • Child Work Item 4.11: Fix User Guide TLS Certificate (3 pts, P2) - ✅ Complete
  • Child Work Item 4.12: Sync Out-of-Sync Applications (2 pts, P2) - 13 apps OutOfSync or Unknown status
  • Low Priority:
  • Child Work Item 4.13: Clean Up Orphaned Resources (1 pt, P3) - 10 orphaned resources across 2 apps

  • Impact on Progress:

  • Total story points increased from 181 to 202 (+21 points)
  • Work Item 4: Now 4/13 complete (31%) instead of 4/7 (57%)
  • Overall progress: 23/36 child work items (64%) vs 23/30 (77%)
  • Completion estimate updated: 2-3 weeks vs 1-2 weeks
  • Identified 2 critical blockers for staging environment and all services

2025-11-14 (Helm Chart Standardization)

  • Helm Chart Standardization - Jenkins X Pattern Removal (Child Work Item 4.7, 3 pts):
  • Removed all 52 Jenkins X legacy patterns from service Helm charts
  • Updated all 4 services (api, project-management, quartz, web) × 4 files each = 16 files total
  • Replaced jx.imagePullSecrets with standard K8s top-level imagePullSecrets array
  • Replaced jxRequirements.ingress.* with clean ingress.* namespace
  • Removed draft label patterns from all services
  • Root cause: syrf-web ImagePullBackOff due to mismatch between global values and chart expectations
  • All charts validated successfully with helm template
  • Documentation: Created ADR-006-helm-chart-standardization.md
  • Web service had 30 references in ingress.yaml alone (complex host name construction)

  • Work Item 4 Progress: 4/7 child work items complete (57%)

  • Overall Progress: 23/30 child work items (77%), 163/181 story points (90%)

2025-11-13 (Updated - Repository Migration)

  • Repository Migration Completed:
  • Migrated monorepo from camaradesuk/syrf-test to camaradesuk/syrf
  • Backup created at camaradesuk/syrf-web-legacy
  • Force push + rename strategy preserved all GitHub metadata (470+ issues, 47 PRs, discussions)
  • ZenHub workspace continues functioning (same internal repo ID)
  • All branches coexist (3 monorepo + 93 syrf-web = no conflicts)
  • All tags coexist (prefixed vs unprefixed = no conflicts)
  • GitHub automatic redirects: syrf-web URLs → syrf URLs
  • Documentation created: ADR-005 and migration guide
  • NEW Child Work Item 1.8: Repository Migration to Production Name (5 pts) ✅ Complete
  • Updated backlog: 22/29 child work items (76%), 160/178 story points (90%)
  • Updated all issue URLs from syrf-web to syrf

  • External-DNS CrashLoopBackOff Issue RESOLVED:

  • Problem: External-DNS pod crashing with Precondition not met error
  • Root Cause: Trying to delete DNS records from legacy Jenkins X cluster with different owner ID
  • Solution: Changed policy from sync to upsert-only in infrastructure/external-dns/values.yaml
  • Status: External-DNS now running successfully, creating/updating records without deletion attempts
  • Legacy DNS Records: Orphaned TXT records from legacy cluster preserved until migration complete
  • Documentation: Created cluster-gitops/docs/troubleshooting/external-dns-crashes.md
  • Commit: 6c3de9d (fix), 8ee375d (docs)

  • NEW FEATURES COMPLETED - Child Work Items 2.5 and 2.6:

  • Production Promotion Automation (Child Work Item 2.5, 8 pts):

    • Automated PR creation for production promotion after successful staging deployment
    • PR requires manual review and merge (no GitHub Environment needed)
    • Workflow completes with green checkmark after PR creation
    • PR labeled requires-review with review checklist
    • Implementation: promote-to-production job in ci-cd.yml
    • Commits: 3d4edccd (initial), 42a46855 (simplified for free tier)
  • Deployment Success Notifications (Child Work Item 2.6, 8 pts):

    • ArgoCD PostSync hooks create GitHub commit statuses after successful deployments
    • Kubernetes Job authenticates with GitHub App
    • Status context: argocd/deploy-{environment}
    • Configuration consolidated in shared-values.yaml (DRY principle)
    • Services enable with single flag: deploymentNotification.enabled: true
    • Staging: commit statuses only
    • Production: commit statuses + GitHub Releases
    • Commits: 3d4edccd, 118648da, 74bee73 (DRY config), 034158d0 (docs)
  • Documentation:

    • Created: docs/how-to/production-promotion-and-notifications.md
    • Updated: CLAUDE.md with CI/CD workflow changes
    • Updated: cluster-gitops shared-values.yaml (both environments)
  • Work Item 2 Status: Now 100% complete (6/6 child work items)

  • Total Story Points: Increased from 157 to 173 (+16 points)
  • Overall Progress: 21/28 child work items (75%), 155/173 story points (90%)

2025-11-07

  • Reorganized hierarchy for ZenHub alignment
  • Changed from 5 Epics to 1 Epic containing 5 Work Items
  • Changed 26 User Stories to 26 Child Work Items
  • Updated all terminology throughout document (Executive Summary, Progress table, Sprint Planning, Dependencies Graph)
  • All content and metadata preserved
  • Created GitHub issues in syrf-web repository:
  • Epic #2128: SyRF GitOps Migration (Short-Term Goals pipeline)
  • Work Items #2129-#2155 (5 work items, Sprint Backlog pipeline)
  • Child Work Items #2130-#2159 (26 child work items, Sprint Backlog pipeline)
  • All issues have proper hierarchy, estimates, and pipeline placement

2025-11-03

  • Initial backlog created
  • Analyzed all planning documents
  • Created 26 stories across 5 epics
  • Identified K8s cluster as primary blocker
  • Defined acceptance criteria and story points
  • Organized into sprint recommendations

References

  • PROJECT-STATUS.md - Current implementation status
  • IMPLEMENTATION-PLAN.md - Phase-by-phase plan
  • CLUSTER ARCHITECTURE GOALS.md - Target architecture
  • DEPENDENCY-MAP.yaml - Service/library dependencies
  • CI-CD-DECISIONS.md - Strategic CI/CD decisions
  • cluster-gitops/PLANNING.md - Migration strategy and Jenkins X baseline

Next Update: After Sprint 2 completion or when K8s cluster becomes available