SyRF GitOps Migration - Product Backlog¶

Last Updated: 2025-12-01 Project: SyRF Monorepo + GitOps Migration Sprint Planning: ZenHub/Scrum Board Format

Executive Summary¶

This backlog tracks the migration from Jenkins X to a GitOps-based deployment architecture using GitHub Actions and ArgoCD. The project is organized as one Epic containing Work Items, with each Work Item containing multiple Child Work Items.

Overall Progress¶

Work Item	Status	Child Work Items Complete	Total Child Work Items	Progress
Work Item 1: Monorepo Foundation	✅ Complete	8/8	8	100%
Work Item 2: CI/CD Automation	✅ Complete	7/7	7	100%
Work Item 3: GitOps Infrastructure	✅ Complete	5/5	5	100%
Work Item 4: ArgoCD Deployment	🔄 In Progress	9/14	14	64%
Work Item 5: Production Migration	⏳ Planned	0/4	4	0%
TOTAL		29/38	38	76%

Burn-down Estimate¶

Total Story Points: 210 (updated: +3 for dynamic matrix 2025-11-20)
Completed: 174 points (83%)
Remaining: 36 points (17%)
Estimated Time to Complete: 2-3 weeks

Legend¶

Status Icons:

✅ Complete
🔄 In Progress
⏳ Blocked/Waiting
📋 Ready
🔮 Future/Backlog

Story Point Scale:

1 point = 1-2 hours
2 points = 2-4 hours
3 points = 4-8 hours (half day)
5 points = 1 full day
8 points = 2 days
13 points = 1 week
21 points = 2 weeks

Epic: SyRF GitOps Migration¶

GitHub Issue: #2128 Goal: Migrate from Jenkins X to a GitOps-based deployment architecture using GitHub Actions, ArgoCD, and Kubernetes

Total Work Items: 5 Total Child Work Items: 38 Total Story Points: 210 (updated: +3 for dynamic matrix 2025-11-20) Completed: 174 points (83%) Remaining: 36 points (17%) Overall Status: 🔄 In Progress

Work Items Overview:

Work Item 1: Monorepo Foundation (58 pts) - ✅ Complete (8/8 child work items)
Work Item 2: CI/CD Automation (57 pts) - ✅ Complete (7/7 child work items)
Work Item 3: GitOps Infrastructure (34 pts) - ✅ Complete (5/5 child work items, 100%)
Work Item 4: ArgoCD Deployment (53 pts) - 🔄 In Progress (9/14 child work items, 64%)
Work Item 5: Production Migration (34 pts) - ⏳ Planned (0/4 child work items)

Work Item 1: Monorepo Foundation ✅ COMPLETE¶

GitHub Issue: #2129 Goal: Establish monorepo structure with automated semantic versioning

Total Story Points: 58 Status: ✅ Complete (100%)

Child Work Item 1.1: Monorepo Structure Setup ✅¶

GitHub Issue: #2130 Status: ✅ Complete Priority: P0 (Critical) Story Points: 8 Sprint: Sprint 0 (Completed)

As a developer I want all services and libraries consolidated into a single monorepo So that I can make atomic changes across service boundaries and simplify dependency management

Acceptance Criteria:

All 4 services moved to src/services/ (api, project-management, quartz, web)
All shared libraries moved to src/libs/
Helm charts organized in src/services/{service}/charts/
Root solution file syrf.sln created with proper folder structure
Solution filters (.slnf) created for each service
Git history preserved from original repositories
All projects build successfully with dotnet build
Directory.Build.props centralized at repository root

Dependencies: None

Technical Notes:

Completed via migration scripts
Repository: camaradesuk/syrf-monorepo (production ready)
Test repository: camaradesuk/syrf-test

Child Work Item 1.2: GitVersion Configuration ✅¶

GitHub Issue: #2131 Status: ✅ Complete Priority: P0 (Critical) Story Points: 5 Sprint: Sprint 1 (Completed)

As a developer I want automated semantic versioning based on conventional commits So that versions are calculated automatically without manual intervention

Acceptance Criteria:

GitVersion.yml created for all 5 services (api, pm, quartz, web, s3-notifier)
All services use mode: ContinuousDeployment
Conventional commit patterns configured (feat:, fix:, chore:)
Service-specific tag prefixes defined (api-v, pm-v, quartz-v, web-v, s3-notifier-v)
Path filtering working (services version independently)
GitVersion.yml removed from shared libraries
Test commit successfully calculates version

Dependencies: Story 1.1 (Monorepo Structure)

Technical Notes:

Decision documented in: GITVERSION-MODE-DECISION.md
Used ContinuousDeployment mode instead of Mainline
All services at 0.1.0 baseline

Child Work Item 1.3: Chart Version Stabilization ✅¶

GitHub Issue: #2132 Status: ✅ Complete Priority: P0 (Critical) Story Points: 2 Sprint: Sprint 1 (Completed)

As a platform engineer I want Helm Chart versions to remain stable at 0.0.0 So that deployment versions are controlled via git refs and image tags, not chart versions

Acceptance Criteria:

All Chart.yaml files set to version: 0.0.0
Comment added: "Stable version; deployments via git ref + image tag"
Policy documented in CLUSTER ARCHITECTURE GOALS.md
CI/CD workflows do NOT update Chart.yaml versions
Charts still valid for Helm deployment

Dependencies: None

Technical Notes:

Aligns with GitOps best practices
Commit: 941e2a1b (2025-11-03)

Child Work Item 1.4: Dependency Mapping ✅¶

GitHub Issue: #2133 Status: ✅ Complete Priority: P1 (High) Story Points: 8 Sprint: Sprint 1 (Completed)

As a developer I want a clear dependency map of all services and libraries So that I can understand impact of changes and optimize builds

Acceptance Criteria:

DEPENDENCY-MAP.yaml created as single source of truth
Complete dependency trees documented for all services
Docker build context requirements specified
CI/CD workflow trigger paths defined
Impact analysis for library changes documented
Zero circular dependencies verified
Validation script created (validate-dependencies.sh)

Dependencies: Story 1.1 (Monorepo Structure)

Technical Notes:

File: architecture/dependency-map.yaml
SharedKernel is most critical (affects 3 services)
Web service has no .NET dependencies

Child Work Item 1.5: CI/CD Path Filtering Optimization ✅¶

GitHub Issue: #2134 Status: ✅ Complete Priority: P1 (High) Story Points: 5 Sprint: Sprint 1 (Completed)

As a developer I want CI/CD workflows to build only changed services So that builds are fast and resource-efficient

Acceptance Criteria:

Path filters use precise library paths (not broad src/libs/**)
API triggers on 6 specific library paths
PM triggers on 7 specific library paths
Quartz triggers on 2 library paths (minimal dependencies)
Web has no library dependencies
Test: Change to SharedKernel triggers API, PM, Quartz (not Web)
Test: Change to Web triggers only Web service

Dependencies: Story 1.4 (Dependency Mapping)

Technical Notes:

Uses dorny/paths-filter@v3 action
Prevents unnecessary builds when unrelated libraries change

Child Work Item 1.6: Documentation Consolidation ✅¶

GitHub Issue: #2135 Status: ✅ Complete Priority: P2 (Medium) Story Points: 8 Sprint: Sprint 1 (Completed)

As a developer I want clear, non-redundant documentation So that I can understand the current state and make informed decisions

Acceptance Criteria:

CLAUDE.md updated with current architecture
PROJECT-STATUS.md reflects current implementation status
IMPLEMENTATION-PLAN.md aligned with actual progress
Obsolete documents deleted (preserved in git history)
Path references standardized (src/services/ not services/)
GitVersion mode contradiction resolved (all docs use ContinuousDeployment)
Documentation anti-patterns documented
README.md rewritten as navigation entry point

Dependencies: All previous stories

Technical Notes:

Deleted 3 obsolete analysis files
Adopted hybrid redundancy strategy
DEPENDENCY-MAP.yaml is now authoritative

Child Work Item 1.7: Build Configuration Optimization ✅¶

GitHub Issue: #2136 Status: ✅ Complete Priority: P2 (Medium) Story Points: 3 Sprint: Sprint 1 (Completed)

As a developer I want optimized Docker build contexts and .dockerignore So that builds are faster and use less disk space

Acceptance Criteria:

.dockerignore excludes planning/, .github/, docs
.dockerignore organized by category with comments
Estimated 20-30% reduction in build context size
Directory.Build.props enhanced with:
Common build settings
Code quality settings
NuGet package metadata
Deterministic builds for CI/CD
Redundant Directory.Build.props files removed from service subdirectories

Dependencies: Story 1.1 (Monorepo Structure)

Technical Notes:

Root Directory.Build.props is single source of MSBuild configuration
.dockerignore tested and validated

Child Work Item 1.8: Repository Migration to Production Name ✅¶

GitHub Issue: #2163 Status: ✅ Complete Priority: P1 (High) Story Points: 5 Sprint: Sprint 1 (Completed 2025-11-13)

As a developer I want the monorepo migrated to the production repository name So that all GitHub metadata is preserved in one consolidated location

Acceptance Criteria:

Dependencies:

Story 1.1 (Monorepo Structure Setup)
Story 1.6 (Documentation Consolidation)

Technical Notes:

Strategy: Force push + rename to preserve GitHub metadata
Backup created: camaradesuk/syrf-web-legacy
GitHub automatic redirects: syrf-web URLs → syrf URLs
Git history preserved: syrf-web main is part of monorepo via git mv
Branches: 3 monorepo + 93 syrf-web = no conflicts
Tags: Prefixed (api-v*, pm-v*) vs unprefixed (v*) = no conflicts
ZenHub: Repository rename transparent (same internal repo ID)
Files created:
docs/decisions/ADR-005-repository-migration-strategy.md
docs/how-to/repository-migration-guide.md
Documentation updated: 28 files (syrf-monorepo → syrf, syrf-test → syrf)
Commits: 5db1d9e9 (migration docs), [user executed migration]

Estimated Effort: 5 story points (1 day)

Work Item 2: CI/CD Automation ✅ COMPLETE¶

GitHub Issue: #2137 Goal: Build and push Docker images with automated tagging and promotion

Total Story Points: 57 (updated: +2 for version continuity, +8 for production promotion, +8 for deployment notifications, +3 for dynamic matrix) Status: ✅ Complete (100%) - 7/7 child work items

Child Work Item 2.1: Auto-Version Workflow Cleanup ✅¶

GitHub Issue: #2138 Status: ✅ Complete Priority: P0 (Critical) Story Points: 5 Sprint: Sprint 2 (Planned)

As a developer I want the auto-version workflow to create tags without polluting git history So that versioning is clean and doesn't create commit noise

Acceptance Criteria:

Dependencies:

Story 1.2 (GitVersion Configuration)
Story 1.3 (Chart Version Stabilization)

Technical Notes:

Aligns with GitOps principle (no auto-commits to source repo)
Tags are lightweight references, not commits
File: .github/workflows/ci-cd.yml (formerly auto-version.yml - already merged)
Version Continuity Strategy:
Polyrepo tags (v8.20.1) migrated with git history
Create prefixed baseline tags at same commits (api-v8.20.1)
GitVersion recognizes prefixed tags via tag-prefix config
Next versions increment from baseline: feat → minor, fix → patch
Maintains semantic versioning continuity across migration

Estimated Effort: 5 story points (1 day) - UPDATED: +2 points for version continuity = 7 points total

Child Work Item 2.2: Docker Image Build Integration ✅¶

GitHub Issue: #2139 Status: ✅ Complete Priority: P0 (Critical) Story Points: 13 Sprint: Sprint 2 (Planned)

As a platform engineer I want Docker images built and pushed to GHCR automatically So that every version has an immutable container image

Acceptance Criteria:

Review and validate all Dockerfiles for monorepo structure
Add Docker build job to auto-version workflow (after version jobs)
Use matrix strategy for changed services
Build images with correct build context
Tag images with both patterns:
{version} (e.g., 1.2.3)
{version}-sha.{shortsha} (e.g., 1.2.3-sha.abc123)
latest (updates with each push from main)
Push to GHCR using GITHUB_TOKEN
Test: Trigger workflow with code change
Test: Verify images exist in GHCR
Test: Both tags exist and point to same image

Dependencies:

Story 2.1 (Auto-Version Workflow Cleanup)
Story 1.4 (Dependency Mapping)

Technical Notes:

Registry: ghcr.io/camaradesuk/syrf-{service}
Auth: GITHUB_TOKEN (automatic, no PAT needed)
Build context must include entire monorepo (MSBuild requirement)
Reference DEPENDENCY-MAP.yaml for required paths
Implementation Details:
Created automated Dockerfile generation script (scripts/generate-dockerfiles.py)
Generates cache-optimized Dockerfiles with 5-layer structure
All Dockerfiles regenerated from dependency-map.yaml
Fixed PM and Quartz build contexts to use monorepo root
API and Web contexts already correct
Cache optimization: ~70% time savings for source code changes

Estimated Effort: 13 story points (1 week)

Child Work Item 2.3: Build Optimization - Conditional Rebuild ✅¶

GitHub Issue: #2140 Status: ✅ Complete Priority: P2 (Medium) Story Points: 8 Sprint: Sprint 3 (Completed 2025-11-19)

As a platform engineer I want to avoid rebuilding Docker images when only non-code files change So that CI/CD is faster and more resource-efficient

Acceptance Criteria:

Install crane CLI tool in workflow
Implement change detection logic:
Detect code vs non-code changes
Compare files changed since last git tag
Include shared libraries in detection
Implement conditional build/retag:
If no code changes and source image exists: retag using crane tag
If code changed or source missing: build from scratch
Add monitoring and summary to workflow output
Test: Chart-only change triggers retag (not rebuild)
Test: Code change triggers full rebuild
Test: Shared library change triggers full rebuild (logic verified)
Test: Missing source image falls back to rebuild (intentionally errors - signals config issue)
Measure time savings (target: 2-5 min per optimized build) - Achieved: 12s vs 4+ min

Dependencies:

Story 2.2 (Docker Image Build Integration)

Technical Notes:

Uses crane for manifest-based retagging (no download)
Transparent to GitOps (ArgoCD only cares tag exists)
Detailed spec in: CLUSTER ARCHITECTURE GOALS.md section 10a
Implementation Notes (2025-11-19):
Initial approach using dorny/paths-filter negation patterns didn't work (patterns are OR'd)
Fixed by using list-files: shell and analyzing actual file paths
Chart-only detection checks if ALL changed files match .chart/ pattern
Successfully tested: API chart-only change retagged 9.4.3 → 9.4.4 in 12s
Missing source image intentionally errors rather than falling back to build (signals configuration issue)

Estimated Effort: 8 story points (2 days)

Child Work Item 2.4: Promotion PR Automation ✅¶

GitHub Issue: #2141 Status: ✅ Complete Priority: P0 (Critical) Story Points: 8 Sprint: Sprint 2 (Completed)

As a platform engineer I want automatic PRs to cluster-gitops after successful image push So that staging deployments are triggered declaratively

Acceptance Criteria:

Create GitHub PAT with repo scope for cluster-gitops access
Add PAT as secret GITOPS_PAT to app-monorepo repository
Add promotion PR job to auto-version workflow
Install yq tool for YAML manipulation
Update staging values files for changed services:
environments/staging/{service}.values.yaml
Set image.tag: {version}
Create PR with:
Title: "Promote {services} to {version} (staging)"
Body: Image details, source tag, changelog link
Auto-label: promotion, staging, auto-generated
Test: Code change creates promotion PR in cluster-gitops
Test: PR contains correct version information
Test: PR is properly formatted and reviewable

Dependencies:

Story 2.2 (Docker Image Build Integration)
Story 3.1 (cluster-gitops Repository Complete)

Technical Notes:

Uses yq for YAML updates (preserves formatting)
PR can be auto-merged or require approval (configurable)
File updated: .github/workflows/auto-version.yml

Estimated Effort: 8 story points (2 days)

Child Work Item 2.5: Production Promotion Automation ✅¶

GitHub Issue: #2203 Status: ✅ Complete Priority: P0 (Critical) Story Points: 8 Sprint: Sprint 2 (Completed 2025-11-13)

As a platform engineer I want automated production promotion PRs after successful staging deployment So that production updates are tracked and require manual approval

Acceptance Criteria:

Dependencies:

Story 2.4 (Promotion PR Automation for staging)

Technical Notes:

Uses GitHub App authentication for PR creation
No GitHub Environment configuration needed (works on free tier)
Manual gate happens at PR merge step in cluster-gitops
Workflow shows success after PR creation, not after deployment
Commit: 3d4edccd (initial), 42a46855 (simplified)

Estimated Effort: 8 story points (2 days)

Child Work Item 2.6: Deployment Success Notifications ✅¶

GitHub Issue: #2204 Status: ✅ Complete Priority: P1 (High) Story Points: 8 Sprint: Sprint 2 (Completed 2025-11-13)

As a developer I want GitHub commit statuses when ArgoCD successfully deploys services So that I can see deployment status directly on commits and PRs

Acceptance Criteria:

Dependencies:

Story 4.2 (ArgoCD Installation) - for testing
Story 4.3 (Platform Add-ons) - for secrets

Technical Notes:

PostSync hook runs Kubernetes Job after successful sync
Uses curlimages/curl:8.10.1 container
JWT-based GitHub App authentication
Staging: commit statuses only (createReleaseNote: false)
Production: commit statuses + releases (createReleaseNote: true)
DRY: Common config in shared-values, services only set enabled flag
Files created:
src/services/*/chart/templates/postsync-notify.yaml (all services)
docs/how-to/production-promotion-and-notifications.md
environments/staging/shared-values.yaml (deploymentNotification section)
environments/production/shared-values.yaml (deploymentNotification section)
Commits: 3d4edccd, 118648da, 74bee73 (DRY config)

Estimated Effort: 8 story points (2 days)

Child Work Item 2.7: Dynamic Matrix for Docker Builds ✅¶

GitHub Issue: #2202 Status: ✅ Complete Priority: P2 (Medium) Story Points: 3 Sprint: Sprint 3 (Completed 2025-11-20)

As a developer I want the CI/CD workflow to show unchanged services as "Skipped" rather than "Succeeded" So that I can clearly see which services were actually built in each workflow run

Acceptance Criteria:

Dependencies:

Story 2.3 (Build Optimization - Conditional Rebuild)

Technical Notes:

Replaces static 6-service matrix with dynamic matrix
Uses jq for reliable JSON generation
Matrix entries include: name, image, dockerfile, context, changed_output, and service-specific flags
GitHub Actions can only show "Skipped" at job level, not step level
With static matrix, all jobs run and succeed early (confusing UI)
With dynamic matrix, jobs for unchanged services don't exist (clean UI)
Commits: c4da6e23, fc2f97ca, a52d0be2

Estimated Effort: 3 story points (half day)

Work Item 3: GitOps Infrastructure ✅ COMPLETE¶

GitHub Issue: #2142 Goal: Establish cluster-gitops repository with ArgoCD configuration

Total Story Points: 34 Status: ✅ Complete (5/5 child items complete, 34/34 pts = 100%)

Child Work Item 3.1: cluster-gitops Repository Complete ✅¶

GitHub Issue: #2143 Status: ✅ Complete Priority: P0 (Critical) Story Points: 8 Sprint: Sprint 1 (Completed)

As a platform engineer I want a complete cluster-gitops repository structure So that ArgoCD can declaratively manage cluster state

Acceptance Criteria:

Repository created: camaradesuk/cluster-gitops
Directory structure established:
bootstrap/ (App-of-Apps)
projects/ (AppProjects)
clusters/{staging,prod}/apps/
applicationsets/
envs/_global/
envs/syrf/{api,project-management,quartz,web}/
Initial values files created for all 4 services
README and SETUP-INSTRUCTIONS.md documented
Initial skeleton committed and pushed
PLANNING.md created with migration strategy

Dependencies: None

Technical Notes:

Repository: github.com/camaradesuk/cluster-gitops
Visibility: Private
Multi-source pattern ready for ArgoCD ≥2.6

Child Work Item 3.2: ArgoCD Application Manifests ✅¶

GitHub Issue: #2144 Status: ✅ Complete Priority: P0 (Critical) Story Points: 8 Sprint: Sprint 2 (Completed 2025-11-12)

As a platform engineer I want ArgoCD Application definitions for all services So that services can be deployed via GitOps

Acceptance Criteria:

Create AppProject definitions (6 projects: syrf-staging, syrf-production, preview, plugins, default, bootstrap)
Create Application manifests via ApplicationSets:
argocd/applicationsets/syrf.yaml - Matrix generator for all services
argocd/applicationsets/plugins.yaml - Infrastructure components
argocd/applicationsets/argocd-infrastructure.yaml - ArgoCD components
Configure multi-source pattern:
Source 1: Chart from monorepo at specific targetRevision tag
Source 2: Values from cluster-gitops repository
Source 3: Optional resources directory
Configure sync policies:
Staging: automated (prune + selfHeal)
Production: automated with selfHeal disabled
Set targetRevision policy using service tags ({service}-vX.Y.Z)
Test: Render manifests locally with helm template

Dependencies:

Story 3.1 (cluster-gitops Repository Complete) ✅

Technical Notes:

Uses ArgoCD multi-source pattern (≥2.6)
ApplicationSets auto-generate Applications from environment configs
Values interpolation via $values reference
CreateNamespace=true for automatic namespace creation

Estimated Effort: 8 story points (2 days)

Child Work Item 3.3: Environment Values Configuration ✅¶

GitHub Issue: #2145 Status: ✅ Complete Priority: P0 (Critical) Story Points: 5 Sprint: Sprint 2 (Completed 2025-11-12)

As a platform engineer I want environment-specific values for all services So that staging and production have appropriate resource allocations

Acceptance Criteria:

Create global values (global/values.yaml)
Create environment-specific shared values:
syrf/environments/staging/shared-values.yaml
syrf/environments/production/shared-values.yaml
Create service-specific values for 6 services × 2 environments:
syrf/environments/{staging,production}/{api,web,project-management,quartz,docs,user-guide}/
Each service has config.yaml (chart reference) and values.yaml (Helm values)
Configure for each environment:
Image repository and tag (via CI/CD promotion)
Replica counts
Resource requests/limits
Ingress hosts and TLS
Environment variables
Health check settings
Document configuration knobs in comments
Validate YAML syntax

Dependencies:

Story 3.1 (cluster-gitops Repository Complete) ✅

Technical Notes:

Environment namespace.yaml contains sync policies
Shared-values.yaml contains common config (deployment notifications, etc.)
Service config.yaml updated automatically by CI/CD promotion workflow
Staging: automated sync; Production: automated with manual PR merge gate

Estimated Effort: 5 story points (1 day)

Child Work Item 3.4: ApplicationSet for PR Previews ✅¶

GitHub Issue: #2146 Status: ✅ Complete (manually tested and verified 2025-12-01) Priority: P1 (High) Story Points: 8 Sprint: Sprint 3 (Completed 2025-12-01)

As a developer I want ephemeral preview environments for PRs So that I can test changes before merging

Acceptance Criteria:

Create ApplicationSet definition (applicationsets/syrf-previews.yaml)
Configure Pull Request Generator:
Watch syrf PRs with preview label
GitHub App credentials (github-app-repo-creds secret)
Requeue every 300 seconds
Template Application spec:
Name: syrf-pr-{{number}}-{{serviceName}}
Namespace: pr-{{number}}
Chart source: PR head SHA
Image tag: pr-{{number}}
Ingress: pr-{{number}}-{{serviceName}}.staging.syrf.org.uk
Configure sync policy:
Automated (prune + selfHeal)
CreateNamespace=true
Test: Open PR creates preview environment
Test: PR close deletes preview environment
Document preview URL pattern (docs/how-to/use-pr-preview-environments.md)

Completed Components:

GitHub Actions workflow (pr-preview.yml) - builds images with pr-{number} tag
Preview AppProject (argocd/projects/preview.yaml) - allows pr-* namespaces
Preview common values (syrf/environments/preview/common.values.yaml)
Documentation (381 lines comprehensive guide)
ApplicationSet with PullRequest generator - syrf-previews.yaml created
GitHub credentials secret config - github-app-repo-creds ExternalSecret added

Remaining Work:

Create camarades-github-app-installation-id secret in GCP Secret Manager
Push cluster-gitops changes and verify ArgoCD sync
Test: Open PR with 'preview' label → preview environment created
Test: Close PR → preview environment deleted

Final State: All components complete. PR Preview environments fully operational - manually tested and verified 2025-12-01.

Dependencies:

Story 3.2 (ArgoCD Application Manifests) ✅
Story 4.2 (ArgoCD Installation) ✅

Technical Notes:

Requires ApplicationSet with pullRequest generator
GitHub PAT or GitHub App credentials needed
Ephemeral namespaces automatically cleaned up on PR close
Preview URLs: pr-{number}.staging.syrf.org.uk

Estimated Effort: 8 story points (remaining: ~4 points)

Child Work Item 3.5: Infrastructure Dependencies Analysis ✅¶

GitHub Issue: #2147 Status: ✅ Complete Priority: P0 (Critical) Story Points: 5 Sprint: Sprint 2 (Completed 2025-11-12)

As a platform engineer I want to identify all infrastructure dependencies for SyRF services So that the new cluster has all required components before migration

Acceptance Criteria:

Document required infrastructure components:
Ingress controller (ingress-nginx v4.11.1)
cert-manager (v1.15.0) for TLS
external-dns (v1.14.5) for DNS management
RabbitMQ (v14.6.6) for inter-service messaging
External Secrets Operator for secret management (Google Secret Manager)
Create Helm charts or manifests for each component (plugins/helm/ directory)
Define installation order (documented in docs/cluster-bootstrap.md)
Create bootstrap Application for platform add-ons (argocd/bootstrap/root.yaml)
Document configuration requirements (per-component values.yaml files)
Create smoke test checklist for each component

Dependencies:

Story 3.1 (cluster-gitops Repository Complete) ✅

Technical Notes:

All components deployed via GitOps (plugins ApplicationSet)
Each component has config.yaml + values.yaml + resources/ directory
RabbitMQ is CRITICAL (required by all .NET services)
ESO uses ClusterSecretStore with Google Secret Manager backend
Workload Identity configured for external-dns and ESO

Estimated Effort: 5 story points (1 day)

Work Item 4: ArgoCD Deployment 🔄 IN PROGRESS¶

GitHub Issue: #2148 Goal: Install and configure ArgoCD on new Kubernetes cluster

Total Story Points: 53 (32 + 21 cluster remediation issues discovered 2025-11-17) Status: 🔄 In Progress (64% - 9/14 child work items complete) Blocker Resolved: Cluster provisioned on 2025-11-12 Blocker Resolved: ExternalSecrets fixed on 2025-11-18 (Story 4.8) Blocker Resolved: Image tags fixed on 2025-11-18 (Story 4.9) Blocker Resolved: identity-server dependency removed on 2025-11-18 (Story 4.9) Current Blockers: None

Child Work Item 4.1: Kubernetes Cluster Provisioning ✅¶

GitHub Issue: #2149 Status: ✅ Complete Priority: P0 (Critical) Story Points: 13 Sprint: Sprint 2 (Completed 2025-11-12)

As a platform engineer I want a new Kubernetes cluster provisioned So that I can install ArgoCD and deploy services

Acceptance Criteria:

Decision made: GKE (Google Kubernetes Engine)
Cluster provisioned with Terraform:
Cluster: camaradesuk, europe-west2-a
Kubernetes version: 1.33.5-gke.1201000
Nodes: 3-6 (autoscaling), e2-standard-2
Features: Workload Identity, VPA, Shielded Nodes
kubectl access configured locally
Cluster connectivity validated
Basic namespaces created via ArgoCD
Document cluster details in camarades-infrastructure repo

Dependencies: None (but blocks all other Epic 4 stories)

Technical Notes:

Recommended: GKE europe-west2-a (continuity with Jenkins X)
Alternative: Any Kubernetes 1.27+ cluster
This is the PRIMARY BLOCKER for GitOps migration

Estimated Effort: 13 story points (1 week - including approval/provisioning time)

Child Work Item 4.2: ArgoCD Installation ✅¶

GitHub Issue: #2150 Status: ✅ Complete Priority: P0 (Critical) Story Points: 5 Sprint: Sprint 2 (Completed 2025-11-12)

As a platform engineer I want ArgoCD installed on the new cluster So that GitOps-based deployments can begin

Acceptance Criteria:

Install ArgoCD in argocd namespace (HA mode with Helm)
Verify all ArgoCD components are running
Access ArgoCD UI via Ingress (argocd.camarades.net)
TLS certificate configured with Let's Encrypt
ArgoCD admin password available via secret
GitHub credential template created for repository access

Dependencies:

Story 4.1 (K8s Cluster Provisioning)

Technical Notes:

Install command: kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
Wait for: kubectl wait --for=condition=available --timeout=300s deployment/argocd-server -n argocd
Password: kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

Estimated Effort: 5 story points (1 day)

Child Work Item 4.3: Platform Add-ons Installation ✅¶

GitHub Issue: #2151 Status: ✅ Complete Priority: P0 (Critical) Story Points: 8 Sprint: Sprint 2 (Completed 2025-11-12)

As a platform engineer I want all required infrastructure components installed So that SyRF services have the dependencies they need

Acceptance Criteria:

Install cert-manager v1.15.0 for TLS certificates
Install ingress-nginx v4.11.1 for HTTP routing (LoadBalancer: 34.13.36.98)
Install external-dns v1.14.5 for DNS management (with Workload Identity)
Install RabbitMQ v14.6.6 (REQUIRED for SyRF services)
Configure each component via ArgoCD Applications
Verify all components are healthy and synced
Document configuration in cluster-gitops/docs/cluster-bootstrap.md

Dependencies:

Story 4.2 (ArgoCD Installation)
Story 3.5 (Infrastructure Dependencies Analysis)

Technical Notes:

RabbitMQ is CRITICAL - services cannot start without it
Secret management: ESO with Google Secret Manager (current setup)
Use ArgoCD Applications for declarative installation

Estimated Effort: 8 story points (2 days)

Child Work Item 4.4: App-of-Apps Bootstrap ✅¶

GitHub Issue: #2152 Status: ✅ Complete Priority: P0 (Critical) Story Points: 3 Sprint: Sprint 3 (Completed 2025-11-12)

As a platform engineer I want ArgoCD bootstrapped via App-of-Apps pattern So that all applications are managed declaratively from Git

Acceptance Criteria:

Create bootstrap Application (bootstrap/root.yaml)
Configure to watch apps/ directory
Apply bootstrap Application to cluster
Verify ArgoCD creates child Applications
All Applications appear in ArgoCD UI
Sync status is healthy
Document bootstrap procedure

Dependencies:

Story 4.2 (ArgoCD Installation)
Story 3.2 (ArgoCD Application Manifests)

Technical Notes:

Bootstrap Application lives in cluster-gitops/bootstrap/
Creates Applications recursively from apps/ directory
Once applied, entire cluster state is Git-driven
Tested pruning: Applications auto-delete when YAML removed from Git
Updated cluster-bootstrap.md with App-of-Apps pattern

Estimated Effort: 3 story points (half day)

Child Work Item 4.5: First Service Deployment (Canary) 🔄¶

GitHub Issue: #2153 Status: 🔄 In Progress (75% complete) Priority: P1 (High) Story Points: 5 Sprint: Sprint 3 (Started 2025-11-12)

As a platform engineer I want to deploy one service to staging as a canary So that I can validate the entire GitOps flow before deploying all services

Acceptance Criteria:

Choose canary service (API selected)
Apply API Application manifest to ArgoCD (via App-of-Apps)
Verify ArgoCD syncs successfully (Synced)
Service pods are running and healthy (Progressing - waiting for secrets)
Ingress is accessible (smoke test endpoint) - BLOCKED by missing secrets
Check logs for errors - BLOCKED by missing secrets
Verify RabbitMQ connectivity - BLOCKED by missing secrets
Document any issues encountered
Create runbook for common operations

Dependencies:

Story 4.3 (Platform Add-ons Installation) ✅
Story 4.4 (App-of-Apps Bootstrap) ✅

Progress Summary (2025-11-12):

✅ Completed:

All 4 .NET services deployed (API, PM, Quartz, Web)
ArgoCD Applications created via App-of-Apps
All showing Synced status
Charts successfully templated
Pods created (Progressing state)
Fixed 2 critical deployment issues:
Environment variable format: Changed from array to map format in all staging values files
Image references: Updated Helm templates from Jenkins X pattern to standard Values pattern
Documentation created:
/docs/how-to/required-secrets.md - Complete guide for all 14 required secrets
Includes YAML templates, verification commands, ESO examples, troubleshooting
Triggered documentation service builds:
Committed changes to trigger CI/CD for syrf-user-guide and syrf-docs
Docker images building (commit: 421f76b5)

⏳ Blockers Identified:

Missing Kubernetes Secrets (Critical - blocks all .NET services):
auth0, identity-server, swagger-auth, public-api
mongo-db, elastic-db, dev-postgres-credentials
rabbit-mq, aws-s3, aws-ses
google-sheets, rob-api-credentials
elastic-apm, sentry
Recommendation: Set up External Secrets Operator
Missing Docker Images (syrf-docs, syrf-user-guide):
Images don't exist yet in GHCR
Build triggered via commit 421f76b5
Expected completion: ~5-10 minutes

Current Application Status:

Platform Services:
✅ ingress-nginx: Synced, Healthy
✅ cert-manager: Synced, Healthy
✅ external-dns: Synced, Healthy
🔄 rabbitmq: Synced, Progressing

SyRF Services:
🔄 syrf-api: Synced, Progressing (waiting for secrets)
🔄 syrf-project-management: Synced, Progressing (waiting for secrets)
🔄 syrf-quartz: Synced, Progressing (waiting for secrets)
🔄 syrf-web: Synced, Progressing (waiting for secrets)
❌ syrf-docs: Synced, Degraded (ImagePullBackOff - building)
❌ syrf-user-guide: Synced, Degraded (ImagePullBackOff - building)

Next Steps:

Wait for user-guide/docs images to build (~5-10 min)
Set up External Secrets Operator OR create secrets manually
Verify all services start successfully
Test ingress endpoints
Complete acceptance criteria

Technical Notes:

Use baseline version from Jenkins X (see PLANNING.md)
API service is good canary (simpler than PM)
Validate entire stack before other services
App-of-Apps pattern validated successfully

Estimated Effort: 5 story points (1 day)

Child Work Item 4.6: End-to-End GitOps Flow Validation ⏳¶

GitHub Issue: #2154 Status: ⏳ Blocked Priority: P1 (High) Story Points: 8 Sprint: TBD

As a platform engineer I want to validate the complete GitOps workflow So that I can confirm all automation works as designed

Acceptance Criteria:

Dependencies:

Story 4.5 (First Service Deployment)
Story 2.2 (Docker Image Build Integration)
Story 2.4 (Promotion PR Automation)
Story 3.4 (ApplicationSet for PR Previews)

Technical Notes:

This validates the ENTIRE GitOps architecture
Target: commit → staging < 10 min p50
Target: preview ready < 2 min
Document any issues for optimization

Estimated Effort: 8 story points (2 days)

Child Work Item 4.7: Helm Chart Standardization - Jenkins X Pattern Removal ✅¶

GitHub Issue: #2172 Status: ✅ Complete Priority: P0 (Critical) Story Points: 3 Sprint: Sprint 3 (Completed 2025-11-14)

As a platform engineer I want all Jenkins X legacy patterns removed from Helm charts So that charts use standard Kubernetes conventions and are maintainable

Acceptance Criteria:

Remove all jx.imagePullSecrets references (use top-level imagePullSecrets array)
Remove all jxRequirements.ingress.* references (use ingress.*)
Remove all draft label patterns
Update all 4 service charts (api, project-management, quartz, web)
Validate all charts render successfully with helm template
Document changes in ADR-006
Update environment values in cluster-gitops to match new structure

Dependencies:

Story 4.5 (First Service Deployment) - discovered issue during deployment

Scope Summary:

52 jx references removed across 16 files (4 services × 4 files)
Services updated: api, project-management, quartz, web
Files per service: values.yaml, deployment.yaml, ingress.yaml, canary.yaml
Root cause: syrf-web ImagePullBackOff due to jx.imagePullSecrets vs top-level imagePullSecrets mismatch

Technical Notes:

Web service had 30 jx references in ingress.yaml alone (complex host name construction)
Used bulk sed replacements for efficiency in web ingress.yaml
All charts validated with helm template after changes
ADR-006 created: docs/decisions/ADR-006-helm-chart-standardization.md
Follow-up required: Update cluster-gitops environment values to use new structure

Estimated Effort: 3 story points (half day)

Child Work Item 4.8: Fix SecretStore Configuration (External Secrets Migration) ✅¶

GitHub Issue: #2195 Status: ✅ Complete (2025-11-18) Priority: P0 (Critical - blocks ALL services) Story Points: 8 Sprint: Sprint 3

As a platform engineer I want all ExternalSecrets migrated from v1beta1 SecretStore to v1 ClusterSecretStore So that services can retrieve secrets from GCP Secret Manager and start successfully

Acceptance Criteria:

Migrate extra-secrets-staging to use ClusterSecretStore pattern
Migrate extra-secrets-production to use ClusterSecretStore pattern
Update all ExternalSecret references from SecretStore to ClusterSecretStore
Verify ClusterSecretStore is READY in staging and production
Test 3-5 critical secrets sync successfully (auth0, mongo-db, rabbit-mq)
All 40 ExternalSecrets show READY: True status (20 staging + 20 production)
Document migration in cluster-gitops (chart templates documented)
Update environment values to reference ClusterSecretStore

Completion Summary (2025-11-18):

What was done:

Chart Template Updates:
Added shorthand secrets format to extra-secrets chart for cleaner values files
Created ClusterExternalSecret template (cluster-external-secrets.yaml) for cluster-wide secrets
All ExternalSecrets now use kind: ClusterSecretStore instead of kind: SecretStore
Updated API version from v1beta1 to v1
ClusterExternalSecrets Created (deployed via argocd-secrets):
ghcr-secret → argocd, syrf-staging, syrf-production
rabbit-mq → rabbitmq, syrf-staging, syrf-production
github-app-credentials → argocd, syrf-staging, syrf-production
Added ClusterExternalSecret to argocd project whitelist
Values Files Simplified:
Staging: 17 namespace-scoped secrets using shorthand format
Production: 17 namespace-scoped secrets using shorthand format
Removed duplicates (ghcr-secret, rabbit-mq, github-app-credentials) now handled by ClusterExternalSecrets
Additional Fixes:
Fixed webhook HMAC verification (trailing newline in GCP secret)
Removed stuck finalizers from ExternalSecrets blocking deletion

Final State:

✅ 20 ExternalSecrets in syrf-staging: All SecretSynced
✅ 20 ExternalSecrets in syrf-production: All SecretSynced
✅ 3 ClusterExternalSecrets: All Ready, provisioned to all target namespaces
✅ GitHub webhook working (instant sync on push)
✅ ClusterSecretStore gcpsm-secret-store serving all namespaces

Files Modified:

charts/extra-secrets/templates/external-secrets.yaml - Added shorthand format
charts/extra-secrets/templates/cluster-external-secrets.yaml - New file
charts/extra-secrets/values.yaml - Documented new formats
argocd/local/argocd-secrets/values.yaml - Added ClusterExternalSecrets config
argocd/projects/argocd.yaml - Added ClusterExternalSecret to whitelist
plugins/local/extra-secrets-staging/values.yaml - Simplified with shorthand format
plugins/local/extra-secrets-production/values.yaml - Simplified with shorthand format

Dependencies:

Story 4.3 (Platform Add-ons) ✅ Complete - ESO installed
Story 4.5 (First Service Deployment) - blocked by this issue

Technical Notes:

Reference working pattern: argocd/local/argocd-secrets/values.yaml
ClusterSecretStore enables cross-namespace secret access
Workload Identity already configured for ESO: external-secrets@camarades-net.iam.gserviceaccount.com
IAM binding already exists: roles/iam.workloadIdentityUser
Only chart updates needed, no infrastructure changes

Estimated Effort: 8 story points (2 days - includes testing and verification)

Child Work Item 4.9: Fix Staging Environment Image Tags ✅¶

GitHub Issue: #2196 Status: ✅ Complete (2025-11-18) Priority: P0 (Critical - staging completely broken) Story Points: 5 Sprint: Sprint 3

As a developer I want staging service pods to have valid image tags So that staging environment is functional for testing

Acceptance Criteria:

Identify root cause of empty image tags in staging
Fix deployment manifests or Helm values causing empty tags
Update staging values with correct image tags for all services
Document fix in cluster-gitops troubleshooting
Verify all staging pods transition to Running state
Delete failed pods (InvalidImageName/ImagePullBackOff)
Verify new pods start successfully with valid images

Completion Summary (2025-11-18):

Root Cause Identified:

ApplicationSet removed image.tag parameters (commit bbfd0cd)
Staging values files didn't have explicit image configuration
Result: Helm rendered : (empty repository and tag)

Fixes Applied:

syrf monorepo (commit df49793a):
Renamed pm → project-management throughout CI/CD workflow
Updated Docker image name: syrf-pm → syrf-project-management
Updated git tag prefix: pm-v → project-management-v
CI/CD now sets both chartTag and imageTag when promoting
cluster-gitops (commit 4b5eab1):
Added image.repository to all service base values.yaml files
Added ApplicationSet parameter to derive image.tag from service.imageTag
Updated all environment configs with imageTag field
Updated project-management chartTag: pm-v11.2.0 → project-management-v11.2.0
Created compatibility git tag: project-management-v11.2.0
Temporary workaround (commit 09eb23b):
project-management uses syrf-pm image until next CI/CD build
TODO in values.yaml to change to syrf-project-management after build
Versioning error fix (2025-11-18):
CI/CD created incorrect project-management-v1.0.0 tag (GitVersion ran before compatibility tag)
Deleted incorrect tag, created correct project-management-v11.3.0 tag
Updated staging config: chartTag: project-management-v11.3.0, imageTag: "11.2.0"
Commit 2205291 in cluster-gitops
Project-management rename completion (2025-11-18):
Triggered new CI/CD build which created project-management-v11.3.1 tag
Updated cluster-gitops image.repository from syrf-pm to syrf-project-management
Final staging config: chartTag: project-management-v11.3.2, imageTag: "11.3.2"
Identity-server removal (2025-11-18):
Removed unused IdentityServer4.AccessTokenValidation package from API (now using Auth0)
Removed identityServer config blocks from all 4 service Helm values.yaml files
Removed identity-server secret environment variables from 3 deployment templates
Added IdentityModel.AspNetCore.OAuth2Introspection for TokenRetrieval (was transitive dependency)
Updated required-secrets.md - removed identity-server from required secrets list

Final Service Versions (All Healthy):

api: 9.2.3
project-management: 11.3.2
quartz: 0.5.1
web: 5.4.2
docs: 1.6.5
user-guide: 1.1.0

Architecture Improvement:

image.repository is now explicit in service values.yaml (static)
image.tag is derived via ApplicationSet from service.imageTag
CI/CD sets both chartTag (chart version) and imageTag (Docker tag) on promotion

Previous State (2025-11-17):

❌ 5 failed pods in syrf-staging namespace
❌ syrf-api-657c97878c-jzdd6: InvalidImageName (Image: : - no registry, no tag)
❌ syrf-projectmanagement-9cdbf4465-tpkpw: InvalidImageName
❌ syrf-projectmanagement-6fc8d864f5-65pf7: ImagePullBackOff
❌ syrf-quartz-d6696d6d5-sr846: InvalidImageName
❌ syrf-web-7468b67d77-t7bl6: InvalidImageName
✅ Older pods still running: syrf-api-5758596878, syrf-quartz-68b97d8994, syrf-web-65c666df64

Root Cause Analysis (1 hour):

Check staging values files in cluster-gitops for image.tag configuration
Compare with production values (production is working)
Check ApplicationSet or Application manifests for templating issues
Review recent commits that may have introduced the issue
Check if CI/CD promotion PR left tags empty

Remediation Steps (2-4 hours):

Immediate fix: Update staging values with working image tags (30 min)
Set explicit image.tag values for all services
Use last known working versions from old pods
Create PR to cluster-gitops
Sync and verify: ArgoCD sync and pod replacement (1 hour)
Sync staging Applications
Delete failed pods
Wait for new pods to start
Verify Running status
Permanent fix: Fix root cause in values/templates (1-2 hours)
Fix Helm templating if issue is in chart
Fix CI/CD promotion workflow if issue is in automation
Add validation to prevent empty tags in future
Test with dry-run deployment
Documentation: Update troubleshooting guide (30 min)
Document common image tag issues
Add verification commands
Create runbook for fixing failed pods

Dependencies:

Story 4.5 (First Service Deployment) - blocked by this issue
May depend on Story 2.4 (Promotion PR Automation) if CI/CD is root cause

Technical Notes:

Old working pods can provide reference for correct image tags
Issue likely introduced in recent deployment or promotion
Compare staging vs production Application specs
May need to rollback recent changes to cluster-gitops

Estimated Effort: 5 story points (1 day - includes analysis and fix)

Child Work Item 4.10: Fix Extra-Secrets ApplicationSet Directory Structure ✅¶

GitHub Issue: #2197 Status: ✅ Completed (2025-11-18) Priority: P1 (High - blocks extra-secrets deployment) Story Points: 2 Sprint: Sprint 3

As a platform engineer I want extra-secrets Applications to sync successfully So that ClusterSecretStore and ExternalSecrets can be deployed

Acceptance Criteria:

Create missing resources/ directories in cluster-gitops
Add .gitkeep files to preserve empty directories
Verify extra-secrets-staging Application syncs successfully
Verify extra-secrets-production Application syncs successfully
Both Applications show Synced status (not Degraded)
Document ApplicationSet directory requirements

Current State (2025-11-18):

✅ extra-secrets-production: Synced, Healthy
✅ extra-secrets-staging: Synced, Healthy
Resolution: Added missing resources/ directories with .gitkeep files (commit 8233395)
Documentation: Added troubleshooting section to cluster-gitops/docs/applicationsets.md

Affected Files (cluster-gitops repo):

Missing: plugins/local/extra-secrets-production/resources/
Missing: plugins/local/extra-secrets-staging/resources/

Remediation Steps (1-2 hours):

Create directories (15 min)

cd cluster-gitops
mkdir -p plugins/local/extra-secrets-production/resources
mkdir -p plugins/local/extra-secrets-staging/resources
touch plugins/local/extra-secrets-production/resources/.gitkeep
touch plugins/local/extra-secrets-staging/resources/.gitkeep

Commit and push (15 min)
Create PR or commit directly to cluster-gitops
Include documentation in commit message
Sync and verify (30 min)
Trigger ArgoCD sync for both Applications
Verify Degraded status clears
Verify no more "path does not exist" errors
Check Application health in ArgoCD UI
Document pattern (30 min)
Document ApplicationSet multi-source requirements
Add note about resources/ directory purpose
Update cluster-gitops README if needed

Dependencies:

Story 4.4 (App-of-Apps Bootstrap) ✅ Complete
Related to Story 4.8 (SecretStore fix) - both affect extra-secrets

Technical Notes:

Same pattern already used for: argocd/local/argocd-secrets/resources/.gitkeep
ApplicationSet template unconditionally adds third source
Empty resources/ directory satisfies template requirement
.gitkeep ensures git tracks empty directory

Estimated Effort: 2 story points (2-4 hours)

Child Work Item 4.11: Fix User Guide TLS Certificate ✅¶

GitHub Issue: #2198 Status: ✅ Complete Priority: P2 (Medium - cert-manager issue) Story Points: 3 Sprint: Sprint 3 Completed: 2025-11-18

As a platform engineer I want the user-guide TLS certificate to issue successfully So that help.staging.syrf.org.uk is accessible over HTTPS

Acceptance Criteria:

Investigate why user-guide-tls certificate is stuck in "Issuing" state
Identify blocker (DNS, ACME challenge, rate limit, etc.)
Resolve issue preventing certificate issuance
Verify certificate transitions to Ready: True
Verify TLS secret is created
Test HTTPS access to staging URLs
Document resolution

Root Causes Identified (2025-11-18):

Staging using production URLs: Staging ingresses were configured with production hostnames (e.g., help.syrf.org.uk) instead of staging hostnames (e.g., help.staging.syrf.org.uk)
DNS mismatch: Production URLs point to GitHub Pages or legacy cluster, not GKE LoadBalancer, so ACME HTTP-01 challenges failed with 404
Wrong Let's Encrypt issuer: Staging shared-values.yaml used letsencrypt-staging issuer
Chart defaults pollution: Helm chart defaults had hardcoded ingress values not fully overridden

Resolution Applied:

Updated all Helm chart defaults to ingress: {} (api, pm, quartz, web, user-guide, docs)
Created staging environment values with correct staging URLs
Changed staging shared-values to use letsencrypt-prod issuer
Deleted old certificates to trigger reissuance with correct configuration

Final State (2025-11-18):

✅ All staging certificates using letsencrypt-prod
✅ All staging certificates Ready: True
✅ Staging URLs configured (updated to new convention 2025-11-30):
api: api.staging.syrf.org.uk
web: staging.syrf.org.uk
project-management: project-management.staging.syrf.org.uk
docs: docs.staging.syrf.org.uk
user-guide: help.staging.syrf.org.uk
✅ All production certificates Ready: True

Diagnostic Steps (1-2 hours):

Check cert-manager logs (30 min)
Look for errors related to user-guide-tls
Check ACME challenge status
Identify specific failure reason
Check ACME challenge resources (30 min)
List Challenge resources for this certificate
Check if HTTP-01 challenge pod exists
Verify challenge endpoint is reachable
Check if cert-manager solver ingress exists
Check DNS and ingress (30 min)
Verify help.syrf.org.uk DNS resolves to ingress IP
Check ingress routes for ACME challenge path
Verify no conflicts with other certificates
Check Let's Encrypt rate limits (15 min)
Check if domain hit rate limits
Verify staging vs production ACME server usage

Potential Root Causes:

DNS issue: help.syrf.org.uk not resolving or resolving to wrong IP
ACME challenge failure: HTTP-01 challenge endpoint not reachable
Rate limiting: Let's Encrypt rate limits exceeded
Ingress conflict: Multiple ingresses competing for same hostname
cert-manager bug: Controller not processing certificate request

Remediation Steps (1-2 hours):

Delete and recreate (if stuck in bad state)

kubectl delete certificate user-guide-tls -n syrf-staging
# ArgoCD will recreate from manifest

Force new order (if ACME issue)

kubectl delete certificaterequest -n syrf-staging -l cert-manager.io/certificate-name=user-guide-tls
kubectl delete challenge -n syrf-staging --all

Update certificate spec (if configuration issue)
Switch to DNS-01 challenge if HTTP-01 failing
Use Let's Encrypt staging server to avoid rate limits
Adjust dnsNames or issuer reference
Verify and monitor (30 min)
Watch certificate events
Monitor cert-manager logs
Verify certificate reaches Ready state
Test HTTPS access

Dependencies:

Story 4.3 (Platform Add-ons) ✅ Complete - cert-manager installed
May need coordination with DNS/ingress configuration

Technical Notes:

cert-manager version: v1.15.0
Issuer: Let's Encrypt (production)
Challenge type: HTTP-01 (assumed)
Other certificates issuing successfully (suggests cert-manager is healthy)
Issue specific to user-guide service

Estimated Effort: 3 story points (4-8 hours - investigation heavy)

Child Work Item 4.12: Sync Out-of-Sync Applications 🔄¶

GitHub Issue: #2199 Status: 🔄 In Progress (Plugins complete, SyRF apps pending) Priority: P2 (Medium - operational hygiene) Story Points: 2 Sprint: Sprint 3 (In Progress)

As a platform engineer I want all ArgoCD Applications to show Synced status So that cluster state matches Git and drift is eliminated

Acceptance Criteria:

Review all Applications with OutOfSync or Unknown status
Identify reason for each drift (manual change, missing config, etc.)
Sync or fix each Application
Verify all Applications show Synced status
Document any manual steps taken
Configure sync policies if needed (auto-sync, prune, self-heal)

Current State (2025-11-17): OutOfSync Applications (4):

argocd-secrets: OutOfSync
cert-manager: OutOfSync
rabbitmq: OutOfSync
root: OutOfSync

Unknown Status Applications (9):

docs-production: Unknown
external-dns: Unknown
extra-secrets-production: Unknown (also Degraded)
extra-secrets-staging: Unknown (also Degraded)
ingress-nginx: Unknown
quartz-production: Unknown
rabbitmq-secrets: Unknown
user-guide-production: Unknown
user-guide-staging: Synced (but has failing certificate)

Sync Process (2-4 hours):

Categorize issues (30 min)
Group by root cause
Identify which can auto-sync vs need manual intervention
Check if blocked by other child work items
Sync core infrastructure (1 hour)
argocd-secrets (likely just committed changes)
cert-manager (check for drift)
rabbitmq (verify no manual changes)
root (app-of-apps - sync to propagate changes)
Investigate Unknown status (1 hour)
Check why Application status is Unknown
May indicate health check issues
May be transient during deployment
Review Application logs and events
Document and prevent (30 min)
Document why each Application was out of sync
Configure sync policies to prevent future drift
Consider enabling auto-sync for infrastructure apps

Dependencies:

May depend on Story 4.8 (SecretStore fix) for extra-secrets apps
May depend on Story 4.10 (directory structure fix) for extra-secrets apps

Technical Notes:

Unknown status often means ArgoCD can't determine health
May need to configure custom health checks
OutOfSync is expected during active development
Root Application sync propagates to all child Applications

Estimated Effort: 2 story points (2-4 hours)

Child Work Item 4.13: Clean Up Orphaned Resources 🔄¶

GitHub Issue: #2200 Status: 🔄 In Progress (Plugins cleanup complete, SyRF apps pending) Priority: P3 (Low - cleanup task) Story Points: 1 Sprint: Sprint 3-4 (In Progress)

As a platform engineer I want orphaned resources cleaned up So that cluster is tidy and ArgoCD warnings are eliminated

Acceptance Criteria:

Review orphaned resources in api-staging (7 resources)
Review orphaned resources in extra-secrets-production (3 resources)
Determine if resources should be:
Deleted (if truly orphaned)
Adopted by ArgoCD (if should be managed)
Ignored (if intentionally manual)
Execute cleanup or adoption
Verify OrphanedResourceWarning clears from Applications
Document decisions for future reference

Current State (2025-11-17):

api-staging: 7 orphaned resources
extra-secrets-production: 3 orphaned resources

Orphaned Resource Analysis (1-2 hours):

Identify resources (30 min)

kubectl get application api-staging -n argocd -o yaml | yq '.status.resources'
kubectl get application extra-secrets-production -n argocd -o yaml | yq '.status.resources'

Determine ownership (30 min)
Check resource labels and annotations
Verify if resources are managed by Helm
Check if resources should exist in manifests
Identify why ArgoCD considers them orphaned
Decide action (30 min)
Delete: If resources are leftover from old deployments
Adopt: If resources should be in Git but aren't
Ignore: If resources are intentionally manual (add to ignoreDifferences)

Cleanup Steps (30 min - 1 hour):

Option A: Delete orphaned resources

kubectl delete <resource-type> <resource-name> -n <namespace>

Option B: Adopt into ArgoCD
Add resource manifests to Git
Configure ownerReferences
Sync Application
Option C: Ignore
Add to Application ignoreDifferences
Document why resources are manual

Dependencies:

None (independent cleanup task)

Technical Notes:

Orphaned resources don't break functionality
Warnings indicate resources exist but not tracked in Git
Common causes: manual kubectl apply, Helm 2 migration, renamed resources
ArgoCD can adopt resources with proper annotations

Estimated Effort: 1 story point (1-2 hours)

Child Work Item 4.14: Configure ArgoCD Sync Policies and Drift Prevention 🔄¶

GitHub Issue: #2201 Status: 🔄 In Progress (Plugins complete, SyRF apps pending) Priority: P1 (High - prevents future issues) Story Points: 5 Sprint: Sprint 3 (In Progress)

As a platform engineer I want proper ArgoCD sync policies configured for all Applications So that the cluster stays synchronized with Git and drift is prevented automatically

Acceptance Criteria:

Current State (2025-11-17):

Mixed sync policies: Some auto-sync, some manual, inconsistent configuration
No self-heal configured: Manual changes to cluster not automatically reverted
No prune configured: Deleted resources in Git remain in cluster
No sync waves: Dependencies deployed in random order
Limited health checks: ArgoCD can't determine health for some resources

Sync Policy Categories (1 hour):

Full Automation (auto-sync + prune + self-heal):
Infrastructure: ingress-nginx, cert-manager, external-dns
Platform: external-secrets-operator, rabbitmq-secrets
Staging services: All syrf-staging services
Justification: Non-critical, fast feedback needed
Semi-Automated (auto-sync + self-heal, NO prune):
Production services: All syrf-production services
ArgoCD itself: Self-managing but careful pruning
Justification: Auto-sync for speed, manual pruning for safety
Manual Only (no auto-sync):
Production database configs (if added)
Security-critical resources (if added)
Justification: Require explicit approval before changes

Configuration Tasks (3-4 hours):

Infrastructure Applications (1 hour)

# Example: ingress-nginx
syncPolicy:
  automated:
    prune: true
    selfHeal: true
  syncOptions:
    - CreateNamespace=true
    - PrunePropagationPolicy=foreground
    - PruneLast=true
  retry:
    limit: 5
    backoff:
      duration: 5s
      factor: 2
      maxDuration: 3m

Service Applications (1-2 hours)
Add sync waves to ensure dependencies deploy first
Configure health checks for custom resources
Set appropriate retry policies
Enable auto-sync for staging, semi-auto for production
PostSync Validation Hooks (1 hour)
Add hooks beyond deployment notifications
Validate critical resources exist after sync
Check for common misconfigurations
Alert on unexpected drift
Health Checks (30 min)
Configure custom health checks for ExternalSecrets
Configure health for Jobs (consider successful completion)
Configure health for StatefulSets (readiness)

Example PostSync Validation Hook:

apiVersion: batch/v1
kind: Job
metadata:
  name: validate-deployment
  annotations:
    argocd.argoproj.io/hook: PostSync
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
spec:
  template:
    spec:
      containers:
      - name: validate
        image: bitnami/kubectl:latest
        command:
        - /bin/bash
        - -c
        - |
          # Validate ExternalSecrets are ready
          kubectl wait --for=condition=Ready \
            externalsecret/rabbit-mq -n {{ .Values.namespace }} \
            --timeout=60s

          # Validate pods are running
          kubectl wait --for=condition=Ready \
            pod -l app={{ .Values.appName }} \
            -n {{ .Values.namespace }} \
            --timeout=120s
      restartPolicy: Never

Sync Wave Strategy (deployment order):

# Wave -1: Prerequisites
- ClusterSecretStore (wave: -1)
- Namespaces (wave: -1)

# Wave 0: Infrastructure
- ingress-nginx (wave: 0)
- cert-manager (wave: 0)

# Wave 1: Platform Services
- external-secrets-operator (wave: 1)
- rabbitmq (wave: 1)

# Wave 2: Secrets
- extra-secrets (wave: 2)
- rabbitmq-secrets (wave: 2)

# Wave 3: Application Services
- syrf-api (wave: 3)
- syrf-pm (wave: 3)
- syrf-quartz (wave: 3)
- syrf-web (wave: 3)

Testing and Validation (1 hour):

Test auto-sync (20 min)
Make Git change to auto-sync app
Verify ArgoCD detects and syncs automatically
Verify sync completes within 3 minutes
Test self-heal (20 min)
Make manual change to cluster (kubectl edit)
Verify ArgoCD detects drift
Verify auto-remediation within 5 minutes
Test prune (20 min)
Delete resource from Git
Verify ArgoCD removes from cluster
Verify prune happens in correct order

Documentation (30 min):

Document sync policy for each Application
Create decision matrix (when to use each policy)
Document sync wave strategy
Create troubleshooting guide for sync failures
Update cluster-gitops README with sync policies

Dependencies:

Story 4.8 (SecretStore fix) - needed for testing ExternalSecret health
Story 4.12 (Sync out-of-sync apps) - clean slate before configuring policies

Technical Notes:

Prune safety: Use PruneLast=true to ensure new resources deploy before old ones delete
Self-heal interval: ArgoCD checks every 3 minutes by default
Sync waves: Lower numbers deploy first, use negative for prerequisites
Health checks: Custom Lua scripts for complex resources
Retry logic: Exponential backoff prevents sync storms
PostSync hooks: Run AFTER sync succeeds, useful for validation
PreSync hooks: Run BEFORE sync, useful for migrations

Estimated Effort: 5 story points (1 day)

Work Item 5: Production Migration ⏳ PLANNED¶

GitHub Issue: #2155 Goal: Migrate production traffic from Jenkins X to new cluster

Total Story Points: 34 Status: ⏳ Planned (0%)

Child Work Item 5.1: Production Deployment Validation 📋¶

GitHub Issue: #2156 Status: 📋 Ready (after Epic 4) Priority: P0 (Critical) Story Points: 8 Sprint: TBD

As a platform engineer I want all services deployed to production environment So that production environment is ready for traffic

Acceptance Criteria:

Deploy all 4 services to production namespace
Use current production versions (from Jenkins X cluster)
Verify all pods are running and healthy
Verify ingress routes are configured
Verify RabbitMQ connectivity
Verify database connectivity
Run smoke tests for each service
Verify monitoring and logging
Document production configuration

Dependencies:

Story 4.6 (End-to-End GitOps Flow Validation)
All Epic 4 stories complete

Technical Notes:

Use versions from PLANNING.md (Jenkins X baseline)
Do NOT switch traffic yet (parallel running)
Validate in isolation first

Estimated Effort: 8 story points (2 days)

Child Work Item 5.2: Traffic Cutover Planning 📋¶

GitHub Issue: #2157 Status: 📋 Ready (after Epic 4) Priority: P0 (Critical) Story Points: 5 Sprint: TBD

As a platform engineer I want a detailed cutover plan with rollback procedures So that production migration is safe and reversible

Acceptance Criteria:

Document cutover strategy:
Blue-green deployment
Canary rollout
DNS switch
Define success criteria
Create rollback plan
Schedule maintenance window
Prepare communication plan
Create monitoring checklist
Define rollback triggers
Test rollback procedure in staging

Dependencies:

Story 5.1 (Production Deployment Validation)

Technical Notes:

Recommended: DNS-based cutover (fastest rollback)
Alternative: Load balancer reconfiguration
Monitor for 24-48 hours before Jenkins X decomm

Estimated Effort: 5 story points (1 day)

Child Work Item 5.3: Production Cutover Execution 📋¶

GitHub Issue: #2158 Status: 📋 Ready (after Epic 4) Priority: P0 (Critical) Story Points: 13 Sprint: TBD

As a platform engineer I want to switch production traffic to the new cluster So that SyRF runs on the GitOps architecture

Acceptance Criteria:

Dependencies:

Story 5.2 (Traffic Cutover Planning)

Technical Notes:

This is the GO-LIVE event
Rollback plan must be ready
Team on standby during cutover

Estimated Effort: 13 story points (1 week - includes monitoring period)

Child Work Item 5.4: Jenkins X Cluster Decommission 📋¶

GitHub Issue: #2159 Status: 📋 Ready (after Epic 5) Priority: P2 (Medium) Story Points: 8 Sprint: TBD

As a platform engineer I want to decommission the Jenkins X cluster So that we don't pay for unused infrastructure

Acceptance Criteria:

Dependencies:

Story 5.3 (Production Cutover Execution)
1 week monitoring period

Technical Notes:

Do NOT delete until 100% confident in new cluster
Keep backups of Jenkins X configs
Final step of the migration

Estimated Effort: 8 story points (2 days)

Sprint Planning Recommendations¶

Sprint 2 (Current - 2 weeks)¶

Goal: Complete CI/CD automation and cluster-gitops setup

Child Work Items to Include (42 story points - ambitious):

✅ Child Work Item 2.1: Auto-Version Workflow Cleanup (5 pts)
✅ Child Work Item 2.2: Docker Image Build Integration (13 pts)
✅ Child Work Item 3.2: ArgoCD Application Manifests (8 pts)
✅ Child Work Item 3.3: Environment Values Configuration (5 pts)
✅ Child Work Item 3.5: Infrastructure Dependencies Analysis (5 pts)

Stretch Goals:

Child Work Item 2.4: Promotion PR Automation (8 pts)

Deliverables:

Docker images building and pushing to GHCR
ArgoCD manifests ready to apply
Environment values configured
Infrastructure requirements documented

Sprint 3 (In Progress - 2 weeks)¶

Goal: Complete remaining CI/CD and GitOps infrastructure

Child Work Items to Include (24 story points):

✅ Child Work Item 2.4: Promotion PR Automation (8 pts) - Complete
✅ Child Work Item 3.4: ApplicationSet for PR Previews (8 pts) - Complete
✅ Child Work Item 2.3: Build Optimization (8 pts) - Complete (2025-11-19)

Deliverables:

✅ Full CI/CD automation working
✅ PR preview environments configured
✅ Build optimization implemented

Sprint 4+ (Blocked on K8s cluster)¶

Goal: Deploy to new cluster and validate

Child Work Items:

All Work Item 4 child work items (29 points)
Requires: Kubernetes cluster provisioned

Sprint 5+ (Blocked on Sprint 4)¶

Goal: Production migration

Child Work Items:

All Work Item 5 child work items (34 points)

Risk Register¶

High-Risk Items¶

Risk	Probability	Impact	Mitigation	Status
K8s cluster not available	High	Critical	Work on items that don't require cluster (Epics 2, 3)	⚠️ Active
Docker build failures in monorepo	Medium	High	Test builds locally first, review Dockerfiles	📋 Planned
RabbitMQ connectivity issues	Low	Critical	Test in staging first, document config	📋 Planned
Production cutover problems	Medium	Critical	Detailed rollback plan, gradual cutover	📋 Planned
Data migration issues	Low	Critical	Verify data access before cutover	📋 Planned

Medium-Risk Items¶

Risk	Probability	Impact	Mitigation	Status
DNS propagation delays	Medium	Medium	Plan for TTL, use low TTL before cutover	📋 Planned
Secret management migration	Medium	Medium	Test ESO in staging, document secrets	📋 Planned
Performance degradation	Low	Medium	Load testing, monitoring, gradual rollout	📋 Planned

Metrics & KPIs¶

Development Velocity¶

Sprint Capacity: ~40 story points per 2-week sprint (1 developer)
Completed: 53 story points (Epic 1)
Remaining: 102 story points
Estimated Sprints: 3-4 sprints (6-8 weeks)

GitOps Success Criteria¶

Once Epic 4 complete, measure:

Metric	Target	Current	Status
Commit → Staging Deploy Time	< 10 min p50	N/A	⏳ Not measured
Preview Env Creation Time	< 2 min	N/A	⏳ Not measured
Deployment via Git PRs	100%	N/A	⏳ Not measured
Untracked Drift	0 instances	N/A	⏳ Not measured
Rollback Time	< 5 min	N/A	⏳ Not measured

Dependencies Graph¶

Work Item 1 (✅ Complete)
  └── Work Item 2 (🔄 In Progress)
        ├── Child Work Item 2.1 (Auto-Version Cleanup)
        ├── Child Work Item 2.2 (Docker Builds) → depends on 2.1
        ├── Child Work Item 2.3 (Build Optimization) → depends on 2.2
        └── Child Work Item 2.4 (Promotion PRs) → depends on 2.2, 3.1

Work Item 1 (✅ Complete)
  └── Work Item 3 (🔄 In Progress)
        ├── Child Work Item 3.1 (cluster-gitops) ✅ Complete
        ├── Child Work Item 3.2 (ArgoCD Apps) → depends on 3.1
        ├── Child Work Item 3.3 (Env Values) → depends on 3.1
        ├── Child Work Item 3.4 (ApplicationSets) → depends on 3.2
        └── Child Work Item 3.5 (Infrastructure) → depends on 3.1

Work Item 4 (⏳ Blocked) → BLOCKER: K8s Cluster
  ├── Child Work Item 4.1 (K8s Cluster) ⚠️ PRIMARY BLOCKER
  ├── Child Work Item 4.2 (ArgoCD Install) → depends on 4.1
  ├── Child Work Item 4.3 (Platform Add-ons) → depends on 4.2, 3.5
  ├── Child Work Item 4.4 (Bootstrap) → depends on 4.2, 3.2
  ├── Child Work Item 4.5 (First Service) → depends on 4.3, 4.4
  └── Child Work Item 4.6 (E2E Validation) → depends on 4.5, 2.2, 2.4, 3.4

Work Item 5 (⏳ Planned) → depends on Work Item 4
  ├── Child Work Item 5.1 (Prod Validation) → depends on 4.6
  ├── Child Work Item 5.2 (Cutover Plan) → depends on 5.1
  ├── Child Work Item 5.3 (Cutover) → depends on 5.2
  └── Child Work Item 5.4 (Decommission) → depends on 5.3 + 1 week

Changelog¶

2025-11-19 (Build Optimization - Complete)¶

Child Work Item 2.3: Build Optimization ✅ Complete
Implemented crane-based image retagging for chart-only changes
Added list-files: shell to dorny/paths-filter for file analysis
Chart-only detection checks if ALL changed files match .chart/ pattern
Initial approach using negation patterns didn't work (patterns are OR'd)
Fixed by analyzing actual file paths in combined step
Successfully tested: API chart-only change retagged 9.4.3 → 9.4.4
Time savings: 12 seconds vs 4+ minutes for full build
Sprint 3 now fully complete - All 24 story points delivered

2025-11-18 (Plugins ApplicationSet Fixes - Complete)¶

Plugins Project ArgoCD Applications Fixed: All plugins apps now Synced/Healthy
Sync Policy Configuration (Child Work Item 4.14 - Partial):
Enabled selfHeal: true for drift prevention (Git is source of truth)
Added ServerSideApply=true for large CRDs (ESO)
Added ApplyOutOfSyncOnly=true for efficient syncing
Added ignoreDifferences for ESO controller default fields (conversionStrategy, decodingStrategy, metadataPolicy)
Removed blocking SyncWindows from plugins and argocd projects
Orphaned Resources Cleanup (Child Work Item 4.13 - Partial):
Deleted orphaned rabbitmq secret from syrf-staging
Removed redundant rabbitmq-secrets plugin (ClusterExternalSecret handles this)
Configured RabbitMQ existingErlangSecret to prevent drift
Directory Structure Fix (Child Work Item 4.10 - Complete):
Created missing resources directories with .gitkeep for external-dns, ingress-nginx
ClusterIssuer Resolution:
Deleted ClusterIssuers for GitOps regeneration with correct tracking ID
ESO CRD Fix:
Removed kubectl.kubernetes.io/last-applied-configuration annotations
Deleted and recreated CRDs with ServerSideApply
Commits:
1c9ebe5: fix(plugins): resolve ArgoCD application issues
8b6678a: fix(plugins): enable selfHeal and remove redundant rabbitmq-secrets
e97f189: fix(projects): remove blocking SyncWindows from argocd and plugins
422194c: fix(plugins): add ApplyOutOfSyncOnly to help with large CRD sync
af29deb: fix(plugins): ignore ESO controller default fields in ExternalSecrets
Final Status: All 7 plugins apps showing Synced/Healthy
cert-manager, external-dns, external-secrets-operator
extra-secrets-production, extra-secrets-staging, ingress-nginx, rabbitmq

2025-11-18 (TLS Certificate and Ingress Configuration Fixes)¶

Child Work Item 4.11 Complete: Fixed TLS certificates for all staging and production services
Root Causes Identified:
Staging ingresses using production URLs (e.g., help.syrf.org.uk instead of help.staging.syrf.org.uk)
DNS mismatch causing ACME HTTP-01 challenges to fail with 404
Staging using letsencrypt-staging issuer instead of letsencrypt-prod
Helm chart defaults with hardcoded ingress values not fully overridden
Fixes Applied:
Updated all Helm chart defaults to ingress: {} (api, pm, quartz, web, user-guide, docs)
Created staging environment values with correct staging URLs
Changed staging shared-values to use letsencrypt-prod issuer
All certificates now issued by Let's Encrypt production
Current State:
All staging certificates: Ready ✅
All production certificates: Ready ✅
Correct staging URLs configured for all services
Progress Update: Work Item 4 now 8/14 complete (57%), overall 27/37 (73%)

2025-11-17 (Cluster Health Assessment - 6 New Issues Discovered)¶

Cluster Health & Remediation Issues: Added 6 new child work items to Work Item 4 after comprehensive cluster health check
Critical Issues:
Child Work Item 4.8: Fix SecretStore Configuration (8 pts, P0) - 44 ExternalSecrets failing, all reference non-existent SecretStores
Child Work Item 4.9: Fix Staging Image Tags (5 pts, P0) - 5 staging pods with InvalidImageName/ImagePullBackOff
High Priority:
Child Work Item 4.10: Fix Extra-Secrets Directory Structure (2 pts, P1) - Missing resources/ directories blocking sync
Medium Priority:
Child Work Item 4.11: Fix User Guide TLS Certificate (3 pts, P2) - ✅ Complete
Child Work Item 4.12: Sync Out-of-Sync Applications (2 pts, P2) - 13 apps OutOfSync or Unknown status
Low Priority:
Child Work Item 4.13: Clean Up Orphaned Resources (1 pt, P3) - 10 orphaned resources across 2 apps
Impact on Progress:
Total story points increased from 181 to 202 (+21 points)
Work Item 4: Now 4/13 complete (31%) instead of 4/7 (57%)
Overall progress: 23/36 child work items (64%) vs 23/30 (77%)
Completion estimate updated: 2-3 weeks vs 1-2 weeks
Identified 2 critical blockers for staging environment and all services

2025-11-14 (Helm Chart Standardization)¶

Helm Chart Standardization - Jenkins X Pattern Removal (Child Work Item 4.7, 3 pts):
Removed all 52 Jenkins X legacy patterns from service Helm charts
Updated all 4 services (api, project-management, quartz, web) × 4 files each = 16 files total
Replaced jx.imagePullSecrets with standard K8s top-level imagePullSecrets array
Replaced jxRequirements.ingress.* with clean ingress.* namespace
Removed draft label patterns from all services
Root cause: syrf-web ImagePullBackOff due to mismatch between global values and chart expectations
All charts validated successfully with helm template
Documentation: Created ADR-006-helm-chart-standardization.md
Web service had 30 references in ingress.yaml alone (complex host name construction)
Work Item 4 Progress: 4/7 child work items complete (57%)
Overall Progress: 23/30 child work items (77%), 163/181 story points (90%)

2025-11-13 (Updated - Repository Migration)¶

Repository Migration Completed:
Migrated monorepo from camaradesuk/syrf-test to camaradesuk/syrf
Backup created at camaradesuk/syrf-web-legacy
Force push + rename strategy preserved all GitHub metadata (470+ issues, 47 PRs, discussions)
ZenHub workspace continues functioning (same internal repo ID)
All branches coexist (3 monorepo + 93 syrf-web = no conflicts)
All tags coexist (prefixed vs unprefixed = no conflicts)
GitHub automatic redirects: syrf-web URLs → syrf URLs
Documentation created: ADR-005 and migration guide
NEW Child Work Item 1.8: Repository Migration to Production Name (5 pts) ✅ Complete
Updated backlog: 22/29 child work items (76%), 160/178 story points (90%)
Updated all issue URLs from syrf-web to syrf
External-DNS CrashLoopBackOff Issue RESOLVED:
Problem: External-DNS pod crashing with Precondition not met error
Root Cause: Trying to delete DNS records from legacy Jenkins X cluster with different owner ID
Solution: Changed policy from sync to upsert-only in infrastructure/external-dns/values.yaml
Status: External-DNS now running successfully, creating/updating records without deletion attempts
Legacy DNS Records: Orphaned TXT records from legacy cluster preserved until migration complete
Documentation: Created cluster-gitops/docs/troubleshooting/external-dns-crashes.md
Commit: 6c3de9d (fix), 8ee375d (docs)
NEW FEATURES COMPLETED - Child Work Items 2.5 and 2.6:
Production Promotion Automation (Child Work Item 2.5, 8 pts):
- Automated PR creation for production promotion after successful staging deployment
- PR requires manual review and merge (no GitHub Environment needed)
- Workflow completes with green checkmark after PR creation
- PR labeled requires-review with review checklist
- Implementation: promote-to-production job in ci-cd.yml
- Commits: 3d4edccd (initial), 42a46855 (simplified for free tier)
Deployment Success Notifications (Child Work Item 2.6, 8 pts):
- ArgoCD PostSync hooks create GitHub commit statuses after successful deployments
- Kubernetes Job authenticates with GitHub App
- Status context: argocd/deploy-{environment}
- Configuration consolidated in shared-values.yaml (DRY principle)
- Services enable with single flag: deploymentNotification.enabled: true
- Staging: commit statuses only
- Production: commit statuses + GitHub Releases
- Commits: 3d4edccd, 118648da, 74bee73 (DRY config), 034158d0 (docs)
Documentation:
- Created: docs/how-to/production-promotion-and-notifications.md
- Updated: CLAUDE.md with CI/CD workflow changes
- Updated: cluster-gitops shared-values.yaml (both environments)
Work Item 2 Status: Now 100% complete (6/6 child work items)
Total Story Points: Increased from 157 to 173 (+16 points)
Overall Progress: 21/28 child work items (75%), 155/173 story points (90%)

2025-11-07¶

Reorganized hierarchy for ZenHub alignment
Changed from 5 Epics to 1 Epic containing 5 Work Items
Changed 26 User Stories to 26 Child Work Items
Updated all terminology throughout document (Executive Summary, Progress table, Sprint Planning, Dependencies Graph)
All content and metadata preserved
Created GitHub issues in syrf-web repository:
Epic #2128: SyRF GitOps Migration (Short-Term Goals pipeline)
Work Items #2129-#2155 (5 work items, Sprint Backlog pipeline)
Child Work Items #2130-#2159 (26 child work items, Sprint Backlog pipeline)
All issues have proper hierarchy, estimates, and pipeline placement

2025-11-03¶

Initial backlog created
Analyzed all planning documents
Created 26 stories across 5 epics
Identified K8s cluster as primary blocker
Defined acceptance criteria and story points
Organized into sprint recommendations

References¶

PROJECT-STATUS.md - Current implementation status
IMPLEMENTATION-PLAN.md - Phase-by-phase plan
CLUSTER ARCHITECTURE GOALS.md - Target architecture
DEPENDENCY-MAP.yaml - Service/library dependencies
CI-CD-DECISIONS.md - Strategic CI/CD decisions
cluster-gitops/PLANNING.md - Migration strategy and Jenkins X baseline

Next Update: After Sprint 2 completion or when K8s cluster becomes available