SyRF GitOps Migration - Product Backlog¶
Last Updated: 2025-12-01 Project: SyRF Monorepo + GitOps Migration Sprint Planning: ZenHub/Scrum Board Format
Executive Summary¶
This backlog tracks the migration from Jenkins X to a GitOps-based deployment architecture using GitHub Actions and ArgoCD. The project is organized as one Epic containing Work Items, with each Work Item containing multiple Child Work Items.
Overall Progress¶
| Work Item | Status | Child Work Items Complete | Total Child Work Items | Progress |
|---|---|---|---|---|
| Work Item 1: Monorepo Foundation | ✅ Complete | 8/8 | 8 | 100% |
| Work Item 2: CI/CD Automation | ✅ Complete | 7/7 | 7 | 100% |
| Work Item 3: GitOps Infrastructure | ✅ Complete | 5/5 | 5 | 100% |
| Work Item 4: ArgoCD Deployment | 🔄 In Progress | 9/14 | 14 | 64% |
| Work Item 5: Production Migration | ⏳ Planned | 0/4 | 4 | 0% |
| TOTAL | 29/38 | 38 | 76% |
Burn-down Estimate¶
- Total Story Points: 210 (updated: +3 for dynamic matrix 2025-11-20)
- Completed: 174 points (83%)
- Remaining: 36 points (17%)
- Estimated Time to Complete: 2-3 weeks
Legend¶
Status Icons:
- ✅ Complete
- 🔄 In Progress
- ⏳ Blocked/Waiting
- 📋 Ready
- 🔮 Future/Backlog
Story Point Scale:
- 1 point = 1-2 hours
- 2 points = 2-4 hours
- 3 points = 4-8 hours (half day)
- 5 points = 1 full day
- 8 points = 2 days
- 13 points = 1 week
- 21 points = 2 weeks
Epic: SyRF GitOps Migration¶
GitHub Issue: #2128 Goal: Migrate from Jenkins X to a GitOps-based deployment architecture using GitHub Actions, ArgoCD, and Kubernetes
Total Work Items: 5 Total Child Work Items: 38 Total Story Points: 210 (updated: +3 for dynamic matrix 2025-11-20) Completed: 174 points (83%) Remaining: 36 points (17%) Overall Status: 🔄 In Progress
Work Items Overview:
- Work Item 1: Monorepo Foundation (58 pts) - ✅ Complete (8/8 child work items)
- Work Item 2: CI/CD Automation (57 pts) - ✅ Complete (7/7 child work items)
- Work Item 3: GitOps Infrastructure (34 pts) - ✅ Complete (5/5 child work items, 100%)
- Work Item 4: ArgoCD Deployment (53 pts) - 🔄 In Progress (9/14 child work items, 64%)
- Work Item 5: Production Migration (34 pts) - ⏳ Planned (0/4 child work items)
Work Item 1: Monorepo Foundation ✅ COMPLETE¶
GitHub Issue: #2129 Goal: Establish monorepo structure with automated semantic versioning
Total Story Points: 58 Status: ✅ Complete (100%)
Child Work Item 1.1: Monorepo Structure Setup ✅¶
GitHub Issue: #2130 Status: ✅ Complete Priority: P0 (Critical) Story Points: 8 Sprint: Sprint 0 (Completed)
As a developer I want all services and libraries consolidated into a single monorepo So that I can make atomic changes across service boundaries and simplify dependency management
Acceptance Criteria:
- All 4 services moved to
src/services/(api, project-management, quartz, web) - All shared libraries moved to
src/libs/ - Helm charts organized in
src/services/{service}/charts/ - Root solution file
syrf.slncreated with proper folder structure - Solution filters (
.slnf) created for each service - Git history preserved from original repositories
- All projects build successfully with
dotnet build - Directory.Build.props centralized at repository root
Dependencies: None
Technical Notes:
- Completed via migration scripts
- Repository:
camaradesuk/syrf-monorepo(production ready) - Test repository:
camaradesuk/syrf-test
Child Work Item 1.2: GitVersion Configuration ✅¶
GitHub Issue: #2131 Status: ✅ Complete Priority: P0 (Critical) Story Points: 5 Sprint: Sprint 1 (Completed)
As a developer I want automated semantic versioning based on conventional commits So that versions are calculated automatically without manual intervention
Acceptance Criteria:
- GitVersion.yml created for all 5 services (api, pm, quartz, web, s3-notifier)
- All services use
mode: ContinuousDeployment - Conventional commit patterns configured (feat:, fix:, chore:)
- Service-specific tag prefixes defined (api-v, pm-v, quartz-v, web-v, s3-notifier-v)
- Path filtering working (services version independently)
- GitVersion.yml removed from shared libraries
- Test commit successfully calculates version
Dependencies: Story 1.1 (Monorepo Structure)
Technical Notes:
- Decision documented in:
GITVERSION-MODE-DECISION.md - Used ContinuousDeployment mode instead of Mainline
- All services at 0.1.0 baseline
Child Work Item 1.3: Chart Version Stabilization ✅¶
GitHub Issue: #2132 Status: ✅ Complete Priority: P0 (Critical) Story Points: 2 Sprint: Sprint 1 (Completed)
As a platform engineer I want Helm Chart versions to remain stable at 0.0.0 So that deployment versions are controlled via git refs and image tags, not chart versions
Acceptance Criteria:
- All Chart.yaml files set to
version: 0.0.0 - Comment added: "Stable version; deployments via git ref + image tag"
- Policy documented in CLUSTER ARCHITECTURE GOALS.md
- CI/CD workflows do NOT update Chart.yaml versions
- Charts still valid for Helm deployment
Dependencies: None
Technical Notes:
- Aligns with GitOps best practices
- Commit: 941e2a1b (2025-11-03)
Child Work Item 1.4: Dependency Mapping ✅¶
GitHub Issue: #2133 Status: ✅ Complete Priority: P1 (High) Story Points: 8 Sprint: Sprint 1 (Completed)
As a developer I want a clear dependency map of all services and libraries So that I can understand impact of changes and optimize builds
Acceptance Criteria:
- DEPENDENCY-MAP.yaml created as single source of truth
- Complete dependency trees documented for all services
- Docker build context requirements specified
- CI/CD workflow trigger paths defined
- Impact analysis for library changes documented
- Zero circular dependencies verified
- Validation script created (
validate-dependencies.sh)
Dependencies: Story 1.1 (Monorepo Structure)
Technical Notes:
- File:
architecture/dependency-map.yaml - SharedKernel is most critical (affects 3 services)
- Web service has no .NET dependencies
Child Work Item 1.5: CI/CD Path Filtering Optimization ✅¶
GitHub Issue: #2134 Status: ✅ Complete Priority: P1 (High) Story Points: 5 Sprint: Sprint 1 (Completed)
As a developer I want CI/CD workflows to build only changed services So that builds are fast and resource-efficient
Acceptance Criteria:
- Path filters use precise library paths (not broad
src/libs/**) - API triggers on 6 specific library paths
- PM triggers on 7 specific library paths
- Quartz triggers on 2 library paths (minimal dependencies)
- Web has no library dependencies
- Test: Change to SharedKernel triggers API, PM, Quartz (not Web)
- Test: Change to Web triggers only Web service
Dependencies: Story 1.4 (Dependency Mapping)
Technical Notes:
- Uses
dorny/paths-filter@v3action - Prevents unnecessary builds when unrelated libraries change
Child Work Item 1.6: Documentation Consolidation ✅¶
GitHub Issue: #2135 Status: ✅ Complete Priority: P2 (Medium) Story Points: 8 Sprint: Sprint 1 (Completed)
As a developer I want clear, non-redundant documentation So that I can understand the current state and make informed decisions
Acceptance Criteria:
- CLAUDE.md updated with current architecture
- PROJECT-STATUS.md reflects current implementation status
- IMPLEMENTATION-PLAN.md aligned with actual progress
- Obsolete documents deleted (preserved in git history)
- Path references standardized (
src/services/notservices/) - GitVersion mode contradiction resolved (all docs use ContinuousDeployment)
- Documentation anti-patterns documented
- README.md rewritten as navigation entry point
Dependencies: All previous stories
Technical Notes:
- Deleted 3 obsolete analysis files
- Adopted hybrid redundancy strategy
- DEPENDENCY-MAP.yaml is now authoritative
Child Work Item 1.7: Build Configuration Optimization ✅¶
GitHub Issue: #2136 Status: ✅ Complete Priority: P2 (Medium) Story Points: 3 Sprint: Sprint 1 (Completed)
As a developer I want optimized Docker build contexts and .dockerignore So that builds are faster and use less disk space
Acceptance Criteria:
- .dockerignore excludes planning/, .github/, docs
- .dockerignore organized by category with comments
- Estimated 20-30% reduction in build context size
- Directory.Build.props enhanced with:
- Common build settings
- Code quality settings
- NuGet package metadata
- Deterministic builds for CI/CD
- Redundant Directory.Build.props files removed from service subdirectories
Dependencies: Story 1.1 (Monorepo Structure)
Technical Notes:
- Root Directory.Build.props is single source of MSBuild configuration
- .dockerignore tested and validated
Child Work Item 1.8: Repository Migration to Production Name ✅¶
GitHub Issue: #2163 Status: ✅ Complete Priority: P1 (High) Story Points: 5 Sprint: Sprint 1 (Completed 2025-11-13)
As a developer I want the monorepo migrated to the production repository name So that all GitHub metadata is preserved in one consolidated location
Acceptance Criteria:
- Create backup of syrf-web in syrf-web-legacy repository
- Force push monorepo branches and tags from syrf-test to syrf-web
- Rename syrf-web to syrf via GitHub settings
- Update all documentation references from syrf-monorepo to syrf
- Update all references from syrf-test to syrf
- Update git remote URLs locally
- Verify all issues (470+) preserved with original IDs
- Verify all PRs (47) preserved and accessible
- Verify ZenHub workspace continues functioning
- Verify branches coexist (no conflicts)
- Verify tags coexist (no conflicts)
- Verify old URLs redirect to new repository
- Create comprehensive migration documentation (ADR-005, migration guide)
Dependencies:
- Story 1.1 (Monorepo Structure Setup)
- Story 1.6 (Documentation Consolidation)
Technical Notes:
- Strategy: Force push + rename to preserve GitHub metadata
- Backup created:
camaradesuk/syrf-web-legacy - GitHub automatic redirects: syrf-web URLs → syrf URLs
- Git history preserved: syrf-web main is part of monorepo via git mv
- Branches: 3 monorepo + 93 syrf-web = no conflicts
- Tags: Prefixed (api-v*, pm-v*) vs unprefixed (v*) = no conflicts
- ZenHub: Repository rename transparent (same internal repo ID)
- Files created:
- docs/decisions/ADR-005-repository-migration-strategy.md
- docs/how-to/repository-migration-guide.md
- Documentation updated: 28 files (syrf-monorepo → syrf, syrf-test → syrf)
- Commits: 5db1d9e9 (migration docs), [user executed migration]
Estimated Effort: 5 story points (1 day)
Work Item 2: CI/CD Automation ✅ COMPLETE¶
GitHub Issue: #2137 Goal: Build and push Docker images with automated tagging and promotion
Total Story Points: 57 (updated: +2 for version continuity, +8 for production promotion, +8 for deployment notifications, +3 for dynamic matrix) Status: ✅ Complete (100%) - 7/7 child work items
Child Work Item 2.1: Auto-Version Workflow Cleanup ✅¶
GitHub Issue: #2138 Status: ✅ Complete Priority: P0 (Critical) Story Points: 5 Sprint: Sprint 2 (Planned)
As a developer I want the auto-version workflow to create tags without polluting git history So that versioning is clean and doesn't create commit noise
Acceptance Criteria:
- Remove VERSION file operations from workflow
- Remove Chart.yaml update operations from workflow
- Remove commit creation steps
- Keep tag creation steps
- Modify push step to only push tags (not commits)
- Simplify workflow structure (remove file restoration logic)
- Test: Workflow creates tags but NO commits
- Test: GitVersion still calculates versions correctly
- Ensure version continuity from polyrepos:
- Create baseline tags for each service continuing from last polyrepo version
- API: Last polyrepo v8.20.1 → create baseline tag
api-v8.20.1 - PM: Last polyrepo v10.44.1 → create baseline tag
pm-v10.44.1 - Web: Last polyrepo v11.27.0 → create baseline tag
web-v11.27.0 - Update GitVersion.yml configs with
next-versionif needed - Test: Next versions increment correctly (api-v8.21.0, pm-v10.45.0, web-v11.28.0)
- Document version mapping in ADR
Dependencies:
- Story 1.2 (GitVersion Configuration)
- Story 1.3 (Chart Version Stabilization)
Technical Notes:
- Aligns with GitOps principle (no auto-commits to source repo)
- Tags are lightweight references, not commits
- File:
.github/workflows/ci-cd.yml(formerly auto-version.yml - already merged) - Version Continuity Strategy:
- Polyrepo tags (v8.20.1) migrated with git history
- Create prefixed baseline tags at same commits (api-v8.20.1)
- GitVersion recognizes prefixed tags via
tag-prefixconfig - Next versions increment from baseline: feat → minor, fix → patch
- Maintains semantic versioning continuity across migration
Estimated Effort: 5 story points (1 day) - UPDATED: +2 points for version continuity = 7 points total
Child Work Item 2.2: Docker Image Build Integration ✅¶
GitHub Issue: #2139 Status: ✅ Complete Priority: P0 (Critical) Story Points: 13 Sprint: Sprint 2 (Planned)
As a platform engineer I want Docker images built and pushed to GHCR automatically So that every version has an immutable container image
Acceptance Criteria:
- Review and validate all Dockerfiles for monorepo structure
- Add Docker build job to auto-version workflow (after version jobs)
- Use matrix strategy for changed services
- Build images with correct build context
- Tag images with both patterns:
{version}(e.g.,1.2.3){version}-sha.{shortsha}(e.g.,1.2.3-sha.abc123)latest(updates with each push from main)- Push to GHCR using GITHUB_TOKEN
- Test: Trigger workflow with code change
- Test: Verify images exist in GHCR
- Test: Both tags exist and point to same image
Dependencies:
- Story 2.1 (Auto-Version Workflow Cleanup)
- Story 1.4 (Dependency Mapping)
Technical Notes:
- Registry:
ghcr.io/camaradesuk/syrf-{service} - Auth: GITHUB_TOKEN (automatic, no PAT needed)
- Build context must include entire monorepo (MSBuild requirement)
- Reference DEPENDENCY-MAP.yaml for required paths
- Implementation Details:
- Created automated Dockerfile generation script (
scripts/generate-dockerfiles.py) - Generates cache-optimized Dockerfiles with 5-layer structure
- All Dockerfiles regenerated from dependency-map.yaml
- Fixed PM and Quartz build contexts to use monorepo root
- API and Web contexts already correct
- Cache optimization: ~70% time savings for source code changes
Estimated Effort: 13 story points (1 week)
Child Work Item 2.3: Build Optimization - Conditional Rebuild ✅¶
GitHub Issue: #2140 Status: ✅ Complete Priority: P2 (Medium) Story Points: 8 Sprint: Sprint 3 (Completed 2025-11-19)
As a platform engineer I want to avoid rebuilding Docker images when only non-code files change So that CI/CD is faster and more resource-efficient
Acceptance Criteria:
- Install
craneCLI tool in workflow - Implement change detection logic:
- Detect code vs non-code changes
- Compare files changed since last git tag
- Include shared libraries in detection
- Implement conditional build/retag:
- If no code changes and source image exists: retag using
crane tag - If code changed or source missing: build from scratch
- Add monitoring and summary to workflow output
- Test: Chart-only change triggers retag (not rebuild)
- Test: Code change triggers full rebuild
- Test: Shared library change triggers full rebuild (logic verified)
- Test: Missing source image falls back to rebuild (intentionally errors - signals config issue)
- Measure time savings (target: 2-5 min per optimized build) - Achieved: 12s vs 4+ min
Dependencies:
- Story 2.2 (Docker Image Build Integration)
Technical Notes:
- Uses
cranefor manifest-based retagging (no download) - Transparent to GitOps (ArgoCD only cares tag exists)
- Detailed spec in: CLUSTER ARCHITECTURE GOALS.md section 10a
- Implementation Notes (2025-11-19):
- Initial approach using dorny/paths-filter negation patterns didn't work (patterns are OR'd)
- Fixed by using
list-files: shelland analyzing actual file paths - Chart-only detection checks if ALL changed files match
.chart/pattern - Successfully tested: API chart-only change retagged 9.4.3 → 9.4.4 in 12s
- Missing source image intentionally errors rather than falling back to build (signals configuration issue)
Estimated Effort: 8 story points (2 days)
Child Work Item 2.4: Promotion PR Automation ✅¶
GitHub Issue: #2141 Status: ✅ Complete Priority: P0 (Critical) Story Points: 8 Sprint: Sprint 2 (Completed)
As a platform engineer I want automatic PRs to cluster-gitops after successful image push So that staging deployments are triggered declaratively
Acceptance Criteria:
- Create GitHub PAT with repo scope for cluster-gitops access
- Add PAT as secret
GITOPS_PATto app-monorepo repository - Add promotion PR job to auto-version workflow
- Install
yqtool for YAML manipulation - Update staging values files for changed services:
environments/staging/{service}.values.yaml- Set
image.tag: {version} - Create PR with:
- Title: "Promote {services} to {version} (staging)"
- Body: Image details, source tag, changelog link
- Auto-label:
promotion,staging,auto-generated - Test: Code change creates promotion PR in cluster-gitops
- Test: PR contains correct version information
- Test: PR is properly formatted and reviewable
Dependencies:
- Story 2.2 (Docker Image Build Integration)
- Story 3.1 (cluster-gitops Repository Complete)
Technical Notes:
- Uses
yqfor YAML updates (preserves formatting) - PR can be auto-merged or require approval (configurable)
- File updated:
.github/workflows/auto-version.yml
Estimated Effort: 8 story points (2 days)
Child Work Item 2.5: Production Promotion Automation ✅¶
GitHub Issue: #2203 Status: ✅ Complete Priority: P0 (Critical) Story Points: 8 Sprint: Sprint 2 (Completed 2025-11-13)
As a platform engineer I want automated production promotion PRs after successful staging deployment So that production updates are tracked and require manual approval
Acceptance Criteria:
- Add
promote-to-productionjob to ci-cd.yml workflow - Job triggers automatically after successful staging promotion
- Copies service versions from staging to production
- Creates PR to cluster-gitops updating production service values
- PR labeled
requires-reviewwith review checklist - PR does NOT auto-merge (requires manual administrator approval)
- Workflow completes (shows green checkmark) after PR creation
- Administrator can review and manually merge PR
- After PR merge, ArgoCD syncs production automatically
- Documentation created:
docs/how-to/production-promotion-and-notifications.md
Dependencies:
- Story 2.4 (Promotion PR Automation for staging)
Technical Notes:
- Uses GitHub App authentication for PR creation
- No GitHub Environment configuration needed (works on free tier)
- Manual gate happens at PR merge step in cluster-gitops
- Workflow shows success after PR creation, not after deployment
- Commit: 3d4edccd (initial), 42a46855 (simplified)
Estimated Effort: 8 story points (2 days)
Child Work Item 2.6: Deployment Success Notifications ✅¶
GitHub Issue: #2204 Status: ✅ Complete Priority: P1 (High) Story Points: 8 Sprint: Sprint 2 (Completed 2025-11-13)
As a developer I want GitHub commit statuses when ArgoCD successfully deploys services So that I can see deployment status directly on commits and PRs
Acceptance Criteria:
- Create PostSync hook template for all service charts
- Job authenticates with GitHub App
- Creates commit status on source repository
- Status context:
argocd/deploy-{environment} - Status description includes service name and version
- Links to deployed service URL
- Optional: Create GitHub Releases for production deployments
- Configuration consolidated in environment shared-values.yaml (DRY principle)
- Services enable with single flag:
deploymentNotification.enabled: true - Common config inherited from shared values
- Documentation updated with DRY configuration approach
- PostSync jobs auto-cleanup after 5 minutes
Dependencies:
- Story 4.2 (ArgoCD Installation) - for testing
- Story 4.3 (Platform Add-ons) - for secrets
Technical Notes:
- PostSync hook runs Kubernetes Job after successful sync
- Uses curlimages/curl:8.10.1 container
- JWT-based GitHub App authentication
- Staging: commit statuses only (createReleaseNote: false)
- Production: commit statuses + releases (createReleaseNote: true)
- DRY: Common config in shared-values, services only set enabled flag
- Files created:
src/services/*/chart/templates/postsync-notify.yaml(all services)docs/how-to/production-promotion-and-notifications.mdenvironments/staging/shared-values.yaml(deploymentNotification section)environments/production/shared-values.yaml(deploymentNotification section)- Commits: 3d4edccd, 118648da, 74bee73 (DRY config)
Estimated Effort: 8 story points (2 days)
Child Work Item 2.7: Dynamic Matrix for Docker Builds ✅¶
GitHub Issue: #2202 Status: ✅ Complete Priority: P2 (Medium) Story Points: 3 Sprint: Sprint 3 (Completed 2025-11-20)
As a developer I want the CI/CD workflow to show unchanged services as "Skipped" rather than "Succeeded" So that I can clearly see which services were actually built in each workflow run
Acceptance Criteria:
- Implement dynamic matrix generation in detect-changes job
- Matrix only includes services that have actually changed
- Each matrix entry contains full service metadata (name, image, dockerfile, context, flags)
- Build-docker job uses dynamic matrix instead of static matrix
- Remove service_changed skip logic from reusable workflow
- Unchanged services show as "Skipped" in GitHub UI (correct behavior)
- Changed services build normally with all metadata preserved
- Web service artifact handling preserved
- Docs service additional checkouts preserved
- Workflow validates successfully
Dependencies:
- Story 2.3 (Build Optimization - Conditional Rebuild)
Technical Notes:
- Replaces static 6-service matrix with dynamic matrix
- Uses
jqfor reliable JSON generation - Matrix entries include: name, image, dockerfile, context, changed_output, and service-specific flags
- GitHub Actions can only show "Skipped" at job level, not step level
- With static matrix, all jobs run and succeed early (confusing UI)
- With dynamic matrix, jobs for unchanged services don't exist (clean UI)
- Commits: c4da6e23, fc2f97ca, a52d0be2
Estimated Effort: 3 story points (half day)
Work Item 3: GitOps Infrastructure ✅ COMPLETE¶
GitHub Issue: #2142 Goal: Establish cluster-gitops repository with ArgoCD configuration
Total Story Points: 34 Status: ✅ Complete (5/5 child items complete, 34/34 pts = 100%)
Child Work Item 3.1: cluster-gitops Repository Complete ✅¶
GitHub Issue: #2143 Status: ✅ Complete Priority: P0 (Critical) Story Points: 8 Sprint: Sprint 1 (Completed)
As a platform engineer I want a complete cluster-gitops repository structure So that ArgoCD can declaratively manage cluster state
Acceptance Criteria:
- Repository created:
camaradesuk/cluster-gitops - Directory structure established:
- bootstrap/ (App-of-Apps)
- projects/ (AppProjects)
- clusters/{staging,prod}/apps/
- applicationsets/
- envs/_global/
- envs/syrf/{api,project-management,quartz,web}/
- Initial values files created for all 4 services
- README and SETUP-INSTRUCTIONS.md documented
- Initial skeleton committed and pushed
- PLANNING.md created with migration strategy
Dependencies: None
Technical Notes:
- Repository:
github.com/camaradesuk/cluster-gitops - Visibility: Private
- Multi-source pattern ready for ArgoCD ≥2.6
Child Work Item 3.2: ArgoCD Application Manifests ✅¶
GitHub Issue: #2144 Status: ✅ Complete Priority: P0 (Critical) Story Points: 8 Sprint: Sprint 2 (Completed 2025-11-12)
As a platform engineer I want ArgoCD Application definitions for all services So that services can be deployed via GitOps
Acceptance Criteria:
- Create AppProject definitions (6 projects: syrf-staging, syrf-production, preview, plugins, default, bootstrap)
- Create Application manifests via ApplicationSets:
argocd/applicationsets/syrf.yaml- Matrix generator for all servicesargocd/applicationsets/plugins.yaml- Infrastructure componentsargocd/applicationsets/argocd-infrastructure.yaml- ArgoCD components- Configure multi-source pattern:
- Source 1: Chart from monorepo at specific
targetRevisiontag - Source 2: Values from cluster-gitops repository
- Source 3: Optional resources directory
- Configure sync policies:
- Staging: automated (prune + selfHeal)
- Production: automated with selfHeal disabled
- Set
targetRevisionpolicy using service tags ({service}-vX.Y.Z) - Test: Render manifests locally with
helm template
Dependencies:
- Story 3.1 (cluster-gitops Repository Complete) ✅
Technical Notes:
- Uses ArgoCD multi-source pattern (≥2.6)
- ApplicationSets auto-generate Applications from environment configs
- Values interpolation via
$valuesreference - CreateNamespace=true for automatic namespace creation
Estimated Effort: 8 story points (2 days)
Child Work Item 3.3: Environment Values Configuration ✅¶
GitHub Issue: #2145 Status: ✅ Complete Priority: P0 (Critical) Story Points: 5 Sprint: Sprint 2 (Completed 2025-11-12)
As a platform engineer I want environment-specific values for all services So that staging and production have appropriate resource allocations
Acceptance Criteria:
- Create global values (
global/values.yaml) - Create environment-specific shared values:
syrf/environments/staging/shared-values.yamlsyrf/environments/production/shared-values.yaml- Create service-specific values for 6 services × 2 environments:
syrf/environments/{staging,production}/{api,web,project-management,quartz,docs,user-guide}/- Each service has
config.yaml(chart reference) andvalues.yaml(Helm values) - Configure for each environment:
- Image repository and tag (via CI/CD promotion)
- Replica counts
- Resource requests/limits
- Ingress hosts and TLS
- Environment variables
- Health check settings
- Document configuration knobs in comments
- Validate YAML syntax
Dependencies:
- Story 3.1 (cluster-gitops Repository Complete) ✅
Technical Notes:
- Environment namespace.yaml contains sync policies
- Shared-values.yaml contains common config (deployment notifications, etc.)
- Service config.yaml updated automatically by CI/CD promotion workflow
- Staging: automated sync; Production: automated with manual PR merge gate
Estimated Effort: 5 story points (1 day)
Child Work Item 3.4: ApplicationSet for PR Previews ✅¶
GitHub Issue: #2146 Status: ✅ Complete (manually tested and verified 2025-12-01) Priority: P1 (High) Story Points: 8 Sprint: Sprint 3 (Completed 2025-12-01)
As a developer I want ephemeral preview environments for PRs So that I can test changes before merging
Acceptance Criteria:
- Create ApplicationSet definition (
applicationsets/syrf-previews.yaml) - Configure Pull Request Generator:
- Watch syrf PRs with
previewlabel - GitHub App credentials (github-app-repo-creds secret)
- Requeue every 300 seconds
- Template Application spec:
- Name:
syrf-pr-{{number}}-{{serviceName}} - Namespace:
pr-{{number}} - Chart source: PR head SHA
- Image tag:
pr-{{number}} - Ingress:
pr-{{number}}-{{serviceName}}.staging.syrf.org.uk - Configure sync policy:
- Automated (prune + selfHeal)
- CreateNamespace=true
- Test: Open PR creates preview environment
- Test: PR close deletes preview environment
- Document preview URL pattern (
docs/how-to/use-pr-preview-environments.md)
Completed Components:
- GitHub Actions workflow (
pr-preview.yml) - builds images withpr-{number}tag - Preview AppProject (
argocd/projects/preview.yaml) - allowspr-*namespaces - Preview common values (
syrf/environments/preview/common.values.yaml) - Documentation (381 lines comprehensive guide)
- ApplicationSet with PullRequest generator -
syrf-previews.yamlcreated - GitHub credentials secret config - github-app-repo-creds ExternalSecret added
Remaining Work:
- Create
camarades-github-app-installation-idsecret in GCP Secret Manager - Push cluster-gitops changes and verify ArgoCD sync
- Test: Open PR with 'preview' label → preview environment created
- Test: Close PR → preview environment deleted
Final State: All components complete. PR Preview environments fully operational - manually tested and verified 2025-12-01.
Dependencies:
- Story 3.2 (ArgoCD Application Manifests) ✅
- Story 4.2 (ArgoCD Installation) ✅
Technical Notes:
- Requires ApplicationSet with
pullRequestgenerator - GitHub PAT or GitHub App credentials needed
- Ephemeral namespaces automatically cleaned up on PR close
- Preview URLs:
pr-{number}.staging.syrf.org.uk
Estimated Effort: 8 story points (remaining: ~4 points)
Child Work Item 3.5: Infrastructure Dependencies Analysis ✅¶
GitHub Issue: #2147 Status: ✅ Complete Priority: P0 (Critical) Story Points: 5 Sprint: Sprint 2 (Completed 2025-11-12)
As a platform engineer I want to identify all infrastructure dependencies for SyRF services So that the new cluster has all required components before migration
Acceptance Criteria:
- Document required infrastructure components:
- Ingress controller (ingress-nginx v4.11.1)
- cert-manager (v1.15.0) for TLS
- external-dns (v1.14.5) for DNS management
- RabbitMQ (v14.6.6) for inter-service messaging
- External Secrets Operator for secret management (Google Secret Manager)
- Create Helm charts or manifests for each component (
plugins/helm/directory) - Define installation order (documented in
docs/cluster-bootstrap.md) - Create bootstrap Application for platform add-ons (
argocd/bootstrap/root.yaml) - Document configuration requirements (per-component values.yaml files)
- Create smoke test checklist for each component
Dependencies:
- Story 3.1 (cluster-gitops Repository Complete) ✅
Technical Notes:
- All components deployed via GitOps (plugins ApplicationSet)
- Each component has
config.yaml+values.yaml+resources/directory - RabbitMQ is CRITICAL (required by all .NET services)
- ESO uses ClusterSecretStore with Google Secret Manager backend
- Workload Identity configured for external-dns and ESO
Estimated Effort: 5 story points (1 day)
Work Item 4: ArgoCD Deployment 🔄 IN PROGRESS¶
GitHub Issue: #2148 Goal: Install and configure ArgoCD on new Kubernetes cluster
Total Story Points: 53 (32 + 21 cluster remediation issues discovered 2025-11-17) Status: 🔄 In Progress (64% - 9/14 child work items complete) Blocker Resolved: Cluster provisioned on 2025-11-12 Blocker Resolved: ExternalSecrets fixed on 2025-11-18 (Story 4.8) Blocker Resolved: Image tags fixed on 2025-11-18 (Story 4.9) Blocker Resolved: identity-server dependency removed on 2025-11-18 (Story 4.9) Current Blockers: None
Child Work Item 4.1: Kubernetes Cluster Provisioning ✅¶
GitHub Issue: #2149 Status: ✅ Complete Priority: P0 (Critical) Story Points: 13 Sprint: Sprint 2 (Completed 2025-11-12)
As a platform engineer I want a new Kubernetes cluster provisioned So that I can install ArgoCD and deploy services
Acceptance Criteria:
- Decision made: GKE (Google Kubernetes Engine)
- Cluster provisioned with Terraform:
- Cluster: camaradesuk, europe-west2-a
- Kubernetes version: 1.33.5-gke.1201000
- Nodes: 3-6 (autoscaling), e2-standard-2
- Features: Workload Identity, VPA, Shielded Nodes
- kubectl access configured locally
- Cluster connectivity validated
- Basic namespaces created via ArgoCD
- Document cluster details in camarades-infrastructure repo
Dependencies: None (but blocks all other Epic 4 stories)
Technical Notes:
- Recommended: GKE europe-west2-a (continuity with Jenkins X)
- Alternative: Any Kubernetes 1.27+ cluster
- This is the PRIMARY BLOCKER for GitOps migration
Estimated Effort: 13 story points (1 week - including approval/provisioning time)
Child Work Item 4.2: ArgoCD Installation ✅¶
GitHub Issue: #2150 Status: ✅ Complete Priority: P0 (Critical) Story Points: 5 Sprint: Sprint 2 (Completed 2025-11-12)
As a platform engineer I want ArgoCD installed on the new cluster So that GitOps-based deployments can begin
Acceptance Criteria:
- Install ArgoCD in
argocdnamespace (HA mode with Helm) - Verify all ArgoCD components are running
- Access ArgoCD UI via Ingress (argocd.camarades.net)
- TLS certificate configured with Let's Encrypt
- ArgoCD admin password available via secret
- GitHub credential template created for repository access
Dependencies:
- Story 4.1 (K8s Cluster Provisioning)
Technical Notes:
- Install command:
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml - Wait for:
kubectl wait --for=condition=available --timeout=300s deployment/argocd-server -n argocd - Password:
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d
Estimated Effort: 5 story points (1 day)
Child Work Item 4.3: Platform Add-ons Installation ✅¶
GitHub Issue: #2151 Status: ✅ Complete Priority: P0 (Critical) Story Points: 8 Sprint: Sprint 2 (Completed 2025-11-12)
As a platform engineer I want all required infrastructure components installed So that SyRF services have the dependencies they need
Acceptance Criteria:
- Install cert-manager v1.15.0 for TLS certificates
- Install ingress-nginx v4.11.1 for HTTP routing (LoadBalancer: 34.13.36.98)
- Install external-dns v1.14.5 for DNS management (with Workload Identity)
- Install RabbitMQ v14.6.6 (REQUIRED for SyRF services)
- Configure each component via ArgoCD Applications
- Verify all components are healthy and synced
- Document configuration in cluster-gitops/docs/cluster-bootstrap.md
Dependencies:
- Story 4.2 (ArgoCD Installation)
- Story 3.5 (Infrastructure Dependencies Analysis)
Technical Notes:
- RabbitMQ is CRITICAL - services cannot start without it
- Secret management: ESO with Google Secret Manager (current setup)
- Use ArgoCD Applications for declarative installation
Estimated Effort: 8 story points (2 days)
Child Work Item 4.4: App-of-Apps Bootstrap ✅¶
GitHub Issue: #2152 Status: ✅ Complete Priority: P0 (Critical) Story Points: 3 Sprint: Sprint 3 (Completed 2025-11-12)
As a platform engineer I want ArgoCD bootstrapped via App-of-Apps pattern So that all applications are managed declaratively from Git
Acceptance Criteria:
- Create bootstrap Application (
bootstrap/root.yaml) - Configure to watch
apps/directory - Apply bootstrap Application to cluster
- Verify ArgoCD creates child Applications
- All Applications appear in ArgoCD UI
- Sync status is healthy
- Document bootstrap procedure
Dependencies:
- Story 4.2 (ArgoCD Installation)
- Story 3.2 (ArgoCD Application Manifests)
Technical Notes:
- Bootstrap Application lives in cluster-gitops/bootstrap/
- Creates Applications recursively from apps/ directory
- Once applied, entire cluster state is Git-driven
- Tested pruning: Applications auto-delete when YAML removed from Git
- Updated cluster-bootstrap.md with App-of-Apps pattern
Estimated Effort: 3 story points (half day)
Child Work Item 4.5: First Service Deployment (Canary) 🔄¶
GitHub Issue: #2153 Status: 🔄 In Progress (75% complete) Priority: P1 (High) Story Points: 5 Sprint: Sprint 3 (Started 2025-11-12)
As a platform engineer I want to deploy one service to staging as a canary So that I can validate the entire GitOps flow before deploying all services
Acceptance Criteria:
- Choose canary service (API selected)
- Apply API Application manifest to ArgoCD (via App-of-Apps)
- Verify ArgoCD syncs successfully (Synced)
- Service pods are running and healthy (Progressing - waiting for secrets)
- Ingress is accessible (smoke test endpoint) - BLOCKED by missing secrets
- Check logs for errors - BLOCKED by missing secrets
- Verify RabbitMQ connectivity - BLOCKED by missing secrets
- Document any issues encountered
- Create runbook for common operations
Dependencies:
- Story 4.3 (Platform Add-ons Installation) ✅
- Story 4.4 (App-of-Apps Bootstrap) ✅
Progress Summary (2025-11-12):
✅ Completed:
- All 4 .NET services deployed (API, PM, Quartz, Web)
- ArgoCD Applications created via App-of-Apps
- All showing
Syncedstatus - Charts successfully templated
-
Pods created (Progressing state)
-
Fixed 2 critical deployment issues:
- Environment variable format: Changed from array to map format in all staging values files
-
Image references: Updated Helm templates from Jenkins X pattern to standard Values pattern
-
Documentation created:
/docs/how-to/required-secrets.md- Complete guide for all 14 required secrets-
Includes YAML templates, verification commands, ESO examples, troubleshooting
-
Triggered documentation service builds:
- Committed changes to trigger CI/CD for
syrf-user-guideandsyrf-docs - Docker images building (commit: 421f76b5)
⏳ Blockers Identified:
- Missing Kubernetes Secrets (Critical - blocks all .NET services):
- auth0, identity-server, swagger-auth, public-api
- mongo-db, elastic-db, dev-postgres-credentials
- rabbit-mq, aws-s3, aws-ses
- google-sheets, rob-api-credentials
- elastic-apm, sentry
-
Recommendation: Set up External Secrets Operator
-
Missing Docker Images (syrf-docs, syrf-user-guide):
- Images don't exist yet in GHCR
- Build triggered via commit 421f76b5
- Expected completion: ~5-10 minutes
Current Application Status:
Platform Services:
✅ ingress-nginx: Synced, Healthy
✅ cert-manager: Synced, Healthy
✅ external-dns: Synced, Healthy
🔄 rabbitmq: Synced, Progressing
SyRF Services:
🔄 syrf-api: Synced, Progressing (waiting for secrets)
🔄 syrf-project-management: Synced, Progressing (waiting for secrets)
🔄 syrf-quartz: Synced, Progressing (waiting for secrets)
🔄 syrf-web: Synced, Progressing (waiting for secrets)
❌ syrf-docs: Synced, Degraded (ImagePullBackOff - building)
❌ syrf-user-guide: Synced, Degraded (ImagePullBackOff - building)
Next Steps:
- Wait for user-guide/docs images to build (~5-10 min)
- Set up External Secrets Operator OR create secrets manually
- Verify all services start successfully
- Test ingress endpoints
- Complete acceptance criteria
Technical Notes:
- Use baseline version from Jenkins X (see PLANNING.md)
- API service is good canary (simpler than PM)
- Validate entire stack before other services
- App-of-Apps pattern validated successfully
Estimated Effort: 5 story points (1 day)
Child Work Item 4.6: End-to-End GitOps Flow Validation ⏳¶
GitHub Issue: #2154 Status: ⏳ Blocked Priority: P1 (High) Story Points: 8 Sprint: TBD
As a platform engineer I want to validate the complete GitOps workflow So that I can confirm all automation works as designed
Acceptance Criteria:
- Test: Make code change to one service
- Test: Verify auto-version creates tag
- Test: Verify Docker image is built and pushed
- Test: Verify promotion PR is created to cluster-gitops
- Test: Merge promotion PR
- Test: Verify ArgoCD syncs staging environment
- Test: Verify service is deployed with new version
- Test: Open PR in app-monorepo
- Test: Verify preview environment is created
- Test: Close PR and verify preview cleanup
- Test: Create manual production promotion
- Test: Verify production deployment
- Test: Rollback by reverting promotion PR
- Document timing metrics (commit → staging deployment time)
Dependencies:
- Story 4.5 (First Service Deployment)
- Story 2.2 (Docker Image Build Integration)
- Story 2.4 (Promotion PR Automation)
- Story 3.4 (ApplicationSet for PR Previews)
Technical Notes:
- This validates the ENTIRE GitOps architecture
- Target: commit → staging < 10 min p50
- Target: preview ready < 2 min
- Document any issues for optimization
Estimated Effort: 8 story points (2 days)
Child Work Item 4.7: Helm Chart Standardization - Jenkins X Pattern Removal ✅¶
GitHub Issue: #2172 Status: ✅ Complete Priority: P0 (Critical) Story Points: 3 Sprint: Sprint 3 (Completed 2025-11-14)
As a platform engineer I want all Jenkins X legacy patterns removed from Helm charts So that charts use standard Kubernetes conventions and are maintainable
Acceptance Criteria:
- Remove all
jx.imagePullSecretsreferences (use top-levelimagePullSecretsarray) - Remove all
jxRequirements.ingress.*references (useingress.*) - Remove all
draftlabel patterns - Update all 4 service charts (api, project-management, quartz, web)
- Validate all charts render successfully with
helm template - Document changes in ADR-006
- Update environment values in cluster-gitops to match new structure
Dependencies:
- Story 4.5 (First Service Deployment) - discovered issue during deployment
Scope Summary:
- 52
jxreferences removed across 16 files (4 services × 4 files) - Services updated: api, project-management, quartz, web
- Files per service: values.yaml, deployment.yaml, ingress.yaml, canary.yaml
- Root cause: syrf-web ImagePullBackOff due to
jx.imagePullSecretsvs top-levelimagePullSecretsmismatch
Technical Notes:
- Web service had 30 jx references in ingress.yaml alone (complex host name construction)
- Used bulk
sedreplacements for efficiency in web ingress.yaml - All charts validated with helm template after changes
- ADR-006 created: docs/decisions/ADR-006-helm-chart-standardization.md
- Follow-up required: Update cluster-gitops environment values to use new structure
Estimated Effort: 3 story points (half day)
Child Work Item 4.8: Fix SecretStore Configuration (External Secrets Migration) ✅¶
GitHub Issue: #2195 Status: ✅ Complete (2025-11-18) Priority: P0 (Critical - blocks ALL services) Story Points: 8 Sprint: Sprint 3
As a platform engineer I want all ExternalSecrets migrated from v1beta1 SecretStore to v1 ClusterSecretStore So that services can retrieve secrets from GCP Secret Manager and start successfully
Acceptance Criteria:
- Migrate extra-secrets-staging to use ClusterSecretStore pattern
- Migrate extra-secrets-production to use ClusterSecretStore pattern
- Update all ExternalSecret references from SecretStore to ClusterSecretStore
- Verify ClusterSecretStore is READY in staging and production
- Test 3-5 critical secrets sync successfully (auth0, mongo-db, rabbit-mq)
- All 40 ExternalSecrets show READY: True status (20 staging + 20 production)
- Document migration in cluster-gitops (chart templates documented)
- Update environment values to reference ClusterSecretStore
Completion Summary (2025-11-18):
What was done:
- Chart Template Updates:
- Added shorthand
secretsformat to extra-secrets chart for cleaner values files - Created ClusterExternalSecret template (
cluster-external-secrets.yaml) for cluster-wide secrets - All ExternalSecrets now use
kind: ClusterSecretStoreinstead ofkind: SecretStore -
Updated API version from
v1beta1tov1 -
ClusterExternalSecrets Created (deployed via argocd-secrets):
ghcr-secret→ argocd, syrf-staging, syrf-productionrabbit-mq→ rabbitmq, syrf-staging, syrf-productiongithub-app-credentials→ argocd, syrf-staging, syrf-production-
Added
ClusterExternalSecretto argocd project whitelist -
Values Files Simplified:
- Staging: 17 namespace-scoped secrets using shorthand format
- Production: 17 namespace-scoped secrets using shorthand format
-
Removed duplicates (ghcr-secret, rabbit-mq, github-app-credentials) now handled by ClusterExternalSecrets
-
Additional Fixes:
- Fixed webhook HMAC verification (trailing newline in GCP secret)
- Removed stuck finalizers from ExternalSecrets blocking deletion
Final State:
- ✅ 20 ExternalSecrets in syrf-staging: All SecretSynced
- ✅ 20 ExternalSecrets in syrf-production: All SecretSynced
- ✅ 3 ClusterExternalSecrets: All Ready, provisioned to all target namespaces
- ✅ GitHub webhook working (instant sync on push)
- ✅ ClusterSecretStore
gcpsm-secret-storeserving all namespaces
Files Modified:
charts/extra-secrets/templates/external-secrets.yaml- Added shorthand formatcharts/extra-secrets/templates/cluster-external-secrets.yaml- New filecharts/extra-secrets/values.yaml- Documented new formatsargocd/local/argocd-secrets/values.yaml- Added ClusterExternalSecrets configargocd/projects/argocd.yaml- Added ClusterExternalSecret to whitelistplugins/local/extra-secrets-staging/values.yaml- Simplified with shorthand formatplugins/local/extra-secrets-production/values.yaml- Simplified with shorthand format
Dependencies:
- Story 4.3 (Platform Add-ons) ✅ Complete - ESO installed
- Story 4.5 (First Service Deployment) - blocked by this issue
Technical Notes:
- Reference working pattern:
argocd/local/argocd-secrets/values.yaml - ClusterSecretStore enables cross-namespace secret access
- Workload Identity already configured for ESO:
external-secrets@camarades-net.iam.gserviceaccount.com - IAM binding already exists:
roles/iam.workloadIdentityUser - Only chart updates needed, no infrastructure changes
Estimated Effort: 8 story points (2 days - includes testing and verification)
Child Work Item 4.9: Fix Staging Environment Image Tags ✅¶
GitHub Issue: #2196 Status: ✅ Complete (2025-11-18) Priority: P0 (Critical - staging completely broken) Story Points: 5 Sprint: Sprint 3
As a developer I want staging service pods to have valid image tags So that staging environment is functional for testing
Acceptance Criteria:
- Identify root cause of empty image tags in staging
- Fix deployment manifests or Helm values causing empty tags
- Update staging values with correct image tags for all services
- Document fix in cluster-gitops troubleshooting
- Verify all staging pods transition to Running state
- Delete failed pods (InvalidImageName/ImagePullBackOff)
- Verify new pods start successfully with valid images
Completion Summary (2025-11-18):
Root Cause Identified:
- ApplicationSet removed image.tag parameters (commit bbfd0cd)
- Staging values files didn't have explicit image configuration
- Result: Helm rendered
:(empty repository and tag)
Fixes Applied:
- syrf monorepo (commit df49793a):
- Renamed
pm→project-managementthroughout CI/CD workflow - Updated Docker image name:
syrf-pm→syrf-project-management - Updated git tag prefix:
pm-v→project-management-v -
CI/CD now sets both
chartTagandimageTagwhen promoting -
cluster-gitops (commit 4b5eab1):
- Added
image.repositoryto all service base values.yaml files - Added ApplicationSet parameter to derive
image.tagfromservice.imageTag - Updated all environment configs with
imageTagfield - Updated project-management
chartTag:pm-v11.2.0→project-management-v11.2.0 -
Created compatibility git tag:
project-management-v11.2.0 -
Temporary workaround (commit 09eb23b):
- project-management uses
syrf-pmimage until next CI/CD build -
TODO in values.yaml to change to
syrf-project-managementafter build -
Versioning error fix (2025-11-18):
- CI/CD created incorrect
project-management-v1.0.0tag (GitVersion ran before compatibility tag) - Deleted incorrect tag, created correct
project-management-v11.3.0tag - Updated staging config:
chartTag: project-management-v11.3.0,imageTag: "11.2.0" -
Commit 2205291 in cluster-gitops
-
Project-management rename completion (2025-11-18):
- Triggered new CI/CD build which created
project-management-v11.3.1tag - Updated cluster-gitops image.repository from
syrf-pmtosyrf-project-management -
Final staging config:
chartTag: project-management-v11.3.2,imageTag: "11.3.2" -
Identity-server removal (2025-11-18):
- Removed unused IdentityServer4.AccessTokenValidation package from API (now using Auth0)
- Removed identityServer config blocks from all 4 service Helm values.yaml files
- Removed identity-server secret environment variables from 3 deployment templates
- Added IdentityModel.AspNetCore.OAuth2Introspection for TokenRetrieval (was transitive dependency)
- Updated required-secrets.md - removed identity-server from required secrets list
Final Service Versions (All Healthy):
- api: 9.2.3
- project-management: 11.3.2
- quartz: 0.5.1
- web: 5.4.2
- docs: 1.6.5
- user-guide: 1.1.0
Architecture Improvement:
image.repositoryis now explicit in service values.yaml (static)image.tagis derived via ApplicationSet fromservice.imageTag- CI/CD sets both
chartTag(chart version) andimageTag(Docker tag) on promotion
Previous State (2025-11-17):
- ❌ 5 failed pods in syrf-staging namespace
- ❌
syrf-api-657c97878c-jzdd6: InvalidImageName (Image::- no registry, no tag) - ❌
syrf-projectmanagement-9cdbf4465-tpkpw: InvalidImageName - ❌
syrf-projectmanagement-6fc8d864f5-65pf7: ImagePullBackOff - ❌
syrf-quartz-d6696d6d5-sr846: InvalidImageName - ❌
syrf-web-7468b67d77-t7bl6: InvalidImageName - ✅ Older pods still running:
syrf-api-5758596878,syrf-quartz-68b97d8994,syrf-web-65c666df64
Root Cause Analysis (1 hour):
- Check staging values files in cluster-gitops for image.tag configuration
- Compare with production values (production is working)
- Check ApplicationSet or Application manifests for templating issues
- Review recent commits that may have introduced the issue
- Check if CI/CD promotion PR left tags empty
Remediation Steps (2-4 hours):
- Immediate fix: Update staging values with working image tags (30 min)
- Set explicit image.tag values for all services
- Use last known working versions from old pods
-
Create PR to cluster-gitops
-
Sync and verify: ArgoCD sync and pod replacement (1 hour)
- Sync staging Applications
- Delete failed pods
- Wait for new pods to start
-
Verify Running status
-
Permanent fix: Fix root cause in values/templates (1-2 hours)
- Fix Helm templating if issue is in chart
- Fix CI/CD promotion workflow if issue is in automation
- Add validation to prevent empty tags in future
-
Test with dry-run deployment
-
Documentation: Update troubleshooting guide (30 min)
- Document common image tag issues
- Add verification commands
- Create runbook for fixing failed pods
Dependencies:
- Story 4.5 (First Service Deployment) - blocked by this issue
- May depend on Story 2.4 (Promotion PR Automation) if CI/CD is root cause
Technical Notes:
- Old working pods can provide reference for correct image tags
- Issue likely introduced in recent deployment or promotion
- Compare staging vs production Application specs
- May need to rollback recent changes to cluster-gitops
Estimated Effort: 5 story points (1 day - includes analysis and fix)
Child Work Item 4.10: Fix Extra-Secrets ApplicationSet Directory Structure ✅¶
GitHub Issue: #2197 Status: ✅ Completed (2025-11-18) Priority: P1 (High - blocks extra-secrets deployment) Story Points: 2 Sprint: Sprint 3
As a platform engineer I want extra-secrets Applications to sync successfully So that ClusterSecretStore and ExternalSecrets can be deployed
Acceptance Criteria:
- Create missing
resources/directories in cluster-gitops - Add
.gitkeepfiles to preserve empty directories - Verify extra-secrets-staging Application syncs successfully
- Verify extra-secrets-production Application syncs successfully
- Both Applications show Synced status (not Degraded)
- Document ApplicationSet directory requirements
Current State (2025-11-18):
- ✅ extra-secrets-production: Synced, Healthy
- ✅ extra-secrets-staging: Synced, Healthy
- Resolution: Added missing
resources/directories with.gitkeepfiles (commit 8233395) - Documentation: Added troubleshooting section to
cluster-gitops/docs/applicationsets.md
Affected Files (cluster-gitops repo):
- Missing:
plugins/local/extra-secrets-production/resources/ - Missing:
plugins/local/extra-secrets-staging/resources/
Remediation Steps (1-2 hours):
- Create directories (15 min)
cd cluster-gitops
mkdir -p plugins/local/extra-secrets-production/resources
mkdir -p plugins/local/extra-secrets-staging/resources
touch plugins/local/extra-secrets-production/resources/.gitkeep
touch plugins/local/extra-secrets-staging/resources/.gitkeep
- Commit and push (15 min)
- Create PR or commit directly to cluster-gitops
-
Include documentation in commit message
-
Sync and verify (30 min)
- Trigger ArgoCD sync for both Applications
- Verify Degraded status clears
- Verify no more "path does not exist" errors
-
Check Application health in ArgoCD UI
-
Document pattern (30 min)
- Document ApplicationSet multi-source requirements
- Add note about resources/ directory purpose
- Update cluster-gitops README if needed
Dependencies:
- Story 4.4 (App-of-Apps Bootstrap) ✅ Complete
- Related to Story 4.8 (SecretStore fix) - both affect extra-secrets
Technical Notes:
- Same pattern already used for:
argocd/local/argocd-secrets/resources/.gitkeep - ApplicationSet template unconditionally adds third source
- Empty resources/ directory satisfies template requirement
.gitkeepensures git tracks empty directory
Estimated Effort: 2 story points (2-4 hours)
Child Work Item 4.11: Fix User Guide TLS Certificate ✅¶
GitHub Issue: #2198 Status: ✅ Complete Priority: P2 (Medium - cert-manager issue) Story Points: 3 Sprint: Sprint 3 Completed: 2025-11-18
As a platform engineer I want the user-guide TLS certificate to issue successfully So that help.staging.syrf.org.uk is accessible over HTTPS
Acceptance Criteria:
- Investigate why user-guide-tls certificate is stuck in "Issuing" state
- Identify blocker (DNS, ACME challenge, rate limit, etc.)
- Resolve issue preventing certificate issuance
- Verify certificate transitions to Ready: True
- Verify TLS secret is created
- Test HTTPS access to staging URLs
- Document resolution
Root Causes Identified (2025-11-18):
- Staging using production URLs: Staging ingresses were configured with production hostnames
(e.g.,
help.syrf.org.uk) instead of staging hostnames (e.g.,help.staging.syrf.org.uk) - DNS mismatch: Production URLs point to GitHub Pages or legacy cluster, not GKE LoadBalancer, so ACME HTTP-01 challenges failed with 404
- Wrong Let's Encrypt issuer: Staging shared-values.yaml used
letsencrypt-stagingissuer - Chart defaults pollution: Helm chart defaults had hardcoded ingress values not fully overridden
Resolution Applied:
- Updated all Helm chart defaults to
ingress: {}(api, pm, quartz, web, user-guide, docs) - Created staging environment values with correct staging URLs
- Changed staging shared-values to use
letsencrypt-prodissuer - Deleted old certificates to trigger reissuance with correct configuration
Final State (2025-11-18):
- ✅ All staging certificates using
letsencrypt-prod - ✅ All staging certificates Ready: True
- ✅ Staging URLs configured (updated to new convention 2025-11-30):
- api:
api.staging.syrf.org.uk - web:
staging.syrf.org.uk - project-management:
project-management.staging.syrf.org.uk - docs:
docs.staging.syrf.org.uk - user-guide:
help.staging.syrf.org.uk - ✅ All production certificates Ready: True
Diagnostic Steps (1-2 hours):
- Check cert-manager logs (30 min)
- Look for errors related to user-guide-tls
- Check ACME challenge status
-
Identify specific failure reason
-
Check ACME challenge resources (30 min)
- List Challenge resources for this certificate
- Check if HTTP-01 challenge pod exists
- Verify challenge endpoint is reachable
-
Check if cert-manager solver ingress exists
-
Check DNS and ingress (30 min)
- Verify help.syrf.org.uk DNS resolves to ingress IP
- Check ingress routes for ACME challenge path
-
Verify no conflicts with other certificates
-
Check Let's Encrypt rate limits (15 min)
- Check if domain hit rate limits
- Verify staging vs production ACME server usage
Potential Root Causes:
- DNS issue: help.syrf.org.uk not resolving or resolving to wrong IP
- ACME challenge failure: HTTP-01 challenge endpoint not reachable
- Rate limiting: Let's Encrypt rate limits exceeded
- Ingress conflict: Multiple ingresses competing for same hostname
- cert-manager bug: Controller not processing certificate request
Remediation Steps (1-2 hours):
- Delete and recreate (if stuck in bad state)
- Force new order (if ACME issue)
kubectl delete certificaterequest -n syrf-staging -l cert-manager.io/certificate-name=user-guide-tls
kubectl delete challenge -n syrf-staging --all
- Update certificate spec (if configuration issue)
- Switch to DNS-01 challenge if HTTP-01 failing
- Use Let's Encrypt staging server to avoid rate limits
-
Adjust dnsNames or issuer reference
-
Verify and monitor (30 min)
- Watch certificate events
- Monitor cert-manager logs
- Verify certificate reaches Ready state
- Test HTTPS access
Dependencies:
- Story 4.3 (Platform Add-ons) ✅ Complete - cert-manager installed
- May need coordination with DNS/ingress configuration
Technical Notes:
- cert-manager version: v1.15.0
- Issuer: Let's Encrypt (production)
- Challenge type: HTTP-01 (assumed)
- Other certificates issuing successfully (suggests cert-manager is healthy)
- Issue specific to user-guide service
Estimated Effort: 3 story points (4-8 hours - investigation heavy)
Child Work Item 4.12: Sync Out-of-Sync Applications 🔄¶
GitHub Issue: #2199 Status: 🔄 In Progress (Plugins complete, SyRF apps pending) Priority: P2 (Medium - operational hygiene) Story Points: 2 Sprint: Sprint 3 (In Progress)
As a platform engineer I want all ArgoCD Applications to show Synced status So that cluster state matches Git and drift is eliminated
Acceptance Criteria:
- Review all Applications with OutOfSync or Unknown status
- Identify reason for each drift (manual change, missing config, etc.)
- Sync or fix each Application
- Verify all Applications show Synced status
- Document any manual steps taken
- Configure sync policies if needed (auto-sync, prune, self-heal)
Current State (2025-11-17): OutOfSync Applications (4):
- argocd-secrets: OutOfSync
- cert-manager: OutOfSync
- rabbitmq: OutOfSync
- root: OutOfSync
Unknown Status Applications (9):
- docs-production: Unknown
- external-dns: Unknown
- extra-secrets-production: Unknown (also Degraded)
- extra-secrets-staging: Unknown (also Degraded)
- ingress-nginx: Unknown
- quartz-production: Unknown
- rabbitmq-secrets: Unknown
- user-guide-production: Unknown
- user-guide-staging: Synced (but has failing certificate)
Sync Process (2-4 hours):
- Categorize issues (30 min)
- Group by root cause
- Identify which can auto-sync vs need manual intervention
-
Check if blocked by other child work items
-
Sync core infrastructure (1 hour)
- argocd-secrets (likely just committed changes)
- cert-manager (check for drift)
- rabbitmq (verify no manual changes)
-
root (app-of-apps - sync to propagate changes)
-
Investigate Unknown status (1 hour)
- Check why Application status is Unknown
- May indicate health check issues
- May be transient during deployment
-
Review Application logs and events
-
Document and prevent (30 min)
- Document why each Application was out of sync
- Configure sync policies to prevent future drift
- Consider enabling auto-sync for infrastructure apps
Dependencies:
- May depend on Story 4.8 (SecretStore fix) for extra-secrets apps
- May depend on Story 4.10 (directory structure fix) for extra-secrets apps
Technical Notes:
- Unknown status often means ArgoCD can't determine health
- May need to configure custom health checks
- OutOfSync is expected during active development
- Root Application sync propagates to all child Applications
Estimated Effort: 2 story points (2-4 hours)
Child Work Item 4.13: Clean Up Orphaned Resources 🔄¶
GitHub Issue: #2200 Status: 🔄 In Progress (Plugins cleanup complete, SyRF apps pending) Priority: P3 (Low - cleanup task) Story Points: 1 Sprint: Sprint 3-4 (In Progress)
As a platform engineer I want orphaned resources cleaned up So that cluster is tidy and ArgoCD warnings are eliminated
Acceptance Criteria:
- Review orphaned resources in api-staging (7 resources)
- Review orphaned resources in extra-secrets-production (3 resources)
- Determine if resources should be:
- Deleted (if truly orphaned)
- Adopted by ArgoCD (if should be managed)
- Ignored (if intentionally manual)
- Execute cleanup or adoption
- Verify OrphanedResourceWarning clears from Applications
- Document decisions for future reference
Current State (2025-11-17):
- api-staging: 7 orphaned resources
- extra-secrets-production: 3 orphaned resources
Orphaned Resource Analysis (1-2 hours):
- Identify resources (30 min)
kubectl get application api-staging -n argocd -o yaml | yq '.status.resources'
kubectl get application extra-secrets-production -n argocd -o yaml | yq '.status.resources'
- Determine ownership (30 min)
- Check resource labels and annotations
- Verify if resources are managed by Helm
- Check if resources should exist in manifests
-
Identify why ArgoCD considers them orphaned
-
Decide action (30 min)
- Delete: If resources are leftover from old deployments
- Adopt: If resources should be in Git but aren't
- Ignore: If resources are intentionally manual (add to ignoreDifferences)
Cleanup Steps (30 min - 1 hour):
- Option A: Delete orphaned resources
- Option B: Adopt into ArgoCD
- Add resource manifests to Git
- Configure ownerReferences
-
Sync Application
-
Option C: Ignore
- Add to Application ignoreDifferences
- Document why resources are manual
Dependencies:
- None (independent cleanup task)
Technical Notes:
- Orphaned resources don't break functionality
- Warnings indicate resources exist but not tracked in Git
- Common causes: manual kubectl apply, Helm 2 migration, renamed resources
- ArgoCD can adopt resources with proper annotations
Estimated Effort: 1 story point (1-2 hours)
Child Work Item 4.14: Configure ArgoCD Sync Policies and Drift Prevention 🔄¶
GitHub Issue: #2201 Status: 🔄 In Progress (Plugins complete, SyRF apps pending) Priority: P1 (High - prevents future issues) Story Points: 5 Sprint: Sprint 3 (In Progress)
As a platform engineer I want proper ArgoCD sync policies configured for all Applications So that the cluster stays synchronized with Git and drift is prevented automatically
Acceptance Criteria:
- Audit all Applications and categorize by sync strategy needs
- Configure automated sync policies where appropriate
- Enable prune and self-heal for automated applications
- Configure sync waves for ordered deployments
- Set up retry logic for transient failures
- Add PostSync hooks for critical validations (beyond deployment notifications)
- Configure health checks for custom resources
- Document sync policy decisions and rationale
- Test drift detection and auto-remediation
- Create runbook for monitoring sync status
Current State (2025-11-17):
- Mixed sync policies: Some auto-sync, some manual, inconsistent configuration
- No self-heal configured: Manual changes to cluster not automatically reverted
- No prune configured: Deleted resources in Git remain in cluster
- No sync waves: Dependencies deployed in random order
- Limited health checks: ArgoCD can't determine health for some resources
Sync Policy Categories (1 hour):
- Full Automation (auto-sync + prune + self-heal):
- Infrastructure: ingress-nginx, cert-manager, external-dns
- Platform: external-secrets-operator, rabbitmq-secrets
- Staging services: All syrf-staging services
-
Justification: Non-critical, fast feedback needed
-
Semi-Automated (auto-sync + self-heal, NO prune):
- Production services: All syrf-production services
- ArgoCD itself: Self-managing but careful pruning
-
Justification: Auto-sync for speed, manual pruning for safety
-
Manual Only (no auto-sync):
- Production database configs (if added)
- Security-critical resources (if added)
- Justification: Require explicit approval before changes
Configuration Tasks (3-4 hours):
- Infrastructure Applications (1 hour)
# Example: ingress-nginx
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- PrunePropagationPolicy=foreground
- PruneLast=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
- Service Applications (1-2 hours)
- Add sync waves to ensure dependencies deploy first
- Configure health checks for custom resources
- Set appropriate retry policies
-
Enable auto-sync for staging, semi-auto for production
-
PostSync Validation Hooks (1 hour)
- Add hooks beyond deployment notifications
- Validate critical resources exist after sync
- Check for common misconfigurations
-
Alert on unexpected drift
-
Health Checks (30 min)
- Configure custom health checks for ExternalSecrets
- Configure health for Jobs (consider successful completion)
- Configure health for StatefulSets (readiness)
Example PostSync Validation Hook:
apiVersion: batch/v1
kind: Job
metadata:
name: validate-deployment
annotations:
argocd.argoproj.io/hook: PostSync
argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
spec:
template:
spec:
containers:
- name: validate
image: bitnami/kubectl:latest
command:
- /bin/bash
- -c
- |
# Validate ExternalSecrets are ready
kubectl wait --for=condition=Ready \
externalsecret/rabbit-mq -n {{ .Values.namespace }} \
--timeout=60s
# Validate pods are running
kubectl wait --for=condition=Ready \
pod -l app={{ .Values.appName }} \
-n {{ .Values.namespace }} \
--timeout=120s
restartPolicy: Never
Sync Wave Strategy (deployment order):
# Wave -1: Prerequisites
- ClusterSecretStore (wave: -1)
- Namespaces (wave: -1)
# Wave 0: Infrastructure
- ingress-nginx (wave: 0)
- cert-manager (wave: 0)
# Wave 1: Platform Services
- external-secrets-operator (wave: 1)
- rabbitmq (wave: 1)
# Wave 2: Secrets
- extra-secrets (wave: 2)
- rabbitmq-secrets (wave: 2)
# Wave 3: Application Services
- syrf-api (wave: 3)
- syrf-pm (wave: 3)
- syrf-quartz (wave: 3)
- syrf-web (wave: 3)
Testing and Validation (1 hour):
- Test auto-sync (20 min)
- Make Git change to auto-sync app
- Verify ArgoCD detects and syncs automatically
-
Verify sync completes within 3 minutes
-
Test self-heal (20 min)
- Make manual change to cluster (kubectl edit)
- Verify ArgoCD detects drift
-
Verify auto-remediation within 5 minutes
-
Test prune (20 min)
- Delete resource from Git
- Verify ArgoCD removes from cluster
- Verify prune happens in correct order
Documentation (30 min):
- Document sync policy for each Application
- Create decision matrix (when to use each policy)
- Document sync wave strategy
- Create troubleshooting guide for sync failures
- Update cluster-gitops README with sync policies
Dependencies:
- Story 4.8 (SecretStore fix) - needed for testing ExternalSecret health
- Story 4.12 (Sync out-of-sync apps) - clean slate before configuring policies
Technical Notes:
- Prune safety: Use
PruneLast=trueto ensure new resources deploy before old ones delete - Self-heal interval: ArgoCD checks every 3 minutes by default
- Sync waves: Lower numbers deploy first, use negative for prerequisites
- Health checks: Custom Lua scripts for complex resources
- Retry logic: Exponential backoff prevents sync storms
- PostSync hooks: Run AFTER sync succeeds, useful for validation
- PreSync hooks: Run BEFORE sync, useful for migrations
Estimated Effort: 5 story points (1 day)
Work Item 5: Production Migration ⏳ PLANNED¶
GitHub Issue: #2155 Goal: Migrate production traffic from Jenkins X to new cluster
Total Story Points: 34 Status: ⏳ Planned (0%)
Child Work Item 5.1: Production Deployment Validation 📋¶
GitHub Issue: #2156 Status: 📋 Ready (after Epic 4) Priority: P0 (Critical) Story Points: 8 Sprint: TBD
As a platform engineer I want all services deployed to production environment So that production environment is ready for traffic
Acceptance Criteria:
- Deploy all 4 services to production namespace
- Use current production versions (from Jenkins X cluster)
- Verify all pods are running and healthy
- Verify ingress routes are configured
- Verify RabbitMQ connectivity
- Verify database connectivity
- Run smoke tests for each service
- Verify monitoring and logging
- Document production configuration
Dependencies:
- Story 4.6 (End-to-End GitOps Flow Validation)
- All Epic 4 stories complete
Technical Notes:
- Use versions from PLANNING.md (Jenkins X baseline)
- Do NOT switch traffic yet (parallel running)
- Validate in isolation first
Estimated Effort: 8 story points (2 days)
Child Work Item 5.2: Traffic Cutover Planning 📋¶
GitHub Issue: #2157 Status: 📋 Ready (after Epic 4) Priority: P0 (Critical) Story Points: 5 Sprint: TBD
As a platform engineer I want a detailed cutover plan with rollback procedures So that production migration is safe and reversible
Acceptance Criteria:
- Document cutover strategy:
- Blue-green deployment
- Canary rollout
- DNS switch
- Define success criteria
- Create rollback plan
- Schedule maintenance window
- Prepare communication plan
- Create monitoring checklist
- Define rollback triggers
- Test rollback procedure in staging
Dependencies:
- Story 5.1 (Production Deployment Validation)
Technical Notes:
- Recommended: DNS-based cutover (fastest rollback)
- Alternative: Load balancer reconfiguration
- Monitor for 24-48 hours before Jenkins X decomm
Estimated Effort: 5 story points (1 day)
Child Work Item 5.3: Production Cutover Execution 📋¶
GitHub Issue: #2158 Status: 📋 Ready (after Epic 4) Priority: P0 (Critical) Story Points: 13 Sprint: TBD
As a platform engineer I want to switch production traffic to the new cluster So that SyRF runs on the GitOps architecture
Acceptance Criteria:
- Announce maintenance window
- Verify new cluster is ready
- Switch traffic (DNS or load balancer)
- Monitor application health
- Monitor error rates
- Monitor latency metrics
- Verify all services responding
- Run production smoke tests
- Monitor for 2 hours
- Confirm no critical errors
- Document cutover results
- Update status page
Dependencies:
- Story 5.2 (Traffic Cutover Planning)
Technical Notes:
- This is the GO-LIVE event
- Rollback plan must be ready
- Team on standby during cutover
Estimated Effort: 13 story points (1 week - includes monitoring period)
Child Work Item 5.4: Jenkins X Cluster Decommission 📋¶
GitHub Issue: #2159 Status: 📋 Ready (after Epic 5) Priority: P2 (Medium) Story Points: 8 Sprint: TBD
As a platform engineer I want to decommission the Jenkins X cluster So that we don't pay for unused infrastructure
Acceptance Criteria:
- Monitor new cluster for 1 week post-cutover
- Confirm no critical issues
- Export Jenkins X configuration (backup)
- Document lessons learned
- Archive Jenkins X logs and metrics
- Delete Jenkins X Applications
- Delete Jenkins X cluster
- Remove DNS entries
- Update documentation
- Celebrate migration success! 🎉
Dependencies:
- Story 5.3 (Production Cutover Execution)
- 1 week monitoring period
Technical Notes:
- Do NOT delete until 100% confident in new cluster
- Keep backups of Jenkins X configs
- Final step of the migration
Estimated Effort: 8 story points (2 days)
Sprint Planning Recommendations¶
Sprint 2 (Current - 2 weeks)¶
Goal: Complete CI/CD automation and cluster-gitops setup
Child Work Items to Include (42 story points - ambitious):
- ✅ Child Work Item 2.1: Auto-Version Workflow Cleanup (5 pts)
- ✅ Child Work Item 2.2: Docker Image Build Integration (13 pts)
- ✅ Child Work Item 3.2: ArgoCD Application Manifests (8 pts)
- ✅ Child Work Item 3.3: Environment Values Configuration (5 pts)
- ✅ Child Work Item 3.5: Infrastructure Dependencies Analysis (5 pts)
Stretch Goals:
- Child Work Item 2.4: Promotion PR Automation (8 pts)
Deliverables:
- Docker images building and pushing to GHCR
- ArgoCD manifests ready to apply
- Environment values configured
- Infrastructure requirements documented
Sprint 3 (In Progress - 2 weeks)¶
Goal: Complete remaining CI/CD and GitOps infrastructure
Child Work Items to Include (24 story points):
- ✅ Child Work Item 2.4: Promotion PR Automation (8 pts) - Complete
- ✅ Child Work Item 3.4: ApplicationSet for PR Previews (8 pts) - Complete
- ✅ Child Work Item 2.3: Build Optimization (8 pts) - Complete (2025-11-19)
Deliverables:
- ✅ Full CI/CD automation working
- ✅ PR preview environments configured
- ✅ Build optimization implemented
Sprint 4+ (Blocked on K8s cluster)¶
Goal: Deploy to new cluster and validate
Child Work Items:
- All Work Item 4 child work items (29 points)
- Requires: Kubernetes cluster provisioned
Sprint 5+ (Blocked on Sprint 4)¶
Goal: Production migration
Child Work Items:
- All Work Item 5 child work items (34 points)
Risk Register¶
High-Risk Items¶
| Risk | Probability | Impact | Mitigation | Status |
|---|---|---|---|---|
| K8s cluster not available | High | Critical | Work on items that don't require cluster (Epics 2, 3) | ⚠️ Active |
| Docker build failures in monorepo | Medium | High | Test builds locally first, review Dockerfiles | 📋 Planned |
| RabbitMQ connectivity issues | Low | Critical | Test in staging first, document config | 📋 Planned |
| Production cutover problems | Medium | Critical | Detailed rollback plan, gradual cutover | 📋 Planned |
| Data migration issues | Low | Critical | Verify data access before cutover | 📋 Planned |
Medium-Risk Items¶
| Risk | Probability | Impact | Mitigation | Status |
|---|---|---|---|---|
| DNS propagation delays | Medium | Medium | Plan for TTL, use low TTL before cutover | 📋 Planned |
| Secret management migration | Medium | Medium | Test ESO in staging, document secrets | 📋 Planned |
| Performance degradation | Low | Medium | Load testing, monitoring, gradual rollout | 📋 Planned |
Metrics & KPIs¶
Development Velocity¶
- Sprint Capacity: ~40 story points per 2-week sprint (1 developer)
- Completed: 53 story points (Epic 1)
- Remaining: 102 story points
- Estimated Sprints: 3-4 sprints (6-8 weeks)
GitOps Success Criteria¶
Once Epic 4 complete, measure:
| Metric | Target | Current | Status |
|---|---|---|---|
| Commit → Staging Deploy Time | < 10 min p50 | N/A | ⏳ Not measured |
| Preview Env Creation Time | < 2 min | N/A | ⏳ Not measured |
| Deployment via Git PRs | 100% | N/A | ⏳ Not measured |
| Untracked Drift | 0 instances | N/A | ⏳ Not measured |
| Rollback Time | < 5 min | N/A | ⏳ Not measured |
Dependencies Graph¶
Work Item 1 (✅ Complete)
└── Work Item 2 (🔄 In Progress)
├── Child Work Item 2.1 (Auto-Version Cleanup)
├── Child Work Item 2.2 (Docker Builds) → depends on 2.1
├── Child Work Item 2.3 (Build Optimization) → depends on 2.2
└── Child Work Item 2.4 (Promotion PRs) → depends on 2.2, 3.1
Work Item 1 (✅ Complete)
└── Work Item 3 (🔄 In Progress)
├── Child Work Item 3.1 (cluster-gitops) ✅ Complete
├── Child Work Item 3.2 (ArgoCD Apps) → depends on 3.1
├── Child Work Item 3.3 (Env Values) → depends on 3.1
├── Child Work Item 3.4 (ApplicationSets) → depends on 3.2
└── Child Work Item 3.5 (Infrastructure) → depends on 3.1
Work Item 4 (⏳ Blocked) → BLOCKER: K8s Cluster
├── Child Work Item 4.1 (K8s Cluster) ⚠️ PRIMARY BLOCKER
├── Child Work Item 4.2 (ArgoCD Install) → depends on 4.1
├── Child Work Item 4.3 (Platform Add-ons) → depends on 4.2, 3.5
├── Child Work Item 4.4 (Bootstrap) → depends on 4.2, 3.2
├── Child Work Item 4.5 (First Service) → depends on 4.3, 4.4
└── Child Work Item 4.6 (E2E Validation) → depends on 4.5, 2.2, 2.4, 3.4
Work Item 5 (⏳ Planned) → depends on Work Item 4
├── Child Work Item 5.1 (Prod Validation) → depends on 4.6
├── Child Work Item 5.2 (Cutover Plan) → depends on 5.1
├── Child Work Item 5.3 (Cutover) → depends on 5.2
└── Child Work Item 5.4 (Decommission) → depends on 5.3 + 1 week
Changelog¶
2025-11-19 (Build Optimization - Complete)¶
- Child Work Item 2.3: Build Optimization ✅ Complete
- Implemented crane-based image retagging for chart-only changes
- Added
list-files: shellto dorny/paths-filter for file analysis - Chart-only detection checks if ALL changed files match
.chart/pattern - Initial approach using negation patterns didn't work (patterns are OR'd)
- Fixed by analyzing actual file paths in combined step
- Successfully tested: API chart-only change retagged 9.4.3 → 9.4.4
- Time savings: 12 seconds vs 4+ minutes for full build
- Sprint 3 now fully complete - All 24 story points delivered
2025-11-18 (Plugins ApplicationSet Fixes - Complete)¶
- Plugins Project ArgoCD Applications Fixed: All plugins apps now Synced/Healthy
- Sync Policy Configuration (Child Work Item 4.14 - Partial):
- Enabled
selfHeal: truefor drift prevention (Git is source of truth) - Added
ServerSideApply=truefor large CRDs (ESO) - Added
ApplyOutOfSyncOnly=truefor efficient syncing - Added
ignoreDifferencesfor ESO controller default fields (conversionStrategy, decodingStrategy, metadataPolicy) - Removed blocking SyncWindows from plugins and argocd projects
- Orphaned Resources Cleanup (Child Work Item 4.13 - Partial):
- Deleted orphaned rabbitmq secret from syrf-staging
- Removed redundant rabbitmq-secrets plugin (ClusterExternalSecret handles this)
- Configured RabbitMQ
existingErlangSecretto prevent drift - Directory Structure Fix (Child Work Item 4.10 - Complete):
- Created missing resources directories with .gitkeep for external-dns, ingress-nginx
- ClusterIssuer Resolution:
- Deleted ClusterIssuers for GitOps regeneration with correct tracking ID
- ESO CRD Fix:
- Removed kubectl.kubernetes.io/last-applied-configuration annotations
- Deleted and recreated CRDs with ServerSideApply
- Commits:
- 1c9ebe5: fix(plugins): resolve ArgoCD application issues
- 8b6678a: fix(plugins): enable selfHeal and remove redundant rabbitmq-secrets
- e97f189: fix(projects): remove blocking SyncWindows from argocd and plugins
- 422194c: fix(plugins): add ApplyOutOfSyncOnly to help with large CRD sync
- af29deb: fix(plugins): ignore ESO controller default fields in ExternalSecrets
- Final Status: All 7 plugins apps showing Synced/Healthy
- cert-manager, external-dns, external-secrets-operator
- extra-secrets-production, extra-secrets-staging, ingress-nginx, rabbitmq
2025-11-18 (TLS Certificate and Ingress Configuration Fixes)¶
- Child Work Item 4.11 Complete: Fixed TLS certificates for all staging and production services
- Root Causes Identified:
- Staging ingresses using production URLs (e.g.,
help.syrf.org.ukinstead ofhelp.staging.syrf.org.uk) - DNS mismatch causing ACME HTTP-01 challenges to fail with 404
- Staging using
letsencrypt-stagingissuer instead ofletsencrypt-prod - Helm chart defaults with hardcoded ingress values not fully overridden
- Fixes Applied:
- Updated all Helm chart defaults to
ingress: {}(api, pm, quartz, web, user-guide, docs) - Created staging environment values with correct staging URLs
- Changed staging shared-values to use
letsencrypt-prodissuer - All certificates now issued by Let's Encrypt production
- Current State:
- All staging certificates: Ready ✅
- All production certificates: Ready ✅
- Correct staging URLs configured for all services
- Progress Update: Work Item 4 now 8/14 complete (57%), overall 27/37 (73%)
2025-11-17 (Cluster Health Assessment - 6 New Issues Discovered)¶
- Cluster Health & Remediation Issues: Added 6 new child work items to Work Item 4 after comprehensive cluster health check
- Critical Issues:
- Child Work Item 4.8: Fix SecretStore Configuration (8 pts, P0) - 44 ExternalSecrets failing, all reference non-existent SecretStores
- Child Work Item 4.9: Fix Staging Image Tags (5 pts, P0) - 5 staging pods with InvalidImageName/ImagePullBackOff
- High Priority:
- Child Work Item 4.10: Fix Extra-Secrets Directory Structure (2 pts, P1) - Missing resources/ directories blocking sync
- Medium Priority:
- Child Work Item 4.11: Fix User Guide TLS Certificate (3 pts, P2) - ✅ Complete
- Child Work Item 4.12: Sync Out-of-Sync Applications (2 pts, P2) - 13 apps OutOfSync or Unknown status
- Low Priority:
-
Child Work Item 4.13: Clean Up Orphaned Resources (1 pt, P3) - 10 orphaned resources across 2 apps
-
Impact on Progress:
- Total story points increased from 181 to 202 (+21 points)
- Work Item 4: Now 4/13 complete (31%) instead of 4/7 (57%)
- Overall progress: 23/36 child work items (64%) vs 23/30 (77%)
- Completion estimate updated: 2-3 weeks vs 1-2 weeks
- Identified 2 critical blockers for staging environment and all services
2025-11-14 (Helm Chart Standardization)¶
- Helm Chart Standardization - Jenkins X Pattern Removal (Child Work Item 4.7, 3 pts):
- Removed all 52 Jenkins X legacy patterns from service Helm charts
- Updated all 4 services (api, project-management, quartz, web) × 4 files each = 16 files total
- Replaced
jx.imagePullSecretswith standard K8s top-levelimagePullSecretsarray - Replaced
jxRequirements.ingress.*with cleaningress.*namespace - Removed
draftlabel patterns from all services - Root cause: syrf-web ImagePullBackOff due to mismatch between global values and chart expectations
- All charts validated successfully with
helm template - Documentation: Created ADR-006-helm-chart-standardization.md
-
Web service had 30 references in ingress.yaml alone (complex host name construction)
-
Work Item 4 Progress: 4/7 child work items complete (57%)
- Overall Progress: 23/30 child work items (77%), 163/181 story points (90%)
2025-11-13 (Updated - Repository Migration)¶
- Repository Migration Completed:
- Migrated monorepo from
camaradesuk/syrf-testtocamaradesuk/syrf - Backup created at
camaradesuk/syrf-web-legacy - Force push + rename strategy preserved all GitHub metadata (470+ issues, 47 PRs, discussions)
- ZenHub workspace continues functioning (same internal repo ID)
- All branches coexist (3 monorepo + 93 syrf-web = no conflicts)
- All tags coexist (prefixed vs unprefixed = no conflicts)
- GitHub automatic redirects: syrf-web URLs → syrf URLs
- Documentation created: ADR-005 and migration guide
- NEW Child Work Item 1.8: Repository Migration to Production Name (5 pts) ✅ Complete
- Updated backlog: 22/29 child work items (76%), 160/178 story points (90%)
-
Updated all issue URLs from syrf-web to syrf
-
External-DNS CrashLoopBackOff Issue RESOLVED:
- Problem: External-DNS pod crashing with
Precondition not meterror - Root Cause: Trying to delete DNS records from legacy Jenkins X cluster with different owner ID
- Solution: Changed policy from
synctoupsert-onlyininfrastructure/external-dns/values.yaml - Status: External-DNS now running successfully, creating/updating records without deletion attempts
- Legacy DNS Records: Orphaned TXT records from legacy cluster preserved until migration complete
- Documentation: Created
cluster-gitops/docs/troubleshooting/external-dns-crashes.md -
Commit: 6c3de9d (fix), 8ee375d (docs)
-
NEW FEATURES COMPLETED - Child Work Items 2.5 and 2.6:
-
Production Promotion Automation (Child Work Item 2.5, 8 pts):
- Automated PR creation for production promotion after successful staging deployment
- PR requires manual review and merge (no GitHub Environment needed)
- Workflow completes with green checkmark after PR creation
- PR labeled
requires-reviewwith review checklist - Implementation:
promote-to-productionjob in ci-cd.yml - Commits: 3d4edccd (initial), 42a46855 (simplified for free tier)
-
Deployment Success Notifications (Child Work Item 2.6, 8 pts):
- ArgoCD PostSync hooks create GitHub commit statuses after successful deployments
- Kubernetes Job authenticates with GitHub App
- Status context:
argocd/deploy-{environment} - Configuration consolidated in shared-values.yaml (DRY principle)
- Services enable with single flag:
deploymentNotification.enabled: true - Staging: commit statuses only
- Production: commit statuses + GitHub Releases
- Commits: 3d4edccd, 118648da, 74bee73 (DRY config), 034158d0 (docs)
-
Documentation:
- Created:
docs/how-to/production-promotion-and-notifications.md - Updated: CLAUDE.md with CI/CD workflow changes
- Updated: cluster-gitops shared-values.yaml (both environments)
- Created:
-
Work Item 2 Status: Now 100% complete (6/6 child work items)
- Total Story Points: Increased from 157 to 173 (+16 points)
- Overall Progress: 21/28 child work items (75%), 155/173 story points (90%)
2025-11-07¶
- Reorganized hierarchy for ZenHub alignment
- Changed from 5 Epics to 1 Epic containing 5 Work Items
- Changed 26 User Stories to 26 Child Work Items
- Updated all terminology throughout document (Executive Summary, Progress table, Sprint Planning, Dependencies Graph)
- All content and metadata preserved
- Created GitHub issues in syrf-web repository:
- Epic #2128: SyRF GitOps Migration (Short-Term Goals pipeline)
- Work Items #2129-#2155 (5 work items, Sprint Backlog pipeline)
- Child Work Items #2130-#2159 (26 child work items, Sprint Backlog pipeline)
- All issues have proper hierarchy, estimates, and pipeline placement
2025-11-03¶
- Initial backlog created
- Analyzed all planning documents
- Created 26 stories across 5 epics
- Identified K8s cluster as primary blocker
- Defined acceptance criteria and story points
- Organized into sprint recommendations
References¶
- PROJECT-STATUS.md - Current implementation status
- IMPLEMENTATION-PLAN.md - Phase-by-phase plan
- CLUSTER ARCHITECTURE GOALS.md - Target architecture
- DEPENDENCY-MAP.yaml - Service/library dependencies
- CI-CD-DECISIONS.md - Strategic CI/CD decisions
- cluster-gitops/PLANNING.md - Migration strategy and Jenkins X baseline
Next Update: After Sprint 2 completion or when K8s cluster becomes available