Skip to content

GitOps Architecture for SyRF Platform

Overview

This document describes the GitOps architecture for deploying the SyRF platform using ArgoCD. The architecture follows the two-repository model and implements ArgoCD best practices for production environments.

GCP Configuration: Project camarades-net | Region europe-west2 | Cluster camaradesuk in zone europe-west2-a

For complete infrastructure details, see CLAUDE.md (search for "GCP & Infrastructure Configuration").

Repository Structure

Two-Repository Model

syrf-monorepo (Application Code):

  • Source code for all services
  • Helm charts with templates
  • Docker build definitions
  • CI/CD workflows (GitHub Actions)
  • Semantic versioning with Git Version

cluster-gitops (Environment Configuration):

  • Environment-specific values files
  • ArgoCD Application manifests
  • ArgoCD AppProject definitions
  • Infrastructure configurations

Why Two Repositories?

  1. Separation of Concerns: App code changes don't trigger infrastructure reviews
  2. Access Control: Different teams can manage apps vs infrastructure
  3. Audit Trail: Clear history of configuration changes
  4. ArgoCD Best Practice: Recommended pattern for production

Key Architectural Decisions

1. Multi-Source Applications ✅

Each ArgoCD Application uses 2 sources:

  • Source 1: Helm chart from syrf-monorepo
  • Source 2: Values file from cluster-gitops (using $values reference)

Benefits:

  • Chart updates don't require config repo changes
  • Environment-specific configs isolated
  • Follows ArgoCD recommended pattern (2-3 sources max)

2. Secret Management: External Secrets Operator + Google Secret Manager ✅

Decision Date: 2025-11-03

Implementation:

  • Secrets stored in Google Secret Manager (GSM)
  • External Secrets Operator (ESO) syncs secrets to Kubernetes
  • ArgoCD never has access to secret values
  • Workload Identity binds ESO to GSM

Why ESO?:

  • ArgoCD best practice: Destination cluster secret management
  • Maintains continuity with Jenkins X (already using GSM)
  • Enhanced security: Secrets don't persist in Redis cache
  • Better for multi-environment (separate GSM projects)

Rejected Alternative: Sealed Secrets

  • Would require migrating secrets out of GSM
  • Team already familiar with GSM patterns

3. Git Pinning Strategy ✅

Production: Pin to git tags (e.g., api-v8.20.1)

  • Immutable manifests
  • Explicit version control
  • Prevents "manifests suddenly changing meaning"

Staging: Track main branch

  • Rapid iteration
  • Automatic updates
  • Fast feedback loop

4. Sync Policies ✅

Staging:

syncPolicy:
  automated:
    prune: true      # Remove resources not in Git
    selfHeal: true   # Correct drift automatically
  syncOptions:
    - CreateNamespace=true

Production:

syncPolicy:
  # NO automated sync - manual approval required
  syncOptions:
    - CreateNamespace=true
    - PruneLast=true

5. High Availability Requirements ✅

ArgoCD HA Mode:

  • Minimum 3 nodes for pod anti-affinity
  • Multiple replicas: repo-server, application-controller, server
  • Resource limits to prevent OOM
  • Redis HA for state persistence

Cluster Sizing:

  • Provider: GKE (Google Kubernetes Engine)
  • Version: Kubernetes 1.28+
  • Nodes: 3-4 (n1-standard-4 or e2-standard-4)
  • Region: europe-west2-a (London)

6. Security via AppProjects ✅

project-staging.yaml:

  • Source repository whitelisting
  • Destination: syrf-staging namespace only
  • Prevents accidental production changes

project-production.yaml:

  • Stricter controls
  • Manual sync only
  • Destination: syrf-production namespace only

Application Inventory

Service Chart Location Image Registry Purpose
syrf-api syrf-monorepo/src/services/api/charts/api/ ghcr.io/camaradesuk/syrf-api Main REST API
syrf-project-management syrf-monorepo/src/services/project-management/charts/project-management/ ghcr.io/camaradesuk/syrf-project-management Project & study management
syrf-quartz syrf-monorepo/src/services/quartz/charts/quartz/ ghcr.io/camaradesuk/syrf-quartz Background job scheduler
syrf-web syrf-monorepo/src/services/web/charts/web/ ghcr.io/camaradesuk/syrf-web Angular frontend

Note: syrf-s3-notifier is deployed as AWS Lambda, not in Kubernetes.

Infrastructure Components

Required Platform Components

  1. ArgoCD (GitOps Engine)
  2. HA mode with 3+ replicas
  3. Redis for state management
  4. Repo-server for chart fetching

  5. Ingress Controller (nginx-ingress)

  6. External LoadBalancer
  7. TLS termination
  8. Path-based routing

  9. cert-manager (TLS Certificate Management)

  10. Let's Encrypt integration
  11. Automatic certificate renewal
  12. ClusterIssuer: letsencrypt-prod

  13. External DNS (Automatic DNS Management)

  14. Google Cloud DNS integration
  15. Domain: syrf.org.uk
  16. Automatic record creation/deletion

  17. External Secrets Operator (Secret Synchronization)

  18. Google Secret Manager integration
  19. Workload Identity authentication
  20. Automatic secret refresh

  21. RabbitMQ (Message Queue)

  22. In-cluster deployment (Helm chart)
  23. 3 replicas for HA
  24. Persistent storage

  25. MongoDB Atlas (Database)

  26. Managed cloud service
  27. Existing deployment
  28. Connection via ExternalSecrets

Environment Configuration

Configuration Structure

Updated 2025-11-13: Environment configurations use a service-per-file structure for better version tracking with ArgoCD.

Directory Layout:

cluster-gitops/
├── argocd/
│   └── applicationsets/
│       ├── syrf.yaml                  # ApplicationSet for staging/production
│       └── syrf-previews.yaml         # ApplicationSet for PR previews
└── syrf/
    ├── global.values.yaml             # Universal defaults for ALL environments
    ├── services/{svc}/                # Service base configurations
    │   ├── config.yaml                # Service metadata (chartPath, chartRepo)
    │   └── values.yaml                # Base Helm values
    └── environments/
        ├── staging/
        │   ├── namespace.yaml         # Environment metadata
        │   ├── staging.values.yaml    # Environment-specific Helm values
        │   └── {svc}/                 # Service configurations (CI/CD updates)
        │       ├── config.yaml        # chartTag, imageTag
        │       └── values.yaml        # Service-specific values
        ├── production/
        │   ├── namespace.yaml
        │   ├── production.values.yaml
        │   └── {svc}/                 # Service configurations (manual promotion)
        │       ├── config.yaml
        │       └── values.yaml
        └── preview/
            ├── preview.values.yaml    # Preview defaults (all PRs)
            ├── services/{svc}/        # Preview service defaults
            │   └── values.yaml
            └── pr-{n}/                # PR-specific (auto-generated)
                ├── pr.yaml            # PR metadata (headSha, branch)
                └── services/{svc}.values.yaml

Values Hierarchy (lowest to highest priority):

1. syrf/global.values.yaml                           # Universal defaults
2. syrf/services/{svc}/values.yaml                   # Service base values
3. syrf/environments/{env}/{env}.values.yaml         # Environment defaults
4. syrf/environments/{env}/{svc}/values.yaml         # Environment+service values

Service Configuration File Format:

# environments/staging/services/api.yaml
service:
  name: api
  enabled: true
  chartTag: api-v8.21.0      # Updated by CI/CD
  chartRepo: https://github.com/camaradesuk/syrf-test
  chartPath: src/services/api/.chart

Staging Environment

Namespace: syrf-staging

Characteristics:

  • Auto-sync enabled
  • Fast feedback for development
  • Lower resource requests
  • Debug mode enabled

Service Files: Self-contained configs in environments/staging/services/

  • Includes chartTag (updated automatically by CI/CD)
  • Includes chartRepo and chartPath
  • Helm values in separate syrf/{service}/values-staging.yaml files

Production Environment

Namespace: syrf-production

Characteristics:

  • Manual sync only
  • Requires PR approval and merge
  • Right-sized resource requests (see GKE analysis)
  • Debug mode disabled

Service Files: Self-contained configs in environments/production/services/

  • Includes chartTag (updated via manual promotion PRs)
  • Includes chartRepo and chartPath
  • Helm values in separate syrf/{service}/values-production.yaml files

ApplicationSet Pattern

Updated 2025-11-13: SyRF uses ArgoCD ApplicationSet with a matrix generator to automatically create and manage Applications for all services across environments.

ApplicationSet Structure

# cluster-gitops/applicationsets/syrf.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
spec:
  generators:
    - matrix:
        generators:
          # Read environment metadata
          - git:
              files:
                - path: "environments/*/namespace.yaml"

          # Read service configurations for that environment
          - git:
              files:
                - path: "environments/{{.environment.name}}/services/*.yaml"

  template:
    spec:
      sources:
        - repoURL: '{{.service.chartRepo}}'
          targetRevision: '{{.service.chartTag}}'  # Version from service file
          path: '{{.service.chartPath}}'
          helm:
            valueFiles:
              - $values/syrf/global.values.yaml
              - $values/syrf/services/{{.serviceName}}/values.yaml
              - $values/syrf/environments/{{.envName}}/{{.envName}}.values.yaml
              - $values/syrf/environments/{{.envName}}/{{.serviceName}}/values.yaml
        - repoURL: https://github.com/camaradesuk/cluster-gitops
          ref: values

How it works:

  1. Matrix Generator creates a cartesian product: environment × services
  2. For each combination, creates an Application (e.g., api-staging, api-production)
  3. Each Application's targetRevision comes from the service file's chartTag
  4. When CI/CD updates chartTag, ArgoCD automatically detects and syncs

Benefits:

  • DRY: Single ApplicationSet generates all Applications
  • Automatic: New services/environments automatically get Applications
  • Version Tracking: targetRevision updates automatically from git
  • Composable: Environment metadata + service configs = Application

Deployment Workflow

1. Code Change in syrf-monorepo

Developer → Code Change → GitHub PR → Merge to main
                                    CI/CD Workflow
                                    GitVersion calculates version
                                    Docker build & push to GHCR
                                    Create git tag (e.g., api-v8.21.0)

2. Automated Staging Promotion

CI/CD Workflow → Create PR to cluster-gitops
                 Update environments/staging/services/{service}.yaml
                 (Updates service.chartTag field)
                 YAML validation passes
                 PR auto-merged
                 Workflow waits for merge completion
                 ArgoCD detects chartTag change
                 ApplicationSet regenerates Application with new targetRevision
                 Auto-sync to syrf-staging namespace

Key Points:

  • Individual service files updated (not a single YAML with array)
  • Workflow waits for PR merge before completing (ensures deployment ready)
  • ApplicationSet automatically picks up new chartTag value

3. Manual Production Promotion

Team validates staging → Trigger "Promote to Production" workflow
                        Options: Copy all from staging OR specify versions
                        Create PR to cluster-gitops
                        Update environments/production/services/{service}.yaml
                        (Updates service.chartTag field)
                        YAML validation passes
                        PR review & approval (manual gate)
                        Merge PR
                        ArgoCD detects chartTag change
                        ApplicationSet regenerates Application with new targetRevision
                        Auto-sync to syrf-production namespace

Key Points:

  • Manual workflow trigger (not automatic)
  • Option to copy all versions from staging OR cherry-pick specific versions
  • Requires human approval before merge
  • After merge, ArgoCD auto-syncs (no manual sync needed in UI)

Resource Optimization

Based on GKE Cluster Analysis, the legacy cluster was significantly overprovisioned:

Legacy Issues

  • Production API: Requesting 1500m CPU, using 20m (98.7% waste)
  • Staging API: Requesting 200m CPU, using 15m (92.5% waste)
  • Staging Web: Requesting 200m CPU, using 2m (99% waste)
  • Cluster-wide: 10 nodes maintained for workloads that fit on 3-4 nodes
  • Estimated waste: $150-200/month

Right-Sized Recommendations

See the GKE Cluster Analysis document for detailed recommendations.

Key Principles:

  1. Start with actual usage + small buffer
  2. Set limits = requests for Guaranteed QoS
  3. Enable VPA for continuous optimization
  4. Monitor and adjust based on real metrics

Monitoring & Observability

Prometheus & Grafana

  • Metrics Collection: Prometheus scrapes all services
  • Dashboards: Grafana visualizes cluster health
  • Retention: 15 days of metrics
  • AlertManager: Configured for critical alerts

Cost Monitoring

  • GKE Cost Allocation: Enabled for namespace-level cost tracking
  • Recommendations: GCP Recommender API for optimization insights
  • Target: 40-60% node utilization (healthy efficiency)

Key Metrics to Track

  1. Node CPU/Memory Utilization: Target 40-70%
  2. Pod CPU/Memory Requests vs Usage: Identify over-provisioning
  3. Cluster Autoscaler Events: Scale-up/down frequency
  4. VPA Recommendations: Automatic right-sizing suggestions
  5. ArgoCD Sync Status: Application health
  6. Cost per Namespace: Track spending trends

Rollback Strategies

Application Rollback

# Revert to previous version (via Git)
cd cluster-gitops
git revert <commit-hash>
git push origin master

# ArgoCD will sync the rollback
argocd app sync syrf-api-production

Emergency Rollback

# Manual kubectl rollback
kubectl rollout undo deployment/syrf-api -n syrf-production

# Or scale to zero and back
kubectl scale deployment/syrf-api -n syrf-production --replicas=0
kubectl scale deployment/syrf-api -n syrf-production --replicas=1

Full Cluster Rollback

If DNS cutover fails:

  1. Update DNS to point back to old cluster (TTL-dependent)
  2. Monitor old cluster for stability
  3. Investigate new cluster issues
  4. Fix and retry cutover

Migration Strategy

Parallel Deployment (Zero Downtime)

  1. Week 1: Setup new cluster, install infrastructure
  2. Week 2: Deploy to staging, validate functionality
  3. Week 3: Deploy to production (both clusters running)
  4. Week 3-4: Validate new cluster, parallel operation
  5. Week 4: DNS cutover to new cluster
  6. Week 5: Decommission old cluster

Dual-Run Period: 3-5 days to minimize costs while ensuring stability

ArgoCD Best Practices Applied

Based on official ArgoCD documentation review (2025-11-03):

  1. ✅ Repository Separation - Separate app code from config
  2. ✅ Manifest Immutability - Production pins to git tags
  3. ✅ Security via AppProjects - Environment isolation
  4. ✅ Secret Management - Destination cluster pattern (ESO)
  5. ✅ Sync Policies - Auto for staging, manual for production
  6. ✅ High Availability - 3+ nodes, HA manifest
  7. ✅ Multi-Source Pattern - 2 sources per app (within recommended limit)

Success Metrics

Technical Metrics

  • Deployment Frequency: < 10 minutes from merge to staging
  • Preview Environment Creation: < 2 minutes (future enhancement)
  • Production Deployment: Manual approval process in place
  • Rollback Time: < 5 minutes via PR revert
  • Drift Detection: Zero untracked changes in cluster

Business Metrics

  • Migration Downtime: Target zero (parallel deployment)
  • Post-Migration Incidents: Target < 2 in first week
  • User Impact: No service degradation
  • Cost Savings: $150-200/month through right-sizing
  • Node Utilization: 40-60% (healthy efficiency)

References


Document Status: Approved Owner: DevOps Team Last Review: 2025-11-13

Changelog

2025-11-13

  • Added ApplicationSet pattern section
  • Updated environment configuration to service-per-file structure
  • Updated deployment workflows to reflect new promotion process
  • Fixed ApplicationSet version tracking issue documentation