GitOps Architecture for SyRF Platform¶
Overview¶
This document describes the GitOps architecture for deploying the SyRF platform using ArgoCD. The architecture follows the two-repository model and implements ArgoCD best practices for production environments.
GCP Configuration: Project
camarades-net| Regioneurope-west2| Clustercamaradesukin zoneeurope-west2-aFor complete infrastructure details, see CLAUDE.md (search for "GCP & Infrastructure Configuration").
Repository Structure¶
Two-Repository Model¶
syrf-monorepo (Application Code):
- Source code for all services
- Helm charts with templates
- Docker build definitions
- CI/CD workflows (GitHub Actions)
- Semantic versioning with Git Version
cluster-gitops (Environment Configuration):
- Environment-specific values files
- ArgoCD Application manifests
- ArgoCD AppProject definitions
- Infrastructure configurations
Why Two Repositories?¶
- Separation of Concerns: App code changes don't trigger infrastructure reviews
- Access Control: Different teams can manage apps vs infrastructure
- Audit Trail: Clear history of configuration changes
- ArgoCD Best Practice: Recommended pattern for production
Key Architectural Decisions¶
1. Multi-Source Applications ✅¶
Each ArgoCD Application uses 2 sources:
- Source 1: Helm chart from syrf-monorepo
- Source 2: Values file from cluster-gitops (using
$valuesreference)
Benefits:
- Chart updates don't require config repo changes
- Environment-specific configs isolated
- Follows ArgoCD recommended pattern (2-3 sources max)
2. Secret Management: External Secrets Operator + Google Secret Manager ✅¶
Decision Date: 2025-11-03
Implementation:
- Secrets stored in Google Secret Manager (GSM)
- External Secrets Operator (ESO) syncs secrets to Kubernetes
- ArgoCD never has access to secret values
- Workload Identity binds ESO to GSM
Why ESO?:
- ArgoCD best practice: Destination cluster secret management
- Maintains continuity with Jenkins X (already using GSM)
- Enhanced security: Secrets don't persist in Redis cache
- Better for multi-environment (separate GSM projects)
Rejected Alternative: Sealed Secrets
- Would require migrating secrets out of GSM
- Team already familiar with GSM patterns
3. Git Pinning Strategy ✅¶
Production: Pin to git tags (e.g., api-v8.20.1)
- Immutable manifests
- Explicit version control
- Prevents "manifests suddenly changing meaning"
Staging: Track main branch
- Rapid iteration
- Automatic updates
- Fast feedback loop
4. Sync Policies ✅¶
Staging:
syncPolicy:
automated:
prune: true # Remove resources not in Git
selfHeal: true # Correct drift automatically
syncOptions:
- CreateNamespace=true
Production:
syncPolicy:
# NO automated sync - manual approval required
syncOptions:
- CreateNamespace=true
- PruneLast=true
5. High Availability Requirements ✅¶
ArgoCD HA Mode:
- Minimum 3 nodes for pod anti-affinity
- Multiple replicas: repo-server, application-controller, server
- Resource limits to prevent OOM
- Redis HA for state persistence
Cluster Sizing:
- Provider: GKE (Google Kubernetes Engine)
- Version: Kubernetes 1.28+
- Nodes: 3-4 (n1-standard-4 or e2-standard-4)
- Region: europe-west2-a (London)
6. Security via AppProjects ✅¶
project-staging.yaml:
- Source repository whitelisting
- Destination:
syrf-stagingnamespace only - Prevents accidental production changes
project-production.yaml:
- Stricter controls
- Manual sync only
- Destination:
syrf-productionnamespace only
Application Inventory¶
| Service | Chart Location | Image Registry | Purpose |
|---|---|---|---|
| syrf-api | syrf-monorepo/src/services/api/charts/api/ |
ghcr.io/camaradesuk/syrf-api |
Main REST API |
| syrf-project-management | syrf-monorepo/src/services/project-management/charts/project-management/ |
ghcr.io/camaradesuk/syrf-project-management |
Project & study management |
| syrf-quartz | syrf-monorepo/src/services/quartz/charts/quartz/ |
ghcr.io/camaradesuk/syrf-quartz |
Background job scheduler |
| syrf-web | syrf-monorepo/src/services/web/charts/web/ |
ghcr.io/camaradesuk/syrf-web |
Angular frontend |
Note: syrf-s3-notifier is deployed as AWS Lambda, not in Kubernetes.
Infrastructure Components¶
Required Platform Components¶
- ArgoCD (GitOps Engine)
- HA mode with 3+ replicas
- Redis for state management
-
Repo-server for chart fetching
-
Ingress Controller (nginx-ingress)
- External LoadBalancer
- TLS termination
-
Path-based routing
-
cert-manager (TLS Certificate Management)
- Let's Encrypt integration
- Automatic certificate renewal
-
ClusterIssuer:
letsencrypt-prod -
External DNS (Automatic DNS Management)
- Google Cloud DNS integration
- Domain:
syrf.org.uk -
Automatic record creation/deletion
-
External Secrets Operator (Secret Synchronization)
- Google Secret Manager integration
- Workload Identity authentication
-
Automatic secret refresh
-
RabbitMQ (Message Queue)
- In-cluster deployment (Helm chart)
- 3 replicas for HA
-
Persistent storage
-
MongoDB Atlas (Database)
- Managed cloud service
- Existing deployment
- Connection via ExternalSecrets
Environment Configuration¶
Configuration Structure¶
Updated 2025-11-13: Environment configurations use a service-per-file structure for better version tracking with ArgoCD.
Directory Layout:
cluster-gitops/
├── argocd/
│ └── applicationsets/
│ ├── syrf.yaml # ApplicationSet for staging/production
│ └── syrf-previews.yaml # ApplicationSet for PR previews
└── syrf/
├── global.values.yaml # Universal defaults for ALL environments
├── services/{svc}/ # Service base configurations
│ ├── config.yaml # Service metadata (chartPath, chartRepo)
│ └── values.yaml # Base Helm values
└── environments/
├── staging/
│ ├── namespace.yaml # Environment metadata
│ ├── staging.values.yaml # Environment-specific Helm values
│ └── {svc}/ # Service configurations (CI/CD updates)
│ ├── config.yaml # chartTag, imageTag
│ └── values.yaml # Service-specific values
├── production/
│ ├── namespace.yaml
│ ├── production.values.yaml
│ └── {svc}/ # Service configurations (manual promotion)
│ ├── config.yaml
│ └── values.yaml
└── preview/
├── preview.values.yaml # Preview defaults (all PRs)
├── services/{svc}/ # Preview service defaults
│ └── values.yaml
└── pr-{n}/ # PR-specific (auto-generated)
├── pr.yaml # PR metadata (headSha, branch)
└── services/{svc}.values.yaml
Values Hierarchy (lowest to highest priority):
1. syrf/global.values.yaml # Universal defaults
2. syrf/services/{svc}/values.yaml # Service base values
3. syrf/environments/{env}/{env}.values.yaml # Environment defaults
4. syrf/environments/{env}/{svc}/values.yaml # Environment+service values
Service Configuration File Format:
# environments/staging/services/api.yaml
service:
name: api
enabled: true
chartTag: api-v8.21.0 # Updated by CI/CD
chartRepo: https://github.com/camaradesuk/syrf-test
chartPath: src/services/api/.chart
Staging Environment¶
Namespace: syrf-staging
Characteristics:
- Auto-sync enabled
- Fast feedback for development
- Lower resource requests
- Debug mode enabled
Service Files: Self-contained configs in environments/staging/services/
- Includes chartTag (updated automatically by CI/CD)
- Includes chartRepo and chartPath
- Helm values in separate
syrf/{service}/values-staging.yamlfiles
Production Environment¶
Namespace: syrf-production
Characteristics:
- Manual sync only
- Requires PR approval and merge
- Right-sized resource requests (see GKE analysis)
- Debug mode disabled
Service Files: Self-contained configs in environments/production/services/
- Includes chartTag (updated via manual promotion PRs)
- Includes chartRepo and chartPath
- Helm values in separate
syrf/{service}/values-production.yamlfiles
ApplicationSet Pattern¶
Updated 2025-11-13: SyRF uses ArgoCD ApplicationSet with a matrix generator to automatically create and manage Applications for all services across environments.
ApplicationSet Structure¶
# cluster-gitops/applicationsets/syrf.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
spec:
generators:
- matrix:
generators:
# Read environment metadata
- git:
files:
- path: "environments/*/namespace.yaml"
# Read service configurations for that environment
- git:
files:
- path: "environments/{{.environment.name}}/services/*.yaml"
template:
spec:
sources:
- repoURL: '{{.service.chartRepo}}'
targetRevision: '{{.service.chartTag}}' # Version from service file
path: '{{.service.chartPath}}'
helm:
valueFiles:
- $values/syrf/global.values.yaml
- $values/syrf/services/{{.serviceName}}/values.yaml
- $values/syrf/environments/{{.envName}}/{{.envName}}.values.yaml
- $values/syrf/environments/{{.envName}}/{{.serviceName}}/values.yaml
- repoURL: https://github.com/camaradesuk/cluster-gitops
ref: values
How it works:
- Matrix Generator creates a cartesian product:
environment × services - For each combination, creates an Application (e.g.,
api-staging,api-production) - Each Application's
targetRevisioncomes from the service file'schartTag - When CI/CD updates
chartTag, ArgoCD automatically detects and syncs
Benefits:
- ✅ DRY: Single ApplicationSet generates all Applications
- ✅ Automatic: New services/environments automatically get Applications
- ✅ Version Tracking:
targetRevisionupdates automatically from git - ✅ Composable: Environment metadata + service configs = Application
Deployment Workflow¶
1. Code Change in syrf-monorepo¶
Developer → Code Change → GitHub PR → Merge to main
↓
CI/CD Workflow
↓
GitVersion calculates version
↓
Docker build & push to GHCR
↓
Create git tag (e.g., api-v8.21.0)
2. Automated Staging Promotion¶
CI/CD Workflow → Create PR to cluster-gitops
↓
Update environments/staging/services/{service}.yaml
(Updates service.chartTag field)
↓
YAML validation passes
↓
PR auto-merged
↓
Workflow waits for merge completion
↓
ArgoCD detects chartTag change
↓
ApplicationSet regenerates Application with new targetRevision
↓
Auto-sync to syrf-staging namespace
Key Points:
- Individual service files updated (not a single YAML with array)
- Workflow waits for PR merge before completing (ensures deployment ready)
- ApplicationSet automatically picks up new
chartTagvalue
3. Manual Production Promotion¶
Team validates staging → Trigger "Promote to Production" workflow
↓
Options: Copy all from staging OR specify versions
↓
Create PR to cluster-gitops
↓
Update environments/production/services/{service}.yaml
(Updates service.chartTag field)
↓
YAML validation passes
↓
PR review & approval (manual gate)
↓
Merge PR
↓
ArgoCD detects chartTag change
↓
ApplicationSet regenerates Application with new targetRevision
↓
Auto-sync to syrf-production namespace
Key Points:
- Manual workflow trigger (not automatic)
- Option to copy all versions from staging OR cherry-pick specific versions
- Requires human approval before merge
- After merge, ArgoCD auto-syncs (no manual sync needed in UI)
Resource Optimization¶
Based on GKE Cluster Analysis, the legacy cluster was significantly overprovisioned:
Legacy Issues¶
- Production API: Requesting 1500m CPU, using 20m (98.7% waste)
- Staging API: Requesting 200m CPU, using 15m (92.5% waste)
- Staging Web: Requesting 200m CPU, using 2m (99% waste)
- Cluster-wide: 10 nodes maintained for workloads that fit on 3-4 nodes
- Estimated waste: $150-200/month
Right-Sized Recommendations¶
See the GKE Cluster Analysis document for detailed recommendations.
Key Principles:
- Start with actual usage + small buffer
- Set limits = requests for Guaranteed QoS
- Enable VPA for continuous optimization
- Monitor and adjust based on real metrics
Monitoring & Observability¶
Prometheus & Grafana¶
- Metrics Collection: Prometheus scrapes all services
- Dashboards: Grafana visualizes cluster health
- Retention: 15 days of metrics
- AlertManager: Configured for critical alerts
Cost Monitoring¶
- GKE Cost Allocation: Enabled for namespace-level cost tracking
- Recommendations: GCP Recommender API for optimization insights
- Target: 40-60% node utilization (healthy efficiency)
Key Metrics to Track¶
- Node CPU/Memory Utilization: Target 40-70%
- Pod CPU/Memory Requests vs Usage: Identify over-provisioning
- Cluster Autoscaler Events: Scale-up/down frequency
- VPA Recommendations: Automatic right-sizing suggestions
- ArgoCD Sync Status: Application health
- Cost per Namespace: Track spending trends
Rollback Strategies¶
Application Rollback¶
# Revert to previous version (via Git)
cd cluster-gitops
git revert <commit-hash>
git push origin master
# ArgoCD will sync the rollback
argocd app sync syrf-api-production
Emergency Rollback¶
# Manual kubectl rollback
kubectl rollout undo deployment/syrf-api -n syrf-production
# Or scale to zero and back
kubectl scale deployment/syrf-api -n syrf-production --replicas=0
kubectl scale deployment/syrf-api -n syrf-production --replicas=1
Full Cluster Rollback¶
If DNS cutover fails:
- Update DNS to point back to old cluster (TTL-dependent)
- Monitor old cluster for stability
- Investigate new cluster issues
- Fix and retry cutover
Migration Strategy¶
Parallel Deployment (Zero Downtime)¶
- Week 1: Setup new cluster, install infrastructure
- Week 2: Deploy to staging, validate functionality
- Week 3: Deploy to production (both clusters running)
- Week 3-4: Validate new cluster, parallel operation
- Week 4: DNS cutover to new cluster
- Week 5: Decommission old cluster
Dual-Run Period: 3-5 days to minimize costs while ensuring stability
ArgoCD Best Practices Applied¶
Based on official ArgoCD documentation review (2025-11-03):
- ✅ Repository Separation - Separate app code from config
- ✅ Manifest Immutability - Production pins to git tags
- ✅ Security via AppProjects - Environment isolation
- ✅ Secret Management - Destination cluster pattern (ESO)
- ✅ Sync Policies - Auto for staging, manual for production
- ✅ High Availability - 3+ nodes, HA manifest
- ✅ Multi-Source Pattern - 2 sources per app (within recommended limit)
Success Metrics¶
Technical Metrics¶
- Deployment Frequency: < 10 minutes from merge to staging
- Preview Environment Creation: < 2 minutes (future enhancement)
- Production Deployment: Manual approval process in place
- Rollback Time: < 5 minutes via PR revert
- Drift Detection: Zero untracked changes in cluster
Business Metrics¶
- Migration Downtime: Target zero (parallel deployment)
- Post-Migration Incidents: Target < 2 in first week
- User Impact: No service degradation
- Cost Savings: $150-200/month through right-sizing
- Node Utilization: 40-60% (healthy efficiency)
References¶
- GKE Cluster Analysis - Production cluster performance analysis
- ADR-003: Cluster Architecture - Architectural decisions
- ArgoCD Documentation
- External Secrets Operator
- syrf-monorepo
- cluster-gitops
Document Status: Approved Owner: DevOps Team Last Review: 2025-11-13
Changelog¶
2025-11-13¶
- Added ApplicationSet pattern section
- Updated environment configuration to service-per-file structure
- Updated deployment workflows to reflect new promotion process
- Fixed ApplicationSet version tracking issue documentation