SyRF Kubernetes Cluster Setup Guide¶

This guide provides step-by-step instructions for provisioning and configuring a new GKE cluster for the SyRF platform, incorporating lessons learned from production cluster analysis.

Table of Contents¶

Prerequisites
Cluster Provisioning
Foundation Setup
ArgoCD Installation
Secret Management Setup
Application Deployment
Resource Optimization
Monitoring & Alerts
Verification Checklist

Prerequisites¶

Required Access¶

GCP project access (camarades-net)
gcloud CLI installed and authenticated
kubectl CLI installed
GitHub repository access:
camaradesuk/syrf-monorepo
camaradesuk/cluster-gitops
GitHub Personal Access Token with repo scope

Required Knowledge¶

Kubernetes fundamentals
Helm chart basics
GitOps concepts
GCP services (GKE, GSM, Cloud DNS)

Cost Considerations¶

IMPORTANT: Based on production cluster analysis (see gke-cluster-analysis.md), the legacy cluster is significantly overprovisioned with estimated savings of $150-200/month (30-60%) achievable through right-sizing.

Recommended Dual-Run Period: 3-5 days to minimize costs while ensuring stability.

Cluster Provisioning¶

Cluster Specifications¶

Based on ArgoCD HA requirements and right-sized resource analysis:

Specification	Value	Rationale
Provider	GKE (Google Kubernetes Engine)	Team familiarity, GSM integration
Project	`camarades-net`	Existing GCP project
Region	`europe-west2-a` (London)	Data residency, low latency
Kubernetes Version	1.28+ (latest stable)	Long-term support, feature compatibility
Node Count	3-4 nodes minimum	ArgoCD HA (pod anti-affinity) + workloads
Machine Type	`n1-standard-4` or `e2-standard-4`	4 vCPU, 15GB RAM per node
Preemptible	Mixed (2 standard + 1-2 preemptible)	Cost savings with reliability
Workload Identity	Enabled	Secure GSM access for ESO
Auto-upgrade	Enabled	Security patches, K8s updates
Auto-repair	Enabled	Node health management

Create Cluster¶

# Set project context
gcloud config set project camarades-net

# Create GKE cluster with Workload Identity
gcloud container clusters create syrf-cluster \
  --region=europe-west2-a \
  --num-nodes=3 \
  --machine-type=n1-standard-4 \
  --disk-size=100 \
  --enable-autoscaling \
  --min-nodes=3 \
  --max-nodes=6 \
  --enable-autorepair \
  --enable-autoupgrade \
  --workload-pool=camarades-net.svc.id.goog \
  --enable-shielded-nodes \
  --addons=HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver

# Get cluster credentials
gcloud container clusters get-credentials syrf-cluster --region=europe-west2-a

Verify Cluster¶

# Check cluster info
kubectl cluster-info

# Verify nodes
kubectl get nodes

# Expected output: 3 nodes in Ready state

Foundation Setup¶

1. Install Ingress Controller¶

# Install nginx-ingress controller
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --set controller.service.type=LoadBalancer \
  --set controller.metrics.enabled=true

# Wait for external IP assignment
kubectl get svc -n ingress-nginx -w

# Note the EXTERNAL-IP for DNS configuration

2. Install cert-manager¶

# Install cert-manager CRDs
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml

# Wait for cert-manager to be ready
kubectl wait --for=condition=available --timeout=300s deployment/cert-manager -n cert-manager
kubectl wait --for=condition=available --timeout=300s deployment/cert-manager-webhook -n cert-manager

# Create Let's Encrypt ClusterIssuer
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: chris.sena@ed.ac.uk
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: nginx
EOF

3. Install External DNS (Optional but Recommended)¶

# Create service account for External DNS
kubectl create serviceaccount external-dns -n kube-system

# Bind to Cloud DNS role (requires GCP IAM setup)
# See: https://kubernetes-sigs.github.io/external-dns/v0.13.5/tutorials/gke/

helm repo add bitnami https://charts.bitnami.com/bitnami
helm install external-dns bitnami/external-dns \
  --namespace kube-system \
  --set provider=google \
  --set google.project=camarades-net \
  --set domainFilters[0]=syrf.org.uk \
  --set policy=sync \
  --set txtOwnerId=syrf-cluster

4. Deploy RabbitMQ¶

# Add Bitnami Helm repository
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update

# Create namespace
kubectl create namespace rabbitmq

# Install RabbitMQ with persistence and HA
helm install rabbitmq bitnami/rabbitmq \
  --namespace rabbitmq \
  --set auth.username=rabbit \
  --set auth.password=<secure-password> \
  --set replicaCount=3 \
  --set persistence.enabled=true \
  --set persistence.size=10Gi \
  --set metrics.enabled=true

# Verify RabbitMQ is running
kubectl get pods -n rabbitmq

ArgoCD Installation¶

Install ArgoCD in HA Mode¶

# Create argocd namespace
kubectl create namespace argocd

# Install ArgoCD HA manifest
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/ha/install.yaml

# Wait for ArgoCD to be ready
kubectl wait --for=condition=available --timeout=600s deployment/argocd-server -n argocd

# Get initial admin password
ARGOCD_PASSWORD=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
echo "ArgoCD admin password: $ARGOCD_PASSWORD"

Configure ArgoCD for Production¶

# Patch ArgoCD repo-server for better performance
kubectl patch deployment argocd-repo-server -n argocd --type='json' \
  -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--parallelismlimit=50"},
       {"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--repo-cache-expiration=1h"}]'

# Patch ArgoCD application-controller for scalability
kubectl patch deployment argocd-application-controller -n argocd --type='json' \
  -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--status-processors=50"},
       {"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--operation-processors=25"}]'

# Set resource limits on all ArgoCD components
kubectl set resources deployment argocd-server -n argocd \
  --requests=cpu=100m,memory=128Mi \
  --limits=cpu=500m,memory=512Mi

kubectl set resources deployment argocd-repo-server -n argocd \
  --requests=cpu=100m,memory=256Mi \
  --limits=cpu=500m,memory=1Gi

kubectl set resources deployment argocd-application-controller -n argocd \
  --requests=cpu=250m,memory=512Mi \
  --limits=cpu=1000m,memory=2Gi

Access ArgoCD UI¶

# Port forward to ArgoCD server
kubectl port-forward svc/argocd-server -n argocd 8080:443

# Access UI: https://localhost:8080
# Username: admin
# Password: (from previous step)

# Change admin password after first login
argocd login localhost:8080
argocd account update-password

Connect cluster-gitops Repository¶

# Via ArgoCD CLI
argocd repo add https://github.com/camaradesuk/cluster-gitops.git \
  --type git \
  --name cluster-gitops \
  --username <github-username> \
  --password <github-pat>

# Add syrf-monorepo as well
argocd repo add https://github.com/camaradesuk/syrf-monorepo.git \
  --type git \
  --name syrf-monorepo \
  --username <github-username> \
  --password <github-pat>

Secret Management Setup¶

Install External Secrets Operator¶

# Install ESO CRDs
kubectl apply -f https://raw.githubusercontent.com/external-secrets/external-secrets/main/deploy/crds/bundle.yaml

# Add External Secrets Helm repository
helm repo add external-secrets https://charts.external-secrets.io
helm repo update

# Install ESO operator
helm install external-secrets external-secrets/external-secrets \
  --namespace external-secrets-operator \
  --create-namespace \
  --set installCRDs=true

Configure Workload Identity for ESO¶

# Create GCP service account for ESO
gcloud iam service-accounts create eso-secret-accessor \
  --display-name "External Secrets Operator Secret Accessor"

# Grant Secret Manager Secret Accessor role
gcloud projects add-iam-policy-binding camarades-net \
  --member="serviceAccount:eso-secret-accessor@camarades-net.iam.gserviceaccount.com" \
  --role="roles/secretmanager.secretAccessor"

# Create Kubernetes service account
kubectl create serviceaccount external-secrets-sa -n external-secrets-operator

# Bind Workload Identity
gcloud iam service-accounts add-iam-policy-binding \
  eso-secret-accessor@camarades-net.iam.gserviceaccount.com \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:camarades-net.svc.id.goog[external-secrets-operator/external-secrets-sa]"

# Annotate Kubernetes service account
kubectl annotate serviceaccount external-secrets-sa \
  -n external-secrets-operator \
  iam.gke.io/gcp-service-account=eso-secret-accessor@camarades-net.iam.gserviceaccount.com

Create SecretStores¶

# Create SecretStore for staging
cat <<EOF | kubectl apply -f -
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: google-secret-manager
  namespace: syrf-staging
spec:
  provider:
    gcpsm:
      projectID: camarades-net
      auth:
        workloadIdentity:
          clusterLocation: europe-west2-a
          clusterName: syrf-cluster
          serviceAccountRef:
            name: external-secrets-sa
EOF

# Create SecretStore for production
cat <<EOF | kubectl apply -f -
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: google-secret-manager
  namespace: syrf-production
spec:
  provider:
    gcpsm:
      projectID: camarades-net
      auth:
        workloadIdentity:
          clusterLocation: europe-west2-a
          clusterName: syrf-cluster
          serviceAccountRef:
            name: external-secrets-sa
EOF

Test Secret Synchronization¶

# Create a test secret in GSM
echo -n "test-value" | gcloud secrets create test-secret --data-file=-

# Create ExternalSecret resource
cat <<EOF | kubectl apply -f -
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: test-secret
  namespace: syrf-staging
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: google-secret-manager
    kind: SecretStore
  target:
    name: test-secret
    creationPolicy: Owner
  data:
  - secretKey: value
    remoteRef:
      key: test-secret
EOF

# Verify secret was created
kubectl get secret test-secret -n syrf-staging
kubectl get secret test-secret -n syrf-staging -o jsonpath='{.data.value}' | base64 -d

# Cleanup test secret
kubectl delete externalsecret test-secret -n syrf-staging
gcloud secrets delete test-secret --quiet

Application Deployment¶

Deploy ArgoCD AppProjects¶

# Apply AppProject for staging
kubectl apply -f https://raw.githubusercontent.com/camaradesuk/cluster-gitops/master/apps/project-staging.yaml

# Apply AppProject for production
kubectl apply -f https://raw.githubusercontent.com/camaradesuk/cluster-gitops/master/apps/project-production.yaml

# Verify projects
argocd proj list

Deploy Applications¶

# Deploy all applications
kubectl apply -f https://raw.githubusercontent.com/camaradesuk/cluster-gitops/master/apps/api.yaml
kubectl apply -f https://raw.githubusercontent.com/camaradesuk/cluster-gitops/master/apps/project-management.yaml
kubectl apply -f https://raw.githubusercontent.com/camaradesuk/cluster-gitops/master/apps/quartz.yaml
kubectl apply -f https://raw.githubusercontent.com/camaradesuk/cluster-gitops/master/apps/web.yaml

# Check application status
argocd app list

# Sync applications (if not auto-syncing)
argocd app sync syrf-api-staging
argocd app sync syrf-project-management-staging
argocd app sync syrf-quartz-staging
argocd app sync syrf-web-staging

Verify Deployments¶

# Check pods in staging namespace
kubectl get pods -n syrf-staging

# Check services
kubectl get svc -n syrf-staging

# Check ingresses
kubectl get ingress -n syrf-staging

# View logs
kubectl logs -n syrf-staging -l app.kubernetes.io/name=syrf-api

Resource Optimization¶

Background: GKE Cluster Analysis¶

Based on analysis of the legacy production cluster (see gke-cluster-analysis.md):

Production API: Requesting 1500m CPU, using 20m (98.7% waste)
Staging API: Requesting 200m CPU, using 15m (92.5% waste)
Staging Web: Requesting 200m CPU, using 2m (99% waste)
Cluster-wide CPU utilization: 1-4% (10 nodes maintained for workloads that fit on 3-4)
Estimated savings: $150-200/month (30-60%) through right-sizing

Recommended Resource Values¶

Update cluster-gitops/environments/{staging,production}/*.values.yaml with these right-sized values:

API Service¶

# cluster-gitops/environments/production/api.values.yaml
resources:
  requests:
    cpu: 20m        # was: 1500m (-98.7%)
    memory: 795Mi   # was: 3Gi (-74%)
  limits:
    cpu: 100m       # allow 5x burst for peak loads
    memory: 795Mi   # Guaranteed QoS

Project Management Service¶

# cluster-gitops/environments/production/project-management.values.yaml
resources:
  requests:
    cpu: 50m        # conservative estimate
    memory: 512Mi   # based on similar workloads
  limits:
    cpu: 200m       # allow 4x burst
    memory: 512Mi   # Guaranteed QoS

Quartz Service¶

# cluster-gitops/environments/production/quartz.values.yaml
resources:
  requests:
    cpu: 30m        # background job processor
    memory: 256Mi   # moderate memory needs
  limits:
    cpu: 150m       # allow 5x burst
    memory: 256Mi   # Guaranteed QoS

Web Service (Angular/NGINX)¶

# cluster-gitops/environments/production/web.values.yaml
resources:
  requests:
    cpu: 2m         # was: 200m (-99%)
    memory: 7Mi     # was: 128Mi (-94.5%)
  limits:
    cpu: 10m        # allow 5x burst
    memory: 14Mi    # 2x request

Enable Vertical Pod Autoscaler (VPA)¶

# Enable VPA on cluster
gcloud container clusters update syrf-cluster \
  --enable-vertical-pod-autoscaling \
  --region=europe-west2-a

# Deploy VPA for API service (recommendation mode)
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: syrf-api-vpa
  namespace: syrf-production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: syrf-api
  updatePolicy:
    updateMode: "Off"  # Start with recommendations only
  resourcePolicy:
    containerPolicies:
    - containerName: syrf-api
      minAllowed:
        cpu: 10m
        memory: 100Mi
      maxAllowed:
        cpu: 500m
        memory: 2Gi
EOF

# After 24-48 hours, review recommendations
kubectl describe vpa syrf-api-vpa -n syrf-production

# If recommendations look good, switch to Auto mode
kubectl patch vpa syrf-api-vpa -n syrf-production \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/updatePolicy/updateMode", "value": "Auto"}]'

Enable Cluster Autoscaler¶

# Enable autoscaling on default node pool
gcloud container clusters update syrf-cluster \
  --enable-autoscaling \
  --node-pool=default-pool \
  --min-nodes=3 \
  --max-nodes=6 \
  --region=europe-west2-a

# Monitor autoscaler logs
kubectl logs -f deployment/cluster-autoscaler -n kube-system

Monitoring & Alerts¶

Enable GKE Cost Allocation¶

gcloud container clusters update syrf-cluster \
  --enable-cost-allocation \
  --region=europe-west2-a

Install Prometheus & Grafana¶

# Add Prometheus community Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack (Prometheus + Grafana + AlertManager)
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=15d \
  --set grafana.adminPassword=<secure-password>

# Access Grafana
kubectl port-forward svc/prometheus-grafana -n monitoring 3000:80
# URL: http://localhost:3000 (admin / <password>)

Configure Alerts¶

# Create alert for low CPU utilization
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: low-cpu-alert
  namespace: monitoring
data:
  low-cpu-utilization.yaml: |
    groups:
    - name: resource-optimization
      rules:
      - alert: NodeLowCPUUtilization
        expr: 100 - (avg by(node) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) < 20
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Node {{ \$labels.node }} has low CPU utilization"
          description: "CPU utilization is {{ \$value }}% for more than 1 hour"

      - alert: PodPending
        expr: kube_pod_status_phase{phase="Pending"} > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ \$labels.pod }} is pending"
          description: "Pod has been pending for more than 5 minutes"
EOF

Verification Checklist¶

Phase 1: Foundation¶

GKE cluster created with 3+ nodes
kubectl configured and working
Ingress controller running with external IP
cert-manager installed and ClusterIssuer created
External DNS configured (optional)
RabbitMQ running with 3 replicas

Phase 2: ArgoCD¶

ArgoCD installed in HA mode (3+ replicas)
ArgoCD UI accessible
Admin password changed
GitHub repositories connected
AppProjects created (staging, production)

Phase 3: Secret Management¶

External Secrets Operator installed
Workload Identity configured
SecretStores created (staging, production)
Test secret synchronization successful

Phase 4: Applications¶

All ArgoCD Applications deployed
Staging applications synced
Pods running in syrf-staging namespace
Ingress working with TLS certificates
Services accessible via domain names

Phase 5: Optimization¶

Resource requests right-sized based on GKE analysis
VPA enabled with recommendations reviewed
Cluster autoscaler enabled
Cost allocation enabled
Prometheus & Grafana deployed
Alerts configured

Troubleshooting¶

ArgoCD Can't Access Repository¶

Check repository credentials in ArgoCD
Verify GitHub PAT has repo scope
Test git clone manually

Pods Not Starting¶

# Check pod events
kubectl describe pod <pod-name> -n syrf-staging

# Check pod logs
kubectl logs <pod-name> -n syrf-staging

# Check node resources
kubectl top nodes

Secret Sync Failing¶

# Check ESO logs
kubectl logs -n external-secrets-operator deployment/external-secrets

# Check ExternalSecret status
kubectl describe externalsecret <name> -n syrf-staging

# Verify Workload Identity
gcloud iam service-accounts get-iam-policy \
  eso-secret-accessor@camarades-net.iam.gserviceaccount.com

High Resource Usage¶

# Check actual resource usage
kubectl top pods -n syrf-staging

# Compare to requests
kubectl describe pod <pod-name> -n syrf-staging | grep -A 5 Requests

# Review VPA recommendations
kubectl describe vpa -n syrf-staging

Next Steps¶

Week 1: Deploy to staging, validate functionality
Week 2: Fine-tune resource requests based on actual usage
Week 3: Deploy to production (parallel with old cluster)
Week 3-4: Monitor stability, validate with users
Week 4: DNS cutover to new cluster
Week 5: Decommission old cluster

References¶

GKE Cluster Analysis - Detailed analysis of legacy cluster
ADR-003: Cluster Architecture - Architectural decisions
ArgoCD Documentation
External Secrets Operator
GKE Best Practices

Document Status: Approved for implementation Owner: DevOps Team Last Review: 2025-11-11