Skip to content

SyRF Kubernetes Cluster Setup Guide

This guide provides step-by-step instructions for provisioning and configuring a new GKE cluster for the SyRF platform, incorporating lessons learned from production cluster analysis.

Table of Contents

  1. Prerequisites
  2. Cluster Provisioning
  3. Foundation Setup
  4. ArgoCD Installation
  5. Secret Management Setup
  6. Application Deployment
  7. Resource Optimization
  8. Monitoring & Alerts
  9. Verification Checklist

Prerequisites

Required Access

  • GCP project access (camarades-net)
  • gcloud CLI installed and authenticated
  • kubectl CLI installed
  • GitHub repository access:
  • camaradesuk/syrf-monorepo
  • camaradesuk/cluster-gitops
  • GitHub Personal Access Token with repo scope

Required Knowledge

  • Kubernetes fundamentals
  • Helm chart basics
  • GitOps concepts
  • GCP services (GKE, GSM, Cloud DNS)

Cost Considerations

IMPORTANT: Based on production cluster analysis (see gke-cluster-analysis.md), the legacy cluster is significantly overprovisioned with estimated savings of $150-200/month (30-60%) achievable through right-sizing.

Recommended Dual-Run Period: 3-5 days to minimize costs while ensuring stability.


Cluster Provisioning

Cluster Specifications

Based on ArgoCD HA requirements and right-sized resource analysis:

Specification Value Rationale
Provider GKE (Google Kubernetes Engine) Team familiarity, GSM integration
Project camarades-net Existing GCP project
Region europe-west2-a (London) Data residency, low latency
Kubernetes Version 1.28+ (latest stable) Long-term support, feature compatibility
Node Count 3-4 nodes minimum ArgoCD HA (pod anti-affinity) + workloads
Machine Type n1-standard-4 or e2-standard-4 4 vCPU, 15GB RAM per node
Preemptible Mixed (2 standard + 1-2 preemptible) Cost savings with reliability
Workload Identity Enabled Secure GSM access for ESO
Auto-upgrade Enabled Security patches, K8s updates
Auto-repair Enabled Node health management

Create Cluster

# Set project context
gcloud config set project camarades-net

# Create GKE cluster with Workload Identity
gcloud container clusters create syrf-cluster \
  --region=europe-west2-a \
  --num-nodes=3 \
  --machine-type=n1-standard-4 \
  --disk-size=100 \
  --enable-autoscaling \
  --min-nodes=3 \
  --max-nodes=6 \
  --enable-autorepair \
  --enable-autoupgrade \
  --workload-pool=camarades-net.svc.id.goog \
  --enable-shielded-nodes \
  --addons=HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver

# Get cluster credentials
gcloud container clusters get-credentials syrf-cluster --region=europe-west2-a

Verify Cluster

# Check cluster info
kubectl cluster-info

# Verify nodes
kubectl get nodes

# Expected output: 3 nodes in Ready state

Foundation Setup

1. Install Ingress Controller

# Install nginx-ingress controller
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx \
  --create-namespace \
  --set controller.service.type=LoadBalancer \
  --set controller.metrics.enabled=true

# Wait for external IP assignment
kubectl get svc -n ingress-nginx -w

# Note the EXTERNAL-IP for DNS configuration

2. Install cert-manager

# Install cert-manager CRDs
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml

# Wait for cert-manager to be ready
kubectl wait --for=condition=available --timeout=300s deployment/cert-manager -n cert-manager
kubectl wait --for=condition=available --timeout=300s deployment/cert-manager-webhook -n cert-manager

# Create Let's Encrypt ClusterIssuer
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: chris.sena@ed.ac.uk
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: nginx
EOF
# Create service account for External DNS
kubectl create serviceaccount external-dns -n kube-system

# Bind to Cloud DNS role (requires GCP IAM setup)
# See: https://kubernetes-sigs.github.io/external-dns/v0.13.5/tutorials/gke/

helm repo add bitnami https://charts.bitnami.com/bitnami
helm install external-dns bitnami/external-dns \
  --namespace kube-system \
  --set provider=google \
  --set google.project=camarades-net \
  --set domainFilters[0]=syrf.org.uk \
  --set policy=sync \
  --set txtOwnerId=syrf-cluster

4. Deploy RabbitMQ

# Add Bitnami Helm repository
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update

# Create namespace
kubectl create namespace rabbitmq

# Install RabbitMQ with persistence and HA
helm install rabbitmq bitnami/rabbitmq \
  --namespace rabbitmq \
  --set auth.username=rabbit \
  --set auth.password=<secure-password> \
  --set replicaCount=3 \
  --set persistence.enabled=true \
  --set persistence.size=10Gi \
  --set metrics.enabled=true

# Verify RabbitMQ is running
kubectl get pods -n rabbitmq

ArgoCD Installation

Install ArgoCD in HA Mode

# Create argocd namespace
kubectl create namespace argocd

# Install ArgoCD HA manifest
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/ha/install.yaml

# Wait for ArgoCD to be ready
kubectl wait --for=condition=available --timeout=600s deployment/argocd-server -n argocd

# Get initial admin password
ARGOCD_PASSWORD=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
echo "ArgoCD admin password: $ARGOCD_PASSWORD"

Configure ArgoCD for Production

# Patch ArgoCD repo-server for better performance
kubectl patch deployment argocd-repo-server -n argocd --type='json' \
  -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--parallelismlimit=50"},
       {"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--repo-cache-expiration=1h"}]'

# Patch ArgoCD application-controller for scalability
kubectl patch deployment argocd-application-controller -n argocd --type='json' \
  -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--status-processors=50"},
       {"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--operation-processors=25"}]'

# Set resource limits on all ArgoCD components
kubectl set resources deployment argocd-server -n argocd \
  --requests=cpu=100m,memory=128Mi \
  --limits=cpu=500m,memory=512Mi

kubectl set resources deployment argocd-repo-server -n argocd \
  --requests=cpu=100m,memory=256Mi \
  --limits=cpu=500m,memory=1Gi

kubectl set resources deployment argocd-application-controller -n argocd \
  --requests=cpu=250m,memory=512Mi \
  --limits=cpu=1000m,memory=2Gi

Access ArgoCD UI

# Port forward to ArgoCD server
kubectl port-forward svc/argocd-server -n argocd 8080:443

# Access UI: https://localhost:8080
# Username: admin
# Password: (from previous step)

# Change admin password after first login
argocd login localhost:8080
argocd account update-password

Connect cluster-gitops Repository

# Via ArgoCD CLI
argocd repo add https://github.com/camaradesuk/cluster-gitops.git \
  --type git \
  --name cluster-gitops \
  --username <github-username> \
  --password <github-pat>

# Add syrf-monorepo as well
argocd repo add https://github.com/camaradesuk/syrf-monorepo.git \
  --type git \
  --name syrf-monorepo \
  --username <github-username> \
  --password <github-pat>

Secret Management Setup

Install External Secrets Operator

# Install ESO CRDs
kubectl apply -f https://raw.githubusercontent.com/external-secrets/external-secrets/main/deploy/crds/bundle.yaml

# Add External Secrets Helm repository
helm repo add external-secrets https://charts.external-secrets.io
helm repo update

# Install ESO operator
helm install external-secrets external-secrets/external-secrets \
  --namespace external-secrets-operator \
  --create-namespace \
  --set installCRDs=true

Configure Workload Identity for ESO

# Create GCP service account for ESO
gcloud iam service-accounts create eso-secret-accessor \
  --display-name "External Secrets Operator Secret Accessor"

# Grant Secret Manager Secret Accessor role
gcloud projects add-iam-policy-binding camarades-net \
  --member="serviceAccount:eso-secret-accessor@camarades-net.iam.gserviceaccount.com" \
  --role="roles/secretmanager.secretAccessor"

# Create Kubernetes service account
kubectl create serviceaccount external-secrets-sa -n external-secrets-operator

# Bind Workload Identity
gcloud iam service-accounts add-iam-policy-binding \
  eso-secret-accessor@camarades-net.iam.gserviceaccount.com \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:camarades-net.svc.id.goog[external-secrets-operator/external-secrets-sa]"

# Annotate Kubernetes service account
kubectl annotate serviceaccount external-secrets-sa \
  -n external-secrets-operator \
  iam.gke.io/gcp-service-account=eso-secret-accessor@camarades-net.iam.gserviceaccount.com

Create SecretStores

# Create SecretStore for staging
cat <<EOF | kubectl apply -f -
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: google-secret-manager
  namespace: syrf-staging
spec:
  provider:
    gcpsm:
      projectID: camarades-net
      auth:
        workloadIdentity:
          clusterLocation: europe-west2-a
          clusterName: syrf-cluster
          serviceAccountRef:
            name: external-secrets-sa
EOF

# Create SecretStore for production
cat <<EOF | kubectl apply -f -
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: google-secret-manager
  namespace: syrf-production
spec:
  provider:
    gcpsm:
      projectID: camarades-net
      auth:
        workloadIdentity:
          clusterLocation: europe-west2-a
          clusterName: syrf-cluster
          serviceAccountRef:
            name: external-secrets-sa
EOF

Test Secret Synchronization

# Create a test secret in GSM
echo -n "test-value" | gcloud secrets create test-secret --data-file=-

# Create ExternalSecret resource
cat <<EOF | kubectl apply -f -
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: test-secret
  namespace: syrf-staging
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: google-secret-manager
    kind: SecretStore
  target:
    name: test-secret
    creationPolicy: Owner
  data:
  - secretKey: value
    remoteRef:
      key: test-secret
EOF

# Verify secret was created
kubectl get secret test-secret -n syrf-staging
kubectl get secret test-secret -n syrf-staging -o jsonpath='{.data.value}' | base64 -d

# Cleanup test secret
kubectl delete externalsecret test-secret -n syrf-staging
gcloud secrets delete test-secret --quiet

Application Deployment

Deploy ArgoCD AppProjects

# Apply AppProject for staging
kubectl apply -f https://raw.githubusercontent.com/camaradesuk/cluster-gitops/master/apps/project-staging.yaml

# Apply AppProject for production
kubectl apply -f https://raw.githubusercontent.com/camaradesuk/cluster-gitops/master/apps/project-production.yaml

# Verify projects
argocd proj list

Deploy Applications

# Deploy all applications
kubectl apply -f https://raw.githubusercontent.com/camaradesuk/cluster-gitops/master/apps/api.yaml
kubectl apply -f https://raw.githubusercontent.com/camaradesuk/cluster-gitops/master/apps/project-management.yaml
kubectl apply -f https://raw.githubusercontent.com/camaradesuk/cluster-gitops/master/apps/quartz.yaml
kubectl apply -f https://raw.githubusercontent.com/camaradesuk/cluster-gitops/master/apps/web.yaml

# Check application status
argocd app list

# Sync applications (if not auto-syncing)
argocd app sync syrf-api-staging
argocd app sync syrf-project-management-staging
argocd app sync syrf-quartz-staging
argocd app sync syrf-web-staging

Verify Deployments

# Check pods in staging namespace
kubectl get pods -n syrf-staging

# Check services
kubectl get svc -n syrf-staging

# Check ingresses
kubectl get ingress -n syrf-staging

# View logs
kubectl logs -n syrf-staging -l app.kubernetes.io/name=syrf-api

Resource Optimization

Background: GKE Cluster Analysis

Based on analysis of the legacy production cluster (see gke-cluster-analysis.md):

  • Production API: Requesting 1500m CPU, using 20m (98.7% waste)
  • Staging API: Requesting 200m CPU, using 15m (92.5% waste)
  • Staging Web: Requesting 200m CPU, using 2m (99% waste)
  • Cluster-wide CPU utilization: 1-4% (10 nodes maintained for workloads that fit on 3-4)
  • Estimated savings: $150-200/month (30-60%) through right-sizing

Update cluster-gitops/environments/{staging,production}/*.values.yaml with these right-sized values:

API Service

# cluster-gitops/environments/production/api.values.yaml
resources:
  requests:
    cpu: 20m        # was: 1500m (-98.7%)
    memory: 795Mi   # was: 3Gi (-74%)
  limits:
    cpu: 100m       # allow 5x burst for peak loads
    memory: 795Mi   # Guaranteed QoS

Project Management Service

# cluster-gitops/environments/production/project-management.values.yaml
resources:
  requests:
    cpu: 50m        # conservative estimate
    memory: 512Mi   # based on similar workloads
  limits:
    cpu: 200m       # allow 4x burst
    memory: 512Mi   # Guaranteed QoS

Quartz Service

# cluster-gitops/environments/production/quartz.values.yaml
resources:
  requests:
    cpu: 30m        # background job processor
    memory: 256Mi   # moderate memory needs
  limits:
    cpu: 150m       # allow 5x burst
    memory: 256Mi   # Guaranteed QoS

Web Service (Angular/NGINX)

# cluster-gitops/environments/production/web.values.yaml
resources:
  requests:
    cpu: 2m         # was: 200m (-99%)
    memory: 7Mi     # was: 128Mi (-94.5%)
  limits:
    cpu: 10m        # allow 5x burst
    memory: 14Mi    # 2x request

Enable Vertical Pod Autoscaler (VPA)

# Enable VPA on cluster
gcloud container clusters update syrf-cluster \
  --enable-vertical-pod-autoscaling \
  --region=europe-west2-a

# Deploy VPA for API service (recommendation mode)
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: syrf-api-vpa
  namespace: syrf-production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: syrf-api
  updatePolicy:
    updateMode: "Off"  # Start with recommendations only
  resourcePolicy:
    containerPolicies:
    - containerName: syrf-api
      minAllowed:
        cpu: 10m
        memory: 100Mi
      maxAllowed:
        cpu: 500m
        memory: 2Gi
EOF

# After 24-48 hours, review recommendations
kubectl describe vpa syrf-api-vpa -n syrf-production

# If recommendations look good, switch to Auto mode
kubectl patch vpa syrf-api-vpa -n syrf-production \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/updatePolicy/updateMode", "value": "Auto"}]'

Enable Cluster Autoscaler

# Enable autoscaling on default node pool
gcloud container clusters update syrf-cluster \
  --enable-autoscaling \
  --node-pool=default-pool \
  --min-nodes=3 \
  --max-nodes=6 \
  --region=europe-west2-a

# Monitor autoscaler logs
kubectl logs -f deployment/cluster-autoscaler -n kube-system

Monitoring & Alerts

Enable GKE Cost Allocation

gcloud container clusters update syrf-cluster \
  --enable-cost-allocation \
  --region=europe-west2-a

Install Prometheus & Grafana

# Add Prometheus community Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack (Prometheus + Grafana + AlertManager)
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=15d \
  --set grafana.adminPassword=<secure-password>

# Access Grafana
kubectl port-forward svc/prometheus-grafana -n monitoring 3000:80
# URL: http://localhost:3000 (admin / <password>)

Configure Alerts

# Create alert for low CPU utilization
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: low-cpu-alert
  namespace: monitoring
data:
  low-cpu-utilization.yaml: |
    groups:
    - name: resource-optimization
      rules:
      - alert: NodeLowCPUUtilization
        expr: 100 - (avg by(node) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) < 20
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Node {{ \$labels.node }} has low CPU utilization"
          description: "CPU utilization is {{ \$value }}% for more than 1 hour"

      - alert: PodPending
        expr: kube_pod_status_phase{phase="Pending"} > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ \$labels.pod }} is pending"
          description: "Pod has been pending for more than 5 minutes"
EOF

Verification Checklist

Phase 1: Foundation

  • GKE cluster created with 3+ nodes
  • kubectl configured and working
  • Ingress controller running with external IP
  • cert-manager installed and ClusterIssuer created
  • External DNS configured (optional)
  • RabbitMQ running with 3 replicas

Phase 2: ArgoCD

  • ArgoCD installed in HA mode (3+ replicas)
  • ArgoCD UI accessible
  • Admin password changed
  • GitHub repositories connected
  • AppProjects created (staging, production)

Phase 3: Secret Management

  • External Secrets Operator installed
  • Workload Identity configured
  • SecretStores created (staging, production)
  • Test secret synchronization successful

Phase 4: Applications

  • All ArgoCD Applications deployed
  • Staging applications synced
  • Pods running in syrf-staging namespace
  • Ingress working with TLS certificates
  • Services accessible via domain names

Phase 5: Optimization

  • Resource requests right-sized based on GKE analysis
  • VPA enabled with recommendations reviewed
  • Cluster autoscaler enabled
  • Cost allocation enabled
  • Prometheus & Grafana deployed
  • Alerts configured

Troubleshooting

ArgoCD Can't Access Repository

  • Check repository credentials in ArgoCD
  • Verify GitHub PAT has repo scope
  • Test git clone manually

Pods Not Starting

# Check pod events
kubectl describe pod <pod-name> -n syrf-staging

# Check pod logs
kubectl logs <pod-name> -n syrf-staging

# Check node resources
kubectl top nodes

Secret Sync Failing

# Check ESO logs
kubectl logs -n external-secrets-operator deployment/external-secrets

# Check ExternalSecret status
kubectl describe externalsecret <name> -n syrf-staging

# Verify Workload Identity
gcloud iam service-accounts get-iam-policy \
  eso-secret-accessor@camarades-net.iam.gserviceaccount.com

High Resource Usage

# Check actual resource usage
kubectl top pods -n syrf-staging

# Compare to requests
kubectl describe pod <pod-name> -n syrf-staging | grep -A 5 Requests

# Review VPA recommendations
kubectl describe vpa -n syrf-staging

Next Steps

  1. Week 1: Deploy to staging, validate functionality
  2. Week 2: Fine-tune resource requests based on actual usage
  3. Week 3: Deploy to production (parallel with old cluster)
  4. Week 3-4: Monitor stability, validate with users
  5. Week 4: DNS cutover to new cluster
  6. Week 5: Decommission old cluster

References


Document Status: Approved for implementation Owner: DevOps Team Last Review: 2025-11-11