SyRF Kubernetes Cluster Setup Guide¶
This guide provides step-by-step instructions for provisioning and configuring a new GKE cluster for the SyRF platform, incorporating lessons learned from production cluster analysis.
Table of Contents¶
- Prerequisites
- Cluster Provisioning
- Foundation Setup
- ArgoCD Installation
- Secret Management Setup
- Application Deployment
- Resource Optimization
- Monitoring & Alerts
- Verification Checklist
Prerequisites¶
Required Access¶
- GCP project access (
camarades-net) -
gcloudCLI installed and authenticated -
kubectlCLI installed - GitHub repository access:
camaradesuk/syrf-monorepocamaradesuk/cluster-gitops- GitHub Personal Access Token with
reposcope
Required Knowledge¶
- Kubernetes fundamentals
- Helm chart basics
- GitOps concepts
- GCP services (GKE, GSM, Cloud DNS)
Cost Considerations¶
IMPORTANT: Based on production cluster analysis (see gke-cluster-analysis.md), the legacy cluster is significantly overprovisioned with estimated savings of $150-200/month (30-60%) achievable through right-sizing.
Recommended Dual-Run Period: 3-5 days to minimize costs while ensuring stability.
Cluster Provisioning¶
Cluster Specifications¶
Based on ArgoCD HA requirements and right-sized resource analysis:
| Specification | Value | Rationale |
|---|---|---|
| Provider | GKE (Google Kubernetes Engine) | Team familiarity, GSM integration |
| Project | camarades-net |
Existing GCP project |
| Region | europe-west2-a (London) |
Data residency, low latency |
| Kubernetes Version | 1.28+ (latest stable) | Long-term support, feature compatibility |
| Node Count | 3-4 nodes minimum | ArgoCD HA (pod anti-affinity) + workloads |
| Machine Type | n1-standard-4 or e2-standard-4 |
4 vCPU, 15GB RAM per node |
| Preemptible | Mixed (2 standard + 1-2 preemptible) | Cost savings with reliability |
| Workload Identity | Enabled | Secure GSM access for ESO |
| Auto-upgrade | Enabled | Security patches, K8s updates |
| Auto-repair | Enabled | Node health management |
Create Cluster¶
# Set project context
gcloud config set project camarades-net
# Create GKE cluster with Workload Identity
gcloud container clusters create syrf-cluster \
--region=europe-west2-a \
--num-nodes=3 \
--machine-type=n1-standard-4 \
--disk-size=100 \
--enable-autoscaling \
--min-nodes=3 \
--max-nodes=6 \
--enable-autorepair \
--enable-autoupgrade \
--workload-pool=camarades-net.svc.id.goog \
--enable-shielded-nodes \
--addons=HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver
# Get cluster credentials
gcloud container clusters get-credentials syrf-cluster --region=europe-west2-a
Verify Cluster¶
# Check cluster info
kubectl cluster-info
# Verify nodes
kubectl get nodes
# Expected output: 3 nodes in Ready state
Foundation Setup¶
1. Install Ingress Controller¶
# Install nginx-ingress controller
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx \
--create-namespace \
--set controller.service.type=LoadBalancer \
--set controller.metrics.enabled=true
# Wait for external IP assignment
kubectl get svc -n ingress-nginx -w
# Note the EXTERNAL-IP for DNS configuration
2. Install cert-manager¶
# Install cert-manager CRDs
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
# Wait for cert-manager to be ready
kubectl wait --for=condition=available --timeout=300s deployment/cert-manager -n cert-manager
kubectl wait --for=condition=available --timeout=300s deployment/cert-manager-webhook -n cert-manager
# Create Let's Encrypt ClusterIssuer
cat <<EOF | kubectl apply -f -
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: chris.sena@ed.ac.uk
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx
EOF
3. Install External DNS (Optional but Recommended)¶
# Create service account for External DNS
kubectl create serviceaccount external-dns -n kube-system
# Bind to Cloud DNS role (requires GCP IAM setup)
# See: https://kubernetes-sigs.github.io/external-dns/v0.13.5/tutorials/gke/
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install external-dns bitnami/external-dns \
--namespace kube-system \
--set provider=google \
--set google.project=camarades-net \
--set domainFilters[0]=syrf.org.uk \
--set policy=sync \
--set txtOwnerId=syrf-cluster
4. Deploy RabbitMQ¶
# Add Bitnami Helm repository
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
# Create namespace
kubectl create namespace rabbitmq
# Install RabbitMQ with persistence and HA
helm install rabbitmq bitnami/rabbitmq \
--namespace rabbitmq \
--set auth.username=rabbit \
--set auth.password=<secure-password> \
--set replicaCount=3 \
--set persistence.enabled=true \
--set persistence.size=10Gi \
--set metrics.enabled=true
# Verify RabbitMQ is running
kubectl get pods -n rabbitmq
ArgoCD Installation¶
Install ArgoCD in HA Mode¶
# Create argocd namespace
kubectl create namespace argocd
# Install ArgoCD HA manifest
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/ha/install.yaml
# Wait for ArgoCD to be ready
kubectl wait --for=condition=available --timeout=600s deployment/argocd-server -n argocd
# Get initial admin password
ARGOCD_PASSWORD=$(kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d)
echo "ArgoCD admin password: $ARGOCD_PASSWORD"
Configure ArgoCD for Production¶
# Patch ArgoCD repo-server for better performance
kubectl patch deployment argocd-repo-server -n argocd --type='json' \
-p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--parallelismlimit=50"},
{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--repo-cache-expiration=1h"}]'
# Patch ArgoCD application-controller for scalability
kubectl patch deployment argocd-application-controller -n argocd --type='json' \
-p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--status-processors=50"},
{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--operation-processors=25"}]'
# Set resource limits on all ArgoCD components
kubectl set resources deployment argocd-server -n argocd \
--requests=cpu=100m,memory=128Mi \
--limits=cpu=500m,memory=512Mi
kubectl set resources deployment argocd-repo-server -n argocd \
--requests=cpu=100m,memory=256Mi \
--limits=cpu=500m,memory=1Gi
kubectl set resources deployment argocd-application-controller -n argocd \
--requests=cpu=250m,memory=512Mi \
--limits=cpu=1000m,memory=2Gi
Access ArgoCD UI¶
# Port forward to ArgoCD server
kubectl port-forward svc/argocd-server -n argocd 8080:443
# Access UI: https://localhost:8080
# Username: admin
# Password: (from previous step)
# Change admin password after first login
argocd login localhost:8080
argocd account update-password
Connect cluster-gitops Repository¶
# Via ArgoCD CLI
argocd repo add https://github.com/camaradesuk/cluster-gitops.git \
--type git \
--name cluster-gitops \
--username <github-username> \
--password <github-pat>
# Add syrf-monorepo as well
argocd repo add https://github.com/camaradesuk/syrf-monorepo.git \
--type git \
--name syrf-monorepo \
--username <github-username> \
--password <github-pat>
Secret Management Setup¶
Install External Secrets Operator¶
# Install ESO CRDs
kubectl apply -f https://raw.githubusercontent.com/external-secrets/external-secrets/main/deploy/crds/bundle.yaml
# Add External Secrets Helm repository
helm repo add external-secrets https://charts.external-secrets.io
helm repo update
# Install ESO operator
helm install external-secrets external-secrets/external-secrets \
--namespace external-secrets-operator \
--create-namespace \
--set installCRDs=true
Configure Workload Identity for ESO¶
# Create GCP service account for ESO
gcloud iam service-accounts create eso-secret-accessor \
--display-name "External Secrets Operator Secret Accessor"
# Grant Secret Manager Secret Accessor role
gcloud projects add-iam-policy-binding camarades-net \
--member="serviceAccount:eso-secret-accessor@camarades-net.iam.gserviceaccount.com" \
--role="roles/secretmanager.secretAccessor"
# Create Kubernetes service account
kubectl create serviceaccount external-secrets-sa -n external-secrets-operator
# Bind Workload Identity
gcloud iam service-accounts add-iam-policy-binding \
eso-secret-accessor@camarades-net.iam.gserviceaccount.com \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:camarades-net.svc.id.goog[external-secrets-operator/external-secrets-sa]"
# Annotate Kubernetes service account
kubectl annotate serviceaccount external-secrets-sa \
-n external-secrets-operator \
iam.gke.io/gcp-service-account=eso-secret-accessor@camarades-net.iam.gserviceaccount.com
Create SecretStores¶
# Create SecretStore for staging
cat <<EOF | kubectl apply -f -
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: google-secret-manager
namespace: syrf-staging
spec:
provider:
gcpsm:
projectID: camarades-net
auth:
workloadIdentity:
clusterLocation: europe-west2-a
clusterName: syrf-cluster
serviceAccountRef:
name: external-secrets-sa
EOF
# Create SecretStore for production
cat <<EOF | kubectl apply -f -
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: google-secret-manager
namespace: syrf-production
spec:
provider:
gcpsm:
projectID: camarades-net
auth:
workloadIdentity:
clusterLocation: europe-west2-a
clusterName: syrf-cluster
serviceAccountRef:
name: external-secrets-sa
EOF
Test Secret Synchronization¶
# Create a test secret in GSM
echo -n "test-value" | gcloud secrets create test-secret --data-file=-
# Create ExternalSecret resource
cat <<EOF | kubectl apply -f -
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: test-secret
namespace: syrf-staging
spec:
refreshInterval: 1h
secretStoreRef:
name: google-secret-manager
kind: SecretStore
target:
name: test-secret
creationPolicy: Owner
data:
- secretKey: value
remoteRef:
key: test-secret
EOF
# Verify secret was created
kubectl get secret test-secret -n syrf-staging
kubectl get secret test-secret -n syrf-staging -o jsonpath='{.data.value}' | base64 -d
# Cleanup test secret
kubectl delete externalsecret test-secret -n syrf-staging
gcloud secrets delete test-secret --quiet
Application Deployment¶
Deploy ArgoCD AppProjects¶
# Apply AppProject for staging
kubectl apply -f https://raw.githubusercontent.com/camaradesuk/cluster-gitops/master/apps/project-staging.yaml
# Apply AppProject for production
kubectl apply -f https://raw.githubusercontent.com/camaradesuk/cluster-gitops/master/apps/project-production.yaml
# Verify projects
argocd proj list
Deploy Applications¶
# Deploy all applications
kubectl apply -f https://raw.githubusercontent.com/camaradesuk/cluster-gitops/master/apps/api.yaml
kubectl apply -f https://raw.githubusercontent.com/camaradesuk/cluster-gitops/master/apps/project-management.yaml
kubectl apply -f https://raw.githubusercontent.com/camaradesuk/cluster-gitops/master/apps/quartz.yaml
kubectl apply -f https://raw.githubusercontent.com/camaradesuk/cluster-gitops/master/apps/web.yaml
# Check application status
argocd app list
# Sync applications (if not auto-syncing)
argocd app sync syrf-api-staging
argocd app sync syrf-project-management-staging
argocd app sync syrf-quartz-staging
argocd app sync syrf-web-staging
Verify Deployments¶
# Check pods in staging namespace
kubectl get pods -n syrf-staging
# Check services
kubectl get svc -n syrf-staging
# Check ingresses
kubectl get ingress -n syrf-staging
# View logs
kubectl logs -n syrf-staging -l app.kubernetes.io/name=syrf-api
Resource Optimization¶
Background: GKE Cluster Analysis¶
Based on analysis of the legacy production cluster (see gke-cluster-analysis.md):
- Production API: Requesting 1500m CPU, using 20m (98.7% waste)
- Staging API: Requesting 200m CPU, using 15m (92.5% waste)
- Staging Web: Requesting 200m CPU, using 2m (99% waste)
- Cluster-wide CPU utilization: 1-4% (10 nodes maintained for workloads that fit on 3-4)
- Estimated savings: $150-200/month (30-60%) through right-sizing
Recommended Resource Values¶
Update cluster-gitops/environments/{staging,production}/*.values.yaml with these right-sized values:
API Service¶
# cluster-gitops/environments/production/api.values.yaml
resources:
requests:
cpu: 20m # was: 1500m (-98.7%)
memory: 795Mi # was: 3Gi (-74%)
limits:
cpu: 100m # allow 5x burst for peak loads
memory: 795Mi # Guaranteed QoS
Project Management Service¶
# cluster-gitops/environments/production/project-management.values.yaml
resources:
requests:
cpu: 50m # conservative estimate
memory: 512Mi # based on similar workloads
limits:
cpu: 200m # allow 4x burst
memory: 512Mi # Guaranteed QoS
Quartz Service¶
# cluster-gitops/environments/production/quartz.values.yaml
resources:
requests:
cpu: 30m # background job processor
memory: 256Mi # moderate memory needs
limits:
cpu: 150m # allow 5x burst
memory: 256Mi # Guaranteed QoS
Web Service (Angular/NGINX)¶
# cluster-gitops/environments/production/web.values.yaml
resources:
requests:
cpu: 2m # was: 200m (-99%)
memory: 7Mi # was: 128Mi (-94.5%)
limits:
cpu: 10m # allow 5x burst
memory: 14Mi # 2x request
Enable Vertical Pod Autoscaler (VPA)¶
# Enable VPA on cluster
gcloud container clusters update syrf-cluster \
--enable-vertical-pod-autoscaling \
--region=europe-west2-a
# Deploy VPA for API service (recommendation mode)
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: syrf-api-vpa
namespace: syrf-production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: syrf-api
updatePolicy:
updateMode: "Off" # Start with recommendations only
resourcePolicy:
containerPolicies:
- containerName: syrf-api
minAllowed:
cpu: 10m
memory: 100Mi
maxAllowed:
cpu: 500m
memory: 2Gi
EOF
# After 24-48 hours, review recommendations
kubectl describe vpa syrf-api-vpa -n syrf-production
# If recommendations look good, switch to Auto mode
kubectl patch vpa syrf-api-vpa -n syrf-production \
--type='json' \
-p='[{"op": "replace", "path": "/spec/updatePolicy/updateMode", "value": "Auto"}]'
Enable Cluster Autoscaler¶
# Enable autoscaling on default node pool
gcloud container clusters update syrf-cluster \
--enable-autoscaling \
--node-pool=default-pool \
--min-nodes=3 \
--max-nodes=6 \
--region=europe-west2-a
# Monitor autoscaler logs
kubectl logs -f deployment/cluster-autoscaler -n kube-system
Monitoring & Alerts¶
Enable GKE Cost Allocation¶
Install Prometheus & Grafana¶
# Add Prometheus community Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack (Prometheus + Grafana + AlertManager)
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=15d \
--set grafana.adminPassword=<secure-password>
# Access Grafana
kubectl port-forward svc/prometheus-grafana -n monitoring 3000:80
# URL: http://localhost:3000 (admin / <password>)
Configure Alerts¶
# Create alert for low CPU utilization
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
name: low-cpu-alert
namespace: monitoring
data:
low-cpu-utilization.yaml: |
groups:
- name: resource-optimization
rules:
- alert: NodeLowCPUUtilization
expr: 100 - (avg by(node) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) < 20
for: 1h
labels:
severity: warning
annotations:
summary: "Node {{ \$labels.node }} has low CPU utilization"
description: "CPU utilization is {{ \$value }}% for more than 1 hour"
- alert: PodPending
expr: kube_pod_status_phase{phase="Pending"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ \$labels.pod }} is pending"
description: "Pod has been pending for more than 5 minutes"
EOF
Verification Checklist¶
Phase 1: Foundation¶
- GKE cluster created with 3+ nodes
-
kubectlconfigured and working - Ingress controller running with external IP
- cert-manager installed and ClusterIssuer created
- External DNS configured (optional)
- RabbitMQ running with 3 replicas
Phase 2: ArgoCD¶
- ArgoCD installed in HA mode (3+ replicas)
- ArgoCD UI accessible
- Admin password changed
- GitHub repositories connected
- AppProjects created (staging, production)
Phase 3: Secret Management¶
- External Secrets Operator installed
- Workload Identity configured
- SecretStores created (staging, production)
- Test secret synchronization successful
Phase 4: Applications¶
- All ArgoCD Applications deployed
- Staging applications synced
- Pods running in syrf-staging namespace
- Ingress working with TLS certificates
- Services accessible via domain names
Phase 5: Optimization¶
- Resource requests right-sized based on GKE analysis
- VPA enabled with recommendations reviewed
- Cluster autoscaler enabled
- Cost allocation enabled
- Prometheus & Grafana deployed
- Alerts configured
Troubleshooting¶
ArgoCD Can't Access Repository¶
- Check repository credentials in ArgoCD
- Verify GitHub PAT has
reposcope - Test git clone manually
Pods Not Starting¶
# Check pod events
kubectl describe pod <pod-name> -n syrf-staging
# Check pod logs
kubectl logs <pod-name> -n syrf-staging
# Check node resources
kubectl top nodes
Secret Sync Failing¶
# Check ESO logs
kubectl logs -n external-secrets-operator deployment/external-secrets
# Check ExternalSecret status
kubectl describe externalsecret <name> -n syrf-staging
# Verify Workload Identity
gcloud iam service-accounts get-iam-policy \
eso-secret-accessor@camarades-net.iam.gserviceaccount.com
High Resource Usage¶
# Check actual resource usage
kubectl top pods -n syrf-staging
# Compare to requests
kubectl describe pod <pod-name> -n syrf-staging | grep -A 5 Requests
# Review VPA recommendations
kubectl describe vpa -n syrf-staging
Next Steps¶
- Week 1: Deploy to staging, validate functionality
- Week 2: Fine-tune resource requests based on actual usage
- Week 3: Deploy to production (parallel with old cluster)
- Week 3-4: Monitor stability, validate with users
- Week 4: DNS cutover to new cluster
- Week 5: Decommission old cluster
References¶
- GKE Cluster Analysis - Detailed analysis of legacy cluster
- ADR-003: Cluster Architecture - Architectural decisions
- ArgoCD Documentation
- External Secrets Operator
- GKE Best Practices
Document Status: Approved for implementation Owner: DevOps Team Last Review: 2025-11-11