Kubernetes Resource Optimization Guide¶
This guide provides actionable steps to optimize Kubernetes resource usage for the SyRF platform, based on analysis of the legacy production cluster.
Executive Summary¶
Problem: Legacy cluster was massively overprovisioned with 98.7% CPU waste on production API.
Root Cause: Pod resource requests set far higher than actual usage, causing scheduler to maintain unnecessary nodes.
Solution: Right-size resource requests based on actual usage + enable VPA for continuous optimization.
Expected Savings: $150-200/month (30-60% cost reduction)
Part 1: Update Helm Chart Default Values¶
Purpose of Chart Defaults¶
Chart values.yaml files provide sensible defaults for development/local environments. These should be:
- Conservative (enough to run without issues)
- Not wasteful (not production-sized)
- Easily overridden by environment-specific values
Current Chart Values vs Recommendations¶
| Service | Current Requests | Current Limits | Recommended Requests | Recommended Limits |
|---|---|---|---|---|
| API | 200m CPU, 1Gi | 400m CPU, 3Gi | 50m CPU, 256Mi | 200m CPU, 512Mi |
| Project Management | 500m CPU, 1Gi | 700m CPU, 3Gi | 75m CPU, 384Mi | 300m CPU, 768Mi |
| Quartz | 200m CPU, 128Mi | 400m CPU, 256Mi | 50m CPU, 192Mi | 200m CPU, 384Mi |
| Web | 200m CPU, 128Mi | 400m CPU, 256Mi | 5m CPU, 32Mi | 20m CPU, 64Mi |
Proposed Chart Value Updates¶
1. API Service (src/services/api/charts/api/values.yaml)¶
Current (lines 180-186):
Proposed:
resources:
limits:
cpu: 200m # Reduced from 400m
memory: 512Mi # Reduced from 3Gi
requests:
cpu: 50m # Reduced from 200m
memory: 256Mi # Reduced from 1Gi
Rationale: - Legacy staging API: requesting 200m, using 15m (92.5% waste) - New defaults suitable for dev/staging - Production will override with even lower values (20m CPU)
2. Project Management Service (src/services/project-management/charts/project-management/values.yaml)¶
Current (lines 150-156):
Proposed:
resources:
limits:
cpu: 300m # Reduced from 700m
memory: 768Mi # Reduced from 3Gi
requests:
cpu: 75m # Reduced from 500m
memory: 384Mi # Reduced from 1Gi
Rationale: - Similar workload to API - Conservative estimate based on typical .NET API resource usage - Production will further optimize based on actual metrics
3. Quartz Service (src/services/quartz/charts/quartz/values.yaml)¶
Current (lines 142-148):
Proposed:
resources:
limits:
cpu: 200m # Reduced from 400m
memory: 384Mi # Increased from 256Mi (job processor needs memory)
requests:
cpu: 50m # Reduced from 200m
memory: 192Mi # Increased from 128Mi
Rationale: - Background job processor - CPU usage typically low between jobs - Memory needs moderate for job state
4. Web Service (src/services/web/charts/syrf-web/values.yaml)¶
Current (lines 145-151):
Proposed:
resources:
limits:
cpu: 20m # Reduced from 400m
memory: 64Mi # Reduced from 256Mi
requests:
cpu: 5m # Reduced from 200m
memory: 32Mi # Reduced from 128Mi
Rationale: - Legacy staging web: requesting 200m, using 2m (99% waste) - Angular static site served by NGINX - Minimal resource requirements - Production will use even lower (2m CPU, 7Mi memory)
Part 2: Environment-Specific Overrides¶
Staging Environment Values¶
These go in cluster-gitops/environments/staging/*.values.yaml and override chart defaults.
API (cluster-gitops/environments/staging/api.values.yaml)¶
resources:
requests:
cpu: 15m # Based on actual usage (was using ~15m)
memory: 239Mi # Based on GKE analysis
limits:
cpu: 50m # Allow 3x burst
memory: 239Mi # Guaranteed QoS
Project Management (cluster-gitops/environments/staging/project-management.values.yaml)¶
resources:
requests:
cpu: 40m # Conservative estimate
memory: 384Mi # Based on similar workloads
limits:
cpu: 150m # Allow 3-4x burst
memory: 384Mi # Guaranteed QoS
Quartz (cluster-gitops/environments/staging/quartz.values.yaml)¶
resources:
requests:
cpu: 30m # Background processor
memory: 192Mi # Moderate memory needs
limits:
cpu: 120m # Allow 4x burst
memory: 192Mi # Guaranteed QoS
Web (cluster-gitops/environments/staging/web.values.yaml)¶
resources:
requests:
cpu: 2m # Based on actual usage (was using ~2m)
memory: 7Mi # Based on GKE analysis
limits:
cpu: 10m # Allow 5x burst
memory: 14Mi # 2x request
Production Environment Values¶
These go in cluster-gitops/environments/production/*.values.yaml.
API (cluster-gitops/environments/production/api.values.yaml)¶
resources:
requests:
cpu: 20m # Legacy: 1500m → 20m (98.7% reduction)
memory: 795Mi # Legacy: 3Gi → 795Mi (74% reduction)
limits:
cpu: 100m # Allow 5x burst for peak loads
memory: 795Mi # Guaranteed QoS
Impact Analysis: - Each pod was reserving 1.5 CPU cores but using only 0.02 cores - This single deployment blocked resources equivalent to ~4 e2-standard-2 nodes - Requesting 75x more CPU than needed
Project Management (cluster-gitops/environments/production/project-management.values.yaml)¶
resources:
requests:
cpu: 50m # Conservative estimate (no legacy data)
memory: 512Mi # Based on similar workloads
limits:
cpu: 200m # Allow 4x burst
memory: 512Mi # Guaranteed QoS
Quartz (cluster-gitops/environments/production/quartz.values.yaml)¶
resources:
requests:
cpu: 30m # Background job processor
memory: 256Mi # Moderate memory needs
limits:
cpu: 150m # Allow 5x burst
memory: 256Mi # Guaranteed QoS
Web (cluster-gitops/environments/production/web.values.yaml)¶
resources:
requests:
cpu: 2m # Legacy: 200m → 2m (99% reduction)
memory: 7Mi # Legacy: 128Mi → 7Mi (94.5% reduction)
limits:
cpu: 10m # Allow 5x burst
memory: 14Mi # 2x request
Impact Analysis: - Requesting 100x more CPU than needed (most extreme case) - Requesting 18x more memory than needed - Angular static site with NGINX has minimal resource needs
Part 3: Testing Protocol¶
Step 1: Apply to Staging First¶
# Update chart default values (for new deployments)
cd syrf-monorepo
# Edit values.yaml files as shown above
git commit -am "chore: right-size default resource requests based on GKE analysis"
git push
# Update staging environment overrides
cd cluster-gitops
# Edit environments/staging/*.values.yaml files
git commit -am "chore(staging): optimize resource requests based on actual usage"
git push
# ArgoCD will auto-sync if enabled, or manually sync
argocd app sync syrf-api-staging
argocd app sync syrf-project-management-staging
argocd app sync syrf-quartz-staging
argocd app sync syrf-web-staging
Step 2: Monitor for 24-48 Hours¶
# Watch pod metrics
kubectl top pods -n syrf-staging --watch
# Check for OOMKilled or CPU throttling
kubectl get events -n syrf-staging --watch | grep -i "oom\|throttl"
# View pod resource usage vs requests
kubectl describe pods -n syrf-staging | grep -A 10 "Requests"
# Check application logs for errors
kubectl logs -f -n syrf-staging -l app.kubernetes.io/name=syrf-api
Step 3: Verify Application Health¶
Health Checks: - [ ] All pods running (not restarting) - [ ] No OOMKilled events - [ ] No CPU throttling warnings - [ ] Application logs clean (no resource-related errors) - [ ] API response times normal - [ ] Web UI loads correctly - [ ] Background jobs processing
Load Testing (optional):
# Use your existing load testing tools
# Simulate typical user traffic
# Monitor resource usage under load
Step 4: Promote to Production¶
# After staging validation
cd cluster-gitops
# Edit environments/production/*.values.yaml files
git commit -am "chore(production): optimize resource requests based on staging validation"
git push
# Create PR for production changes
# Get approval from team
# Merge PR
# Manually sync in ArgoCD (production requires manual sync)
argocd app sync syrf-api-production
argocd app sync syrf-project-management-production
argocd app sync syrf-quartz-production
argocd app sync syrf-web-production
Step 5: Monitor Production for 48 Hours¶
# Same monitoring as staging
kubectl top pods -n syrf-production --watch
kubectl get events -n syrf-production --watch
kubectl logs -f -n syrf-production -l app.kubernetes.io/name=syrf-api
# Monitor user-reported issues
# Check error rates in Sentry
# Review application performance metrics
Part 4: Enable Vertical Pod Autoscaler¶
Install VPA on Cluster¶
gcloud container clusters update syrf-cluster \
--enable-vertical-pod-autoscaling \
--region=europe-west2-a
Deploy VPA in Recommendation Mode¶
Start with recommendation-only mode to review suggestions before auto-applying:
API Service VPA¶
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: syrf-api-vpa
namespace: syrf-production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: syrf-api
updatePolicy:
updateMode: "Off" # Start with recommendations only
resourcePolicy:
containerPolicies:
- containerName: syrf-api
minAllowed:
cpu: 10m
memory: 100Mi
maxAllowed:
cpu: 500m
memory: 2Gi
controlledResources:
- cpu
- memory
Project Management Service VPA¶
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: syrf-pm-vpa
namespace: syrf-production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: syrf-project-management
updatePolicy:
updateMode: "Off"
resourcePolicy:
containerPolicies:
- containerName: syrf-project-management
minAllowed:
cpu: 20m
memory: 256Mi
maxAllowed:
cpu: 500m
memory: 1Gi
Quartz Service VPA¶
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: syrf-quartz-vpa
namespace: syrf-production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: syrf-quartz
updatePolicy:
updateMode: "Off"
resourcePolicy:
containerPolicies:
- containerName: syrf-quartz
minAllowed:
cpu: 10m
memory: 128Mi
maxAllowed:
cpu: 300m
memory: 512Mi
Web Service VPA¶
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: syrf-web-vpa
namespace: syrf-production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: syrf-web
updatePolicy:
updateMode: "Off"
resourcePolicy:
containerPolicies:
- containerName: syrf-web
minAllowed:
cpu: 2m
memory: 7Mi
maxAllowed:
cpu: 50m
memory: 128Mi
Review VPA Recommendations¶
After 24-48 hours:
# Get VPA recommendations for API
kubectl describe vpa syrf-api-vpa -n syrf-production
# Look for the "Recommendation" section:
# - Target: Recommended values
# - Lower Bound: Minimum safe values
# - Upper Bound: Maximum safe values
# Example output:
# Recommendation:
# Container Recommendations:
# Container Name: syrf-api
# Lower Bound:
# Cpu: 15m
# Memory: 600Mi
# Target:
# Cpu: 25m
# Memory: 750Mi
# Upper Bound:
# Cpu: 100m
# Memory: 1Gi
Enable Auto-Updates (After Validation)¶
Once confident in recommendations:
# Switch to Auto mode for API
kubectl patch vpa syrf-api-vpa -n syrf-production \
--type='json' \
-p='[{"op": "replace", "path": "/spec/updatePolicy/updateMode", "value": "Auto"}]'
# Repeat for other services
kubectl patch vpa syrf-pm-vpa -n syrf-production \
--type='json' \
-p='[{"op": "replace", "path": "/spec/updatePolicy/updateMode", "value": "Auto"}]'
Note: VPA will evict and recreate pods to apply new resource requests. This causes brief downtime per pod. Consider using updateMode: "Recreate" for more control.
Part 5: Enable Cluster Autoscaler¶
# Enable autoscaling on default node pool
gcloud container clusters update syrf-cluster \
--enable-autoscaling \
--node-pool=default-pool \
--min-nodes=3 \
--max-nodes=6 \
--region=europe-west2-a
# Monitor autoscaler logs
kubectl logs -f deployment/cluster-autoscaler -n kube-system
# Check node pool sizes
watch kubectl get nodes
Expected Behavior¶
After right-sizing resource requests:
- Cluster autoscaler recognizes excess capacity
- Begins draining underutilized nodes
- Scales down to minimum nodes (3 total)
- Scales up automatically when pods are pending
- Scales down during off-peak hours
Part 6: Monitoring & Alerts¶
Key Metrics Dashboard¶
Create Grafana dashboard tracking:
- Node CPU/Memory Utilization (target: 40-70%)
- Pod CPU/Memory Requests vs Usage
- Cluster Autoscaler Events
- VPA Recommendation Application Rate
- Cost per Namespace (if GKE Cost Allocation enabled)
Alerting Rules¶
groups:
- name: resource-optimization
rules:
- alert: PodOOMKilled
expr: kube_pod_container_status_terminated_reason{reason="OOMKilled"} > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} was OOMKilled"
description: "Memory limit too low, increase requests"
- alert: PodCPUThrottling
expr: rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} experiencing CPU throttling"
description: "Consider increasing CPU limits"
- alert: NodeLowUtilization
expr: 100 - (avg by(node) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) < 20
for: 1h
labels:
severity: warning
annotations:
summary: "Node {{ $labels.node }} has low CPU utilization"
description: "CPU utilization is {{ $value }}% for more than 1 hour"
Part 7: Weekly Review Process¶
Checklist¶
-
Review GCP Recommender suggestions
-
Check VPA recommendations (if not in auto mode)
-
Review cost trends in GCP Cost Explorer
-
Check node utilization
-
Adjust autoscaler thresholds if needed
Rollback Procedures¶
If Resource Changes Cause Issues¶
Symptoms of Under-Provisioning¶
- OOMKilled pods (check
kubectl get events) - CPU throttling warnings in logs
- Increased response times
- 502/503 errors
Quick Rollback Steps¶
-
Revert to previous manifest:
-
Or manually increase resources:
-
Monitor recovery:
If Autoscaler Scales Down Too Aggressively¶
-
Adjust minimum nodes:
-
Or disable autoscaling temporarily:
Success Metrics¶
Short-term (1-2 weeks)¶
- Node CPU utilization increases from 1-4% to 30-50%
- Node count reduces from 10 to 6-7 nodes (if migrating from legacy cluster)
- No increase in pod restarts or OOMKilled events
- Application response times remain stable
Medium-term (1 month)¶
- Monthly GCP bill decreases by $75-150
- Cluster autoscaler successfully scales down during off-peak
- VPA recommendations align with actual usage (validation)
- Zero production incidents related to resource constraints
Long-term (3 months)¶
- Sustained 40-60% node utilization
- Node count stabilizes at 3-4 nodes average
- Cost savings of $150-200/month achieved
- Automated optimization reduces manual intervention
References¶
- GKE Cluster Analysis - Detailed analysis of legacy cluster
- Cluster Setup Guide - Infrastructure setup instructions
- GKE Best Practices: Resource Requests and Limits
- Vertical Pod Autoscaling
- Cluster Autoscaling
Document Status: Approved for implementation Owner: DevOps Team Last Review: 2025-11-11