Skip to content

Kubernetes Resource Optimization Guide

This guide provides actionable steps to optimize Kubernetes resource usage for the SyRF platform, based on analysis of the legacy production cluster.

Executive Summary

Problem: Legacy cluster was massively overprovisioned with 98.7% CPU waste on production API.

Root Cause: Pod resource requests set far higher than actual usage, causing scheduler to maintain unnecessary nodes.

Solution: Right-size resource requests based on actual usage + enable VPA for continuous optimization.

Expected Savings: $150-200/month (30-60% cost reduction)


Part 1: Update Helm Chart Default Values

Purpose of Chart Defaults

Chart values.yaml files provide sensible defaults for development/local environments. These should be: - Conservative (enough to run without issues) - Not wasteful (not production-sized) - Easily overridden by environment-specific values

Current Chart Values vs Recommendations

Service Current Requests Current Limits Recommended Requests Recommended Limits
API 200m CPU, 1Gi 400m CPU, 3Gi 50m CPU, 256Mi 200m CPU, 512Mi
Project Management 500m CPU, 1Gi 700m CPU, 3Gi 75m CPU, 384Mi 300m CPU, 768Mi
Quartz 200m CPU, 128Mi 400m CPU, 256Mi 50m CPU, 192Mi 200m CPU, 384Mi
Web 200m CPU, 128Mi 400m CPU, 256Mi 5m CPU, 32Mi 20m CPU, 64Mi

Proposed Chart Value Updates

1. API Service (src/services/api/charts/api/values.yaml)

Current (lines 180-186):

resources:
  limits:
    cpu: 400m
    memory: 3Gi
  requests:
    cpu: 200m
    memory: 1Gi

Proposed:

resources:
  limits:
    cpu: 200m        # Reduced from 400m
    memory: 512Mi    # Reduced from 3Gi
  requests:
    cpu: 50m         # Reduced from 200m
    memory: 256Mi    # Reduced from 1Gi

Rationale: - Legacy staging API: requesting 200m, using 15m (92.5% waste) - New defaults suitable for dev/staging - Production will override with even lower values (20m CPU)

2. Project Management Service (src/services/project-management/charts/project-management/values.yaml)

Current (lines 150-156):

resources:
  limits:
    cpu: 700m
    memory: 3Gi
  requests:
    cpu: 500m
    memory: 1Gi

Proposed:

resources:
  limits:
    cpu: 300m        # Reduced from 700m
    memory: 768Mi    # Reduced from 3Gi
  requests:
    cpu: 75m         # Reduced from 500m
    memory: 384Mi    # Reduced from 1Gi

Rationale: - Similar workload to API - Conservative estimate based on typical .NET API resource usage - Production will further optimize based on actual metrics

3. Quartz Service (src/services/quartz/charts/quartz/values.yaml)

Current (lines 142-148):

resources:
  limits:
    cpu: 400m
    memory: 256Mi
  requests:
    cpu: 200m
    memory: 128Mi

Proposed:

resources:
  limits:
    cpu: 200m        # Reduced from 400m
    memory: 384Mi    # Increased from 256Mi (job processor needs memory)
  requests:
    cpu: 50m         # Reduced from 200m
    memory: 192Mi    # Increased from 128Mi

Rationale: - Background job processor - CPU usage typically low between jobs - Memory needs moderate for job state

4. Web Service (src/services/web/charts/syrf-web/values.yaml)

Current (lines 145-151):

resources:
  limits:
    cpu: 400m
    memory: 256Mi
  requests:
    cpu: 200m
    memory: 128Mi

Proposed:

resources:
  limits:
    cpu: 20m         # Reduced from 400m
    memory: 64Mi     # Reduced from 256Mi
  requests:
    cpu: 5m          # Reduced from 200m
    memory: 32Mi     # Reduced from 128Mi

Rationale: - Legacy staging web: requesting 200m, using 2m (99% waste) - Angular static site served by NGINX - Minimal resource requirements - Production will use even lower (2m CPU, 7Mi memory)


Part 2: Environment-Specific Overrides

Staging Environment Values

These go in cluster-gitops/environments/staging/*.values.yaml and override chart defaults.

API (cluster-gitops/environments/staging/api.values.yaml)

resources:
  requests:
    cpu: 15m        # Based on actual usage (was using ~15m)
    memory: 239Mi   # Based on GKE analysis
  limits:
    cpu: 50m        # Allow 3x burst
    memory: 239Mi   # Guaranteed QoS

Project Management (cluster-gitops/environments/staging/project-management.values.yaml)

resources:
  requests:
    cpu: 40m        # Conservative estimate
    memory: 384Mi   # Based on similar workloads
  limits:
    cpu: 150m       # Allow 3-4x burst
    memory: 384Mi   # Guaranteed QoS

Quartz (cluster-gitops/environments/staging/quartz.values.yaml)

resources:
  requests:
    cpu: 30m        # Background processor
    memory: 192Mi   # Moderate memory needs
  limits:
    cpu: 120m       # Allow 4x burst
    memory: 192Mi   # Guaranteed QoS

Web (cluster-gitops/environments/staging/web.values.yaml)

resources:
  requests:
    cpu: 2m         # Based on actual usage (was using ~2m)
    memory: 7Mi     # Based on GKE analysis
  limits:
    cpu: 10m        # Allow 5x burst
    memory: 14Mi    # 2x request

Production Environment Values

These go in cluster-gitops/environments/production/*.values.yaml.

API (cluster-gitops/environments/production/api.values.yaml)

resources:
  requests:
    cpu: 20m        # Legacy: 1500m → 20m (98.7% reduction)
    memory: 795Mi   # Legacy: 3Gi → 795Mi (74% reduction)
  limits:
    cpu: 100m       # Allow 5x burst for peak loads
    memory: 795Mi   # Guaranteed QoS

Impact Analysis: - Each pod was reserving 1.5 CPU cores but using only 0.02 cores - This single deployment blocked resources equivalent to ~4 e2-standard-2 nodes - Requesting 75x more CPU than needed

Project Management (cluster-gitops/environments/production/project-management.values.yaml)

resources:
  requests:
    cpu: 50m        # Conservative estimate (no legacy data)
    memory: 512Mi   # Based on similar workloads
  limits:
    cpu: 200m       # Allow 4x burst
    memory: 512Mi   # Guaranteed QoS

Quartz (cluster-gitops/environments/production/quartz.values.yaml)

resources:
  requests:
    cpu: 30m        # Background job processor
    memory: 256Mi   # Moderate memory needs
  limits:
    cpu: 150m       # Allow 5x burst
    memory: 256Mi   # Guaranteed QoS

Web (cluster-gitops/environments/production/web.values.yaml)

resources:
  requests:
    cpu: 2m         # Legacy: 200m → 2m (99% reduction)
    memory: 7Mi     # Legacy: 128Mi → 7Mi (94.5% reduction)
  limits:
    cpu: 10m        # Allow 5x burst
    memory: 14Mi    # 2x request

Impact Analysis: - Requesting 100x more CPU than needed (most extreme case) - Requesting 18x more memory than needed - Angular static site with NGINX has minimal resource needs


Part 3: Testing Protocol

Step 1: Apply to Staging First

# Update chart default values (for new deployments)
cd syrf-monorepo
# Edit values.yaml files as shown above
git commit -am "chore: right-size default resource requests based on GKE analysis"
git push

# Update staging environment overrides
cd cluster-gitops
# Edit environments/staging/*.values.yaml files
git commit -am "chore(staging): optimize resource requests based on actual usage"
git push

# ArgoCD will auto-sync if enabled, or manually sync
argocd app sync syrf-api-staging
argocd app sync syrf-project-management-staging
argocd app sync syrf-quartz-staging
argocd app sync syrf-web-staging

Step 2: Monitor for 24-48 Hours

# Watch pod metrics
kubectl top pods -n syrf-staging --watch

# Check for OOMKilled or CPU throttling
kubectl get events -n syrf-staging --watch | grep -i "oom\|throttl"

# View pod resource usage vs requests
kubectl describe pods -n syrf-staging | grep -A 10 "Requests"

# Check application logs for errors
kubectl logs -f -n syrf-staging -l app.kubernetes.io/name=syrf-api

Step 3: Verify Application Health

Health Checks: - [ ] All pods running (not restarting) - [ ] No OOMKilled events - [ ] No CPU throttling warnings - [ ] Application logs clean (no resource-related errors) - [ ] API response times normal - [ ] Web UI loads correctly - [ ] Background jobs processing

Load Testing (optional):

# Use your existing load testing tools
# Simulate typical user traffic
# Monitor resource usage under load

Step 4: Promote to Production

# After staging validation
cd cluster-gitops
# Edit environments/production/*.values.yaml files
git commit -am "chore(production): optimize resource requests based on staging validation"
git push

# Create PR for production changes
# Get approval from team
# Merge PR

# Manually sync in ArgoCD (production requires manual sync)
argocd app sync syrf-api-production
argocd app sync syrf-project-management-production
argocd app sync syrf-quartz-production
argocd app sync syrf-web-production

Step 5: Monitor Production for 48 Hours

# Same monitoring as staging
kubectl top pods -n syrf-production --watch
kubectl get events -n syrf-production --watch
kubectl logs -f -n syrf-production -l app.kubernetes.io/name=syrf-api

# Monitor user-reported issues
# Check error rates in Sentry
# Review application performance metrics

Part 4: Enable Vertical Pod Autoscaler

Install VPA on Cluster

gcloud container clusters update syrf-cluster \
  --enable-vertical-pod-autoscaling \
  --region=europe-west2-a

Deploy VPA in Recommendation Mode

Start with recommendation-only mode to review suggestions before auto-applying:

API Service VPA

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: syrf-api-vpa
  namespace: syrf-production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: syrf-api
  updatePolicy:
    updateMode: "Off"  # Start with recommendations only
  resourcePolicy:
    containerPolicies:
    - containerName: syrf-api
      minAllowed:
        cpu: 10m
        memory: 100Mi
      maxAllowed:
        cpu: 500m
        memory: 2Gi
      controlledResources:
        - cpu
        - memory

Project Management Service VPA

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: syrf-pm-vpa
  namespace: syrf-production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: syrf-project-management
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
    - containerName: syrf-project-management
      minAllowed:
        cpu: 20m
        memory: 256Mi
      maxAllowed:
        cpu: 500m
        memory: 1Gi

Quartz Service VPA

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: syrf-quartz-vpa
  namespace: syrf-production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: syrf-quartz
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
    - containerName: syrf-quartz
      minAllowed:
        cpu: 10m
        memory: 128Mi
      maxAllowed:
        cpu: 300m
        memory: 512Mi

Web Service VPA

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: syrf-web-vpa
  namespace: syrf-production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: syrf-web
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
    - containerName: syrf-web
      minAllowed:
        cpu: 2m
        memory: 7Mi
      maxAllowed:
        cpu: 50m
        memory: 128Mi

Review VPA Recommendations

After 24-48 hours:

# Get VPA recommendations for API
kubectl describe vpa syrf-api-vpa -n syrf-production

# Look for the "Recommendation" section:
# - Target: Recommended values
# - Lower Bound: Minimum safe values
# - Upper Bound: Maximum safe values

# Example output:
#   Recommendation:
#     Container Recommendations:
#       Container Name:  syrf-api
#       Lower Bound:
#         Cpu:     15m
#         Memory:  600Mi
#       Target:
#         Cpu:     25m
#         Memory:  750Mi
#       Upper Bound:
#         Cpu:     100m
#         Memory:  1Gi

Enable Auto-Updates (After Validation)

Once confident in recommendations:

# Switch to Auto mode for API
kubectl patch vpa syrf-api-vpa -n syrf-production \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/updatePolicy/updateMode", "value": "Auto"}]'

# Repeat for other services
kubectl patch vpa syrf-pm-vpa -n syrf-production \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/updatePolicy/updateMode", "value": "Auto"}]'

Note: VPA will evict and recreate pods to apply new resource requests. This causes brief downtime per pod. Consider using updateMode: "Recreate" for more control.


Part 5: Enable Cluster Autoscaler

# Enable autoscaling on default node pool
gcloud container clusters update syrf-cluster \
  --enable-autoscaling \
  --node-pool=default-pool \
  --min-nodes=3 \
  --max-nodes=6 \
  --region=europe-west2-a

# Monitor autoscaler logs
kubectl logs -f deployment/cluster-autoscaler -n kube-system

# Check node pool sizes
watch kubectl get nodes

Expected Behavior

After right-sizing resource requests:

  1. Cluster autoscaler recognizes excess capacity
  2. Begins draining underutilized nodes
  3. Scales down to minimum nodes (3 total)
  4. Scales up automatically when pods are pending
  5. Scales down during off-peak hours

Part 6: Monitoring & Alerts

Key Metrics Dashboard

Create Grafana dashboard tracking:

  1. Node CPU/Memory Utilization (target: 40-70%)
  2. Pod CPU/Memory Requests vs Usage
  3. Cluster Autoscaler Events
  4. VPA Recommendation Application Rate
  5. Cost per Namespace (if GKE Cost Allocation enabled)

Alerting Rules

groups:
- name: resource-optimization
  rules:
  - alert: PodOOMKilled
    expr: kube_pod_container_status_terminated_reason{reason="OOMKilled"} > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Pod {{ $labels.pod }} was OOMKilled"
      description: "Memory limit too low, increase requests"

  - alert: PodCPUThrottling
    expr: rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.5
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Pod {{ $labels.pod }} experiencing CPU throttling"
      description: "Consider increasing CPU limits"

  - alert: NodeLowUtilization
    expr: 100 - (avg by(node) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) < 20
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "Node {{ $labels.node }} has low CPU utilization"
      description: "CPU utilization is {{ $value }}% for more than 1 hour"

Part 7: Weekly Review Process

Checklist

  • Review GCP Recommender suggestions

    gcloud recommender recommendations list \
      --project=camarades-net \
      --location=europe-west2-a \
      --recommender=google.container.DiagnosisRecommender \
      --filter="primaryImpact.category=COST AND stateInfo.state=ACTIVE"
    

  • Check VPA recommendations (if not in auto mode)

    kubectl describe vpa -n syrf-production
    kubectl describe vpa -n syrf-staging
    

  • Review cost trends in GCP Cost Explorer

  • Check node utilization

    kubectl top nodes
    

  • Adjust autoscaler thresholds if needed


Rollback Procedures

If Resource Changes Cause Issues

Symptoms of Under-Provisioning

  • OOMKilled pods (check kubectl get events)
  • CPU throttling warnings in logs
  • Increased response times
  • 502/503 errors

Quick Rollback Steps

  1. Revert to previous manifest:

    kubectl rollout undo deployment/syrf-api -n syrf-production
    

  2. Or manually increase resources:

    kubectl set resources deployment/syrf-api -n syrf-production \
      --containers=syrf-api \
      --requests=cpu=100m,memory=1Gi \
      --limits=cpu=200m,memory=1Gi
    

  3. Monitor recovery:

    kubectl rollout status deployment/syrf-api -n syrf-production
    kubectl top pods -n syrf-production
    

If Autoscaler Scales Down Too Aggressively

  1. Adjust minimum nodes:

    gcloud container clusters update syrf-cluster \
      --node-pool=default-pool \
      --min-nodes=4 \
      --region=europe-west2-a
    

  2. Or disable autoscaling temporarily:

    gcloud container clusters update syrf-cluster \
      --no-enable-autoscaling \
      --node-pool=default-pool \
      --region=europe-west2-a
    


Success Metrics

Short-term (1-2 weeks)

  • Node CPU utilization increases from 1-4% to 30-50%
  • Node count reduces from 10 to 6-7 nodes (if migrating from legacy cluster)
  • No increase in pod restarts or OOMKilled events
  • Application response times remain stable

Medium-term (1 month)

  • Monthly GCP bill decreases by $75-150
  • Cluster autoscaler successfully scales down during off-peak
  • VPA recommendations align with actual usage (validation)
  • Zero production incidents related to resource constraints

Long-term (3 months)

  • Sustained 40-60% node utilization
  • Node count stabilizes at 3-4 nodes average
  • Cost savings of $150-200/month achieved
  • Automated optimization reduces manual intervention

References


Document Status: Approved for implementation Owner: DevOps Team Last Review: 2025-11-11