Kubernetes Resource Optimization Guide¶

This guide provides actionable steps to optimize Kubernetes resource usage for the SyRF platform, based on analysis of the legacy production cluster.

Executive Summary¶

Problem: Legacy cluster was massively overprovisioned with 98.7% CPU waste on production API.

Root Cause: Pod resource requests set far higher than actual usage, causing scheduler to maintain unnecessary nodes.

Solution: Right-size resource requests based on actual usage + enable VPA for continuous optimization.

Expected Savings: $150-200/month (30-60% cost reduction)

Part 1: Update Helm Chart Default Values¶

Purpose of Chart Defaults¶

Chart values.yaml files provide sensible defaults for development/local environments. These should be: - Conservative (enough to run without issues) - Not wasteful (not production-sized) - Easily overridden by environment-specific values

Current Chart Values vs Recommendations¶

Service	Current Requests	Current Limits	Recommended Requests	Recommended Limits
API	200m CPU, 1Gi	400m CPU, 3Gi	50m CPU, 256Mi	200m CPU, 512Mi
Project Management	500m CPU, 1Gi	700m CPU, 3Gi	75m CPU, 384Mi	300m CPU, 768Mi
Quartz	200m CPU, 128Mi	400m CPU, 256Mi	50m CPU, 192Mi	200m CPU, 384Mi
Web	200m CPU, 128Mi	400m CPU, 256Mi	5m CPU, 32Mi	20m CPU, 64Mi

Proposed Chart Value Updates¶

1. API Service (`src/services/api/charts/api/values.yaml`)¶

Current (lines 180-186):

resources:
  limits:
    cpu: 400m
    memory: 3Gi
  requests:
    cpu: 200m
    memory: 1Gi

Proposed:

resources:
  limits:
    cpu: 200m        # Reduced from 400m
    memory: 512Mi    # Reduced from 3Gi
  requests:
    cpu: 50m         # Reduced from 200m
    memory: 256Mi    # Reduced from 1Gi

Rationale: - Legacy staging API: requesting 200m, using 15m (92.5% waste) - New defaults suitable for dev/staging - Production will override with even lower values (20m CPU)

2. Project Management Service (`src/services/project-management/charts/project-management/values.yaml`)¶

Current (lines 150-156):

resources:
  limits:
    cpu: 700m
    memory: 3Gi
  requests:
    cpu: 500m
    memory: 1Gi

Proposed:

resources:
  limits:
    cpu: 300m        # Reduced from 700m
    memory: 768Mi    # Reduced from 3Gi
  requests:
    cpu: 75m         # Reduced from 500m
    memory: 384Mi    # Reduced from 1Gi

Rationale: - Similar workload to API - Conservative estimate based on typical .NET API resource usage - Production will further optimize based on actual metrics

3. Quartz Service (`src/services/quartz/charts/quartz/values.yaml`)¶

Current (lines 142-148):

resources:
  limits:
    cpu: 400m
    memory: 256Mi
  requests:
    cpu: 200m
    memory: 128Mi

Proposed:

resources:
  limits:
    cpu: 200m        # Reduced from 400m
    memory: 384Mi    # Increased from 256Mi (job processor needs memory)
  requests:
    cpu: 50m         # Reduced from 200m
    memory: 192Mi    # Increased from 128Mi

Rationale: - Background job processor - CPU usage typically low between jobs - Memory needs moderate for job state

4. Web Service (`src/services/web/charts/syrf-web/values.yaml`)¶

Current (lines 145-151):

resources:
  limits:
    cpu: 400m
    memory: 256Mi
  requests:
    cpu: 200m
    memory: 128Mi

Proposed:

resources:
  limits:
    cpu: 20m         # Reduced from 400m
    memory: 64Mi     # Reduced from 256Mi
  requests:
    cpu: 5m          # Reduced from 200m
    memory: 32Mi     # Reduced from 128Mi

Rationale: - Legacy staging web: requesting 200m, using 2m (99% waste) - Angular static site served by NGINX - Minimal resource requirements - Production will use even lower (2m CPU, 7Mi memory)

Part 2: Environment-Specific Overrides¶

Staging Environment Values¶

These go in cluster-gitops/environments/staging/*.values.yaml and override chart defaults.

API (`cluster-gitops/environments/staging/api.values.yaml`)¶

resources:
  requests:
    cpu: 15m        # Based on actual usage (was using ~15m)
    memory: 239Mi   # Based on GKE analysis
  limits:
    cpu: 50m        # Allow 3x burst
    memory: 239Mi   # Guaranteed QoS

Project Management (`cluster-gitops/environments/staging/project-management.values.yaml`)¶

resources:
  requests:
    cpu: 40m        # Conservative estimate
    memory: 384Mi   # Based on similar workloads
  limits:
    cpu: 150m       # Allow 3-4x burst
    memory: 384Mi   # Guaranteed QoS

Quartz (`cluster-gitops/environments/staging/quartz.values.yaml`)¶

resources:
  requests:
    cpu: 30m        # Background processor
    memory: 192Mi   # Moderate memory needs
  limits:
    cpu: 120m       # Allow 4x burst
    memory: 192Mi   # Guaranteed QoS

Web (`cluster-gitops/environments/staging/web.values.yaml`)¶

resources:
  requests:
    cpu: 2m         # Based on actual usage (was using ~2m)
    memory: 7Mi     # Based on GKE analysis
  limits:
    cpu: 10m        # Allow 5x burst
    memory: 14Mi    # 2x request

Production Environment Values¶

These go in cluster-gitops/environments/production/*.values.yaml.

API (`cluster-gitops/environments/production/api.values.yaml`)¶

resources:
  requests:
    cpu: 20m        # Legacy: 1500m → 20m (98.7% reduction)
    memory: 795Mi   # Legacy: 3Gi → 795Mi (74% reduction)
  limits:
    cpu: 100m       # Allow 5x burst for peak loads
    memory: 795Mi   # Guaranteed QoS

Impact Analysis: - Each pod was reserving 1.5 CPU cores but using only 0.02 cores - This single deployment blocked resources equivalent to ~4 e2-standard-2 nodes - Requesting 75x more CPU than needed

Project Management (`cluster-gitops/environments/production/project-management.values.yaml`)¶

resources:
  requests:
    cpu: 50m        # Conservative estimate (no legacy data)
    memory: 512Mi   # Based on similar workloads
  limits:
    cpu: 200m       # Allow 4x burst
    memory: 512Mi   # Guaranteed QoS

Quartz (`cluster-gitops/environments/production/quartz.values.yaml`)¶

resources:
  requests:
    cpu: 30m        # Background job processor
    memory: 256Mi   # Moderate memory needs
  limits:
    cpu: 150m       # Allow 5x burst
    memory: 256Mi   # Guaranteed QoS

Web (`cluster-gitops/environments/production/web.values.yaml`)¶

resources:
  requests:
    cpu: 2m         # Legacy: 200m → 2m (99% reduction)
    memory: 7Mi     # Legacy: 128Mi → 7Mi (94.5% reduction)
  limits:
    cpu: 10m        # Allow 5x burst
    memory: 14Mi    # 2x request

Impact Analysis: - Requesting 100x more CPU than needed (most extreme case) - Requesting 18x more memory than needed - Angular static site with NGINX has minimal resource needs

Part 3: Testing Protocol¶

Step 1: Apply to Staging First¶

# Update chart default values (for new deployments)
cd syrf-monorepo
# Edit values.yaml files as shown above
git commit -am "chore: right-size default resource requests based on GKE analysis"
git push

# Update staging environment overrides
cd cluster-gitops
# Edit environments/staging/*.values.yaml files
git commit -am "chore(staging): optimize resource requests based on actual usage"
git push

# ArgoCD will auto-sync if enabled, or manually sync
argocd app sync syrf-api-staging
argocd app sync syrf-project-management-staging
argocd app sync syrf-quartz-staging
argocd app sync syrf-web-staging

Step 2: Monitor for 24-48 Hours¶

# Watch pod metrics
kubectl top pods -n syrf-staging --watch

# Check for OOMKilled or CPU throttling
kubectl get events -n syrf-staging --watch | grep -i "oom\|throttl"

# View pod resource usage vs requests
kubectl describe pods -n syrf-staging | grep -A 10 "Requests"

# Check application logs for errors
kubectl logs -f -n syrf-staging -l app.kubernetes.io/name=syrf-api

Step 3: Verify Application Health¶

Health Checks: - [ ] All pods running (not restarting) - [ ] No OOMKilled events - [ ] No CPU throttling warnings - [ ] Application logs clean (no resource-related errors) - [ ] API response times normal - [ ] Web UI loads correctly - [ ] Background jobs processing

Load Testing (optional):

# Use your existing load testing tools
# Simulate typical user traffic
# Monitor resource usage under load

Step 4: Promote to Production¶

# After staging validation
cd cluster-gitops
# Edit environments/production/*.values.yaml files
git commit -am "chore(production): optimize resource requests based on staging validation"
git push

# Create PR for production changes
# Get approval from team
# Merge PR

# Manually sync in ArgoCD (production requires manual sync)
argocd app sync syrf-api-production
argocd app sync syrf-project-management-production
argocd app sync syrf-quartz-production
argocd app sync syrf-web-production

Step 5: Monitor Production for 48 Hours¶

# Same monitoring as staging
kubectl top pods -n syrf-production --watch
kubectl get events -n syrf-production --watch
kubectl logs -f -n syrf-production -l app.kubernetes.io/name=syrf-api

# Monitor user-reported issues
# Check error rates in Sentry
# Review application performance metrics

Part 4: Enable Vertical Pod Autoscaler¶

Install VPA on Cluster¶

gcloud container clusters update syrf-cluster \
  --enable-vertical-pod-autoscaling \
  --region=europe-west2-a

Deploy VPA in Recommendation Mode¶

Start with recommendation-only mode to review suggestions before auto-applying:

API Service VPA¶

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: syrf-api-vpa
  namespace: syrf-production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: syrf-api
  updatePolicy:
    updateMode: "Off"  # Start with recommendations only
  resourcePolicy:
    containerPolicies:
    - containerName: syrf-api
      minAllowed:
        cpu: 10m
        memory: 100Mi
      maxAllowed:
        cpu: 500m
        memory: 2Gi
      controlledResources:
        - cpu
        - memory

Project Management Service VPA¶

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: syrf-pm-vpa
  namespace: syrf-production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: syrf-project-management
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
    - containerName: syrf-project-management
      minAllowed:
        cpu: 20m
        memory: 256Mi
      maxAllowed:
        cpu: 500m
        memory: 1Gi

Quartz Service VPA¶

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: syrf-quartz-vpa
  namespace: syrf-production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: syrf-quartz
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
    - containerName: syrf-quartz
      minAllowed:
        cpu: 10m
        memory: 128Mi
      maxAllowed:
        cpu: 300m
        memory: 512Mi

Web Service VPA¶

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: syrf-web-vpa
  namespace: syrf-production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: syrf-web
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
    - containerName: syrf-web
      minAllowed:
        cpu: 2m
        memory: 7Mi
      maxAllowed:
        cpu: 50m
        memory: 128Mi

Review VPA Recommendations¶

After 24-48 hours:

# Get VPA recommendations for API
kubectl describe vpa syrf-api-vpa -n syrf-production

# Look for the "Recommendation" section:
# - Target: Recommended values
# - Lower Bound: Minimum safe values
# - Upper Bound: Maximum safe values

# Example output:
#   Recommendation:
#     Container Recommendations:
#       Container Name:  syrf-api
#       Lower Bound:
#         Cpu:     15m
#         Memory:  600Mi
#       Target:
#         Cpu:     25m
#         Memory:  750Mi
#       Upper Bound:
#         Cpu:     100m
#         Memory:  1Gi

Enable Auto-Updates (After Validation)¶

Once confident in recommendations:

# Switch to Auto mode for API
kubectl patch vpa syrf-api-vpa -n syrf-production \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/updatePolicy/updateMode", "value": "Auto"}]'

# Repeat for other services
kubectl patch vpa syrf-pm-vpa -n syrf-production \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/updatePolicy/updateMode", "value": "Auto"}]'

Note: VPA will evict and recreate pods to apply new resource requests. This causes brief downtime per pod. Consider using updateMode: "Recreate" for more control.

Part 5: Enable Cluster Autoscaler¶

# Enable autoscaling on default node pool
gcloud container clusters update syrf-cluster \
  --enable-autoscaling \
  --node-pool=default-pool \
  --min-nodes=3 \
  --max-nodes=6 \
  --region=europe-west2-a

# Monitor autoscaler logs
kubectl logs -f deployment/cluster-autoscaler -n kube-system

# Check node pool sizes
watch kubectl get nodes

Expected Behavior¶

After right-sizing resource requests:

Cluster autoscaler recognizes excess capacity
Begins draining underutilized nodes
Scales down to minimum nodes (3 total)
Scales up automatically when pods are pending
Scales down during off-peak hours

Part 6: Monitoring & Alerts¶

Key Metrics Dashboard¶

Create Grafana dashboard tracking:

Node CPU/Memory Utilization (target: 40-70%)
Pod CPU/Memory Requests vs Usage
Cluster Autoscaler Events
VPA Recommendation Application Rate
Cost per Namespace (if GKE Cost Allocation enabled)

Alerting Rules¶

groups:
- name: resource-optimization
  rules:
  - alert: PodOOMKilled
    expr: kube_pod_container_status_terminated_reason{reason="OOMKilled"} > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Pod {{ $labels.pod }} was OOMKilled"
      description: "Memory limit too low, increase requests"

  - alert: PodCPUThrottling
    expr: rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.5
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Pod {{ $labels.pod }} experiencing CPU throttling"
      description: "Consider increasing CPU limits"

  - alert: NodeLowUtilization
    expr: 100 - (avg by(node) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) < 20
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "Node {{ $labels.node }} has low CPU utilization"
      description: "CPU utilization is {{ $value }}% for more than 1 hour"

Part 7: Weekly Review Process¶

Checklist¶

Review GCP Recommender suggestions

gcloud recommender recommendations list \
  --project=camarades-net \
  --location=europe-west2-a \
  --recommender=google.container.DiagnosisRecommender \
  --filter="primaryImpact.category=COST AND stateInfo.state=ACTIVE"

Check VPA recommendations (if not in auto mode)

kubectl describe vpa -n syrf-production
kubectl describe vpa -n syrf-staging

Review cost trends in GCP Cost Explorer
Check node utilization
```
kubectl top nodes
```
Adjust autoscaler thresholds if needed

Rollback Procedures¶

If Resource Changes Cause Issues¶

Symptoms of Under-Provisioning¶

OOMKilled pods (check kubectl get events)
CPU throttling warnings in logs
Increased response times
502/503 errors

Quick Rollback Steps¶

Revert to previous manifest:

kubectl rollout undo deployment/syrf-api -n syrf-production

Or manually increase resources:

kubectl set resources deployment/syrf-api -n syrf-production \
  --containers=syrf-api \
  --requests=cpu=100m,memory=1Gi \
  --limits=cpu=200m,memory=1Gi

Monitor recovery:

kubectl rollout status deployment/syrf-api -n syrf-production
kubectl top pods -n syrf-production

If Autoscaler Scales Down Too Aggressively¶

Adjust minimum nodes:

gcloud container clusters update syrf-cluster \
  --node-pool=default-pool \
  --min-nodes=4 \
  --region=europe-west2-a

Or disable autoscaling temporarily:

gcloud container clusters update syrf-cluster \
  --no-enable-autoscaling \
  --node-pool=default-pool \
  --region=europe-west2-a

Success Metrics¶

Short-term (1-2 weeks)¶

Node CPU utilization increases from 1-4% to 30-50%
Node count reduces from 10 to 6-7 nodes (if migrating from legacy cluster)
No increase in pod restarts or OOMKilled events
Application response times remain stable

Medium-term (1 month)¶

Monthly GCP bill decreases by $75-150
Cluster autoscaler successfully scales down during off-peak
VPA recommendations align with actual usage (validation)
Zero production incidents related to resource constraints

Long-term (3 months)¶

Sustained 40-60% node utilization
Node count stabilizes at 3-4 nodes average
Cost savings of $150-200/month achieved
Automated optimization reduces manual intervention

References¶

GKE Cluster Analysis - Detailed analysis of legacy cluster
Cluster Setup Guide - Infrastructure setup instructions
GKE Best Practices: Resource Requests and Limits
Vertical Pod Autoscaling
Cluster Autoscaling

Document Status: Approved for implementation Owner: DevOps Team Last Review: 2025-11-11

Kubernetes Resource Optimization Guide¶

Executive Summary¶

Part 1: Update Helm Chart Default Values¶

Purpose of Chart Defaults¶

Current Chart Values vs Recommendations¶

Proposed Chart Value Updates¶

1. API Service (src/services/api/charts/api/values.yaml)¶

2. Project Management Service (src/services/project-management/charts/project-management/values.yaml)¶

3. Quartz Service (src/services/quartz/charts/quartz/values.yaml)¶

4. Web Service (src/services/web/charts/syrf-web/values.yaml)¶

Part 2: Environment-Specific Overrides¶

Staging Environment Values¶

API (cluster-gitops/environments/staging/api.values.yaml)¶

Project Management (cluster-gitops/environments/staging/project-management.values.yaml)¶

Quartz (cluster-gitops/environments/staging/quartz.values.yaml)¶

Web (cluster-gitops/environments/staging/web.values.yaml)¶

Production Environment Values¶

API (cluster-gitops/environments/production/api.values.yaml)¶

Project Management (cluster-gitops/environments/production/project-management.values.yaml)¶

Quartz (cluster-gitops/environments/production/quartz.values.yaml)¶

Web (cluster-gitops/environments/production/web.values.yaml)¶

Part 3: Testing Protocol¶

Step 1: Apply to Staging First¶

Step 2: Monitor for 24-48 Hours¶

Step 3: Verify Application Health¶

Step 4: Promote to Production¶

Step 5: Monitor Production for 48 Hours¶

Part 4: Enable Vertical Pod Autoscaler¶

Install VPA on Cluster¶

Deploy VPA in Recommendation Mode¶

API Service VPA¶

Project Management Service VPA¶

Quartz Service VPA¶

Web Service VPA¶

Review VPA Recommendations¶

Enable Auto-Updates (After Validation)¶

Part 5: Enable Cluster Autoscaler¶

Expected Behavior¶

Part 6: Monitoring & Alerts¶

Key Metrics Dashboard¶

Alerting Rules¶

Part 7: Weekly Review Process¶

Checklist¶

Rollback Procedures¶

If Resource Changes Cause Issues¶

Symptoms of Under-Provisioning¶

Quick Rollback Steps¶

If Autoscaler Scales Down Too Aggressively¶

Success Metrics¶

Short-term (1-2 weeks)¶

Medium-term (1 month)¶

Long-term (3 months)¶

References¶

1. API Service (`src/services/api/charts/api/values.yaml`)¶

2. Project Management Service (`src/services/project-management/charts/project-management/values.yaml`)¶

3. Quartz Service (`src/services/quartz/charts/quartz/values.yaml`)¶

4. Web Service (`src/services/web/charts/syrf-web/values.yaml`)¶

API (`cluster-gitops/environments/staging/api.values.yaml`)¶

Project Management (`cluster-gitops/environments/staging/project-management.values.yaml`)¶

Quartz (`cluster-gitops/environments/staging/quartz.values.yaml`)¶

Web (`cluster-gitops/environments/staging/web.values.yaml`)¶

API (`cluster-gitops/environments/production/api.values.yaml`)¶

Project Management (`cluster-gitops/environments/production/project-management.values.yaml`)¶

Quartz (`cluster-gitops/environments/production/quartz.values.yaml`)¶

Web (`cluster-gitops/environments/production/web.values.yaml`)¶