Service Promotion Workflow¶

Guide to promoting services from staging to production using GitOps best practices.

Overview¶

The promotion workflow is the process of moving a validated service version from staging to production. This follows a progressive delivery model:

Development → Staging → Production
    ↓            ↓          ↓
  Code Merge   Auto Deploy  Manual Promotion

Promotion Philosophy¶

Key Principles: 1. Staging is automatic - Deployments to staging happen automatically after CI/CD completes 2. Production is manual - Promotions to production require explicit approval and testing 3. Version matching - Production should only run versions that have been validated in staging 4. Rollback ready - Always have a plan to rollback if issues arise 5. Communication - Notify team of production deployments

Prerequisites¶

Service successfully deployed and tested in staging
Access to cluster-gitops repository
Production deployment checklist completed
Stakeholder approval (for major changes)

Promotion Process¶

Step 1: Validate in Staging¶

Before promoting, ensure the service is stable in staging:

# 1. Check pod status
kubectl get pods -n syrf-staging -l app=api

# 2. Check recent logs for errors
kubectl logs -n syrf-staging -l app=api --tail=200 | grep -i error

# 3. Verify health endpoints (if configured)
curl https://staging-api.syrf.org.uk/health || echo "No health endpoint configured"

# 4. Check current version
yq eval '.service.chartTag' syrf/environments/staging/api/config.yaml

# 5. Check ArgoCD sync status
argocd app get api-staging

Staging Validation Checklist: - [ ] Pods are running and healthy - [ ] No error logs or warnings - [ ] Health endpoint returns 200 OK - [ ] Smoke tests pass (if available) - [ ] No performance degradation - [ ] Feature works as expected - [ ] Database migrations successful (if any)

Step 2: Promote to Production¶

Updated 2025-11-16: Production promotion is fully automated with manual review gate

Production promotion happens automatically after successful staging deployment:

Trigger: After staging promotion succeeds in CI/CD workflow
Job: promote-to-production job starts automatically
Action: Copies all service configs from staging to production
Updates: Sets envName: production and keeps chartTag from staging
Creates PR: Labeled requires-review with review checklist
Manual Gate: Administrator must review and merge PR manually
CI/CD Completes: Workflow shows green checkmark after PR creation
Deployment: After PR merge, ArgoCD auto-syncs to production

Key Points: - No manual workflow trigger needed - No GitHub Environment configuration required - Manual gate happens at PR merge step in cluster-gitops repo - All services promoted together with matching versions - See PRs: https://github.com/camaradesuk/cluster-gitops/pulls?q=is:pr+label:requires-review

Example config file updated:

# syrf/environments/production/api/config.yaml
serviceName: api
envName: production  # ← Changed from staging
service:
  enabled: true
  chartTag: api-v9.1.1  # ← Same version as staging

Manual Review Process¶

When the automated PR is created:

Review the PR at cluster-gitops repository
Check the staging validation checklist
Verify the version matches staging
Merge the PR to trigger production deployment

# View the production promotion PR
cd cluster-gitops
gh pr list --label requires-review

# Review changes in the PR
gh pr diff <PR_NUMBER>

# Check what's being promoted
git diff main...production-promotion-<run-id> -- syrf/environments/production/

# Merge after review
gh pr merge <PR_NUMBER> --merge

Step 3: Monitor Production Deployment¶

Critical: Closely monitor the production deployment

# 1. Watch pod rollout
kubectl get pods -n production -l app=api -w

# 2. Check deployment progress
kubectl rollout status deployment/api -n production

# 3. Monitor logs in real-time
kubectl logs -n production -l app=api --tail=100 -f

# 4. Verify health endpoint
curl https://api.syrf.org.uk/health

# 5. Check for errors in last 5 minutes
kubectl logs -n production -l app=api --since=5m | grep -i error

Production Monitoring Checklist: - [ ] New pods starting successfully - [ ] Old pods terminating gracefully - [ ] No error logs during rollout - [ ] Health endpoint responding - [ ] User traffic flowing normally - [ ] No alerts firing (if monitoring set up)

Step 4: Post-Deployment Verification¶

After deployment completes, perform verification:

# 1. Verify correct image version deployed
kubectl get deployment api -n production -o jsonpath='{.spec.template.spec.containers[0].image}'

# 2. Check all replicas are ready
kubectl get deployment api -n production

# 3. Test critical user flows
# Example: Login, create project, run search, etc.

# 4. Check application metrics (if available)
# Navigate to: https://grafana.syrf.org.uk/d/production-api

# 5. Verify no increase in error rates
# Check logs, metrics, and user reports

Rollback Procedure¶

If issues are detected post-deployment, rollback immediately:

Quick Rollback (Recommended)¶

# 1. Revert to previous version
cd cluster-gitops
nano environments/production/services/api.yaml

# 2. Change chartTag to previous version
service:
  chartTag: api-v1.2.2  # Previous stable version

# 3. Commit and push
git commit -am "rollback(api): revert production to v1.2.2 due to [issue]"
git push

# 4. Monitor rollback
kubectl rollout status deployment/api -n production
kubectl logs -n production -l app=api --tail=100 -f

Git-Based Rollback¶

# Alternative: Use git revert
git log --oneline -- environments/production/services/api.yaml
git revert <commit-sha>
git push

Kubernetes Native Rollback¶

# Last resort: kubectl rollback (doesn't update git state!)
kubectl rollout undo deployment/api -n production

# IMPORTANT: Update git to match cluster state
nano environments/production/services/api.yaml
# Update chartTag to match what's running
git commit -am "chore(api): sync git with rolled-back production state"
git push

Service-Specific Promotion Examples¶

API Service¶

# Staging validation
kubectl logs -n staging -l app=api --tail=200
curl https://staging-api.syrf.org.uk/health

# Promote (automated workflow recommended)
# OR manually:
nano environments/production/services/api.yaml
# Update: chartTag: api-v8.21.0
git commit -m "promote(api): v8.21.0 → production"
git push

# Monitor
kubectl rollout status deployment/api -n production

Web Service¶

# Staging validation
kubectl logs -n staging -l app=web --tail=200
curl https://staging.syrf.org.uk

# Promote (automated workflow recommended)
# OR manually:
nano environments/production/services/web.yaml
# Update: chartTag: web-v5.0.1
git commit -m "promote(web): v5.0.1 → production"
git push

# Monitor
kubectl rollout status deployment/web -n production

Project Management Service¶

# Staging validation
kubectl logs -n staging -l app=project-management --tail=200
curl https://staging-api.syrf.org.uk/pm/health

# Promote (automated workflow recommended)
# OR manually:
nano environments/production/services/project-management.yaml
# Update: chartTag: pm-v10.45.0
git commit -m "promote(pm): v10.45.0 → production"
git push

# Monitor
kubectl rollout status deployment/project-management -n production

Best Practices¶

Timing¶

Avoid peak hours: Deploy during low-traffic periods (e.g., evenings, weekends)
Business hours preferred: Deploy when team is available to respond to issues
Coordinate with team: Notify team members before production deployments
Maintenance windows: Use scheduled maintenance windows for risky changes

Communication¶

Before Deployment:

🚀 Production Deployment Planned

Service: API v1.2.3
Time: 2025-11-12 18:00 UTC
Changes: User authentication improvements
Validated in staging: ✅ All tests passing
Risk level: Low
Rollback plan: Revert to v1.2.2

After Deployment:

✅ Production Deployment Complete

Service: API v1.2.3
Deployed at: 2025-11-12 18:15 UTC
Status: Healthy - no errors detected
Monitoring: https://grafana.syrf.org.uk/d/api

Testing¶

Always validate in staging first (no shortcuts!)
Run smoke tests before promoting
Test critical user flows post-deployment
Monitor for at least 30 minutes after deployment
Have rollback command ready before deploying

Version Management¶

Never skip versions - Deploy 1.2.3 → 1.2.4, not 1.2.3 → 1.3.0 (unless tested)
Tag format consistency - Always use semantic versions (no latest in production)
Git history - Maintain clear commit messages with rationale
Documentation - Update CHANGELOG or release notes

Troubleshooting¶

Promotion Blocked by ArgoCD¶

Symptom: Production application shows "OutOfSync" but won't sync

# Check sync status
argocd app get api-production

# Check for sync policy issues
kubectl get application api-production -n argocd -o yaml

# Force sync if needed
argocd app sync api-production --force

Rollout Stuck¶

Symptom: New pods not starting or old pods not terminating

# Check rollout status
kubectl rollout status deployment/api -n production

# Check pod events
kubectl describe pod <pod-name> -n production

# Check for resource constraints
kubectl top pods -n production
kubectl describe nodes

# Manual intervention (use cautiously)
kubectl rollout restart deployment/api -n production

Health Check Failures¶

Symptom: Readiness probe failing, traffic not routing to new pods

# Check probe configuration
kubectl describe deployment api -n production | grep -A 5 Readiness

# Test health endpoint from within cluster
kubectl exec -n production deployment/api -- curl -f http://localhost:8080/health

# Check logs for startup errors
kubectl logs -n production <new-pod-name> --tail=200

Deploying Services - How to deploy to staging
Environment Configuration - Managing environment values (TODO)
Troubleshooting - Common deployment issues (TODO)
CI/CD Workflow - How images are built
Cluster Configuration - Production cluster details

External Resources¶

Progressive Delivery: https://www.weave.works/blog/what-is-progressive-delivery-all-about
Argo Rollouts: https://argoproj.github.io/argo-rollouts/
Blue/Green Deployments: https://martinfowler.com/bliki/BlueGreenDeployment.html