Service Promotion Workflow¶
Guide to promoting services from staging to production using GitOps best practices.
Overview¶
The promotion workflow is the process of moving a validated service version from staging to production. This follows a progressive delivery model:
Promotion Philosophy¶
Key Principles: 1. Staging is automatic - Deployments to staging happen automatically after CI/CD completes 2. Production is manual - Promotions to production require explicit approval and testing 3. Version matching - Production should only run versions that have been validated in staging 4. Rollback ready - Always have a plan to rollback if issues arise 5. Communication - Notify team of production deployments
Prerequisites¶
- Service successfully deployed and tested in staging
- Access to
cluster-gitopsrepository - Production deployment checklist completed
- Stakeholder approval (for major changes)
Promotion Process¶
Step 1: Validate in Staging¶
Before promoting, ensure the service is stable in staging:
# 1. Check pod status
kubectl get pods -n syrf-staging -l app=api
# 2. Check recent logs for errors
kubectl logs -n syrf-staging -l app=api --tail=200 | grep -i error
# 3. Verify health endpoints (if configured)
curl https://staging-api.syrf.org.uk/health || echo "No health endpoint configured"
# 4. Check current version
yq eval '.service.chartTag' syrf/environments/staging/api/config.yaml
# 5. Check ArgoCD sync status
argocd app get api-staging
Staging Validation Checklist: - [ ] Pods are running and healthy - [ ] No error logs or warnings - [ ] Health endpoint returns 200 OK - [ ] Smoke tests pass (if available) - [ ] No performance degradation - [ ] Feature works as expected - [ ] Database migrations successful (if any)
Step 2: Promote to Production¶
Updated 2025-11-16: Production promotion is fully automated with manual review gate
Production promotion happens automatically after successful staging deployment:
- Trigger: After staging promotion succeeds in CI/CD workflow
- Job:
promote-to-productionjob starts automatically - Action: Copies all service configs from staging to production
- Updates: Sets
envName: productionand keepschartTagfrom staging - Creates PR: Labeled
requires-reviewwith review checklist - Manual Gate: Administrator must review and merge PR manually
- CI/CD Completes: Workflow shows green checkmark after PR creation
- Deployment: After PR merge, ArgoCD auto-syncs to production
Key Points:
- No manual workflow trigger needed
- No GitHub Environment configuration required
- Manual gate happens at PR merge step in cluster-gitops repo
- All services promoted together with matching versions
- See PRs: https://github.com/camaradesuk/cluster-gitops/pulls?q=is:pr+label:requires-review
Example config file updated:
# syrf/environments/production/api/config.yaml
serviceName: api
envName: production # ← Changed from staging
service:
enabled: true
chartTag: api-v9.1.1 # ← Same version as staging
Manual Review Process¶
When the automated PR is created:
- Review the PR at cluster-gitops repository
- Check the staging validation checklist
- Verify the version matches staging
- Merge the PR to trigger production deployment
# View the production promotion PR
cd cluster-gitops
gh pr list --label requires-review
# Review changes in the PR
gh pr diff <PR_NUMBER>
# Check what's being promoted
git diff main...production-promotion-<run-id> -- syrf/environments/production/
# Merge after review
gh pr merge <PR_NUMBER> --merge
Step 3: Monitor Production Deployment¶
Critical: Closely monitor the production deployment
# 1. Watch pod rollout
kubectl get pods -n production -l app=api -w
# 2. Check deployment progress
kubectl rollout status deployment/api -n production
# 3. Monitor logs in real-time
kubectl logs -n production -l app=api --tail=100 -f
# 4. Verify health endpoint
curl https://api.syrf.org.uk/health
# 5. Check for errors in last 5 minutes
kubectl logs -n production -l app=api --since=5m | grep -i error
Production Monitoring Checklist: - [ ] New pods starting successfully - [ ] Old pods terminating gracefully - [ ] No error logs during rollout - [ ] Health endpoint responding - [ ] User traffic flowing normally - [ ] No alerts firing (if monitoring set up)
Step 4: Post-Deployment Verification¶
After deployment completes, perform verification:
# 1. Verify correct image version deployed
kubectl get deployment api -n production -o jsonpath='{.spec.template.spec.containers[0].image}'
# 2. Check all replicas are ready
kubectl get deployment api -n production
# 3. Test critical user flows
# Example: Login, create project, run search, etc.
# 4. Check application metrics (if available)
# Navigate to: https://grafana.syrf.org.uk/d/production-api
# 5. Verify no increase in error rates
# Check logs, metrics, and user reports
Rollback Procedure¶
If issues are detected post-deployment, rollback immediately:
Quick Rollback (Recommended)¶
# 1. Revert to previous version
cd cluster-gitops
nano environments/production/services/api.yaml
# 2. Change chartTag to previous version
service:
chartTag: api-v1.2.2 # Previous stable version
# 3. Commit and push
git commit -am "rollback(api): revert production to v1.2.2 due to [issue]"
git push
# 4. Monitor rollback
kubectl rollout status deployment/api -n production
kubectl logs -n production -l app=api --tail=100 -f
Git-Based Rollback¶
# Alternative: Use git revert
git log --oneline -- environments/production/services/api.yaml
git revert <commit-sha>
git push
Kubernetes Native Rollback¶
# Last resort: kubectl rollback (doesn't update git state!)
kubectl rollout undo deployment/api -n production
# IMPORTANT: Update git to match cluster state
nano environments/production/services/api.yaml
# Update chartTag to match what's running
git commit -am "chore(api): sync git with rolled-back production state"
git push
Service-Specific Promotion Examples¶
API Service¶
# Staging validation
kubectl logs -n staging -l app=api --tail=200
curl https://staging-api.syrf.org.uk/health
# Promote (automated workflow recommended)
# OR manually:
nano environments/production/services/api.yaml
# Update: chartTag: api-v8.21.0
git commit -m "promote(api): v8.21.0 → production"
git push
# Monitor
kubectl rollout status deployment/api -n production
Web Service¶
# Staging validation
kubectl logs -n staging -l app=web --tail=200
curl https://staging.syrf.org.uk
# Promote (automated workflow recommended)
# OR manually:
nano environments/production/services/web.yaml
# Update: chartTag: web-v5.0.1
git commit -m "promote(web): v5.0.1 → production"
git push
# Monitor
kubectl rollout status deployment/web -n production
Project Management Service¶
# Staging validation
kubectl logs -n staging -l app=project-management --tail=200
curl https://staging-api.syrf.org.uk/pm/health
# Promote (automated workflow recommended)
# OR manually:
nano environments/production/services/project-management.yaml
# Update: chartTag: pm-v10.45.0
git commit -m "promote(pm): v10.45.0 → production"
git push
# Monitor
kubectl rollout status deployment/project-management -n production
Best Practices¶
Timing¶
- Avoid peak hours: Deploy during low-traffic periods (e.g., evenings, weekends)
- Business hours preferred: Deploy when team is available to respond to issues
- Coordinate with team: Notify team members before production deployments
- Maintenance windows: Use scheduled maintenance windows for risky changes
Communication¶
Before Deployment:
🚀 Production Deployment Planned
Service: API v1.2.3
Time: 2025-11-12 18:00 UTC
Changes: User authentication improvements
Validated in staging: ✅ All tests passing
Risk level: Low
Rollback plan: Revert to v1.2.2
After Deployment:
✅ Production Deployment Complete
Service: API v1.2.3
Deployed at: 2025-11-12 18:15 UTC
Status: Healthy - no errors detected
Monitoring: https://grafana.syrf.org.uk/d/api
Testing¶
- Always validate in staging first (no shortcuts!)
- Run smoke tests before promoting
- Test critical user flows post-deployment
- Monitor for at least 30 minutes after deployment
- Have rollback command ready before deploying
Version Management¶
- Never skip versions - Deploy 1.2.3 → 1.2.4, not 1.2.3 → 1.3.0 (unless tested)
- Tag format consistency - Always use semantic versions (no
latestin production) - Git history - Maintain clear commit messages with rationale
- Documentation - Update CHANGELOG or release notes
Troubleshooting¶
Promotion Blocked by ArgoCD¶
Symptom: Production application shows "OutOfSync" but won't sync
# Check sync status
argocd app get api-production
# Check for sync policy issues
kubectl get application api-production -n argocd -o yaml
# Force sync if needed
argocd app sync api-production --force
Rollout Stuck¶
Symptom: New pods not starting or old pods not terminating
# Check rollout status
kubectl rollout status deployment/api -n production
# Check pod events
kubectl describe pod <pod-name> -n production
# Check for resource constraints
kubectl top pods -n production
kubectl describe nodes
# Manual intervention (use cautiously)
kubectl rollout restart deployment/api -n production
Health Check Failures¶
Symptom: Readiness probe failing, traffic not routing to new pods
# Check probe configuration
kubectl describe deployment api -n production | grep -A 5 Readiness
# Test health endpoint from within cluster
kubectl exec -n production deployment/api -- curl -f http://localhost:8080/health
# Check logs for startup errors
kubectl logs -n production <new-pod-name> --tail=200
Related Documentation¶
- Deploying Services - How to deploy to staging
- Environment Configuration - Managing environment values (TODO)
- Troubleshooting - Common deployment issues (TODO)
- CI/CD Workflow - How images are built
- Cluster Configuration - Production cluster details
External Resources¶
- Progressive Delivery: https://www.weave.works/blog/what-is-progressive-delivery-all-about
- Argo Rollouts: https://argoproj.github.io/argo-rollouts/
- Blue/Green Deployments: https://martinfowler.com/bliki/BlueGreenDeployment.html