Migration Runbook: Production Lambda to ACK¶
NOTE: This runbook covers Phase 4 (Production Cutover) of the ACK migration. It needs updating before use — see Technical Plan for the validated approach. Key changes: separate per-environment S3 buckets (e.g.
syrfapp-uploads,syrfapp-uploads-staging), PostSync Job for credentials, corrected Lambda handler/runtime.
Overview¶
This runbook provides step-by-step instructions for migrating the production S3 Notifier Lambda from Terraform/CI-managed deployment to ACK (AWS Controllers for Kubernetes) GitOps management.
Risk Level: Medium - Production bucket contains user data Estimated Duration: 2-4 hours (with validation pauses) Rollback Time: 15 minutes
Pre-Migration Checklist¶
1. Prerequisites Verified¶
- ACK S3 Controller installed and healthy in
ack-systemnamespace - ACK Lambda Controller installed and healthy in
ack-systemnamespace - Cross-cloud IAM (GKE → AWS) tested with staging resources
- Staging deployment completed and validated
- Helm chart tested with
helm templatelocally
2. Access Confirmed¶
- AWS Console access (eu-west-1)
- GKE cluster access (
kubectlconfigured forcamaradesuk) - ArgoCD admin access
- GitHub write access (for cluster-gitops)
3. Backup Completed¶
# Create inventory of production bucket
aws s3 ls s3://syrfapp-uploads --recursive > ~/production-bucket-inventory-$(date +%Y%m%d).txt
# Record current Lambda configuration
aws lambda get-function --function-name syrfAppUploadS3Notifier > ~/production-lambda-config-$(date +%Y%m%d).json
# Record current bucket notification config
aws s3api get-bucket-notification-configuration --bucket syrfapp-uploads > ~/production-notification-config-$(date +%Y%m%d).json
Migration Steps¶
Step 1: Freeze Current Deployment (5 min)¶
Purpose: Prevent conflicts during migration
- Disable the CI/CD Lambda deployment:
# In syrf repo, create a temporary branch to disable Lambda deployment
# Or: Communicate to team that Lambda deployments are frozen
- Verify no deployments in progress:
# Check GitHub Actions for running Lambda workflows
gh run list --workflow=ci-cd.yml --status=in_progress
Step 2: Verify Production State (10 min)¶
Purpose: Establish baseline for validation
- Record current Lambda version:
aws lambda get-function --function-name syrfAppUploadS3Notifier \
--query 'Configuration.Version' --output text
- Test current functionality:
# Upload a test file
echo "migration-test-$(date +%s)" > /tmp/migration-test.txt
aws s3 cp /tmp/migration-test.txt s3://syrfapp-uploads/migration-test/
# Check Lambda was invoked (wait 30 seconds)
aws logs tail /aws/lambda/syrfAppUploadS3Notifier --since 2m | grep migration-test
# Clean up
aws s3 rm s3://syrfapp-uploads/migration-test/migration-test.txt
- Record file count for validation:
aws s3 ls s3://syrfapp-uploads --recursive --summarize | tail -2
# Note: Total Objects and Total Size
Step 3: Create Production Config in cluster-gitops (15 min)¶
Purpose: Prepare GitOps configuration with adoption flags
- Create production service config:
- Create
config.yaml:
# syrf/environments/production/s3-notifier/config.yaml
serviceName: s3-notifier
envName: production
chartTag: main # Or specific commit SHA
lambda:
version: "X.Y.Z" # Current production version
packageKey: "s3-notifier/X.Y.Z.zip"
gitVersion:
sha: "abc123..."
shortSha: "abc123"
- Create
values.yamlwith adoption flag:
# syrf/environments/production/s3-notifier/values.yaml
envName: production
namespace: syrf-production
bucket:
name: syrfapp-uploads
adopt: true # CRITICAL: Adopt existing bucket
versioning: true
tags:
Environment: production
CriticalData: "true"
function:
name: syrfAppUploadS3Notifier
env:
rabbitmqHost: "amqp://rabbitmq.camarades.net:5672"
extra:
LOG_LEVEL: "Information"
- Commit but DO NOT push yet:
git add syrf/environments/production/s3-notifier/
git commit -m "feat(s3-notifier): add production config with adoption flag"
Step 4: Deploy to Production (20 min)¶
Purpose: Let ACK adopt existing resources
- Push the config:
- Monitor ArgoCD sync:
# Watch the Application appear and sync
argocd app list | grep s3-notifier
argocd app get production-s3-notifier
- Watch ACK controller logs during adoption:
kubectl logs -n ack-system -l app.kubernetes.io/name=ack-s3-controller -f &
kubectl logs -n ack-system -l app.kubernetes.io/name=ack-lambda-controller -f &
- Verify ACK resources created:
Step 5: Validate Adoption (15 min)¶
Purpose: Confirm no data loss or recreation
- CRITICAL: Verify bucket was NOT recreated:
# Check bucket creation date - should be original date, NOT today
aws s3api head-bucket --bucket syrfapp-uploads 2>&1 || true
# List bucket to confirm files exist
aws s3 ls s3://syrfapp-uploads --recursive --summarize | tail -2
# Compare with Step 2 - numbers should match
- Verify ACK shows adopted status:
kubectl describe bucket syrfapp-uploads -n syrf-production | grep -A5 Annotations
# Should show: services.k8s.aws/adopted: "true"
- Verify Lambda configuration unchanged:
aws lambda get-function --function-name syrfAppUploadS3Notifier \
--query 'Configuration.[FunctionName,Runtime,Handler,MemorySize]'
Step 6: Test End-to-End (15 min)¶
Purpose: Confirm full functionality
- Upload test file:
echo "post-migration-test-$(date +%s)" > /tmp/post-migration-test.txt
aws s3 cp /tmp/post-migration-test.txt s3://syrfapp-uploads/migration-test/
- Verify Lambda invocation:
# Wait 30 seconds
sleep 30
aws logs tail /aws/lambda/syrfAppUploadS3Notifier --since 2m | grep post-migration-test
- Verify RabbitMQ message received:
# Check API/PM service logs for file notification
kubectl logs -n syrf-production -l app=api --since=5m | grep -i "file\|upload"
- Clean up test file:
Step 7: Decommission Old Management (30 min)¶
Purpose: Remove Terraform Lambda management
- Comment out Lambda resources in Terraform:
# In camarades-infrastructure repo
# Comment out or remove Lambda-related resources
# Keep bucket management temporarily if using separate Terraform
- Archive old workflow:
# In syrf repo
git mv .github/workflows/pr-preview-lambda.yml .github/workflows/archived/
git commit -m "chore: archive pr-preview-lambda.yml - now managed by ACK"
- Update documentation:
- Update CLAUDE.md with new architecture
- Mark old docs as deprecated
Rollback Procedure¶
If issues occur during migration:
Immediate Rollback (ACK just deployed)¶
- Delete ACK resources (bucket retained due to deletion policy):
kubectl delete bucket syrfapp-uploads -n syrf-production
kubectl delete function syrfAppUploadS3Notifier -n syrf-production
- Verify bucket still exists:
- Re-enable Terraform/CI management
Post-Migration Rollback¶
- Revert cluster-gitops changes:
- Manually restore notification configuration if needed:
aws s3api put-bucket-notification-configuration \
--bucket syrfapp-uploads \
--notification-configuration file://production-notification-config-YYYYMMDD.json
Post-Migration Monitoring¶
First 24 Hours¶
- Monitor Lambda invocation metrics in CloudWatch
- Check for errors in Lambda logs
- Verify ArgoCD shows healthy sync status
- Confirm file uploads working in SyRF application
First Week¶
- Review ACK controller logs for any reconciliation issues
- Confirm no drift between desired and actual state
- Validate CI/CD promotion workflow works for new Lambda versions
Troubleshooting¶
ACK Tries to Recreate Bucket¶
Symptom: ACK creates new bucket instead of adopting
Fix:
# Manually add adoption annotation
kubectl annotate bucket syrfapp-uploads \
services.k8s.aws/adopted=true \
-n syrf-production --overwrite
Lambda Not Triggering After Migration¶
Symptom: S3 uploads don't invoke Lambda
Check:
# Verify notification configuration
aws s3api get-bucket-notification-configuration --bucket syrfapp-uploads
# Verify Lambda permission
aws lambda get-policy --function-name syrfAppUploadS3Notifier
Fix: May need to manually recreate notification/permission if ACK CRDs don't support them.
ArgoCD Shows OutOfSync¶
Symptom: ArgoCD keeps showing drift
Check:
Fix: Usually indicates Helm values don't match actual AWS state. Update values to match.
Success Criteria¶
Migration is successful when:
- Production bucket exists with all original files
- Lambda responds to S3 uploads within 30 seconds
- ArgoCD shows synced status
- ACK resources show
services.k8s.aws/adopted: "true" - No errors in ACK controller logs
- File upload functionality works in SyRF application
- Old Terraform/CI Lambda management disabled