Technical Plan: Lambda ACK GitOps Migration¶
Overview¶
Migrate the S3 Notifier Lambda (syrfAppUploadS3Notifier) from Terraform/CI-managed deployment to ACK (AWS Controllers for Kubernetes) GitOps management. This brings Lambda and S3 bucket lifecycle management into the same GitOps paradigm as all other SyRF services — ACK controllers in GKE manage Lambda + S3 as Kubernetes CRDs, ArgoCD syncs them, and cluster-gitops is the single source of truth.
Related Documents:
- Migration Runbook - Production migration steps (to be updated)
- Lambda GitOps Integration - Umbrella strategy doc (Tier 1 completed, Tier 3 links here)
Key Architecture Decisions¶
1. Separate S3 buckets per environment (full ACK)¶
Each environment gets its own S3 bucket and Lambda, both managed as ACK CRDs. This aligns with the existing isolation pattern (MongoDB is already per-environment). Each environment is fully self-contained — no shared notification config, no aggregation problem.
| Environment | S3 Bucket | Lambda | Notification |
|---|---|---|---|
| Production | syrfapp-uploads (adopt existing) |
syrfAppUploadS3Notifier (adopt existing) |
1:1, in Bucket CRD |
| Staging | syrfapp-uploads-staging (new) |
syrfAppUploadS3Notifier-staging (new) |
1:1, in Bucket CRD |
| Preview PR N | syrfapp-uploads-pr-{N} (new, ephemeral) |
syrfAppUploadS3Notifier-pr-{N} (new, ephemeral) |
1:1, in Bucket CRD |
App config change: Set s3.bucketName per environment in cluster-gitops Helm values. Most S3 operations already read the bucket name from config (S3Settings.BucketName), but S3FileService.WriteStreamToFile (used by OverwriteAllLines) has a hard-coded BucketName constant set to "syrfapp-uploads" (S3FileService.cs:24, line 112). This must be fixed to use _s3Settings.BucketName to avoid writes going to the production bucket from non-production environments.
2. ACK manages both Bucket + Function CRDs¶
Since each environment has its own bucket, the Bucket CRD's spec.notification only references one Lambda. No aggregation, no PostSync Job for notifications. Clean declarative management.
- S3 Bucket: ACK S3 controller (
BucketCRD) — includes notification config - Lambda Function: ACK Lambda controller (
FunctionCRD) - Lambda Permission: PostSync Job (ACK Lambda controller lacks
PermissionCRD) - Lambda Env Vars: PostSync Job (CRD doesn't support SecretKeyRef for env vars)
- IAM Execution Roles: Terraform (already exist, rarely change)
3. Setup Jobs for credentials + permissions¶
ACK's Function CRD environment.variables is a plain string map — no SecretKeyRef support. RabbitMQ password must never appear in git. Lambda Permission has no ACK CRD. Solution: Two ArgoCD hook Jobs:
- permission-job (Sync hook, wave 2): Grants S3→Lambda invoke permission with
--source-accountfor confused deputy protection. Runs during sync, before Bucket (wave 3) creates the notification. - env-vars-job (PostSync hook): Reads
RabbitMqPasswordfrom K8s Secret (synced via ClusterExternalSecret from GCP Secret Manager), callsaws lambda update-function-configurationto set all 4 env vars.
Both jobs use the syrf-ack-setup-job IAM role (least-privilege, separate from the syrf-ack-controllers role used by ACK). The Function CRD intentionally omits environment — the env-vars-job owns env var management.
4. GKE → AWS cross-cloud auth via OIDC federation¶
ACK controllers authenticate to AWS using projected service account tokens + AWS_WEB_IDENTITY_TOKEN_FILE. GKE projects OIDC tokens for pods via Workload Identity. AWS IAM trusts the GKE OIDC issuer. This is proven technology but untested in this project — Phase 0 validates it.
Codebase-Validated Reference¶
These values were validated against the actual codebase on 2026-02-06:
| Parameter | Correct Value | Source |
|---|---|---|
| Lambda runtime | dotnet10 |
camarades-infrastructure/terraform/lambda/main.tf:67 |
| Lambda handler | SyRF.S3FileSavedNotifier.Endpoint::SyRF.S3FileSavedNotifier.Endpoint.S3FileReceivedHandler::HandleEvent |
camarades-infrastructure/terraform/lambda/main.tf:68 |
| Lambda env vars | RabbitMqHost, RabbitMqUsername, RabbitMqPassword, S3Region |
S3FileReceivedFunction.cs:78-80, Terraform main.tf:82 |
| Lambda env var note | S3Region is set by Terraform but not read by handler code — preserved for compatibility. --environment replaces ALL vars, so all 4 must be set together |
env-vars-job.yaml |
| RabbitMQ host | amqp://rabbitmq.camarades.net:5672 (public, NOT cluster-internal, plain AMQP — see Security Note) |
camarades-infrastructure/terraform/lambda/variables.tf:45 |
| RabbitMQ username | rabbit |
camarades-infrastructure/terraform/lambda/variables.tf:51 |
| RabbitMQ virtual host | From S3 object metadata (metadata["virtualhost"]) |
S3FileReceivedFunction.cs:81 |
| S3 bucket name | syrfapp-uploads (production) |
camarades-infrastructure/terraform/lambda/variables.tf:57 |
| Lambda packages bucket | camarades-terraform-state-aws |
camarades-infrastructure/terraform/lambda/variables.tf:63 |
| Production IAM role | syrfS3NotifierProductionLambdaRole |
camarades-infrastructure/terraform/lambda/main.tf:16 |
| Preview IAM role | syrfS3NotifierPreviewLambdaRole |
camarades-infrastructure/terraform/lambda/main.tf:97 |
| ApplicationSet glob | syrf/services/*/config.yaml (auto-discovers new services) |
cluster-gitops/argocd/applicationsets/syrf.yaml |
Security Note: RabbitMQ Transport¶
The current Terraform configuration uses plain AMQP (amqp://rabbitmq.camarades.net:5672) for Lambda→RabbitMQ connections. This is existing production behavior, not introduced by this migration. However, it transmits credentials and messages in cleartext over the public internet. Upgrading to TLS (amqps:// on port 5671) is recommended as a separate follow-up but is out of scope for this migration — the goal here is to replicate the existing configuration faithfully under ACK management first.
Phase 0: Proof of Concept — Cross-Cloud Auth¶
Goal: Prove that a pod in GKE can assume an AWS IAM role and manage Lambda/S3 resources. Highest-risk component — if it fails, ACK approach is blocked.
Duration: 1-2 days
Steps¶
- Create AWS IAM OIDC Provider for GKE (Terraform)
- File:
camarades-infrastructure/terraform/main.tf - Add
aws_iam_openid_connect_providertrusting GKE OIDC issuer:https://container.googleapis.com/v1/projects/camarades-net/zones/europe-west2-a/clusters/camaradesuk -
Compute thumbprint from issuer URL certificate chain
-
Create AWS IAM Roles (Terraform)
- File:
camarades-infrastructure/terraform/lambda/ack-iam.tf - Role
syrf-ack-controllers(for ACK controllers):- Trust policy: Allow
sts:AssumeRoleWithWebIdentityfrom GKE OIDC issuer, scoped to exact service accounts (ack-lambda-controller,ack-s3-controller) withaud: sts.amazonaws.comconstraint - Permissions (3 policy statements):
- ACKLambdaManagement:
lambda:*scoped toarn:aws:lambda:eu-west-1:318789018510:function:syrfAppUploadS3Notifier*— wildcard because ACK controllers call undocumented APIs (GetFunctionConcurrency,GetFunctionEventInvokeConfig) that break with explicit action lists - ACKS3Management:
s3:*scoped toarn:aws:s3:::syrfapp-uploads*(bucket + objects), pluss3:ListAllMyBucketson*(required by ACK S3 controller for bucket discovery) - ACKIAMPassRole:
iam:PassRolefor Lambda execution roles (syrfS3NotifierProductionLambdaRole,syrfS3NotifierStagingLambdaRole,syrfS3NotifierPreviewLambdaRole)
- Trust policy: Allow
-
Role
syrf-ack-setup-job(least-privilege for setup jobs):- Trust policy: scoped to
syrf-staging:ack-setup-job,syrf-production:ack-setup-job, andpr-*:ack-setup-job(viaStringLike) withaud: sts.amazonaws.comconstraint - Permissions:
lambda:GetFunction,lambda:GetFunctionConfiguration,lambda:UpdateFunctionConfiguration,lambda:GetPolicy,lambda:AddPermission,lambda:RemovePermissiononly
- Trust policy: scoped to
-
Deploy test pod with projected token (manual, temporary)
- Create ServiceAccount in
ack-systemwitheks.amazonaws.com/role-arnannotation - Deploy
amazon/aws-clipod, setAWS_WEB_IDENTITY_TOKEN_FILE+AWS_ROLE_ARN - Test:
aws lambda list-functions --region eu-west-1 -
Test:
aws s3 ls(verify S3 access) -
Validate: Pod can call both Lambda and S3 APIs → clean up test pod
Success gate¶
Pod successfully calls AWS Lambda + S3 APIs. If this fails:
- Try storing AWS creds as K8s Secret instead of OIDC (less elegant but functional)
- Try GCP-to-AWS Workload Identity Federation
- Re-evaluate ACK approach
Phase 1: ACK Controller Installation¶
Goal: Install ACK S3 + Lambda controllers via GitOps, verified working.
Duration: 1 day
Prerequisites¶
- AWS Pod Identity Webhook: Must be installed as a cluster plugin (
pod-identity-webhook) before ACK controllers. It mutates pods witheks.amazonaws.com/role-arnannotations to inject AWS credentials via projected service account tokens. Without it, ACK controllers and setup jobs cannot authenticate to AWS. - ArgoCD AppProject config: The plugins project needs
oci://public.ecr.aws/aws-controllers-k8sandhttps://jkroepke.github.io/helm-chartsinsourceRepos, andack-system+pod-identity-webhooknamespace destinations. - SyRF AppProject config: Staging and production projects need ACK CRD whitelists (
lambda.services.k8s.aws/Function,s3.services.k8s.aws/Bucket,services.k8s.aws/AdoptedResource). - ArgoCD health checks: Custom Lua health checks for Function, Bucket, and AdoptedResource CRDs in
argocd-cm.
Steps¶
- Add ACK S3 controller to cluster-gitops plugins
-
New:
cluster-gitops/plugins/helm/ack-s3-controller/config.yaml -
New:
cluster-gitops/plugins/helm/ack-s3-controller/values.yamlaws.region: eu-west-1, ServiceAccount with AWS role ARN from Phase 0
-
Add ACK Lambda controller to cluster-gitops plugins
-
New:
cluster-gitops/plugins/helm/ack-lambda-controller/config.yaml -
New:
cluster-gitops/plugins/helm/ack-lambda-controller/values.yaml -
Verify
- Both controller pods running in
ack-system - CRDs installed:
buckets.s3.services.k8s.aws,functions.lambda.services.k8s.aws - No auth errors in controller logs
- Smoke test: create test Bucket CRD → verify bucket created in AWS → delete CRD → verify bucket deleted
Key files (4 new)¶
cluster-gitops/plugins/helm/ack-s3-controller/{config,values}.yamlcluster-gitops/plugins/helm/ack-lambda-controller/{config,values}.yaml
Phase 2: Helm Chart Development¶
Goal: Create the s3-notifier Helm chart that renders ACK Bucket + Function CRDs and a PostSync Job.
Duration: 1-2 days
Chart structure¶
src/services/s3-notifier/.chart/
├── Chart.yaml # Standalone (no syrf-common dependency)
├── values.yaml
├── templates/
│ ├── _helpers.tpl
│ ├── bucket.yaml # ACK Bucket CRD with notification config (wave 3)
│ ├── function.yaml # ACK Function CRD (no env vars, wave 1)
│ ├── serviceaccount.yaml # SA with eks.amazonaws.com/role-arn annotation
│ ├── permission-job.yaml # Sync hook (wave 2): Lambda invoke permission
│ ├── env-vars-job.yaml # PostSync hook: Lambda env vars from K8s Secret
│ └── adopted-resource.yaml # Conditional: adopts existing prod resources
└── NOTES.txt
bucket.yaml — ACK Bucket CRD with built-in notification¶
apiVersion: s3.services.k8s.aws/v1alpha1
kind: Bucket
metadata:
name: {{ .Values.bucket.name }}
annotations:
services.k8s.aws/deletion-policy: {{ .Values.bucket.deletionPolicy }}
argocd.argoproj.io/sync-wave: "3" # After permission-job (wave 2) — S3 notification needs Lambda permission to exist
{{- if .Values.adoptExisting }}
services.k8s.aws/adopted: "true"
{{- end }}
spec:
name: {{ .Values.bucket.name }}
versioning:
status: {{ if .Values.bucket.versioning }}Enabled{{ else }}Suspended{{ end }}
publicAccessBlock:
blockPublicAcls: true
blockPublicPolicy: true
ignorePublicAcls: true
restrictPublicBuckets: true
notification:
lambdaFunctionConfigurations:
- events:
- s3:ObjectCreated:*
lambdaFunctionARN: {{ include "s3-notifier.lambdaArn" . }}
Key: Notification is declared inline — each bucket points to exactly one Lambda. No aggregation.
function.yaml — ACK Function CRD¶
apiVersion: lambda.services.k8s.aws/v1alpha1
kind: Function
metadata:
name: {{ include "s3-notifier.k8sName" . }} # DNS-1123 compliant (lowercase)
annotations:
services.k8s.aws/deletion-policy: {{ .Values.lambda.deletionPolicy }}
argocd.argoproj.io/sync-wave: "1" # Before permission-job (wave 2) and Bucket (wave 3)
spec:
name: {{ include "s3-notifier.functionName" . }} # AWS name (camelCase)
runtime: dotnet10
handler: "SyRF.S3FileSavedNotifier.Endpoint::SyRF.S3FileSavedNotifier.Endpoint.S3FileReceivedHandler::HandleEvent"
memorySize: {{ .Values.lambda.memorySize }}
timeout: {{ .Values.lambda.timeout }}
role: {{ .Values.lambda.executionRoleArn }}
code:
s3Bucket: {{ .Values.lambda.code.s3Bucket }}
s3Key: {{ .Values.lambda.code.s3Key }}
# environment.variables intentionally omitted — managed by setup-job (contains secrets)
tags:
Environment: {{ .Values.environmentName }}
ManagedBy: argocd-ack
permission-job.yaml — Sync hook (wave 2) for Lambda invoke permission¶
- ArgoCD Sync hook at wave 2 (runs during sync, before Bucket at wave 3)
- Waits for Lambda to reach Active state (configurable timeout via
setupJob.timeouts.lambdaActive) - Grants S3→Lambda invoke permission idempotently with confused deputy protection (
--source-account) - On
ResourceConflictException, verifies existing statement matches current source-arn and source-account; replaces if stale - Uses
amazon/aws-cliimage with Python for JSON parsing (no jq in image) - ServiceAccount
ack-setup-jobwithsyrf-ack-setup-jobIAM role (least-privilege)
env-vars-job.yaml — PostSync hook for Lambda env vars¶
- ArgoCD PostSync hook (runs after all resources synced)
- Pre-checks RABBITMQ_PASSWORD availability (from
rabbit-mqK8s Secret viasecretKeyRef) - Waits for Lambda to accept configuration updates (configurable timeout via
setupJob.timeouts.lambdaConfigReady) - Calls
aws lambda update-function-configurationto set all 4 env vars: RabbitMqHost=amqp://rabbitmq.camarades.net:5672(public hostname — Lambda runs outside cluster)RabbitMqUsername=rabbitRabbitMqPassword= (from K8s Secret)S3Region=eu-west-1(set by Terraform historically, preserved for compatibility)- NOTE:
--environmentreplaces ALL env vars. This is intentional — all 4 vars are declared here as the canonical source of truth - Uses Python heredoc for JSON payload construction (special characters in password)
- Waits for update to complete, hard-fails on timeout
RabbitMQ secret¶
The rabbit-mq K8s Secret is managed by a ClusterExternalSecret in the plugins project — it targets all service namespaces automatically. No per-chart ExternalSecret is needed.
adopted-resource.yaml (conditional on adoptExisting: true)¶
- Creates
AdoptedResourceCRD for both Bucket and Function - Tells ACK to discover and adopt existing AWS resources rather than creating new ones
- Only used during production cutover (Phase 6)
values.yaml defaults¶
bucket:
name: "" # Required: syrfapp-uploads, syrfapp-uploads-staging, etc.
versioning: true
deletionPolicy: retain # Override to "delete" for previews
tags:
Service: s3-notifier
ManagedBy: ACK
lambda:
memorySize: 512
timeout: 30
deletionPolicy: retain # Override to "delete" for previews
executionRoleArn: "" # Required per environment
code:
s3Bucket: camarades-terraform-state-aws
s3Key: "" # Required: lambda-packages/{env}.zip
awsAccountId: "" # Required for Lambda ARN construction
awsRegion: eu-west-1
rabbitMq:
host: "amqp://rabbitmq.camarades.net:5672"
username: rabbit
passwordSecretName: rabbit-mq
passwordSecretKey: rabbitmq-password # Must match ClusterExternalSecret key
environmentName: "" # staging, production, pr-{N}
adoptExisting: false # true only for production cutover
setupJob:
enabled: true
image: amazon/aws-cli:2.15.0
serviceAccountName: ack-setup-job
iamRoleName: syrf-ack-setup-job # Least-privilege role (not syrf-ack-controllers)
timeouts:
lambdaActive: 120 # seconds to wait for Lambda Active state
lambdaConfigReady: 60 # seconds to wait for LastUpdateStatus
Validation¶
helm templaterenders valid YAML with no secrets in output- Bucket CRD notification references correct Lambda ARN
- Function CRD has correct handler (
S3FileReceivedHandler::HandleEvent) and runtime (dotnet10) - RabbitMQ host is public hostname (
amqp://rabbitmq.camarades.net:5672), NOT cluster-internal address
Key files (8 new in monorepo)¶
src/services/s3-notifier/.chart/Chart.yamlsrc/services/s3-notifier/.chart/values.yamlsrc/services/s3-notifier/.chart/templates/{_helpers,bucket,function,serviceaccount,permission-job,env-vars-job,adopted-resource}.tpl/yaml
Phase 3: CI/CD Pipeline Updates ✅¶
Goal: Update CI/CD to deploy Lambda code via GitOps (upload zip to S3 + update cluster-gitops) instead of direct aws lambda update-function-code. Must be in place before any ACK-managed environment receives code updates.
Status: Complete
Why before deployment phases¶
Without CI/CD updates, code changes to s3-notifier would still trigger the old pipeline (aws lambda update-function-code directly), conflicting with ACK's management of the Function CRD. Updating CI/CD first ensures all code deployments flow through GitOps from the start.
Key design decision: derive s3Key from image.tag¶
The ApplicationSet passes image.tag (from config.yaml's imageTag) to every service. The chart template defaults lambda.code.s3Key from image.tag:
This means CI/CD only sets chartTag and imageTag in config.yaml (same as Docker services). No explicit s3Key in environment values.
Changes made¶
- Renamed
deploy-lambda→package-lambdainci-cd.yml - Removed: Terraform setup, init, plan, apply steps
- Removed: GitHub App token + infrastructure repo checkout
- Kept:
dotnet publish+ zip creation, AWS credentials, S3 upload -
Added: versioned upload (
s3-notifier-v{version}.zip) alongside backward-compatproduction.zip -
Added s3-notifier to standard promotion flow
- s3-notifier now appears in
Collect successful services(promote-to-staging) - Flows through existing
Update service versionsstep (generic yq-based config.yaml update) - Removed legacy
Update S3 Notifier version in API valuesstep -
Removed
s3NotifierVersionfromsyrf/services/api/values.yaml -
Chart template derives s3Key
values.yaml: addedimage.tagplaceholderfunction.yaml:s3Keydefaults tolambda-packages/s3-notifier-v{image.tag}.zip-
Explicit
lambda.code.s3Keyoverride still works (for custom deployments) -
Updated
detect-service-changes.sh - Added chart path:
src/services/s3-notifier/.chart -
Updated detection block: both
buildandretagtrigger packaging (no Docker image to retag) -
Removed explicit s3Key from cluster-gitops environment values
- Staging and production values no longer specify
lambda.code.s3Key - Chart template derives it from
image.tag(set by CI/CD promotion)
Backward compatibility¶
package-lambdauploads BOTHs3-notifier-v{version}.zipANDproduction.zip- Terraform-managed production Lambda still reads
production.zip - Remove
production.zipupload after production cutover (Phase 6)
Key files¶
.github/workflows/ci-cd.yml(renamed job, added promotion, removed legacy handling).github/scripts/detect-service-changes.sh(added chart path)src/services/s3-notifier/.chart/values.yaml(addedimage.tag)src/services/s3-notifier/.chart/templates/function.yaml(derived s3Key)
Phase 4: Preview Integration ✅¶
Goal: Replace pr-preview-lambda.yml with ACK-managed preview environments. Each PR gets its own isolated S3 bucket + Lambda.
Status: Complete
Changes made¶
- Added s3-notifier to preview ApplicationSet discovery
- New:
cluster-gitops/syrf/environments/preview/services/s3-notifier/config.yaml -
New:
cluster-gitops/syrf/environments/preview/services/s3-notifier/values.yaml(ephemeral deletion policies) -
Updated
pr-preview.ymlworkflow - Added s3-notifier to
detect-changes(outputs +process_servicecall) - Added
version-s3-notifierjob (GitVersion) - Added
package-lambdajob (dotnet publish → zip → S3 upload with PR-specific key) - Added s3-notifier to
write-versions(per-PR values: bucket.name, environmentName, lambda.code.s3Key) - Added per-PR S3 bucket name to API values (
s3.bucketName: syrfapp-uploads-pr-{N}) - Updated
update-pr-statusandcreate-tagsneeds chains and outputs -
Added AWS cleanup steps to
cleanup-tags(empty bucket + delete Lambda package from S3) -
Archived
pr-preview-lambda.yml→.github/workflows/archived/
Original steps (for reference)¶
- Add s3-notifier to preview ApplicationSet
- New:
cluster-gitops/syrf/environments/preview/services/s3-notifier/values.yaml -
Modify:
cluster-gitops/argocd/applicationsets/syrf-previews.yaml- Pass per-PR parameters:
bucket.name=syrfapp-uploads-pr-{{.prNumber}},environmentName=pr-{{.prNumber}},lambda.deletionPolicy=delete,bucket.deletionPolicy=delete
- Pass per-PR parameters:
-
Update pr-preview.yml workflow
- Add s3-notifier to
detect-changes - Add Lambda build step (move from
pr-preview-lambda.yml) - Upload zip to S3 with PR-specific key:
lambda-packages/pr-{N}.zip - Write s3-notifier values in
write-versionsjob - Set preview API values:
s3.bucketName: syrfapp-uploads-pr-{N} -
Preview bucket + Lambda have
deletionPolicy: delete(ephemeral) -
Preview cleanup on PR close
- ArgoCD deletes Application → ACK deletes Bucket CRD + Function CRD
deletionPolicy: deletemeans ACK also deletes the actual AWS resources- S3 won't delete non-empty bucket → add pre-delete step to empty bucket first
-
Add to existing cleanup workflow:
aws s3 rm s3://syrfapp-uploads-pr-{N} --recursive -
Archive
pr-preview-lambda.yml→.github/workflows/archived/
Key files¶
.github/workflows/pr-preview.yml(modify).github/workflows/pr-preview-lambda.yml(archive)cluster-gitops/syrf/environments/preview/services/s3-notifier/values.yaml(new)cluster-gitops/argocd/applicationsets/syrf-previews.yaml(modify)
Phase 5: Staging Deployment ✅¶
Goal: Deploy a NEW staging S3 bucket + Lambda via ACK. Existing production remains Terraform-managed.
Status: Complete (configuration deployed to cluster-gitops, awaiting ACK controller sync)
Steps¶
- Register s3-notifier as a service in cluster-gitops
-
New:
cluster-gitops/syrf/services/s3-notifier/config.yaml -
New:
cluster-gitops/syrf/services/s3-notifier/values.yaml(base defaults includingawsAccountId) -
Create staging environment config
-
New:
cluster-gitops/syrf/environments/staging/s3-notifier/config.yaml -
New:
cluster-gitops/syrf/environments/staging/s3-notifier/values.yaml -
Update staging API values to use new bucket
-
Modify:
cluster-gitops/syrf/environments/staging/api/values.yaml -
Same for project-management if it accesses S3 directly
-
ArgoCD syncs → ACK creates resources
- ApplicationSet auto-discovers s3-notifier (glob:
syrf/services/*/config.yaml) - ACK S3 controller creates
syrfapp-uploads-stagingbucket - ACK Lambda controller creates
syrfAppUploadS3Notifier-stagingfunction - Bucket CRD notification links the two (1:1)
-
PostSync Job sets env vars + Lambda permission
-
Test end-to-end
- Upload file via staging SyRF UI
- Verify file lands in
syrfapp-uploads-staging(not the shared bucket) - Verify Lambda triggers (CloudWatch logs)
- Verify RabbitMQ message arrives at staging services
- ArgoCD shows healthy sync for
s3-notifier-staging
Risk: ACK reconciliation overwriting env vars¶
After PostSync Job sets env vars, the ACK controller will reconcile the Function CRD. If the CRD omits environment, ACK may either leave existing env vars alone (late initialization) or clear them. Must verify this behavior in staging before proceeding to production. If ACK clears them, mitigation: include non-sensitive env vars (RabbitMqHost, RabbitMqUsername, S3Region) in the CRD spec, and only use the Job for RabbitMqPassword.
Key files¶
cluster-gitops/syrf/services/s3-notifier/{config,values}.yaml(new)cluster-gitops/syrf/environments/staging/s3-notifier/{config,values}.yaml(new)cluster-gitops/syrf/environments/staging/api/values.yaml(modify — adds3.bucketName)
Phase 6: Production Cutover¶
Goal: Migrate production Lambda + bucket from Terraform to ACK. Zero downtime.
Duration: 1 day (then 1-2 week soak)
Pre-cutover checklist¶
- Staging Lambda running via ACK for 1+ week, no issues
- ACK env var reconciliation behavior confirmed safe (from Phase 5)
- Backup:
aws lambda get-function --function-name syrfAppUploadS3Notifier - Backup:
aws s3api get-bucket-notification-configuration --bucket syrfapp-uploads - Backup:
aws s3 ls s3://syrfapp-uploads --recursive --summarize > inventory.txt -
deletion-policy: retainconfirmed in chart values for production - Maintenance window communicated
Steps¶
- Create production config with adoption
-
New:
cluster-gitops/syrf/environments/production/s3-notifier/values.yamlbucket: name: syrfapp-uploads # Adopt existing deletionPolicy: retain lambda: executionRoleArn: "arn:aws:iam::<ACCOUNT_ID>:role/syrfS3NotifierProductionLambdaRole" code: s3Key: "lambda-packages/production.zip" deletionPolicy: retain environmentName: production adoptExisting: true # Triggers AdoptedResource CRDs -
Deploy and adopt — push to cluster-gitops, ArgoCD syncs, ACK discovers existing resources
-
Verify adoption
- Bucket creation date unchanged (NOT recreated)
aws s3 ls s3://syrfapp-uploads --recursive --summarizematches pre-cutover inventory- Lambda config unchanged in AWS Console
-
File uploads continue working
-
Remove from Terraform state (maintenance window)
terraform state rm aws_lambda_function.s3_notifier_production
terraform state rm aws_lambda_permission.s3_invoke_production
terraform state rm aws_s3_bucket_notification.uploads # Preview Lambdas already ACK-managed (Phase 4)
- Remove
adoptExisting: true— ACK now fully owns production resources
Rollback¶
Re-import into Terraform (terraform import), set service.enabled: false in cluster-gitops.
Phase 7: Terraform Cleanup & Documentation¶
Goal: Remove Terraform Lambda resources (now fully ACK-managed) and document the migration.
Duration: 1 day
Steps¶
- Remove Terraform Lambda resources
- Gut
camarades-infrastructure/terraform/lambda/main.tf(remove all Lambda + notification resources) - Keep IAM execution roles (shared, rarely change)
-
Keep
lambda_packages_bucketvariable (still used for zip upload) -
Documentation
- Update
CLAUDE.mdS3 Notifier section - Create
docs/decisions/ADR-00X-ack-lambda-migration.md - Update this technical plan status to "Completed"
Risk Register¶
| # | Risk | Severity | Likelihood | Mitigation |
|---|---|---|---|---|
| 1 | GKE→AWS OIDC auth doesn't work | Critical | Medium | Phase 0 is entirely a PoC. Fallback: AWS creds as K8s Secret (less elegant but functional) |
| 2 | ACK reconciliation clears env vars set by PostSync Job | High | Medium | Verify in Phase 5 staging. If ACK clears them: put non-sensitive vars in CRD spec, only use Job for password |
| 3 | Production bucket adopted incorrectly (data loss) | Critical | Low | deletion-policy: retain + adopted: true annotation. Backup inventory before cutover. Verify bucket creation date unchanged |
| 4 | S3 bucket name globally unavailable | Low | Low | syrfapp-uploads-staging may already exist. Check availability before Phase 5. Worst case: use alternative naming |
| 5 | Bucket notification → Lambda ordering | Medium | Medium | Bucket CRD references Lambda ARN. If Lambda doesn't exist yet when Bucket syncs, notification fails. Mitigation: ArgoCD sync waves enforce ordering — Function at wave 1, permission-job at wave 2, Bucket at wave 3 |
| 6 | Preview bucket cleanup fails (non-empty bucket) | Medium | Medium | S3 API refuses to delete non-empty buckets. Add aws s3 rm --recursive step before ArgoCD prunes CRDs |
| 7 | CI/CD complexity during transition | Medium | High | Phases 5-6: both Terraform and ACK manage different Lambdas. Document which system owns which Lambda clearly |
| 8 | ACK controller instability | Medium | Low | Pin to stable versions. Test upgrades in staging first. Keep Terraform as rollback for 2 weeks post-production cutover |
| 9 | ACK controller calls undocumented APIs | Medium | High | Use lambda:* / s3:* scoped to resource ARN prefix — explicit action lists break on controller upgrades (see Operational Notes) |
| 10 | Python unavailable in AWS CLI image | Low | Confirmed | amazon/aws-cli:2.15.0 has no standalone python3. Setup jobs search multiple fallback paths including /usr/bin/python2.7. Permission-job skips verification gracefully |
| 11 | IAM policy changes not picked up by ACK | Medium | Confirmed | ACK controllers cache STS sessions. Restart controller deployments after IAM changes (kubectl rollout restart) |
Verification Plan¶
Phase 0¶
# From test pod in GKE:
aws sts get-caller-identity # Confirms cross-cloud auth works
aws lambda list-functions --region eu-west-1
aws s3 ls # Confirms S3 access
Phase 2¶
helm template staging src/services/s3-notifier/.chart/ -f test-staging-values.yaml
# Verify: no secrets in output, valid Bucket + Function CRDs, correct Lambda ARN in notification
Phase 3 ✅¶
# Verified: chart template renders correctly
helm template test src/services/s3-notifier/.chart/ --set image.tag=0.1.3 \
--set awsAccountId=318789018510 --set environmentName=staging \
--set bucket.name=test --set lambda.executionRoleArn=arn:aws:iam::318789018510:role/test
# Result: s3Key: lambda-packages/s3-notifier-v0.1.3.zip ✅
# Explicit override with --set lambda.code.s3Key=custom.zip also works ✅
# After merge: verify package-lambda runs, versioned zip uploaded, promotion PR created
Phase 4 ✅¶
# Preview: create PR with preview label
# Verify: syrfapp-uploads-pr-{N} bucket created, Lambda created, notification linked
# Upload file → Lambda triggers → correct RabbitMQ vhost
# Close PR → verify bucket + Lambda deleted in AWS
aws s3 ls | grep -v syrfapp-uploads-pr-{N} # Should not exist
Phase 5 ✅¶
# ArgoCD
argocd app get s3-notifier-staging
kubectl get bucket,function -n syrf-staging
# AWS — new bucket exists
aws s3 ls | grep syrfapp-uploads-staging
# AWS — Lambda configured correctly
aws lambda get-function-configuration --function-name syrfAppUploadS3Notifier-staging
# AWS — notification links bucket → Lambda
aws s3api get-bucket-notification-configuration --bucket syrfapp-uploads-staging
# End-to-end: upload file via staging UI → Lambda triggers → RabbitMQ message received
Phase 6¶
# Compare pre/post cutover
aws lambda get-function --function-name syrfAppUploadS3Notifier # Config unchanged
aws s3 ls s3://syrfapp-uploads --recursive --summarize # File count unchanged
# Functional: upload file via production SyRF UI, verify end-to-end processing
Appendix: Differences from Original Plan¶
The original technical plan (created 2026-01-15, deprecated) had 12 issues identified during validation against the actual codebase. This updated plan corrects all of them:
| Issue | Original (Wrong) | Corrected |
|---|---|---|
| Lambda handler | Function::FunctionHandler |
S3FileReceivedHandler::HandleEvent |
| Lambda runtime | dotnet8 |
dotnet10 |
| Lambda env vars | RABBITMQ_HOST (uppercase) |
RabbitMqHost (PascalCase, matching C# code) |
| RabbitMQ host | amqp://rabbitmq.syrf-{env}.svc.cluster.local:5672 |
amqp://rabbitmq.camarades.net:5672 (public — Lambda runs outside cluster) |
| Missing env vars | Only RABBITMQ_HOST |
RabbitMqHost, RabbitMqUsername, RabbitMqPassword (3 used by handler; S3Region set by Terraform but unused by code) |
| AWS Account ID | ACCOUNT_ID placeholders in IAM ARNs |
Documented as required value, set per-environment in cluster-gitops |
| Lambda packages bucket | camarades-lambda-packages |
camarades-terraform-state-aws |
| IAM role names | lambda-s3-notifier-execution-role |
syrfS3NotifierProductionLambdaRole / syrfS3NotifierPreviewLambdaRole |
| Lambda permission | Terraform-managed | PostSync Job (keeps everything in GitOps) |
| S3 bucket strategy | Shared bucket / values files in chart | Separate buckets per env, values in cluster-gitops only |
| Env var management | Inline in Function CRD spec | PostSync Job (secrets never in git) |
| Bucket notification | Separate from Lambda permission concern | Built into Bucket CRD spec.notification (1:1 with Lambda) |
Appendix: ACK CRD Verification¶
Verified via ACK documentation review (2026-01-15).
S3 Controller CRDs¶
| CRD | Status | Notes |
|---|---|---|
Bucket |
Supported | Full S3 bucket management with 30+ configuration categories |
| Bucket Notifications | Built-in | Configured via spec.notification field — NOT a separate CRD |
S3 notification config within Bucket CRD:
spec:
notification:
lambdaFunctionConfigurations:
- events: ["s3:ObjectCreated:*"]
lambdaFunctionARN: "arn:aws:lambda:eu-west-1:ACCOUNT_ID:function:syrfAppUploadS3Notifier"
Lambda Controller CRDs¶
| CRD | Status | Notes |
|---|---|---|
Function |
Supported | Core Lambda function management |
Alias |
Supported | Alias with event invoke config |
CodeSigningConfig |
Supported | Code signing configuration |
EventSourceMapping |
Supported | Kafka, MQ, SQS event sources |
FunctionUrlConfig |
Supported | Function URL HTTPS endpoints |
LayerVersion |
Supported | Lambda layer management |
Version |
Supported | Immutable function versions |
Permission |
NOT Supported | Referenced in internal hooks but NOT a top-level CRD |
Critical gap: Lambda Permission (resource-based policy for S3→Lambda invoke) has no ACK CRD. Solved via PostSync Job calling aws lambda add-permission.
Verification commands¶
# After ACK installation, verify available CRDs:
kubectl get crd | grep s3.services.k8s.aws
# Expected: buckets.s3.services.k8s.aws
kubectl get crd | grep lambda.services.k8s.aws
# Expected: functions, aliases, codesigningconfigs, eventsourcemappings,
# functionurlconfigs, layerversions, versions
# NOT expected: permissions
Appendix: Data Persistence Guarantees¶
Production bucket contains user uploads (PDFs, reference files). Multiple protection layers ensure data safety:
| Layer | Protection Against |
|---|---|
services.k8s.aws/deletion-policy: retain |
CRD deletion removing bucket |
| AWS "bucket not empty" check | API-level bucket deletion |
| S3 versioning | Accidental object deletion |
ACK AdoptedResource CRD |
Existing resources being recreated |
# bucket.yaml — Critical annotations
apiVersion: s3.services.k8s.aws/v1alpha1
kind: Bucket
metadata:
name: syrfapp-uploads
annotations:
services.k8s.aws/deletion-policy: retain # Never delete bucket when CRD removed
services.k8s.aws/adopted: "true" # Adopt existing, don't recreate
spec:
name: syrfapp-uploads
versioning:
status: Enabled # Object-level recovery
Appendix: Operational Notes (from Preview Deployment)¶
Lessons learned during the first preview deployment (PR #2328, 2026-02-11). These issues were resolved in the chart and documented here for future reference.
ACK Lambda controller requires ephemeralStorage¶
The ACK Lambda controller calls UpdateFunctionConfiguration on every reconciliation. If ephemeralStorage is omitted from the Function CRD spec, the API call omits it and AWS returns an error. The chart now includes ephemeralStorage.size (defaults to 512 MB in values.yaml).
ACK controller IAM — use wildcards, not explicit action lists¶
The ACK Lambda controller calls undocumented/internal AWS APIs during reconciliation (e.g. GetFunctionConcurrency, GetFunctionEventInvokeConfig). An explicit action list broke on controller upgrade. The ACKLambdaManagement policy now uses lambda:* scoped to the function ARN prefix. Similarly, the ACK S3 controller requires s3:ListAllMyBuckets on * for bucket discovery — ACKS3Management uses s3:* on the bucket ARN prefix plus the account-level action.
Setup-job trust policy must include preview namespaces¶
The syrf-ack-setup-job trust policy originally only allowed syrf-staging and syrf-production namespaces. Preview environments use pr-* namespaces. The trust policy now uses StringLike with system:serviceaccount:pr-*:ack-setup-job in addition to the staging/production entries.
Python is not directly available in amazon/aws-cli:2.15.0¶
The AWS CLI image (Amazon Linux 2) bundles Python inside the aws binary (PyInstaller-frozen). There is no standalone python3 in $PATH. However, python2.7 exists at /usr/bin/python2.7. The setup jobs now search multiple paths: python3, python, python2.7, /usr/bin/python2.7, and AWS CLI's internal Python paths. The permission-job also gracefully skips policy verification if no Python is found (the permission is already added by that point).
ACK controller credential caching after IAM changes¶
After updating IAM policies, ACK controllers continue using cached STS sessions. The controllers must be restarted (kubectl rollout restart deployment) to pick up new permissions. This only applies to IAM policy changes — normal CRD operations use the existing session.
ArgoCD retry exhaustion on transient failures¶
If an ACK CRD fails during sync (e.g. due to IAM permission errors), ArgoCD exhausts its 3 retry attempts. After fixing the root cause, the operation state must be cleared manually.
Note: This
kubectl patchtargets ArgoCD's own internal state (clearing operation status), not application resources. This falls under the ArgoCD bootstrap exception in the GitOps policy.
Then trigger a hard refresh in ArgoCD.
AWS IAM Policies — Current State¶
The following IAM policies were updated directly in AWS during the preview deployment. These need to be backported to Terraform (camarades-infrastructure/terraform/lambda/ack-iam.tf) before production cutover.
Role: syrf-ack-controllers
| Policy | Actions | Resources |
|---|---|---|
| ACKLambdaManagement | lambda:* |
arn:aws:lambda:eu-west-1:318789018510:function:syrfAppUploadS3Notifier* |
| ACKS3Management | s3:* |
arn:aws:s3:::syrfapp-uploads, arn:aws:s3:::syrfapp-uploads-*, arn:aws:s3:::syrfapp-uploads/*, arn:aws:s3:::syrfapp-uploads-*/* |
| ACKS3Management | s3:ListAllMyBuckets |
* |
| ACKIAMPassRole | iam:PassRole |
arn:aws:iam::318789018510:role/syrfS3Notifier*LambdaRole |
| LambdaPackagesRead | s3:GetObject, s3:ListBucket |
arn:aws:s3:::camarades-terraform-state-aws, arn:aws:s3:::camarades-terraform-state-aws/lambda-packages/* |
Role: syrf-ack-setup-job
| Policy | Actions | Resources |
|---|---|---|
| SetupJobPermissions | lambda:GetFunction, lambda:GetFunctionConfiguration, lambda:UpdateFunctionConfiguration, lambda:GetPolicy, lambda:AddPermission, lambda:RemovePermission |
arn:aws:lambda:eu-west-1:318789018510:function:syrfAppUploadS3Notifier* |
| Trust: StringEquals | sts:AssumeRoleWithWebIdentity |
system:serviceaccount:syrf-staging:ack-setup-job, system:serviceaccount:syrf-production:ack-setup-job |
| Trust: StringLike | sts:AssumeRoleWithWebIdentity |
system:serviceaccount:pr-*:ack-setup-job |