Skip to content

Technical Plan: Lambda ACK GitOps Migration

Overview

Migrate the S3 Notifier Lambda (syrfAppUploadS3Notifier) from Terraform/CI-managed deployment to ACK (AWS Controllers for Kubernetes) GitOps management. This brings Lambda and S3 bucket lifecycle management into the same GitOps paradigm as all other SyRF services — ACK controllers in GKE manage Lambda + S3 as Kubernetes CRDs, ArgoCD syncs them, and cluster-gitops is the single source of truth.

Related Documents:


Key Architecture Decisions

1. Separate S3 buckets per environment (full ACK)

Each environment gets its own S3 bucket and Lambda, both managed as ACK CRDs. This aligns with the existing isolation pattern (MongoDB is already per-environment). Each environment is fully self-contained — no shared notification config, no aggregation problem.

Environment S3 Bucket Lambda Notification
Production syrfapp-uploads (adopt existing) syrfAppUploadS3Notifier (adopt existing) 1:1, in Bucket CRD
Staging syrfapp-uploads-staging (new) syrfAppUploadS3Notifier-staging (new) 1:1, in Bucket CRD
Preview PR N syrfapp-uploads-pr-{N} (new, ephemeral) syrfAppUploadS3Notifier-pr-{N} (new, ephemeral) 1:1, in Bucket CRD

App config change: Set s3.bucketName per environment in cluster-gitops Helm values. Most S3 operations already read the bucket name from config (S3Settings.BucketName), but S3FileService.WriteStreamToFile (used by OverwriteAllLines) has a hard-coded BucketName constant set to "syrfapp-uploads" (S3FileService.cs:24, line 112). This must be fixed to use _s3Settings.BucketName to avoid writes going to the production bucket from non-production environments.

2. ACK manages both Bucket + Function CRDs

Since each environment has its own bucket, the Bucket CRD's spec.notification only references one Lambda. No aggregation, no PostSync Job for notifications. Clean declarative management.

  • S3 Bucket: ACK S3 controller (Bucket CRD) — includes notification config
  • Lambda Function: ACK Lambda controller (Function CRD)
  • Lambda Permission: PostSync Job (ACK Lambda controller lacks Permission CRD)
  • Lambda Env Vars: PostSync Job (CRD doesn't support SecretKeyRef for env vars)
  • IAM Execution Roles: Terraform (already exist, rarely change)

3. Setup Jobs for credentials + permissions

ACK's Function CRD environment.variables is a plain string map — no SecretKeyRef support. RabbitMQ password must never appear in git. Lambda Permission has no ACK CRD. Solution: Two ArgoCD hook Jobs:

  • permission-job (Sync hook, wave 2): Grants S3→Lambda invoke permission with --source-account for confused deputy protection. Runs during sync, before Bucket (wave 3) creates the notification.
  • env-vars-job (PostSync hook): Reads RabbitMqPassword from K8s Secret (synced via ClusterExternalSecret from GCP Secret Manager), calls aws lambda update-function-configuration to set all 4 env vars.

Both jobs use the syrf-ack-setup-job IAM role (least-privilege, separate from the syrf-ack-controllers role used by ACK). The Function CRD intentionally omits environment — the env-vars-job owns env var management.

4. GKE → AWS cross-cloud auth via OIDC federation

ACK controllers authenticate to AWS using projected service account tokens + AWS_WEB_IDENTITY_TOKEN_FILE. GKE projects OIDC tokens for pods via Workload Identity. AWS IAM trusts the GKE OIDC issuer. This is proven technology but untested in this project — Phase 0 validates it.


Codebase-Validated Reference

These values were validated against the actual codebase on 2026-02-06:

Parameter Correct Value Source
Lambda runtime dotnet10 camarades-infrastructure/terraform/lambda/main.tf:67
Lambda handler SyRF.S3FileSavedNotifier.Endpoint::SyRF.S3FileSavedNotifier.Endpoint.S3FileReceivedHandler::HandleEvent camarades-infrastructure/terraform/lambda/main.tf:68
Lambda env vars RabbitMqHost, RabbitMqUsername, RabbitMqPassword, S3Region S3FileReceivedFunction.cs:78-80, Terraform main.tf:82
Lambda env var note S3Region is set by Terraform but not read by handler code — preserved for compatibility. --environment replaces ALL vars, so all 4 must be set together env-vars-job.yaml
RabbitMQ host amqp://rabbitmq.camarades.net:5672 (public, NOT cluster-internal, plain AMQP — see Security Note) camarades-infrastructure/terraform/lambda/variables.tf:45
RabbitMQ username rabbit camarades-infrastructure/terraform/lambda/variables.tf:51
RabbitMQ virtual host From S3 object metadata (metadata["virtualhost"]) S3FileReceivedFunction.cs:81
S3 bucket name syrfapp-uploads (production) camarades-infrastructure/terraform/lambda/variables.tf:57
Lambda packages bucket camarades-terraform-state-aws camarades-infrastructure/terraform/lambda/variables.tf:63
Production IAM role syrfS3NotifierProductionLambdaRole camarades-infrastructure/terraform/lambda/main.tf:16
Preview IAM role syrfS3NotifierPreviewLambdaRole camarades-infrastructure/terraform/lambda/main.tf:97
ApplicationSet glob syrf/services/*/config.yaml (auto-discovers new services) cluster-gitops/argocd/applicationsets/syrf.yaml

Security Note: RabbitMQ Transport

The current Terraform configuration uses plain AMQP (amqp://rabbitmq.camarades.net:5672) for Lambda→RabbitMQ connections. This is existing production behavior, not introduced by this migration. However, it transmits credentials and messages in cleartext over the public internet. Upgrading to TLS (amqps:// on port 5671) is recommended as a separate follow-up but is out of scope for this migration — the goal here is to replicate the existing configuration faithfully under ACK management first.


Phase 0: Proof of Concept — Cross-Cloud Auth

Goal: Prove that a pod in GKE can assume an AWS IAM role and manage Lambda/S3 resources. Highest-risk component — if it fails, ACK approach is blocked.

Duration: 1-2 days

Steps

  1. Create AWS IAM OIDC Provider for GKE (Terraform)
  2. File: camarades-infrastructure/terraform/main.tf
  3. Add aws_iam_openid_connect_provider trusting GKE OIDC issuer: https://container.googleapis.com/v1/projects/camarades-net/zones/europe-west2-a/clusters/camaradesuk
  4. Compute thumbprint from issuer URL certificate chain

  5. Create AWS IAM Roles (Terraform)

  6. File: camarades-infrastructure/terraform/lambda/ack-iam.tf
  7. Role syrf-ack-controllers (for ACK controllers):
    • Trust policy: Allow sts:AssumeRoleWithWebIdentity from GKE OIDC issuer, scoped to exact service accounts (ack-lambda-controller, ack-s3-controller) with aud: sts.amazonaws.com constraint
    • Permissions (3 policy statements):
    • ACKLambdaManagement: lambda:* scoped to arn:aws:lambda:eu-west-1:318789018510:function:syrfAppUploadS3Notifier* — wildcard because ACK controllers call undocumented APIs (GetFunctionConcurrency, GetFunctionEventInvokeConfig) that break with explicit action lists
    • ACKS3Management: s3:* scoped to arn:aws:s3:::syrfapp-uploads* (bucket + objects), plus s3:ListAllMyBuckets on * (required by ACK S3 controller for bucket discovery)
    • ACKIAMPassRole: iam:PassRole for Lambda execution roles (syrfS3NotifierProductionLambdaRole, syrfS3NotifierStagingLambdaRole, syrfS3NotifierPreviewLambdaRole)
  8. Role syrf-ack-setup-job (least-privilege for setup jobs):

    • Trust policy: scoped to syrf-staging:ack-setup-job, syrf-production:ack-setup-job, and pr-*:ack-setup-job (via StringLike) with aud: sts.amazonaws.com constraint
    • Permissions: lambda:GetFunction, lambda:GetFunctionConfiguration, lambda:UpdateFunctionConfiguration, lambda:GetPolicy, lambda:AddPermission, lambda:RemovePermission only
  9. Deploy test pod with projected token (manual, temporary)

  10. Create ServiceAccount in ack-system with eks.amazonaws.com/role-arn annotation
  11. Deploy amazon/aws-cli pod, set AWS_WEB_IDENTITY_TOKEN_FILE + AWS_ROLE_ARN
  12. Test: aws lambda list-functions --region eu-west-1
  13. Test: aws s3 ls (verify S3 access)

  14. Validate: Pod can call both Lambda and S3 APIs → clean up test pod

Success gate

Pod successfully calls AWS Lambda + S3 APIs. If this fails:

  1. Try storing AWS creds as K8s Secret instead of OIDC (less elegant but functional)
  2. Try GCP-to-AWS Workload Identity Federation
  3. Re-evaluate ACK approach

Phase 1: ACK Controller Installation

Goal: Install ACK S3 + Lambda controllers via GitOps, verified working.

Duration: 1 day

Prerequisites

  • AWS Pod Identity Webhook: Must be installed as a cluster plugin (pod-identity-webhook) before ACK controllers. It mutates pods with eks.amazonaws.com/role-arn annotations to inject AWS credentials via projected service account tokens. Without it, ACK controllers and setup jobs cannot authenticate to AWS.
  • ArgoCD AppProject config: The plugins project needs oci://public.ecr.aws/aws-controllers-k8s and https://jkroepke.github.io/helm-charts in sourceRepos, and ack-system + pod-identity-webhook namespace destinations.
  • SyRF AppProject config: Staging and production projects need ACK CRD whitelists (lambda.services.k8s.aws/Function, s3.services.k8s.aws/Bucket, services.k8s.aws/AdoptedResource).
  • ArgoCD health checks: Custom Lua health checks for Function, Bucket, and AdoptedResource CRDs in argocd-cm.

Steps

  1. Add ACK S3 controller to cluster-gitops plugins
  2. New: cluster-gitops/plugins/helm/ack-s3-controller/config.yaml

    plugin:
      name: ack-s3-controller
      repoURL: oci://public.ecr.aws/aws-controllers-k8s
      chart: s3-chart
      version: "1.0.14"  # Pin to stable version
      namespace: ack-system
    
  3. New: cluster-gitops/plugins/helm/ack-s3-controller/values.yaml

    • aws.region: eu-west-1, ServiceAccount with AWS role ARN from Phase 0
  4. Add ACK Lambda controller to cluster-gitops plugins

  5. New: cluster-gitops/plugins/helm/ack-lambda-controller/config.yaml

    plugin:
      name: ack-lambda-controller
      repoURL: oci://public.ecr.aws/aws-controllers-k8s
      chart: lambda-chart
      version: "1.5.2"  # Pin to stable version
      namespace: ack-system
    
  6. New: cluster-gitops/plugins/helm/ack-lambda-controller/values.yaml

  7. Verify

  8. Both controller pods running in ack-system
  9. CRDs installed: buckets.s3.services.k8s.aws, functions.lambda.services.k8s.aws
  10. No auth errors in controller logs
  11. Smoke test: create test Bucket CRD → verify bucket created in AWS → delete CRD → verify bucket deleted

Key files (4 new)

  • cluster-gitops/plugins/helm/ack-s3-controller/{config,values}.yaml
  • cluster-gitops/plugins/helm/ack-lambda-controller/{config,values}.yaml

Phase 2: Helm Chart Development

Goal: Create the s3-notifier Helm chart that renders ACK Bucket + Function CRDs and a PostSync Job.

Duration: 1-2 days

Chart structure

src/services/s3-notifier/.chart/
├── Chart.yaml                  # Standalone (no syrf-common dependency)
├── values.yaml
├── templates/
│   ├── _helpers.tpl
│   ├── bucket.yaml             # ACK Bucket CRD with notification config (wave 3)
│   ├── function.yaml           # ACK Function CRD (no env vars, wave 1)
│   ├── serviceaccount.yaml     # SA with eks.amazonaws.com/role-arn annotation
│   ├── permission-job.yaml     # Sync hook (wave 2): Lambda invoke permission
│   ├── env-vars-job.yaml       # PostSync hook: Lambda env vars from K8s Secret
│   └── adopted-resource.yaml   # Conditional: adopts existing prod resources
└── NOTES.txt

bucket.yaml — ACK Bucket CRD with built-in notification

apiVersion: s3.services.k8s.aws/v1alpha1
kind: Bucket
metadata:
  name: {{ .Values.bucket.name }}
  annotations:
    services.k8s.aws/deletion-policy: {{ .Values.bucket.deletionPolicy }}
    argocd.argoproj.io/sync-wave: "3"  # After permission-job (wave 2) — S3 notification needs Lambda permission to exist
    {{- if .Values.adoptExisting }}
    services.k8s.aws/adopted: "true"
    {{- end }}
spec:
  name: {{ .Values.bucket.name }}
  versioning:
    status: {{ if .Values.bucket.versioning }}Enabled{{ else }}Suspended{{ end }}
  publicAccessBlock:
    blockPublicAcls: true
    blockPublicPolicy: true
    ignorePublicAcls: true
    restrictPublicBuckets: true
  notification:
    lambdaFunctionConfigurations:
      - events:
          - s3:ObjectCreated:*
        lambdaFunctionARN: {{ include "s3-notifier.lambdaArn" . }}

Key: Notification is declared inline — each bucket points to exactly one Lambda. No aggregation.

function.yaml — ACK Function CRD

apiVersion: lambda.services.k8s.aws/v1alpha1
kind: Function
metadata:
  name: {{ include "s3-notifier.k8sName" . }}   # DNS-1123 compliant (lowercase)
  annotations:
    services.k8s.aws/deletion-policy: {{ .Values.lambda.deletionPolicy }}
    argocd.argoproj.io/sync-wave: "1"  # Before permission-job (wave 2) and Bucket (wave 3)
spec:
  name: {{ include "s3-notifier.functionName" . }}  # AWS name (camelCase)
  runtime: dotnet10
  handler: "SyRF.S3FileSavedNotifier.Endpoint::SyRF.S3FileSavedNotifier.Endpoint.S3FileReceivedHandler::HandleEvent"
  memorySize: {{ .Values.lambda.memorySize }}
  timeout: {{ .Values.lambda.timeout }}
  role: {{ .Values.lambda.executionRoleArn }}
  code:
    s3Bucket: {{ .Values.lambda.code.s3Bucket }}
    s3Key: {{ .Values.lambda.code.s3Key }}
  # environment.variables intentionally omitted — managed by setup-job (contains secrets)
  tags:
    Environment: {{ .Values.environmentName }}
    ManagedBy: argocd-ack

permission-job.yaml — Sync hook (wave 2) for Lambda invoke permission

  • ArgoCD Sync hook at wave 2 (runs during sync, before Bucket at wave 3)
  • Waits for Lambda to reach Active state (configurable timeout via setupJob.timeouts.lambdaActive)
  • Grants S3→Lambda invoke permission idempotently with confused deputy protection (--source-account)
  • On ResourceConflictException, verifies existing statement matches current source-arn and source-account; replaces if stale
  • Uses amazon/aws-cli image with Python for JSON parsing (no jq in image)
  • ServiceAccount ack-setup-job with syrf-ack-setup-job IAM role (least-privilege)

env-vars-job.yaml — PostSync hook for Lambda env vars

  • ArgoCD PostSync hook (runs after all resources synced)
  • Pre-checks RABBITMQ_PASSWORD availability (from rabbit-mq K8s Secret via secretKeyRef)
  • Waits for Lambda to accept configuration updates (configurable timeout via setupJob.timeouts.lambdaConfigReady)
  • Calls aws lambda update-function-configuration to set all 4 env vars:
  • RabbitMqHost = amqp://rabbitmq.camarades.net:5672 (public hostname — Lambda runs outside cluster)
  • RabbitMqUsername = rabbit
  • RabbitMqPassword = (from K8s Secret)
  • S3Region = eu-west-1 (set by Terraform historically, preserved for compatibility)
  • NOTE: --environment replaces ALL env vars. This is intentional — all 4 vars are declared here as the canonical source of truth
  • Uses Python heredoc for JSON payload construction (special characters in password)
  • Waits for update to complete, hard-fails on timeout

RabbitMQ secret

The rabbit-mq K8s Secret is managed by a ClusterExternalSecret in the plugins project — it targets all service namespaces automatically. No per-chart ExternalSecret is needed.

adopted-resource.yaml (conditional on adoptExisting: true)

  • Creates AdoptedResource CRD for both Bucket and Function
  • Tells ACK to discover and adopt existing AWS resources rather than creating new ones
  • Only used during production cutover (Phase 6)

values.yaml defaults

bucket:
  name: ""               # Required: syrfapp-uploads, syrfapp-uploads-staging, etc.
  versioning: true
  deletionPolicy: retain  # Override to "delete" for previews
  tags:
    Service: s3-notifier
    ManagedBy: ACK

lambda:
  memorySize: 512
  timeout: 30
  deletionPolicy: retain  # Override to "delete" for previews
  executionRoleArn: ""     # Required per environment
  code:
    s3Bucket: camarades-terraform-state-aws
    s3Key: ""              # Required: lambda-packages/{env}.zip

awsAccountId: ""           # Required for Lambda ARN construction
awsRegion: eu-west-1

rabbitMq:
  host: "amqp://rabbitmq.camarades.net:5672"
  username: rabbit
  passwordSecretName: rabbit-mq
  passwordSecretKey: rabbitmq-password  # Must match ClusterExternalSecret key

environmentName: ""         # staging, production, pr-{N}
adoptExisting: false        # true only for production cutover

setupJob:
  enabled: true
  image: amazon/aws-cli:2.15.0
  serviceAccountName: ack-setup-job
  iamRoleName: syrf-ack-setup-job    # Least-privilege role (not syrf-ack-controllers)
  timeouts:
    lambdaActive: 120                # seconds to wait for Lambda Active state
    lambdaConfigReady: 60            # seconds to wait for LastUpdateStatus

Validation

  • helm template renders valid YAML with no secrets in output
  • Bucket CRD notification references correct Lambda ARN
  • Function CRD has correct handler (S3FileReceivedHandler::HandleEvent) and runtime (dotnet10)
  • RabbitMQ host is public hostname (amqp://rabbitmq.camarades.net:5672), NOT cluster-internal address

Key files (8 new in monorepo)

  • src/services/s3-notifier/.chart/Chart.yaml
  • src/services/s3-notifier/.chart/values.yaml
  • src/services/s3-notifier/.chart/templates/{_helpers,bucket,function,serviceaccount,permission-job,env-vars-job,adopted-resource}.tpl/yaml

Phase 3: CI/CD Pipeline Updates ✅

Goal: Update CI/CD to deploy Lambda code via GitOps (upload zip to S3 + update cluster-gitops) instead of direct aws lambda update-function-code. Must be in place before any ACK-managed environment receives code updates.

Status: Complete

Why before deployment phases

Without CI/CD updates, code changes to s3-notifier would still trigger the old pipeline (aws lambda update-function-code directly), conflicting with ACK's management of the Function CRD. Updating CI/CD first ensures all code deployments flow through GitOps from the start.

Key design decision: derive s3Key from image.tag

The ApplicationSet passes image.tag (from config.yaml's imageTag) to every service. The chart template defaults lambda.code.s3Key from image.tag:

imageTag: "0.1.3" → image.tag → s3Key: "lambda-packages/s3-notifier-v0.1.3.zip"

This means CI/CD only sets chartTag and imageTag in config.yaml (same as Docker services). No explicit s3Key in environment values.

Changes made

  1. Renamed deploy-lambdapackage-lambda in ci-cd.yml
  2. Removed: Terraform setup, init, plan, apply steps
  3. Removed: GitHub App token + infrastructure repo checkout
  4. Kept: dotnet publish + zip creation, AWS credentials, S3 upload
  5. Added: versioned upload (s3-notifier-v{version}.zip) alongside backward-compat production.zip

  6. Added s3-notifier to standard promotion flow

  7. s3-notifier now appears in Collect successful services (promote-to-staging)
  8. Flows through existing Update service versions step (generic yq-based config.yaml update)
  9. Removed legacy Update S3 Notifier version in API values step
  10. Removed s3NotifierVersion from syrf/services/api/values.yaml

  11. Chart template derives s3Key

  12. values.yaml: added image.tag placeholder
  13. function.yaml: s3Key defaults to lambda-packages/s3-notifier-v{image.tag}.zip
  14. Explicit lambda.code.s3Key override still works (for custom deployments)

  15. Updated detect-service-changes.sh

  16. Added chart path: src/services/s3-notifier/.chart
  17. Updated detection block: both build and retag trigger packaging (no Docker image to retag)

  18. Removed explicit s3Key from cluster-gitops environment values

  19. Staging and production values no longer specify lambda.code.s3Key
  20. Chart template derives it from image.tag (set by CI/CD promotion)

Backward compatibility

  • package-lambda uploads BOTH s3-notifier-v{version}.zip AND production.zip
  • Terraform-managed production Lambda still reads production.zip
  • Remove production.zip upload after production cutover (Phase 6)

Key files

  • .github/workflows/ci-cd.yml (renamed job, added promotion, removed legacy handling)
  • .github/scripts/detect-service-changes.sh (added chart path)
  • src/services/s3-notifier/.chart/values.yaml (added image.tag)
  • src/services/s3-notifier/.chart/templates/function.yaml (derived s3Key)

Phase 4: Preview Integration ✅

Goal: Replace pr-preview-lambda.yml with ACK-managed preview environments. Each PR gets its own isolated S3 bucket + Lambda.

Status: Complete

Changes made

  1. Added s3-notifier to preview ApplicationSet discovery
  2. New: cluster-gitops/syrf/environments/preview/services/s3-notifier/config.yaml
  3. New: cluster-gitops/syrf/environments/preview/services/s3-notifier/values.yaml (ephemeral deletion policies)

  4. Updated pr-preview.yml workflow

  5. Added s3-notifier to detect-changes (outputs + process_service call)
  6. Added version-s3-notifier job (GitVersion)
  7. Added package-lambda job (dotnet publish → zip → S3 upload with PR-specific key)
  8. Added s3-notifier to write-versions (per-PR values: bucket.name, environmentName, lambda.code.s3Key)
  9. Added per-PR S3 bucket name to API values (s3.bucketName: syrfapp-uploads-pr-{N})
  10. Updated update-pr-status and create-tags needs chains and outputs
  11. Added AWS cleanup steps to cleanup-tags (empty bucket + delete Lambda package from S3)

  12. Archived pr-preview-lambda.yml.github/workflows/archived/

Original steps (for reference)

  1. Add s3-notifier to preview ApplicationSet
  2. New: cluster-gitops/syrf/environments/preview/services/s3-notifier/values.yaml
  3. Modify: cluster-gitops/argocd/applicationsets/syrf-previews.yaml

    • Pass per-PR parameters: bucket.name=syrfapp-uploads-pr-{{.prNumber}}, environmentName=pr-{{.prNumber}}, lambda.deletionPolicy=delete, bucket.deletionPolicy=delete
  4. Update pr-preview.yml workflow

  5. Add s3-notifier to detect-changes
  6. Add Lambda build step (move from pr-preview-lambda.yml)
  7. Upload zip to S3 with PR-specific key: lambda-packages/pr-{N}.zip
  8. Write s3-notifier values in write-versions job
  9. Set preview API values: s3.bucketName: syrfapp-uploads-pr-{N}
  10. Preview bucket + Lambda have deletionPolicy: delete (ephemeral)

  11. Preview cleanup on PR close

  12. ArgoCD deletes Application → ACK deletes Bucket CRD + Function CRD
  13. deletionPolicy: delete means ACK also deletes the actual AWS resources
  14. S3 won't delete non-empty bucket → add pre-delete step to empty bucket first
  15. Add to existing cleanup workflow: aws s3 rm s3://syrfapp-uploads-pr-{N} --recursive

  16. Archive pr-preview-lambda.yml.github/workflows/archived/

Key files

  • .github/workflows/pr-preview.yml (modify)
  • .github/workflows/pr-preview-lambda.yml (archive)
  • cluster-gitops/syrf/environments/preview/services/s3-notifier/values.yaml (new)
  • cluster-gitops/argocd/applicationsets/syrf-previews.yaml (modify)

Phase 5: Staging Deployment ✅

Goal: Deploy a NEW staging S3 bucket + Lambda via ACK. Existing production remains Terraform-managed.

Status: Complete (configuration deployed to cluster-gitops, awaiting ACK controller sync)

Steps

  1. Register s3-notifier as a service in cluster-gitops
  2. New: cluster-gitops/syrf/services/s3-notifier/config.yaml

    serviceName: s3-notifier
    service:
      chartPath: src/services/s3-notifier/.chart
      chartRepo: https://github.com/camaradesuk/syrf
    
  3. New: cluster-gitops/syrf/services/s3-notifier/values.yaml (base defaults including awsAccountId)

  4. Create staging environment config

  5. New: cluster-gitops/syrf/environments/staging/s3-notifier/config.yaml

    serviceName: s3-notifier
    envName: staging
    service:
      enabled: true
      chartTag: main
      imageTag: "1.0.0"
    
  6. New: cluster-gitops/syrf/environments/staging/s3-notifier/values.yaml

    bucket:
      name: syrfapp-uploads-staging
    lambda:
      executionRoleArn: "arn:aws:iam::318789018510:role/syrfS3NotifierStagingLambdaRole"
      code:
        s3Key: "lambda-packages/production.zip"
    environmentName: staging
    
  7. Update staging API values to use new bucket

  8. Modify: cluster-gitops/syrf/environments/staging/api/values.yaml

    s3:
      bucketName: syrfapp-uploads-staging
    
  9. Same for project-management if it accesses S3 directly

  10. ArgoCD syncs → ACK creates resources

  11. ApplicationSet auto-discovers s3-notifier (glob: syrf/services/*/config.yaml)
  12. ACK S3 controller creates syrfapp-uploads-staging bucket
  13. ACK Lambda controller creates syrfAppUploadS3Notifier-staging function
  14. Bucket CRD notification links the two (1:1)
  15. PostSync Job sets env vars + Lambda permission

  16. Test end-to-end

  17. Upload file via staging SyRF UI
  18. Verify file lands in syrfapp-uploads-staging (not the shared bucket)
  19. Verify Lambda triggers (CloudWatch logs)
  20. Verify RabbitMQ message arrives at staging services
  21. ArgoCD shows healthy sync for s3-notifier-staging

Risk: ACK reconciliation overwriting env vars

After PostSync Job sets env vars, the ACK controller will reconcile the Function CRD. If the CRD omits environment, ACK may either leave existing env vars alone (late initialization) or clear them. Must verify this behavior in staging before proceeding to production. If ACK clears them, mitigation: include non-sensitive env vars (RabbitMqHost, RabbitMqUsername, S3Region) in the CRD spec, and only use the Job for RabbitMqPassword.

Key files

  • cluster-gitops/syrf/services/s3-notifier/{config,values}.yaml (new)
  • cluster-gitops/syrf/environments/staging/s3-notifier/{config,values}.yaml (new)
  • cluster-gitops/syrf/environments/staging/api/values.yaml (modify — add s3.bucketName)

Phase 6: Production Cutover

Goal: Migrate production Lambda + bucket from Terraform to ACK. Zero downtime.

Duration: 1 day (then 1-2 week soak)

Pre-cutover checklist

  • Staging Lambda running via ACK for 1+ week, no issues
  • ACK env var reconciliation behavior confirmed safe (from Phase 5)
  • Backup: aws lambda get-function --function-name syrfAppUploadS3Notifier
  • Backup: aws s3api get-bucket-notification-configuration --bucket syrfapp-uploads
  • Backup: aws s3 ls s3://syrfapp-uploads --recursive --summarize > inventory.txt
  • deletion-policy: retain confirmed in chart values for production
  • Maintenance window communicated

Steps

  1. Create production config with adoption
  2. New: cluster-gitops/syrf/environments/production/s3-notifier/values.yaml

    bucket:
      name: syrfapp-uploads          # Adopt existing
      deletionPolicy: retain
    lambda:
      executionRoleArn: "arn:aws:iam::<ACCOUNT_ID>:role/syrfS3NotifierProductionLambdaRole"
      code:
        s3Key: "lambda-packages/production.zip"
      deletionPolicy: retain
    environmentName: production
    adoptExisting: true              # Triggers AdoptedResource CRDs
    
  3. Deploy and adopt — push to cluster-gitops, ArgoCD syncs, ACK discovers existing resources

  4. Verify adoption

  5. Bucket creation date unchanged (NOT recreated)
  6. aws s3 ls s3://syrfapp-uploads --recursive --summarize matches pre-cutover inventory
  7. Lambda config unchanged in AWS Console
  8. File uploads continue working

  9. Remove from Terraform state (maintenance window)

terraform state rm aws_lambda_function.s3_notifier_production
terraform state rm aws_lambda_permission.s3_invoke_production
terraform state rm aws_s3_bucket_notification.uploads  # Preview Lambdas already ACK-managed (Phase 4)
  1. Remove adoptExisting: true — ACK now fully owns production resources

Rollback

Re-import into Terraform (terraform import), set service.enabled: false in cluster-gitops.


Phase 7: Terraform Cleanup & Documentation

Goal: Remove Terraform Lambda resources (now fully ACK-managed) and document the migration.

Duration: 1 day

Steps

  1. Remove Terraform Lambda resources
  2. Gut camarades-infrastructure/terraform/lambda/main.tf (remove all Lambda + notification resources)
  3. Keep IAM execution roles (shared, rarely change)
  4. Keep lambda_packages_bucket variable (still used for zip upload)

  5. Documentation

  6. Update CLAUDE.md S3 Notifier section
  7. Create docs/decisions/ADR-00X-ack-lambda-migration.md
  8. Update this technical plan status to "Completed"

Risk Register

# Risk Severity Likelihood Mitigation
1 GKE→AWS OIDC auth doesn't work Critical Medium Phase 0 is entirely a PoC. Fallback: AWS creds as K8s Secret (less elegant but functional)
2 ACK reconciliation clears env vars set by PostSync Job High Medium Verify in Phase 5 staging. If ACK clears them: put non-sensitive vars in CRD spec, only use Job for password
3 Production bucket adopted incorrectly (data loss) Critical Low deletion-policy: retain + adopted: true annotation. Backup inventory before cutover. Verify bucket creation date unchanged
4 S3 bucket name globally unavailable Low Low syrfapp-uploads-staging may already exist. Check availability before Phase 5. Worst case: use alternative naming
5 Bucket notification → Lambda ordering Medium Medium Bucket CRD references Lambda ARN. If Lambda doesn't exist yet when Bucket syncs, notification fails. Mitigation: ArgoCD sync waves enforce ordering — Function at wave 1, permission-job at wave 2, Bucket at wave 3
6 Preview bucket cleanup fails (non-empty bucket) Medium Medium S3 API refuses to delete non-empty buckets. Add aws s3 rm --recursive step before ArgoCD prunes CRDs
7 CI/CD complexity during transition Medium High Phases 5-6: both Terraform and ACK manage different Lambdas. Document which system owns which Lambda clearly
8 ACK controller instability Medium Low Pin to stable versions. Test upgrades in staging first. Keep Terraform as rollback for 2 weeks post-production cutover
9 ACK controller calls undocumented APIs Medium High Use lambda:* / s3:* scoped to resource ARN prefix — explicit action lists break on controller upgrades (see Operational Notes)
10 Python unavailable in AWS CLI image Low Confirmed amazon/aws-cli:2.15.0 has no standalone python3. Setup jobs search multiple fallback paths including /usr/bin/python2.7. Permission-job skips verification gracefully
11 IAM policy changes not picked up by ACK Medium Confirmed ACK controllers cache STS sessions. Restart controller deployments after IAM changes (kubectl rollout restart)

Verification Plan

Phase 0

# From test pod in GKE:
aws sts get-caller-identity          # Confirms cross-cloud auth works
aws lambda list-functions --region eu-west-1
aws s3 ls                            # Confirms S3 access

Phase 2

helm template staging src/services/s3-notifier/.chart/ -f test-staging-values.yaml
# Verify: no secrets in output, valid Bucket + Function CRDs, correct Lambda ARN in notification

Phase 3 ✅

# Verified: chart template renders correctly
helm template test src/services/s3-notifier/.chart/ --set image.tag=0.1.3 \
  --set awsAccountId=318789018510 --set environmentName=staging \
  --set bucket.name=test --set lambda.executionRoleArn=arn:aws:iam::318789018510:role/test
# Result: s3Key: lambda-packages/s3-notifier-v0.1.3.zip ✅
# Explicit override with --set lambda.code.s3Key=custom.zip also works ✅
# After merge: verify package-lambda runs, versioned zip uploaded, promotion PR created

Phase 4 ✅

# Preview: create PR with preview label
# Verify: syrfapp-uploads-pr-{N} bucket created, Lambda created, notification linked
# Upload file → Lambda triggers → correct RabbitMQ vhost

# Close PR → verify bucket + Lambda deleted in AWS
aws s3 ls | grep -v syrfapp-uploads-pr-{N}  # Should not exist

Phase 5 ✅

# ArgoCD
argocd app get s3-notifier-staging
kubectl get bucket,function -n syrf-staging

# AWS — new bucket exists
aws s3 ls | grep syrfapp-uploads-staging

# AWS — Lambda configured correctly
aws lambda get-function-configuration --function-name syrfAppUploadS3Notifier-staging

# AWS — notification links bucket → Lambda
aws s3api get-bucket-notification-configuration --bucket syrfapp-uploads-staging

# End-to-end: upload file via staging UI → Lambda triggers → RabbitMQ message received

Phase 6

# Compare pre/post cutover
aws lambda get-function --function-name syrfAppUploadS3Notifier  # Config unchanged
aws s3 ls s3://syrfapp-uploads --recursive --summarize            # File count unchanged

# Functional: upload file via production SyRF UI, verify end-to-end processing

Appendix: Differences from Original Plan

The original technical plan (created 2026-01-15, deprecated) had 12 issues identified during validation against the actual codebase. This updated plan corrects all of them:

Issue Original (Wrong) Corrected
Lambda handler Function::FunctionHandler S3FileReceivedHandler::HandleEvent
Lambda runtime dotnet8 dotnet10
Lambda env vars RABBITMQ_HOST (uppercase) RabbitMqHost (PascalCase, matching C# code)
RabbitMQ host amqp://rabbitmq.syrf-{env}.svc.cluster.local:5672 amqp://rabbitmq.camarades.net:5672 (public — Lambda runs outside cluster)
Missing env vars Only RABBITMQ_HOST RabbitMqHost, RabbitMqUsername, RabbitMqPassword (3 used by handler; S3Region set by Terraform but unused by code)
AWS Account ID ACCOUNT_ID placeholders in IAM ARNs Documented as required value, set per-environment in cluster-gitops
Lambda packages bucket camarades-lambda-packages camarades-terraform-state-aws
IAM role names lambda-s3-notifier-execution-role syrfS3NotifierProductionLambdaRole / syrfS3NotifierPreviewLambdaRole
Lambda permission Terraform-managed PostSync Job (keeps everything in GitOps)
S3 bucket strategy Shared bucket / values files in chart Separate buckets per env, values in cluster-gitops only
Env var management Inline in Function CRD spec PostSync Job (secrets never in git)
Bucket notification Separate from Lambda permission concern Built into Bucket CRD spec.notification (1:1 with Lambda)

Appendix: ACK CRD Verification

Verified via ACK documentation review (2026-01-15).

S3 Controller CRDs

CRD Status Notes
Bucket Supported Full S3 bucket management with 30+ configuration categories
Bucket Notifications Built-in Configured via spec.notification field — NOT a separate CRD

S3 notification config within Bucket CRD:

spec:
  notification:
    lambdaFunctionConfigurations:
      - events: ["s3:ObjectCreated:*"]
        lambdaFunctionARN: "arn:aws:lambda:eu-west-1:ACCOUNT_ID:function:syrfAppUploadS3Notifier"

Lambda Controller CRDs

CRD Status Notes
Function Supported Core Lambda function management
Alias Supported Alias with event invoke config
CodeSigningConfig Supported Code signing configuration
EventSourceMapping Supported Kafka, MQ, SQS event sources
FunctionUrlConfig Supported Function URL HTTPS endpoints
LayerVersion Supported Lambda layer management
Version Supported Immutable function versions
Permission NOT Supported Referenced in internal hooks but NOT a top-level CRD

Critical gap: Lambda Permission (resource-based policy for S3→Lambda invoke) has no ACK CRD. Solved via PostSync Job calling aws lambda add-permission.

Verification commands

# After ACK installation, verify available CRDs:
kubectl get crd | grep s3.services.k8s.aws
# Expected: buckets.s3.services.k8s.aws

kubectl get crd | grep lambda.services.k8s.aws
# Expected: functions, aliases, codesigningconfigs, eventsourcemappings,
#           functionurlconfigs, layerversions, versions
# NOT expected: permissions

Appendix: Data Persistence Guarantees

Production bucket contains user uploads (PDFs, reference files). Multiple protection layers ensure data safety:

Layer Protection Against
services.k8s.aws/deletion-policy: retain CRD deletion removing bucket
AWS "bucket not empty" check API-level bucket deletion
S3 versioning Accidental object deletion
ACK AdoptedResource CRD Existing resources being recreated
# bucket.yaml — Critical annotations
apiVersion: s3.services.k8s.aws/v1alpha1
kind: Bucket
metadata:
  name: syrfapp-uploads
  annotations:
    services.k8s.aws/deletion-policy: retain    # Never delete bucket when CRD removed
    services.k8s.aws/adopted: "true"            # Adopt existing, don't recreate
spec:
  name: syrfapp-uploads
  versioning:
    status: Enabled                              # Object-level recovery

Appendix: Operational Notes (from Preview Deployment)

Lessons learned during the first preview deployment (PR #2328, 2026-02-11). These issues were resolved in the chart and documented here for future reference.

ACK Lambda controller requires ephemeralStorage

The ACK Lambda controller calls UpdateFunctionConfiguration on every reconciliation. If ephemeralStorage is omitted from the Function CRD spec, the API call omits it and AWS returns an error. The chart now includes ephemeralStorage.size (defaults to 512 MB in values.yaml).

ACK controller IAM — use wildcards, not explicit action lists

The ACK Lambda controller calls undocumented/internal AWS APIs during reconciliation (e.g. GetFunctionConcurrency, GetFunctionEventInvokeConfig). An explicit action list broke on controller upgrade. The ACKLambdaManagement policy now uses lambda:* scoped to the function ARN prefix. Similarly, the ACK S3 controller requires s3:ListAllMyBuckets on * for bucket discovery — ACKS3Management uses s3:* on the bucket ARN prefix plus the account-level action.

Setup-job trust policy must include preview namespaces

The syrf-ack-setup-job trust policy originally only allowed syrf-staging and syrf-production namespaces. Preview environments use pr-* namespaces. The trust policy now uses StringLike with system:serviceaccount:pr-*:ack-setup-job in addition to the staging/production entries.

Python is not directly available in amazon/aws-cli:2.15.0

The AWS CLI image (Amazon Linux 2) bundles Python inside the aws binary (PyInstaller-frozen). There is no standalone python3 in $PATH. However, python2.7 exists at /usr/bin/python2.7. The setup jobs now search multiple paths: python3, python, python2.7, /usr/bin/python2.7, and AWS CLI's internal Python paths. The permission-job also gracefully skips policy verification if no Python is found (the permission is already added by that point).

ACK controller credential caching after IAM changes

After updating IAM policies, ACK controllers continue using cached STS sessions. The controllers must be restarted (kubectl rollout restart deployment) to pick up new permissions. This only applies to IAM policy changes — normal CRD operations use the existing session.

ArgoCD retry exhaustion on transient failures

If an ACK CRD fails during sync (e.g. due to IAM permission errors), ArgoCD exhausts its 3 retry attempts. After fixing the root cause, the operation state must be cleared manually.

Note: This kubectl patch targets ArgoCD's own internal state (clearing operation status), not application resources. This falls under the ArgoCD bootstrap exception in the GitOps policy.

kubectl patch application <app> -n argocd --type merge \
  -p '{"status":{"operationState":null}}'

Then trigger a hard refresh in ArgoCD.

AWS IAM Policies — Current State

The following IAM policies were updated directly in AWS during the preview deployment. These need to be backported to Terraform (camarades-infrastructure/terraform/lambda/ack-iam.tf) before production cutover.

Role: syrf-ack-controllers

Policy Actions Resources
ACKLambdaManagement lambda:* arn:aws:lambda:eu-west-1:318789018510:function:syrfAppUploadS3Notifier*
ACKS3Management s3:* arn:aws:s3:::syrfapp-uploads, arn:aws:s3:::syrfapp-uploads-*, arn:aws:s3:::syrfapp-uploads/*, arn:aws:s3:::syrfapp-uploads-*/*
ACKS3Management s3:ListAllMyBuckets *
ACKIAMPassRole iam:PassRole arn:aws:iam::318789018510:role/syrfS3Notifier*LambdaRole
LambdaPackagesRead s3:GetObject, s3:ListBucket arn:aws:s3:::camarades-terraform-state-aws, arn:aws:s3:::camarades-terraform-state-aws/lambda-packages/*

Role: syrf-ack-setup-job

Policy Actions Resources
SetupJobPermissions lambda:GetFunction, lambda:GetFunctionConfiguration, lambda:UpdateFunctionConfiguration, lambda:GetPolicy, lambda:AddPermission, lambda:RemovePermission arn:aws:lambda:eu-west-1:318789018510:function:syrfAppUploadS3Notifier*
Trust: StringEquals sts:AssumeRoleWithWebIdentity system:serviceaccount:syrf-staging:ack-setup-job, system:serviceaccount:syrf-production:ack-setup-job
Trust: StringLike sts:AssumeRoleWithWebIdentity system:serviceaccount:pr-*:ack-setup-job

References