Technical Plan: Lambda ACK GitOps Migration¶

Overview¶

Migrate the S3 Notifier Lambda (syrfAppUploadS3Notifier) from Terraform/CI-managed deployment to ACK (AWS Controllers for Kubernetes) GitOps management. This brings Lambda and S3 bucket lifecycle management into the same GitOps paradigm as all other SyRF services — ACK controllers in GKE manage Lambda + S3 as Kubernetes CRDs, ArgoCD syncs them, and cluster-gitops is the single source of truth.

Related Documents:

Migration Runbook - Production migration steps (to be updated)

Lambda GitOps Integration - Umbrella strategy doc (Tier 1 completed, Tier 3 links here)

Key Architecture Decisions¶

1. Separate S3 buckets per environment (full ACK)¶

Each environment gets its own S3 bucket and Lambda, both managed as ACK CRDs. This aligns with the existing isolation pattern (MongoDB is already per-environment). Each environment is fully self-contained — no shared notification config, no aggregation problem.

Environment	S3 Bucket	Lambda	Notification
Production	`syrfapp-uploads` (adopt existing)	`syrfAppUploadS3Notifier` (adopt existing)	1:1, in Bucket CRD
Staging	`syrfapp-uploads-staging` (new)	`syrfAppUploadS3Notifier-staging` (new)	1:1, in Bucket CRD
Preview PR N	`syrfapp-uploads-pr-{N}` (new, ephemeral)	`syrfAppUploadS3Notifier-pr-{N}` (new, ephemeral)	1:1, in Bucket CRD

App config change: Set s3.bucketName per environment in cluster-gitops Helm values. Most S3 operations already read the bucket name from config (S3Settings.BucketName), but S3FileService.WriteStreamToFile (used by OverwriteAllLines) has a hard-coded BucketName constant set to "syrfapp-uploads" (S3FileService.cs:24, line 112). This must be fixed to use _s3Settings.BucketName to avoid writes going to the production bucket from non-production environments.

2. ACK manages both Bucket + Function CRDs¶

Since each environment has its own bucket, the Bucket CRD's spec.notification only references one Lambda. No aggregation, no PostSync Job for notifications. Clean declarative management.

S3 Bucket: ACK S3 controller (Bucket CRD) — includes notification config
Lambda Function: ACK Lambda controller (Function CRD)
Lambda Permission: PostSync Job (ACK Lambda controller lacks Permission CRD)
Lambda Env Vars: PostSync Job (CRD doesn't support SecretKeyRef for env vars)
IAM Execution Roles: Terraform (already exist, rarely change)

3. Setup Jobs for credentials + permissions¶

ACK's Function CRD environment.variables is a plain string map — no SecretKeyRef support. RabbitMQ password must never appear in git. Lambda Permission has no ACK CRD. Solution: Two ArgoCD hook Jobs:

permission-job (Sync hook, wave 2): Grants S3→Lambda invoke permission with --source-account for confused deputy protection. Runs during sync, before Bucket (wave 3) creates the notification.
env-vars-job (PostSync hook): Reads RabbitMqPassword from K8s Secret (synced via ClusterExternalSecret from GCP Secret Manager), calls aws lambda update-function-configuration to set all 4 env vars.

Both jobs use the syrf-ack-setup-job IAM role (least-privilege, separate from the syrf-ack-controllers role used by ACK). The Function CRD intentionally omits environment — the env-vars-job owns env var management.

4. GKE → AWS cross-cloud auth via OIDC federation¶

ACK controllers authenticate to AWS using projected service account tokens + AWS_WEB_IDENTITY_TOKEN_FILE. GKE projects OIDC tokens for pods via Workload Identity. AWS IAM trusts the GKE OIDC issuer. This is proven technology but untested in this project — Phase 0 validates it.

Codebase-Validated Reference¶

These values were validated against the actual codebase on 2026-02-06:

Parameter	Correct Value	Source
Lambda runtime	`dotnet10`	`camarades-infrastructure/terraform/lambda/main.tf:67`
Lambda handler	`SyRF.S3FileSavedNotifier.Endpoint::SyRF.S3FileSavedNotifier.Endpoint.S3FileReceivedHandler::HandleEvent`	`camarades-infrastructure/terraform/lambda/main.tf:68`
Lambda env vars	`RabbitMqHost`, `RabbitMqUsername`, `RabbitMqPassword`, `S3Region`	`S3FileReceivedFunction.cs:78-80`, Terraform `main.tf:82`
Lambda env var note	`S3Region` is set by Terraform but not read by handler code — preserved for compatibility. `--environment` replaces ALL vars, so all 4 must be set together	env-vars-job.yaml
RabbitMQ host	`amqp://rabbitmq.camarades.net:5672` (public, NOT cluster-internal, plain AMQP — see Security Note)	`camarades-infrastructure/terraform/lambda/variables.tf:45`
RabbitMQ username	`rabbit`	`camarades-infrastructure/terraform/lambda/variables.tf:51`
RabbitMQ virtual host	From S3 object metadata (`metadata["virtualhost"]`)	`S3FileReceivedFunction.cs:81`
S3 bucket name	`syrfapp-uploads` (production)	`camarades-infrastructure/terraform/lambda/variables.tf:57`
Lambda packages bucket	`camarades-terraform-state-aws`	`camarades-infrastructure/terraform/lambda/variables.tf:63`
Production IAM role	`syrfS3NotifierProductionLambdaRole`	`camarades-infrastructure/terraform/lambda/main.tf:16`
Preview IAM role	`syrfS3NotifierPreviewLambdaRole`	`camarades-infrastructure/terraform/lambda/main.tf:97`
ApplicationSet glob	`syrf/services/*/config.yaml` (auto-discovers new services)	`cluster-gitops/argocd/applicationsets/syrf.yaml`

Security Note: RabbitMQ Transport¶

The current Terraform configuration uses plain AMQP (amqp://rabbitmq.camarades.net:5672) for Lambda→RabbitMQ connections. This is existing production behavior, not introduced by this migration. However, it transmits credentials and messages in cleartext over the public internet. Upgrading to TLS (amqps:// on port 5671) is recommended as a separate follow-up but is out of scope for this migration — the goal here is to replicate the existing configuration faithfully under ACK management first.

Phase 0: Proof of Concept — Cross-Cloud Auth¶

Goal: Prove that a pod in GKE can assume an AWS IAM role and manage Lambda/S3 resources. Highest-risk component — if it fails, ACK approach is blocked.

Duration: 1-2 days

Steps¶

Create AWS IAM OIDC Provider for GKE (Terraform)
File: camarades-infrastructure/terraform/main.tf
Add aws_iam_openid_connect_provider trusting GKE OIDC issuer: https://container.googleapis.com/v1/projects/camarades-net/zones/europe-west2-a/clusters/camaradesuk
Compute thumbprint from issuer URL certificate chain
Create AWS IAM Roles (Terraform)
File: camarades-infrastructure/terraform/lambda/ack-iam.tf
Role syrf-ack-controllers (for ACK controllers):
- Trust policy: Allow sts:AssumeRoleWithWebIdentity from GKE OIDC issuer, scoped to exact service accounts (ack-lambda-controller, ack-s3-controller) with aud: sts.amazonaws.com constraint
- Permissions (3 policy statements):
- ACKLambdaManagement: lambda:* scoped to arn:aws:lambda:eu-west-1:318789018510:function:syrfAppUploadS3Notifier* — wildcard because ACK controllers call undocumented APIs (GetFunctionConcurrency, GetFunctionEventInvokeConfig) that break with explicit action lists
- ACKS3Management: s3:* scoped to arn:aws:s3:::syrfapp-uploads* (bucket + objects), plus s3:ListAllMyBuckets on * (required by ACK S3 controller for bucket discovery)
- ACKIAMPassRole: iam:PassRole for Lambda execution roles (syrfS3NotifierProductionLambdaRole, syrfS3NotifierStagingLambdaRole, syrfS3NotifierPreviewLambdaRole)
Role syrf-ack-setup-job (least-privilege for setup jobs):
- Trust policy: scoped to syrf-staging:ack-setup-job, syrf-production:ack-setup-job, and pr-*:ack-setup-job (via StringLike) with aud: sts.amazonaws.com constraint
- Permissions: lambda:GetFunction, lambda:GetFunctionConfiguration, lambda:UpdateFunctionConfiguration, lambda:GetPolicy, lambda:AddPermission, lambda:RemovePermission only
Deploy test pod with projected token (manual, temporary)
Create ServiceAccount in ack-system with eks.amazonaws.com/role-arn annotation
Deploy amazon/aws-cli pod, set AWS_WEB_IDENTITY_TOKEN_FILE + AWS_ROLE_ARN
Test: aws lambda list-functions --region eu-west-1
Test: aws s3 ls (verify S3 access)
Validate: Pod can call both Lambda and S3 APIs → clean up test pod

Success gate¶

Pod successfully calls AWS Lambda + S3 APIs. If this fails:

Try storing AWS creds as K8s Secret instead of OIDC (less elegant but functional)
Try GCP-to-AWS Workload Identity Federation
Re-evaluate ACK approach

Phase 1: ACK Controller Installation¶

Goal: Install ACK S3 + Lambda controllers via GitOps, verified working.

Duration: 1 day

Prerequisites¶

AWS Pod Identity Webhook: Must be installed as a cluster plugin (pod-identity-webhook) before ACK controllers. It mutates pods with eks.amazonaws.com/role-arn annotations to inject AWS credentials via projected service account tokens. Without it, ACK controllers and setup jobs cannot authenticate to AWS.
ArgoCD AppProject config: The plugins project needs oci://public.ecr.aws/aws-controllers-k8s and https://jkroepke.github.io/helm-charts in sourceRepos, and ack-system + pod-identity-webhook namespace destinations.
SyRF AppProject config: Staging and production projects need ACK CRD whitelists (lambda.services.k8s.aws/Function, s3.services.k8s.aws/Bucket, services.k8s.aws/AdoptedResource).
ArgoCD health checks: Custom Lua health checks for Function, Bucket, and AdoptedResource CRDs in argocd-cm.

Steps¶

Add ACK S3 controller to cluster-gitops plugins

New: cluster-gitops/plugins/helm/ack-s3-controller/config.yaml

plugin:
  name: ack-s3-controller
  repoURL: oci://public.ecr.aws/aws-controllers-k8s
  chart: s3-chart
  version: "1.0.14"  # Pin to stable version
  namespace: ack-system

New: cluster-gitops/plugins/helm/ack-s3-controller/values.yaml
- aws.region: eu-west-1, ServiceAccount with AWS role ARN from Phase 0
Add ACK Lambda controller to cluster-gitops plugins

New: cluster-gitops/plugins/helm/ack-lambda-controller/config.yaml

plugin:
  name: ack-lambda-controller
  repoURL: oci://public.ecr.aws/aws-controllers-k8s
  chart: lambda-chart
  version: "1.5.2"  # Pin to stable version
  namespace: ack-system

New: cluster-gitops/plugins/helm/ack-lambda-controller/values.yaml
Verify
Both controller pods running in ack-system
CRDs installed: buckets.s3.services.k8s.aws, functions.lambda.services.k8s.aws
No auth errors in controller logs
Smoke test: create test Bucket CRD → verify bucket created in AWS → delete CRD → verify bucket deleted

Key files (4 new)¶

cluster-gitops/plugins/helm/ack-s3-controller/{config,values}.yaml
cluster-gitops/plugins/helm/ack-lambda-controller/{config,values}.yaml

Phase 2: Helm Chart Development¶

Goal: Create the s3-notifier Helm chart that renders ACK Bucket + Function CRDs and a PostSync Job.

Duration: 1-2 days

Chart structure¶

src/services/s3-notifier/.chart/
├── Chart.yaml                  # Standalone (no syrf-common dependency)
├── values.yaml
├── templates/
│   ├── _helpers.tpl
│   ├── bucket.yaml             # ACK Bucket CRD with notification config (wave 3)
│   ├── function.yaml           # ACK Function CRD (no env vars, wave 1)
│   ├── serviceaccount.yaml     # SA with eks.amazonaws.com/role-arn annotation
│   ├── permission-job.yaml     # Sync hook (wave 2): Lambda invoke permission
│   ├── env-vars-job.yaml       # PostSync hook: Lambda env vars from K8s Secret
│   └── adopted-resource.yaml   # Conditional: adopts existing prod resources
└── NOTES.txt

`bucket.yaml` — ACK Bucket CRD with built-in notification¶

apiVersion: s3.services.k8s.aws/v1alpha1
kind: Bucket
metadata:
  name: {{ .Values.bucket.name }}
  annotations:
    services.k8s.aws/deletion-policy: {{ .Values.bucket.deletionPolicy }}
    argocd.argoproj.io/sync-wave: "3"  # After permission-job (wave 2) — S3 notification needs Lambda permission to exist
    {{- if .Values.adoptExisting }}
    services.k8s.aws/adopted: "true"
    {{- end }}
spec:
  name: {{ .Values.bucket.name }}
  versioning:
    status: {{ if .Values.bucket.versioning }}Enabled{{ else }}Suspended{{ end }}
  publicAccessBlock:
    blockPublicAcls: true
    blockPublicPolicy: true
    ignorePublicAcls: true
    restrictPublicBuckets: true
  notification:
    lambdaFunctionConfigurations:
      - events:
          - s3:ObjectCreated:*
        lambdaFunctionARN: {{ include "s3-notifier.lambdaArn" . }}

Key: Notification is declared inline — each bucket points to exactly one Lambda. No aggregation.

`function.yaml` — ACK Function CRD¶

apiVersion: lambda.services.k8s.aws/v1alpha1
kind: Function
metadata:
  name: {{ include "s3-notifier.k8sName" . }}   # DNS-1123 compliant (lowercase)
  annotations:
    services.k8s.aws/deletion-policy: {{ .Values.lambda.deletionPolicy }}
    argocd.argoproj.io/sync-wave: "1"  # Before permission-job (wave 2) and Bucket (wave 3)
spec:
  name: {{ include "s3-notifier.functionName" . }}  # AWS name (camelCase)
  runtime: dotnet10
  handler: "SyRF.S3FileSavedNotifier.Endpoint::SyRF.S3FileSavedNotifier.Endpoint.S3FileReceivedHandler::HandleEvent"
  memorySize: {{ .Values.lambda.memorySize }}
  timeout: {{ .Values.lambda.timeout }}
  role: {{ .Values.lambda.executionRoleArn }}
  code:
    s3Bucket: {{ .Values.lambda.code.s3Bucket }}
    s3Key: {{ .Values.lambda.code.s3Key }}
  # environment.variables intentionally omitted — managed by setup-job (contains secrets)
  tags:
    Environment: {{ .Values.environmentName }}
    ManagedBy: argocd-ack

`permission-job.yaml` — Sync hook (wave 2) for Lambda invoke permission¶

ArgoCD Sync hook at wave 2 (runs during sync, before Bucket at wave 3)
Waits for Lambda to reach Active state (configurable timeout via setupJob.timeouts.lambdaActive)
Grants S3→Lambda invoke permission idempotently with confused deputy protection (--source-account)
On ResourceConflictException, verifies existing statement matches current source-arn and source-account; replaces if stale
Uses amazon/aws-cli image with Python for JSON parsing (no jq in image)
ServiceAccount ack-setup-job with syrf-ack-setup-job IAM role (least-privilege)

`env-vars-job.yaml` — PostSync hook for Lambda env vars¶

ArgoCD PostSync hook (runs after all resources synced)
Pre-checks RABBITMQ_PASSWORD availability (from rabbit-mq K8s Secret via secretKeyRef)
Waits for Lambda to accept configuration updates (configurable timeout via setupJob.timeouts.lambdaConfigReady)
Calls aws lambda update-function-configuration to set all 4 env vars:
RabbitMqHost = amqp://rabbitmq.camarades.net:5672 (public hostname — Lambda runs outside cluster)
RabbitMqUsername = rabbit
RabbitMqPassword = (from K8s Secret)
S3Region = eu-west-1 (set by Terraform historically, preserved for compatibility)
NOTE: --environment replaces ALL env vars. This is intentional — all 4 vars are declared here as the canonical source of truth
Uses Python heredoc for JSON payload construction (special characters in password)
Waits for update to complete, hard-fails on timeout

RabbitMQ secret¶

The rabbit-mq K8s Secret is managed by a ClusterExternalSecret in the plugins project — it targets all service namespaces automatically. No per-chart ExternalSecret is needed.

`adopted-resource.yaml` (conditional on `adoptExisting: true`)¶

Creates AdoptedResource CRD for both Bucket and Function
Tells ACK to discover and adopt existing AWS resources rather than creating new ones
Only used during production cutover (Phase 6)

`values.yaml` defaults¶

bucket:
  name: ""               # Required: syrfapp-uploads, syrfapp-uploads-staging, etc.
  versioning: true
  deletionPolicy: retain  # Override to "delete" for previews
  tags:
    Service: s3-notifier
    ManagedBy: ACK

lambda:
  memorySize: 512
  timeout: 30
  deletionPolicy: retain  # Override to "delete" for previews
  executionRoleArn: ""     # Required per environment
  code:
    s3Bucket: camarades-terraform-state-aws
    s3Key: ""              # Required: lambda-packages/{env}.zip

awsAccountId: ""           # Required for Lambda ARN construction
awsRegion: eu-west-1

rabbitMq:
  host: "amqp://rabbitmq.camarades.net:5672"
  username: rabbit
  passwordSecretName: rabbit-mq
  passwordSecretKey: rabbitmq-password  # Must match ClusterExternalSecret key

environmentName: ""         # staging, production, pr-{N}
adoptExisting: false        # true only for production cutover

setupJob:
  enabled: true
  image: amazon/aws-cli:2.15.0
  serviceAccountName: ack-setup-job
  iamRoleName: syrf-ack-setup-job    # Least-privilege role (not syrf-ack-controllers)
  timeouts:
    lambdaActive: 120                # seconds to wait for Lambda Active state
    lambdaConfigReady: 60            # seconds to wait for LastUpdateStatus

Validation¶

helm template renders valid YAML with no secrets in output
Bucket CRD notification references correct Lambda ARN
Function CRD has correct handler (S3FileReceivedHandler::HandleEvent) and runtime (dotnet10)
RabbitMQ host is public hostname (amqp://rabbitmq.camarades.net:5672), NOT cluster-internal address

Key files (8 new in monorepo)¶

src/services/s3-notifier/.chart/Chart.yaml
src/services/s3-notifier/.chart/values.yaml
src/services/s3-notifier/.chart/templates/{_helpers,bucket,function,serviceaccount,permission-job,env-vars-job,adopted-resource}.tpl/yaml

Phase 3: CI/CD Pipeline Updates ✅¶

Goal: Update CI/CD to deploy Lambda code via GitOps (upload zip to S3 + update cluster-gitops) instead of direct aws lambda update-function-code. Must be in place before any ACK-managed environment receives code updates.

Status: Complete

Why before deployment phases¶

Without CI/CD updates, code changes to s3-notifier would still trigger the old pipeline (aws lambda update-function-code directly), conflicting with ACK's management of the Function CRD. Updating CI/CD first ensures all code deployments flow through GitOps from the start.

Key design decision: derive s3Key from image.tag¶

The ApplicationSet passes image.tag (from config.yaml's imageTag) to every service. The chart template defaults lambda.code.s3Key from image.tag:

imageTag: "0.1.3" → image.tag → s3Key: "lambda-packages/s3-notifier-v0.1.3.zip"

This means CI/CD only sets chartTag and imageTag in config.yaml (same as Docker services). No explicit s3Key in environment values.

Changes made¶

Renamed deploy-lambda → package-lambda in ci-cd.yml
Removed: Terraform setup, init, plan, apply steps
Removed: GitHub App token + infrastructure repo checkout
Kept: dotnet publish + zip creation, AWS credentials, S3 upload
Added: versioned upload (s3-notifier-v{version}.zip) alongside backward-compat production.zip
Added s3-notifier to standard promotion flow
s3-notifier now appears in Collect successful services (promote-to-staging)
Flows through existing Update service versions step (generic yq-based config.yaml update)
Removed legacy Update S3 Notifier version in API values step
Removed s3NotifierVersion from syrf/services/api/values.yaml
Chart template derives s3Key
values.yaml: added image.tag placeholder
function.yaml: s3Key defaults to lambda-packages/s3-notifier-v{image.tag}.zip
Explicit lambda.code.s3Key override still works (for custom deployments)
Updated detect-service-changes.sh
Added chart path: src/services/s3-notifier/.chart
Updated detection block: both build and retag trigger packaging (no Docker image to retag)
Removed explicit s3Key from cluster-gitops environment values
Staging and production values no longer specify lambda.code.s3Key
Chart template derives it from image.tag (set by CI/CD promotion)

Backward compatibility¶

package-lambda uploads BOTH s3-notifier-v{version}.zip AND production.zip
Terraform-managed production Lambda still reads production.zip
Remove production.zip upload after production cutover (Phase 6)

Key files¶

.github/workflows/ci-cd.yml (renamed job, added promotion, removed legacy handling)
.github/scripts/detect-service-changes.sh (added chart path)
src/services/s3-notifier/.chart/values.yaml (added image.tag)
src/services/s3-notifier/.chart/templates/function.yaml (derived s3Key)

Phase 4: Preview Integration ✅¶

Goal: Replace pr-preview-lambda.yml with ACK-managed preview environments. Each PR gets its own isolated S3 bucket + Lambda.

Status: Complete

Changes made¶

Added s3-notifier to preview ApplicationSet discovery
New: cluster-gitops/syrf/environments/preview/services/s3-notifier/config.yaml
New: cluster-gitops/syrf/environments/preview/services/s3-notifier/values.yaml (ephemeral deletion policies)
Updated pr-preview.yml workflow
Added s3-notifier to detect-changes (outputs + process_service call)
Added version-s3-notifier job (GitVersion)
Added package-lambda job (dotnet publish → zip → S3 upload with PR-specific key)
Added s3-notifier to write-versions (per-PR values: bucket.name, environmentName, lambda.code.s3Key)
Added per-PR S3 bucket name to API values (s3.bucketName: syrfapp-uploads-pr-{N})
Updated update-pr-status and create-tags needs chains and outputs
Added AWS cleanup steps to cleanup-tags (empty bucket + delete Lambda package from S3)
Archived pr-preview-lambda.yml → .github/workflows/archived/

Original steps (for reference)¶

Add s3-notifier to preview ApplicationSet
New: cluster-gitops/syrf/environments/preview/services/s3-notifier/values.yaml
Modify: cluster-gitops/argocd/applicationsets/syrf-previews.yaml
- Pass per-PR parameters: bucket.name=syrfapp-uploads-pr-{{.prNumber}}, environmentName=pr-{{.prNumber}}, lambda.deletionPolicy=delete, bucket.deletionPolicy=delete
Update pr-preview.yml workflow
Add s3-notifier to detect-changes
Add Lambda build step (move from pr-preview-lambda.yml)
Upload zip to S3 with PR-specific key: lambda-packages/pr-{N}.zip
Write s3-notifier values in write-versions job
Set preview API values: s3.bucketName: syrfapp-uploads-pr-{N}
Preview bucket + Lambda have deletionPolicy: delete (ephemeral)
Preview cleanup on PR close
ArgoCD deletes Application → ACK deletes Bucket CRD + Function CRD
deletionPolicy: delete means ACK also deletes the actual AWS resources
S3 won't delete non-empty bucket → add pre-delete step to empty bucket first
Add to existing cleanup workflow: aws s3 rm s3://syrfapp-uploads-pr-{N} --recursive
Archive pr-preview-lambda.yml → .github/workflows/archived/

Key files¶

.github/workflows/pr-preview.yml (modify)
.github/workflows/pr-preview-lambda.yml (archive)
cluster-gitops/syrf/environments/preview/services/s3-notifier/values.yaml (new)
cluster-gitops/argocd/applicationsets/syrf-previews.yaml (modify)

Phase 5: Staging Deployment ✅¶

Goal: Deploy a NEW staging S3 bucket + Lambda via ACK. Existing production remains Terraform-managed.

Status: Complete (configuration deployed to cluster-gitops, awaiting ACK controller sync)

Steps¶

Register s3-notifier as a service in cluster-gitops

New: cluster-gitops/syrf/services/s3-notifier/config.yaml

serviceName: s3-notifier
service:
  chartPath: src/services/s3-notifier/.chart
  chartRepo: https://github.com/camaradesuk/syrf

New: cluster-gitops/syrf/services/s3-notifier/values.yaml (base defaults including awsAccountId)
Create staging environment config

New: cluster-gitops/syrf/environments/staging/s3-notifier/config.yaml

serviceName: s3-notifier
envName: staging
service:
  enabled: true
  chartTag: main
  imageTag: "1.0.0"

New: cluster-gitops/syrf/environments/staging/s3-notifier/values.yaml

bucket:
  name: syrfapp-uploads-staging
lambda:
  executionRoleArn: "arn:aws:iam::318789018510:role/syrfS3NotifierStagingLambdaRole"
  code:
    s3Key: "lambda-packages/production.zip"
environmentName: staging

Update staging API values to use new bucket
Modify: cluster-gitops/syrf/environments/staging/api/values.yaml
```
s3:
  bucketName: syrfapp-uploads-staging
```
Same for project-management if it accesses S3 directly
ArgoCD syncs → ACK creates resources
ApplicationSet auto-discovers s3-notifier (glob: syrf/services/*/config.yaml)
ACK S3 controller creates syrfapp-uploads-staging bucket
ACK Lambda controller creates syrfAppUploadS3Notifier-staging function
Bucket CRD notification links the two (1:1)
PostSync Job sets env vars + Lambda permission
Test end-to-end
Upload file via staging SyRF UI
Verify file lands in syrfapp-uploads-staging (not the shared bucket)
Verify Lambda triggers (CloudWatch logs)
Verify RabbitMQ message arrives at staging services
ArgoCD shows healthy sync for s3-notifier-staging

Risk: ACK reconciliation overwriting env vars¶

After PostSync Job sets env vars, the ACK controller will reconcile the Function CRD. If the CRD omits environment, ACK may either leave existing env vars alone (late initialization) or clear them. Must verify this behavior in staging before proceeding to production. If ACK clears them, mitigation: include non-sensitive env vars (RabbitMqHost, RabbitMqUsername, S3Region) in the CRD spec, and only use the Job for RabbitMqPassword.

Key files¶

cluster-gitops/syrf/services/s3-notifier/{config,values}.yaml (new)
cluster-gitops/syrf/environments/staging/s3-notifier/{config,values}.yaml (new)
cluster-gitops/syrf/environments/staging/api/values.yaml (modify — add s3.bucketName)

Phase 6: Production Cutover¶

Goal: Migrate production Lambda + bucket from Terraform to ACK. Zero downtime.

Duration: 1 day (then 1-2 week soak)

Pre-cutover checklist¶

Staging Lambda running via ACK for 1+ week, no issues
ACK env var reconciliation behavior confirmed safe (from Phase 5)
Backup: aws lambda get-function --function-name syrfAppUploadS3Notifier
Backup: aws s3api get-bucket-notification-configuration --bucket syrfapp-uploads
Backup: aws s3 ls s3://syrfapp-uploads --recursive --summarize > inventory.txt
deletion-policy: retain confirmed in chart values for production
Maintenance window communicated

Steps¶

Create production config with adoption

New: cluster-gitops/syrf/environments/production/s3-notifier/values.yaml

bucket:
  name: syrfapp-uploads          # Adopt existing
  deletionPolicy: retain
lambda:
  executionRoleArn: "arn:aws:iam::<ACCOUNT_ID>:role/syrfS3NotifierProductionLambdaRole"
  code:
    s3Key: "lambda-packages/production.zip"
  deletionPolicy: retain
environmentName: production
adoptExisting: true              # Triggers AdoptedResource CRDs

Deploy and adopt — push to cluster-gitops, ArgoCD syncs, ACK discovers existing resources
Verify adoption
Bucket creation date unchanged (NOT recreated)
aws s3 ls s3://syrfapp-uploads --recursive --summarize matches pre-cutover inventory
Lambda config unchanged in AWS Console
File uploads continue working
Remove from Terraform state (maintenance window)

terraform state rm aws_lambda_function.s3_notifier_production
terraform state rm aws_lambda_permission.s3_invoke_production
terraform state rm aws_s3_bucket_notification.uploads  # Preview Lambdas already ACK-managed (Phase 4)

Remove adoptExisting: true — ACK now fully owns production resources

Rollback¶

Re-import into Terraform (terraform import), set service.enabled: false in cluster-gitops.

Phase 7: Terraform Cleanup & Documentation¶

Goal: Remove Terraform Lambda resources (now fully ACK-managed) and document the migration.

Duration: 1 day

Steps¶

Remove Terraform Lambda resources
Gut camarades-infrastructure/terraform/lambda/main.tf (remove all Lambda + notification resources)
Keep IAM execution roles (shared, rarely change)
Keep lambda_packages_bucket variable (still used for zip upload)
Documentation
Update CLAUDE.md S3 Notifier section
Create docs/decisions/ADR-00X-ack-lambda-migration.md
Update this technical plan status to "Completed"

Risk Register¶

#	Risk	Severity	Likelihood	Mitigation
1	GKE→AWS OIDC auth doesn't work	Critical	Medium	Phase 0 is entirely a PoC. Fallback: AWS creds as K8s Secret (less elegant but functional)
2	ACK reconciliation clears env vars set by PostSync Job	High	Medium	Verify in Phase 5 staging. If ACK clears them: put non-sensitive vars in CRD spec, only use Job for password
3	Production bucket adopted incorrectly (data loss)	Critical	Low	`deletion-policy: retain` + `adopted: true` annotation. Backup inventory before cutover. Verify bucket creation date unchanged
4	S3 bucket name globally unavailable	Low	Low	`syrfapp-uploads-staging` may already exist. Check availability before Phase 5. Worst case: use alternative naming
5	Bucket notification → Lambda ordering	Medium	Medium	Bucket CRD references Lambda ARN. If Lambda doesn't exist yet when Bucket syncs, notification fails. Mitigation: ArgoCD sync waves enforce ordering — Function at wave 1, permission-job at wave 2, Bucket at wave 3
6	Preview bucket cleanup fails (non-empty bucket)	Medium	Medium	S3 API refuses to delete non-empty buckets. Add `aws s3 rm --recursive` step before ArgoCD prunes CRDs
7	CI/CD complexity during transition	Medium	High	Phases 5-6: both Terraform and ACK manage different Lambdas. Document which system owns which Lambda clearly
8	ACK controller instability	Medium	Low	Pin to stable versions. Test upgrades in staging first. Keep Terraform as rollback for 2 weeks post-production cutover
9	ACK controller calls undocumented APIs	Medium	High	Use `lambda:` / `s3:` scoped to resource ARN prefix — explicit action lists break on controller upgrades (see Operational Notes)
10	Python unavailable in AWS CLI image	Low	Confirmed	`amazon/aws-cli:2.15.0` has no standalone python3. Setup jobs search multiple fallback paths including `/usr/bin/python2.7`. Permission-job skips verification gracefully
11	IAM policy changes not picked up by ACK	Medium	Confirmed	ACK controllers cache STS sessions. Restart controller deployments after IAM changes (`kubectl rollout restart`)

Verification Plan¶

Phase 0¶

# From test pod in GKE:
aws sts get-caller-identity          # Confirms cross-cloud auth works
aws lambda list-functions --region eu-west-1
aws s3 ls                            # Confirms S3 access

Phase 2¶

helm template staging src/services/s3-notifier/.chart/ -f test-staging-values.yaml
# Verify: no secrets in output, valid Bucket + Function CRDs, correct Lambda ARN in notification

Phase 3 ✅¶

# Verified: chart template renders correctly
helm template test src/services/s3-notifier/.chart/ --set image.tag=0.1.3 \
  --set awsAccountId=318789018510 --set environmentName=staging \
  --set bucket.name=test --set lambda.executionRoleArn=arn:aws:iam::318789018510:role/test
# Result: s3Key: lambda-packages/s3-notifier-v0.1.3.zip ✅
# Explicit override with --set lambda.code.s3Key=custom.zip also works ✅
# After merge: verify package-lambda runs, versioned zip uploaded, promotion PR created

Phase 4 ✅¶

# Preview: create PR with preview label
# Verify: syrfapp-uploads-pr-{N} bucket created, Lambda created, notification linked
# Upload file → Lambda triggers → correct RabbitMQ vhost

# Close PR → verify bucket + Lambda deleted in AWS
aws s3 ls | grep -v syrfapp-uploads-pr-{N}  # Should not exist

Phase 5 ✅¶

# ArgoCD
argocd app get s3-notifier-staging
kubectl get bucket,function -n syrf-staging

# AWS — new bucket exists
aws s3 ls | grep syrfapp-uploads-staging

# AWS — Lambda configured correctly
aws lambda get-function-configuration --function-name syrfAppUploadS3Notifier-staging

# AWS — notification links bucket → Lambda
aws s3api get-bucket-notification-configuration --bucket syrfapp-uploads-staging

# End-to-end: upload file via staging UI → Lambda triggers → RabbitMQ message received

Phase 6¶

# Compare pre/post cutover
aws lambda get-function --function-name syrfAppUploadS3Notifier  # Config unchanged
aws s3 ls s3://syrfapp-uploads --recursive --summarize            # File count unchanged

# Functional: upload file via production SyRF UI, verify end-to-end processing

Appendix: Differences from Original Plan¶

The original technical plan (created 2026-01-15, deprecated) had 12 issues identified during validation against the actual codebase. This updated plan corrects all of them:

Issue	Original (Wrong)	Corrected
Lambda handler	`Function::FunctionHandler`	`S3FileReceivedHandler::HandleEvent`
Lambda runtime	`dotnet8`	`dotnet10`
Lambda env vars	`RABBITMQ_HOST` (uppercase)	`RabbitMqHost` (PascalCase, matching C# code)
RabbitMQ host	`amqp://rabbitmq.syrf-{env}.svc.cluster.local:5672`	`amqp://rabbitmq.camarades.net:5672` (public — Lambda runs outside cluster)
Missing env vars	Only `RABBITMQ_HOST`	`RabbitMqHost`, `RabbitMqUsername`, `RabbitMqPassword` (3 used by handler; `S3Region` set by Terraform but unused by code)
AWS Account ID	`ACCOUNT_ID` placeholders in IAM ARNs	Documented as required value, set per-environment in cluster-gitops
Lambda packages bucket	`camarades-lambda-packages`	`camarades-terraform-state-aws`
IAM role names	`lambda-s3-notifier-execution-role`	`syrfS3NotifierProductionLambdaRole` / `syrfS3NotifierPreviewLambdaRole`
Lambda permission	Terraform-managed	PostSync Job (keeps everything in GitOps)
S3 bucket strategy	Shared bucket / values files in chart	Separate buckets per env, values in cluster-gitops only
Env var management	Inline in Function CRD spec	PostSync Job (secrets never in git)
Bucket notification	Separate from Lambda permission concern	Built into Bucket CRD `spec.notification` (1:1 with Lambda)

Appendix: ACK CRD Verification¶

Verified via ACK documentation review (2026-01-15).

S3 Controller CRDs¶

CRD	Status	Notes
`Bucket`	Supported	Full S3 bucket management with 30+ configuration categories
Bucket Notifications	Built-in	Configured via `spec.notification` field — NOT a separate CRD

S3 notification config within Bucket CRD:

spec:
  notification:
    lambdaFunctionConfigurations:
      - events: ["s3:ObjectCreated:*"]
        lambdaFunctionARN: "arn:aws:lambda:eu-west-1:ACCOUNT_ID:function:syrfAppUploadS3Notifier"

Lambda Controller CRDs¶

CRD	Status	Notes
`Function`	Supported	Core Lambda function management
`Alias`	Supported	Alias with event invoke config
`CodeSigningConfig`	Supported	Code signing configuration
`EventSourceMapping`	Supported	Kafka, MQ, SQS event sources
`FunctionUrlConfig`	Supported	Function URL HTTPS endpoints
`LayerVersion`	Supported	Lambda layer management
`Version`	Supported	Immutable function versions
`Permission`	NOT Supported	Referenced in internal hooks but NOT a top-level CRD

Critical gap: Lambda Permission (resource-based policy for S3→Lambda invoke) has no ACK CRD. Solved via PostSync Job calling aws lambda add-permission.

Verification commands¶

# After ACK installation, verify available CRDs:
kubectl get crd | grep s3.services.k8s.aws
# Expected: buckets.s3.services.k8s.aws

kubectl get crd | grep lambda.services.k8s.aws
# Expected: functions, aliases, codesigningconfigs, eventsourcemappings,
#           functionurlconfigs, layerversions, versions
# NOT expected: permissions

Appendix: Data Persistence Guarantees¶

Production bucket contains user uploads (PDFs, reference files). Multiple protection layers ensure data safety:

Layer	Protection Against
`services.k8s.aws/deletion-policy: retain`	CRD deletion removing bucket
AWS "bucket not empty" check	API-level bucket deletion
S3 versioning	Accidental object deletion
ACK `AdoptedResource` CRD	Existing resources being recreated

# bucket.yaml — Critical annotations
apiVersion: s3.services.k8s.aws/v1alpha1
kind: Bucket
metadata:
  name: syrfapp-uploads
  annotations:
    services.k8s.aws/deletion-policy: retain    # Never delete bucket when CRD removed
    services.k8s.aws/adopted: "true"            # Adopt existing, don't recreate
spec:
  name: syrfapp-uploads
  versioning:
    status: Enabled                              # Object-level recovery

Appendix: Operational Notes (from Preview Deployment)¶

Lessons learned during the first preview deployment (PR #2328, 2026-02-11). These issues were resolved in the chart and documented here for future reference.

ACK Lambda controller requires `ephemeralStorage`¶

The ACK Lambda controller calls UpdateFunctionConfiguration on every reconciliation. If ephemeralStorage is omitted from the Function CRD spec, the API call omits it and AWS returns an error. The chart now includes ephemeralStorage.size (defaults to 512 MB in values.yaml).

ACK controller IAM — use wildcards, not explicit action lists¶

The ACK Lambda controller calls undocumented/internal AWS APIs during reconciliation (e.g. GetFunctionConcurrency, GetFunctionEventInvokeConfig). An explicit action list broke on controller upgrade. The ACKLambdaManagement policy now uses lambda:* scoped to the function ARN prefix. Similarly, the ACK S3 controller requires s3:ListAllMyBuckets on * for bucket discovery — ACKS3Management uses s3:* on the bucket ARN prefix plus the account-level action.

Setup-job trust policy must include preview namespaces¶

The syrf-ack-setup-job trust policy originally only allowed syrf-staging and syrf-production namespaces. Preview environments use pr-* namespaces. The trust policy now uses StringLike with system:serviceaccount:pr-*:ack-setup-job in addition to the staging/production entries.

Python is not directly available in `amazon/aws-cli:2.15.0`¶

The AWS CLI image (Amazon Linux 2) bundles Python inside the aws binary (PyInstaller-frozen). There is no standalone python3 in $PATH. However, python2.7 exists at /usr/bin/python2.7. The setup jobs now search multiple paths: python3, python, python2.7, /usr/bin/python2.7, and AWS CLI's internal Python paths. The permission-job also gracefully skips policy verification if no Python is found (the permission is already added by that point).

ACK controller credential caching after IAM changes¶

After updating IAM policies, ACK controllers continue using cached STS sessions. The controllers must be restarted (kubectl rollout restart deployment) to pick up new permissions. This only applies to IAM policy changes — normal CRD operations use the existing session.

ArgoCD retry exhaustion on transient failures¶

If an ACK CRD fails during sync (e.g. due to IAM permission errors), ArgoCD exhausts its 3 retry attempts. After fixing the root cause, the operation state must be cleared manually.

Note: This kubectl patch targets ArgoCD's own internal state (clearing operation status), not application resources. This falls under the ArgoCD bootstrap exception in the GitOps policy.

kubectl patch application <app> -n argocd --type merge \
  -p '{"status":{"operationState":null}}'

Then trigger a hard refresh in ArgoCD.

AWS IAM Policies — Current State¶

The following IAM policies were updated directly in AWS during the preview deployment. These need to be backported to Terraform (camarades-infrastructure/terraform/lambda/ack-iam.tf) before production cutover.

Role: syrf-ack-controllers

Policy	Actions	Resources
ACKLambdaManagement	`lambda:*`	`arn:aws:lambda:eu-west-1:318789018510:function:syrfAppUploadS3Notifier*`
ACKS3Management	`s3:*`	`arn:aws:s3:::syrfapp-uploads`, `arn:aws:s3:::syrfapp-uploads-`, `arn:aws:s3:::syrfapp-uploads/`, `arn:aws:s3:::syrfapp-uploads-/`
ACKS3Management	`s3:ListAllMyBuckets`	`*`
ACKIAMPassRole	`iam:PassRole`	`arn:aws:iam::318789018510:role/syrfS3Notifier*LambdaRole`
LambdaPackagesRead	`s3:GetObject`, `s3:ListBucket`	`arn:aws:s3:::camarades-terraform-state-aws`, `arn:aws:s3:::camarades-terraform-state-aws/lambda-packages/*`

Role: syrf-ack-setup-job

Policy	Actions	Resources
SetupJobPermissions	`lambda:GetFunction`, `lambda:GetFunctionConfiguration`, `lambda:UpdateFunctionConfiguration`, `lambda:GetPolicy`, `lambda:AddPermission`, `lambda:RemovePermission`	`arn:aws:lambda:eu-west-1:318789018510:function:syrfAppUploadS3Notifier*`
Trust: StringEquals	`sts:AssumeRoleWithWebIdentity`	`system:serviceaccount:syrf-staging:ack-setup-job`, `system:serviceaccount:syrf-production:ack-setup-job`
Trust: StringLike	`sts:AssumeRoleWithWebIdentity`	`system:serviceaccount:pr-*:ack-setup-job`

Technical Plan: Lambda ACK GitOps Migration¶

Overview¶

Key Architecture Decisions¶

1. Separate S3 buckets per environment (full ACK)¶

2. ACK manages both Bucket + Function CRDs¶

3. Setup Jobs for credentials + permissions¶

4. GKE → AWS cross-cloud auth via OIDC federation¶

Codebase-Validated Reference¶

Security Note: RabbitMQ Transport¶

Phase 0: Proof of Concept — Cross-Cloud Auth¶

Steps¶

Success gate¶

Phase 1: ACK Controller Installation¶

Prerequisites¶

Steps¶

Key files (4 new)¶

Phase 2: Helm Chart Development¶

Chart structure¶

bucket.yaml — ACK Bucket CRD with built-in notification¶

function.yaml — ACK Function CRD¶

permission-job.yaml — Sync hook (wave 2) for Lambda invoke permission¶

env-vars-job.yaml — PostSync hook for Lambda env vars¶

RabbitMQ secret¶

adopted-resource.yaml (conditional on adoptExisting: true)¶

values.yaml defaults¶

Validation¶

Key files (8 new in monorepo)¶

Phase 3: CI/CD Pipeline Updates ✅¶

Why before deployment phases¶

Key design decision: derive s3Key from image.tag¶

Changes made¶

Backward compatibility¶

Key files¶

Phase 4: Preview Integration ✅¶

Changes made¶

Original steps (for reference)¶

Key files¶

Phase 5: Staging Deployment ✅¶

Steps¶

Risk: ACK reconciliation overwriting env vars¶

Key files¶

Phase 6: Production Cutover¶

Pre-cutover checklist¶

Steps¶

Rollback¶

Phase 7: Terraform Cleanup & Documentation¶

Steps¶

Risk Register¶

Verification Plan¶

Phase 0¶

Phase 2¶

Phase 3 ✅¶

Phase 4 ✅¶

Phase 5 ✅¶

Phase 6¶

Appendix: Differences from Original Plan¶

Appendix: ACK CRD Verification¶

S3 Controller CRDs¶

Lambda Controller CRDs¶

Verification commands¶

Appendix: Data Persistence Guarantees¶

Appendix: Operational Notes (from Preview Deployment)¶

ACK Lambda controller requires ephemeralStorage¶

ACK controller IAM — use wildcards, not explicit action lists¶

Setup-job trust policy must include preview namespaces¶

Python is not directly available in amazon/aws-cli:2.15.0¶

ACK controller credential caching after IAM changes¶

ArgoCD retry exhaustion on transient failures¶

AWS IAM Policies — Current State¶

References¶

`bucket.yaml` — ACK Bucket CRD with built-in notification¶

`function.yaml` — ACK Function CRD¶

`permission-job.yaml` — Sync hook (wave 2) for Lambda invoke permission¶

`env-vars-job.yaml` — PostSync hook for Lambda env vars¶

`adopted-resource.yaml` (conditional on `adoptExisting: true`)¶

`values.yaml` defaults¶

ACK Lambda controller requires `ephemeralStorage`¶

Python is not directly available in `amazon/aws-cli:2.15.0`¶