Plan: Fix Code Review Issues for ACK Migration (v3.15)¶

v3.15 changes¶

Here-strings for multi-line variables — All printf '%s\n' "$VAR" | grep/awk/sed pipelines on captured multi-line output (chart values, helm template output, controller logs) replaced with here-strings (grep ... <<<"$VAR", awk ... <<<"$VAR", sed ... <<<"$VAR"). Here-strings pass the variable as stdin without a pipe process, eliminating SIGPIPE entirely under set -euo pipefail. Remaining pipes are intentional: token-check patterns (echo "$ROLE_ARN" | tr | grep -qxF) operate on small single-line jsonpath outputs where the full data fits in one pipe buffer write. Affects: chart key validation, SA extraction, controller log check.
Chart key validation: leaf-level yq -e predicates — Loose grep -q 'certManager:' assertions replaced with leaf-level yq -e exit-code predicates (e.g. .pki.certManager.enabled, not .pki.certManager). Ensures the specific keys our values.yaml overrides actually exist in the chart schema. yq -e exits non-zero when the path resolves to null, eliminating the intermediate | grep -v '^null$' pipe.
Fix 0 webhook readiness: kubectl wait — Standalone Fix 0 verification step 1 changed from informational kubectl get pods (exits 0 regardless of Ready state) to kubectl wait --for=condition=Ready ... --timeout=120s, consistent with Phase 2 gate.
OCI sourceRepos: yq exact match — OCI sourceRepos fallback verification replaced brittle grep -qF substring check with yq -e '.spec.sourceRepos[] | select(. == ...)' for exact array element matching. Eliminates false-positives from partial URL matches.
yq required everywhere, no fallbacks — Prerequisites list yq as required (was "optional with hardcoded fallback"). Pre-merge block includes command -v yq + major-version gate (v4+ check). Phase 4b expected-value sourcing now hard-fails on missing/old yq instead of falling back to hardcoded values that can silently drift from config.
Repo root autodetect — GITOPS_ROOT and INFRA_ROOT now autodetect nested ($REPO_ROOT/cluster-gitops) or sibling ($REPO_ROOT/../cluster-gitops) layouts. Explicit export GITOPS_ROOT=... override still works for CI or non-standard layouts. Actionable error if neither path exists.
Phase 5: production rollout gate — New section after Phase 4b with 4 concrete steps: (1) enable adoptExisting: true via GitOps commit, (2) verify health+sync gate via argocd app wait, (3) run Phase 4b reconciliation against production, (4) reset adoptExisting: false after verification. Rollback requires GitOps revert (not just argocd app delete) to prevent ApplicationSet recreation. All commands use --grpc-web and jq for structured checks.
Phase 4b parameterized — All hardcoded staging references (syrfAppUploadS3Notifier-staging, syrf-staging, s3-notifier-staging) replaced with $FUNCTION_NAME, $K8S_NAMESPACE, $CR_NAME, $ARGOCD_APP, $ENV_VALUES_PATH. Defaults to staging if PHASE4B_* env vars are unset. Phase 5 Step 3 shows the production overrides.
Phase 5 hardened — Step 1 now includes both service.enabled: true (config.yaml) and adoptExisting: true (values.yaml) in a single dedicated commit, since the app won't be generated without enablement. Adoption SHA is a hard gate: auto-detects from HEAD only if commit message mentions adoptExisting; requires explicit export ADOPTION_SHA=<sha> otherwise. Lambda state hard-asserted Active. S3 notification filtered by ARN. Rollback deletionPolicy pre-check handles each CRD independently (present+retain, present+other, or absent). --grpc-web on all ArgoCD commands.
Terraform teardown: correct addresses and scope — Fixed resource names to match actual Terraform config: aws_lambda_function.s3_notifier_production, aws_lambda_permission.s3_invoke_production. aws_s3_bucket_notification.uploads is NOT removed (it manages both production AND preview notifications); instead, the production lambda_function block is removed from the resource config. Preview resources (s3_notifier_preview, s3_invoke_preview) are explicitly called out as untouched. Documents removed blocks (Terraform 1.7+) as preferred alternative for Lambda and permission.
macOS note simplified — Verbose 4-line macOS/BSD date +%s%N comment condensed to 1 line. Only relevant to local shell portability, not cluster runtime.

v3.14 changes¶

Fail-fast: set -euo pipefail — All runnable verification blocks (Prerequisites, Fix 0 standalone, Phases 1-4b) now start with set -euo pipefail. Prevents gate commands from failing silently and allowing execution to continue past a broken check.
Fallback sync error guard — argocd app sync --force in the Phase 4b automated retry now has || { echo "..."; exit 1; } to catch ArgoCD connectivity or app-state failures instead of proceeding blind.
namespaceSelector doc consistency — Initial webhook values comment (failurePolicy: Ignore deploy) now explicitly notes that a namespaceSelector should be added when switching to Fail, cross-referencing the fail-closed section. Removes "no namespaceSelector needed" contradiction.
Namespace label pre-check — Added a runnable pre-check script before the failurePolicy: Fail switch that verifies syrf.io/aws-identity=enabled label exists on ack-system, syrf-staging, and syrf-production. Hard-fails with remediation command if any label is missing.
Rollout timeout alignment — Inline remediation example (--timeout=60s) aligned to 180s to match the main Phase 2 rollout waits. No more conflicting timeout guidance.
Version bump: v3.13 → v3.14 — New version tag for traceability after 6 incremental fixes since v3.13.
S3 IAM phantom actions — Moved from v3.13 changelog item 42 to a clear note: removed s3:GetObjectMetadata and s3:HeadObject from the staging Lambda role policy. AWS maps HeadObject auth to s3:GetObject — the phantom actions are not valid IAM action names and would be silently ignored by IAM.
set -e + grep safety — All grep calls inside command substitutions now use { grep ... || true; } to prevent set -e from killing the script before custom error handling. Affected: yq version detection, SA-name extraction (helm template pipeline), webhook rollback name lookups. Count checks switched from wc -l (counts empty lines) to grep -c . || true.
SA extraction: explicit empty guard — After helm template + grep pipeline, added explicit -z checks for LAMBDA_SA and S3_SA with actionable error messages, before the trust-policy cross-check.
Namespace selector structural clarity — Initial values.yaml block now includes a commented-out namespaceSelector placeholder explicitly noting it's not needed for Ignore but required for Fail, mirroring the fail-closed section's structure.
Label remediation: --overwrite — All kubectl label ns remediation commands (pre-check script and inline YAML comments) now include --overwrite to handle wrong-value correction, not just missing-label cases.
POSIX redirect: &>/dev/null → >/dev/null 2>&1 — Replaced bashism &> with POSIX-portable >/dev/null 2>&1 in yq presence check.
Prerequisites: context hard gate — EXPECTED_CONTEXT was defined but never validated in Prerequisites (only in Fix 0 standalone). Added kubectl config current-context assertion with exit 1 on mismatch, so skipping the Fix 0 block can't accidentally hit the wrong cluster.
Phase 2 webhook readiness gate — Replaced kubectl get pods (exits 0 even when not Ready) with kubectl wait --for=condition=Ready pod ... --timeout=120s. Webhook must be fully Ready before controller rollout restart, not just listed.
Pre-merge validation: set -euo pipefail — The chart key validation block (helm repo add/show values) was the last runnable block without fail-fast mode. Added for consistency with all other blocks.
New namespace labeling note — Added operational warning that any new namespace needing AWS credential injection (e.g. PR previews) must be labeled syrf.io/aws-identity=enabled before pod deployment when failurePolicy: Fail with namespaceSelector is active, or credentials silently won't inject.
SIGPIPE safety: capture-then-parse — helm show values ... | head -60 and helm template ... | grep | head -1 pipelines can false-fail under set -euo pipefail when head closes the pipe early (upstream gets SIGPIPE, pipefail sees non-zero). Rewritten to capture full output into a variable first, then parse with sed -n/awk (which read all input, no early close).
SA extraction: stderr preserved — Removed 2>/dev/null from helm template in SA-name extraction. Stderr is now captured via 2>&1 into the output variable and printed on failure, giving actionable diagnostics (wrong chart version, missing values file, OCI pull error) instead of a silent empty result.
Rollback block: set -euo pipefail — The webhook rollback script (Option 1) was the last runnable block without fail-fast. Added for consistency — all runnable blocks now have it.
SIGPIPE: here-strings for multi-line variables — Variable-to-command patterns for large captured output (chart values, helm template, controller logs) now use here-strings (grep ... <<<"$VAR", awk ... <<<"$VAR", sed ... <<<"$VAR") instead of pipes. Token-check patterns on small jsonpath outputs retain pipes (no practical SIGPIPE risk for single-buffer writes). Superseded by v3.15 item 1.
Chart key validation: yq path checks — Pre-merge validation uses yq -e exit-code predicates (yq -e '.pki.certManager', etc.) against captured chart values instead of loose grep -q 'certManager:' which can false-pass. Superseded by v3.15 item 2.
Recovery instruction: kubectl wait — Recovery step "Confirm webhook is Ready" changed from kubectl get pods (informational, exits 0 regardless) to kubectl wait --for=condition=Ready ... --timeout=120s, matching the Phase 2 gate pattern.

v3.13 changes¶

Permission-job: Python instead of jq — The conflict-path policy inspection used jq for JSON parsing, but amazon/aws-cli:2.15.0 does not include jq. Replaced with Python (bundled internally by the AWS CLI image), using the same discovery logic as env-vars-job.
Reconcile evidence: --since-time instead of --since=5m — The 5-minute rolling window could catch log lines from a prior reconcile, causing false-pass. Now captures LOG_SINCE timestamp immediately before the annotation and uses --since-time=$LOG_SINCE for an exact window.
Separate GITOPS_ROOT variable — REPO_ROOT/cluster-gitops assumed repos are co-located (nested). Added GITOPS_ROOT as a separate variable in Prerequisites, defaulting to ${REPO_ROOT}/cluster-gitops but overridable for different checkout layouts (CI, separate worktrees).
yq v4+ explicit requirement — The ireduce syntax used for Helm-style deep merge requires yq v4+. Prerequisites now state the version requirement explicitly.
Auth check: WARNING → hard FAIL — Phase 2 ACK controller auth error detection was a warning (continue). For a deployment gate, credential errors must block — upgraded to exit 1.
Header/date refresh — Bumped version to v3.13, updated date to 2026-02-08.
Phase 1 paths: GITOPS_ROOT + INFRA_ROOT — Phase 1 Terraform verification used hardcoded relative paths (../../../cluster-gitops/...) that break in non-co-located repo layouts. Now uses ${GITOPS_ROOT} for helm values and ${INFRA_ROOT} for the cd target. Added INFRA_ROOT variable to Prerequisites alongside GITOPS_ROOT.
Reconcile annotation: nanosecond uniqueness — date +%s can repeat on fast reruns (same second). Now uses date +%s%N (nanoseconds) with case-based detection and epoch-$$ fallback (superseded by item 11 for macOS correctness).
Permission-job: Python existence guard — The conflict-path PYTHON discovery lacked the --version check present in env-vars-job. Added the same explicit verification and actionable error message.
Mutation assertions: exact role ARN — All three mutation check blocks (Fix 0, Phase 2, Phase 2b sa-test) now assert the exact expected role ARN (syrf-ack-controllers for ACK controllers, syrf-ack-setup-job for sa-test) instead of just non-empty. Catches wrong-role injection.
Nanosecond timestamp: macOS-safe — date +%s%N returns literal N on macOS (exit 0, so || fallback never fires). Now uses case check: if output ends in N, falls back to epoch-$$ (PID, POSIX-portable unlike $RANDOM).
Fix 0 token file path assertion — Fix 0 verification now asserts exact AWS_WEB_IDENTITY_TOKEN_FILE path (was presence-only; Phase 2 already had it).
Auth grep: web-identity patterns — Added InvalidIdentityToken, ExpiredToken, WebIdentityErr to the auth error grep in Phase 2, covering common OIDC/STS federation failures.
Fail-fast directory guards — Prerequisites block now validates that REPO_ROOT, GITOPS_ROOT, and INFRA_ROOT all exist before any phase runs.
Permission conflict: Python error handling — Both Python JSON-parsing invocations now wrap json.loads in try/except. On malformed or empty policy payloads, prints the raw payload and exits with actionable error instead of an opaque Python traceback.
Changelog consistency — Item 8 still referenced ${RANDOM} fallback; corrected to match item 11's epoch-$$ implementation.
ArgoCD checks: gating — Replaced informational argocd app get (always exits 0) with argocd app wait --health --timeout which exits non-zero if health/sync is not reached. Phase 2 steps 4-5 are now actual deployment gates.
Container selector: containers[*] — All mutation jsonpath queries now use containers[*] instead of containers[0]. Sidecar injection (e.g. Istio, Linkerd) can shift the main container from index 0; wildcard matches across all containers.
Timestamp: stricter portability — Added 2>/dev/null to suppress stderr from exotic shells, and ${RECONCILE_TS:-} empty-string guard in the case check for shells where the variable might be unset.
Parameterised ARNs — Extracted hardcoded arn:aws:iam::318789018510:role/... into AWS_ACCOUNT_ID, ACK_CONTROLLER_ROLE_ARN, and SETUP_JOB_ROLE_ARN variables in Prerequisites. All phases now reference these variables, making the scripts reusable across accounts.
Fix 0 standalone safety — Fix 0 verification block referenced $ACK_CONTROLLER_ROLE_ARN (defined later in Prerequisites). Now inlines the ARN directly with a comment pointing to Prerequisites for the parameterised version used by Phases.
Multi-container value safety — containers[*] jsonpath can concatenate values from multiple containers (whitespace-separated). Equality checks (!=) replaced with token-splitting + exact match (see item 26). Prevents false-fails when sidecars are present.
yq v4 runtime validation — The command -v yq presence check now also extracts the major version number and requires >= 4. If yq v3 is installed, falls back to hardcoded values with a clear message instead of failing with an opaque ireduce syntax error.
AWS account sanity gate — Prerequisites now runs aws sts get-caller-identity and asserts the returned account matches $AWS_ACCOUNT_ID. Prevents cross-account mistakes (wrong profile, personal creds).
sa-test cleanup trap — Phase 2b now sets trap ... EXIT before creating the sa-test pod, ensuring cleanup on timeout, assertion failure, or Ctrl-C. Manual kubectl delete calls removed from failure paths (trap handles them).
Exact-token matching — Replaced grep -qF (substring) with tr ' ' '\n' | grep -qxF (exact line match after word-splitting). Prevents false-pass when a longer ARN contains the expected ARN as a prefix (e.g. syrf-ack-controllers-v2 matching syrf-ack-controllers).
Fix 0 context awareness — Added kubectl context visibility (kubectl config current-context) and derived EXPECTED_ACCOUNT from the inline ARN. Standalone block can't call AWS APIs (no AWS context). Upgraded to hard gate in item 32.
yq merge error handling — Wrapped yq ea in an if guard: on failure (missing files, parse error), captures the error message and falls back to hardcoded values instead of aborting. Fallback message now says "v4+ not available or merge failed" (was misleadingly "not available" even for v3).
get-policy guard — aws lambda get-policy in the permission-job conflict path now has explicit || { ... exit 1 } error handling with actionable message. Also guards against empty/None response. Previously, set -e would cause an opaque abort on API failure.
ArgoCD wait timeouts: realistic — Increased ACK controller health timeout 120s → 300s (image pull + CRD registration + AWS round-trip). Increased s3-notifier timeout 180s → 600s (3 sync waves + PostSync job, each involving AWS API calls).
Reconcile annotation: extra entropy on macOS — epoch-$$ fallback can repeat if rerun in same second from same shell (PID doesn't change). Now appends a monotonic _RECONCILE_SEQ counter that increments on each invocation.
Fix 0: hard-gate kubectl context — Standalone Fix 0 block now asserts kubectl config current-context matches the expected GKE context (gke_camarades-net_europe-west2-a_camaradesuk). Exits on mismatch instead of just printing the context.
Whitespace splitting: tr -s '[:space:]' — All tr ' ' '\n' instances replaced with tr -s '[:space:]' '\n'. Splits on all whitespace characters (tabs, newlines from jsonpath) not just spaces, and collapses consecutive whitespace to avoid empty lines.
Stale comment cleanup — Comments referencing old grep -qF (contains) semantics updated to reflect current exact-token approach (tr -s '[:space:]' '\n' | grep -qxF). Changelog item 22 also updated to cross-reference item 26.
Force-reconcile fallback trigger — Phase 4b annotation-based trigger may not fire if ACK runtime predicates ignore metadata-only changes. Added a documented manual fallback path: argocd app sync --force + re-check, with guidance on when to investigate further.
Parameterised EXPECTED_CONTEXT — Hard-coded gke_camarades-net_europe-west2-a_camaradesuk moved to Prerequisites alongside the account/role variables. Fix 0 standalone uses ${EXPECTED_CONTEXT:-...} default so it works both standalone and after Prerequisites.
Unique sa-test pod names — Fixed pod name sa-test replaced with sa-test-$$ (PID suffix) to avoid AlreadyExists errors on repeated verification runs. All references (create, wait, inspect, cleanup trap, delete) updated to use $SA_TEST_POD.
EXPECTED_VALS: jq -n --arg — Phase 4b EXPECTED_VALS construction replaced string interpolation ("\"$EXP_HOST\"") with jq -n --arg Host ... --arg Region ... --arg User .... Avoids escaping issues if values contain special characters.
Error message consistency — All "does not contain expected" FAIL messages updated to "missing exact expected token" with both expected and actual values quoted, matching the grep -qxF exact-token semantics.
macOS date note scope — Clarified that the date +%s%N / BSD workaround applies only to local shell verification scripts, not to cluster runtime (Kubernetes pods use GNU coreutils).
Webhook failurePolicy: Fail scoping — Added namespaceSelector (label: syrf.io/aws-identity=enabled) to the fail-closed webhook configuration. Without scoping, Fail blocks ALL pod creation cluster-wide if the webhook is unavailable. Added risk note explaining the blast radius difference.
S3 IAM actions: remove phantoms — Removed s3:GetObjectMetadata and s3:HeadObject from the staging Lambda role policy. Neither is a real IAM action — AWS authorises HeadObject under s3:GetObject.
ACK rollout timeout: 60s → 180s — kubectl rollout status on cold image pulls (first deploy or node reschedule) can exceed 60s. Increased to 180s to match the same order as ArgoCD health timeouts.
sa-test pod: timestamp+PID — sa-test-$$ can collide across repeated runs in the same shell (PID doesn't change). Changed to sa-test-$(date +%s)-$$ for uniqueness across both time and shell sessions.
Reconcile gate: automated fallback — Replaced manual-only fallback comment with an automated two-attempt loop. Attempt 1 uses annotation trigger; on timeout with no evidence, attempt 2 automatically runs argocd app sync --force and re-checks. Reduces operator variance while still hard-failing if both attempts show no reconcile evidence.

v3.12 changes¶

Reconcile gate: lastTransitionTime-based — Replace resourceVersion two-bump proof (race-prone: ACK can reconcile before AFTER_ANNOTATION_RV captured) with lastTransitionTime comparison on ResourceSynced condition. Immune to timing races. Graceful fallback if condition stays True throughout (env var validation is the authoritative check).
Distroless-safe token check — Replace kubectl exec ... ls ... (fails on distroless/minimal ACK images) with kubectl get pod ... -o jsonpath to check projected volume mount path in pod spec.
sa-test race fix — Add kubectl wait --for=condition=Ready before inspecting pod spec (was race-prone: sleep 10 + immediate inspect).
Pod selection: newest pod after rollout — Replace sort | head -1 (picks oldest/first alphabetically, may miss newly-mutated pod) with --sort-by=.metadata.creationTimestamp + last item (newest pod, the one created after webhook install).
--force fallback — Add explicit fallback for Fix 9: if --force still doesn't rerun hooks, delete the old hook Job so BeforeHookCreation delete policy triggers recreation on next sync. Scoped to s3-notifier labels to avoid deleting unrelated PostSync jobs. Keep --force on the retry path too.
Reconcile gate: expected-value assertions — After before/after comparison, also assert env var keys and non-sensitive values match expected config (Host, Username, Region). Catches the case where both before AND after are wrong.
Reconcile wait: 30s → 60s — Controller + AWS API round-trip can be slow; 30s was flaky-prone.
Token-volume check: combined assertion — Replace weak grep -c token / grep token with specific jsonpath for mount path + AWS_ROLE_ARN + AWS_WEB_IDENTITY_TOKEN_FILE. Fail explicitly if any are missing.
JSON comparison: jq-normalized — Pipe all aws --output json through jq -Sc '.' for whitespace/key-order normalization. Prevents brittle raw string comparison across AWS CLI versions.
Reconcile timeout: controller log check — On timeout, verify controller logs mention s3-notifier-staging as secondary evidence that reconcile ran (guards against false-pass from stale state).
Hook delete: --ignore-not-found — Make fallback kubectl delete job idempotent.
sa-test: --restart=Never — Remove kubectl CLI version ambiguity for kubectl run.
Reconcile proof: hard fail on no evidence — Upgrade the no-log-evidence path from WARNING to hard FAIL. Continuing on stale state defeats the purpose of the reconciliation gate.
Cross-repo gate: immutable ref — Pin instruction now requires a specific commit SHA (not main) to prevent the ref from advancing while prerequisites are pending.
Expected values: source from config — Use yq to extract expected Host/Username/Region from cluster-gitops Helm values files, falling back to hardcoded values if yq is unavailable. Prevents drift.
Prerequisites section — Added required CLI tools list (kubectl, helm, argocd, aws, jq, yq) with auth instructions before verification phases.
Token file path: exact assertion — Assert AWS_WEB_IDENTITY_TOKEN_FILE equals expected path /var/run/secrets/eks.amazonaws.com/serviceaccount/token, not just non-empty.
yq merge: proper overlay — Replace broken tail -1 (drops service defaults when env file omits key) with yq ea '. as $item ireduce ({}; . * $item)' for correct Helm-style deep merge.
Permission idempotency: verify on conflict — On ResourceConflictException, inspect existing policy statement for correct source-arn/source-account. If mismatched (stale from previous bucket), remove and re-add.
Execution context: REPO_ROOT — All phases now use REPO_ROOT variable for path references. Set once in Prerequisites, used in Phase 1 cd, Phase 3 helm template, Phase 4b yq paths.
Auth log check: robust capture — Capture logs into variable first (fail if retrieval fails), widened window from --since=2m to --since=5m, narrowed grep to auth-specific errors only.
Reconcile evidence grep: multi-pattern — Match K8s CR name (s3-notifier-staging), AWS function name (syrfAppUploadS3Notifier-staging), and annotation timestamp. Capture logs into variable with error handling.
sa-test: full mutation assertion — Replace grep AWS env name check with exact jsonpath assertions for AWS_ROLE_ARN, AWS_WEB_IDENTITY_TOKEN_FILE path, and token volume mount. Cleanup pod on failure.
Webhook chart source: aligned — Pre-merge helm show values now uses the same Helm repo (jkroepke/helm-charts) as config.yaml deployment, not OCI alternative.

v3.11 changes¶

Reconcile gate: two-bump proof — Record resourceVersion after annotation write, then wait for it to change again (proving the controller reconciled, not just the annotation write). Previous code checked != BEFORE_RV which passed after just the annotation write.
Phase 4b: compare non-sensitive values, not just keys — Use JMESPath to extract {Host, User, Region} (redact RabbitMqPassword). Catches value drift that key-only comparison misses.
Fix 9 immediate fix: --force flag — argocd app sync s3-notifier-staging --force --grpc-web to ensure PostSync hooks rerun even when app is already in sync. Without --force, hooks may be skipped.
Env-vars job post-update wait: values-driven — Replace hardcoded UPDATE_WAIT=30 with {{ .Values.setupJob.timeouts.lambdaConfigReady }} to match the pre-update wait and allow tuning for AWS control-plane lag.
Pod selection: count guard with deterministic selection — Replace .items[0] with {.items[*].metadata.name} + wc -w count check + sort | head -1. Error if 0 pods, warn if >1 (expected during rollout).

Context¶

After implementing the initial ACK migration code (Helm chart, cluster-gitops configs, Phase 0 Terraform), three rounds of code review identified 9 original issues plus 3 critical infrastructure gaps. This plan addresses all fixes.

Critical gaps found in review rounds 2-3: - Fix 0: Pod Identity Webhook — eks.amazonaws.com/role-arn annotations do nothing on GKE without it - Fix 1b: Plugins AppProject — missing ACK namespace/source repos blocks ACK controller deployment - Fix 1c: ACK health checks — ArgoCD can't interpret ACK CRD status, risking PostSync deadlock

Repositories touched: - syrf.pr2321 (monorepo) — Helm chart template fixes - cluster-gitops — ArgoCD config, plugins, projects, health checks - camarades-infrastructure — Terraform IAM roles + policy fixes

Fix 0 (NEW, CRITICAL): Install AWS Pod Identity Webhook¶

Problem: eks.amazonaws.com/role-arn annotations on ServiceAccounts do NOTHING on GKE. On EKS, a mutating admission webhook intercepts these annotations and injects projected token volumes + AWS_ROLE_ARN/AWS_WEB_IDENTITY_TOKEN_FILE env vars. Without this webhook, both ACK controllers and setup jobs have NO AWS credentials.

Current AWS auth uses static IAM credentials (aws-s3, aws-ses secrets with keyId/accessKey) — no OIDC/web-identity in use today.

Solution: Install amazon-eks-pod-identity-webhook as a cluster-gitops plugin using the jkroepke Helm chart.

Prerequisites: cert-manager (already installed), GKE projected SA token support (available since GKE 1.20+).

Files to create:¶

1. cluster-gitops/plugins/helm/pod-identity-webhook/config.yaml

plugin:
  name: pod-identity-webhook
  repoURL: https://jkroepke.github.io/helm-charts
  chart: amazon-eks-pod-identity-webhook
  version: "2.6.0"
  namespace: pod-identity-webhook

2. cluster-gitops/plugins/helm/pod-identity-webhook/values.yaml

# AWS Pod Identity Webhook for GKE -> AWS OIDC federation
# Mutates pods with eks.amazonaws.com/role-arn annotation to inject:
#   - AWS_ROLE_ARN env var
#   - AWS_WEB_IDENTITY_TOKEN_FILE env var
#   - Projected service account token volume (audience: sts.amazonaws.com)
#
# No namespaceSelector needed for initial deploy (failurePolicy: Ignore).
# Mutation only triggers on pods whose SA has the eks.amazonaws.com/role-arn annotation.
# When switching to failurePolicy: Fail later, ADD a namespaceSelector to limit blast radius
# (see "Fail-closed gate" section below).

# TLS via cert-manager (already installed as cluster plugin)
# Chart default is true, but explicit for clarity
pki:
  certManager:
    enabled: true

# Token audience for AWS STS (chart key: tokenAudience, default: sts.amazonaws.com)
# Explicit for documentation even though it matches the default
config:
  tokenAudience: sts.amazonaws.com

# Webhook failure policy
# Ignore = fail-open: pods start without AWS creds if webhook is down (safe for initial rollout)
# Fail = fail-closed: pods blocked if webhook is down (safer once webhook is stable)
# Start with Ignore, switch to Fail after confirming webhook stability
mutatingWebhook:
  failurePolicy: Ignore
  # namespaceSelector: <-- NOT needed while failurePolicy is Ignore.
  #   Add when switching to Fail (see "Fail-closed gate" section below).

resources:
  requests:
    memory: "32Mi"
    cpu: "25m"
  limits:
    memory: "64Mi"
    cpu: "50m"

Ordering note: The webhook must be running before ACK controller pods start, otherwise they won't get AWS credentials injected. selfHeal does NOT recreate existing pods — it only reconciles K8s manifests against git. After confirming the webhook is healthy, you must explicitly restart ACK controller deployments:

# After webhook pod is Running AND Ready (both conditions required):
kubectl wait --for=condition=Ready pod -n pod-identity-webhook \
  -l app.kubernetes.io/name=amazon-eks-pod-identity-webhook --timeout=60s

# Then restart ACK controllers:
kubectl rollout restart deployment -n ack-system -l app.kubernetes.io/instance=ack-lambda-controller
kubectl rollout restart deployment -n ack-system -l app.kubernetes.io/instance=ack-s3-controller
# New pods will be mutated by the webhook with projected token volumes

If restart happened before webhook was ready (pods have no AWS env vars):

Confirm webhook is Ready: kubectl wait --for=condition=Ready pod -n pod-identity-webhook -l app.kubernetes.io/name=amazon-eks-pod-identity-webhook --timeout=120s
Re-do the rollout restart: kubectl rollout restart deployment -n ack-system -l app.kubernetes.io/instance=ack-lambda-controller
Wait for completion: kubectl rollout status deployment -n ack-system -l app.kubernetes.io/instance=ack-lambda-controller --timeout=180s
Re-run the verification checks below — this time the new pods should have projected volumes and AWS env vars

Verification (after webhook + ACK controller restart)¶

set -euo pipefail

# This block runs standalone (before Prerequisites/Phases).
# Define expected ARN inline here; Phases use $ACK_CONTROLLER_ROLE_ARN from Prerequisites.
EXPECTED_ACCOUNT="318789018510"
EXPECTED_CONTROLLER_ROLE="arn:aws:iam::${EXPECTED_ACCOUNT}:role/syrf-ack-controllers"
EXPECTED_TOKEN_PATH="/var/run/secrets/eks.amazonaws.com/serviceaccount/token"

# 0. Hard-gate: verify kubectl context targets the expected cluster.
# Only checks kubectl context, not AWS — this block doesn't call AWS APIs.
# Inline here for standalone use; Phases use $EXPECTED_CONTEXT from Prerequisites.
EXPECTED_CONTEXT="${EXPECTED_CONTEXT:-gke_camarades-net_europe-west2-a_camaradesuk}"
CURRENT_CONTEXT=$(kubectl config current-context 2>/dev/null || echo "NONE")
echo "kubectl context: $CURRENT_CONTEXT (expected: $EXPECTED_CONTEXT)"
if [ "$CURRENT_CONTEXT" != "$EXPECTED_CONTEXT" ]; then
  echo "ERROR: kubectl context mismatch — aborting to prevent operating on the wrong cluster."
  echo "Run: kubectl config use-context $EXPECTED_CONTEXT"
  exit 1
fi

# 1. Verify webhook pod is running and ready (hard gate, not informational)
kubectl wait --for=condition=Ready pod \
  -n pod-identity-webhook \
  -l app.kubernetes.io/name=amazon-eks-pod-identity-webhook \
  --timeout=120s

# 2. Verify mutation on BOTH ACK controller pods
for CONTROLLER in ack-lambda-controller ack-s3-controller; do
  echo "--- Checking $CONTROLLER ---"

  # Capture a Running pod with count guard (deterministic: newest by creation timestamp)
  POD_LIST=$(kubectl get pod -n ack-system -l app.kubernetes.io/instance="$CONTROLLER" \
    --field-selector=status.phase=Running --sort-by=.metadata.creationTimestamp \
    -o jsonpath='{.items[*].metadata.name}')
  POD_COUNT=$(echo "$POD_LIST" | wc -w)
  if [ "$POD_COUNT" -eq 0 ] || [ -z "$POD_LIST" ]; then
    echo "ERROR: No Running pods found for $CONTROLLER"
    echo "All pods:"
    kubectl get pods -n ack-system -l app.kubernetes.io/instance="$CONTROLLER"
    exit 1
  fi
  if [ "$POD_COUNT" -gt 1 ]; then
    echo "WARNING: $POD_COUNT Running pods found for $CONTROLLER (may be mid-rollout), using newest"
  fi
  POD=$(echo "$POD_LIST" | tr -s '[:space:]' '\n' | tail -1)  # Last = newest by creationTimestamp
  echo "Checking pod: $POD"

  # Verify webhook mutation: projected token volume mount + AWS env vars (combined assertion)
  # Use containers[*] (not containers[0]) — sidecar injection can shift container order.
  # With containers[*], jsonpath may concat values from multiple containers (space-separated).
  # Exact-token check — containers[*] may return space-separated values from multiple containers.
  # Split on all whitespace and match exact token to avoid partial-match false-pass.
  TOKEN_MOUNT=$(kubectl get pod "$POD" -n ack-system \
    -o jsonpath='{.spec.containers[*].volumeMounts[?(@.mountPath=="/var/run/secrets/eks.amazonaws.com/serviceaccount")].mountPath}')
  ROLE_ARN=$(kubectl get pod "$POD" -n ack-system \
    -o jsonpath='{.spec.containers[*].env[?(@.name=="AWS_ROLE_ARN")].value}')
  TOKEN_FILE=$(kubectl get pod "$POD" -n ack-system \
    -o jsonpath='{.spec.containers[*].env[?(@.name=="AWS_WEB_IDENTITY_TOKEN_FILE")].value}')
  echo "  Token mount: ${TOKEN_MOUNT:-MISSING}"
  echo "  AWS_ROLE_ARN: ${ROLE_ARN:-MISSING} (expected: $EXPECTED_CONTROLLER_ROLE)"
  echo "  AWS_WEB_IDENTITY_TOKEN_FILE: ${TOKEN_FILE:-MISSING} (expected: $EXPECTED_TOKEN_PATH)"
  # Pass condition: all three must be present
  if [ -z "$TOKEN_MOUNT" ] || [ -z "$ROLE_ARN" ] || [ -z "$TOKEN_FILE" ]; then
    echo "FAIL: Webhook mutation incomplete for $CONTROLLER pod $POD"
    exit 1
  fi
  # Exact-token check — containers[*] may return whitespace-separated values from multiple containers.
  # Split on all whitespace (tr -s '[:space:]') and grep -qxF for exact line match.
  if ! echo "$ROLE_ARN" | tr -s '[:space:]' '\n' | grep -qxF "$EXPECTED_CONTROLLER_ROLE"; then
    echo "FAIL: AWS_ROLE_ARN missing exact expected token '$EXPECTED_CONTROLLER_ROLE', got: $ROLE_ARN"
    exit 1
  fi
  if ! echo "$TOKEN_FILE" | tr -s '[:space:]' '\n' | grep -qxF "$EXPECTED_TOKEN_PATH"; then
    echo "FAIL: AWS_WEB_IDENTITY_TOKEN_FILE missing exact expected token '$EXPECTED_TOKEN_PATH', got: $TOKEN_FILE"
    exit 1
  fi
done

# If any check fails: webhook isn't mutating. Check:
#   - Did rollout restart happen AFTER webhook was Running?
#   - Does the ACK controller SA have the eks.amazonaws.com/role-arn annotation?
#   - kubectl logs -n pod-identity-webhook -l app.kubernetes.io/name=amazon-eks-pod-identity-webhook

Pre-merge verification (chart key validation)¶

Before deploying the webhook, verify the chart values schema matches the actual chart. The values keys below were verified against chart values.yaml v2.6.0 on 2026-02-07, but should be re-checked before merge:

set -euo pipefail

# Tool gate — yq v4+ is required for path-aware key checks below.
command -v yq >/dev/null 2>&1 || { echo "ERROR: yq not found. Install yq v4+: https://github.com/mikefarah/yq"; exit 1; }
YQ_MAJOR=$(yq --version 2>&1 | awk 'match($0, /[0-9]+/) {print substr($0, RSTART, RLENGTH); exit}' || true)
if [ "${YQ_MAJOR:-0}" -lt 4 ]; then
  echo "ERROR: yq v4+ required (found v${YQ_MAJOR:-unknown}). yq -e and path queries require v4 syntax."
  exit 1
fi

# Pull the chart from the SAME repo used in deployment (config.yaml repoURL)
# to avoid "validated one artifact, deployed another" risk.
helm repo add jkroepke https://jkroepke.github.io/helm-charts 2>/dev/null || true
helm repo update jkroepke
# Capture full output into variable — all subsequent parsing uses here-strings,
# so no pipeline SIGPIPE is possible under set -euo pipefail.
CHART_VALUES=$(helm show values jkroepke/amazon-eks-pod-identity-webhook --version 2.6.0)
echo "First 60 lines of chart values:"
sed -n '1,60p' <<<"$CHART_VALUES"
# Hard-assert required LEAF keys exist via yq path checks (catches chart schema changes before deploy).
# Leaf-level paths (e.g. pki.certManager.enabled, not pki.certManager) ensure the specific keys
# our values.yaml overrides actually exist in the chart schema.
MISSING_KEYS=""
yq -e '.pki.certManager.enabled' <<<"$CHART_VALUES" >/dev/null 2>&1 || MISSING_KEYS="$MISSING_KEYS pki.certManager.enabled"
yq -e '.config.tokenAudience' <<<"$CHART_VALUES" >/dev/null 2>&1 || MISSING_KEYS="$MISSING_KEYS config.tokenAudience"
yq -e '.mutatingWebhook.failurePolicy' <<<"$CHART_VALUES" >/dev/null 2>&1 || MISSING_KEYS="$MISSING_KEYS mutatingWebhook.failurePolicy"
if [ -n "$MISSING_KEYS" ]; then
  echo "ERROR: Required chart keys not found:$MISSING_KEYS"
  echo "Chart schema may have changed — update values.yaml before deploying."
  exit 1
fi
echo "All required chart keys verified."

Fail-closed gate: switching `failurePolicy` from `Ignore` to `Fail`¶

The webhook starts with failurePolicy: Ignore (fail-open) for safe initial rollout. This means pods start without AWS credentials if the webhook is down — they fail later at AWS API call time, not at pod admission.

Switch to Fail (fail-closed) when ALL of these acceptance criteria are met:

Webhook pod has been Running continuously for 48+ hours without restarts
At least one ACK controller rollout restart has been verified with mutation working (env vars + projected volume present)
At least one full ArgoCD sync cycle of s3-notifier has completed successfully (PostSync jobs passed)
Webhook pod resource usage is stable (not approaching limits)

How to switch:

# cluster-gitops/plugins/helm/pod-identity-webhook/values.yaml
mutatingWebhook:
  failurePolicy: Fail   # Changed from Ignore after acceptance criteria met
  # IMPORTANT: Scope the webhook to only namespaces that need AWS credentials.
  # Without scoping, failurePolicy: Fail blocks ALL pod creation cluster-wide
  # if the webhook is unavailable (e.g. during upgrades, OOM, node drain).
  namespaceSelector:
    matchExpressions:
      - key: syrf.io/aws-identity
        operator: In
        values: ["enabled"]
  # Label the target namespaces (--overwrite handles wrong-value correction):
  #   kubectl label ns ack-system syrf.io/aws-identity=enabled --overwrite
  #   kubectl label ns syrf-staging syrf.io/aws-identity=enabled --overwrite
  #   kubectl label ns syrf-production syrf.io/aws-identity=enabled --overwrite

Pre-check: verify namespace labels exist BEFORE switching to Fail:

set -euo pipefail

REQUIRED_LABEL="syrf.io/aws-identity"
for NS in ack-system syrf-staging syrf-production; do
  LABEL_VAL=$(kubectl get ns "$NS" -o jsonpath="{.metadata.labels['syrf\.io/aws-identity']}" 2>/dev/null || echo "")
  if [ "$LABEL_VAL" != "enabled" ]; then
    echo "ERROR: Namespace '$NS' missing label $REQUIRED_LABEL=enabled"
    echo "Run: kubectl label ns $NS $REQUIRED_LABEL=enabled --overwrite"
    exit 1
  fi
  echo "OK: $NS has $REQUIRED_LABEL=enabled"
done
echo "All namespaces labelled. Safe to switch failurePolicy to Fail."

Risk of NOT switching: If the webhook goes down silently, new pods start without AWS creds and fail at runtime with opaque errors (STS auth failures, missing env vars). Fail policy makes the problem immediately visible at pod scheduling time.

Risk of switching WITHOUT scoping: If the webhook is unavailable (crash, upgrade, node drain), Fail blocks ALL pod creation cluster-wide — including kube-system, monitoring, and ArgoCD itself. The namespaceSelector above limits blast radius to only namespaces that actually need AWS identity injection.

Operational note — new namespaces: When failurePolicy: Fail with namespaceSelector is active, any new namespace that needs AWS credential injection (e.g. syrf-pr-123 for PR previews, or a future syrf-production split) must be labeled syrf.io/aws-identity=enabled before deploying pods that depend on IRSA. Without the label, the webhook won't match the namespace, credentials won't be injected, and pods will fail at AWS API call time with opaque STS errors — the same failure mode as not having the webhook at all. Add the label as part of namespace creation (e.g. in the PR preview workflow or namespace manifest).

Rollback (if Fail causes pod scheduling issues):

Option 1 — Immediate (manual, overwritten on next ArgoCD sync):

set -euo pipefail

# Find the webhook configuration name (enforce single match)
WEBHOOK_CONFIGS=$(kubectl get mutatingwebhookconfigurations -o name | { grep pod-identity || true; })
CONFIG_COUNT=$(echo "$WEBHOOK_CONFIGS" | grep -c . || true)
if [ -z "$WEBHOOK_CONFIGS" ]; then
  echo "ERROR: No MutatingWebhookConfiguration matching 'pod-identity' found."
  kubectl get mutatingwebhookconfigurations -o name
  exit 1
fi
if [ "$CONFIG_COUNT" -gt 1 ]; then
  echo "ERROR: Multiple MutatingWebhookConfigurations match 'pod-identity':"
  echo "$WEBHOOK_CONFIGS"
  exit 1
fi
WEBHOOK_CONFIG="$WEBHOOK_CONFIGS"
# Get the webhook entry name by matching the pod-identity pattern.
# Enforce exactly one match — if grep returns multiple lines, abort.
WEBHOOK_ENTRIES=$(kubectl get "$WEBHOOK_CONFIG" \
  -o jsonpath='{.webhooks[*].name}' | tr -s '[:space:]' '\n' | { grep pod-identity || true; })
MATCH_COUNT=$(echo "$WEBHOOK_ENTRIES" | grep -c . || true)
if [ -z "$WEBHOOK_ENTRIES" ]; then
  echo "ERROR: No webhook entry matching 'pod-identity'. All entries:"
  kubectl get "$WEBHOOK_CONFIG" -o jsonpath='{.webhooks[*].name}' | tr -s '[:space:]' '\n'
  exit 1
fi
if [ "$MATCH_COUNT" -gt 1 ]; then
  echo "ERROR: Multiple webhook entries match 'pod-identity' — ambiguous:"
  echo "$WEBHOOK_ENTRIES"
  echo "Specify the exact webhook name in the patch command instead."
  exit 1
fi
WEBHOOK_ENTRY="$WEBHOOK_ENTRIES"
# Patch failurePolicy back to Ignore using strategic merge (safe regardless of webhook order)
kubectl patch "$WEBHOOK_CONFIG" --type='strategic' \
  -p="{\"webhooks\":[{\"name\":\"${WEBHOOK_ENTRY}\",\"failurePolicy\":\"Ignore\"}]}"
# Pods can schedule again immediately — ArgoCD will overwrite this on next sync (~5 min)

Option 2 — Durable (via GitOps):

# cluster-gitops/plugins/helm/pod-identity-webhook/values.yaml
mutatingWebhook:
  failurePolicy: Ignore   # Reverted — investigate webhook health before re-enabling Fail

Commit, push. ArgoCD auto-syncs within ~5 minutes.

Alternative considered:¶

GKE Workload Identity could map GCP SAs to AWS roles, but that adds GCP-to-AWS indirection. The pod identity webhook is the standard pattern for cross-cloud AWS auth and matches our Terraform OIDC provider setup.

Fix 1a (HIGH): ArgoCD SyRF project whitelists missing ACK CRDs¶

ArgoCD projects only whitelist standard K8s resource types. ACK CRDs will be rejected.

Files: - cluster-gitops/argocd/projects/syrf-staging.yaml — add to namespaceResourceWhitelist - cluster-gitops/argocd/projects/syrf-production.yaml — same additions

Add these entries to namespaceResourceWhitelist in both files (after existing entries, before namespaceResourceBlacklist):

    # ACK CRDs for S3 Notifier (Lambda + S3 bucket management)
    - group: 'lambda.services.k8s.aws'
      kind: Function
    - group: 's3.services.k8s.aws'
      kind: Bucket
    - group: 'services.k8s.aws'
      kind: AdoptedResource

Note: ExternalSecret is not added here — no service in this change creates ExternalSecret resources in syrf namespaces (the chart's own external-secret.yaml is being deleted in Fix 5, and the ClusterExternalSecret is managed by the plugins project). Add it only when a syrf-project chart actually needs to create ExternalSecret resources.

Fix 1b (NEW, CRITICAL): Plugins AppProject — missing ACK destinations + sources¶

Problem: cluster-gitops/argocd/projects/plugins.yaml is missing:

ack-system namespace in destinations — ACK controllers deploy there
pod-identity-webhook namespace in destinations
oci://public.ecr.aws/aws-controllers-k8s in sourceRepos — ACK charts are OCI-based
https://jkroepke.github.io/helm-charts in sourceRepos — pod identity webhook chart

Without these, ArgoCD silently refuses to sync ACK controller and webhook Applications.

File: cluster-gitops/argocd/projects/plugins.yaml

Add to sourceRepos (after existing entries):

    - oci://public.ecr.aws/aws-controllers-k8s     # ACK Lambda + S3 controllers
    - https://jkroepke.github.io/helm-charts        # pod-identity-webhook

OCI sourceRepos format — canonical: oci:// prefix. The ACK plugin configs specify repoURL: oci://public.ecr.aws/aws-controllers-k8s (with oci:// prefix). The ApplicationSet passes this URL through verbatim to generated Applications. ArgoCD matches sourceRepos against the Application's repoURL, so the sourceRepos entry must include the oci:// prefix to match. Verified: ArgoCD docs show OCI sources with oci:// in Application specs.

Fallback if oci:// prefix is rejected (unlikely but documented): strip to public.ecr.aws/aws-controllers-k8s. Verify after merge:

set -euo pipefail

EXPECTED_OCI="oci://public.ecr.aws/aws-controllers-k8s"

# Confirm sourceRepos entry via exact yq match (not substring grep):
PROJ_YAML=$(argocd proj get plugins -o yaml --grpc-web)
if yq -e ".spec.sourceRepos[] | select(. == \"$EXPECTED_OCI\")" <<<"$PROJ_YAML" >/dev/null 2>&1; then
  echo "OK: sourceRepos contains exact entry $EXPECTED_OCI"
else
  echo "ERROR: sourceRepos does not contain $EXPECTED_OCI"
  echo "Current sourceRepos entries:"
  yq '.spec.sourceRepos[]' <<<"$PROJ_YAML" 2>/dev/null || echo "(could not parse project YAML)"
  exit 1
fi

# Dry-run sync to check permission end-to-end:
argocd app sync ack-lambda-controller --dry-run --grpc-web
echo "OK: dry-run sync succeeded — OCI source permitted."
# If "application repo is not permitted in project", switch to non-prefixed form

Add to destinations (after existing entries):

    - namespace: ack-system                          # ACK controllers
      server: https://kubernetes.default.svc
    - namespace: pod-identity-webhook                # AWS Pod Identity Webhook
      server: https://kubernetes.default.svc

Fix 1c (NEW, CRITICAL): ACK health customization in argocd-cm¶

Problem: ArgoCD doesn't know how to interpret ACK CRD status. Without health checks, ACK resources (Function, Bucket) may show as Unknown health, which could prevent PostSync hooks from running — a deadlock.

Solution: Add Lua health checks for ACK CRDs, following the existing pattern in argocd-cm.yaml (which already has checks for ESO and DatabaseLifecycle CRDs).

ACK uses these status conditions:

ACK.ResourceSynced = True → resource matches AWS state (Healthy)
ACK.Terminal = True → unrecoverable error (Degraded)
ACK.Recoverable = True → transient error, controller will retry (Degraded — surfaces the issue instead of hiding behind Progressing)
Neither → still reconciling (Progressing)

Version note: These condition types are from ACK Lambda Controller v1.5.2 and S3 Controller v1.0.14. If controller versions are upgraded, validate condition names against live CRs: kubectl get function -n syrf-staging -o jsonpath='{.items[0].status.conditions[*].type}'

File: cluster-gitops/argocd/resources/argocd-cm.yaml

Add after the DatabaseLifecycle health check (line 144):

  # Custom health checks for ACK (AWS Controllers for Kubernetes) CRDs
  # Without these, ACK resources show as Unknown, potentially blocking PostSync hooks
  # Precedence: Terminal (unrecoverable) → ResourceSynced (healthy) → Recoverable (transient)
  # If both ResourceSynced and Recoverable are True, resource IS synced — treat as Healthy.
  # Recoverable only matters when ResourceSynced is NOT True.
  resource.customizations.health.lambda.services.k8s.aws_Function: |
    hs = {}
    if obj.status ~= nil then
      if obj.status.conditions ~= nil then
        for i, condition in ipairs(obj.status.conditions) do
          if condition.type == "ACK.Terminal" and condition.status == "True" then
            hs.status = "Degraded"
            hs.message = condition.message or "Terminal error"
            return hs
          end
        end
        for i, condition in ipairs(obj.status.conditions) do
          if condition.type == "ACK.ResourceSynced" and condition.status == "True" then
            hs.status = "Healthy"
            hs.message = condition.message or "Resource synced with AWS"
            return hs
          end
        end
        for i, condition in ipairs(obj.status.conditions) do
          if condition.type == "ACK.Recoverable" and condition.status == "True" then
            hs.status = "Degraded"
            hs.message = condition.message or "Recoverable error (will retry)"
            return hs
          end
        end
      end
    end
    hs.status = "Progressing"
    hs.message = "Waiting for Lambda function to sync with AWS"
    return hs

  # Same precedence as Lambda Function: Terminal → ResourceSynced → Recoverable
  resource.customizations.health.s3.services.k8s.aws_Bucket: |
    hs = {}
    if obj.status ~= nil then
      if obj.status.conditions ~= nil then
        for i, condition in ipairs(obj.status.conditions) do
          if condition.type == "ACK.Terminal" and condition.status == "True" then
            hs.status = "Degraded"
            hs.message = condition.message or "Terminal error"
            return hs
          end
        end
        for i, condition in ipairs(obj.status.conditions) do
          if condition.type == "ACK.ResourceSynced" and condition.status == "True" then
            hs.status = "Healthy"
            hs.message = condition.message or "Resource synced with AWS"
            return hs
          end
        end
        for i, condition in ipairs(obj.status.conditions) do
          if condition.type == "ACK.Recoverable" and condition.status == "True" then
            hs.status = "Degraded"
            hs.message = condition.message or "Recoverable error (will retry)"
            return hs
          end
        end
      end
    end
    hs.status = "Progressing"
    hs.message = "Waiting for S3 bucket to sync with AWS"
    return hs

  resource.customizations.health.services.k8s.aws_AdoptedResource: |
    hs = {}
    if obj.status ~= nil then
      if obj.status.conditions ~= nil then
        for i, condition in ipairs(obj.status.conditions) do
          if condition.type == "ACK.Adopted" and condition.status == "True" then
            hs.status = "Healthy"
            hs.message = condition.message or "Resource adopted"
            return hs
          end
          if condition.type == "ACK.Terminal" and condition.status == "True" then
            hs.status = "Degraded"
            hs.message = condition.message or "Adoption failed"
            return hs
          end
        end
      end
    end
    hs.status = "Progressing"
    hs.message = "Waiting for resource adoption"
    return hs

Fix 2 (HIGH): Lambda K8s metadata.name in camelCase (invalid)¶

functionName helper produces syrfAppUploadS3Notifier — invalid for K8s metadata.name (must be DNS-1123 lowercase). Need separate K8s name vs AWS function name.

File: _helpers.tpl

Add new helper s3-notifier.k8sName:

{{/*
Kubernetes resource name (DNS-1123 compliant, lowercase)
Used for metadata.name on Function CRD and AdoptedResource
*/}}
{{- define "s3-notifier.k8sName" -}}
{{- if eq .Values.environmentName "production" -}}
s3-notifier
{{- else -}}
s3-notifier-{{ .Values.environmentName }}
{{- end -}}
{{- end }}

File: function.yaml - Line 4: Change metadata.name from {{ include "s3-notifier.functionName" . }} to {{ include "s3-notifier.k8sName" . }} - Line 15: Keep spec.name as {{ include "s3-notifier.functionName" . }} (AWS name, camelCase OK)

File: adopted-resource.yaml - Line 6: Change metadata.name from {{ include "s3-notifier.functionName" . }} to {{ include "s3-notifier.k8sName" . }} - Line 19: Change kubernetes.metadata.name from {{ include "s3-notifier.functionName" . }} to {{ include "s3-notifier.k8sName" . }}

Fix 3 (HIGH): PostSync permission ordering deadlock — with improvements¶

Problem: Bucket CRD (wave 2) sets S3 notification pointing to Lambda. But AWS PutBucketNotificationConfiguration validates that the Lambda has an S3 invoke permission. Permission is currently granted by PostSync Job — which runs AFTER all sync waves. Deadlock: bucket can't sync because permission doesn't exist yet.

Solution: Split setup-job.yaml into two separate jobs:

permission-job.yaml — Sync hook at wave 2 (runs during sync, after Function at wave 1)
env-vars-job.yaml — PostSync hook (runs after everything syncs successfully)
Bump bucket.yaml to wave 3 (after permission job at wave 2)

Ordering after fix:

Wave 1: Function CRD (Lambda created in AWS)
Wave 2: permission-job (Sync hook — adds S3 invoke permission to Lambda)
Wave 3: Bucket CRD (S3 notification now valid because permission exists)
PostSync: env-vars-job (sets Lambda env vars including RabbitMQ password from K8s Secret)

Delete: src/services/s3-notifier/.chart/templates/setup-job.yaml

Create: src/services/s3-notifier/.chart/templates/permission-job.yaml

api kindmet

href="#__codelineno-16-1">{{- if .Values.setupJob.enabled }} Version: batch/v1 class="p">: Job adata: name: s3-notifier-permission-{{ .Values.environmentName }} labels: {{- include "s3-notifier.labels" . | nindent 4 }} annotations: argocd.argoproj.io/hook: Sync argocd.argoproj.io/hook-delete-policy: BeforeHookCreation argocd.argoproj.io/sync-wave: "2" >spec: ttlSecondsAfterFinished: 600 backoffLimit: 3 template: metadata: labels: {{- include "s3-notifier.labels" . | nindent 8 }} spec: serviceAccountName: {{ .Values.setupJob.serviceAccountName }} restartPolicy: Never containers: - name: permission image: {{ .Values.setupJob.image }} resources: {{- toYaml .Values.setupJob.resources | nindent 12 }} env: - name: AWS_DEFAULT_REGION value: {{ .Values.awsRegion }} - name: AWS_ACCOUNT_ID value: "{{ .Values.awsAccountId }}" - name: FUNCTION_NAME value: {{ include "s3-notifier.functionName" . }} - name: BUCKET_NAME value: {{ .Values.bucket.name }} - name: ENVIRONMENT_NAME value: {{ .Values.environmentName }} command: - /bin/sh - -c - | set -euo pipefail echo "=== S3 Notifier Permission Setup ===" echo "Function: $FUNCTION_NAME" echo "Bucket: $BUCKET_NAME" # Preflight: verify AWS credentials are working echo "Verifying AWS credentials..." aws sts get-caller-identity --output text --query 'Arn' # Wait for Lambda to reach Active state (ACK may still be creating it) echo "Waiting for Lambda to become Active..." MAX_WAIT={{ .Values.setupJob.timeouts.lambdaActive }} ELAPSED=0 STATE="unknown" while [ $ELAPSED -lt $MAX_WAIT ]; do STATE=$(aws lambda get-function --function-name "$FUNCTION_NAME" \ --query 'Configuration.State' --output text 2>/dev/null || echo "NotFound") if [ "$STATE" = "Active" ]; then echo "Lambda is Active." break fi echo "Lambda state: $STATE (waiting...)" sleep 5 ELAPSED=$((ELAPSED + 5)) done if [ "$STATE" != "Active" ]; then echo "ERROR: Lambda did not reach Active state within ${MAX_WAIT}s (state: $STATE)" exit 1 fi # Grant S3 -> Lambda invoke permission (idempotent via try/catch) # Instead of parsing get-policy JSON (formatting-dependent), attempt # add-permission directly and catch ResourceConflictException. STATEMENT_ID="s3-invoke-${ENVIRONMENT_NAME}" SOURCE_ARN="arn:aws:s3:::${BUCKET_NAME}" echo "Adding S3 invoke permission (statement: $STATEMENT_ID)..." set +e # Temporarily allow non-zero exit ADD_RESULT=$(aws lambda add-permission \ --function-name "$FUNCTION_NAME" \ --principal s3.amazonaws.com \ --statement-id "$STATEMENT_ID" \ --action lambda:InvokeFunction \ --source-arn "$SOURCE_ARN" \ --source-account "$AWS_ACCOUNT_ID" \ --output text --query 'Statement' 2>&1) ADD_RC=$? set -e if [ $ADD_RC -eq 0 ]; then echo "Permission added successfully." elif echo "$ADD_RESULT" | grep -q "ResourceConflictException"; then # Existing statement found — verify it has the correct source-arn and source-account. # If mismatched (e.g. stale from a previous bucket), replace it. # # NOTE: We use Python (not jq) for JSON parsing because the job image # (amazon/aws-cli:2.15.0) does not include jq. Python is bundled internally # by the AWS CLI — same discovery logic as env-vars-job. echo "Permission '$STATEMENT_ID' already exists. Verifying..." # get-policy can fail if the resource policy was deleted between add-permission # and this call, or on transient API errors. Guard explicitly since set -e is active. EXISTING_POLICY=$(aws lambda get-policy --function-name "$FUNCTION_NAME" \ --query 'Policy' --output text 2>&1) || { echo "ERROR: Failed to retrieve existing Lambda policy for verification" echo " get-policy output: ${EXISTING_POLICY:-(empty)}" echo " Retrying add-permission may resolve this (transient race)." exit 1 } if [ -z "$EXISTING_POLICY" ] || [ "$EXISTING_POLICY" = "None" ]; then echo "ERROR: get-policy returned empty/None — policy may have been deleted mid-flight" exit 1 fi PYTHON=$(command -v python3 2>/dev/null \ || echo "/usr/local/aws-cli/v2/current/dist/python3") if ! "$PYTHON" --version >/dev/null 2>&1; then echo "ERROR: python3 not found — cannot parse existing policy for verification" echo "Workaround: switch setupJob.image to a custom image with python3 in PATH" exit 1 fi EXISTING_SOURCE=$("$PYTHON" -c " class="l l-Scalar l-Scalar-Plain">import json, sys class="nt">try: policy = json.loads(sys.argv[1]) class="l l-Scalar l-Scalar-Plain">except (json.JSONDecodeError, IndexError) as e: print('ERROR: Failed to parse policy JSON: ' + str(e), file=sys.stderr) sys.exit(1) class="s">stmt = next((s for s in policy.get('Statement', []) if s.get('Sid') == sys.argv[2]), {}) class="l l-Scalar l-Scalar-Plain">cond = stmt.get('Condition', {}) class="l l-Scalar l-Scalar-Plain">arn_like = cond.get('ArnLike', {}) class="l l-Scalar l-Scalar-Plain">print(arn_like.get('AWS:SourceArn', arn_like.get('aws:SourceArn', ''))) class="s">" "$EXISTING_POLICY" "$STATEMENT_ID") || { echo "ERROR: Failed to parse existing policy — payload may be empty or malformed" echo "Raw policy: ${EXISTING_POLICY:-(empty)}" exit 1 } EXISTING_ACCOUNT=$("$PYTHON" -c " class="l l-Scalar l-Scalar-Plain">import json, sys class="nt">try: policy = json.loads(sys.argv[1]) class="l l-Scalar l-Scalar-Plain">except (json.JSONDecodeError, IndexError) as e: print('ERROR: Failed to parse policy JSON: ' + str(e), file=sys.stderr) sys.exit(1) class="s">stmt = next((s for s in policy.get('Statement', []) if s.get('Sid') == sys.argv[2]), {}) class="l l-Scalar l-Scalar-Plain">cond = stmt.get('Condition', {}) class="l l-Scalar l-Scalar-Plain">str_eq = cond.get('StringEquals', {}) class="l l-Scalar l-Scalar-Plain">print(str_eq.get('AWS:SourceAccount', str_eq.get('aws:SourceAccount', ''))) class="s">" "$EXISTING_POLICY" "$STATEMENT_ID") || { echo "ERROR: Failed to parse existing policy — payload may be empty or malformed" exit 1 } if [ "$EXISTING_SOURCE" = "$SOURCE_ARN" ] && [ "$EXISTING_ACCOUNT" = "$AWS_ACCOUNT_ID" ]; then echo "Existing statement matches (source-arn: $SOURCE_ARN, source-account: $AWS_ACCOUNT_ID). OK." else echo "WARNING: Existing statement has stale config (source-arn: $EXISTING_SOURCE, source-account: $EXISTING_ACCOUNT)." echo "Replacing with correct values..." aws lambda remove-permission --function-name "$FUNCTION_NAME" --statement-id "$STATEMENT_ID" aws lambda add-permission \ --function-name "$FUNCTION_NAME" \ --principal s3.amazonaws.com \ --statement-id "$STATEMENT_ID" \ --action lambda:InvokeFunction \ --source-arn "$SOURCE_ARN" \ --source-account "$AWS_ACCOUNT_ID" \ --output text --query 'Statement' echo "Permission replaced successfully." fi else echo "ERROR: add-permission failed: $ADD_RESULT" exit 1 fi echo "=== Permission setup complete ===" class="p p-Indicator">{{- end }}

Key improvements over original setup-job: - Wait-for-active loop: Polls Lambda state until Active (max 120s) before attempting add-permission, preventing race with ACK still creating the function. - --source-account: Constrains the S3 permission to only allow invocations from our AWS account (prevents confused deputy). - Preflight sts get-caller-identity: Fails fast if OIDC credentials aren't working.

Create: src/services/s3-notifier/.chart/templates/env-vars-job.yaml

{{- if .Values.setupJob.enabled }}
apiVersion: batch/v1
kind: Job
metadata:
  name: s3-notifier-env-vars-{{ .Values.environmentName }}
  labels:
    {{- include "s3-notifier.labels" . | nindent 4 }}
  annotations:
    argocd.argoproj.io/hook: PostSync
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
    argocd.argoproj.io/sync-wave: "10"
spec:
  ttlSecondsAfterFinished: 600
  backoffLimit: 3
  template:
    metadata:
      labels:
        {{- include "s3-notifier.labels" . | nindent 8 }}
    spec:
      serviceAccountName: {{ .Values.setupJob.serviceAccountName }}
      restartPolicy: Never
      containers:
        - name: env-vars
          image: {{ .Values.setupJob.image }}
          resources:
            {{- toYaml .Values.setupJob.resources | nindent 12 }}
          env:
            - name: AWS_DEFAULT_REGION
              value: {{ .Values.awsRegion }}
            - name: FUNCTION_NAME
              value: {{ include "s3-notifier.functionName" . }}
            - name: RABBITMQ_HOST
              value: {{ .Values.rabbitMq.host }}
            - name: RABBITMQ_USERNAME
              value: {{ .Values.rabbitMq.username }}
            - name: S3_REGION
              value: {{ .Values.awsRegion }}
            - name: RABBITMQ_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: {{ .Values.rabbitMq.passwordSecretName }}
                  key: {{ .Values.rabbitMq.passwordSecretKey }}
                  optional: true  # Pod must start even if secret is missing — script pre-check catches it
          command:
            - /bin/sh
            - -c
            - |
              set -euo pipefail
              echo "=== S3 Notifier Env Vars Setup ==="
              echo "Function: $FUNCTION_NAME"

              # Pre-check: verify rabbit-mq secret is available
              # Use ${:-} to avoid set -u trap — with optional: true the var may be unset, not empty
              if [ -z "${RABBITMQ_PASSWORD:-}" ]; then
                echo "ERROR: RABBITMQ_PASSWORD is empty or unset - rabbit-mq secret may be missing from namespace"
                echo "Ensure ClusterExternalSecret 'rabbit-mq' targets this namespace"
                exit 1
              fi

              # Preflight: verify AWS credentials are working
              echo "Verifying AWS credentials..."
              aws sts get-caller-identity --output text --query 'Arn'

              # Wait for Lambda to be ready for configuration updates
              echo "Waiting for Lambda to accept configuration updates..."
              MAX_WAIT={{ .Values.setupJob.timeouts.lambdaConfigReady }}
              ELAPSED=0
              LAST_STATUS="unknown"
              while [ $ELAPSED -lt $MAX_WAIT ]; do
                LAST_STATUS=$(aws lambda get-function-configuration \
                  --function-name "$FUNCTION_NAME" \
                  --query 'LastUpdateStatus' --output text 2>/dev/null || echo "Unknown")
                if [ "$LAST_STATUS" = "Successful" ] || [ "$LAST_STATUS" = "null" ]; then
                  break
                fi
                echo "LastUpdateStatus: $LAST_STATUS (waiting...)"
                sleep 5
                ELAPSED=$((ELAPSED + 5))
              done
              if [ "$LAST_STATUS" != "Successful" ] && [ "$LAST_STATUS" != "null" ]; then
                echo "ERROR: Lambda not ready for config updates within ${MAX_WAIT}s (status: $LAST_STATUS)"
                exit 1
              fi

              # NOTE: --environment replaces ALL env vars. This is intentional —
              # all 4 vars (RabbitMqHost, RabbitMqUsername, RabbitMqPassword, S3Region)
              # are declared here as the canonical source of truth. Any env var not
              # listed here will be removed. This matches Terraform's existing behavior.
              #
              # We build JSON via Python rather than CLI shorthand "Variables={k=v,...}"
              # because the RabbitMQ password may contain special characters
              # (=, commas, braces, quotes) that break the shorthand parser.
              #
              # python3 is NOT in PATH in amazon/aws-cli — it's bundled internally.
              # Locate it, or fail with an actionable error.
              # NOTE: We use a heredoc (<<'PYEOF') instead of python -c "..."
              # because indented code inside -c causes IndentationError when
              # the script is embedded in a YAML block scalar.
              echo "Setting Lambda environment variables..."
              PYTHON=$(command -v python3 2>/dev/null \
                || echo "/usr/local/aws-cli/v2/current/dist/python3")
              if ! "$PYTHON" --version >/dev/null 2>&1; then
                echo "ERROR: python3 not found at PATH or /usr/local/aws-cli/v2/current/dist/python3"
                echo "The amazon/aws-cli image bundles Python — check if the path has changed"
                echo "Workaround: switch setupJob.image to a custom image with python3 in PATH"
                exit 1
              fi
              ENV_JSON=$("$PYTHON" <<'PYEOF'
              import json, os
              print(json.dumps({'Variables': {
                  'RabbitMqHost': os.environ['RABBITMQ_HOST'],
                  'RabbitMqUsername': os.environ['RABBITMQ_USERNAME'],
                  'RabbitMqPassword': os.environ['RABBITMQ_PASSWORD'],
                  'S3Region': os.environ['S3_REGION']
              }}))
              PYEOF
              )
              aws lambda update-function-configuration \
                --function-name "$FUNCTION_NAME" \
                --environment "$ENV_JSON" \
                --output text --query 'FunctionName'

              # Wait for update to complete — Lambda may be in InProgress state
              echo "Waiting for update to complete..."
              UPDATE_WAIT={{ .Values.setupJob.timeouts.lambdaConfigReady }}
              UPDATE_ELAPSED=0
              UPDATE_STATUS="InProgress"
              while [ $UPDATE_ELAPSED -lt $UPDATE_WAIT ]; do
                UPDATE_STATUS=$(aws lambda get-function-configuration \
                  --function-name "$FUNCTION_NAME" \
                  --query 'LastUpdateStatus' --output text 2>/dev/null || echo "Unknown")
                if [ "$UPDATE_STATUS" = "Successful" ]; then
                  break
                fi
                if [ "$UPDATE_STATUS" = "Failed" ]; then
                  echo "ERROR: Lambda update failed"
                  aws lambda get-function-configuration --function-name "$FUNCTION_NAME" \
                    --query 'LastUpdateStatusReason' --output text
                  exit 1
                fi
                sleep 2
                UPDATE_ELAPSED=$((UPDATE_ELAPSED + 2))
              done
              if [ "$UPDATE_STATUS" != "Successful" ]; then
                echo "ERROR: Update did not complete within ${UPDATE_WAIT}s (status: $UPDATE_STATUS)"
                echo "Check: aws lambda get-function-configuration --function-name $FUNCTION_NAME --query LastUpdateStatusReason"
                exit 1
              fi

              echo "Environment variables set successfully."
              echo "=== Env vars setup complete ==="
{{- end }}

File: bucket.yaml - Line 10: Change sync-wave from "2" to "3" and update comment: argocd.argoproj.io/sync-wave: "3" # After permission-job (wave 2) — S3 notification needs Lambda permission to exist

Fix 3 improvements (v3)¶

3a. Hard fail on env-vars-job timeout — Added exit 1 after the LastUpdateStatus wait loop if Lambda doesn't become ready. Previously the script would silently proceed past the timeout.

3b. Configurable timeouts via values — Timeouts are now templated from values.yaml:

setupJob:
  # ...existing...
  timeouts:
    lambdaActive: 120     # seconds to wait for Lambda Active state (permission-job)
    lambdaConfigReady: 60 # seconds to wait for LastUpdateStatus (env-vars-job)

3c. rabbit-mq secret pre-check — Added early validation in env-vars-job that RABBITMQ_PASSWORD is non-empty before making AWS calls. Provides actionable error message if ClusterExternalSecret hasn't propagated to the namespace.

Fix 4 (HIGH): ack-setup-job ServiceAccount missing + least-privilege role¶

No manifest creates the ack-setup-job ServiceAccount. The permission-job and env-vars-job will fail. Additionally, the original plan reused the syrf-ack-controllers role which has full Lambda CRUD + S3 bucket management — far more than the setup jobs need.

4a. Create dedicated IAM role (Terraform)¶

File: camarades-infrastructure/terraform/lambda/ack-iam.tf

Add a new syrf-ack-setup-job role with minimal permissions:

# Least-privilege IAM role for s3-notifier setup jobs
# Only needs: update Lambda config, manage Lambda permissions, read Lambda state
resource "aws_iam_role" "ack_setup_job" {
  name = "syrf-ack-setup-job"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = {
          Federated = aws_iam_openid_connect_provider.gke.arn
        }
        Action = "sts:AssumeRoleWithWebIdentity"
        Condition = {
          StringEquals = {
            "${replace(local.gke_oidc_issuer, "https://", "")}:aud" = "sts.amazonaws.com"
            "${replace(local.gke_oidc_issuer, "https://", "")}:sub" = [
              "system:serviceaccount:syrf-staging:ack-setup-job",
              "system:serviceaccount:syrf-production:ack-setup-job"
            ]
          }
        }
      }
    ]
  })

  tags = {
    Purpose   = "Least-privilege role for s3-notifier setup jobs"
    ManagedBy = "Terraform"
  }
}

resource "aws_iam_role_policy" "ack_setup_job_lambda" {
  name = "LambdaConfigAndPermissions"
  role = aws_iam_role.ack_setup_job.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "LambdaConfigManagement"
        Effect = "Allow"
        Action = [
          "lambda:GetFunction",
          "lambda:GetFunctionConfiguration",
          "lambda:UpdateFunctionConfiguration",
          "lambda:GetPolicy",
          "lambda:AddPermission",
          "lambda:RemovePermission"
        ]
        Resource = [
          "arn:aws:lambda:eu-west-1:${data.aws_caller_identity.current.account_id}:function:syrfAppUploadS3Notifier",
          "arn:aws:lambda:eu-west-1:${data.aws_caller_identity.current.account_id}:function:syrfAppUploadS3Notifier-*"
        ]
      }
    ]
  })
}

output "ack_setup_job_role_arn" {
  description = "IAM role ARN for s3-notifier setup jobs (use in Helm SA annotation)"
  value       = aws_iam_role.ack_setup_job.arn
}

Note the trust policy uses exact SA names (not wildcards) for both aud and sub constraints — syrf-staging:ack-setup-job and syrf-production:ack-setup-job only. If a new environment is added, the trust policy must be updated.

4b+4c (MERGED — resolves v2 contradiction)¶

Previous plan had 4b ("add aud to both statements") then 4c ("remove second statement") — contradictory for implementers.

Merged fix: In aws_iam_role.ack_controllers trust policy, replace the existing two-statement policy with a single statement that has BOTH aud and sub constraints:

File: camarades-infrastructure/terraform/lambda/ack-iam.tf

assume_role_policy = jsonencode({
  Version = "2012-10-17"
  Statement = [
    {
      Effect = "Allow"
      Principal = {
        Federated = aws_iam_openid_connect_provider.gke.arn
      }
      Action = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "${replace(local.gke_oidc_issuer, "https://", "")}:aud" = "sts.amazonaws.com"
          "${replace(local.gke_oidc_issuer, "https://", "")}:sub" = [
            "system:serviceaccount:ack-system:ack-lambda-controller",
            "system:serviceaccount:ack-system:ack-s3-controller"
          ]
        }
      }
    }
  ]
})

The second statement (setup-job SA trust) is removed entirely — setup jobs use their own syrf-ack-setup-job role now (Fix 4a).

4d. Create ServiceAccount manifest¶

Create: src/services/s3-notifier/.chart/templates/serviceaccount.yaml

{{- if .Values.setupJob.enabled }}
apiVersion: v1
kind: ServiceAccount
metadata:
  name: {{ .Values.setupJob.serviceAccountName }}
  labels:
    {{- include "s3-notifier.labels" . | nindent 4 }}
  annotations:
    # Least-privilege IAM role for setup jobs (GKE -> AWS via OIDC federation)
    eks.amazonaws.com/role-arn: "arn:aws:iam::{{ .Values.awsAccountId }}:role/syrf-ack-setup-job"
{{- end }}

4e. Update values.yaml default¶

File: values.yaml

Add setupJob.iamRoleName default:

setupJob:
  enabled: true
  image: amazon/aws-cli:2.15.0
  serviceAccountName: ack-setup-job
  iamRoleName: syrf-ack-setup-job    # Least-privilege role (not syrf-ack-controllers)
  timeouts:
    lambdaActive: 120                # seconds to wait for Lambda Active state
    lambdaConfigReady: 60            # seconds to wait for LastUpdateStatus

Update serviceaccount.yaml to use this value instead of hardcoding the role name:

    eks.amazonaws.com/role-arn: "arn:aws:iam::{{ .Values.awsAccountId }}:role/{{ .Values.setupJob.iamRoleName }}"

Fix 5 (HIGH): Secret collision/key mismatch¶

Problem: The chart's external-secret.yaml creates an ExternalSecret for rabbit-mq with key password. But the existing ClusterExternalSecret (cluster-gitops/argocd/local/argocd-secrets/values.yaml:53-62) already creates rabbit-mq in all service namespaces with key rabbitmq-password. Two competing secrets = collision. And the key name doesn't match.

Solution: 1. Delete external-secret.yaml from chart (reuse existing ClusterExternalSecret) 2. Fix passwordSecretKey from password to rabbitmq-password in all values files

Delete: src/services/s3-notifier/.chart/templates/external-secret.yaml

File: values.yaml (chart defaults) - Line 35: Change passwordSecretKey: password to passwordSecretKey: rabbitmq-password

File: cluster-gitops/syrf/services/s3-notifier/values.yaml - Line 10: Change passwordSecretKey: password to passwordSecretKey: rabbitmq-password

Fix 6 (MEDIUM): OIDC issuer URL — `/zones/` vs `/locations/`¶

Terraform is correct. GKE zonal clusters use /zones/ in the OIDC issuer URL (confirmed via gcloud container clusters describe). The technical plan doc incorrectly says /locations/.

No code change needed. Only the technical plan doc needs updating (Fix 10).

Fix 7 (MEDIUM): Staging using production Lambda role¶

Problem: Staging uses syrfS3NotifierProductionLambdaRole whose S3 policy is scoped to arn:aws:s3:::syrfapp-uploads/* (camarades-infrastructure/terraform/lambda/main.tf:57). The staging bucket syrfapp-uploads-staging doesn't match. Widening the production role would weaken environment separation.

Solution: Create a dedicated staging Lambda execution role with S3 policy scoped to syrfapp-uploads-staging/*.

File: camarades-infrastructure/terraform/lambda/main.tf

Add new resources (following the existing production/preview role pattern):

# IAM role for staging Lambda execution
resource "aws_iam_role" "staging_lambda_role" {
  name = "syrfS3NotifierStagingLambdaRole"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "lambda.amazonaws.com"
        }
      }
    ]
  })

  tags = {
    Purpose = "Staging Lambda execution role"
  }
}

resource "aws_iam_role_policy_attachment" "staging_lambda_basic" {
  role       = aws_iam_role.staging_lambda_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}

resource "aws_iam_role_policy" "staging_lambda_s3" {
  name = "S3ReadAccess"
  role = aws_iam_role.staging_lambda_role.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject"
          # NOTE: s3:GetObjectMetadata and s3:HeadObject are NOT real IAM actions.
          # AWS authorises HeadObject under s3:GetObject. See:
          # https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-with-s3-actions.html
        ]
        Resource = "arn:aws:s3:::${var.s3_bucket_name}-staging/*"
      }
    ]
  })
}

Also add iam:PassRole for the new staging role to aws_iam_role_policy.ack_lambda in camarades-infrastructure/terraform/lambda/ack-iam.tf (line 117-119):

Resource = [
  aws_iam_role.production_lambda_role.arn,
  aws_iam_role.preview_lambda_role.arn,
  aws_iam_role.staging_lambda_role.arn
]

File: cluster-gitops/syrf/environments/staging/s3-notifier/values.yaml

Change executionRoleArn to use the new staging role:

lambda:
  executionRoleArn: "arn:aws:iam::318789018510:role/syrfS3NotifierStagingLambdaRole"

Fix 8 (MEDIUM): Runtime inconsistency — NOT A BUG¶

template-local.yaml uses provided.al2023 for local SAM testing only. Production uses dotnet10 correctly. No change needed.

Fix 9 (MEDIUM): Add S3Region to env vars + ACK reconciliation note¶

Problem: The existing Terraform sets S3Region as a Lambda env var. The setup job omits it.

Solution: Already addressed in Fix 3 — the new env-vars-job.yaml includes S3Region=$S3_REGION.

ACK reconciliation risk: Since the Function CRD omits environment.variables, ACK may leave them alone or clear them on reconcile. This is explicitly tested in Phase 4b (reconciliation gate test) during staging deployment.

If Phase 4b FAILS (env vars cleared by reconciliation):

Immediate fix: Force an ArgoCD sync which reruns PostSync hooks (including the env-vars-job): argocd app sync s3-notifier-staging --force --grpc-web. The --force flag is required because without it, a sync on an already-synced app may skip hooks. Fallback if --force still doesn't rerun hooks: manually delete the old hook Job so the BeforeHookCreation delete policy triggers recreation on next sync:
```
kubectl delete job --ignore-not-found -n syrf-staging -l argocd.argoproj.io/hook=PostSync,app.kubernetes.io/instance=s3-notifier-staging
argocd app sync s3-notifier-staging --force --grpc-web
```
The env-vars-job is a regular Job (PostSync hook), not a CronJob, so kubectl create job --from=cronjob/... does not apply here.
Durable fix: Add non-sensitive vars directly to function.yaml CRD spec:

spec:
  environment:
    variables:
      RabbitMqHost: "{{ .Values.rabbitMq.host }}"
      RabbitMqUsername: "{{ .Values.rabbitMq.username }}"
      S3Region: "{{ .Values.awsRegion }}"
      # RabbitMqPassword still set by env-vars-job (sensitive, from K8s Secret)

This ensures ACK preserves these vars on reconcile. Important: update-function-configuration --environment is a full replacement, not a merge — it replaces the entire Variables map. The env-vars-job must still send ALL vars (including the ones now in the CRD spec), or ACK would preserve only the CRD vars while the job overwrites them with a subset. In practice, keep the env-vars-job sending all 4 vars as it does today. The CRD vars act as a safety net for ACK reconciliation, not as a replacement for the job. No code change now — verify during staging deployment via Phase 4b.

Fix 10: Sync technical plan doc¶

File: technical-plan.md

Updates needed:

Phase 0 OIDC URL (line 100): Change /locations/ to /zones/ with note that zonal clusters use /zones/
Phase 0 IAM: Document new syrf-ack-setup-job role (least-privilege for setup jobs) and aud constraint on trust policies
Phase 0 prerequisites: Add pod identity webhook as a prerequisite (Fix 0)
Phase 2 chart structure (line 198): Replace external-secret.yaml with serviceaccount.yaml, permission-job.yaml, env-vars-job.yaml
Phase 2 setup-job description (lines 259-286): Update to describe split into permission-job + env-vars-job with correct sync-wave ordering, wait-for-active loop, --source-account, preflight checks, configurable timeouts, rabbit-mq pre-check
Phase 2 external-secret description (lines 290-293): Remove — no longer creating ExternalSecret (ClusterExternalSecret handles it)
Phase 2 values.yaml (line 328): Fix passwordSecretKey: password to passwordSecretKey: rabbitmq-password, add iamRoleName, add timeouts
Phase 3 staging role (line 391): Use syrfS3NotifierStagingLambdaRole instead of production role
Phase 3 risk about env var reconciliation (lines 421-423): Add note about S3Region now being included, document intentional overwrite behavior
Bucket sync-wave in Phase 2 examples: Update from wave 2 to wave 3, add permission-job at wave 2
ArgoCD prerequisites: Document plugins AppProject updates (Fix 1b) and ACK health checks (Fix 1c)
Updated date in frontmatter

Summary of all file changes¶

#	File	Repo	Action
1	`plugins/helm/pod-identity-webhook/config.yaml`	cluster-gitops	Create (Fix 0)
2	`plugins/helm/pod-identity-webhook/values.yaml`	cluster-gitops	Create (Fix 0)
3	`argocd/projects/plugins.yaml`	cluster-gitops	Modify (Fix 1b: add ACK + webhook sources/destinations)
4	`argocd/projects/syrf-staging.yaml`	cluster-gitops	Modify (Fix 1a: add CRD whitelist)
5	`argocd/projects/syrf-production.yaml`	cluster-gitops	Modify (Fix 1a: add CRD whitelist)
6	`argocd/resources/argocd-cm.yaml`	cluster-gitops	Modify (Fix 1c: add ACK health checks)
7	`syrf/services/s3-notifier/values.yaml`	cluster-gitops	Modify (Fix 5: passwordSecretKey)
8	`syrf/environments/staging/s3-notifier/values.yaml`	cluster-gitops	Modify (Fix 7: executionRoleArn)
9	`.chart/templates/_helpers.tpl`	monorepo	Modify (Fix 2: add k8sName helper)
10	`.chart/templates/function.yaml`	monorepo	Modify (Fix 2: metadata.name)
11	`.chart/templates/adopted-resource.yaml`	monorepo	Modify (Fix 2: metadata.name)
12	`.chart/templates/setup-job.yaml`	monorepo	Delete (Fix 3)
13	`.chart/templates/permission-job.yaml`	monorepo	Create (Fix 3)
14	`.chart/templates/env-vars-job.yaml`	monorepo	Create (Fix 3: with timeout fail + secret pre-check)
15	`.chart/templates/bucket.yaml`	monorepo	Modify (Fix 3: wave 2→3)
16	`.chart/templates/serviceaccount.yaml`	monorepo	Create (Fix 4d)
17	`.chart/templates/external-secret.yaml`	monorepo	Delete (Fix 5)
18	`.chart/values.yaml`	monorepo	Modify (Fix 4e + 5: iamRoleName, passwordSecretKey, timeouts)
19	`terraform/lambda/main.tf`	camarades-infra	Modify (Fix 7: staging Lambda role)
20	`terraform/lambda/ack-iam.tf`	camarades-infra	Modify (Fix 4a+4b+4c: setup-job role, merge trust policy)
21	`docs/features/lambda-ack-gitops/technical-plan.md`	monorepo	Modify (Fix 10: sync all corrections)

Total: 21 file operations (14 modify, 5 create, 2 delete) — 11 monorepo (rows 9-18, 21), 8 cluster-gitops (rows 1-8), 2 camarades-infra (rows 19-20)

Implementation order¶

Terraform first (Fixes 4a, 4b+4c, 7): IAM roles must exist before K8s resources reference them
cluster-gitops infra (Fixes 0, 1b, 1c): Webhook + AppProject + health checks — must be deployed before ACK controllers or s3-notifier app
cluster-gitops service config (Fixes 1a, 5-values, 7-values): Project whitelists + values corrections
Monorepo Helm chart (Fixes 2, 3, 4d, 4e, 5-delete): All template changes
Docs (Fix 10): Technical plan sync

Steps 1-3 must be committed/merged to their repos before step 4 can deploy successfully. Step 5 can happen any time.

Cross-repo rollout race gate: Since staging auto-deploys from main (ArgoCD watches chartTag in cluster-gitops/syrf/environments/staging/s3-notifier/config.yaml), merging the monorepo Helm chart (step 4) before cluster-gitops prerequisites are live (steps 2-3) will cause ArgoCD to sync and fail. To prevent this:

Before merging step 4, verify steps 1-3 are live: pod-identity-webhook Running, ACK controllers restarted with projected volumes, AppProject updated, health checks in argocd-cm.
If step 4 must merge before prerequisites are ready, pin a specific immutable commit SHA (not main) as chartTag in cluster-gitops/syrf/environments/staging/s3-notifier/config.yaml so ArgoCD stays on the pre-ACK chart version. Using main is not safe since it advances with new commits. Update to the new tag only after prerequisites are confirmed live.
Do NOT use setupJob.enabled: false as a gate — it only disables jobs, but bucket.yaml still applies and will fail without the permission-job's S3 invoke permission.

Verification¶

Prerequisites¶

The verification scripts below require these CLIs installed and authenticated:

kubectl — configured for the GKE cluster (gcloud container clusters get-credentials camaradesuk --zone europe-west2-a)
helm — v3.x for helm template validation
argocd — logged in (argocd login argocd.camarades.net --grpc-web)
aws — configured for account 318789018510, region eu-west-1
jq — for JSON normalization in Phase 4b comparisons
yq — v4+ required; used for pre-merge chart key validation (yq -e path checks) and Helm-style deep merge (ireduce syntax)

All phases assume execution from the syrf monorepo root unless explicitly stated. Set these variables at the start:

set -euo pipefail

# Monorepo root (Helm charts, docs)
REPO_ROOT="$(git rev-parse --show-toplevel)"  # or: export REPO_ROOT=/path/to/syrf

# cluster-gitops root (ArgoCD apps, env values)
# Autodetect: nested (monorepo/cluster-gitops) or sibling (../cluster-gitops) layout.
# Override: export GITOPS_ROOT=/path/to/cluster-gitops before running.
if [ -z "${GITOPS_ROOT:-}" ]; then
  if [ -d "${REPO_ROOT}/cluster-gitops" ]; then
    GITOPS_ROOT="${REPO_ROOT}/cluster-gitops"
  elif [ -d "${REPO_ROOT}/../cluster-gitops" ]; then
    GITOPS_ROOT="$(cd "${REPO_ROOT}/../cluster-gitops" && pwd)"
  else
    echo "ERROR: cluster-gitops not found as nested or sibling of $REPO_ROOT"
    echo "Export GITOPS_ROOT before running, e.g.: export GITOPS_ROOT=/path/to/cluster-gitops"
    exit 1
  fi
fi

# camarades-infrastructure root (Terraform)
# Same autodetect: nested or sibling layout, with explicit override.
if [ -z "${INFRA_ROOT:-}" ]; then
  if [ -d "${REPO_ROOT}/camarades-infrastructure" ]; then
    INFRA_ROOT="${REPO_ROOT}/camarades-infrastructure"
  elif [ -d "${REPO_ROOT}/../camarades-infrastructure" ]; then
    INFRA_ROOT="$(cd "${REPO_ROOT}/../camarades-infrastructure" && pwd)"
  else
    echo "ERROR: camarades-infrastructure not found as nested or sibling of $REPO_ROOT"
    echo "Export INFRA_ROOT before running, e.g.: export INFRA_ROOT=/path/to/camarades-infrastructure"
    exit 1
  fi
fi

# Fail-fast: verify all roots exist before running any phase
for _dir in "$REPO_ROOT" "$GITOPS_ROOT" "$INFRA_ROOT"; do
  if [ ! -d "$_dir" ]; then
    echo "ERROR: Directory not found: $_dir"
    echo "Check REPO_ROOT, GITOPS_ROOT, and INFRA_ROOT variables above."
    exit 1
  fi
done

# Cluster context — hard gate to prevent operating on the wrong cluster.
# Format: gke_{PROJECT}_{ZONE}_{CLUSTER}. Update if cluster is recreated or moved.
EXPECTED_CONTEXT="gke_camarades-net_europe-west2-a_camaradesuk"
CURRENT_CONTEXT=$(kubectl config current-context 2>/dev/null) || {
  echo "ERROR: kubectl config current-context failed — is kubectl configured?"
  exit 1
}
if [ "$CURRENT_CONTEXT" != "$EXPECTED_CONTEXT" ]; then
  echo "ERROR: kubectl context mismatch! Expected '$EXPECTED_CONTEXT', got '$CURRENT_CONTEXT'"
  echo "Run: kubectl config use-context $EXPECTED_CONTEXT"
  exit 1
fi
echo "kubectl context verified: $CURRENT_CONTEXT"

# AWS account and IAM role ARNs — centralised here to avoid hardcoding in each phase.
# Update these if deploying to a different account or renaming roles.
AWS_ACCOUNT_ID="318789018510"
ACK_CONTROLLER_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/syrf-ack-controllers"
SETUP_JOB_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/syrf-ack-setup-job"

# Sanity gate: verify AWS CLI is targeting the expected account.
# Prevents cross-account mistakes (e.g. personal creds, wrong profile).
ACTUAL_ACCOUNT=$(aws sts get-caller-identity --query 'Account' --output text 2>/dev/null) || {
  echo "ERROR: aws sts get-caller-identity failed — check AWS credentials"
  exit 1
}
if [ "$ACTUAL_ACCOUNT" != "$AWS_ACCOUNT_ID" ]; then
  echo "ERROR: AWS account mismatch! Expected $AWS_ACCOUNT_ID, got $ACTUAL_ACCOUNT"
  echo "Check AWS_PROFILE or AWS_DEFAULT_PROFILE environment variable."
  exit 1
fi
echo "AWS account verified: $ACTUAL_ACCOUNT"

Phase 1 — Terraform¶

set -euo pipefail

cd "${INFRA_ROOT}/terraform/lambda"

# Pre-check: verify ACK controller SA names match trust policy subjects.
# The trust policy hardcodes exact SA names — if the Helm charts use different
# names (e.g. via nameOverride), AssumeRoleWithWebIdentity will silently fail.
#
# Use `helm template` to render the actual SA name from the chart + our values,
# rather than grepping raw values.yaml (which may not reflect the rendered name).
echo "Verifying ACK controller ServiceAccount names..."
# Capture full helm template output first, then parse — avoids:
#   1. SIGPIPE from head -1 closing the pipe (pipefail sees non-zero exit)
#   2. Lost diagnostics from 2>/dev/null (stderr preserved on failure)
LAMBDA_SA_OUTPUT=$(helm template ack-lambda-controller \
  oci://public.ecr.aws/aws-controllers-k8s/lambda-chart --version 1.5.2 \
  -f "${GITOPS_ROOT}/plugins/helm/ack-lambda-controller/values.yaml" \
  --show-only templates/serviceaccount.yaml 2>&1) || {
  echo "ERROR: helm template failed for ack-lambda-controller:"
  echo "$LAMBDA_SA_OUTPUT"
  exit 1
}
# Parse from captured output — here-string avoids pipeline SIGPIPE under pipefail.
LAMBDA_SA=$(awk '/^[[:space:]]*name:/ {print $2; exit}' <<<"$LAMBDA_SA_OUTPUT")

S3_SA_OUTPUT=$(helm template ack-s3-controller \
  oci://public.ecr.aws/aws-controllers-k8s/s3-chart --version 1.0.14 \
  -f "${GITOPS_ROOT}/plugins/helm/ack-s3-controller/values.yaml" \
  --show-only templates/serviceaccount.yaml 2>&1) || {
  echo "ERROR: helm template failed for ack-s3-controller:"
  echo "$S3_SA_OUTPUT"
  exit 1
}
S3_SA=$(awk '/^[[:space:]]*name:/ {print $2; exit}' <<<"$S3_SA_OUTPUT")
if [ -z "${LAMBDA_SA:-}" ]; then
  echo "ERROR: Could not extract Lambda controller SA name from helm template."
  echo "Check chart version/values: ${GITOPS_ROOT}/plugins/helm/ack-lambda-controller/values.yaml"
  exit 1
fi
if [ -z "${S3_SA:-}" ]; then
  echo "ERROR: Could not extract S3 controller SA name from helm template."
  echo "Check chart version/values: ${GITOPS_ROOT}/plugins/helm/ack-s3-controller/values.yaml"
  exit 1
fi
echo "Lambda controller SA: $LAMBDA_SA"
echo "S3 controller SA: $S3_SA"
# Cross-check against trust policy subjects in ack-iam.tf:
#   system:serviceaccount:ack-system:ack-lambda-controller
#   system:serviceaccount:ack-system:ack-s3-controller
# If either SA name doesn't match the trust policy, STOP and update ack-iam.tf.
if [ "$LAMBDA_SA" != "ack-lambda-controller" ] || [ "$S3_SA" != "ack-s3-controller" ]; then
  echo "WARNING: Rendered SA names do not match trust policy subjects!"
  echo "  Expected: ack-lambda-controller / ack-s3-controller"
  echo "  Got:      $LAMBDA_SA / $S3_SA"
  echo "Update ack-iam.tf trust policy before applying."
  exit 1
fi

terraform validate
terraform plan
# Full plan (not -target) — ensures coupled changes (trust policies + iam:PassRole
# updates + new roles) are all captured together. Review the plan output for:
# - New roles: ack_setup_job, staging_lambda_role
# - Modified role: ack_controllers trust policy (single statement with aud + sub)
# - Modified policy: ack_lambda PassRole now includes staging role ARN

Phase 2 — cluster-gitops (after merge)¶

set -euo pipefail

# 1. Gate: pod identity webhook must be Running and Ready before restarting controllers.
#    kubectl get exits 0 even when pods aren't Ready; kubectl wait is a proper gate.
kubectl wait --for=condition=Ready pod \
  -n pod-identity-webhook -l app.kubernetes.io/name=amazon-eks-pod-identity-webhook \
  --timeout=120s

# 2. Restart ACK controllers so NEW pods get mutated by the webhook
kubectl rollout restart deployment -n ack-system -l app.kubernetes.io/instance=ack-lambda-controller
kubectl rollout restart deployment -n ack-system -l app.kubernetes.io/instance=ack-s3-controller
kubectl rollout status deployment -n ack-system -l app.kubernetes.io/instance=ack-lambda-controller --timeout=180s
kubectl rollout status deployment -n ack-system -l app.kubernetes.io/instance=ack-s3-controller --timeout=180s

# 3. Verify mutation + auth for BOTH ACK controllers (Lambda and S3)
for CONTROLLER in ack-lambda-controller ack-s3-controller; do
  echo "--- Checking $CONTROLLER ---"

  # Capture a Running pod with count guard (deterministic: newest by creation timestamp)
  POD_LIST=$(kubectl get pod -n ack-system -l app.kubernetes.io/instance="$CONTROLLER" \
    --field-selector=status.phase=Running --sort-by=.metadata.creationTimestamp \
    -o jsonpath='{.items[*].metadata.name}')
  POD_COUNT=$(echo "$POD_LIST" | wc -w)
  if [ "$POD_COUNT" -eq 0 ] || [ -z "$POD_LIST" ]; then
    echo "ERROR: No Running pods found for $CONTROLLER"
    echo "All pods:"
    kubectl get pods -n ack-system -l app.kubernetes.io/instance="$CONTROLLER"
    exit 1
  fi
  if [ "$POD_COUNT" -gt 1 ]; then
    echo "WARNING: $POD_COUNT Running pods found for $CONTROLLER (may be mid-rollout), using newest"
  fi
  POD=$(echo "$POD_LIST" | tr -s '[:space:]' '\n' | tail -1)  # Last = newest by creationTimestamp
  echo "Checking pod: $POD"

  # Verify webhook mutation: projected token volume mount + AWS env vars (combined assertion)
  # Use containers[*] (not containers[0]) — sidecar injection can shift container order.
  # With containers[*], jsonpath may concat values from multiple containers (whitespace-separated).
  # Split on all whitespace and match exact token (tr -s '[:space:]' + grep -qxF).
  TOKEN_MOUNT=$(kubectl get pod "$POD" -n ack-system \
    -o jsonpath='{.spec.containers[*].volumeMounts[?(@.mountPath=="/var/run/secrets/eks.amazonaws.com/serviceaccount")].mountPath}')
  ROLE_ARN=$(kubectl get pod "$POD" -n ack-system \
    -o jsonpath='{.spec.containers[*].env[?(@.name=="AWS_ROLE_ARN")].value}')
  TOKEN_FILE=$(kubectl get pod "$POD" -n ack-system \
    -o jsonpath='{.spec.containers[*].env[?(@.name=="AWS_WEB_IDENTITY_TOKEN_FILE")].value}')
  EXPECTED_TOKEN_PATH="/var/run/secrets/eks.amazonaws.com/serviceaccount/token"
  EXPECTED_CONTROLLER_ROLE="$ACK_CONTROLLER_ROLE_ARN"
  echo "  Token mount: ${TOKEN_MOUNT:-MISSING}"
  echo "  AWS_ROLE_ARN: ${ROLE_ARN:-MISSING} (expected: $EXPECTED_CONTROLLER_ROLE)"
  echo "  AWS_WEB_IDENTITY_TOKEN_FILE: ${TOKEN_FILE:-MISSING} (expected: $EXPECTED_TOKEN_PATH)"
  if [ -z "$TOKEN_MOUNT" ] || [ -z "$ROLE_ARN" ] || [ -z "$TOKEN_FILE" ]; then
    echo "FAIL: Webhook mutation incomplete for $CONTROLLER pod $POD"
    exit 1
  fi
  # Exact-token check — see Fix 0 block for rationale (tr + grep -qxF)
  if ! echo "$ROLE_ARN" | tr -s '[:space:]' '\n' | grep -qxF "$EXPECTED_CONTROLLER_ROLE"; then
    echo "FAIL: AWS_ROLE_ARN missing exact expected token '$EXPECTED_CONTROLLER_ROLE', got: $ROLE_ARN"
    exit 1
  fi
  if ! echo "$TOKEN_FILE" | tr -s '[:space:]' '\n' | grep -qxF "$EXPECTED_TOKEN_PATH"; then
    echo "FAIL: AWS_WEB_IDENTITY_TOKEN_FILE missing exact expected token '$EXPECTED_TOKEN_PATH', got: $TOKEN_FILE"
    exit 1
  fi

  # Verify auth: capture logs first (fail if retrieval fails), then check for errors
  POD_LOGS=$(kubectl logs -n ack-system "$POD" --since=5m 2>&1) || {
    echo "FAIL: Could not retrieve logs from $POD"
    exit 1
  }
  AUTH_ERRORS=$(echo "$POD_LOGS" | grep -iE 'unauthorized|forbidden|NoCredentialProviders|AccessDenied|InvalidIdentityToken|ExpiredToken|WebIdentityErr' || true)
  if [ -n "$AUTH_ERRORS" ]; then
    echo "FAIL: Auth errors detected in $CONTROLLER logs:"
    echo "$AUTH_ERRORS"
    echo "Fix OIDC trust policy or webhook mutation before proceeding."
    exit 1
  else
    echo "(no auth errors found — good)"
  fi
done

# 4. Gate: ACK controllers must be Healthy in ArgoCD (not Unknown/Progressing)
#    `argocd app wait` exits non-zero if health is not reached within timeout.
#    `argocd app get` would always exit 0 regardless of health — not a gate.
#    300s allows for image pull, CRD registration, and initial AWS API round-trip.
argocd app wait ack-lambda-controller --health --timeout 300 --grpc-web
argocd app wait ack-s3-controller --health --timeout 300 --grpc-web

# 5. Gate: s3-notifier app must reach Healthy + Synced after deploy
#    600s accommodates sync waves (Function → permission-job → Bucket) + PostSync env-vars-job,
#    each involving AWS API calls with potential cold-start latency.
argocd app wait s3-notifier-staging --health --sync --timeout 600 --grpc-web
# Expect: Function and Bucket resources show Healthy (not Unknown/Progressing)

Phase 2b — After monorepo Helm chart deploys (s3-notifier)¶

set -euo pipefail

# 9. Verify ack-setup-job ServiceAccount was created with correct annotation
kubectl get sa ack-setup-job -n syrf-staging -o jsonpath='{.metadata.annotations}'
# Expect: eks.amazonaws.com/role-arn = arn:aws:iam::318789018510:role/syrf-ack-setup-job

# 10. Verify setup-job SA mutation: create a test pod to confirm webhook mutates it
# Use a unique pod name to avoid collisions on repeated runs (timestamp + PID).
SA_TEST_POD="sa-test-$(date +%s)-$$"
# Trap ensures the test pod is cleaned up on ANY exit (timeout, assertion failure, Ctrl-C)
trap "kubectl delete pod $SA_TEST_POD -n syrf-staging --ignore-not-found 2>/dev/null" EXIT
kubectl run "$SA_TEST_POD" --image=amazon/aws-cli:2.15.0 -n syrf-staging --restart=Never \
  --overrides='{"spec":{"serviceAccountName":"ack-setup-job"}}' \
  --command -- sleep 300
# Wait for pod to be Ready before inspecting (avoids race with sleep + immediate inspect)
kubectl wait --for=condition=Ready "pod/$SA_TEST_POD" -n syrf-staging --timeout=60s
# Assert exact env var values and token mount (same rigor as ACK controller checks)
# Use containers[*] (not containers[0]) — sidecar injection can shift container order.
SA_ROLE_ARN=$(kubectl get pod "$SA_TEST_POD" -n syrf-staging \
  -o jsonpath='{.spec.containers[*].env[?(@.name=="AWS_ROLE_ARN")].value}')
SA_TOKEN_FILE=$(kubectl get pod "$SA_TEST_POD" -n syrf-staging \
  -o jsonpath='{.spec.containers[*].env[?(@.name=="AWS_WEB_IDENTITY_TOKEN_FILE")].value}')
SA_TOKEN_MOUNT=$(kubectl get pod "$SA_TEST_POD" -n syrf-staging \
  -o jsonpath='{.spec.containers[*].volumeMounts[?(@.mountPath=="/var/run/secrets/eks.amazonaws.com/serviceaccount")].mountPath}')
EXPECTED_SETUP_ROLE="$SETUP_JOB_ROLE_ARN"
EXPECTED_SA_TOKEN_PATH="/var/run/secrets/eks.amazonaws.com/serviceaccount/token"
echo "sa-test AWS_ROLE_ARN: ${SA_ROLE_ARN:-MISSING} (expected: $EXPECTED_SETUP_ROLE)"
echo "sa-test AWS_WEB_IDENTITY_TOKEN_FILE: ${SA_TOKEN_FILE:-MISSING}"
echo "sa-test Token mount: ${SA_TOKEN_MOUNT:-MISSING}"
# Cleanup is handled by EXIT trap above — no manual kubectl delete needed on failure paths.
if [ -z "$SA_ROLE_ARN" ] || [ -z "$SA_TOKEN_FILE" ] || [ -z "$SA_TOKEN_MOUNT" ]; then
  echo "FAIL: Webhook mutation incomplete for $SA_TEST_POD pod"
  exit 1
fi
# Exact-token check — see Fix 0 block for rationale (tr + grep -qxF)
if ! echo "$SA_ROLE_ARN" | tr -s '[:space:]' '\n' | grep -qxF "$EXPECTED_SETUP_ROLE"; then
  echo "FAIL: $SA_TEST_POD AWS_ROLE_ARN missing exact expected token '$EXPECTED_SETUP_ROLE', got: $SA_ROLE_ARN"
  exit 1
fi
if ! echo "$SA_TOKEN_FILE" | tr -s '[:space:]' '\n' | grep -qxF "$EXPECTED_SA_TOKEN_PATH"; then
  echo "FAIL: $SA_TEST_POD AWS_WEB_IDENTITY_TOKEN_FILE missing exact expected token '$EXPECTED_SA_TOKEN_PATH', got: $SA_TOKEN_FILE"
  exit 1
fi
echo "$SA_TEST_POD mutation verified."
# Explicit delete (trap also covers this, but explicit for clarity)
trap - EXIT
kubectl delete pod "$SA_TEST_POD" -n syrf-staging --ignore-not-found

Phase 3 — Helm chart¶

set -euo pipefail

helm template staging "${REPO_ROOT}/src/services/s3-notifier/.chart/" \
  --set bucket.name=syrfapp-uploads-staging \
  --set lambda.executionRoleArn=arn:aws:iam::318789018510:role/syrfS3NotifierStagingLambdaRole \
  --set lambda.code.s3Key=test.zip \
  --set awsAccountId=318789018510 \
  --set environmentName=staging

Verify in rendered output:

Function metadata.name is lowercase (s3-notifier-staging)
Function spec.name is camelCase (syrfAppUploadS3Notifier-staging)
No external-secret.yaml in output
ServiceAccount created with syrf-ack-setup-job role annotation (not syrf-ack-controllers)
permission-job is Sync hook at wave 2, includes --source-account, wait-for-active loop, preflight check
env-vars-job is PostSync, includes S3Region, timeout fail, rabbit-mq pre-check
bucket is at wave 3
Secret key is rabbitmq-password (not password)

Phase 4 — Staging deployment smoke test¶

set -euo pipefail

# After ArgoCD syncs s3-notifier to staging:
# 1. Verify Lambda function exists and is Active
aws lambda get-function --function-name syrfAppUploadS3Notifier-staging --region eu-west-1
# 2. Verify S3 invoke permission exists
aws lambda get-policy --function-name syrfAppUploadS3Notifier-staging --region eu-west-1
# 3. Verify env var KEYS are set (including S3Region) — keys only, no values (avoids leaking secrets)
aws lambda get-function-configuration --function-name syrfAppUploadS3Notifier-staging \
  --query 'sort(keys(Environment.Variables))' --output json --region eu-west-1
# Expect: ["RabbitMqHost", "RabbitMqPassword", "RabbitMqUsername", "S3Region"]

Phase 4b — ACK reconciliation gate (CRITICAL)¶

The Function CRD does not include environment.variables — env vars are set out-of-band by the PostSync env-vars-job. This means ACK reconciliation may clear them if the controller reconciles the Function spec against the CRD. This phase explicitly tests that.

set -euo pipefail

# Environment parameters — defaults to staging. Override for production (see Phase 5 Step 3).
FUNCTION_NAME="${PHASE4B_FUNCTION_NAME:-syrfAppUploadS3Notifier-staging}"
K8S_NAMESPACE="${PHASE4B_NAMESPACE:-syrf-staging}"
CR_NAME="${PHASE4B_CR_NAME:-s3-notifier-staging}"
ENV_VALUES_PATH="${PHASE4B_ENV_VALUES_PATH:-syrf/environments/staging/s3-notifier/values.yaml}"
ARGOCD_APP="${PHASE4B_ARGOCD_APP:-s3-notifier-staging}"
echo "Phase 4b targets: function=$FUNCTION_NAME namespace=$K8S_NAMESPACE cr=$CR_NAME app=$ARGOCD_APP"

# Compare env var KEYS and non-sensitive VALUES to catch both key removal and value drift.
# RabbitMqPassword is redacted — only keys, Host, Username, and S3Region are compared.
# Use JMESPath sort() for deterministic key ordering.

# 1. Record current env var keys + non-sensitive values (baseline)
#    Pipe through jq -Sc to normalize whitespace and key ordering for reliable comparison.
BEFORE_KEYS=$(aws lambda get-function-configuration \
  --function-name "$FUNCTION_NAME" --region eu-west-1 \
  --query 'sort(keys(Environment.Variables))' --output json | jq -Sc '.')
BEFORE_VALS=$(aws lambda get-function-configuration \
  --function-name "$FUNCTION_NAME" --region eu-west-1 \
  --query '{Host: Environment.Variables.RabbitMqHost, User: Environment.Variables.RabbitMqUsername, Region: Environment.Variables.S3Region}' \
  --output json | jq -Sc '.')
echo "Env var keys before reconcile: $BEFORE_KEYS"
echo "Non-sensitive values before: $BEFORE_VALS"

# 2. Record ResourceSynced lastTransitionTime BEFORE annotation.
#    We use lastTransitionTime (not resourceVersion) because it's immune to timing
#    races — the old two-bump resourceVersion approach could false-fail if ACK
#    reconciled before AFTER_ANNOTATION_RV was captured.
BEFORE_LTT=$(kubectl get function "$CR_NAME" -n "$K8S_NAMESPACE" \
  -o jsonpath='{.status.conditions[?(@.type=="ACK.ResourceSynced")].lastTransitionTime}')
echo "ResourceSynced lastTransitionTime before: $BEFORE_LTT"

# 3. Force ACK to reconcile by annotating the Function CR
#    Capture log-window start BEFORE annotation so --since-time excludes prior reconcile noise.
LOG_SINCE=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
# Use nanoseconds for uniqueness on fast reruns (date +%s repeats within same second).
# macOS/BSD: date +%s%N returns literal "N" suffix (exit 0). Detect and fall back to epoch-PID-counter.
RECONCILE_TS=$(date +%s%N 2>/dev/null)
case "${RECONCILE_TS:-}" in ""|*N)
  _RECONCILE_SEQ=$(( ${_RECONCILE_SEQ:-0} + 1 ))
  RECONCILE_TS="$(date +%s)-$$-${_RECONCILE_SEQ}"
  ;; esac
kubectl annotate function "$CR_NAME" -n "$K8S_NAMESPACE" \
  "ack.aws.k8s.aws/force-reconcile=${RECONCILE_TS}" --overwrite

# 4. Wait for reconciliation with automated fallback.
#    Two signals checked per attempt:
#    (a) lastTransitionTime changed — proves controller updated status after our trigger
#    (b) ResourceSynced=True — proves reconcile succeeded
#
#    Attempt 1: annotation-based trigger (already applied above).
#    Attempt 2 (automatic): argocd app sync --force — re-applies all resources,
#      forcing the controller to reconcile even if it ignores metadata-only changes.
#
#    NOTE: If ResourceSynced stays True throughout (no transition), lastTransitionTime
#    may not change (K8s convention: only updates on status transitions). In that case,
#    we check controller logs as secondary evidence. The env var validation below is the
#    authoritative check — the reconcile gate is a bonus signal, not a hard gate.
RECON_WAIT=60  # Per-attempt timeout — ACK controller + AWS API round-trip can be slow
RECON_CONFIRMED=false

for RECON_ATTEMPT in 1 2; do
  if [ "$RECON_ATTEMPT" -eq 2 ]; then
    echo "Annotation-based trigger did not produce reconcile evidence. Attempting fallback: argocd app sync --force..."
    # Reset baselines for the second attempt
    BEFORE_LTT=$(kubectl get function "$CR_NAME" -n "$K8S_NAMESPACE" \
      -o jsonpath='{.status.conditions[?(@.type=="ACK.ResourceSynced")].lastTransitionTime}')
    LOG_SINCE=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
    argocd app sync "$ARGOCD_APP" --force --grpc-web || {
      echo "ERROR: argocd app sync --force failed. Check ArgoCD connectivity and app status."
      exit 1
    }
    sleep 5  # Allow sync propagation before polling
  fi

  echo "Waiting for ACK reconciliation (attempt $RECON_ATTEMPT)..."
  RECON_ELAPSED=0
  while [ $RECON_ELAPSED -lt $RECON_WAIT ]; do
    CURRENT_LTT=$(kubectl get function "$CR_NAME" -n "$K8S_NAMESPACE" \
      -o jsonpath='{.status.conditions[?(@.type=="ACK.ResourceSynced")].lastTransitionTime}' 2>/dev/null)
    CURRENT_STATUS=$(kubectl get function "$CR_NAME" -n "$K8S_NAMESPACE" \
      -o jsonpath='{.status.conditions[?(@.type=="ACK.ResourceSynced")].status}' 2>/dev/null)
    if [ "$CURRENT_LTT" != "$BEFORE_LTT" ] && [ "$CURRENT_STATUS" = "True" ]; then
      echo "Reconciliation detected (lastTransitionTime: $BEFORE_LTT -> $CURRENT_LTT)."
      RECON_CONFIRMED=true
      break
    fi
    sleep 3
    RECON_ELAPSED=$((RECON_ELAPSED + 3))
  done
  [ "$RECON_CONFIRMED" = true ] && break

  # Timeout path — check if condition stayed True and look for log evidence
  FINAL_STATUS=$(kubectl get function "$CR_NAME" -n "$K8S_NAMESPACE" \
    -o jsonpath='{.status.conditions[?(@.type=="ACK.ResourceSynced")].status}' 2>/dev/null)
  if [ "$FINAL_STATUS" != "True" ]; then
    echo "ERROR: ResourceSynced is NOT True after ${RECON_WAIT}s (attempt $RECON_ATTEMPT)."
    [ "$RECON_ATTEMPT" -eq 2 ] && { echo "Both attempts failed."; exit 1; }
    continue  # Try fallback
  fi
  echo "NOTE: lastTransitionTime unchanged after ${RECON_WAIT}s — checking controller logs..."
  CTRL_LOGS=$(kubectl logs -n ack-system -l app.kubernetes.io/instance=ack-lambda-controller --since-time="$LOG_SINCE" 2>&1) || {
    echo "FAIL: Could not retrieve controller logs."
    exit 1
  }
  if grep -qE "${CR_NAME}|${FUNCTION_NAME}|${RECONCILE_TS}" <<<"$CTRL_LOGS"; then
    echo "Controller logs confirm reconcile activity (attempt $RECON_ATTEMPT, matched CR/function name or ts=${RECONCILE_TS})."
    RECON_CONFIRMED=true
    break
  fi
  echo "No log evidence of reconcile on attempt $RECON_ATTEMPT."
  [ "$RECON_ATTEMPT" -eq 2 ] && {
    echo "FAIL: No reconcile evidence after both annotation and forced sync."
    echo "Verify ACK controller is healthy: kubectl logs -n ack-system -l app.kubernetes.io/instance=ack-lambda-controller --since-time=$LOG_SINCE"
    exit 1
  }
done

# 6. Check env var keys + non-sensitive values after reconciliation
AFTER_KEYS=$(aws lambda get-function-configuration \
  --function-name "$FUNCTION_NAME" --region eu-west-1 \
  --query 'sort(keys(Environment.Variables))' --output json | jq -Sc '.')
AFTER_VALS=$(aws lambda get-function-configuration \
  --function-name "$FUNCTION_NAME" --region eu-west-1 \
  --query '{Host: Environment.Variables.RabbitMqHost, User: Environment.Variables.RabbitMqUsername, Region: Environment.Variables.S3Region}' \
  --output json | jq -Sc '.')
echo "Env var keys after reconcile: $AFTER_KEYS"
echo "Non-sensitive values after: $AFTER_VALS"

# 7. Compare keys — env vars MUST still be present
if [ "$BEFORE_KEYS" != "$AFTER_KEYS" ]; then
  echo "FAIL: ACK reconciliation modified env var keys!"
  echo "Keys before: $BEFORE_KEYS"
  echo "Keys after:  $AFTER_KEYS"
  echo ""
  echo "Mitigation: Add non-sensitive vars to the Function CRD spec"
  echo "  (spec.environment.variables in function.yaml template)."
  echo "  Sensitive vars (RabbitMqPassword) must still come from env-vars-job."
  echo "  See Fix 9 for details."
  exit 1
fi

# 8. Compare non-sensitive values — catches value drift that key comparison misses
if [ "$BEFORE_VALS" != "$AFTER_VALS" ]; then
  echo "FAIL: ACK reconciliation changed non-sensitive env var values!"
  echo "Values before: $BEFORE_VALS"
  echo "Values after:  $AFTER_VALS"
  echo ""
  echo "Mitigation: Same as key drift — add non-sensitive vars to CRD spec."
  exit 1
fi

# 9. Assert expected values (catches case where both before AND after are wrong)
#    Source expected values from cluster-gitops Helm values — no hardcoded fallback.
#    yq v4+ is a declared prerequisite; hard-fail if missing to avoid silent drift.
EXPECTED_KEYS='["RabbitMqHost","RabbitMqPassword","RabbitMqUsername","S3Region"]'
command -v yq >/dev/null 2>&1 || { echo "ERROR: yq not found. Required for expected-value sourcing."; exit 1; }
YQ_MAJOR=$(yq --version 2>&1 | awk 'match($0, /[0-9]+/) {print substr($0, RSTART, RLENGTH); exit}' || true)
if [ "${YQ_MAJOR:-0}" -lt 4 ]; then
  echo "ERROR: yq v4+ required (found v${YQ_MAJOR:-unknown}). ireduce syntax needs v4."
  exit 1
fi
SVC_VALS="${GITOPS_ROOT}/syrf/services/s3-notifier/values.yaml"
ENV_VALS="${GITOPS_ROOT}/${ENV_VALUES_PATH}"
# Proper Helm-style merge: env values override service defaults (yq ea with ireduce)
MERGED=$(yq ea '. as $item ireduce ({}; . * $item)' "$SVC_VALS" "$ENV_VALS") || {
  echo "ERROR: yq merge failed. Check files exist and are valid YAML:"
  echo "  $SVC_VALS"
  echo "  $ENV_VALS"
  exit 1
}
EXP_HOST=$(yq -r '.rabbitMq.host // ""' <<<"$MERGED")
EXP_USER=$(yq -r '.rabbitMq.username // ""' <<<"$MERGED")
EXP_REGION=$(yq -r '.awsRegion // ""' <<<"$MERGED")
EXPECTED_VALS=$(jq -n -Sc --arg Host "$EXP_HOST" --arg Region "$EXP_REGION" --arg User "$EXP_USER" \
  '{Host: $Host, Region: $Region, User: $User}')
echo "Expected values sourced from cluster-gitops Helm values (merged: service + env)"
if [ "$AFTER_KEYS" != "$EXPECTED_KEYS" ]; then
  echo "FAIL: Env var keys don't match expected set!"
  echo "Expected: $EXPECTED_KEYS"
  echo "Actual:   $AFTER_KEYS"
  exit 1
fi
if [ "$AFTER_VALS" != "$EXPECTED_VALS" ]; then
  echo "FAIL: Non-sensitive env var values don't match expected config!"
  echo "Expected: $EXPECTED_VALS"
  echo "Actual:   $AFTER_VALS"
  exit 1
fi

echo "PASS: Env var keys, values, and expected config all verified after ACK reconciliation."

Phase 5 — Production rollout gate¶

Do not proceed to production until all Phase 4/4b checks pass on staging. Production enablement requires an explicit decision gate — there is no automatic promotion.

Pre-production checklist (all must be true):

Phase 4 staging smoke test: all Lambda function, S3 permission, and env var checks pass
Phase 4b reconciliation gate: ACK reconcile does not clear env vars (both attempts pass)
Staging has been running with ACK-managed resources for at least 24 hours without drift
ArgoCD shows s3-notifier-staging as Healthy + Synced (no degraded or out-of-sync)

Step 1 — Enable production app with adoption in cluster-gitops (single dedicated GitOps commit):

The production app is currently disabled (service.enabled: false in config.yaml). Both the config enablement and the adoption flag must be set in the same commit (so the app is generated with adoptExisting: true from the start, never without it).

# cluster-gitops/syrf/environments/production/s3-notifier/config.yaml
# Enable the production s3-notifier Application.
serviceName: s3-notifier
envName: production
service:
  enabled: true           # Was: false — enables ApplicationSet generation
  chartTag: main
  imageTag: "0.0.0"       # Placeholder (Lambda uses S3 zip, not Docker image)

# cluster-gitops/syrf/environments/production/s3-notifier/values.yaml
# Set adoptExisting: true for initial deployment only.
# This tells ACK to import existing AWS resources instead of creating new ones.
adoptExisting: true

Commit and push, then verify the Application was generated before syncing:

set -euo pipefail

# Gate: confirm the production config exists in cluster-gitops.
# The ApplicationSet only generates an app when the service config directory exists.
PROD_CONFIG="${GITOPS_ROOT}/syrf/environments/production/s3-notifier/config.yaml"
if [ ! -f "$PROD_CONFIG" ]; then
  echo "ERROR: Production config not found: $PROD_CONFIG"
  echo "Create the production s3-notifier config in cluster-gitops first."
  exit 1
fi
echo "OK: Production config exists at $PROD_CONFIG"

# Gate: confirm the ApplicationSet generated the production app.
# Without this, argocd app sync will fail with a confusing "app not found" error.
argocd app get s3-notifier-production --grpc-web >/dev/null 2>&1 || {
  echo "ERROR: s3-notifier-production Application does not exist."
  echo "Check that the production s3-notifier config was pushed to cluster-gitops"
  echo "and the ApplicationSet has generated the Application."
  echo "  argocd appset list --grpc-web"
  exit 1
}
echo "OK: s3-notifier-production Application exists."

# Record the adoption commit SHA for use in rollback (if needed).
# IMPORTANT: The adoption values change MUST be a dedicated commit (no other changes).
# If ADOPTION_SHA is already exported (from a prior run or manual set), use it.
# Otherwise, verify HEAD is the adoption commit by checking its message.
if [ -n "${ADOPTION_SHA:-}" ]; then
  echo "Using pre-set ADOPTION_SHA=$ADOPTION_SHA"
else
  ADOPTION_SHA=$(git -C "$GITOPS_ROOT" rev-parse HEAD)
  ADOPTION_MSG=$(git -C "$GITOPS_ROOT" log -1 --format='%s' HEAD)
  if ! grep -qiF "adoptExisting" <<<"$ADOPTION_MSG"; then
    echo "ERROR: HEAD commit does not mention 'adoptExisting':"
    echo "  $ADOPTION_SHA $ADOPTION_MSG"
    echo ""
    echo "The adoption commit must be a DEDICATED commit. Find the correct SHA:"
    echo "  git -C \$GITOPS_ROOT log --oneline -10"
    echo "Then re-run with: export ADOPTION_SHA=<correct SHA>"
    exit 1
  fi
  echo "Adoption commit: $ADOPTION_SHA ($ADOPTION_MSG)"
fi

# Trigger manual sync (production is not auto-synced)
argocd app sync s3-notifier-production --grpc-web
# Wait for sync + health to settle (timeout 5 min)
argocd app wait s3-notifier-production --health --sync --timeout 300 --grpc-web

Step 2 — Verify adoption succeeded (health + sync gate):

set -euo pipefail

EXPECTED_LAMBDA_ARN="arn:aws:lambda:eu-west-1:318789018510:function:syrfAppUploadS3Notifier"

# Combined health + sync check — both must be true.
APP_STATUS=$(argocd app get s3-notifier-production -o json --grpc-web)
HEALTH=$(jq -r '.status.health.status' <<<"$APP_STATUS")
SYNC=$(jq -r '.status.sync.status' <<<"$APP_STATUS")
if [ "$HEALTH" != "Healthy" ] || [ "$SYNC" != "Synced" ]; then
  echo "FAIL: s3-notifier-production is $HEALTH/$SYNC (expected Healthy/Synced)"
  argocd app get s3-notifier-production --grpc-web
  exit 1
fi
echo "OK: s3-notifier-production is Healthy + Synced"

# Verify adopted Lambda is Active (not Pending, Failed, or Inactive)
LAMBDA_STATE=$(aws lambda get-function --function-name syrfAppUploadS3Notifier --region eu-west-1 \
  --query 'Configuration.State' --output text)
if [ "$LAMBDA_STATE" != "Active" ]; then
  echo "FAIL: Lambda state is '$LAMBDA_STATE' (expected Active)"
  aws lambda get-function --function-name syrfAppUploadS3Notifier --region eu-west-1 \
    --query 'Configuration.{State: State, LastUpdateStatus: LastUpdateStatus, StateReason: StateReason}' --output table
  exit 1
fi
echo "OK: Lambda state is Active"

# Verify S3 notification targets the expected Lambda ARN (filter, not positional index)
NOTIF_ARN=$(aws s3api get-bucket-notification-configuration --bucket syrfapp-uploads --region eu-west-1 \
  --query "LambdaFunctionConfigurations[?LambdaFunctionArn=='${EXPECTED_LAMBDA_ARN}'].LambdaFunctionArn | [0]" \
  --output text)
if [ "$NOTIF_ARN" != "$EXPECTED_LAMBDA_ARN" ]; then
  echo "FAIL: S3 notification does not target expected Lambda ARN"
  echo "Expected: $EXPECTED_LAMBDA_ARN"
  echo "Got: ${NOTIF_ARN:-null}"
  aws s3api get-bucket-notification-configuration --bucket syrfapp-uploads --region eu-west-1
  exit 1
fi
echo "OK: S3 notification targets $EXPECTED_LAMBDA_ARN"

Step 3 — Run Phase 4b reconciliation gate against production:

Phase 4b is parameterized via environment variables. Set these before running the Phase 4b script:

# Production targets for Phase 4b (override staging defaults)
export PHASE4B_FUNCTION_NAME="syrfAppUploadS3Notifier"
export PHASE4B_NAMESPACE="syrf-production"
export PHASE4B_CR_NAME="s3-notifier-production"
export PHASE4B_ENV_VALUES_PATH="syrf/environments/production/s3-notifier/values.yaml"
export PHASE4B_ARGOCD_APP="s3-notifier-production"

Then run the Phase 4b script. It uses ${PHASE4B_FUNCTION_NAME:-syrfAppUploadS3Notifier-staging} etc., defaulting to staging if unset.

Step 4 — Remove adoption annotation (GitOps commit):

After adoption is verified and Phase 4b passes, set adoptExisting: false so future syncs use normal reconciliation:

# cluster-gitops/syrf/environments/production/s3-notifier/values.yaml
# Adoption complete — reset to normal reconciliation mode.
adoptExisting: false

Sync again to apply:

set -euo pipefail

argocd app sync s3-notifier-production --grpc-web
argocd app wait s3-notifier-production --health --sync --timeout 300 --grpc-web
echo "Adoption complete. ACK now manages production resources in normal mode."

Step 5 — Terraform teardown (after Step 4 is verified stable):

The old Terraform-managed resources are now owned by ACK. Remove them from Terraform without destroying the actual AWS resources. Order matters — remove state first, then config:

The Terraform resources to hand off are (from camarades-infrastructure/terraform/lambda/main.tf):

Terraform address	What it manages	ACK takes over?
`aws_lambda_function.s3_notifier_production` (line 65)	Production Lambda	Yes — Function CRD
`aws_lambda_permission.s3_invoke_production` (line 180)	S3→Lambda invoke permission	Yes — permission-job
`aws_s3_bucket_notification.uploads` (line 202)	Both production AND preview notifications	Partially — ACK manages production notification via Bucket CRD; previews stay in Terraform

Critical: aws_s3_bucket_notification.uploads is a single resource managing ALL bucket notifications (production Projects/ prefix + all preview preview/pr-{N}/ prefixes). You cannot state rm it without losing preview notifications. Instead, modify the config to remove only the production lambda_function block.

set -euo pipefail

cd "$INFRA_ROOT/terraform/lambda"

# 1. Remove production-only resources from Terraform state.
#    These are fully replaced by ACK CRDs and setup-job.
terraform state rm aws_lambda_function.s3_notifier_production
terraform state rm aws_lambda_permission.s3_invoke_production

# 2. Verify state removal — these two should no longer appear.
terraform state list | grep -E 's3_notifier_production|s3_invoke_production' && {
  echo "ERROR: Production resources still in state after removal!"
  exit 1
} || echo "OK: Production Lambda + permission removed from state."

# 3. DO NOT state-rm aws_s3_bucket_notification.uploads!
#    It still manages preview notifications. Instead, edit main.tf to remove
#    only the production lambda_function block from that resource.
echo "ACTION: Edit main.tf to modify aws_s3_bucket_notification.uploads:"
echo "  - REMOVE the production lambda_function block (lines 206-210)"
echo "  - KEEP the dynamic preview lambda_function block (lines 214-221)"
echo "  - REMOVE aws_lambda_function.s3_notifier_production resource block"
echo "  - REMOVE aws_lambda_permission.s3_invoke_production resource block"
echo "  - KEEP all preview resources (s3_notifier_preview, s3_invoke_preview)"
echo "  - KEEP production IAM role (still needed for Lambda execution)"

# 4. Run plan to confirm ONLY the expected changes.
terraform plan
# Expect:
#   - aws_s3_bucket_notification.uploads: 1 to update (remove production entry)
#   - No 'destroy' actions for preview resources
#   - If plan shows destroy of preview Lambdas or permissions, STOP.

Alternative (Terraform 1.7+): Use removed blocks for the Lambda and permission. The notification resource is modified in-place (not removed), so removed blocks don't apply to it:
removed {
  from = aws_lambda_function.s3_notifier_production
  lifecycle { destroy = false }
}
removed {
  from = aws_lambda_permission.s3_invoke_production
  lifecycle { destroy = false }
}
Then edit aws_s3_bucket_notification.uploads to remove the production block, and terraform apply. Preferred for auditable GitOps workflow.

Rollback procedure (if production adoption fails):

Rollback requires both removing ACK CRDs (ArgoCD) and reverting the GitOps config (so the ApplicationSet doesn't recreate the app). Use the adoption commit SHA captured in Step 1 — not revert HEAD, which is unsafe on a moving branch.

set -euo pipefail

# ADOPTION_SHA was captured in Step 1. Verify it's set.
if [ -z "${ADOPTION_SHA:-}" ]; then
  echo "ERROR: ADOPTION_SHA not set. Find it with:"
  echo "  git -C \$GITOPS_ROOT log --oneline -5"
  echo "Then: export ADOPTION_SHA=<the adoption commit>"
  exit 1
fi

# 0. Pre-check: verify deletionPolicy on each CRD independently.
#    Each CRD can be: present+retain (safe), present+other (FATAL), or absent (already pruned).
ROLLBACK_SAFE=true
for CRD_KIND_NAME in "function/s3-notifier-production" "bucket/syrfapp-uploads"; do
  CRD_KIND="${CRD_KIND_NAME%%/*}"
  CRD_NAME="${CRD_KIND_NAME##*/}"
  DELETION=$(kubectl get "$CRD_KIND" "$CRD_NAME" -n syrf-production \
    -o jsonpath='{.metadata.annotations.services\.k8s\.aws/deletion-policy}' 2>/dev/null) || true
  if [ -z "$DELETION" ]; then
    echo "NOTE: $CRD_KIND/$CRD_NAME already absent — no deletion risk."
  elif [ "$DELETION" = "retain" ]; then
    echo "OK: $CRD_KIND/$CRD_NAME has deletionPolicy=retain."
  else
    echo "FATAL: $CRD_KIND/$CRD_NAME has deletionPolicy='$DELETION' (expected 'retain')!"
    echo "DO NOT proceed — deleting this CRD will destroy the AWS resource."
    ROLLBACK_SAFE=false
  fi
done
if [ "$ROLLBACK_SAFE" != true ]; then
  exit 1
fi

# 1. Revert the specific adoption commit in cluster-gitops
#    (not HEAD — the branch may have moved since adoption)
echo "Reverting adoption commit $ADOPTION_SHA in cluster-gitops..."
echo "  git -C \$GITOPS_ROOT revert --no-edit $ADOPTION_SHA"
echo "  git -C \$GITOPS_ROOT push"
echo ""
echo "After the revert is pushed, ArgoCD will remove the Application."
echo "deletionPolicy: retain ensures AWS resources are untouched."

# 2. Verify AWS resources are still functional (after ArgoCD prunes the CRDs)
echo ""
echo "After ArgoCD sync completes, verify AWS resources survived:"
aws lambda get-function --function-name syrfAppUploadS3Notifier --region eu-west-1 \
  --query 'Configuration.State' --output text
NOTIF_COUNT=$(aws s3api get-bucket-notification-configuration --bucket syrfapp-uploads --region eu-west-1 \
  --query 'length(LambdaFunctionConfigurations)' --output text)
if [ "${NOTIF_COUNT:-0}" -eq 0 ]; then
  echo "WARNING: S3 bucket has no Lambda notification configurations!"
  exit 1
fi
echo "Rollback complete — AWS resources unchanged ($NOTIF_COUNT notification configs), ACK CRDs removed via GitOps revert."