Plan: Fix Code Review Issues for ACK Migration (v3.15)¶
v3.15 changes¶
- Here-strings for multi-line variables — All
printf '%s\n' "$VAR" | grep/awk/sedpipelines on captured multi-line output (chart values, helm template output, controller logs) replaced with here-strings (grep ... <<<"$VAR",awk ... <<<"$VAR",sed ... <<<"$VAR"). Here-strings pass the variable as stdin without a pipe process, eliminating SIGPIPE entirely underset -euo pipefail. Remaining pipes are intentional: token-check patterns (echo "$ROLE_ARN" | tr | grep -qxF) operate on small single-line jsonpath outputs where the full data fits in one pipe buffer write. Affects: chart key validation, SA extraction, controller log check. - Chart key validation: leaf-level
yq -epredicates — Loosegrep -q 'certManager:'assertions replaced with leaf-levelyq -eexit-code predicates (e.g..pki.certManager.enabled, not.pki.certManager). Ensures the specific keys our values.yaml overrides actually exist in the chart schema.yq -eexits non-zero when the path resolves tonull, eliminating the intermediate| grep -v '^null$'pipe. - Fix 0 webhook readiness:
kubectl wait— Standalone Fix 0 verification step 1 changed from informationalkubectl get pods(exits 0 regardless of Ready state) tokubectl wait --for=condition=Ready ... --timeout=120s, consistent with Phase 2 gate. - OCI sourceRepos:
yqexact match — OCIsourceReposfallback verification replaced brittlegrep -qFsubstring check withyq -e '.spec.sourceRepos[] | select(. == ...)'for exact array element matching. Eliminates false-positives from partial URL matches. yqrequired everywhere, no fallbacks — Prerequisites listyqas required (was "optional with hardcoded fallback"). Pre-merge block includescommand -v yq+ major-version gate (v4+ check). Phase 4b expected-value sourcing now hard-fails on missing/old yq instead of falling back to hardcoded values that can silently drift from config.- Repo root autodetect —
GITOPS_ROOTandINFRA_ROOTnow autodetect nested ($REPO_ROOT/cluster-gitops) or sibling ($REPO_ROOT/../cluster-gitops) layouts. Explicitexport GITOPS_ROOT=...override still works for CI or non-standard layouts. Actionable error if neither path exists. - Phase 5: production rollout gate — New section after Phase 4b with 4 concrete steps: (1) enable
adoptExisting: truevia GitOps commit, (2) verify health+sync gate viaargocd app wait, (3) run Phase 4b reconciliation against production, (4) resetadoptExisting: falseafter verification. Rollback requires GitOps revert (not justargocd app delete) to prevent ApplicationSet recreation. All commands use--grpc-webandjqfor structured checks. - Phase 4b parameterized — All hardcoded staging references (
syrfAppUploadS3Notifier-staging,syrf-staging,s3-notifier-staging) replaced with$FUNCTION_NAME,$K8S_NAMESPACE,$CR_NAME,$ARGOCD_APP,$ENV_VALUES_PATH. Defaults to staging ifPHASE4B_*env vars are unset. Phase 5 Step 3 shows the production overrides. - Phase 5 hardened — Step 1 now includes both
service.enabled: true(config.yaml) andadoptExisting: true(values.yaml) in a single dedicated commit, since the app won't be generated without enablement. Adoption SHA is a hard gate: auto-detects from HEAD only if commit message mentionsadoptExisting; requires explicitexport ADOPTION_SHA=<sha>otherwise. Lambda state hard-assertedActive. S3 notification filtered by ARN. RollbackdeletionPolicypre-check handles each CRD independently (present+retain, present+other, or absent).--grpc-webon all ArgoCD commands. - Terraform teardown: correct addresses and scope — Fixed resource names to match actual Terraform config:
aws_lambda_function.s3_notifier_production,aws_lambda_permission.s3_invoke_production.aws_s3_bucket_notification.uploadsis NOT removed (it manages both production AND preview notifications); instead, the productionlambda_functionblock is removed from the resource config. Preview resources (s3_notifier_preview,s3_invoke_preview) are explicitly called out as untouched. Documentsremovedblocks (Terraform 1.7+) as preferred alternative for Lambda and permission. - macOS note simplified — Verbose 4-line macOS/BSD
date +%s%Ncomment condensed to 1 line. Only relevant to local shell portability, not cluster runtime.
v3.14 changes¶
- Fail-fast:
set -euo pipefail— All runnable verification blocks (Prerequisites, Fix 0 standalone, Phases 1-4b) now start withset -euo pipefail. Prevents gate commands from failing silently and allowing execution to continue past a broken check. - Fallback sync error guard —
argocd app sync --forcein the Phase 4b automated retry now has|| { echo "..."; exit 1; }to catch ArgoCD connectivity or app-state failures instead of proceeding blind. - namespaceSelector doc consistency — Initial webhook values comment (failurePolicy: Ignore deploy) now explicitly notes that a
namespaceSelectorshould be added when switching toFail, cross-referencing the fail-closed section. Removes "no namespaceSelector needed" contradiction. - Namespace label pre-check — Added a runnable pre-check script before the failurePolicy: Fail switch that verifies
syrf.io/aws-identity=enabledlabel exists onack-system,syrf-staging, andsyrf-production. Hard-fails with remediation command if any label is missing. - Rollout timeout alignment — Inline remediation example (
--timeout=60s) aligned to180sto match the main Phase 2 rollout waits. No more conflicting timeout guidance. - Version bump: v3.13 → v3.14 — New version tag for traceability after 6 incremental fixes since v3.13.
- S3 IAM phantom actions — Moved from v3.13 changelog item 42 to a clear note: removed
s3:GetObjectMetadataands3:HeadObjectfrom the staging Lambda role policy. AWS maps HeadObject auth tos3:GetObject— the phantom actions are not valid IAM action names and would be silently ignored by IAM. set -e+grepsafety — Allgrepcalls inside command substitutions now use{ grep ... || true; }to preventset -efrom killing the script before custom error handling. Affected: yq version detection, SA-name extraction (helm template pipeline), webhook rollback name lookups. Count checks switched fromwc -l(counts empty lines) togrep -c . || true.- SA extraction: explicit empty guard — After helm template + grep pipeline, added explicit
-zchecks forLAMBDA_SAandS3_SAwith actionable error messages, before the trust-policy cross-check. - Namespace selector structural clarity — Initial values.yaml block now includes a commented-out
namespaceSelectorplaceholder explicitly noting it's not needed forIgnorebut required forFail, mirroring the fail-closed section's structure. - Label remediation:
--overwrite— Allkubectl label nsremediation commands (pre-check script and inline YAML comments) now include--overwriteto handle wrong-value correction, not just missing-label cases. - POSIX redirect:
&>/dev/null→>/dev/null 2>&1— Replaced bashism&>with POSIX-portable>/dev/null 2>&1in yq presence check. - Prerequisites: context hard gate —
EXPECTED_CONTEXTwas defined but never validated in Prerequisites (only in Fix 0 standalone). Addedkubectl config current-contextassertion withexit 1on mismatch, so skipping the Fix 0 block can't accidentally hit the wrong cluster. - Phase 2 webhook readiness gate — Replaced
kubectl get pods(exits 0 even when not Ready) withkubectl wait --for=condition=Ready pod ... --timeout=120s. Webhook must be fully Ready before controller rollout restart, not just listed. - Pre-merge validation:
set -euo pipefail— The chart key validation block (helm repo add/show values) was the last runnable block without fail-fast mode. Added for consistency with all other blocks. - New namespace labeling note — Added operational warning that any new namespace needing AWS credential injection (e.g. PR previews) must be labeled
syrf.io/aws-identity=enabledbefore pod deployment whenfailurePolicy: FailwithnamespaceSelectoris active, or credentials silently won't inject. - SIGPIPE safety: capture-then-parse —
helm show values ... | head -60andhelm template ... | grep | head -1pipelines can false-fail underset -euo pipefailwhenheadcloses the pipe early (upstream gets SIGPIPE,pipefailsees non-zero). Rewritten to capture full output into a variable first, then parse withsed -n/awk(which read all input, no early close). - SA extraction: stderr preserved — Removed
2>/dev/nullfromhelm templatein SA-name extraction. Stderr is now captured via2>&1into the output variable and printed on failure, giving actionable diagnostics (wrong chart version, missing values file, OCI pull error) instead of a silent empty result. - Rollback block:
set -euo pipefail— The webhook rollback script (Option 1) was the last runnable block without fail-fast. Added for consistency — all runnable blocks now have it. - SIGPIPE: here-strings for multi-line variables — Variable-to-command patterns for large captured output (chart values, helm template, controller logs) now use here-strings (
grep ... <<<"$VAR",awk ... <<<"$VAR",sed ... <<<"$VAR") instead of pipes. Token-check patterns on small jsonpath outputs retain pipes (no practical SIGPIPE risk for single-buffer writes). Superseded by v3.15 item 1. - Chart key validation: yq path checks — Pre-merge validation uses
yq -eexit-code predicates (yq -e '.pki.certManager', etc.) against captured chart values instead of loosegrep -q 'certManager:'which can false-pass. Superseded by v3.15 item 2. - Recovery instruction:
kubectl wait— Recovery step "Confirm webhook is Ready" changed fromkubectl get pods(informational, exits 0 regardless) tokubectl wait --for=condition=Ready ... --timeout=120s, matching the Phase 2 gate pattern.
v3.13 changes¶
- Permission-job: Python instead of jq — The conflict-path policy inspection used
jqfor JSON parsing, butamazon/aws-cli:2.15.0does not includejq. Replaced with Python (bundled internally by the AWS CLI image), using the same discovery logic as env-vars-job. - Reconcile evidence:
--since-timeinstead of--since=5m— The 5-minute rolling window could catch log lines from a prior reconcile, causing false-pass. Now capturesLOG_SINCEtimestamp immediately before the annotation and uses--since-time=$LOG_SINCEfor an exact window. - Separate
GITOPS_ROOTvariable —REPO_ROOT/cluster-gitopsassumed repos are co-located (nested). AddedGITOPS_ROOTas a separate variable in Prerequisites, defaulting to${REPO_ROOT}/cluster-gitopsbut overridable for different checkout layouts (CI, separate worktrees). - yq v4+ explicit requirement — The
ireducesyntax used for Helm-style deep merge requires yq v4+. Prerequisites now state the version requirement explicitly. - Auth check: WARNING → hard FAIL — Phase 2 ACK controller auth error detection was a warning (continue). For a deployment gate, credential errors must block — upgraded to
exit 1. - Header/date refresh — Bumped version to v3.13, updated date to 2026-02-08.
- Phase 1 paths:
GITOPS_ROOT+INFRA_ROOT— Phase 1 Terraform verification used hardcoded relative paths (../../../cluster-gitops/...) that break in non-co-located repo layouts. Now uses${GITOPS_ROOT}for helm values and${INFRA_ROOT}for thecdtarget. AddedINFRA_ROOTvariable to Prerequisites alongsideGITOPS_ROOT. - Reconcile annotation: nanosecond uniqueness —
date +%scan repeat on fast reruns (same second). Now usesdate +%s%N(nanoseconds) withcase-based detection andepoch-$$fallback (superseded by item 11 for macOS correctness). - Permission-job: Python existence guard — The conflict-path
PYTHONdiscovery lacked the--versioncheck present in env-vars-job. Added the same explicit verification and actionable error message. - Mutation assertions: exact role ARN — All three mutation check blocks (Fix 0, Phase 2, Phase 2b sa-test) now assert the exact expected role ARN (
syrf-ack-controllersfor ACK controllers,syrf-ack-setup-jobfor sa-test) instead of just non-empty. Catches wrong-role injection. - Nanosecond timestamp: macOS-safe —
date +%s%Nreturns literalNon macOS (exit 0, so||fallback never fires). Now usescasecheck: if output ends inN, falls back toepoch-$$(PID, POSIX-portable unlike$RANDOM). - Fix 0 token file path assertion — Fix 0 verification now asserts exact
AWS_WEB_IDENTITY_TOKEN_FILEpath (was presence-only; Phase 2 already had it). - Auth grep: web-identity patterns — Added
InvalidIdentityToken,ExpiredToken,WebIdentityErrto the auth error grep in Phase 2, covering common OIDC/STS federation failures. - Fail-fast directory guards — Prerequisites block now validates that
REPO_ROOT,GITOPS_ROOT, andINFRA_ROOTall exist before any phase runs. - Permission conflict: Python error handling — Both Python JSON-parsing invocations now wrap
json.loadsin try/except. On malformed or empty policy payloads, prints the raw payload and exits with actionable error instead of an opaque Python traceback. - Changelog consistency — Item 8 still referenced
${RANDOM}fallback; corrected to match item 11'sepoch-$$implementation. - ArgoCD checks: gating — Replaced informational
argocd app get(always exits 0) withargocd app wait --health --timeoutwhich exits non-zero if health/sync is not reached. Phase 2 steps 4-5 are now actual deployment gates. - Container selector:
containers[*]— All mutation jsonpath queries now usecontainers[*]instead ofcontainers[0]. Sidecar injection (e.g. Istio, Linkerd) can shift the main container from index 0; wildcard matches across all containers. - Timestamp: stricter portability — Added
2>/dev/nullto suppress stderr from exotic shells, and${RECONCILE_TS:-}empty-string guard in thecasecheck for shells where the variable might be unset. - Parameterised ARNs — Extracted hardcoded
arn:aws:iam::318789018510:role/...intoAWS_ACCOUNT_ID,ACK_CONTROLLER_ROLE_ARN, andSETUP_JOB_ROLE_ARNvariables in Prerequisites. All phases now reference these variables, making the scripts reusable across accounts. - Fix 0 standalone safety — Fix 0 verification block referenced
$ACK_CONTROLLER_ROLE_ARN(defined later in Prerequisites). Now inlines the ARN directly with a comment pointing to Prerequisites for the parameterised version used by Phases. - Multi-container value safety —
containers[*]jsonpath can concatenate values from multiple containers (whitespace-separated). Equality checks (!=) replaced with token-splitting + exact match (see item 26). Prevents false-fails when sidecars are present. - yq v4 runtime validation — The
command -v yqpresence check now also extracts the major version number and requires>= 4. If yq v3 is installed, falls back to hardcoded values with a clear message instead of failing with an opaqueireducesyntax error. - AWS account sanity gate — Prerequisites now runs
aws sts get-caller-identityand asserts the returned account matches$AWS_ACCOUNT_ID. Prevents cross-account mistakes (wrong profile, personal creds). - sa-test cleanup trap — Phase 2b now sets
trap ... EXITbefore creating the sa-test pod, ensuring cleanup on timeout, assertion failure, or Ctrl-C. Manualkubectl deletecalls removed from failure paths (trap handles them). - Exact-token matching — Replaced
grep -qF(substring) withtr ' ' '\n' | grep -qxF(exact line match after word-splitting). Prevents false-pass when a longer ARN contains the expected ARN as a prefix (e.g.syrf-ack-controllers-v2matchingsyrf-ack-controllers). - Fix 0 context awareness — Added kubectl context visibility (
kubectl config current-context) and derivedEXPECTED_ACCOUNTfrom the inline ARN. Standalone block can't call AWS APIs (no AWS context). Upgraded to hard gate in item 32. - yq merge error handling — Wrapped
yq eain anifguard: on failure (missing files, parse error), captures the error message and falls back to hardcoded values instead of aborting. Fallback message now says "v4+ not available or merge failed" (was misleadingly "not available" even for v3). - get-policy guard —
aws lambda get-policyin the permission-job conflict path now has explicit|| { ... exit 1 }error handling with actionable message. Also guards against empty/None response. Previously,set -ewould cause an opaque abort on API failure. - ArgoCD wait timeouts: realistic — Increased ACK controller health timeout 120s → 300s (image pull + CRD registration + AWS round-trip). Increased s3-notifier timeout 180s → 600s (3 sync waves + PostSync job, each involving AWS API calls).
- Reconcile annotation: extra entropy on macOS —
epoch-$$fallback can repeat if rerun in same second from same shell (PID doesn't change). Now appends a monotonic_RECONCILE_SEQcounter that increments on each invocation. - Fix 0: hard-gate kubectl context — Standalone Fix 0 block now asserts
kubectl config current-contextmatches the expected GKE context (gke_camarades-net_europe-west2-a_camaradesuk). Exits on mismatch instead of just printing the context. - Whitespace splitting:
tr -s '[:space:]'— Alltr ' ' '\n'instances replaced withtr -s '[:space:]' '\n'. Splits on all whitespace characters (tabs, newlines from jsonpath) not just spaces, and collapses consecutive whitespace to avoid empty lines. - Stale comment cleanup — Comments referencing old
grep -qF(contains) semantics updated to reflect current exact-token approach (tr -s '[:space:]' '\n' | grep -qxF). Changelog item 22 also updated to cross-reference item 26. - Force-reconcile fallback trigger — Phase 4b annotation-based trigger may not fire if ACK runtime predicates ignore metadata-only changes. Added a documented manual fallback path:
argocd app sync --force+ re-check, with guidance on when to investigate further. - Parameterised
EXPECTED_CONTEXT— Hard-codedgke_camarades-net_europe-west2-a_camaradesukmoved to Prerequisites alongside the account/role variables. Fix 0 standalone uses${EXPECTED_CONTEXT:-...}default so it works both standalone and after Prerequisites. - Unique sa-test pod names — Fixed pod name
sa-testreplaced withsa-test-$$(PID suffix) to avoidAlreadyExistserrors on repeated verification runs. All references (create, wait, inspect, cleanup trap, delete) updated to use$SA_TEST_POD. - EXPECTED_VALS:
jq -n --arg— Phase 4bEXPECTED_VALSconstruction replaced string interpolation ("\"$EXP_HOST\"") withjq -n --arg Host ... --arg Region ... --arg User .... Avoids escaping issues if values contain special characters. - Error message consistency — All "does not contain expected" FAIL messages updated to "missing exact expected token" with both expected and actual values quoted, matching the
grep -qxFexact-token semantics. - macOS date note scope — Clarified that the
date +%s%N/ BSD workaround applies only to local shell verification scripts, not to cluster runtime (Kubernetes pods use GNU coreutils). - Webhook
failurePolicy: Failscoping — AddednamespaceSelector(label:syrf.io/aws-identity=enabled) to the fail-closed webhook configuration. Without scoping,Failblocks ALL pod creation cluster-wide if the webhook is unavailable. Added risk note explaining the blast radius difference. - S3 IAM actions: remove phantoms — Removed
s3:GetObjectMetadataands3:HeadObjectfrom the staging Lambda role policy. Neither is a real IAM action — AWS authorises HeadObject unders3:GetObject. - ACK rollout timeout: 60s → 180s —
kubectl rollout statuson cold image pulls (first deploy or node reschedule) can exceed 60s. Increased to 180s to match the same order as ArgoCD health timeouts. - sa-test pod: timestamp+PID —
sa-test-$$can collide across repeated runs in the same shell (PID doesn't change). Changed tosa-test-$(date +%s)-$$for uniqueness across both time and shell sessions. - Reconcile gate: automated fallback — Replaced manual-only fallback comment with an automated two-attempt loop. Attempt 1 uses annotation trigger; on timeout with no evidence, attempt 2 automatically runs
argocd app sync --forceand re-checks. Reduces operator variance while still hard-failing if both attempts show no reconcile evidence.
v3.12 changes¶
- Reconcile gate: lastTransitionTime-based — Replace resourceVersion two-bump proof (race-prone: ACK can reconcile before
AFTER_ANNOTATION_RVcaptured) withlastTransitionTimecomparison on ResourceSynced condition. Immune to timing races. Graceful fallback if condition stays True throughout (env var validation is the authoritative check). - Distroless-safe token check — Replace
kubectl exec ... ls ...(fails on distroless/minimal ACK images) withkubectl get pod ... -o jsonpathto check projected volume mount path in pod spec. - sa-test race fix — Add
kubectl wait --for=condition=Readybefore inspecting pod spec (was race-prone:sleep 10+ immediate inspect). - Pod selection: newest pod after rollout — Replace
sort | head -1(picks oldest/first alphabetically, may miss newly-mutated pod) with--sort-by=.metadata.creationTimestamp+ last item (newest pod, the one created after webhook install). --forcefallback — Add explicit fallback for Fix 9: if--forcestill doesn't rerun hooks, delete the old hook Job soBeforeHookCreationdelete policy triggers recreation on next sync. Scoped to s3-notifier labels to avoid deleting unrelated PostSync jobs. Keep--forceon the retry path too.- Reconcile gate: expected-value assertions — After before/after comparison, also assert env var keys and non-sensitive values match expected config (Host, Username, Region). Catches the case where both before AND after are wrong.
- Reconcile wait: 30s → 60s — Controller + AWS API round-trip can be slow; 30s was flaky-prone.
- Token-volume check: combined assertion — Replace weak
grep -c token/grep tokenwith specific jsonpath for mount path +AWS_ROLE_ARN+AWS_WEB_IDENTITY_TOKEN_FILE. Fail explicitly if any are missing. - JSON comparison: jq-normalized — Pipe all
aws --output jsonthroughjq -Sc '.'for whitespace/key-order normalization. Prevents brittle raw string comparison across AWS CLI versions. - Reconcile timeout: controller log check — On timeout, verify controller logs mention
s3-notifier-stagingas secondary evidence that reconcile ran (guards against false-pass from stale state). - Hook delete:
--ignore-not-found— Make fallbackkubectl delete jobidempotent. - sa-test:
--restart=Never— Remove kubectl CLI version ambiguity forkubectl run. - Reconcile proof: hard fail on no evidence — Upgrade the no-log-evidence path from WARNING to hard FAIL. Continuing on stale state defeats the purpose of the reconciliation gate.
- Cross-repo gate: immutable ref — Pin instruction now requires a specific commit SHA (not
main) to prevent the ref from advancing while prerequisites are pending. - Expected values: source from config — Use
yqto extract expected Host/Username/Region from cluster-gitops Helm values files, falling back to hardcoded values ifyqis unavailable. Prevents drift. - Prerequisites section — Added required CLI tools list (kubectl, helm, argocd, aws, jq, yq) with auth instructions before verification phases.
- Token file path: exact assertion — Assert
AWS_WEB_IDENTITY_TOKEN_FILEequals expected path/var/run/secrets/eks.amazonaws.com/serviceaccount/token, not just non-empty. - yq merge: proper overlay — Replace broken
tail -1(drops service defaults when env file omits key) withyq ea '. as $item ireduce ({}; . * $item)'for correct Helm-style deep merge. - Permission idempotency: verify on conflict — On
ResourceConflictException, inspect existing policy statement for correctsource-arn/source-account. If mismatched (stale from previous bucket), remove and re-add. - Execution context:
REPO_ROOT— All phases now useREPO_ROOTvariable for path references. Set once in Prerequisites, used in Phase 1cd, Phase 3helm template, Phase 4byqpaths. - Auth log check: robust capture — Capture logs into variable first (fail if retrieval fails), widened window from
--since=2mto--since=5m, narrowed grep to auth-specific errors only. - Reconcile evidence grep: multi-pattern — Match K8s CR name (
s3-notifier-staging), AWS function name (syrfAppUploadS3Notifier-staging), and annotation timestamp. Capture logs into variable with error handling. - sa-test: full mutation assertion — Replace
grep AWSenv name check with exact jsonpath assertions forAWS_ROLE_ARN,AWS_WEB_IDENTITY_TOKEN_FILEpath, and token volume mount. Cleanup pod on failure. - Webhook chart source: aligned — Pre-merge
helm show valuesnow uses the same Helm repo (jkroepke/helm-charts) asconfig.yamldeployment, not OCI alternative.
v3.11 changes¶
- Reconcile gate: two-bump proof — Record resourceVersion after annotation write, then wait for it to change again (proving the controller reconciled, not just the annotation write). Previous code checked
!= BEFORE_RVwhich passed after just the annotation write. - Phase 4b: compare non-sensitive values, not just keys — Use JMESPath to extract
{Host, User, Region}(redact RabbitMqPassword). Catches value drift that key-only comparison misses. - Fix 9 immediate fix:
--forceflag —argocd app sync s3-notifier-staging --force --grpc-webto ensure PostSync hooks rerun even when app is already in sync. Without--force, hooks may be skipped. - Env-vars job post-update wait: values-driven — Replace hardcoded
UPDATE_WAIT=30with{{ .Values.setupJob.timeouts.lambdaConfigReady }}to match the pre-update wait and allow tuning for AWS control-plane lag. - Pod selection: count guard with deterministic selection — Replace
.items[0]with{.items[*].metadata.name}+wc -wcount check +sort | head -1. Error if 0 pods, warn if >1 (expected during rollout).
Context¶
After implementing the initial ACK migration code (Helm chart, cluster-gitops configs, Phase 0 Terraform), three rounds of code review identified 9 original issues plus 3 critical infrastructure gaps. This plan addresses all fixes.
Critical gaps found in review rounds 2-3:
- Fix 0: Pod Identity Webhook — eks.amazonaws.com/role-arn annotations do nothing on GKE without it
- Fix 1b: Plugins AppProject — missing ACK namespace/source repos blocks ACK controller deployment
- Fix 1c: ACK health checks — ArgoCD can't interpret ACK CRD status, risking PostSync deadlock
Repositories touched:
- syrf.pr2321 (monorepo) — Helm chart template fixes
- cluster-gitops — ArgoCD config, plugins, projects, health checks
- camarades-infrastructure — Terraform IAM roles + policy fixes
Fix 0 (NEW, CRITICAL): Install AWS Pod Identity Webhook¶
Problem: eks.amazonaws.com/role-arn annotations on ServiceAccounts do NOTHING on GKE. On EKS, a mutating admission webhook intercepts these annotations and injects projected token volumes + AWS_ROLE_ARN/AWS_WEB_IDENTITY_TOKEN_FILE env vars. Without this webhook, both ACK controllers and setup jobs have NO AWS credentials.
Current AWS auth uses static IAM credentials (aws-s3, aws-ses secrets with keyId/accessKey) — no OIDC/web-identity in use today.
Solution: Install amazon-eks-pod-identity-webhook as a cluster-gitops plugin using the jkroepke Helm chart.
Prerequisites: cert-manager (already installed), GKE projected SA token support (available since GKE 1.20+).
Files to create:¶
1. cluster-gitops/plugins/helm/pod-identity-webhook/config.yaml
plugin:
name: pod-identity-webhook
repoURL: https://jkroepke.github.io/helm-charts
chart: amazon-eks-pod-identity-webhook
version: "2.6.0"
namespace: pod-identity-webhook
2. cluster-gitops/plugins/helm/pod-identity-webhook/values.yaml
# AWS Pod Identity Webhook for GKE -> AWS OIDC federation
# Mutates pods with eks.amazonaws.com/role-arn annotation to inject:
# - AWS_ROLE_ARN env var
# - AWS_WEB_IDENTITY_TOKEN_FILE env var
# - Projected service account token volume (audience: sts.amazonaws.com)
#
# No namespaceSelector needed for initial deploy (failurePolicy: Ignore).
# Mutation only triggers on pods whose SA has the eks.amazonaws.com/role-arn annotation.
# When switching to failurePolicy: Fail later, ADD a namespaceSelector to limit blast radius
# (see "Fail-closed gate" section below).
# TLS via cert-manager (already installed as cluster plugin)
# Chart default is true, but explicit for clarity
pki:
certManager:
enabled: true
# Token audience for AWS STS (chart key: tokenAudience, default: sts.amazonaws.com)
# Explicit for documentation even though it matches the default
config:
tokenAudience: sts.amazonaws.com
# Webhook failure policy
# Ignore = fail-open: pods start without AWS creds if webhook is down (safe for initial rollout)
# Fail = fail-closed: pods blocked if webhook is down (safer once webhook is stable)
# Start with Ignore, switch to Fail after confirming webhook stability
mutatingWebhook:
failurePolicy: Ignore
# namespaceSelector: <-- NOT needed while failurePolicy is Ignore.
# Add when switching to Fail (see "Fail-closed gate" section below).
resources:
requests:
memory: "32Mi"
cpu: "25m"
limits:
memory: "64Mi"
cpu: "50m"
Ordering note: The webhook must be running before ACK controller pods start, otherwise they won't get AWS credentials injected. selfHeal does NOT recreate existing pods — it only reconciles K8s manifests against git. After confirming the webhook is healthy, you must explicitly restart ACK controller deployments:
# After webhook pod is Running AND Ready (both conditions required):
kubectl wait --for=condition=Ready pod -n pod-identity-webhook \
-l app.kubernetes.io/name=amazon-eks-pod-identity-webhook --timeout=60s
# Then restart ACK controllers:
kubectl rollout restart deployment -n ack-system -l app.kubernetes.io/instance=ack-lambda-controller
kubectl rollout restart deployment -n ack-system -l app.kubernetes.io/instance=ack-s3-controller
# New pods will be mutated by the webhook with projected token volumes
If restart happened before webhook was ready (pods have no AWS env vars):
- Confirm webhook is Ready:
kubectl wait --for=condition=Ready pod -n pod-identity-webhook -l app.kubernetes.io/name=amazon-eks-pod-identity-webhook --timeout=120s - Re-do the rollout restart:
kubectl rollout restart deployment -n ack-system -l app.kubernetes.io/instance=ack-lambda-controller - Wait for completion:
kubectl rollout status deployment -n ack-system -l app.kubernetes.io/instance=ack-lambda-controller --timeout=180s - Re-run the verification checks below — this time the new pods should have projected volumes and AWS env vars
Verification (after webhook + ACK controller restart)¶
set -euo pipefail
# This block runs standalone (before Prerequisites/Phases).
# Define expected ARN inline here; Phases use $ACK_CONTROLLER_ROLE_ARN from Prerequisites.
EXPECTED_ACCOUNT="318789018510"
EXPECTED_CONTROLLER_ROLE="arn:aws:iam::${EXPECTED_ACCOUNT}:role/syrf-ack-controllers"
EXPECTED_TOKEN_PATH="/var/run/secrets/eks.amazonaws.com/serviceaccount/token"
# 0. Hard-gate: verify kubectl context targets the expected cluster.
# Only checks kubectl context, not AWS — this block doesn't call AWS APIs.
# Inline here for standalone use; Phases use $EXPECTED_CONTEXT from Prerequisites.
EXPECTED_CONTEXT="${EXPECTED_CONTEXT:-gke_camarades-net_europe-west2-a_camaradesuk}"
CURRENT_CONTEXT=$(kubectl config current-context 2>/dev/null || echo "NONE")
echo "kubectl context: $CURRENT_CONTEXT (expected: $EXPECTED_CONTEXT)"
if [ "$CURRENT_CONTEXT" != "$EXPECTED_CONTEXT" ]; then
echo "ERROR: kubectl context mismatch — aborting to prevent operating on the wrong cluster."
echo "Run: kubectl config use-context $EXPECTED_CONTEXT"
exit 1
fi
# 1. Verify webhook pod is running and ready (hard gate, not informational)
kubectl wait --for=condition=Ready pod \
-n pod-identity-webhook \
-l app.kubernetes.io/name=amazon-eks-pod-identity-webhook \
--timeout=120s
# 2. Verify mutation on BOTH ACK controller pods
for CONTROLLER in ack-lambda-controller ack-s3-controller; do
echo "--- Checking $CONTROLLER ---"
# Capture a Running pod with count guard (deterministic: newest by creation timestamp)
POD_LIST=$(kubectl get pod -n ack-system -l app.kubernetes.io/instance="$CONTROLLER" \
--field-selector=status.phase=Running --sort-by=.metadata.creationTimestamp \
-o jsonpath='{.items[*].metadata.name}')
POD_COUNT=$(echo "$POD_LIST" | wc -w)
if [ "$POD_COUNT" -eq 0 ] || [ -z "$POD_LIST" ]; then
echo "ERROR: No Running pods found for $CONTROLLER"
echo "All pods:"
kubectl get pods -n ack-system -l app.kubernetes.io/instance="$CONTROLLER"
exit 1
fi
if [ "$POD_COUNT" -gt 1 ]; then
echo "WARNING: $POD_COUNT Running pods found for $CONTROLLER (may be mid-rollout), using newest"
fi
POD=$(echo "$POD_LIST" | tr -s '[:space:]' '\n' | tail -1) # Last = newest by creationTimestamp
echo "Checking pod: $POD"
# Verify webhook mutation: projected token volume mount + AWS env vars (combined assertion)
# Use containers[*] (not containers[0]) — sidecar injection can shift container order.
# With containers[*], jsonpath may concat values from multiple containers (space-separated).
# Exact-token check — containers[*] may return space-separated values from multiple containers.
# Split on all whitespace and match exact token to avoid partial-match false-pass.
TOKEN_MOUNT=$(kubectl get pod "$POD" -n ack-system \
-o jsonpath='{.spec.containers[*].volumeMounts[?(@.mountPath=="/var/run/secrets/eks.amazonaws.com/serviceaccount")].mountPath}')
ROLE_ARN=$(kubectl get pod "$POD" -n ack-system \
-o jsonpath='{.spec.containers[*].env[?(@.name=="AWS_ROLE_ARN")].value}')
TOKEN_FILE=$(kubectl get pod "$POD" -n ack-system \
-o jsonpath='{.spec.containers[*].env[?(@.name=="AWS_WEB_IDENTITY_TOKEN_FILE")].value}')
echo " Token mount: ${TOKEN_MOUNT:-MISSING}"
echo " AWS_ROLE_ARN: ${ROLE_ARN:-MISSING} (expected: $EXPECTED_CONTROLLER_ROLE)"
echo " AWS_WEB_IDENTITY_TOKEN_FILE: ${TOKEN_FILE:-MISSING} (expected: $EXPECTED_TOKEN_PATH)"
# Pass condition: all three must be present
if [ -z "$TOKEN_MOUNT" ] || [ -z "$ROLE_ARN" ] || [ -z "$TOKEN_FILE" ]; then
echo "FAIL: Webhook mutation incomplete for $CONTROLLER pod $POD"
exit 1
fi
# Exact-token check — containers[*] may return whitespace-separated values from multiple containers.
# Split on all whitespace (tr -s '[:space:]') and grep -qxF for exact line match.
if ! echo "$ROLE_ARN" | tr -s '[:space:]' '\n' | grep -qxF "$EXPECTED_CONTROLLER_ROLE"; then
echo "FAIL: AWS_ROLE_ARN missing exact expected token '$EXPECTED_CONTROLLER_ROLE', got: $ROLE_ARN"
exit 1
fi
if ! echo "$TOKEN_FILE" | tr -s '[:space:]' '\n' | grep -qxF "$EXPECTED_TOKEN_PATH"; then
echo "FAIL: AWS_WEB_IDENTITY_TOKEN_FILE missing exact expected token '$EXPECTED_TOKEN_PATH', got: $TOKEN_FILE"
exit 1
fi
done
# If any check fails: webhook isn't mutating. Check:
# - Did rollout restart happen AFTER webhook was Running?
# - Does the ACK controller SA have the eks.amazonaws.com/role-arn annotation?
# - kubectl logs -n pod-identity-webhook -l app.kubernetes.io/name=amazon-eks-pod-identity-webhook
Pre-merge verification (chart key validation)¶
Before deploying the webhook, verify the chart values schema matches the actual chart. The values keys below were verified against chart values.yaml v2.6.0 on 2026-02-07, but should be re-checked before merge:
set -euo pipefail
# Tool gate — yq v4+ is required for path-aware key checks below.
command -v yq >/dev/null 2>&1 || { echo "ERROR: yq not found. Install yq v4+: https://github.com/mikefarah/yq"; exit 1; }
YQ_MAJOR=$(yq --version 2>&1 | awk 'match($0, /[0-9]+/) {print substr($0, RSTART, RLENGTH); exit}' || true)
if [ "${YQ_MAJOR:-0}" -lt 4 ]; then
echo "ERROR: yq v4+ required (found v${YQ_MAJOR:-unknown}). yq -e and path queries require v4 syntax."
exit 1
fi
# Pull the chart from the SAME repo used in deployment (config.yaml repoURL)
# to avoid "validated one artifact, deployed another" risk.
helm repo add jkroepke https://jkroepke.github.io/helm-charts 2>/dev/null || true
helm repo update jkroepke
# Capture full output into variable — all subsequent parsing uses here-strings,
# so no pipeline SIGPIPE is possible under set -euo pipefail.
CHART_VALUES=$(helm show values jkroepke/amazon-eks-pod-identity-webhook --version 2.6.0)
echo "First 60 lines of chart values:"
sed -n '1,60p' <<<"$CHART_VALUES"
# Hard-assert required LEAF keys exist via yq path checks (catches chart schema changes before deploy).
# Leaf-level paths (e.g. pki.certManager.enabled, not pki.certManager) ensure the specific keys
# our values.yaml overrides actually exist in the chart schema.
MISSING_KEYS=""
yq -e '.pki.certManager.enabled' <<<"$CHART_VALUES" >/dev/null 2>&1 || MISSING_KEYS="$MISSING_KEYS pki.certManager.enabled"
yq -e '.config.tokenAudience' <<<"$CHART_VALUES" >/dev/null 2>&1 || MISSING_KEYS="$MISSING_KEYS config.tokenAudience"
yq -e '.mutatingWebhook.failurePolicy' <<<"$CHART_VALUES" >/dev/null 2>&1 || MISSING_KEYS="$MISSING_KEYS mutatingWebhook.failurePolicy"
if [ -n "$MISSING_KEYS" ]; then
echo "ERROR: Required chart keys not found:$MISSING_KEYS"
echo "Chart schema may have changed — update values.yaml before deploying."
exit 1
fi
echo "All required chart keys verified."
Fail-closed gate: switching failurePolicy from Ignore to Fail¶
The webhook starts with failurePolicy: Ignore (fail-open) for safe initial rollout. This means pods start without AWS credentials if the webhook is down — they fail later at AWS API call time, not at pod admission.
Switch to Fail (fail-closed) when ALL of these acceptance criteria are met:
- Webhook pod has been Running continuously for 48+ hours without restarts
- At least one ACK controller rollout restart has been verified with mutation working (env vars + projected volume present)
- At least one full ArgoCD sync cycle of s3-notifier has completed successfully (PostSync jobs passed)
- Webhook pod resource usage is stable (not approaching limits)
How to switch:
# cluster-gitops/plugins/helm/pod-identity-webhook/values.yaml
mutatingWebhook:
failurePolicy: Fail # Changed from Ignore after acceptance criteria met
# IMPORTANT: Scope the webhook to only namespaces that need AWS credentials.
# Without scoping, failurePolicy: Fail blocks ALL pod creation cluster-wide
# if the webhook is unavailable (e.g. during upgrades, OOM, node drain).
namespaceSelector:
matchExpressions:
- key: syrf.io/aws-identity
operator: In
values: ["enabled"]
# Label the target namespaces (--overwrite handles wrong-value correction):
# kubectl label ns ack-system syrf.io/aws-identity=enabled --overwrite
# kubectl label ns syrf-staging syrf.io/aws-identity=enabled --overwrite
# kubectl label ns syrf-production syrf.io/aws-identity=enabled --overwrite
Pre-check: verify namespace labels exist BEFORE switching to Fail:
set -euo pipefail
REQUIRED_LABEL="syrf.io/aws-identity"
for NS in ack-system syrf-staging syrf-production; do
LABEL_VAL=$(kubectl get ns "$NS" -o jsonpath="{.metadata.labels['syrf\.io/aws-identity']}" 2>/dev/null || echo "")
if [ "$LABEL_VAL" != "enabled" ]; then
echo "ERROR: Namespace '$NS' missing label $REQUIRED_LABEL=enabled"
echo "Run: kubectl label ns $NS $REQUIRED_LABEL=enabled --overwrite"
exit 1
fi
echo "OK: $NS has $REQUIRED_LABEL=enabled"
done
echo "All namespaces labelled. Safe to switch failurePolicy to Fail."
Risk of NOT switching: If the webhook goes down silently, new pods start without AWS creds and fail at runtime with opaque errors (STS auth failures, missing env vars). Fail policy makes the problem immediately visible at pod scheduling time.
Risk of switching WITHOUT scoping: If the webhook is unavailable (crash, upgrade, node drain), Fail blocks ALL pod creation cluster-wide — including kube-system, monitoring, and ArgoCD itself. The namespaceSelector above limits blast radius to only namespaces that actually need AWS identity injection.
Operational note — new namespaces: When failurePolicy: Fail with namespaceSelector is active, any new namespace that needs AWS credential injection (e.g. syrf-pr-123 for PR previews, or a future syrf-production split) must be labeled syrf.io/aws-identity=enabled before deploying pods that depend on IRSA. Without the label, the webhook won't match the namespace, credentials won't be injected, and pods will fail at AWS API call time with opaque STS errors — the same failure mode as not having the webhook at all. Add the label as part of namespace creation (e.g. in the PR preview workflow or namespace manifest).
Rollback (if Fail causes pod scheduling issues):
Option 1 — Immediate (manual, overwritten on next ArgoCD sync):
set -euo pipefail
# Find the webhook configuration name (enforce single match)
WEBHOOK_CONFIGS=$(kubectl get mutatingwebhookconfigurations -o name | { grep pod-identity || true; })
CONFIG_COUNT=$(echo "$WEBHOOK_CONFIGS" | grep -c . || true)
if [ -z "$WEBHOOK_CONFIGS" ]; then
echo "ERROR: No MutatingWebhookConfiguration matching 'pod-identity' found."
kubectl get mutatingwebhookconfigurations -o name
exit 1
fi
if [ "$CONFIG_COUNT" -gt 1 ]; then
echo "ERROR: Multiple MutatingWebhookConfigurations match 'pod-identity':"
echo "$WEBHOOK_CONFIGS"
exit 1
fi
WEBHOOK_CONFIG="$WEBHOOK_CONFIGS"
# Get the webhook entry name by matching the pod-identity pattern.
# Enforce exactly one match — if grep returns multiple lines, abort.
WEBHOOK_ENTRIES=$(kubectl get "$WEBHOOK_CONFIG" \
-o jsonpath='{.webhooks[*].name}' | tr -s '[:space:]' '\n' | { grep pod-identity || true; })
MATCH_COUNT=$(echo "$WEBHOOK_ENTRIES" | grep -c . || true)
if [ -z "$WEBHOOK_ENTRIES" ]; then
echo "ERROR: No webhook entry matching 'pod-identity'. All entries:"
kubectl get "$WEBHOOK_CONFIG" -o jsonpath='{.webhooks[*].name}' | tr -s '[:space:]' '\n'
exit 1
fi
if [ "$MATCH_COUNT" -gt 1 ]; then
echo "ERROR: Multiple webhook entries match 'pod-identity' — ambiguous:"
echo "$WEBHOOK_ENTRIES"
echo "Specify the exact webhook name in the patch command instead."
exit 1
fi
WEBHOOK_ENTRY="$WEBHOOK_ENTRIES"
# Patch failurePolicy back to Ignore using strategic merge (safe regardless of webhook order)
kubectl patch "$WEBHOOK_CONFIG" --type='strategic' \
-p="{\"webhooks\":[{\"name\":\"${WEBHOOK_ENTRY}\",\"failurePolicy\":\"Ignore\"}]}"
# Pods can schedule again immediately — ArgoCD will overwrite this on next sync (~5 min)
Option 2 — Durable (via GitOps):
# cluster-gitops/plugins/helm/pod-identity-webhook/values.yaml
mutatingWebhook:
failurePolicy: Ignore # Reverted — investigate webhook health before re-enabling Fail
Commit, push. ArgoCD auto-syncs within ~5 minutes.
Alternative considered:¶
GKE Workload Identity could map GCP SAs to AWS roles, but that adds GCP-to-AWS indirection. The pod identity webhook is the standard pattern for cross-cloud AWS auth and matches our Terraform OIDC provider setup.
Fix 1a (HIGH): ArgoCD SyRF project whitelists missing ACK CRDs¶
ArgoCD projects only whitelist standard K8s resource types. ACK CRDs will be rejected.
Files:
- cluster-gitops/argocd/projects/syrf-staging.yaml — add to namespaceResourceWhitelist
- cluster-gitops/argocd/projects/syrf-production.yaml — same additions
Add these entries to namespaceResourceWhitelist in both files (after existing entries, before namespaceResourceBlacklist):
# ACK CRDs for S3 Notifier (Lambda + S3 bucket management)
- group: 'lambda.services.k8s.aws'
kind: Function
- group: 's3.services.k8s.aws'
kind: Bucket
- group: 'services.k8s.aws'
kind: AdoptedResource
Note: ExternalSecret is not added here — no service in this change creates ExternalSecret resources in syrf namespaces (the chart's own external-secret.yaml is being deleted in Fix 5, and the ClusterExternalSecret is managed by the plugins project). Add it only when a syrf-project chart actually needs to create ExternalSecret resources.
Fix 1b (NEW, CRITICAL): Plugins AppProject — missing ACK destinations + sources¶
Problem: cluster-gitops/argocd/projects/plugins.yaml is missing:
ack-systemnamespace indestinations— ACK controllers deploy therepod-identity-webhooknamespace indestinationsoci://public.ecr.aws/aws-controllers-k8sinsourceRepos— ACK charts are OCI-basedhttps://jkroepke.github.io/helm-chartsinsourceRepos— pod identity webhook chart
Without these, ArgoCD silently refuses to sync ACK controller and webhook Applications.
File: cluster-gitops/argocd/projects/plugins.yaml
Add to sourceRepos (after existing entries):
- oci://public.ecr.aws/aws-controllers-k8s # ACK Lambda + S3 controllers
- https://jkroepke.github.io/helm-charts # pod-identity-webhook
OCI sourceRepos format — canonical: oci:// prefix. The ACK plugin configs specify repoURL: oci://public.ecr.aws/aws-controllers-k8s (with oci:// prefix). The ApplicationSet passes this URL through verbatim to generated Applications. ArgoCD matches sourceRepos against the Application's repoURL, so the sourceRepos entry must include the oci:// prefix to match. Verified: ArgoCD docs show OCI sources with oci:// in Application specs.
Fallback if oci:// prefix is rejected (unlikely but documented): strip to public.ecr.aws/aws-controllers-k8s. Verify after merge:
set -euo pipefail
EXPECTED_OCI="oci://public.ecr.aws/aws-controllers-k8s"
# Confirm sourceRepos entry via exact yq match (not substring grep):
PROJ_YAML=$(argocd proj get plugins -o yaml --grpc-web)
if yq -e ".spec.sourceRepos[] | select(. == \"$EXPECTED_OCI\")" <<<"$PROJ_YAML" >/dev/null 2>&1; then
echo "OK: sourceRepos contains exact entry $EXPECTED_OCI"
else
echo "ERROR: sourceRepos does not contain $EXPECTED_OCI"
echo "Current sourceRepos entries:"
yq '.spec.sourceRepos[]' <<<"$PROJ_YAML" 2>/dev/null || echo "(could not parse project YAML)"
exit 1
fi
# Dry-run sync to check permission end-to-end:
argocd app sync ack-lambda-controller --dry-run --grpc-web
echo "OK: dry-run sync succeeded — OCI source permitted."
# If "application repo is not permitted in project", switch to non-prefixed form
Add to destinations (after existing entries):
- namespace: ack-system # ACK controllers
server: https://kubernetes.default.svc
- namespace: pod-identity-webhook # AWS Pod Identity Webhook
server: https://kubernetes.default.svc
Fix 1c (NEW, CRITICAL): ACK health customization in argocd-cm¶
Problem: ArgoCD doesn't know how to interpret ACK CRD status. Without health checks, ACK resources (Function, Bucket) may show as Unknown health, which could prevent PostSync hooks from running — a deadlock.
Solution: Add Lua health checks for ACK CRDs, following the existing pattern in argocd-cm.yaml (which already has checks for ESO and DatabaseLifecycle CRDs).
ACK uses these status conditions:
ACK.ResourceSynced= True → resource matches AWS state (Healthy)ACK.Terminal= True → unrecoverable error (Degraded)ACK.Recoverable= True → transient error, controller will retry (Degraded — surfaces the issue instead of hiding behind Progressing)- Neither → still reconciling (Progressing)
Version note: These condition types are from ACK Lambda Controller v1.5.2 and S3 Controller v1.0.14. If controller versions are upgraded, validate condition names against live CRs: kubectl get function -n syrf-staging -o jsonpath='{.items[0].status.conditions[*].type}'
File: cluster-gitops/argocd/resources/argocd-cm.yaml
Add after the DatabaseLifecycle health check (line 144):
# Custom health checks for ACK (AWS Controllers for Kubernetes) CRDs
# Without these, ACK resources show as Unknown, potentially blocking PostSync hooks
# Precedence: Terminal (unrecoverable) → ResourceSynced (healthy) → Recoverable (transient)
# If both ResourceSynced and Recoverable are True, resource IS synced — treat as Healthy.
# Recoverable only matters when ResourceSynced is NOT True.
resource.customizations.health.lambda.services.k8s.aws_Function: |
hs = {}
if obj.status ~= nil then
if obj.status.conditions ~= nil then
for i, condition in ipairs(obj.status.conditions) do
if condition.type == "ACK.Terminal" and condition.status == "True" then
hs.status = "Degraded"
hs.message = condition.message or "Terminal error"
return hs
end
end
for i, condition in ipairs(obj.status.conditions) do
if condition.type == "ACK.ResourceSynced" and condition.status == "True" then
hs.status = "Healthy"
hs.message = condition.message or "Resource synced with AWS"
return hs
end
end
for i, condition in ipairs(obj.status.conditions) do
if condition.type == "ACK.Recoverable" and condition.status == "True" then
hs.status = "Degraded"
hs.message = condition.message or "Recoverable error (will retry)"
return hs
end
end
end
end
hs.status = "Progressing"
hs.message = "Waiting for Lambda function to sync with AWS"
return hs
# Same precedence as Lambda Function: Terminal → ResourceSynced → Recoverable
resource.customizations.health.s3.services.k8s.aws_Bucket: |
hs = {}
if obj.status ~= nil then
if obj.status.conditions ~= nil then
for i, condition in ipairs(obj.status.conditions) do
if condition.type == "ACK.Terminal" and condition.status == "True" then
hs.status = "Degraded"
hs.message = condition.message or "Terminal error"
return hs
end
end
for i, condition in ipairs(obj.status.conditions) do
if condition.type == "ACK.ResourceSynced" and condition.status == "True" then
hs.status = "Healthy"
hs.message = condition.message or "Resource synced with AWS"
return hs
end
end
for i, condition in ipairs(obj.status.conditions) do
if condition.type == "ACK.Recoverable" and condition.status == "True" then
hs.status = "Degraded"
hs.message = condition.message or "Recoverable error (will retry)"
return hs
end
end
end
end
hs.status = "Progressing"
hs.message = "Waiting for S3 bucket to sync with AWS"
return hs
resource.customizations.health.services.k8s.aws_AdoptedResource: |
hs = {}
if obj.status ~= nil then
if obj.status.conditions ~= nil then
for i, condition in ipairs(obj.status.conditions) do
if condition.type == "ACK.Adopted" and condition.status == "True" then
hs.status = "Healthy"
hs.message = condition.message or "Resource adopted"
return hs
end
if condition.type == "ACK.Terminal" and condition.status == "True" then
hs.status = "Degraded"
hs.message = condition.message or "Adoption failed"
return hs
end
end
end
end
hs.status = "Progressing"
hs.message = "Waiting for resource adoption"
return hs
Fix 2 (HIGH): Lambda K8s metadata.name in camelCase (invalid)¶
functionName helper produces syrfAppUploadS3Notifier — invalid for K8s metadata.name (must be DNS-1123 lowercase). Need separate K8s name vs AWS function name.
File: _helpers.tpl
Add new helper s3-notifier.k8sName:
{{/*
Kubernetes resource name (DNS-1123 compliant, lowercase)
Used for metadata.name on Function CRD and AdoptedResource
*/}}
{{- define "s3-notifier.k8sName" -}}
{{- if eq .Values.environmentName "production" -}}
s3-notifier
{{- else -}}
s3-notifier-{{ .Values.environmentName }}
{{- end -}}
{{- end }}
File: function.yaml
- Line 4: Change metadata.name from {{ include "s3-notifier.functionName" . }} to {{ include "s3-notifier.k8sName" . }}
- Line 15: Keep spec.name as {{ include "s3-notifier.functionName" . }} (AWS name, camelCase OK)
File: adopted-resource.yaml
- Line 6: Change metadata.name from {{ include "s3-notifier.functionName" . }} to {{ include "s3-notifier.k8sName" . }}
- Line 19: Change kubernetes.metadata.name from {{ include "s3-notifier.functionName" . }} to {{ include "s3-notifier.k8sName" . }}
Fix 3 (HIGH): PostSync permission ordering deadlock — with improvements¶
Problem: Bucket CRD (wave 2) sets S3 notification pointing to Lambda. But AWS PutBucketNotificationConfiguration validates that the Lambda has an S3 invoke permission. Permission is currently granted by PostSync Job — which runs AFTER all sync waves. Deadlock: bucket can't sync because permission doesn't exist yet.
Solution: Split setup-job.yaml into two separate jobs:
- permission-job.yaml — Sync hook at wave 2 (runs during sync, after Function at wave 1)
- env-vars-job.yaml — PostSync hook (runs after everything syncs successfully)
- Bump bucket.yaml to wave 3 (after permission job at wave 2)
Ordering after fix:
Wave 1: Function CRD (Lambda created in AWS)
Wave 2: permission-job (Sync hook — adds S3 invoke permission to Lambda)
Wave 3: Bucket CRD (S3 notification now valid because permission exists)
PostSync: env-vars-job (sets Lambda env vars including RabbitMQ password from K8s Secret)
Delete: src/services/s3-notifier/.chart/templates/setup-job.yaml
Create: src/services/s3-notifier/.chart/templates/permission-job.yaml
{{- if .Values.setupJob.enabled }}
apiVersion: batch/v1
kind: Job
metadata:
name: s3-notifier-permission-{{ .Values.environmentName }}
labels:
{{- include "s3-notifier.labels" . | nindent 4 }}
annotations:
argocd.argoproj.io/hook: Sync
argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
argocd.argoproj.io/sync-wave: "2"
spec:
ttlSecondsAfterFinished: 600
backoffLimit: 3
template:
metadata:
labels:
{{- include "s3-notifier.labels" . | nindent 8 }}
spec:
serviceAccountName: {{ .Values.setupJob.serviceAccountName }}
restartPolicy: Never
containers:
- name: permission
image: {{ .Values.setupJob.image }}
resources:
{{- toYaml .Values.setupJob.resources | nindent 12 }}
env:
- name: AWS_DEFAULT_REGION
value: {{ .Values.awsRegion }}
- name: AWS_ACCOUNT_ID
value: "{{ .Values.awsAccountId }}"
- name: FUNCTION_NAME
value: {{ include "s3-notifier.functionName" . }}
- name: BUCKET_NAME
value: {{ .Values.bucket.name }}
- name: ENVIRONMENT_NAME
value: {{ .Values.environmentName }}
command:
- /bin/sh
- -c
- |
set -euo pipefail
echo "=== S3 Notifier Permission Setup ==="
echo "Function: $FUNCTION_NAME"
echo "Bucket: $BUCKET_NAME"
# Preflight: verify AWS credentials are working
echo "Verifying AWS credentials..."
aws sts get-caller-identity --output text --query 'Arn'
# Wait for Lambda to reach Active state (ACK may still be creating it)
echo "Waiting for Lambda to become Active..."
MAX_WAIT={{ .Values.setupJob.timeouts.lambdaActive }}
ELAPSED=0
STATE="unknown"
while [ $ELAPSED -lt $MAX_WAIT ]; do
STATE=$(aws lambda get-function --function-name "$FUNCTION_NAME" \
--query 'Configuration.State' --output text 2>/dev/null || echo "NotFound")
if [ "$STATE" = "Active" ]; then
echo "Lambda is Active."
break
fi
echo "Lambda state: $STATE (waiting...)"
sleep 5
ELAPSED=$((ELAPSED + 5))
done
if [ "$STATE" != "Active" ]; then
echo "ERROR: Lambda did not reach Active state within ${MAX_WAIT}s (state: $STATE)"
exit 1
fi
# Grant S3 -> Lambda invoke permission (idempotent via try/catch)
# Instead of parsing get-policy JSON (formatting-dependent), attempt
# add-permission directly and catch ResourceConflictException.
STATEMENT_ID="s3-invoke-${ENVIRONMENT_NAME}"
SOURCE_ARN="arn:aws:s3:::${BUCKET_NAME}"
echo "Adding S3 invoke permission (statement: $STATEMENT_ID)..."
set +e # Temporarily allow non-zero exit
ADD_RESULT=$(aws lambda add-permission \
--function-name "$FUNCTION_NAME" \
--principal s3.amazonaws.com \
--statement-id "$STATEMENT_ID" \
--action lambda:InvokeFunction \
--source-arn "$SOURCE_ARN" \
--source-account "$AWS_ACCOUNT_ID" \
--output text --query 'Statement' 2>&1)
ADD_RC=$?
set -e
if [ $ADD_RC -eq 0 ]; then
echo "Permission added successfully."
elif echo "$ADD_RESULT" | grep -q "ResourceConflictException"; then
# Existing statement found — verify it has the correct source-arn and source-account.
# If mismatched (e.g. stale from a previous bucket), replace it.
#
# NOTE: We use Python (not jq) for JSON parsing because the job image
# (amazon/aws-cli:2.15.0) does not include jq. Python is bundled internally
# by the AWS CLI — same discovery logic as env-vars-job.
echo "Permission '$STATEMENT_ID' already exists. Verifying..."
# get-policy can fail if the resource policy was deleted between add-permission
# and this call, or on transient API errors. Guard explicitly since set -e is active.
EXISTING_POLICY=$(aws lambda get-policy --function-name "$FUNCTION_NAME" \
--query 'Policy' --output text 2>&1) || {
echo "ERROR: Failed to retrieve existing Lambda policy for verification"
echo " get-policy output: ${EXISTING_POLICY:-(empty)}"
echo " Retrying add-permission may resolve this (transient race)."
exit 1
}
if [ -z "$EXISTING_POLICY" ] || [ "$EXISTING_POLICY" = "None" ]; then
echo "ERROR: get-policy returned empty/None — policy may have been deleted mid-flight"
exit 1
fi
PYTHON=$(command -v python3 2>/dev/null \
|| echo "/usr/local/aws-cli/v2/current/dist/python3")
if ! "$PYTHON" --version >/dev/null 2>&1; then
echo "ERROR: python3 not found — cannot parse existing policy for verification"
echo "Workaround: switch setupJob.image to a custom image with python3 in PATH"
exit 1
fi
EXISTING_SOURCE=$("$PYTHON" -c "
import json, sys
try:
policy = json.loads(sys.argv[1])
except (json.JSONDecodeError, IndexError) as e:
print('ERROR: Failed to parse policy JSON: ' + str(e), file=sys.stderr)
sys.exit(1)
stmt = next((s for s in policy.get('Statement', []) if s.get('Sid') == sys.argv[2]), {})
cond = stmt.get('Condition', {})
arn_like = cond.get('ArnLike', {})
print(arn_like.get('AWS:SourceArn', arn_like.get('aws:SourceArn', '')))
" "$EXISTING_POLICY" "$STATEMENT_ID") || {
echo "ERROR: Failed to parse existing policy — payload may be empty or malformed"
echo "Raw policy: ${EXISTING_POLICY:-(empty)}"
exit 1
}
EXISTING_ACCOUNT=$("$PYTHON" -c "
import json, sys
try:
policy = json.loads(sys.argv[1])
except (json.JSONDecodeError, IndexError) as e:
print('ERROR: Failed to parse policy JSON: ' + str(e), file=sys.stderr)
sys.exit(1)
stmt = next((s for s in policy.get('Statement', []) if s.get('Sid') == sys.argv[2]), {})
cond = stmt.get('Condition', {})
str_eq = cond.get('StringEquals', {})
print(str_eq.get('AWS:SourceAccount', str_eq.get('aws:SourceAccount', '')))
" "$EXISTING_POLICY" "$STATEMENT_ID") || {
echo "ERROR: Failed to parse existing policy — payload may be empty or malformed"
exit 1
}
if [ "$EXISTING_SOURCE" = "$SOURCE_ARN" ] && [ "$EXISTING_ACCOUNT" = "$AWS_ACCOUNT_ID" ]; then
echo "Existing statement matches (source-arn: $SOURCE_ARN, source-account: $AWS_ACCOUNT_ID). OK."
else
echo "WARNING: Existing statement has stale config (source-arn: $EXISTING_SOURCE, source-account: $EXISTING_ACCOUNT)."
echo "Replacing with correct values..."
aws lambda remove-permission --function-name "$FUNCTION_NAME" --statement-id "$STATEMENT_ID"
aws lambda add-permission \
--function-name "$FUNCTION_NAME" \
--principal s3.amazonaws.com \
--statement-id "$STATEMENT_ID" \
--action lambda:InvokeFunction \
--source-arn "$SOURCE_ARN" \
--source-account "$AWS_ACCOUNT_ID" \
--output text --query 'Statement'
echo "Permission replaced successfully."
fi
else
echo "ERROR: add-permission failed: $ADD_RESULT"
exit 1
fi
echo "=== Permission setup complete ==="
{{- end }}
Key improvements over original setup-job:
- Wait-for-active loop: Polls Lambda state until Active (max 120s) before attempting add-permission, preventing race with ACK still creating the function.
- --source-account: Constrains the S3 permission to only allow invocations from our AWS account (prevents confused deputy).
- Preflight sts get-caller-identity: Fails fast if OIDC credentials aren't working.
Create: src/services/s3-notifier/.chart/templates/env-vars-job.yaml
{{- if .Values.setupJob.enabled }}
apiVersion: batch/v1
kind: Job
metadata:
name: s3-notifier-env-vars-{{ .Values.environmentName }}
labels:
{{- include "s3-notifier.labels" . | nindent 4 }}
annotations:
argocd.argoproj.io/hook: PostSync
argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
argocd.argoproj.io/sync-wave: "10"
spec:
ttlSecondsAfterFinished: 600
backoffLimit: 3
template:
metadata:
labels:
{{- include "s3-notifier.labels" . | nindent 8 }}
spec:
serviceAccountName: {{ .Values.setupJob.serviceAccountName }}
restartPolicy: Never
containers:
- name: env-vars
image: {{ .Values.setupJob.image }}
resources:
{{- toYaml .Values.setupJob.resources | nindent 12 }}
env:
- name: AWS_DEFAULT_REGION
value: {{ .Values.awsRegion }}
- name: FUNCTION_NAME
value: {{ include "s3-notifier.functionName" . }}
- name: RABBITMQ_HOST
value: {{ .Values.rabbitMq.host }}
- name: RABBITMQ_USERNAME
value: {{ .Values.rabbitMq.username }}
- name: S3_REGION
value: {{ .Values.awsRegion }}
- name: RABBITMQ_PASSWORD
valueFrom:
secretKeyRef:
name: {{ .Values.rabbitMq.passwordSecretName }}
key: {{ .Values.rabbitMq.passwordSecretKey }}
optional: true # Pod must start even if secret is missing — script pre-check catches it
command:
- /bin/sh
- -c
- |
set -euo pipefail
echo "=== S3 Notifier Env Vars Setup ==="
echo "Function: $FUNCTION_NAME"
# Pre-check: verify rabbit-mq secret is available
# Use ${:-} to avoid set -u trap — with optional: true the var may be unset, not empty
if [ -z "${RABBITMQ_PASSWORD:-}" ]; then
echo "ERROR: RABBITMQ_PASSWORD is empty or unset - rabbit-mq secret may be missing from namespace"
echo "Ensure ClusterExternalSecret 'rabbit-mq' targets this namespace"
exit 1
fi
# Preflight: verify AWS credentials are working
echo "Verifying AWS credentials..."
aws sts get-caller-identity --output text --query 'Arn'
# Wait for Lambda to be ready for configuration updates
echo "Waiting for Lambda to accept configuration updates..."
MAX_WAIT={{ .Values.setupJob.timeouts.lambdaConfigReady }}
ELAPSED=0
LAST_STATUS="unknown"
while [ $ELAPSED -lt $MAX_WAIT ]; do
LAST_STATUS=$(aws lambda get-function-configuration \
--function-name "$FUNCTION_NAME" \
--query 'LastUpdateStatus' --output text 2>/dev/null || echo "Unknown")
if [ "$LAST_STATUS" = "Successful" ] || [ "$LAST_STATUS" = "null" ]; then
break
fi
echo "LastUpdateStatus: $LAST_STATUS (waiting...)"
sleep 5
ELAPSED=$((ELAPSED + 5))
done
if [ "$LAST_STATUS" != "Successful" ] && [ "$LAST_STATUS" != "null" ]; then
echo "ERROR: Lambda not ready for config updates within ${MAX_WAIT}s (status: $LAST_STATUS)"
exit 1
fi
# NOTE: --environment replaces ALL env vars. This is intentional —
# all 4 vars (RabbitMqHost, RabbitMqUsername, RabbitMqPassword, S3Region)
# are declared here as the canonical source of truth. Any env var not
# listed here will be removed. This matches Terraform's existing behavior.
#
# We build JSON via Python rather than CLI shorthand "Variables={k=v,...}"
# because the RabbitMQ password may contain special characters
# (=, commas, braces, quotes) that break the shorthand parser.
#
# python3 is NOT in PATH in amazon/aws-cli — it's bundled internally.
# Locate it, or fail with an actionable error.
# NOTE: We use a heredoc (<<'PYEOF') instead of python -c "..."
# because indented code inside -c causes IndentationError when
# the script is embedded in a YAML block scalar.
echo "Setting Lambda environment variables..."
PYTHON=$(command -v python3 2>/dev/null \
|| echo "/usr/local/aws-cli/v2/current/dist/python3")
if ! "$PYTHON" --version >/dev/null 2>&1; then
echo "ERROR: python3 not found at PATH or /usr/local/aws-cli/v2/current/dist/python3"
echo "The amazon/aws-cli image bundles Python — check if the path has changed"
echo "Workaround: switch setupJob.image to a custom image with python3 in PATH"
exit 1
fi
ENV_JSON=$("$PYTHON" <<'PYEOF'
import json, os
print(json.dumps({'Variables': {
'RabbitMqHost': os.environ['RABBITMQ_HOST'],
'RabbitMqUsername': os.environ['RABBITMQ_USERNAME'],
'RabbitMqPassword': os.environ['RABBITMQ_PASSWORD'],
'S3Region': os.environ['S3_REGION']
}}))
PYEOF
)
aws lambda update-function-configuration \
--function-name "$FUNCTION_NAME" \
--environment "$ENV_JSON" \
--output text --query 'FunctionName'
# Wait for update to complete — Lambda may be in InProgress state
echo "Waiting for update to complete..."
UPDATE_WAIT={{ .Values.setupJob.timeouts.lambdaConfigReady }}
UPDATE_ELAPSED=0
UPDATE_STATUS="InProgress"
while [ $UPDATE_ELAPSED -lt $UPDATE_WAIT ]; do
UPDATE_STATUS=$(aws lambda get-function-configuration \
--function-name "$FUNCTION_NAME" \
--query 'LastUpdateStatus' --output text 2>/dev/null || echo "Unknown")
if [ "$UPDATE_STATUS" = "Successful" ]; then
break
fi
if [ "$UPDATE_STATUS" = "Failed" ]; then
echo "ERROR: Lambda update failed"
aws lambda get-function-configuration --function-name "$FUNCTION_NAME" \
--query 'LastUpdateStatusReason' --output text
exit 1
fi
sleep 2
UPDATE_ELAPSED=$((UPDATE_ELAPSED + 2))
done
if [ "$UPDATE_STATUS" != "Successful" ]; then
echo "ERROR: Update did not complete within ${UPDATE_WAIT}s (status: $UPDATE_STATUS)"
echo "Check: aws lambda get-function-configuration --function-name $FUNCTION_NAME --query LastUpdateStatusReason"
exit 1
fi
echo "Environment variables set successfully."
echo "=== Env vars setup complete ==="
{{- end }}
File: bucket.yaml
- Line 10: Change sync-wave from "2" to "3" and update comment:
argocd.argoproj.io/sync-wave: "3" # After permission-job (wave 2) — S3 notification needs Lambda permission to exist
Fix 3 improvements (v3)¶
3a. Hard fail on env-vars-job timeout — Added exit 1 after the LastUpdateStatus wait loop if Lambda doesn't become ready. Previously the script would silently proceed past the timeout.
3b. Configurable timeouts via values — Timeouts are now templated from values.yaml:
setupJob:
# ...existing...
timeouts:
lambdaActive: 120 # seconds to wait for Lambda Active state (permission-job)
lambdaConfigReady: 60 # seconds to wait for LastUpdateStatus (env-vars-job)
3c. rabbit-mq secret pre-check — Added early validation in env-vars-job that RABBITMQ_PASSWORD is non-empty before making AWS calls. Provides actionable error message if ClusterExternalSecret hasn't propagated to the namespace.
Fix 4 (HIGH): ack-setup-job ServiceAccount missing + least-privilege role¶
No manifest creates the ack-setup-job ServiceAccount. The permission-job and env-vars-job will fail. Additionally, the original plan reused the syrf-ack-controllers role which has full Lambda CRUD + S3 bucket management — far more than the setup jobs need.
4a. Create dedicated IAM role (Terraform)¶
File: camarades-infrastructure/terraform/lambda/ack-iam.tf
Add a new syrf-ack-setup-job role with minimal permissions:
# Least-privilege IAM role for s3-notifier setup jobs
# Only needs: update Lambda config, manage Lambda permissions, read Lambda state
resource "aws_iam_role" "ack_setup_job" {
name = "syrf-ack-setup-job"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Principal = {
Federated = aws_iam_openid_connect_provider.gke.arn
}
Action = "sts:AssumeRoleWithWebIdentity"
Condition = {
StringEquals = {
"${replace(local.gke_oidc_issuer, "https://", "")}:aud" = "sts.amazonaws.com"
"${replace(local.gke_oidc_issuer, "https://", "")}:sub" = [
"system:serviceaccount:syrf-staging:ack-setup-job",
"system:serviceaccount:syrf-production:ack-setup-job"
]
}
}
}
]
})
tags = {
Purpose = "Least-privilege role for s3-notifier setup jobs"
ManagedBy = "Terraform"
}
}
resource "aws_iam_role_policy" "ack_setup_job_lambda" {
name = "LambdaConfigAndPermissions"
role = aws_iam_role.ack_setup_job.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "LambdaConfigManagement"
Effect = "Allow"
Action = [
"lambda:GetFunction",
"lambda:GetFunctionConfiguration",
"lambda:UpdateFunctionConfiguration",
"lambda:GetPolicy",
"lambda:AddPermission",
"lambda:RemovePermission"
]
Resource = [
"arn:aws:lambda:eu-west-1:${data.aws_caller_identity.current.account_id}:function:syrfAppUploadS3Notifier",
"arn:aws:lambda:eu-west-1:${data.aws_caller_identity.current.account_id}:function:syrfAppUploadS3Notifier-*"
]
}
]
})
}
output "ack_setup_job_role_arn" {
description = "IAM role ARN for s3-notifier setup jobs (use in Helm SA annotation)"
value = aws_iam_role.ack_setup_job.arn
}
Note the trust policy uses exact SA names (not wildcards) for both aud and sub constraints — syrf-staging:ack-setup-job and syrf-production:ack-setup-job only. If a new environment is added, the trust policy must be updated.
4b+4c (MERGED — resolves v2 contradiction)¶
Previous plan had 4b ("add aud to both statements") then 4c ("remove second statement") — contradictory for implementers.
Merged fix: In aws_iam_role.ack_controllers trust policy, replace the existing two-statement policy with a single statement that has BOTH aud and sub constraints:
File: camarades-infrastructure/terraform/lambda/ack-iam.tf
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Principal = {
Federated = aws_iam_openid_connect_provider.gke.arn
}
Action = "sts:AssumeRoleWithWebIdentity"
Condition = {
StringEquals = {
"${replace(local.gke_oidc_issuer, "https://", "")}:aud" = "sts.amazonaws.com"
"${replace(local.gke_oidc_issuer, "https://", "")}:sub" = [
"system:serviceaccount:ack-system:ack-lambda-controller",
"system:serviceaccount:ack-system:ack-s3-controller"
]
}
}
}
]
})
The second statement (setup-job SA trust) is removed entirely — setup jobs use their own syrf-ack-setup-job role now (Fix 4a).
4d. Create ServiceAccount manifest¶
Create: src/services/s3-notifier/.chart/templates/serviceaccount.yaml
{{- if .Values.setupJob.enabled }}
apiVersion: v1
kind: ServiceAccount
metadata:
name: {{ .Values.setupJob.serviceAccountName }}
labels:
{{- include "s3-notifier.labels" . | nindent 4 }}
annotations:
# Least-privilege IAM role for setup jobs (GKE -> AWS via OIDC federation)
eks.amazonaws.com/role-arn: "arn:aws:iam::{{ .Values.awsAccountId }}:role/syrf-ack-setup-job"
{{- end }}
4e. Update values.yaml default¶
File: values.yaml
Add setupJob.iamRoleName default:
setupJob:
enabled: true
image: amazon/aws-cli:2.15.0
serviceAccountName: ack-setup-job
iamRoleName: syrf-ack-setup-job # Least-privilege role (not syrf-ack-controllers)
timeouts:
lambdaActive: 120 # seconds to wait for Lambda Active state
lambdaConfigReady: 60 # seconds to wait for LastUpdateStatus
Update serviceaccount.yaml to use this value instead of hardcoding the role name:
eks.amazonaws.com/role-arn: "arn:aws:iam::{{ .Values.awsAccountId }}:role/{{ .Values.setupJob.iamRoleName }}"
Fix 5 (HIGH): Secret collision/key mismatch¶
Problem: The chart's external-secret.yaml creates an ExternalSecret for rabbit-mq with key password. But the existing ClusterExternalSecret (cluster-gitops/argocd/local/argocd-secrets/values.yaml:53-62) already creates rabbit-mq in all service namespaces with key rabbitmq-password. Two competing secrets = collision. And the key name doesn't match.
Solution:
1. Delete external-secret.yaml from chart (reuse existing ClusterExternalSecret)
2. Fix passwordSecretKey from password to rabbitmq-password in all values files
Delete: src/services/s3-notifier/.chart/templates/external-secret.yaml
File: values.yaml (chart defaults)
- Line 35: Change passwordSecretKey: password to passwordSecretKey: rabbitmq-password
File: cluster-gitops/syrf/services/s3-notifier/values.yaml
- Line 10: Change passwordSecretKey: password to passwordSecretKey: rabbitmq-password
Fix 6 (MEDIUM): OIDC issuer URL — /zones/ vs /locations/¶
Terraform is correct. GKE zonal clusters use /zones/ in the OIDC issuer URL (confirmed via gcloud container clusters describe). The technical plan doc incorrectly says /locations/.
No code change needed. Only the technical plan doc needs updating (Fix 10).
Fix 7 (MEDIUM): Staging using production Lambda role¶
Problem: Staging uses syrfS3NotifierProductionLambdaRole whose S3 policy is scoped to arn:aws:s3:::syrfapp-uploads/* (camarades-infrastructure/terraform/lambda/main.tf:57). The staging bucket syrfapp-uploads-staging doesn't match. Widening the production role would weaken environment separation.
Solution: Create a dedicated staging Lambda execution role with S3 policy scoped to syrfapp-uploads-staging/*.
File: camarades-infrastructure/terraform/lambda/main.tf
Add new resources (following the existing production/preview role pattern):
# IAM role for staging Lambda execution
resource "aws_iam_role" "staging_lambda_role" {
name = "syrfS3NotifierStagingLambdaRole"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "lambda.amazonaws.com"
}
}
]
})
tags = {
Purpose = "Staging Lambda execution role"
}
}
resource "aws_iam_role_policy_attachment" "staging_lambda_basic" {
role = aws_iam_role.staging_lambda_role.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}
resource "aws_iam_role_policy" "staging_lambda_s3" {
name = "S3ReadAccess"
role = aws_iam_role.staging_lambda_role.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"s3:GetObject"
# NOTE: s3:GetObjectMetadata and s3:HeadObject are NOT real IAM actions.
# AWS authorises HeadObject under s3:GetObject. See:
# https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-with-s3-actions.html
]
Resource = "arn:aws:s3:::${var.s3_bucket_name}-staging/*"
}
]
})
}
Also add iam:PassRole for the new staging role to aws_iam_role_policy.ack_lambda in camarades-infrastructure/terraform/lambda/ack-iam.tf (line 117-119):
Resource = [
aws_iam_role.production_lambda_role.arn,
aws_iam_role.preview_lambda_role.arn,
aws_iam_role.staging_lambda_role.arn
]
File: cluster-gitops/syrf/environments/staging/s3-notifier/values.yaml
Change executionRoleArn to use the new staging role:
Fix 8 (MEDIUM): Runtime inconsistency — NOT A BUG¶
template-local.yaml uses provided.al2023 for local SAM testing only. Production uses dotnet10 correctly. No change needed.
Fix 9 (MEDIUM): Add S3Region to env vars + ACK reconciliation note¶
Problem: The existing Terraform sets S3Region as a Lambda env var. The setup job omits it.
Solution: Already addressed in Fix 3 — the new env-vars-job.yaml includes S3Region=$S3_REGION.
ACK reconciliation risk: Since the Function CRD omits environment.variables, ACK may leave them alone or clear them on reconcile. This is explicitly tested in Phase 4b (reconciliation gate test) during staging deployment.
If Phase 4b FAILS (env vars cleared by reconciliation):
- Immediate fix: Force an ArgoCD sync which reruns PostSync hooks (including the env-vars-job):
argocd app sync s3-notifier-staging --force --grpc-web. The--forceflag is required because without it, a sync on an already-synced app may skip hooks. Fallback if--forcestill doesn't rerun hooks: manually delete the old hook Job so theBeforeHookCreationdelete policy triggers recreation on next sync:The env-vars-job is a regular Job (PostSync hook), not a CronJob, sokubectl delete job --ignore-not-found -n syrf-staging -l argocd.argoproj.io/hook=PostSync,app.kubernetes.io/instance=s3-notifier-staging argocd app sync s3-notifier-staging --force --grpc-webkubectl create job --from=cronjob/...does not apply here. - Durable fix: Add non-sensitive vars directly to
function.yamlCRD spec:
spec:
environment:
variables:
RabbitMqHost: "{{ .Values.rabbitMq.host }}"
RabbitMqUsername: "{{ .Values.rabbitMq.username }}"
S3Region: "{{ .Values.awsRegion }}"
# RabbitMqPassword still set by env-vars-job (sensitive, from K8s Secret)
This ensures ACK preserves these vars on reconcile. Important: update-function-configuration --environment is a full replacement, not a merge — it replaces the entire Variables map. The env-vars-job must still send ALL vars (including the ones now in the CRD spec), or ACK would preserve only the CRD vars while the job overwrites them with a subset. In practice, keep the env-vars-job sending all 4 vars as it does today. The CRD vars act as a safety net for ACK reconciliation, not as a replacement for the job. No code change now — verify during staging deployment via Phase 4b.
Fix 10: Sync technical plan doc¶
File: technical-plan.md
Updates needed:
- Phase 0 OIDC URL (line 100): Change
/locations/to/zones/with note that zonal clusters use/zones/ - Phase 0 IAM: Document new
syrf-ack-setup-jobrole (least-privilege for setup jobs) andaudconstraint on trust policies - Phase 0 prerequisites: Add pod identity webhook as a prerequisite (Fix 0)
- Phase 2 chart structure (line 198): Replace
external-secret.yamlwithserviceaccount.yaml,permission-job.yaml,env-vars-job.yaml - Phase 2 setup-job description (lines 259-286): Update to describe split into permission-job + env-vars-job with correct sync-wave ordering, wait-for-active loop,
--source-account, preflight checks, configurable timeouts, rabbit-mq pre-check - Phase 2 external-secret description (lines 290-293): Remove — no longer creating ExternalSecret (ClusterExternalSecret handles it)
- Phase 2 values.yaml (line 328): Fix
passwordSecretKey: passwordtopasswordSecretKey: rabbitmq-password, addiamRoleName, addtimeouts - Phase 3 staging role (line 391): Use
syrfS3NotifierStagingLambdaRoleinstead of production role - Phase 3 risk about env var reconciliation (lines 421-423): Add note about
S3Regionnow being included, document intentional overwrite behavior - Bucket sync-wave in Phase 2 examples: Update from wave 2 to wave 3, add permission-job at wave 2
- ArgoCD prerequisites: Document plugins AppProject updates (Fix 1b) and ACK health checks (Fix 1c)
- Updated date in frontmatter
Summary of all file changes¶
| # | File | Repo | Action |
|---|---|---|---|
| 1 | plugins/helm/pod-identity-webhook/config.yaml |
cluster-gitops | Create (Fix 0) |
| 2 | plugins/helm/pod-identity-webhook/values.yaml |
cluster-gitops | Create (Fix 0) |
| 3 | argocd/projects/plugins.yaml |
cluster-gitops | Modify (Fix 1b: add ACK + webhook sources/destinations) |
| 4 | argocd/projects/syrf-staging.yaml |
cluster-gitops | Modify (Fix 1a: add CRD whitelist) |
| 5 | argocd/projects/syrf-production.yaml |
cluster-gitops | Modify (Fix 1a: add CRD whitelist) |
| 6 | argocd/resources/argocd-cm.yaml |
cluster-gitops | Modify (Fix 1c: add ACK health checks) |
| 7 | syrf/services/s3-notifier/values.yaml |
cluster-gitops | Modify (Fix 5: passwordSecretKey) |
| 8 | syrf/environments/staging/s3-notifier/values.yaml |
cluster-gitops | Modify (Fix 7: executionRoleArn) |
| 9 | .chart/templates/_helpers.tpl |
monorepo | Modify (Fix 2: add k8sName helper) |
| 10 | .chart/templates/function.yaml |
monorepo | Modify (Fix 2: metadata.name) |
| 11 | .chart/templates/adopted-resource.yaml |
monorepo | Modify (Fix 2: metadata.name) |
| 12 | .chart/templates/setup-job.yaml |
monorepo | Delete (Fix 3) |
| 13 | .chart/templates/permission-job.yaml |
monorepo | Create (Fix 3) |
| 14 | .chart/templates/env-vars-job.yaml |
monorepo | Create (Fix 3: with timeout fail + secret pre-check) |
| 15 | .chart/templates/bucket.yaml |
monorepo | Modify (Fix 3: wave 2→3) |
| 16 | .chart/templates/serviceaccount.yaml |
monorepo | Create (Fix 4d) |
| 17 | .chart/templates/external-secret.yaml |
monorepo | Delete (Fix 5) |
| 18 | .chart/values.yaml |
monorepo | Modify (Fix 4e + 5: iamRoleName, passwordSecretKey, timeouts) |
| 19 | terraform/lambda/main.tf |
camarades-infra | Modify (Fix 7: staging Lambda role) |
| 20 | terraform/lambda/ack-iam.tf |
camarades-infra | Modify (Fix 4a+4b+4c: setup-job role, merge trust policy) |
| 21 | docs/features/lambda-ack-gitops/technical-plan.md |
monorepo | Modify (Fix 10: sync all corrections) |
Total: 21 file operations (14 modify, 5 create, 2 delete) — 11 monorepo (rows 9-18, 21), 8 cluster-gitops (rows 1-8), 2 camarades-infra (rows 19-20)
Implementation order¶
- Terraform first (Fixes 4a, 4b+4c, 7): IAM roles must exist before K8s resources reference them
- cluster-gitops infra (Fixes 0, 1b, 1c): Webhook + AppProject + health checks — must be deployed before ACK controllers or s3-notifier app
- cluster-gitops service config (Fixes 1a, 5-values, 7-values): Project whitelists + values corrections
- Monorepo Helm chart (Fixes 2, 3, 4d, 4e, 5-delete): All template changes
- Docs (Fix 10): Technical plan sync
Steps 1-3 must be committed/merged to their repos before step 4 can deploy successfully. Step 5 can happen any time.
Cross-repo rollout race gate: Since staging auto-deploys from main (ArgoCD watches chartTag in cluster-gitops/syrf/environments/staging/s3-notifier/config.yaml), merging the monorepo Helm chart (step 4) before cluster-gitops prerequisites are live (steps 2-3) will cause ArgoCD to sync and fail. To prevent this:
- Before merging step 4, verify steps 1-3 are live: pod-identity-webhook Running, ACK controllers restarted with projected volumes, AppProject updated, health checks in argocd-cm.
- If step 4 must merge before prerequisites are ready, pin a specific immutable commit SHA (not
main) aschartTagincluster-gitops/syrf/environments/staging/s3-notifier/config.yamlso ArgoCD stays on the pre-ACK chart version. Usingmainis not safe since it advances with new commits. Update to the new tag only after prerequisites are confirmed live. - Do NOT use
setupJob.enabled: falseas a gate — it only disables jobs, butbucket.yamlstill applies and will fail without the permission-job's S3 invoke permission.
Verification¶
Prerequisites¶
The verification scripts below require these CLIs installed and authenticated:
kubectl— configured for the GKE cluster (gcloud container clusters get-credentials camaradesuk --zone europe-west2-a)helm— v3.x forhelm templatevalidationargocd— logged in (argocd login argocd.camarades.net --grpc-web)aws— configured for account318789018510, regioneu-west-1jq— for JSON normalization in Phase 4b comparisonsyq— v4+ required; used for pre-merge chart key validation (yq -epath checks) and Helm-style deep merge (ireducesyntax)
All phases assume execution from the syrf monorepo root unless explicitly stated. Set these variables at the start:
set -euo pipefail
# Monorepo root (Helm charts, docs)
REPO_ROOT="$(git rev-parse --show-toplevel)" # or: export REPO_ROOT=/path/to/syrf
# cluster-gitops root (ArgoCD apps, env values)
# Autodetect: nested (monorepo/cluster-gitops) or sibling (../cluster-gitops) layout.
# Override: export GITOPS_ROOT=/path/to/cluster-gitops before running.
if [ -z "${GITOPS_ROOT:-}" ]; then
if [ -d "${REPO_ROOT}/cluster-gitops" ]; then
GITOPS_ROOT="${REPO_ROOT}/cluster-gitops"
elif [ -d "${REPO_ROOT}/../cluster-gitops" ]; then
GITOPS_ROOT="$(cd "${REPO_ROOT}/../cluster-gitops" && pwd)"
else
echo "ERROR: cluster-gitops not found as nested or sibling of $REPO_ROOT"
echo "Export GITOPS_ROOT before running, e.g.: export GITOPS_ROOT=/path/to/cluster-gitops"
exit 1
fi
fi
# camarades-infrastructure root (Terraform)
# Same autodetect: nested or sibling layout, with explicit override.
if [ -z "${INFRA_ROOT:-}" ]; then
if [ -d "${REPO_ROOT}/camarades-infrastructure" ]; then
INFRA_ROOT="${REPO_ROOT}/camarades-infrastructure"
elif [ -d "${REPO_ROOT}/../camarades-infrastructure" ]; then
INFRA_ROOT="$(cd "${REPO_ROOT}/../camarades-infrastructure" && pwd)"
else
echo "ERROR: camarades-infrastructure not found as nested or sibling of $REPO_ROOT"
echo "Export INFRA_ROOT before running, e.g.: export INFRA_ROOT=/path/to/camarades-infrastructure"
exit 1
fi
fi
# Fail-fast: verify all roots exist before running any phase
for _dir in "$REPO_ROOT" "$GITOPS_ROOT" "$INFRA_ROOT"; do
if [ ! -d "$_dir" ]; then
echo "ERROR: Directory not found: $_dir"
echo "Check REPO_ROOT, GITOPS_ROOT, and INFRA_ROOT variables above."
exit 1
fi
done
# Cluster context — hard gate to prevent operating on the wrong cluster.
# Format: gke_{PROJECT}_{ZONE}_{CLUSTER}. Update if cluster is recreated or moved.
EXPECTED_CONTEXT="gke_camarades-net_europe-west2-a_camaradesuk"
CURRENT_CONTEXT=$(kubectl config current-context 2>/dev/null) || {
echo "ERROR: kubectl config current-context failed — is kubectl configured?"
exit 1
}
if [ "$CURRENT_CONTEXT" != "$EXPECTED_CONTEXT" ]; then
echo "ERROR: kubectl context mismatch! Expected '$EXPECTED_CONTEXT', got '$CURRENT_CONTEXT'"
echo "Run: kubectl config use-context $EXPECTED_CONTEXT"
exit 1
fi
echo "kubectl context verified: $CURRENT_CONTEXT"
# AWS account and IAM role ARNs — centralised here to avoid hardcoding in each phase.
# Update these if deploying to a different account or renaming roles.
AWS_ACCOUNT_ID="318789018510"
ACK_CONTROLLER_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/syrf-ack-controllers"
SETUP_JOB_ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/syrf-ack-setup-job"
# Sanity gate: verify AWS CLI is targeting the expected account.
# Prevents cross-account mistakes (e.g. personal creds, wrong profile).
ACTUAL_ACCOUNT=$(aws sts get-caller-identity --query 'Account' --output text 2>/dev/null) || {
echo "ERROR: aws sts get-caller-identity failed — check AWS credentials"
exit 1
}
if [ "$ACTUAL_ACCOUNT" != "$AWS_ACCOUNT_ID" ]; then
echo "ERROR: AWS account mismatch! Expected $AWS_ACCOUNT_ID, got $ACTUAL_ACCOUNT"
echo "Check AWS_PROFILE or AWS_DEFAULT_PROFILE environment variable."
exit 1
fi
echo "AWS account verified: $ACTUAL_ACCOUNT"
Phase 1 — Terraform¶
set -euo pipefail
cd "${INFRA_ROOT}/terraform/lambda"
# Pre-check: verify ACK controller SA names match trust policy subjects.
# The trust policy hardcodes exact SA names — if the Helm charts use different
# names (e.g. via nameOverride), AssumeRoleWithWebIdentity will silently fail.
#
# Use `helm template` to render the actual SA name from the chart + our values,
# rather than grepping raw values.yaml (which may not reflect the rendered name).
echo "Verifying ACK controller ServiceAccount names..."
# Capture full helm template output first, then parse — avoids:
# 1. SIGPIPE from head -1 closing the pipe (pipefail sees non-zero exit)
# 2. Lost diagnostics from 2>/dev/null (stderr preserved on failure)
LAMBDA_SA_OUTPUT=$(helm template ack-lambda-controller \
oci://public.ecr.aws/aws-controllers-k8s/lambda-chart --version 1.5.2 \
-f "${GITOPS_ROOT}/plugins/helm/ack-lambda-controller/values.yaml" \
--show-only templates/serviceaccount.yaml 2>&1) || {
echo "ERROR: helm template failed for ack-lambda-controller:"
echo "$LAMBDA_SA_OUTPUT"
exit 1
}
# Parse from captured output — here-string avoids pipeline SIGPIPE under pipefail.
LAMBDA_SA=$(awk '/^[[:space:]]*name:/ {print $2; exit}' <<<"$LAMBDA_SA_OUTPUT")
S3_SA_OUTPUT=$(helm template ack-s3-controller \
oci://public.ecr.aws/aws-controllers-k8s/s3-chart --version 1.0.14 \
-f "${GITOPS_ROOT}/plugins/helm/ack-s3-controller/values.yaml" \
--show-only templates/serviceaccount.yaml 2>&1) || {
echo "ERROR: helm template failed for ack-s3-controller:"
echo "$S3_SA_OUTPUT"
exit 1
}
S3_SA=$(awk '/^[[:space:]]*name:/ {print $2; exit}' <<<"$S3_SA_OUTPUT")
if [ -z "${LAMBDA_SA:-}" ]; then
echo "ERROR: Could not extract Lambda controller SA name from helm template."
echo "Check chart version/values: ${GITOPS_ROOT}/plugins/helm/ack-lambda-controller/values.yaml"
exit 1
fi
if [ -z "${S3_SA:-}" ]; then
echo "ERROR: Could not extract S3 controller SA name from helm template."
echo "Check chart version/values: ${GITOPS_ROOT}/plugins/helm/ack-s3-controller/values.yaml"
exit 1
fi
echo "Lambda controller SA: $LAMBDA_SA"
echo "S3 controller SA: $S3_SA"
# Cross-check against trust policy subjects in ack-iam.tf:
# system:serviceaccount:ack-system:ack-lambda-controller
# system:serviceaccount:ack-system:ack-s3-controller
# If either SA name doesn't match the trust policy, STOP and update ack-iam.tf.
if [ "$LAMBDA_SA" != "ack-lambda-controller" ] || [ "$S3_SA" != "ack-s3-controller" ]; then
echo "WARNING: Rendered SA names do not match trust policy subjects!"
echo " Expected: ack-lambda-controller / ack-s3-controller"
echo " Got: $LAMBDA_SA / $S3_SA"
echo "Update ack-iam.tf trust policy before applying."
exit 1
fi
terraform validate
terraform plan
# Full plan (not -target) — ensures coupled changes (trust policies + iam:PassRole
# updates + new roles) are all captured together. Review the plan output for:
# - New roles: ack_setup_job, staging_lambda_role
# - Modified role: ack_controllers trust policy (single statement with aud + sub)
# - Modified policy: ack_lambda PassRole now includes staging role ARN
Phase 2 — cluster-gitops (after merge)¶
set -euo pipefail
# 1. Gate: pod identity webhook must be Running and Ready before restarting controllers.
# kubectl get exits 0 even when pods aren't Ready; kubectl wait is a proper gate.
kubectl wait --for=condition=Ready pod \
-n pod-identity-webhook -l app.kubernetes.io/name=amazon-eks-pod-identity-webhook \
--timeout=120s
# 2. Restart ACK controllers so NEW pods get mutated by the webhook
kubectl rollout restart deployment -n ack-system -l app.kubernetes.io/instance=ack-lambda-controller
kubectl rollout restart deployment -n ack-system -l app.kubernetes.io/instance=ack-s3-controller
kubectl rollout status deployment -n ack-system -l app.kubernetes.io/instance=ack-lambda-controller --timeout=180s
kubectl rollout status deployment -n ack-system -l app.kubernetes.io/instance=ack-s3-controller --timeout=180s
# 3. Verify mutation + auth for BOTH ACK controllers (Lambda and S3)
for CONTROLLER in ack-lambda-controller ack-s3-controller; do
echo "--- Checking $CONTROLLER ---"
# Capture a Running pod with count guard (deterministic: newest by creation timestamp)
POD_LIST=$(kubectl get pod -n ack-system -l app.kubernetes.io/instance="$CONTROLLER" \
--field-selector=status.phase=Running --sort-by=.metadata.creationTimestamp \
-o jsonpath='{.items[*].metadata.name}')
POD_COUNT=$(echo "$POD_LIST" | wc -w)
if [ "$POD_COUNT" -eq 0 ] || [ -z "$POD_LIST" ]; then
echo "ERROR: No Running pods found for $CONTROLLER"
echo "All pods:"
kubectl get pods -n ack-system -l app.kubernetes.io/instance="$CONTROLLER"
exit 1
fi
if [ "$POD_COUNT" -gt 1 ]; then
echo "WARNING: $POD_COUNT Running pods found for $CONTROLLER (may be mid-rollout), using newest"
fi
POD=$(echo "$POD_LIST" | tr -s '[:space:]' '\n' | tail -1) # Last = newest by creationTimestamp
echo "Checking pod: $POD"
# Verify webhook mutation: projected token volume mount + AWS env vars (combined assertion)
# Use containers[*] (not containers[0]) — sidecar injection can shift container order.
# With containers[*], jsonpath may concat values from multiple containers (whitespace-separated).
# Split on all whitespace and match exact token (tr -s '[:space:]' + grep -qxF).
TOKEN_MOUNT=$(kubectl get pod "$POD" -n ack-system \
-o jsonpath='{.spec.containers[*].volumeMounts[?(@.mountPath=="/var/run/secrets/eks.amazonaws.com/serviceaccount")].mountPath}')
ROLE_ARN=$(kubectl get pod "$POD" -n ack-system \
-o jsonpath='{.spec.containers[*].env[?(@.name=="AWS_ROLE_ARN")].value}')
TOKEN_FILE=$(kubectl get pod "$POD" -n ack-system \
-o jsonpath='{.spec.containers[*].env[?(@.name=="AWS_WEB_IDENTITY_TOKEN_FILE")].value}')
EXPECTED_TOKEN_PATH="/var/run/secrets/eks.amazonaws.com/serviceaccount/token"
EXPECTED_CONTROLLER_ROLE="$ACK_CONTROLLER_ROLE_ARN"
echo " Token mount: ${TOKEN_MOUNT:-MISSING}"
echo " AWS_ROLE_ARN: ${ROLE_ARN:-MISSING} (expected: $EXPECTED_CONTROLLER_ROLE)"
echo " AWS_WEB_IDENTITY_TOKEN_FILE: ${TOKEN_FILE:-MISSING} (expected: $EXPECTED_TOKEN_PATH)"
if [ -z "$TOKEN_MOUNT" ] || [ -z "$ROLE_ARN" ] || [ -z "$TOKEN_FILE" ]; then
echo "FAIL: Webhook mutation incomplete for $CONTROLLER pod $POD"
exit 1
fi
# Exact-token check — see Fix 0 block for rationale (tr + grep -qxF)
if ! echo "$ROLE_ARN" | tr -s '[:space:]' '\n' | grep -qxF "$EXPECTED_CONTROLLER_ROLE"; then
echo "FAIL: AWS_ROLE_ARN missing exact expected token '$EXPECTED_CONTROLLER_ROLE', got: $ROLE_ARN"
exit 1
fi
if ! echo "$TOKEN_FILE" | tr -s '[:space:]' '\n' | grep -qxF "$EXPECTED_TOKEN_PATH"; then
echo "FAIL: AWS_WEB_IDENTITY_TOKEN_FILE missing exact expected token '$EXPECTED_TOKEN_PATH', got: $TOKEN_FILE"
exit 1
fi
# Verify auth: capture logs first (fail if retrieval fails), then check for errors
POD_LOGS=$(kubectl logs -n ack-system "$POD" --since=5m 2>&1) || {
echo "FAIL: Could not retrieve logs from $POD"
exit 1
}
AUTH_ERRORS=$(echo "$POD_LOGS" | grep -iE 'unauthorized|forbidden|NoCredentialProviders|AccessDenied|InvalidIdentityToken|ExpiredToken|WebIdentityErr' || true)
if [ -n "$AUTH_ERRORS" ]; then
echo "FAIL: Auth errors detected in $CONTROLLER logs:"
echo "$AUTH_ERRORS"
echo "Fix OIDC trust policy or webhook mutation before proceeding."
exit 1
else
echo "(no auth errors found — good)"
fi
done
# 4. Gate: ACK controllers must be Healthy in ArgoCD (not Unknown/Progressing)
# `argocd app wait` exits non-zero if health is not reached within timeout.
# `argocd app get` would always exit 0 regardless of health — not a gate.
# 300s allows for image pull, CRD registration, and initial AWS API round-trip.
argocd app wait ack-lambda-controller --health --timeout 300 --grpc-web
argocd app wait ack-s3-controller --health --timeout 300 --grpc-web
# 5. Gate: s3-notifier app must reach Healthy + Synced after deploy
# 600s accommodates sync waves (Function → permission-job → Bucket) + PostSync env-vars-job,
# each involving AWS API calls with potential cold-start latency.
argocd app wait s3-notifier-staging --health --sync --timeout 600 --grpc-web
# Expect: Function and Bucket resources show Healthy (not Unknown/Progressing)
Phase 2b — After monorepo Helm chart deploys (s3-notifier)¶
set -euo pipefail
# 9. Verify ack-setup-job ServiceAccount was created with correct annotation
kubectl get sa ack-setup-job -n syrf-staging -o jsonpath='{.metadata.annotations}'
# Expect: eks.amazonaws.com/role-arn = arn:aws:iam::318789018510:role/syrf-ack-setup-job
# 10. Verify setup-job SA mutation: create a test pod to confirm webhook mutates it
# Use a unique pod name to avoid collisions on repeated runs (timestamp + PID).
SA_TEST_POD="sa-test-$(date +%s)-$$"
# Trap ensures the test pod is cleaned up on ANY exit (timeout, assertion failure, Ctrl-C)
trap "kubectl delete pod $SA_TEST_POD -n syrf-staging --ignore-not-found 2>/dev/null" EXIT
kubectl run "$SA_TEST_POD" --image=amazon/aws-cli:2.15.0 -n syrf-staging --restart=Never \
--overrides='{"spec":{"serviceAccountName":"ack-setup-job"}}' \
--command -- sleep 300
# Wait for pod to be Ready before inspecting (avoids race with sleep + immediate inspect)
kubectl wait --for=condition=Ready "pod/$SA_TEST_POD" -n syrf-staging --timeout=60s
# Assert exact env var values and token mount (same rigor as ACK controller checks)
# Use containers[*] (not containers[0]) — sidecar injection can shift container order.
SA_ROLE_ARN=$(kubectl get pod "$SA_TEST_POD" -n syrf-staging \
-o jsonpath='{.spec.containers[*].env[?(@.name=="AWS_ROLE_ARN")].value}')
SA_TOKEN_FILE=$(kubectl get pod "$SA_TEST_POD" -n syrf-staging \
-o jsonpath='{.spec.containers[*].env[?(@.name=="AWS_WEB_IDENTITY_TOKEN_FILE")].value}')
SA_TOKEN_MOUNT=$(kubectl get pod "$SA_TEST_POD" -n syrf-staging \
-o jsonpath='{.spec.containers[*].volumeMounts[?(@.mountPath=="/var/run/secrets/eks.amazonaws.com/serviceaccount")].mountPath}')
EXPECTED_SETUP_ROLE="$SETUP_JOB_ROLE_ARN"
EXPECTED_SA_TOKEN_PATH="/var/run/secrets/eks.amazonaws.com/serviceaccount/token"
echo "sa-test AWS_ROLE_ARN: ${SA_ROLE_ARN:-MISSING} (expected: $EXPECTED_SETUP_ROLE)"
echo "sa-test AWS_WEB_IDENTITY_TOKEN_FILE: ${SA_TOKEN_FILE:-MISSING}"
echo "sa-test Token mount: ${SA_TOKEN_MOUNT:-MISSING}"
# Cleanup is handled by EXIT trap above — no manual kubectl delete needed on failure paths.
if [ -z "$SA_ROLE_ARN" ] || [ -z "$SA_TOKEN_FILE" ] || [ -z "$SA_TOKEN_MOUNT" ]; then
echo "FAIL: Webhook mutation incomplete for $SA_TEST_POD pod"
exit 1
fi
# Exact-token check — see Fix 0 block for rationale (tr + grep -qxF)
if ! echo "$SA_ROLE_ARN" | tr -s '[:space:]' '\n' | grep -qxF "$EXPECTED_SETUP_ROLE"; then
echo "FAIL: $SA_TEST_POD AWS_ROLE_ARN missing exact expected token '$EXPECTED_SETUP_ROLE', got: $SA_ROLE_ARN"
exit 1
fi
if ! echo "$SA_TOKEN_FILE" | tr -s '[:space:]' '\n' | grep -qxF "$EXPECTED_SA_TOKEN_PATH"; then
echo "FAIL: $SA_TEST_POD AWS_WEB_IDENTITY_TOKEN_FILE missing exact expected token '$EXPECTED_SA_TOKEN_PATH', got: $SA_TOKEN_FILE"
exit 1
fi
echo "$SA_TEST_POD mutation verified."
# Explicit delete (trap also covers this, but explicit for clarity)
trap - EXIT
kubectl delete pod "$SA_TEST_POD" -n syrf-staging --ignore-not-found
Phase 3 — Helm chart¶
set -euo pipefail
helm template staging "${REPO_ROOT}/src/services/s3-notifier/.chart/" \
--set bucket.name=syrfapp-uploads-staging \
--set lambda.executionRoleArn=arn:aws:iam::318789018510:role/syrfS3NotifierStagingLambdaRole \
--set lambda.code.s3Key=test.zip \
--set awsAccountId=318789018510 \
--set environmentName=staging
Verify in rendered output:
- Function
metadata.nameis lowercase (s3-notifier-staging) - Function
spec.nameis camelCase (syrfAppUploadS3Notifier-staging) - No
external-secret.yamlin output - ServiceAccount created with
syrf-ack-setup-jobrole annotation (notsyrf-ack-controllers) - permission-job is
Synchook at wave 2, includes--source-account, wait-for-active loop, preflight check - env-vars-job is
PostSync, includesS3Region, timeout fail, rabbit-mq pre-check - bucket is at wave 3
- Secret key is
rabbitmq-password(notpassword)
Phase 4 — Staging deployment smoke test¶
set -euo pipefail
# After ArgoCD syncs s3-notifier to staging:
# 1. Verify Lambda function exists and is Active
aws lambda get-function --function-name syrfAppUploadS3Notifier-staging --region eu-west-1
# 2. Verify S3 invoke permission exists
aws lambda get-policy --function-name syrfAppUploadS3Notifier-staging --region eu-west-1
# 3. Verify env var KEYS are set (including S3Region) — keys only, no values (avoids leaking secrets)
aws lambda get-function-configuration --function-name syrfAppUploadS3Notifier-staging \
--query 'sort(keys(Environment.Variables))' --output json --region eu-west-1
# Expect: ["RabbitMqHost", "RabbitMqPassword", "RabbitMqUsername", "S3Region"]
Phase 4b — ACK reconciliation gate (CRITICAL)¶
The Function CRD does not include environment.variables — env vars are set out-of-band by the PostSync env-vars-job. This means ACK reconciliation may clear them if the controller reconciles the Function spec against the CRD. This phase explicitly tests that.
set -euo pipefail
# Environment parameters — defaults to staging. Override for production (see Phase 5 Step 3).
FUNCTION_NAME="${PHASE4B_FUNCTION_NAME:-syrfAppUploadS3Notifier-staging}"
K8S_NAMESPACE="${PHASE4B_NAMESPACE:-syrf-staging}"
CR_NAME="${PHASE4B_CR_NAME:-s3-notifier-staging}"
ENV_VALUES_PATH="${PHASE4B_ENV_VALUES_PATH:-syrf/environments/staging/s3-notifier/values.yaml}"
ARGOCD_APP="${PHASE4B_ARGOCD_APP:-s3-notifier-staging}"
echo "Phase 4b targets: function=$FUNCTION_NAME namespace=$K8S_NAMESPACE cr=$CR_NAME app=$ARGOCD_APP"
# Compare env var KEYS and non-sensitive VALUES to catch both key removal and value drift.
# RabbitMqPassword is redacted — only keys, Host, Username, and S3Region are compared.
# Use JMESPath sort() for deterministic key ordering.
# 1. Record current env var keys + non-sensitive values (baseline)
# Pipe through jq -Sc to normalize whitespace and key ordering for reliable comparison.
BEFORE_KEYS=$(aws lambda get-function-configuration \
--function-name "$FUNCTION_NAME" --region eu-west-1 \
--query 'sort(keys(Environment.Variables))' --output json | jq -Sc '.')
BEFORE_VALS=$(aws lambda get-function-configuration \
--function-name "$FUNCTION_NAME" --region eu-west-1 \
--query '{Host: Environment.Variables.RabbitMqHost, User: Environment.Variables.RabbitMqUsername, Region: Environment.Variables.S3Region}' \
--output json | jq -Sc '.')
echo "Env var keys before reconcile: $BEFORE_KEYS"
echo "Non-sensitive values before: $BEFORE_VALS"
# 2. Record ResourceSynced lastTransitionTime BEFORE annotation.
# We use lastTransitionTime (not resourceVersion) because it's immune to timing
# races — the old two-bump resourceVersion approach could false-fail if ACK
# reconciled before AFTER_ANNOTATION_RV was captured.
BEFORE_LTT=$(kubectl get function "$CR_NAME" -n "$K8S_NAMESPACE" \
-o jsonpath='{.status.conditions[?(@.type=="ACK.ResourceSynced")].lastTransitionTime}')
echo "ResourceSynced lastTransitionTime before: $BEFORE_LTT"
# 3. Force ACK to reconcile by annotating the Function CR
# Capture log-window start BEFORE annotation so --since-time excludes prior reconcile noise.
LOG_SINCE=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
# Use nanoseconds for uniqueness on fast reruns (date +%s repeats within same second).
# macOS/BSD: date +%s%N returns literal "N" suffix (exit 0). Detect and fall back to epoch-PID-counter.
RECONCILE_TS=$(date +%s%N 2>/dev/null)
case "${RECONCILE_TS:-}" in ""|*N)
_RECONCILE_SEQ=$(( ${_RECONCILE_SEQ:-0} + 1 ))
RECONCILE_TS="$(date +%s)-$$-${_RECONCILE_SEQ}"
;; esac
kubectl annotate function "$CR_NAME" -n "$K8S_NAMESPACE" \
"ack.aws.k8s.aws/force-reconcile=${RECONCILE_TS}" --overwrite
# 4. Wait for reconciliation with automated fallback.
# Two signals checked per attempt:
# (a) lastTransitionTime changed — proves controller updated status after our trigger
# (b) ResourceSynced=True — proves reconcile succeeded
#
# Attempt 1: annotation-based trigger (already applied above).
# Attempt 2 (automatic): argocd app sync --force — re-applies all resources,
# forcing the controller to reconcile even if it ignores metadata-only changes.
#
# NOTE: If ResourceSynced stays True throughout (no transition), lastTransitionTime
# may not change (K8s convention: only updates on status transitions). In that case,
# we check controller logs as secondary evidence. The env var validation below is the
# authoritative check — the reconcile gate is a bonus signal, not a hard gate.
RECON_WAIT=60 # Per-attempt timeout — ACK controller + AWS API round-trip can be slow
RECON_CONFIRMED=false
for RECON_ATTEMPT in 1 2; do
if [ "$RECON_ATTEMPT" -eq 2 ]; then
echo "Annotation-based trigger did not produce reconcile evidence. Attempting fallback: argocd app sync --force..."
# Reset baselines for the second attempt
BEFORE_LTT=$(kubectl get function "$CR_NAME" -n "$K8S_NAMESPACE" \
-o jsonpath='{.status.conditions[?(@.type=="ACK.ResourceSynced")].lastTransitionTime}')
LOG_SINCE=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
argocd app sync "$ARGOCD_APP" --force --grpc-web || {
echo "ERROR: argocd app sync --force failed. Check ArgoCD connectivity and app status."
exit 1
}
sleep 5 # Allow sync propagation before polling
fi
echo "Waiting for ACK reconciliation (attempt $RECON_ATTEMPT)..."
RECON_ELAPSED=0
while [ $RECON_ELAPSED -lt $RECON_WAIT ]; do
CURRENT_LTT=$(kubectl get function "$CR_NAME" -n "$K8S_NAMESPACE" \
-o jsonpath='{.status.conditions[?(@.type=="ACK.ResourceSynced")].lastTransitionTime}' 2>/dev/null)
CURRENT_STATUS=$(kubectl get function "$CR_NAME" -n "$K8S_NAMESPACE" \
-o jsonpath='{.status.conditions[?(@.type=="ACK.ResourceSynced")].status}' 2>/dev/null)
if [ "$CURRENT_LTT" != "$BEFORE_LTT" ] && [ "$CURRENT_STATUS" = "True" ]; then
echo "Reconciliation detected (lastTransitionTime: $BEFORE_LTT -> $CURRENT_LTT)."
RECON_CONFIRMED=true
break
fi
sleep 3
RECON_ELAPSED=$((RECON_ELAPSED + 3))
done
[ "$RECON_CONFIRMED" = true ] && break
# Timeout path — check if condition stayed True and look for log evidence
FINAL_STATUS=$(kubectl get function "$CR_NAME" -n "$K8S_NAMESPACE" \
-o jsonpath='{.status.conditions[?(@.type=="ACK.ResourceSynced")].status}' 2>/dev/null)
if [ "$FINAL_STATUS" != "True" ]; then
echo "ERROR: ResourceSynced is NOT True after ${RECON_WAIT}s (attempt $RECON_ATTEMPT)."
[ "$RECON_ATTEMPT" -eq 2 ] && { echo "Both attempts failed."; exit 1; }
continue # Try fallback
fi
echo "NOTE: lastTransitionTime unchanged after ${RECON_WAIT}s — checking controller logs..."
CTRL_LOGS=$(kubectl logs -n ack-system -l app.kubernetes.io/instance=ack-lambda-controller --since-time="$LOG_SINCE" 2>&1) || {
echo "FAIL: Could not retrieve controller logs."
exit 1
}
if grep -qE "${CR_NAME}|${FUNCTION_NAME}|${RECONCILE_TS}" <<<"$CTRL_LOGS"; then
echo "Controller logs confirm reconcile activity (attempt $RECON_ATTEMPT, matched CR/function name or ts=${RECONCILE_TS})."
RECON_CONFIRMED=true
break
fi
echo "No log evidence of reconcile on attempt $RECON_ATTEMPT."
[ "$RECON_ATTEMPT" -eq 2 ] && {
echo "FAIL: No reconcile evidence after both annotation and forced sync."
echo "Verify ACK controller is healthy: kubectl logs -n ack-system -l app.kubernetes.io/instance=ack-lambda-controller --since-time=$LOG_SINCE"
exit 1
}
done
# 6. Check env var keys + non-sensitive values after reconciliation
AFTER_KEYS=$(aws lambda get-function-configuration \
--function-name "$FUNCTION_NAME" --region eu-west-1 \
--query 'sort(keys(Environment.Variables))' --output json | jq -Sc '.')
AFTER_VALS=$(aws lambda get-function-configuration \
--function-name "$FUNCTION_NAME" --region eu-west-1 \
--query '{Host: Environment.Variables.RabbitMqHost, User: Environment.Variables.RabbitMqUsername, Region: Environment.Variables.S3Region}' \
--output json | jq -Sc '.')
echo "Env var keys after reconcile: $AFTER_KEYS"
echo "Non-sensitive values after: $AFTER_VALS"
# 7. Compare keys — env vars MUST still be present
if [ "$BEFORE_KEYS" != "$AFTER_KEYS" ]; then
echo "FAIL: ACK reconciliation modified env var keys!"
echo "Keys before: $BEFORE_KEYS"
echo "Keys after: $AFTER_KEYS"
echo ""
echo "Mitigation: Add non-sensitive vars to the Function CRD spec"
echo " (spec.environment.variables in function.yaml template)."
echo " Sensitive vars (RabbitMqPassword) must still come from env-vars-job."
echo " See Fix 9 for details."
exit 1
fi
# 8. Compare non-sensitive values — catches value drift that key comparison misses
if [ "$BEFORE_VALS" != "$AFTER_VALS" ]; then
echo "FAIL: ACK reconciliation changed non-sensitive env var values!"
echo "Values before: $BEFORE_VALS"
echo "Values after: $AFTER_VALS"
echo ""
echo "Mitigation: Same as key drift — add non-sensitive vars to CRD spec."
exit 1
fi
# 9. Assert expected values (catches case where both before AND after are wrong)
# Source expected values from cluster-gitops Helm values — no hardcoded fallback.
# yq v4+ is a declared prerequisite; hard-fail if missing to avoid silent drift.
EXPECTED_KEYS='["RabbitMqHost","RabbitMqPassword","RabbitMqUsername","S3Region"]'
command -v yq >/dev/null 2>&1 || { echo "ERROR: yq not found. Required for expected-value sourcing."; exit 1; }
YQ_MAJOR=$(yq --version 2>&1 | awk 'match($0, /[0-9]+/) {print substr($0, RSTART, RLENGTH); exit}' || true)
if [ "${YQ_MAJOR:-0}" -lt 4 ]; then
echo "ERROR: yq v4+ required (found v${YQ_MAJOR:-unknown}). ireduce syntax needs v4."
exit 1
fi
SVC_VALS="${GITOPS_ROOT}/syrf/services/s3-notifier/values.yaml"
ENV_VALS="${GITOPS_ROOT}/${ENV_VALUES_PATH}"
# Proper Helm-style merge: env values override service defaults (yq ea with ireduce)
MERGED=$(yq ea '. as $item ireduce ({}; . * $item)' "$SVC_VALS" "$ENV_VALS") || {
echo "ERROR: yq merge failed. Check files exist and are valid YAML:"
echo " $SVC_VALS"
echo " $ENV_VALS"
exit 1
}
EXP_HOST=$(yq -r '.rabbitMq.host // ""' <<<"$MERGED")
EXP_USER=$(yq -r '.rabbitMq.username // ""' <<<"$MERGED")
EXP_REGION=$(yq -r '.awsRegion // ""' <<<"$MERGED")
EXPECTED_VALS=$(jq -n -Sc --arg Host "$EXP_HOST" --arg Region "$EXP_REGION" --arg User "$EXP_USER" \
'{Host: $Host, Region: $Region, User: $User}')
echo "Expected values sourced from cluster-gitops Helm values (merged: service + env)"
if [ "$AFTER_KEYS" != "$EXPECTED_KEYS" ]; then
echo "FAIL: Env var keys don't match expected set!"
echo "Expected: $EXPECTED_KEYS"
echo "Actual: $AFTER_KEYS"
exit 1
fi
if [ "$AFTER_VALS" != "$EXPECTED_VALS" ]; then
echo "FAIL: Non-sensitive env var values don't match expected config!"
echo "Expected: $EXPECTED_VALS"
echo "Actual: $AFTER_VALS"
exit 1
fi
echo "PASS: Env var keys, values, and expected config all verified after ACK reconciliation."
Phase 5 — Production rollout gate¶
Do not proceed to production until all Phase 4/4b checks pass on staging. Production enablement requires an explicit decision gate — there is no automatic promotion.
Pre-production checklist (all must be true):
- Phase 4 staging smoke test: all Lambda function, S3 permission, and env var checks pass
- Phase 4b reconciliation gate: ACK reconcile does not clear env vars (both attempts pass)
- Staging has been running with ACK-managed resources for at least 24 hours without drift
- ArgoCD shows
s3-notifier-stagingasHealthy+Synced(no degraded or out-of-sync)
Step 1 — Enable production app with adoption in cluster-gitops (single dedicated GitOps commit):
The production app is currently disabled (service.enabled: false in config.yaml). Both the
config enablement and the adoption flag must be set in the same commit (so the app is
generated with adoptExisting: true from the start, never without it).
# cluster-gitops/syrf/environments/production/s3-notifier/config.yaml
# Enable the production s3-notifier Application.
serviceName: s3-notifier
envName: production
service:
enabled: true # Was: false — enables ApplicationSet generation
chartTag: main
imageTag: "0.0.0" # Placeholder (Lambda uses S3 zip, not Docker image)
# cluster-gitops/syrf/environments/production/s3-notifier/values.yaml
# Set adoptExisting: true for initial deployment only.
# This tells ACK to import existing AWS resources instead of creating new ones.
adoptExisting: true
Commit and push, then verify the Application was generated before syncing:
set -euo pipefail
# Gate: confirm the production config exists in cluster-gitops.
# The ApplicationSet only generates an app when the service config directory exists.
PROD_CONFIG="${GITOPS_ROOT}/syrf/environments/production/s3-notifier/config.yaml"
if [ ! -f "$PROD_CONFIG" ]; then
echo "ERROR: Production config not found: $PROD_CONFIG"
echo "Create the production s3-notifier config in cluster-gitops first."
exit 1
fi
echo "OK: Production config exists at $PROD_CONFIG"
# Gate: confirm the ApplicationSet generated the production app.
# Without this, argocd app sync will fail with a confusing "app not found" error.
argocd app get s3-notifier-production --grpc-web >/dev/null 2>&1 || {
echo "ERROR: s3-notifier-production Application does not exist."
echo "Check that the production s3-notifier config was pushed to cluster-gitops"
echo "and the ApplicationSet has generated the Application."
echo " argocd appset list --grpc-web"
exit 1
}
echo "OK: s3-notifier-production Application exists."
# Record the adoption commit SHA for use in rollback (if needed).
# IMPORTANT: The adoption values change MUST be a dedicated commit (no other changes).
# If ADOPTION_SHA is already exported (from a prior run or manual set), use it.
# Otherwise, verify HEAD is the adoption commit by checking its message.
if [ -n "${ADOPTION_SHA:-}" ]; then
echo "Using pre-set ADOPTION_SHA=$ADOPTION_SHA"
else
ADOPTION_SHA=$(git -C "$GITOPS_ROOT" rev-parse HEAD)
ADOPTION_MSG=$(git -C "$GITOPS_ROOT" log -1 --format='%s' HEAD)
if ! grep -qiF "adoptExisting" <<<"$ADOPTION_MSG"; then
echo "ERROR: HEAD commit does not mention 'adoptExisting':"
echo " $ADOPTION_SHA $ADOPTION_MSG"
echo ""
echo "The adoption commit must be a DEDICATED commit. Find the correct SHA:"
echo " git -C \$GITOPS_ROOT log --oneline -10"
echo "Then re-run with: export ADOPTION_SHA=<correct SHA>"
exit 1
fi
echo "Adoption commit: $ADOPTION_SHA ($ADOPTION_MSG)"
fi
# Trigger manual sync (production is not auto-synced)
argocd app sync s3-notifier-production --grpc-web
# Wait for sync + health to settle (timeout 5 min)
argocd app wait s3-notifier-production --health --sync --timeout 300 --grpc-web
Step 2 — Verify adoption succeeded (health + sync gate):
set -euo pipefail
EXPECTED_LAMBDA_ARN="arn:aws:lambda:eu-west-1:318789018510:function:syrfAppUploadS3Notifier"
# Combined health + sync check — both must be true.
APP_STATUS=$(argocd app get s3-notifier-production -o json --grpc-web)
HEALTH=$(jq -r '.status.health.status' <<<"$APP_STATUS")
SYNC=$(jq -r '.status.sync.status' <<<"$APP_STATUS")
if [ "$HEALTH" != "Healthy" ] || [ "$SYNC" != "Synced" ]; then
echo "FAIL: s3-notifier-production is $HEALTH/$SYNC (expected Healthy/Synced)"
argocd app get s3-notifier-production --grpc-web
exit 1
fi
echo "OK: s3-notifier-production is Healthy + Synced"
# Verify adopted Lambda is Active (not Pending, Failed, or Inactive)
LAMBDA_STATE=$(aws lambda get-function --function-name syrfAppUploadS3Notifier --region eu-west-1 \
--query 'Configuration.State' --output text)
if [ "$LAMBDA_STATE" != "Active" ]; then
echo "FAIL: Lambda state is '$LAMBDA_STATE' (expected Active)"
aws lambda get-function --function-name syrfAppUploadS3Notifier --region eu-west-1 \
--query 'Configuration.{State: State, LastUpdateStatus: LastUpdateStatus, StateReason: StateReason}' --output table
exit 1
fi
echo "OK: Lambda state is Active"
# Verify S3 notification targets the expected Lambda ARN (filter, not positional index)
NOTIF_ARN=$(aws s3api get-bucket-notification-configuration --bucket syrfapp-uploads --region eu-west-1 \
--query "LambdaFunctionConfigurations[?LambdaFunctionArn=='${EXPECTED_LAMBDA_ARN}'].LambdaFunctionArn | [0]" \
--output text)
if [ "$NOTIF_ARN" != "$EXPECTED_LAMBDA_ARN" ]; then
echo "FAIL: S3 notification does not target expected Lambda ARN"
echo "Expected: $EXPECTED_LAMBDA_ARN"
echo "Got: ${NOTIF_ARN:-null}"
aws s3api get-bucket-notification-configuration --bucket syrfapp-uploads --region eu-west-1
exit 1
fi
echo "OK: S3 notification targets $EXPECTED_LAMBDA_ARN"
Step 3 — Run Phase 4b reconciliation gate against production:
Phase 4b is parameterized via environment variables. Set these before running the Phase 4b script:
# Production targets for Phase 4b (override staging defaults)
export PHASE4B_FUNCTION_NAME="syrfAppUploadS3Notifier"
export PHASE4B_NAMESPACE="syrf-production"
export PHASE4B_CR_NAME="s3-notifier-production"
export PHASE4B_ENV_VALUES_PATH="syrf/environments/production/s3-notifier/values.yaml"
export PHASE4B_ARGOCD_APP="s3-notifier-production"
Then run the Phase 4b script. It uses ${PHASE4B_FUNCTION_NAME:-syrfAppUploadS3Notifier-staging} etc., defaulting to staging if unset.
Step 4 — Remove adoption annotation (GitOps commit):
After adoption is verified and Phase 4b passes, set adoptExisting: false so future syncs use normal reconciliation:
# cluster-gitops/syrf/environments/production/s3-notifier/values.yaml
# Adoption complete — reset to normal reconciliation mode.
adoptExisting: false
Sync again to apply:
set -euo pipefail
argocd app sync s3-notifier-production --grpc-web
argocd app wait s3-notifier-production --health --sync --timeout 300 --grpc-web
echo "Adoption complete. ACK now manages production resources in normal mode."
Step 5 — Terraform teardown (after Step 4 is verified stable):
The old Terraform-managed resources are now owned by ACK. Remove them from Terraform without destroying the actual AWS resources. Order matters — remove state first, then config:
The Terraform resources to hand off are (from camarades-infrastructure/terraform/lambda/main.tf):
| Terraform address | What it manages | ACK takes over? |
|---|---|---|
aws_lambda_function.s3_notifier_production (line 65) |
Production Lambda | Yes — Function CRD |
aws_lambda_permission.s3_invoke_production (line 180) |
S3→Lambda invoke permission | Yes — permission-job |
aws_s3_bucket_notification.uploads (line 202) |
Both production AND preview notifications | Partially — ACK manages production notification via Bucket CRD; previews stay in Terraform |
Critical: aws_s3_bucket_notification.uploads is a single resource managing ALL bucket notifications (production Projects/ prefix + all preview preview/pr-{N}/ prefixes). You cannot state rm it without losing preview notifications. Instead, modify the config to remove only the production lambda_function block.
set -euo pipefail
cd "$INFRA_ROOT/terraform/lambda"
# 1. Remove production-only resources from Terraform state.
# These are fully replaced by ACK CRDs and setup-job.
terraform state rm aws_lambda_function.s3_notifier_production
terraform state rm aws_lambda_permission.s3_invoke_production
# 2. Verify state removal — these two should no longer appear.
terraform state list | grep -E 's3_notifier_production|s3_invoke_production' && {
echo "ERROR: Production resources still in state after removal!"
exit 1
} || echo "OK: Production Lambda + permission removed from state."
# 3. DO NOT state-rm aws_s3_bucket_notification.uploads!
# It still manages preview notifications. Instead, edit main.tf to remove
# only the production lambda_function block from that resource.
echo "ACTION: Edit main.tf to modify aws_s3_bucket_notification.uploads:"
echo " - REMOVE the production lambda_function block (lines 206-210)"
echo " - KEEP the dynamic preview lambda_function block (lines 214-221)"
echo " - REMOVE aws_lambda_function.s3_notifier_production resource block"
echo " - REMOVE aws_lambda_permission.s3_invoke_production resource block"
echo " - KEEP all preview resources (s3_notifier_preview, s3_invoke_preview)"
echo " - KEEP production IAM role (still needed for Lambda execution)"
# 4. Run plan to confirm ONLY the expected changes.
terraform plan
# Expect:
# - aws_s3_bucket_notification.uploads: 1 to update (remove production entry)
# - No 'destroy' actions for preview resources
# - If plan shows destroy of preview Lambdas or permissions, STOP.
Alternative (Terraform 1.7+): Use
removedblocks for the Lambda and permission. The notification resource is modified in-place (not removed), soremovedblocks don't apply to it:Then editremoved { from = aws_lambda_function.s3_notifier_production lifecycle { destroy = false } } removed { from = aws_lambda_permission.s3_invoke_production lifecycle { destroy = false } }aws_s3_bucket_notification.uploadsto remove the production block, andterraform apply. Preferred for auditable GitOps workflow.
Rollback procedure (if production adoption fails):
Rollback requires both removing ACK CRDs (ArgoCD) and reverting the GitOps config (so the ApplicationSet doesn't recreate the app). Use the adoption commit SHA captured in Step 1 — not revert HEAD, which is unsafe on a moving branch.
set -euo pipefail
# ADOPTION_SHA was captured in Step 1. Verify it's set.
if [ -z "${ADOPTION_SHA:-}" ]; then
echo "ERROR: ADOPTION_SHA not set. Find it with:"
echo " git -C \$GITOPS_ROOT log --oneline -5"
echo "Then: export ADOPTION_SHA=<the adoption commit>"
exit 1
fi
# 0. Pre-check: verify deletionPolicy on each CRD independently.
# Each CRD can be: present+retain (safe), present+other (FATAL), or absent (already pruned).
ROLLBACK_SAFE=true
for CRD_KIND_NAME in "function/s3-notifier-production" "bucket/syrfapp-uploads"; do
CRD_KIND="${CRD_KIND_NAME%%/*}"
CRD_NAME="${CRD_KIND_NAME##*/}"
DELETION=$(kubectl get "$CRD_KIND" "$CRD_NAME" -n syrf-production \
-o jsonpath='{.metadata.annotations.services\.k8s\.aws/deletion-policy}' 2>/dev/null) || true
if [ -z "$DELETION" ]; then
echo "NOTE: $CRD_KIND/$CRD_NAME already absent — no deletion risk."
elif [ "$DELETION" = "retain" ]; then
echo "OK: $CRD_KIND/$CRD_NAME has deletionPolicy=retain."
else
echo "FATAL: $CRD_KIND/$CRD_NAME has deletionPolicy='$DELETION' (expected 'retain')!"
echo "DO NOT proceed — deleting this CRD will destroy the AWS resource."
ROLLBACK_SAFE=false
fi
done
if [ "$ROLLBACK_SAFE" != true ]; then
exit 1
fi
# 1. Revert the specific adoption commit in cluster-gitops
# (not HEAD — the branch may have moved since adoption)
echo "Reverting adoption commit $ADOPTION_SHA in cluster-gitops..."
echo " git -C \$GITOPS_ROOT revert --no-edit $ADOPTION_SHA"
echo " git -C \$GITOPS_ROOT push"
echo ""
echo "After the revert is pushed, ArgoCD will remove the Application."
echo "deletionPolicy: retain ensures AWS resources are untouched."
# 2. Verify AWS resources are still functional (after ArgoCD prunes the CRDs)
echo ""
echo "After ArgoCD sync completes, verify AWS resources survived:"
aws lambda get-function --function-name syrfAppUploadS3Notifier --region eu-west-1 \
--query 'Configuration.State' --output text
NOTIF_COUNT=$(aws s3api get-bucket-notification-configuration --bucket syrfapp-uploads --region eu-west-1 \
--query 'length(LambdaFunctionConfigurations)' --output text)
if [ "${NOTIF_COUNT:-0}" -eq 0 ]; then
echo "WARNING: S3 bucket has no Lambda notification configurations!"
exit 1
fi
echo "Rollback complete — AWS resources unchanged ($NOTIF_COUNT notification configs), ACK CRDs removed via GitOps revert."