Skip to content

Troubleshooting: DatabaseLifecycle Post-Script Job RBAC Failure

Problem

DatabaseLifecycle CR is stuck in Failed phase with the message "Post-script job execution failed". This affects PR preview environments that use post-script jobs (e.g., index-init for MongoDB index creation).

Symptom:

kubectl get databaselifecycle -n pr-2285
# NAME          DATABASE       PHASE    SOURCE   SEEDED   AGE
# pr-database   syrf_pr_2285   Failed                     52m

Error in operator logs:

jobs.batch "dbl-post-pr-database-1768848824" is forbidden:
User "system:serviceaccount:database-lifecycle-operator:database-lifecycle-operator"
cannot get resource "jobs" in API group "batch" in the namespace "pr-2285"

Root Cause

The DatabaseLifecycle operator's ClusterRole is missing permissions for batch/jobs. When the operator attempts to create a post-script Job in a dynamically-created PR namespace (e.g., pr-2285), Kubernetes RBAC denies the request.

Current ClusterRole permissions (missing batch/jobs):

rules:
  - apiGroups: ["database.syrf.org.uk"]
    resources: ["databaselifecycles", "databaselifecycles/status"]
    verbs: ["get", "list", "watch", "update", "patch"]
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list", "watch", "create", "update", "patch"]
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get"]
  - apiGroups: [""]
    resources: ["pods/exec"]
    verbs: ["create"]
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["create", "patch"]
  # MISSING: batch/jobs permissions

Investigation Steps

1. Check DatabaseLifecycle status

kubectl describe databaselifecycle pr-database -n pr-2285

Look for:

  • Phase: Failed
  • Conditions showing PostScriptFailed
  • Message: Post-script job failed

2. Check operator logs for RBAC errors

kubectl logs -n database-lifecycle-operator deployment/database-lifecycle-operator --since=1h \
  | grep -E "(Forbidden|cannot|RBAC|jobs.batch)"

Look for lines containing:

  • jobs.batch ... is forbidden
  • cannot get resource "jobs"
  • Failed to create job

3. Verify current ClusterRole permissions

kubectl get clusterrole database-lifecycle-operator -o yaml | grep -A 50 "rules:"

Check if batch/jobs is listed in the rules.

4. Test permissions manually

kubectl auth can-i create jobs \
  --as=system:serviceaccount:database-lifecycle-operator:database-lifecycle-operator \
  -n pr-2285
# Expected (if broken): no
# Expected (if fixed): yes

Solution

Add batch/jobs permissions to the operator's ClusterRole.

File: charts/database-lifecycle-operator/templates/rbac.yaml

Add the following rule:

rules:
  # ... existing rules ...

  # Jobs for post-script and health-check execution
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["get", "list", "watch", "create", "delete"]

Implementation Steps

  1. Update the Helm chart in cluster-gitops:
cd /home/chris/workspace/cluster-gitops
# Edit charts/database-lifecycle-operator/templates/rbac.yaml
# Add batch/jobs permissions as shown above
  1. Commit and push:
git add charts/database-lifecycle-operator/templates/rbac.yaml
git commit -m "fix(dbl-operator): add batch/jobs RBAC for post-script jobs"
git push
  1. Wait for ArgoCD sync (or trigger manually):
# Check sync status
kubectl get application database-lifecycle-operator -n argocd

# Force sync if needed
argocd app sync database-lifecycle-operator
  1. Verify the ClusterRole was updated:
kubectl get clusterrole database-lifecycle-operator -o yaml | grep -A 5 "batch"

Expected Result

After fixing RBAC:

  1. Re-trigger the DatabaseLifecycle by updating forceReseed or deleting/recreating the CR:
kubectl patch databaselifecycle pr-database -n pr-2285 \
  --type merge -p '{"spec":{"forceReseed":true}}'
  1. Verify post-script job runs:
kubectl get jobs -n pr-2285
# Should see: dbl-post-pr-database-XXXXX
  1. Check DatabaseLifecycle reaches Ready phase:
kubectl get databaselifecycle -n pr-2285
# PHASE should become: Ready (or Seeded if seeding enabled)
  1. Operator logs should show success:
kubectl logs -n database-lifecycle-operator deployment/database-lifecycle-operator --since=5m \
  | grep -E "(post-script|job)"
# Should see: "Running post-script job" followed by success messages
  • Operator Helm Chart: charts/database-lifecycle-operator/
  • RBAC Template: charts/database-lifecycle-operator/templates/rbac.yaml
  • Operator Namespace: database-lifecycle-operator
  • ArgoCD Application: database-lifecycle-operator
  • Affected Environments: PR preview namespaces (pr-*)

Why This Only Affects Preview Environments

The post-script job feature is primarily used in preview environments to run index initialization after database seeding. Staging and production environments don't use this feature because they connect to pre-existing databases with indexes already in place.

References