Skip to content

CI/CD Recovery Guide

This guide covers how to recover from common CI/CD failures in the SyRF project.

Quick Reference

Problem Solution
Build failed, tag exists Force rebuild via workflow dispatch
PR preview not updating Force rebuild PR preview
Retag failed Force rebuild the service
Image missing but tag exists Delete orphaned tag and rebuild
Staging promotion failed Check build results and retry

Force Rebuild (Main Branch)

If a build failed but the tag was created (orphaned tag), you can force a rebuild:

  1. Go to Actions > CI/CD Workflow
  2. Click Run workflow
  3. Select main branch
  4. Check the services you want to rebuild:
  5. force_rebuild_api - Rebuild API service
  6. force_rebuild_project_management - Rebuild PM service
  7. force_rebuild_quartz - Rebuild Quartz service
  8. force_rebuild_web - Rebuild Web frontend
  9. force_rebuild_user_guide - Rebuild User Guide
  10. force_rebuild_docs - Rebuild Team Docs
  11. force_rebuild_all - Rebuild all services
  12. Click Run workflow

The workflow will:

  • Calculate a new version (may bump version if there were commits since last tag)
  • Build and push Docker images
  • Create new git tags
  • Promote to staging

Force Rebuild (PR Preview)

If a PR preview environment isn't updating correctly:

  1. Go to Actions > PR Preview Build
  2. Click Run workflow
  3. Enter the PR number
  4. Check force_rebuild_all to rebuild all services
  5. Click Run workflow

Diagnose Staging Promotion Failure

When staging promotion fails, check:

  1. Build Results in the workflow summary:
  2. Docker Build: Should be success or skipped
  3. Retag Images: Should be success or skipped
  4. Lambda Deploy: Should be success or skipped

  5. Specific Job Logs:

  6. Click on the failed job in the workflow run
  7. Expand the failed step to see error details

  8. Common Issues:

  9. GHCR authentication failed: Check GITHUB_TOKEN permissions
  10. Image doesn't exist: Previous build failed, use force rebuild
  11. cluster-gitops PR failed: Check GITOPS_PAT secret

Delete Orphaned Tags

If a tag exists but the corresponding image doesn't:

# List tags for a service
git tag -l "api-v*" | tail -5

# Check if image exists
crane manifest ghcr.io/camaradesuk/syrf-api:8.21.0

# If image doesn't exist, delete the orphaned tag
git push --delete origin api-v8.21.0
git tag -d api-v8.21.0

Then trigger a force rebuild to create a new tag with a working image.

Check Image Existence

Use crane to verify images exist:

# Install crane (pick one method)
# Option 1: Using Go (requires Go 1.20+)
go install github.com/google/go-containerregistry/cmd/crane@latest

# Option 2: Using Homebrew (macOS/Linux)
brew install crane

# Option 3: Download pre-built binary from GitHub releases
# https://github.com/google/go-containerregistry/releases

# Check if an image exists
crane manifest ghcr.io/camaradesuk/syrf-api:8.21.0

# List tags for an image
crane ls ghcr.io/camaradesuk/syrf-api

# Get image labels (includes commit SHA)
crane config ghcr.io/camaradesuk/syrf-api:8.21.0 | jq '.config.Labels'

Verify Image Content

Images include OCI labels for traceability:

# Get the commit SHA an image was built from
crane config ghcr.io/camaradesuk/syrf-api:8.21.0 | \
  jq -r '.config.Labels["org.opencontainers.image.revision"]'

# Expected output: full commit SHA

If the image revision doesn't match expected, the image may be stale.

Detect Current State

The change detection script can be run locally to understand the current state:

# Check API service status
.github/scripts/detect-service-changes.sh --context main --service api

# Output:
# {
#   "service": "api",
#   "context": "main",
#   "last_tag": "api-v8.21.0",
#   "last_version": "8.21.0",
#   "source_changed": false,
#   "chart_changed": false,
#   "image_exists": "true",
#   "action": "skip"
# }

Actions explained:

  • build: Source changed, will build new image
  • retag: Only chart changed, will retag existing image
  • skip: Nothing changed, will skip (main branch)
  • use-existing: Nothing changed, will use last version (PR context)
  • fail: Chart changed but image doesn't exist (cannot proceed)

Test Change Detection

Run the test suite to verify change detection logic:

.github/scripts/test-detect-service-changes.sh

# Expected output:
# Testing: Missing arguments rejection... PASSED
# Testing: Invalid context rejection... PASSED
# ...
# Tests run:    17
# Tests passed: 17
# Tests failed: 0

Emergency Procedures

Complete CI/CD Reset

If the CI/CD system is in an unrecoverable state:

  1. Delete all orphaned tags for affected services
  2. Force rebuild all services via workflow dispatch
  3. Monitor the workflow to ensure all builds succeed
  4. Verify images exist with crane

Skip Staging, Deploy Directly

In emergencies, you can manually update cluster-gitops:

cd /path/to/cluster-gitops

# Edit the service config
vim syrf/environments/staging/api/config.yaml
# Update chartTag to the desired version

git add .
git commit -m "fix: emergency update api to v8.21.0"
git push

ArgoCD will automatically sync the change.