CI/CD Recovery Guide¶
This guide covers how to recover from common CI/CD failures in the SyRF project.
Quick Reference¶
| Problem | Solution |
|---|---|
| Build failed, tag exists | Force rebuild via workflow dispatch |
| PR preview not updating | Force rebuild PR preview |
| Retag failed | Force rebuild the service |
| Image missing but tag exists | Delete orphaned tag and rebuild |
| Staging promotion failed | Check build results and retry |
Force Rebuild (Main Branch)¶
If a build failed but the tag was created (orphaned tag), you can force a rebuild:
- Go to Actions > CI/CD Workflow
- Click Run workflow
- Select
mainbranch - Check the services you want to rebuild:
force_rebuild_api- Rebuild API serviceforce_rebuild_project_management- Rebuild PM serviceforce_rebuild_quartz- Rebuild Quartz serviceforce_rebuild_web- Rebuild Web frontendforce_rebuild_user_guide- Rebuild User Guideforce_rebuild_docs- Rebuild Team Docsforce_rebuild_all- Rebuild all services- Click Run workflow
The workflow will:
- Calculate a new version (may bump version if there were commits since last tag)
- Build and push Docker images
- Create new git tags
- Promote to staging
Force Rebuild (PR Preview)¶
If a PR preview environment isn't updating correctly:
- Go to Actions > PR Preview Build
- Click Run workflow
- Enter the PR number
- Check
force_rebuild_allto rebuild all services - Click Run workflow
Diagnose Staging Promotion Failure¶
When staging promotion fails, check:
- Build Results in the workflow summary:
- Docker Build: Should be
successorskipped - Retag Images: Should be
successorskipped -
Lambda Deploy: Should be
successorskipped -
Specific Job Logs:
- Click on the failed job in the workflow run
-
Expand the failed step to see error details
-
Common Issues:
- GHCR authentication failed: Check
GITHUB_TOKENpermissions - Image doesn't exist: Previous build failed, use force rebuild
- cluster-gitops PR failed: Check GITOPS_PAT secret
Delete Orphaned Tags¶
If a tag exists but the corresponding image doesn't:
# List tags for a service
git tag -l "api-v*" | tail -5
# Check if image exists
crane manifest ghcr.io/camaradesuk/syrf-api:8.21.0
# If image doesn't exist, delete the orphaned tag
git push --delete origin api-v8.21.0
git tag -d api-v8.21.0
Then trigger a force rebuild to create a new tag with a working image.
Check Image Existence¶
Use crane to verify images exist:
# Install crane (pick one method)
# Option 1: Using Go (requires Go 1.20+)
go install github.com/google/go-containerregistry/cmd/crane@latest
# Option 2: Using Homebrew (macOS/Linux)
brew install crane
# Option 3: Download pre-built binary from GitHub releases
# https://github.com/google/go-containerregistry/releases
# Check if an image exists
crane manifest ghcr.io/camaradesuk/syrf-api:8.21.0
# List tags for an image
crane ls ghcr.io/camaradesuk/syrf-api
# Get image labels (includes commit SHA)
crane config ghcr.io/camaradesuk/syrf-api:8.21.0 | jq '.config.Labels'
Verify Image Content¶
Images include OCI labels for traceability:
# Get the commit SHA an image was built from
crane config ghcr.io/camaradesuk/syrf-api:8.21.0 | \
jq -r '.config.Labels["org.opencontainers.image.revision"]'
# Expected output: full commit SHA
If the image revision doesn't match expected, the image may be stale.
Detect Current State¶
The change detection script can be run locally to understand the current state:
# Check API service status
.github/scripts/detect-service-changes.sh --context main --service api
# Output:
# {
# "service": "api",
# "context": "main",
# "last_tag": "api-v8.21.0",
# "last_version": "8.21.0",
# "source_changed": false,
# "chart_changed": false,
# "image_exists": "true",
# "action": "skip"
# }
Actions explained:
build: Source changed, will build new imageretag: Only chart changed, will retag existing imageskip: Nothing changed, will skip (main branch)use-existing: Nothing changed, will use last version (PR context)fail: Chart changed but image doesn't exist (cannot proceed)
Test Change Detection¶
Run the test suite to verify change detection logic:
.github/scripts/test-detect-service-changes.sh
# Expected output:
# Testing: Missing arguments rejection... PASSED
# Testing: Invalid context rejection... PASSED
# ...
# Tests run: 17
# Tests passed: 17
# Tests failed: 0
Emergency Procedures¶
Complete CI/CD Reset¶
If the CI/CD system is in an unrecoverable state:
- Delete all orphaned tags for affected services
- Force rebuild all services via workflow dispatch
- Monitor the workflow to ensure all builds succeed
- Verify images exist with crane
Skip Staging, Deploy Directly¶
In emergencies, you can manually update cluster-gitops:
cd /path/to/cluster-gitops
# Edit the service config
vim syrf/environments/staging/api/config.yaml
# Update chartTag to the desired version
git add .
git commit -m "fix: emergency update api to v8.21.0"
git push
ArgoCD will automatically sync the change.