URL Migration Plan - Legacy to New Cluster¶
Overview¶
This document provides the complete migration plan for transitioning SyRF production services from the legacy Jenkins X cluster to the new ArgoCD cluster. The migration includes URL changes and requires careful coordination to minimize downtime.
Single Source of Truth: See url-migration-map.yaml for the authoritative mapping of all URLs.
Migration Strategy¶
Approach: Parallel clusters with temporary testing URLs, followed by DNS cutover
Key Principles: - Minimize production downtime (target: <15 minutes) - Maintain rollback capability at each phase - Test thoroughly before cutover - Preserve data integrity (no PV deletions) - Aggressive timeline to reduce dual-cluster costs
URL Changes Summary¶
Services with URL Changes (BREAKING)¶
| Service | Legacy URL | New URL | Impact |
|---|---|---|---|
| API | syrf-api.syrf.org.uk |
api.syrf.org.uk |
HIGH - Auth0 update required |
| Project Management | syrf-projectmanagement.syrf.org.uk |
pm.syrf.org.uk |
Medium - Check frontend for hardcoded URLs |
| RabbitMQ | rabbitmq-stats.camarades.net |
rabbitmq.camarades.net |
Low - Admin access only |
Services with Consistent URLs¶
| Service | URL | Change |
|---|---|---|
| Web | syrf.org.uk |
None - same URL |
| Web Redirect | app.syrf.org.uk → syrf.org.uk |
None - redirect maintained |
| Quartz | syrf-quartz.syrf.org.uk → quartz.syrf.org.uk |
Prefix only |
| User Guide | help.syrf.org.uk |
None - same URL |
New Services¶
| Service | URL | Description |
|---|---|---|
| Team Docs | docs.syrf.org.uk |
Internal documentation with GitHub OAuth |
Temporary Testing URLs¶
During the migration, production namespace services will be accessible on temporary .prod.camarades.net URLs:
| Service | Testing URL | Purpose |
|---|---|---|
| Web | syrf.prod.camarades.net |
Test production deployment |
| Web Redirect | app.prod.camarades.net → syrf.prod.camarades.net |
Test redirect behavior |
| API | api.prod.camarades.net |
Test new API URL before Auth0 update |
| PM | pm.prod.camarades.net |
Test project management |
| Quartz | quartz.prod.camarades.net |
Test background jobs |
| User Guide | help.prod.camarades.net |
Test documentation |
| Team Docs | docs.prod.camarades.net |
Test OAuth2 Proxy integration |
| RabbitMQ | rabbitmq.prod.camarades.net |
Test RabbitMQ instance |
Note: These temporary URLs will be deleted after successful cutover.
Migration Phases¶
Phase 1: Fix DNS Conflict (CRITICAL - BLOCKING)¶
Duration: 2-3 hours Risk: Low Rollback: Easy - revert external-dns config
Problem: Both clusters using txt-owner-id: default causing DNS records to be repeatedly created and deleted.
Steps:
-
Update New Cluster External-DNS (via GitOps):
cd /home/chris/workspace/syrf/cluster-gitops # Edit plugins/helm/external-dns/values.yaml # Add: txtOwnerId: camaradesuk-new git add plugins/helm/external-dns/values.yaml git commit -m "fix(external-dns): set unique txt-owner-id for new cluster" git push # Wait for ArgoCD to sync (3 minutes or trigger manually) kubectl get pods -n external-dns -w -
Update Old Cluster External-DNS (via kubectl - one-time):
-
Verify Fix:
# Wait 60 seconds for both external-dns instances to sync sleep 60 # Check DNS records persist (should stay stable) gcloud dns record-sets list --zone=camarades-net-zone | grep argocd # Should see: # argocd.camarades.net. A 300 34.13.63.21 # argocd.camarades.net. TXT 300 "heritage=external-dns,external-dns/owner=camaradesuk-new..." # Verify DNS resolution works dig argocd.camarades.net +short # Should return: 34.13.63.21 # Wait 5 minutes and check again - should remain stable sleep 300 dig argocd.camarades.net +short # Should still return: 34.13.63.21 -
Test ArgoCD Access:
Success Criteria: - ✅ DNS records remain stable for 24 hours - ✅ ArgoCD UI accessible at https://argocd.camarades.net - ✅ No more DNS record deletions in change history
Rollback: Revert both external-dns configurations to original state.
Phase 2: Deploy Production Namespace with Testing URLs¶
Duration: 1-2 days Risk: Low (isolated testing environment) Rollback: Easy - delete production namespace
Objective: Deploy all services to syrf-production namespace with temporary .prod.camarades.net URLs.
Prerequisites: - ✅ Phase 1 complete (DNS conflict fixed) - ✅ All staging services healthy - ✅ External Secrets Operator working
Steps:
-
Create Missing Production Config Files:
cd /home/chris/workspace/syrf/cluster-gitops/syrf/environments/production # User-guide (missing production config) mkdir -p user-guide cat > user-guide/config.yaml <<EOF serviceName: user-guide envName: production service: enabled: true chartTag: user-guide-v1.0.0 # Replace with actual latest version EOF cat > user-guide/values.yaml <<EOF # Temporary testing URL ingress: enabled: true className: nginx hosts: - host: help.prod.camarades.net paths: - path: / pathType: Prefix tls: - secretName: user-guide-prod-tls hosts: - help.prod.camarades.net # Production resource limits (to be tuned) resources: requests: memory: "128Mi" cpu: "100m" limits: memory: "256Mi" cpu: "200m" EOF -
Update Production Ingress Values for All Services:
API (production/api/values.yaml):
ingress:
enabled: true
className: nginx
hosts:
- host: api.prod.camarades.net # Temporary testing URL
paths:
- path: /
pathType: Prefix
tls:
- secretName: api-prod-tls
hosts:
- api.prod.camarades.net
Web (production/web/values.yaml):
ingress:
enabled: true
className: nginx
hosts:
- host: syrf.prod.camarades.net # Temporary testing URL
paths:
- path: /
pathType: Prefix
tls:
- secretName: web-prod-tls
hosts:
- syrf.prod.camarades.net
# Add redirect ingress for app.prod.camarades.net
redirectIngress:
enabled: true
className: nginx
hosts:
- host: app.prod.camarades.net
paths:
- path: /
pathType: Prefix
tls:
- secretName: web-app-redirect-prod-tls
hosts:
- app.prod.camarades.net
annotations:
nginx.ingress.kubernetes.io/server-snippet: |
return 301 https://syrf.prod.camarades.net$request_uri;
Project Management (production/project-management/values.yaml):
ingress:
enabled: true
className: nginx
hosts:
- host: pm.prod.camarades.net # Temporary testing URL
paths:
- path: /
pathType: Prefix
tls:
- secretName: pm-prod-tls
hosts:
- pm.prod.camarades.net
Quartz (production/quartz/values.yaml):
ingress:
enabled: true
className: nginx
hosts:
- host: quartz.prod.camarades.net # Temporary testing URL
paths:
- path: /
pathType: Prefix
tls:
- secretName: quartz-prod-tls
hosts:
- quartz.prod.camarades.net
Docs (already configured in production/docs/values.yaml):
ingress:
enabled: true
className: nginx
hosts:
- host: docs.prod.camarades.net # Update from docs.syrf.org.uk
paths:
- path: /
pathType: Prefix
tls:
- secretName: docs-prod-tls
hosts:
- docs.prod.camarades.net
-
Update RabbitMQ Ingress (
plugins/helm/rabbitmq/values.yaml): -
Commit and Push Changes:
git add syrf/environments/production/ git add plugins/helm/rabbitmq/values.yaml git commit -m "feat(production): configure temporary testing URLs for production namespace - Add production config for user-guide - Update all service ingresses to use .prod.camarades.net URLs - Configure web app.prod.camarades.net redirect - Update RabbitMQ ingress Refs: url-migration-plan.md Phase 2" git push -
Wait for ArgoCD Sync:
-
Verify DNS Records Created:
-
Verify TLS Certificates Issued:
-
Test Service Access:
# Test each endpoint curl -I https://syrf.prod.camarades.net curl -I https://app.prod.camarades.net # Should redirect curl -I https://api.prod.camarades.net/health curl -I https://pm.prod.camarades.net/health curl -I https://quartz.prod.camarades.net curl -I https://help.prod.camarades.net curl -I https://docs.prod.camarades.net # Should require GitHub auth curl -I https://rabbitmq.prod.camarades.net
Success Criteria: - ✅ All services deployed to syrf-production namespace - ✅ All ingresses created with .prod.camarades.net URLs - ✅ All DNS records point to 34.13.63.21 - ✅ All TLS certificates issued successfully - ✅ All services respond to HTTP requests
Rollback: Delete production namespace or disable services in ApplicationSet.
Phase 3: Test Production Services¶
Duration: 3-5 days Risk: Low (isolated testing) Rollback: N/A (testing only)
Objective: Comprehensive end-to-end testing on temporary URLs before cutover.
Test Scenarios:
- User Authentication:
- ❌ EXPECTED TO FAIL: Auth0 callback URLs still point to legacy
- Create temporary Auth0 test application with
.prod.camarades.netURLs - Test user registration, login, logout
- Test password reset flow
-
Verify token validation on API
-
Project Management:
- Create new project
- Invite team members
- Configure project settings
-
Verify database persistence
-
Study Screening:
- Upload study list
- Perform screening workflows
- Test conflict resolution
-
Export screening results
-
Data Annotation:
- Upload PDFs
- Test annotation workflows
- Test data extraction
-
Verify S3 upload notifications
-
Background Jobs (Quartz):
- Verify scheduled jobs execute
- Check RabbitMQ message processing
-
Monitor job logs
-
RabbitMQ:
- Access management UI at https://rabbitmq.prod.camarades.net
- Verify queues created
- Check message flow between services
-
Monitor connection count
-
Documentation Access:
- Test public user guide (no auth): https://help.prod.camarades.net
- Test team docs (GitHub OAuth): https://docs.prod.camarades.net
- Verify OAuth2 Proxy redirects to GitHub
-
Confirm only camaradesuk org members can access
-
Performance Testing:
- Load testing on API endpoints
- Concurrent user simulation
- Database query performance
-
Memory and CPU utilization
-
Monitoring and Logging:
- Check ArgoCD Application health
- Review pod logs for errors
- Monitor resource usage
- Test alerting (if configured)
Testing Checklist:
### Authentication & Authorization
- [ ] User registration works
- [ ] User login works (with test Auth0 app)
- [ ] Token validation on API works
- [ ] GitHub OAuth for docs works (camaradesuk org only)
- [ ] Logout works
### Core Functionality
- [ ] Project creation works
- [ ] Study upload works
- [ ] Screening workflow works
- [ ] Annotation workflow works
- [ ] Data export works
- [ ] PDF upload to S3 works
- [ ] S3 Lambda notification works
### Background Processing
- [ ] Quartz jobs execute on schedule
- [ ] RabbitMQ messages processed
- [ ] Job logs accessible
- [ ] Failed jobs handled correctly
### Infrastructure
- [ ] All ingresses accessible
- [ ] TLS certificates valid
- [ ] DNS records stable
- [ ] Redirect works (app.prod → syrf.prod)
- [ ] RabbitMQ management UI accessible
### Performance
- [ ] API response times acceptable (<500ms p95)
- [ ] No memory leaks detected
- [ ] No CPU throttling
- [ ] Database queries performant
### Documentation
- [ ] User guide loads correctly
- [ ] Team docs requires GitHub auth
- [ ] All pages render properly
- [ ] Search functionality works
Issue Tracking: Document any issues found in GitHub issues with label migration-testing.
Success Criteria: - ✅ All test scenarios pass (except Auth0 with legacy URLs) - ✅ No critical bugs found - ✅ Performance meets requirements - ✅ Team sign-off on production readiness
Phase 4: Update Auth0 Configuration¶
Duration: 1-2 hours Risk: MEDIUM (breaking change for authentication) Rollback: Revert Auth0 configuration changes
Objective: Migrate Auth0 from legacy API URL (syrf-api.syrf.org.uk) to new URL (api.syrf.org.uk).
Prerequisites: - ✅ Phase 3 complete (testing passed) - ✅ Auth0 admin access - ✅ Backup of current Auth0 configuration
Steps:
-
Backup Current Auth0 Configuration:
-
Update API Audience:
- Login to Auth0 Dashboard: https://manage.auth0.com
- Navigate to Applications → APIs
- Find SyRF API
- Update Identifier from
https://syrf-api.syrf.org.uktohttps://api.syrf.org.uk -
Save changes
-
Update Web Application Settings:
- Navigate to Applications → Applications
- Find SyRF Web Application
- Update Allowed Callback URLs:
- Update Allowed Logout URLs:
- Update Allowed Web Origins:
-
Save changes
-
Update API Application Settings (if separate):
- Same callback URL updates as web application
-
Ensure API audience references new URL
-
Test Authentication on Testing URLs:
-
Verify No Impact on Legacy Cluster:
Success Criteria:
- ✅ Auth0 API audience updated to https://api.syrf.org.uk
- ✅ Authentication works on testing URLs
- ✅ Token validation works on new API
- ✅ Legacy cluster auth still works (or documented as breaking)
Rollback:
1. Revert Auth0 API audience to https://syrf-api.syrf.org.uk
2. Revert callback URLs to legacy URLs
3. Test legacy cluster authentication restored
Phase 5: DNS Cutover (Production Migration)¶
Duration: 2-4 hours (nighttime recommended) Downtime: 5-15 minutes during DNS propagation Risk: HIGH (production impact) Rollback: Revert DNS records (5-15 min recovery)
Objective: Switch production DNS from legacy cluster to new cluster.
Prerequisites: - ✅ All previous phases complete - ✅ Team availability for monitoring - ✅ Communication sent to users (maintenance window) - ✅ Rollback plan tested - ✅ All stakeholders notified
Pre-Cutover Checklist:
### Infrastructure
- [ ] New cluster all services Healthy in ArgoCD
- [ ] External-DNS running and stable (no crashes)
- [ ] Cert-manager issuing certificates successfully
- [ ] Ingress-nginx LoadBalancer stable
- [ ] RabbitMQ queues empty or processable on new cluster
### Configuration
- [ ] Auth0 updated for new URLs
- [ ] Production ingress values ready (final .syrf.org.uk URLs)
- [ ] All secrets synced via External Secrets
- [ ] Database connection strings verified
### Data
- [ ] Latest database backup completed
- [ ] S3 Lambda function pointing to new RabbitMQ
- [ ] PVs ready for production data
### Team
- [ ] On-call engineer available
- [ ] Rollback procedure documented and reviewed
- [ ] Communication sent to users
- [ ] Monitoring dashboards ready
Cutover Steps:
-
Update Production Ingress Values (30 minutes before cutover):
cd /home/chris/workspace/syrf/cluster-gitops/syrf/environments/production # Update all ingress hosts from .prod.camarades.net to final URLs # API: api.prod.camarades.net → api.syrf.org.uk # Web: syrf.prod.camarades.net → syrf.org.uk # PM: pm.prod.camarades.net → pm.syrf.org.uk # Quartz: quartz.prod.camarades.net → quartz.syrf.org.uk # Help: help.prod.camarades.net → help.syrf.org.uk # Docs: docs.prod.camarades.net → docs.syrf.org.uk # Example for API: vim api/values.yaml # Change: # hosts: # - host: api.prod.camarades.net # To: # hosts: # - host: api.syrf.org.uk git add syrf/environments/production/ git commit -m "feat(production): update ingress URLs to final production domains" git push # Wait for ArgoCD sync # New ingresses created, old .prod.camarades.net ingresses remain -
Update Auth0 Callback URLs for Final Production (15 minutes before):
Update Allowed Callback URLs to: https://api.syrf.org.uk/authentication/signin-oidc, https://syrf.org.uk/authentication/signin-oidc Update Allowed Logout URLs to: https://api.syrf.org.uk/authentication/signout-oidc, https://syrf.org.uk Update Allowed Web Origins to: https://api.syrf.org.uk, https://syrf.org.uk -
Transfer TXT Ownership (Start of cutover window):
This is the KEY step that allows new cluster's external-dns to manage the production DNS records.
# Connect to new cluster
kubectl config use-context <new-cluster-context>
# Verify external-dns is using txt-owner-id=camaradesuk-new
kubectl get deployment external-dns -n external-dns -o yaml | grep txt-owner-id
# Should show: - --txt-owner-id=camaradesuk-new
# External-DNS will automatically detect new ingresses and create DNS records
# BUT old cluster still "owns" the records via TXT registry
# We need to manually update TXT ownership records in Cloud DNS
gcloud dns record-sets update syrf.org.uk. \
--zone=syrf-org-uk-zone \
--type=TXT \
--rrdatas="\"heritage=external-dns,external-dns/owner=camaradesuk-new,external-dns/resource=ingress/syrf-production/syrf-web\""
# Repeat for all production domains:
# - syrf.org.uk
# - app.syrf.org.uk
# - api.syrf.org.uk (NEW)
# - pm.syrf.org.uk (NEW)
# - quartz.syrf.org.uk
# - help.syrf.org.uk
# - docs.syrf.org.uk (NEW)
Alternative: Delete old records and let new external-dns recreate (faster but higher risk):
# Delete old A records (will cause brief outage during recreation)
gcloud dns record-sets delete syrf.org.uk. \
--zone=syrf-org-uk-zone \
--type=A
# Wait 60 seconds for new external-dns to detect and recreate
# Watch external-dns logs:
kubectl logs -n external-dns -l app.kubernetes.io/name=external-dns -f
# Should see: "Add records: syrf.org.uk. A [34.13.63.21] 300"
- Update DNS A Records:
Option A: Manual Update (Safer):
# Update each A record from old IP to new IP
# syrf.org.uk: 35.246.22.231 → 34.13.63.21
gcloud dns record-sets update syrf.org.uk. \
--zone=syrf-org-uk-zone \
--type=A \
--rrdatas=34.13.63.21 \
--ttl=300
# app.syrf.org.uk: 35.246.22.231 → 34.13.63.21
gcloud dns record-sets update app.syrf.org.uk. \
--zone=syrf-org-uk-zone \
--type=A \
--rrdatas=34.13.63.21 \
--ttl=300
# api.syrf.org.uk: CREATE NEW (legacy was syrf-api.syrf.org.uk)
gcloud dns record-sets create api.syrf.org.uk. \
--zone=syrf-org-uk-zone \
--type=A \
--rrdatas=34.13.63.21 \
--ttl=300
# pm.syrf.org.uk: CREATE NEW (legacy was syrf-projectmanagement.syrf.org.uk)
gcloud dns record-sets create pm.syrf.org.uk. \
--zone=syrf-org-uk-zone \
--type=A \
--rrdatas=34.13.63.21 \
--ttl=300
# quartz.syrf.org.uk: 35.246.22.231 → 34.13.63.21
gcloud dns record-sets update quartz.syrf.org.uk. \
--zone=syrf-org-uk-zone \
--type=A \
--rrdatas=34.13.63.21 \
--ttl=300
# help.syrf.org.uk: 35.246.22.231 → 34.13.63.21
gcloud dns record-sets update help.syrf.org.uk. \
--zone=syrf-org-uk-zone \
--type=A \
--rrdatas=34.13.63.21 \
--ttl=300
# docs.syrf.org.uk: CREATE NEW
gcloud dns record-sets create docs.syrf.org.uk. \
--zone=syrf-org-uk-zone \
--type=A \
--rrdatas=34.13.63.21 \
--ttl=300
# rabbitmq.camarades.net: CREATE NEW
gcloud dns record-sets create rabbitmq.camarades.net. \
--zone=camarades-net-zone \
--type=A \
--rrdatas=34.13.63.21 \
--ttl=300
Option B: Let External-DNS Manage (Recommended if TXT ownership transferred):
# If TXT ownership was successfully transferred, external-dns will
# automatically update the A records within 60 seconds.
# Watch external-dns logs:
kubectl logs -n external-dns -l app.kubernetes.io/name=external-dns -f
# Should see:
# "Update records: syrf.org.uk. A [35.246.22.231] -> [34.13.63.21]"
# "Add records: api.syrf.org.uk. A [34.13.63.21] 300"
# "Add records: pm.syrf.org.uk. A [34.13.63.21] 300"
# etc.
-
Monitor DNS Propagation (5-15 minutes):
# Check each domain resolves to new IP for host in syrf.org.uk app.syrf.org.uk api.syrf.org.uk pm.syrf.org.uk quartz.syrf.org.uk help.syrf.org.uk docs.syrf.org.uk; do echo "Checking $host..." dig $host +short @8.8.8.8 done # All should return: 34.13.63.21 # Test from local resolver (may be cached) dig syrf.org.uk +short # If showing old IP, wait for TTL expiry (300 seconds = 5 minutes) -
Test Production Services:
# Test each production URL curl -I https://syrf.org.uk curl -I https://app.syrf.org.uk # Should redirect to syrf.org.uk curl -I https://api.syrf.org.uk/health curl -I https://pm.syrf.org.uk/health curl -I https://quartz.syrf.org.uk curl -I https://help.syrf.org.uk curl -I https://docs.syrf.org.uk # Should require GitHub auth curl -I https://rabbitmq.camarades.net # All should return 200 or 30x (redirects) -
Test User Authentication:
- Login to https://syrf.org.uk
- Verify Auth0 login works
- Verify token validation on API
-
Test full user workflow
-
Monitor for 1-2 Hours:
# Watch ArgoCD Applications kubectl get applications -n argocd -w # Watch pod logs for errors kubectl logs -n syrf-production -l app=syrf-api --tail=50 -f # Watch ingress-nginx logs kubectl logs -n ingress-nginx -l app.kubernetes.io/component=controller --tail=50 -f # Monitor RabbitMQ queues # Access https://rabbitmq.camarades.net
Success Criteria: - ✅ All DNS records resolve to new cluster (34.13.63.21) - ✅ All production URLs accessible - ✅ User authentication works - ✅ No errors in application logs - ✅ RabbitMQ processing messages - ✅ No user-reported issues
Rollback Procedure (if issues occur):
-
Revert DNS A Records:
# Revert each record to old cluster IP for host in syrf.org.uk app.syrf.org.uk quartz.syrf.org.uk help.syrf.org.uk; do gcloud dns record-sets update ${host}. \ --zone=syrf-org-uk-zone \ --type=A \ --rrdatas=35.246.22.231 \ --ttl=60 # Short TTL for faster propagation done # Delete new records gcloud dns record-sets delete api.syrf.org.uk. --zone=syrf-org-uk-zone --type=A gcloud dns record-sets delete pm.syrf.org.uk. --zone=syrf-org-uk-zone --type=A gcloud dns record-sets delete docs.syrf.org.uk. --zone=syrf-org-uk-zone --type=A -
Revert Auth0 (if needed):
- Restore callback URLs to legacy
syrf-api.syrf.org.uk -
Restore API audience
-
Monitor Recovery:
-
Post-Rollback:
- Document what went wrong
- Fix issues in new cluster
- Reschedule cutover
Time to Rollback: 5-15 minutes (DNS TTL)
Phase 6: Decommission Legacy Cluster¶
Duration: 1 week monitoring + 1 day teardown Risk: LOW (after stability period) Rollback: IMPOSSIBLE after cluster deletion
Objective: Safely decommission legacy Jenkins X cluster after confirming new cluster stability.
Prerequisites: - ✅ Phase 5 complete (DNS cutover successful) - ✅ New cluster running in production for 1+ week - ✅ No critical issues reported - ✅ All data backed up
Stability Monitoring Period (1 week):
### Daily Checklist
- [ ] No user-reported issues
- [ ] All services Healthy in ArgoCD
- [ ] No error spikes in logs
- [ ] Performance metrics within acceptable range
- [ ] Database queries performant
- [ ] RabbitMQ queues processing normally
- [ ] No memory leaks detected
- [ ] No certificate expiry issues
### Weekly Review
- [ ] Team consensus: new cluster is stable
- [ ] All stakeholders approve decommissioning
- [ ] Backup plan confirmed if issues arise
Decommissioning Steps:
-
Export Critical Data from Legacy Cluster:
# Connect to legacy cluster kubectl config use-context <legacy-context> # Export ConfigMaps (if any needed) kubectl get configmaps -A -o yaml > legacy-configmaps-backup.yaml # Export Secrets (if any needed) kubectl get secrets -A -o yaml > legacy-secrets-backup.yaml # List PVs (DO NOT DELETE - contains package data) kubectl get pv -o yaml > legacy-pvs-backup.yaml -
Scale Down Legacy Workloads:
# Scale down all deployments to 0 replicas # Keep PVs intact kubectl scale deployment --all --replicas=0 -n jx-production kubectl scale deployment --all --replicas=0 -n jx-staging kubectl scale deployment --all --replicas=0 -n jx # Verify all pods gone kubectl get pods -A # Should show no app pods running -
Monitor for 24 Hours:
- Ensure no services trying to connect to old cluster
- Verify new cluster handling all traffic
-
Check for any unexpected issues
-
Delete Legacy DNS Records:
# Delete old production records that are no longer used gcloud dns record-sets delete syrf-api.syrf.org.uk. \ --zone=syrf-org-uk-zone \ --type=A gcloud dns record-sets delete syrf-projectmanagement.syrf.org.uk. \ --zone=syrf-org-uk-zone \ --type=A gcloud dns record-sets delete rabbitmq-stats.camarades.net. \ --zone=camarades-net-zone \ --type=A # Delete legacy infrastructure URLs for host in dashboard-jx lighthouse-jx hook-jx chartmuseum-jx nexus-jx; do gcloud dns record-sets delete ${host}.camarades.net. \ --zone=camarades-net-zone \ --type=A done -
Delete Temporary Testing DNS Records:
-
Delete GKE Cluster (FINAL STEP - IRREVERSIBLE):
# FINAL WARNING: This cannot be undone # Verify all data backed up # Verify PVs for package storage are exported/backed up # Delete cluster via Terraform (if managed) cd /path/to/camarades-infrastructure/terraform-legacy terraform destroy # Or via gcloud gcloud container clusters delete <legacy-cluster-name> \ --zone=<zone> \ --quiet # This will delete: # - GKE cluster # - Node pools # - Load balancers # - Disk resources # BUT: PVs may be retained depending on retention policy -
Cost Verification:
Success Criteria: - ✅ Legacy cluster fully scaled down - ✅ Legacy DNS records deleted - ✅ GKE cluster deleted - ✅ Cost savings confirmed (~50% reduction) - ✅ New cluster stable and performant - ✅ Package data preserved (PVs backed up)
CRITICAL WARNINGS: - ⚠️ DO NOT DELETE PVs with package data without explicit backup - ⚠️ CANNOT ROLLBACK after cluster deletion - ⚠️ VERIFY DATA BACKUP before final deletion - ⚠️ TEAM CONSENSUS required before deletion
Auth0 Configuration Reference¶
Current Production Settings¶
Tenant: syrf Domain: signin.syrf.org.uk Region: eu
Web Application (SPA):
- Client ID: UYpAGmQq1leH2HNh6DTVXer5PwQRypyU
- Application Type: Single Page Application
- Allowed Callback URLs:
- https://syrf.org.uk/authentication/signin-oidc
- https://app.syrf.org.uk/authentication/signin-oidc
- Allowed Logout URLs:
- https://syrf.org.uk
- https://app.syrf.org.uk
- Allowed Web Origins:
- https://syrf.org.uk
- https://api.syrf.org.uk
API Application:
- Client ID: 9BNWn0FrYRTQ1kHg1KpktRBU1tVA1ZKf
- Client Secret: Stored in GCP Secret Manager (camarades-auth0)
- API Audience: https://api.syrf.org.uk (UPDATED in Phase 4)
Migration Updates Required¶
Phase 4 (Testing):
- Add https://api.prod.camarades.net/authentication/signin-oidc to Allowed Callback URLs
- Add https://syrf.prod.camarades.net/authentication/signin-oidc to Allowed Callback URLs
- Update API Audience to https://api.syrf.org.uk
Phase 5 (Production):
- Replace all .prod.camarades.net URLs with final .syrf.org.uk URLs
- Remove legacy syrf-api.syrf.org.uk references
DNS Record Reference¶
Production Domain (syrf.org.uk)¶
Zone: syrf-org-uk-zone
| Record | Type | Current (Legacy) | Target (New) | Change Type |
|---|---|---|---|---|
| syrf.org.uk | A | 35.246.22.231 | 34.13.63.21 | Update |
| app.syrf.org.uk | A | 35.246.22.231 | 34.13.63.21 | Update |
| syrf-api.syrf.org.uk | A | 35.246.22.231 | DELETE | Delete |
| api.syrf.org.uk | A | (none) | 34.13.63.21 | Create |
| syrf-projectmanagement.syrf.org.uk | A | 35.246.22.231 | DELETE | Delete |
| pm.syrf.org.uk | A | (none) | 34.13.63.21 | Create |
| quartz.syrf.org.uk | A | 35.246.22.231 | 34.13.63.21 | Update |
| help.syrf.org.uk | A | 35.246.22.231 | 34.13.63.21 | Update |
| docs.syrf.org.uk | A | (none) | 34.13.63.21 | Create |
Infrastructure Domain (camarades.net)¶
Zone: camarades-net-zone
| Record | Type | Current (Legacy) | Target (New) | Change Type |
|---|---|---|---|---|
| argocd.camarades.net | A | (none) | 34.13.63.21 | Existing (new cluster) |
| rabbitmq-stats.camarades.net | A | 35.246.22.231 | DELETE | Delete |
| rabbitmq.camarades.net | A | (none) | 34.13.63.21 | Create |
Temporary Testing URLs (camarades.net)¶
Zone: camarades-net-zone
All .prod.camarades.net records are temporary and will be deleted after Phase 5:
| Record | Type | IP | Lifecycle |
|---|---|---|---|
| syrf.prod.camarades.net | A | 34.13.63.21 | Phase 2-5 only |
| app.prod.camarades.net | A | 34.13.63.21 | Phase 2-5 only |
| api.prod.camarades.net | A | 34.13.63.21 | Phase 2-5 only |
| pm.prod.camarades.net | A | 34.13.63.21 | Phase 2-5 only |
| quartz.prod.camarades.net | A | 34.13.63.21 | Phase 2-5 only |
| help.prod.camarades.net | A | 34.13.63.21 | Phase 2-5 only |
| docs.prod.camarades.net | A | 34.13.63.21 | Phase 2-5 only |
| rabbitmq.prod.camarades.net | A | 34.13.63.21 | Phase 2-5 only |
Risk Mitigation¶
High Risks¶
Risk: Auth0 API URL change breaks authentication
Impact: Users cannot login
Mitigation:
- Test thoroughly on .prod.camarades.net URLs before cutover
- Have Auth0 rollback procedure ready
- Update during low-traffic period
Recovery: Revert Auth0 config (5-10 minutes)
Risk: DNS propagation delays causing service disruption Impact: Some users unable to access services Mitigation: - Use short TTL (300 seconds) - Perform during low-traffic period (nighttime) - Monitor DNS propagation globally Recovery: Wait for DNS TTL expiry (5 minutes max)
Risk: Database connection issues during cutover
Impact: Services cannot connect to MongoDB
Mitigation:
- Verify connection strings in advance
- Test database access on .prod.camarades.net
- Have rollback plan ready
Recovery: Revert DNS, verify legacy cluster connections
Medium Risks¶
Risk: TLS certificate provisioning delays
Impact: HTTPS not available immediately
Mitigation:
- Pre-create certificates for .prod.camarades.net
- Monitor cert-manager before cutover
- Have manual certificate process ready
Recovery: Use manual ACME challenge if needed
Risk: External-DNS ownership conflicts Impact: DNS records not managed correctly Mitigation: - Transfer TXT ownership before A record updates - Monitor external-dns logs during cutover - Have manual DNS management procedure ready Recovery: Manually manage DNS records via gcloud
Risk: RabbitMQ message queue migration Impact: Messages lost during transition Mitigation: - Ensure queues empty before cutover - Monitor queue depths during migration - Test message flow on new cluster Recovery: Drain legacy queues, reprocess on new cluster
Low Risks¶
Risk: Static asset caching issues Impact: Users see stale frontend assets Mitigation: - Use cache-busting query parameters - Clear CDN cache if applicable Recovery: Users can hard refresh (Ctrl+F5)
Risk: User bookmarks need updating
Impact: Bookmarks to legacy URLs not working
Mitigation:
- Maintain app.syrf.org.uk redirect
- Document new URLs in release notes
Recovery: User re-bookmarks new URLs
Risk: Documentation link updates needed Impact: Internal docs reference old URLs Mitigation: - Search and replace old URLs before cutover - Update wiki/confluence pages Recovery: Update links post-cutover
Communication Plan¶
Pre-Migration Communication¶
Audience: All SyRF users
Subject: Scheduled Maintenance - SyRF Platform Migration
Message:
Dear SyRF Users,
We are upgrading the SyRF platform infrastructure to improve reliability,
performance, and reduce costs.
Scheduled Maintenance Window:
Date: [DATE]
Time: [TIME] - [TIME] [TIMEZONE]
Expected Downtime: 5-15 minutes
During this window:
- You may experience brief service interruption
- In-progress work will be saved
- No data loss expected
After the migration:
- Faster performance
- Improved reliability
- New team documentation site: https://docs.syrf.org.uk
Please save your work before the maintenance window.
If you experience issues after the migration, please contact:
[SUPPORT EMAIL]
Thank you for your patience.
The SyRF Team
During Migration Communication¶
Slack/Teams Channel: Post real-time updates
[TIMESTAMP] Migration started
[TIMESTAMP] DNS records updated
[TIMESTAMP] Testing authentication
[TIMESTAMP] Migration complete - please test
[TIMESTAMP] All systems nominal
Post-Migration Communication¶
Audience: All SyRF users
Subject: SyRF Platform Migration Complete
Message:
Dear SyRF Users,
The SyRF platform migration is complete. All services are now running
on our new infrastructure.
You may notice:
- Slightly different URLs for API (api.syrf.org.uk instead of syrf-api.syrf.org.uk)
- Faster page load times
- Improved reliability
If you bookmarked the old URLs, they will continue to work via redirects.
Please report any issues to [SUPPORT EMAIL].
Thank you for your patience during the migration.
The SyRF Team
Success Metrics¶
Technical Metrics¶
- Uptime: >99.9% after migration
- Response Time: API p95 <500ms
- Error Rate: <0.1% of requests
- DNS Resolution: <100ms globally
- Certificate Validity: All certs valid for >60 days
Business Metrics¶
- Cost Reduction: ~50% infrastructure cost savings
- User Impact: <1% user-reported issues post-migration
- Downtime: <15 minutes during cutover
- Rollback Usage: 0 rollbacks required
Timeline Metrics¶
- Phase 1: Complete within 3 hours
- Phase 2: Complete within 2 days
- Phase 3: Complete within 5 days
- Phase 4: Complete within 2 hours
- Phase 5: Complete within 4 hours
- Phase 6: Complete within 2 weeks
Total Migration Time: 3-4 weeks from Phase 1 start to Phase 6 completion.
Related Documentation¶
- URL Migration Map - Single source of truth for all URLs
- Cluster Bootstrap Guide - Initial cluster setup
- DNS Management Guide - External-DNS operations
- ADR-003: Cluster Architecture - Architecture decisions
History¶
- 2025-11-16: Initial migration plan created