Skip to content

URL Migration Plan - Legacy to New Cluster

Overview

This document provides the complete migration plan for transitioning SyRF production services from the legacy Jenkins X cluster to the new ArgoCD cluster. The migration includes URL changes and requires careful coordination to minimize downtime.

Single Source of Truth: See url-migration-map.yaml for the authoritative mapping of all URLs.

Migration Strategy

Approach: Parallel clusters with temporary testing URLs, followed by DNS cutover

Key Principles: - Minimize production downtime (target: <15 minutes) - Maintain rollback capability at each phase - Test thoroughly before cutover - Preserve data integrity (no PV deletions) - Aggressive timeline to reduce dual-cluster costs

URL Changes Summary

Services with URL Changes (BREAKING)

Service Legacy URL New URL Impact
API syrf-api.syrf.org.uk api.syrf.org.uk HIGH - Auth0 update required
Project Management syrf-projectmanagement.syrf.org.uk pm.syrf.org.uk Medium - Check frontend for hardcoded URLs
RabbitMQ rabbitmq-stats.camarades.net rabbitmq.camarades.net Low - Admin access only

Services with Consistent URLs

Service URL Change
Web syrf.org.uk None - same URL
Web Redirect app.syrf.org.uksyrf.org.uk None - redirect maintained
Quartz syrf-quartz.syrf.org.ukquartz.syrf.org.uk Prefix only
User Guide help.syrf.org.uk None - same URL

New Services

Service URL Description
Team Docs docs.syrf.org.uk Internal documentation with GitHub OAuth

Temporary Testing URLs

During the migration, production namespace services will be accessible on temporary .prod.camarades.net URLs:

Service Testing URL Purpose
Web syrf.prod.camarades.net Test production deployment
Web Redirect app.prod.camarades.netsyrf.prod.camarades.net Test redirect behavior
API api.prod.camarades.net Test new API URL before Auth0 update
PM pm.prod.camarades.net Test project management
Quartz quartz.prod.camarades.net Test background jobs
User Guide help.prod.camarades.net Test documentation
Team Docs docs.prod.camarades.net Test OAuth2 Proxy integration
RabbitMQ rabbitmq.prod.camarades.net Test RabbitMQ instance

Note: These temporary URLs will be deleted after successful cutover.

Migration Phases

Phase 1: Fix DNS Conflict (CRITICAL - BLOCKING)

Duration: 2-3 hours Risk: Low Rollback: Easy - revert external-dns config

Problem: Both clusters using txt-owner-id: default causing DNS records to be repeatedly created and deleted.

Steps:

  1. Update New Cluster External-DNS (via GitOps):

    cd /home/chris/workspace/syrf/cluster-gitops
    
    # Edit plugins/helm/external-dns/values.yaml
    # Add: txtOwnerId: camaradesuk-new
    
    git add plugins/helm/external-dns/values.yaml
    git commit -m "fix(external-dns): set unique txt-owner-id for new cluster"
    git push
    
    # Wait for ArgoCD to sync (3 minutes or trigger manually)
    kubectl get pods -n external-dns -w
    

  2. Update Old Cluster External-DNS (via kubectl - one-time):

    # Connect to legacy Jenkins X cluster
    kubectl config use-context <legacy-context>
    
    # Edit external-dns deployment
    kubectl edit deployment external-dns -n jx
    
    # Add to args:
    - --txt-owner-id=camarades-legacy
    
    # Save and exit
    # Wait for pod restart
    kubectl get pods -n jx -l app=external-dns -w
    

  3. Verify Fix:

    # Wait 60 seconds for both external-dns instances to sync
    sleep 60
    
    # Check DNS records persist (should stay stable)
    gcloud dns record-sets list --zone=camarades-net-zone | grep argocd
    
    # Should see:
    # argocd.camarades.net. A 300 34.13.63.21
    # argocd.camarades.net. TXT 300 "heritage=external-dns,external-dns/owner=camaradesuk-new..."
    
    # Verify DNS resolution works
    dig argocd.camarades.net +short
    # Should return: 34.13.63.21
    
    # Wait 5 minutes and check again - should remain stable
    sleep 300
    dig argocd.camarades.net +short
    # Should still return: 34.13.63.21
    

  4. Test ArgoCD Access:

    # ArgoCD UI should now be accessible
    curl -I https://argocd.camarades.net
    # Should return 200 or redirect (not connection refused)
    

Success Criteria: - ✅ DNS records remain stable for 24 hours - ✅ ArgoCD UI accessible at https://argocd.camarades.net - ✅ No more DNS record deletions in change history

Rollback: Revert both external-dns configurations to original state.


Phase 2: Deploy Production Namespace with Testing URLs

Duration: 1-2 days Risk: Low (isolated testing environment) Rollback: Easy - delete production namespace

Objective: Deploy all services to syrf-production namespace with temporary .prod.camarades.net URLs.

Prerequisites: - ✅ Phase 1 complete (DNS conflict fixed) - ✅ All staging services healthy - ✅ External Secrets Operator working

Steps:

  1. Create Missing Production Config Files:

    cd /home/chris/workspace/syrf/cluster-gitops/syrf/environments/production
    
    # User-guide (missing production config)
    mkdir -p user-guide
    cat > user-guide/config.yaml <<EOF
    serviceName: user-guide
    envName: production
    service:
      enabled: true
      chartTag: user-guide-v1.0.0  # Replace with actual latest version
    EOF
    
    cat > user-guide/values.yaml <<EOF
    # Temporary testing URL
    ingress:
      enabled: true
      className: nginx
      hosts:
        - host: help.prod.camarades.net
          paths:
            - path: /
              pathType: Prefix
      tls:
        - secretName: user-guide-prod-tls
          hosts:
            - help.prod.camarades.net
    
    # Production resource limits (to be tuned)
    resources:
      requests:
        memory: "128Mi"
        cpu: "100m"
      limits:
        memory: "256Mi"
        cpu: "200m"
    EOF
    

  2. Update Production Ingress Values for All Services:

API (production/api/values.yaml):

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: api.prod.camarades.net  # Temporary testing URL
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: api-prod-tls
      hosts:
        - api.prod.camarades.net

Web (production/web/values.yaml):

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: syrf.prod.camarades.net  # Temporary testing URL
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: web-prod-tls
      hosts:
        - syrf.prod.camarades.net

# Add redirect ingress for app.prod.camarades.net
redirectIngress:
  enabled: true
  className: nginx
  hosts:
    - host: app.prod.camarades.net
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: web-app-redirect-prod-tls
      hosts:
        - app.prod.camarades.net
  annotations:
    nginx.ingress.kubernetes.io/server-snippet: |
      return 301 https://syrf.prod.camarades.net$request_uri;

Project Management (production/project-management/values.yaml):

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: pm.prod.camarades.net  # Temporary testing URL
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: pm-prod-tls
      hosts:
        - pm.prod.camarades.net

Quartz (production/quartz/values.yaml):

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: quartz.prod.camarades.net  # Temporary testing URL
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: quartz-prod-tls
      hosts:
        - quartz.prod.camarades.net

Docs (already configured in production/docs/values.yaml):

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: docs.prod.camarades.net  # Update from docs.syrf.org.uk
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: docs-prod-tls
      hosts:
        - docs.prod.camarades.net

  1. Update RabbitMQ Ingress (plugins/helm/rabbitmq/values.yaml):

    ingress:
      enabled: true
      className: nginx
      hostname: rabbitmq.prod.camarades.net  # Temporary testing URL
      tls: true
    

  2. Commit and Push Changes:

    git add syrf/environments/production/
    git add plugins/helm/rabbitmq/values.yaml
    git commit -m "feat(production): configure temporary testing URLs for production namespace
    
    - Add production config for user-guide
    - Update all service ingresses to use .prod.camarades.net URLs
    - Configure web app.prod.camarades.net redirect
    - Update RabbitMQ ingress
    
    Refs: url-migration-plan.md Phase 2"
    git push
    

  3. Wait for ArgoCD Sync:

    # Watch Applications sync
    kubectl get applications -n argocd -w
    
    # All syrf-production-* apps should become Healthy
    

  4. Verify DNS Records Created:

    # Check all temporary testing URLs
    for host in syrf api app pm quartz help docs rabbitmq; do
      echo "Checking ${host}.prod.camarades.net..."
      dig ${host}.prod.camarades.net +short
    done
    
    # All should return: 34.13.63.21
    

  5. Verify TLS Certificates Issued:

    # Check all certificates
    kubectl get certificate -n syrf-production
    kubectl get certificate -n rabbitmq
    
    # All should show READY=True within 5 minutes
    

  6. Test Service Access:

    # Test each endpoint
    curl -I https://syrf.prod.camarades.net
    curl -I https://app.prod.camarades.net  # Should redirect
    curl -I https://api.prod.camarades.net/health
    curl -I https://pm.prod.camarades.net/health
    curl -I https://quartz.prod.camarades.net
    curl -I https://help.prod.camarades.net
    curl -I https://docs.prod.camarades.net  # Should require GitHub auth
    curl -I https://rabbitmq.prod.camarades.net
    

Success Criteria: - ✅ All services deployed to syrf-production namespace - ✅ All ingresses created with .prod.camarades.net URLs - ✅ All DNS records point to 34.13.63.21 - ✅ All TLS certificates issued successfully - ✅ All services respond to HTTP requests

Rollback: Delete production namespace or disable services in ApplicationSet.


Phase 3: Test Production Services

Duration: 3-5 days Risk: Low (isolated testing) Rollback: N/A (testing only)

Objective: Comprehensive end-to-end testing on temporary URLs before cutover.

Test Scenarios:

  1. User Authentication:
  2. EXPECTED TO FAIL: Auth0 callback URLs still point to legacy
  3. Create temporary Auth0 test application with .prod.camarades.net URLs
  4. Test user registration, login, logout
  5. Test password reset flow
  6. Verify token validation on API

  7. Project Management:

  8. Create new project
  9. Invite team members
  10. Configure project settings
  11. Verify database persistence

  12. Study Screening:

  13. Upload study list
  14. Perform screening workflows
  15. Test conflict resolution
  16. Export screening results

  17. Data Annotation:

  18. Upload PDFs
  19. Test annotation workflows
  20. Test data extraction
  21. Verify S3 upload notifications

  22. Background Jobs (Quartz):

  23. Verify scheduled jobs execute
  24. Check RabbitMQ message processing
  25. Monitor job logs

  26. RabbitMQ:

  27. Access management UI at https://rabbitmq.prod.camarades.net
  28. Verify queues created
  29. Check message flow between services
  30. Monitor connection count

  31. Documentation Access:

  32. Test public user guide (no auth): https://help.prod.camarades.net
  33. Test team docs (GitHub OAuth): https://docs.prod.camarades.net
  34. Verify OAuth2 Proxy redirects to GitHub
  35. Confirm only camaradesuk org members can access

  36. Performance Testing:

  37. Load testing on API endpoints
  38. Concurrent user simulation
  39. Database query performance
  40. Memory and CPU utilization

  41. Monitoring and Logging:

  42. Check ArgoCD Application health
  43. Review pod logs for errors
  44. Monitor resource usage
  45. Test alerting (if configured)

Testing Checklist:

### Authentication & Authorization
- [ ] User registration works
- [ ] User login works (with test Auth0 app)
- [ ] Token validation on API works
- [ ] GitHub OAuth for docs works (camaradesuk org only)
- [ ] Logout works

### Core Functionality
- [ ] Project creation works
- [ ] Study upload works
- [ ] Screening workflow works
- [ ] Annotation workflow works
- [ ] Data export works
- [ ] PDF upload to S3 works
- [ ] S3 Lambda notification works

### Background Processing
- [ ] Quartz jobs execute on schedule
- [ ] RabbitMQ messages processed
- [ ] Job logs accessible
- [ ] Failed jobs handled correctly

### Infrastructure
- [ ] All ingresses accessible
- [ ] TLS certificates valid
- [ ] DNS records stable
- [ ] Redirect works (app.prod → syrf.prod)
- [ ] RabbitMQ management UI accessible

### Performance
- [ ] API response times acceptable (<500ms p95)
- [ ] No memory leaks detected
- [ ] No CPU throttling
- [ ] Database queries performant

### Documentation
- [ ] User guide loads correctly
- [ ] Team docs requires GitHub auth
- [ ] All pages render properly
- [ ] Search functionality works

Issue Tracking: Document any issues found in GitHub issues with label migration-testing.

Success Criteria: - ✅ All test scenarios pass (except Auth0 with legacy URLs) - ✅ No critical bugs found - ✅ Performance meets requirements - ✅ Team sign-off on production readiness


Phase 4: Update Auth0 Configuration

Duration: 1-2 hours Risk: MEDIUM (breaking change for authentication) Rollback: Revert Auth0 configuration changes

Objective: Migrate Auth0 from legacy API URL (syrf-api.syrf.org.uk) to new URL (api.syrf.org.uk).

Prerequisites: - ✅ Phase 3 complete (testing passed) - ✅ Auth0 admin access - ✅ Backup of current Auth0 configuration

Steps:

  1. Backup Current Auth0 Configuration:

    # Take screenshots of:
    # - Application settings (Allowed Callback URLs, Allowed Logout URLs, Allowed Web Origins)
    # - API Audience settings
    # - Rules/Actions that reference URLs
    

  2. Update API Audience:

  3. Login to Auth0 Dashboard: https://manage.auth0.com
  4. Navigate to Applications → APIs
  5. Find SyRF API
  6. Update Identifier from https://syrf-api.syrf.org.uk to https://api.syrf.org.uk
  7. Save changes

  8. Update Web Application Settings:

  9. Navigate to Applications → Applications
  10. Find SyRF Web Application
  11. Update Allowed Callback URLs:
    https://api.prod.camarades.net/authentication/signin-oidc,
    https://syrf.prod.camarades.net/authentication/signin-oidc
    
  12. Update Allowed Logout URLs:
    https://api.prod.camarades.net/authentication/signout-oidc,
    https://syrf.prod.camarades.net
    
  13. Update Allowed Web Origins:
    https://api.prod.camarades.net,
    https://syrf.prod.camarades.net
    
  14. Save changes

  15. Update API Application Settings (if separate):

  16. Same callback URL updates as web application
  17. Ensure API audience references new URL

  18. Test Authentication on Testing URLs:

    # Test user login on syrf.prod.camarades.net
    # Should successfully authenticate and receive token
    
    # Test API call with token
    curl -H "Authorization: Bearer <token>" \
      https://api.prod.camarades.net/api/projects
    
    # Should return 200 OK, not 401 Unauthorized
    

  19. Verify No Impact on Legacy Cluster:

    # Legacy cluster should STILL work if using old Auth0 app
    # Or temporarily update legacy to use new Auth0 config
    

Success Criteria: - ✅ Auth0 API audience updated to https://api.syrf.org.uk - ✅ Authentication works on testing URLs - ✅ Token validation works on new API - ✅ Legacy cluster auth still works (or documented as breaking)

Rollback: 1. Revert Auth0 API audience to https://syrf-api.syrf.org.uk 2. Revert callback URLs to legacy URLs 3. Test legacy cluster authentication restored


Phase 5: DNS Cutover (Production Migration)

Duration: 2-4 hours (nighttime recommended) Downtime: 5-15 minutes during DNS propagation Risk: HIGH (production impact) Rollback: Revert DNS records (5-15 min recovery)

Objective: Switch production DNS from legacy cluster to new cluster.

Prerequisites: - ✅ All previous phases complete - ✅ Team availability for monitoring - ✅ Communication sent to users (maintenance window) - ✅ Rollback plan tested - ✅ All stakeholders notified

Pre-Cutover Checklist:

### Infrastructure
- [ ] New cluster all services Healthy in ArgoCD
- [ ] External-DNS running and stable (no crashes)
- [ ] Cert-manager issuing certificates successfully
- [ ] Ingress-nginx LoadBalancer stable
- [ ] RabbitMQ queues empty or processable on new cluster

### Configuration
- [ ] Auth0 updated for new URLs
- [ ] Production ingress values ready (final .syrf.org.uk URLs)
- [ ] All secrets synced via External Secrets
- [ ] Database connection strings verified

### Data
- [ ] Latest database backup completed
- [ ] S3 Lambda function pointing to new RabbitMQ
- [ ] PVs ready for production data

### Team
- [ ] On-call engineer available
- [ ] Rollback procedure documented and reviewed
- [ ] Communication sent to users
- [ ] Monitoring dashboards ready

Cutover Steps:

  1. Update Production Ingress Values (30 minutes before cutover):

    cd /home/chris/workspace/syrf/cluster-gitops/syrf/environments/production
    
    # Update all ingress hosts from .prod.camarades.net to final URLs
    # API: api.prod.camarades.net → api.syrf.org.uk
    # Web: syrf.prod.camarades.net → syrf.org.uk
    # PM: pm.prod.camarades.net → pm.syrf.org.uk
    # Quartz: quartz.prod.camarades.net → quartz.syrf.org.uk
    # Help: help.prod.camarades.net → help.syrf.org.uk
    # Docs: docs.prod.camarades.net → docs.syrf.org.uk
    
    # Example for API:
    vim api/values.yaml
    # Change:
    #   hosts:
    #     - host: api.prod.camarades.net
    # To:
    #   hosts:
    #     - host: api.syrf.org.uk
    
    git add syrf/environments/production/
    git commit -m "feat(production): update ingress URLs to final production domains"
    git push
    
    # Wait for ArgoCD sync
    # New ingresses created, old .prod.camarades.net ingresses remain
    

  2. Update Auth0 Callback URLs for Final Production (15 minutes before):

    Update Allowed Callback URLs to:
    https://api.syrf.org.uk/authentication/signin-oidc,
    https://syrf.org.uk/authentication/signin-oidc
    
    Update Allowed Logout URLs to:
    https://api.syrf.org.uk/authentication/signout-oidc,
    https://syrf.org.uk
    
    Update Allowed Web Origins to:
    https://api.syrf.org.uk,
    https://syrf.org.uk
    

  3. Transfer TXT Ownership (Start of cutover window):

This is the KEY step that allows new cluster's external-dns to manage the production DNS records.

# Connect to new cluster
kubectl config use-context <new-cluster-context>

# Verify external-dns is using txt-owner-id=camaradesuk-new
kubectl get deployment external-dns -n external-dns -o yaml | grep txt-owner-id
# Should show: - --txt-owner-id=camaradesuk-new

# External-DNS will automatically detect new ingresses and create DNS records
# BUT old cluster still "owns" the records via TXT registry

# We need to manually update TXT ownership records in Cloud DNS
gcloud dns record-sets update syrf.org.uk. \
  --zone=syrf-org-uk-zone \
  --type=TXT \
  --rrdatas="\"heritage=external-dns,external-dns/owner=camaradesuk-new,external-dns/resource=ingress/syrf-production/syrf-web\""

# Repeat for all production domains:
# - syrf.org.uk
# - app.syrf.org.uk
# - api.syrf.org.uk (NEW)
# - pm.syrf.org.uk (NEW)
# - quartz.syrf.org.uk
# - help.syrf.org.uk
# - docs.syrf.org.uk (NEW)

Alternative: Delete old records and let new external-dns recreate (faster but higher risk):

# Delete old A records (will cause brief outage during recreation)
gcloud dns record-sets delete syrf.org.uk. \
  --zone=syrf-org-uk-zone \
  --type=A

# Wait 60 seconds for new external-dns to detect and recreate
# Watch external-dns logs:
kubectl logs -n external-dns -l app.kubernetes.io/name=external-dns -f

# Should see: "Add records: syrf.org.uk. A [34.13.63.21] 300"

  1. Update DNS A Records:

Option A: Manual Update (Safer):

# Update each A record from old IP to new IP

# syrf.org.uk: 35.246.22.231 → 34.13.63.21
gcloud dns record-sets update syrf.org.uk. \
  --zone=syrf-org-uk-zone \
  --type=A \
  --rrdatas=34.13.63.21 \
  --ttl=300

# app.syrf.org.uk: 35.246.22.231 → 34.13.63.21
gcloud dns record-sets update app.syrf.org.uk. \
  --zone=syrf-org-uk-zone \
  --type=A \
  --rrdatas=34.13.63.21 \
  --ttl=300

# api.syrf.org.uk: CREATE NEW (legacy was syrf-api.syrf.org.uk)
gcloud dns record-sets create api.syrf.org.uk. \
  --zone=syrf-org-uk-zone \
  --type=A \
  --rrdatas=34.13.63.21 \
  --ttl=300

# pm.syrf.org.uk: CREATE NEW (legacy was syrf-projectmanagement.syrf.org.uk)
gcloud dns record-sets create pm.syrf.org.uk. \
  --zone=syrf-org-uk-zone \
  --type=A \
  --rrdatas=34.13.63.21 \
  --ttl=300

# quartz.syrf.org.uk: 35.246.22.231 → 34.13.63.21
gcloud dns record-sets update quartz.syrf.org.uk. \
  --zone=syrf-org-uk-zone \
  --type=A \
  --rrdatas=34.13.63.21 \
  --ttl=300

# help.syrf.org.uk: 35.246.22.231 → 34.13.63.21
gcloud dns record-sets update help.syrf.org.uk. \
  --zone=syrf-org-uk-zone \
  --type=A \
  --rrdatas=34.13.63.21 \
  --ttl=300

# docs.syrf.org.uk: CREATE NEW
gcloud dns record-sets create docs.syrf.org.uk. \
  --zone=syrf-org-uk-zone \
  --type=A \
  --rrdatas=34.13.63.21 \
  --ttl=300

# rabbitmq.camarades.net: CREATE NEW
gcloud dns record-sets create rabbitmq.camarades.net. \
  --zone=camarades-net-zone \
  --type=A \
  --rrdatas=34.13.63.21 \
  --ttl=300

Option B: Let External-DNS Manage (Recommended if TXT ownership transferred):

# If TXT ownership was successfully transferred, external-dns will
# automatically update the A records within 60 seconds.

# Watch external-dns logs:
kubectl logs -n external-dns -l app.kubernetes.io/name=external-dns -f

# Should see:
# "Update records: syrf.org.uk. A [35.246.22.231] -> [34.13.63.21]"
# "Add records: api.syrf.org.uk. A [34.13.63.21] 300"
# "Add records: pm.syrf.org.uk. A [34.13.63.21] 300"
# etc.

  1. Monitor DNS Propagation (5-15 minutes):

    # Check each domain resolves to new IP
    for host in syrf.org.uk app.syrf.org.uk api.syrf.org.uk pm.syrf.org.uk quartz.syrf.org.uk help.syrf.org.uk docs.syrf.org.uk; do
      echo "Checking $host..."
      dig $host +short @8.8.8.8
    done
    
    # All should return: 34.13.63.21
    
    # Test from local resolver (may be cached)
    dig syrf.org.uk +short
    
    # If showing old IP, wait for TTL expiry (300 seconds = 5 minutes)
    

  2. Test Production Services:

    # Test each production URL
    curl -I https://syrf.org.uk
    curl -I https://app.syrf.org.uk  # Should redirect to syrf.org.uk
    curl -I https://api.syrf.org.uk/health
    curl -I https://pm.syrf.org.uk/health
    curl -I https://quartz.syrf.org.uk
    curl -I https://help.syrf.org.uk
    curl -I https://docs.syrf.org.uk  # Should require GitHub auth
    curl -I https://rabbitmq.camarades.net
    
    # All should return 200 or 30x (redirects)
    

  3. Test User Authentication:

  4. Login to https://syrf.org.uk
  5. Verify Auth0 login works
  6. Verify token validation on API
  7. Test full user workflow

  8. Monitor for 1-2 Hours:

    # Watch ArgoCD Applications
    kubectl get applications -n argocd -w
    
    # Watch pod logs for errors
    kubectl logs -n syrf-production -l app=syrf-api --tail=50 -f
    
    # Watch ingress-nginx logs
    kubectl logs -n ingress-nginx -l app.kubernetes.io/component=controller --tail=50 -f
    
    # Monitor RabbitMQ queues
    # Access https://rabbitmq.camarades.net
    

Success Criteria: - ✅ All DNS records resolve to new cluster (34.13.63.21) - ✅ All production URLs accessible - ✅ User authentication works - ✅ No errors in application logs - ✅ RabbitMQ processing messages - ✅ No user-reported issues

Rollback Procedure (if issues occur):

  1. Revert DNS A Records:

    # Revert each record to old cluster IP
    for host in syrf.org.uk app.syrf.org.uk quartz.syrf.org.uk help.syrf.org.uk; do
      gcloud dns record-sets update ${host}. \
        --zone=syrf-org-uk-zone \
        --type=A \
        --rrdatas=35.246.22.231 \
        --ttl=60  # Short TTL for faster propagation
    done
    
    # Delete new records
    gcloud dns record-sets delete api.syrf.org.uk. --zone=syrf-org-uk-zone --type=A
    gcloud dns record-sets delete pm.syrf.org.uk. --zone=syrf-org-uk-zone --type=A
    gcloud dns record-sets delete docs.syrf.org.uk. --zone=syrf-org-uk-zone --type=A
    

  2. Revert Auth0 (if needed):

  3. Restore callback URLs to legacy syrf-api.syrf.org.uk
  4. Restore API audience

  5. Monitor Recovery:

    # Wait 5-15 minutes for DNS propagation
    dig syrf.org.uk +short @8.8.8.8
    # Should return: 35.246.22.231 (old cluster)
    
    # Test legacy cluster
    curl -I https://syrf.org.uk
    

  6. Post-Rollback:

  7. Document what went wrong
  8. Fix issues in new cluster
  9. Reschedule cutover

Time to Rollback: 5-15 minutes (DNS TTL)


Phase 6: Decommission Legacy Cluster

Duration: 1 week monitoring + 1 day teardown Risk: LOW (after stability period) Rollback: IMPOSSIBLE after cluster deletion

Objective: Safely decommission legacy Jenkins X cluster after confirming new cluster stability.

Prerequisites: - ✅ Phase 5 complete (DNS cutover successful) - ✅ New cluster running in production for 1+ week - ✅ No critical issues reported - ✅ All data backed up

Stability Monitoring Period (1 week):

### Daily Checklist
- [ ] No user-reported issues
- [ ] All services Healthy in ArgoCD
- [ ] No error spikes in logs
- [ ] Performance metrics within acceptable range
- [ ] Database queries performant
- [ ] RabbitMQ queues processing normally
- [ ] No memory leaks detected
- [ ] No certificate expiry issues

### Weekly Review
- [ ] Team consensus: new cluster is stable
- [ ] All stakeholders approve decommissioning
- [ ] Backup plan confirmed if issues arise

Decommissioning Steps:

  1. Export Critical Data from Legacy Cluster:

    # Connect to legacy cluster
    kubectl config use-context <legacy-context>
    
    # Export ConfigMaps (if any needed)
    kubectl get configmaps -A -o yaml > legacy-configmaps-backup.yaml
    
    # Export Secrets (if any needed)
    kubectl get secrets -A -o yaml > legacy-secrets-backup.yaml
    
    # List PVs (DO NOT DELETE - contains package data)
    kubectl get pv -o yaml > legacy-pvs-backup.yaml
    

  2. Scale Down Legacy Workloads:

    # Scale down all deployments to 0 replicas
    # Keep PVs intact
    kubectl scale deployment --all --replicas=0 -n jx-production
    kubectl scale deployment --all --replicas=0 -n jx-staging
    kubectl scale deployment --all --replicas=0 -n jx
    
    # Verify all pods gone
    kubectl get pods -A
    # Should show no app pods running
    

  3. Monitor for 24 Hours:

  4. Ensure no services trying to connect to old cluster
  5. Verify new cluster handling all traffic
  6. Check for any unexpected issues

  7. Delete Legacy DNS Records:

    # Delete old production records that are no longer used
    gcloud dns record-sets delete syrf-api.syrf.org.uk. \
      --zone=syrf-org-uk-zone \
      --type=A
    
    gcloud dns record-sets delete syrf-projectmanagement.syrf.org.uk. \
      --zone=syrf-org-uk-zone \
      --type=A
    
    gcloud dns record-sets delete rabbitmq-stats.camarades.net. \
      --zone=camarades-net-zone \
      --type=A
    
    # Delete legacy infrastructure URLs
    for host in dashboard-jx lighthouse-jx hook-jx chartmuseum-jx nexus-jx; do
      gcloud dns record-sets delete ${host}.camarades.net. \
        --zone=camarades-net-zone \
        --type=A
    done
    

  8. Delete Temporary Testing DNS Records:

    # Delete .prod.camarades.net records (no longer needed)
    for host in syrf api app pm quartz help docs rabbitmq; do
      gcloud dns record-sets delete ${host}.prod.camarades.net. \
        --zone=camarades-net-zone \
        --type=A --quiet
    done
    

  9. Delete GKE Cluster (FINAL STEP - IRREVERSIBLE):

    # FINAL WARNING: This cannot be undone
    # Verify all data backed up
    # Verify PVs for package storage are exported/backed up
    
    # Delete cluster via Terraform (if managed)
    cd /path/to/camarades-infrastructure/terraform-legacy
    terraform destroy
    
    # Or via gcloud
    gcloud container clusters delete <legacy-cluster-name> \
      --zone=<zone> \
      --quiet
    
    # This will delete:
    # - GKE cluster
    # - Node pools
    # - Load balancers
    # - Disk resources
    # BUT: PVs may be retained depending on retention policy
    

  10. Cost Verification:

    # After cluster deletion, monitor GCP billing
    # Expected savings: ~$260/month
    # New cluster cost: ~$260/month
    # Total savings: ~50% from legacy $520/month
    

Success Criteria: - ✅ Legacy cluster fully scaled down - ✅ Legacy DNS records deleted - ✅ GKE cluster deleted - ✅ Cost savings confirmed (~50% reduction) - ✅ New cluster stable and performant - ✅ Package data preserved (PVs backed up)

CRITICAL WARNINGS: - ⚠️ DO NOT DELETE PVs with package data without explicit backup - ⚠️ CANNOT ROLLBACK after cluster deletion - ⚠️ VERIFY DATA BACKUP before final deletion - ⚠️ TEAM CONSENSUS required before deletion


Auth0 Configuration Reference

Current Production Settings

Tenant: syrf Domain: signin.syrf.org.uk Region: eu

Web Application (SPA): - Client ID: UYpAGmQq1leH2HNh6DTVXer5PwQRypyU - Application Type: Single Page Application - Allowed Callback URLs: - https://syrf.org.uk/authentication/signin-oidc - https://app.syrf.org.uk/authentication/signin-oidc - Allowed Logout URLs: - https://syrf.org.uk - https://app.syrf.org.uk - Allowed Web Origins: - https://syrf.org.uk - https://api.syrf.org.uk

API Application: - Client ID: 9BNWn0FrYRTQ1kHg1KpktRBU1tVA1ZKf - Client Secret: Stored in GCP Secret Manager (camarades-auth0) - API Audience: https://api.syrf.org.uk (UPDATED in Phase 4)

Migration Updates Required

Phase 4 (Testing): - Add https://api.prod.camarades.net/authentication/signin-oidc to Allowed Callback URLs - Add https://syrf.prod.camarades.net/authentication/signin-oidc to Allowed Callback URLs - Update API Audience to https://api.syrf.org.uk

Phase 5 (Production): - Replace all .prod.camarades.net URLs with final .syrf.org.uk URLs - Remove legacy syrf-api.syrf.org.uk references


DNS Record Reference

Production Domain (syrf.org.uk)

Zone: syrf-org-uk-zone

Record Type Current (Legacy) Target (New) Change Type
syrf.org.uk A 35.246.22.231 34.13.63.21 Update
app.syrf.org.uk A 35.246.22.231 34.13.63.21 Update
syrf-api.syrf.org.uk A 35.246.22.231 DELETE Delete
api.syrf.org.uk A (none) 34.13.63.21 Create
syrf-projectmanagement.syrf.org.uk A 35.246.22.231 DELETE Delete
pm.syrf.org.uk A (none) 34.13.63.21 Create
quartz.syrf.org.uk A 35.246.22.231 34.13.63.21 Update
help.syrf.org.uk A 35.246.22.231 34.13.63.21 Update
docs.syrf.org.uk A (none) 34.13.63.21 Create

Infrastructure Domain (camarades.net)

Zone: camarades-net-zone

Record Type Current (Legacy) Target (New) Change Type
argocd.camarades.net A (none) 34.13.63.21 Existing (new cluster)
rabbitmq-stats.camarades.net A 35.246.22.231 DELETE Delete
rabbitmq.camarades.net A (none) 34.13.63.21 Create

Temporary Testing URLs (camarades.net)

Zone: camarades-net-zone

All .prod.camarades.net records are temporary and will be deleted after Phase 5:

Record Type IP Lifecycle
syrf.prod.camarades.net A 34.13.63.21 Phase 2-5 only
app.prod.camarades.net A 34.13.63.21 Phase 2-5 only
api.prod.camarades.net A 34.13.63.21 Phase 2-5 only
pm.prod.camarades.net A 34.13.63.21 Phase 2-5 only
quartz.prod.camarades.net A 34.13.63.21 Phase 2-5 only
help.prod.camarades.net A 34.13.63.21 Phase 2-5 only
docs.prod.camarades.net A 34.13.63.21 Phase 2-5 only
rabbitmq.prod.camarades.net A 34.13.63.21 Phase 2-5 only

Risk Mitigation

High Risks

Risk: Auth0 API URL change breaks authentication Impact: Users cannot login Mitigation: - Test thoroughly on .prod.camarades.net URLs before cutover - Have Auth0 rollback procedure ready - Update during low-traffic period Recovery: Revert Auth0 config (5-10 minutes)

Risk: DNS propagation delays causing service disruption Impact: Some users unable to access services Mitigation: - Use short TTL (300 seconds) - Perform during low-traffic period (nighttime) - Monitor DNS propagation globally Recovery: Wait for DNS TTL expiry (5 minutes max)

Risk: Database connection issues during cutover Impact: Services cannot connect to MongoDB Mitigation: - Verify connection strings in advance - Test database access on .prod.camarades.net - Have rollback plan ready Recovery: Revert DNS, verify legacy cluster connections

Medium Risks

Risk: TLS certificate provisioning delays Impact: HTTPS not available immediately Mitigation: - Pre-create certificates for .prod.camarades.net - Monitor cert-manager before cutover - Have manual certificate process ready Recovery: Use manual ACME challenge if needed

Risk: External-DNS ownership conflicts Impact: DNS records not managed correctly Mitigation: - Transfer TXT ownership before A record updates - Monitor external-dns logs during cutover - Have manual DNS management procedure ready Recovery: Manually manage DNS records via gcloud

Risk: RabbitMQ message queue migration Impact: Messages lost during transition Mitigation: - Ensure queues empty before cutover - Monitor queue depths during migration - Test message flow on new cluster Recovery: Drain legacy queues, reprocess on new cluster

Low Risks

Risk: Static asset caching issues Impact: Users see stale frontend assets Mitigation: - Use cache-busting query parameters - Clear CDN cache if applicable Recovery: Users can hard refresh (Ctrl+F5)

Risk: User bookmarks need updating Impact: Bookmarks to legacy URLs not working Mitigation: - Maintain app.syrf.org.uk redirect - Document new URLs in release notes Recovery: User re-bookmarks new URLs

Risk: Documentation link updates needed Impact: Internal docs reference old URLs Mitigation: - Search and replace old URLs before cutover - Update wiki/confluence pages Recovery: Update links post-cutover


Communication Plan

Pre-Migration Communication

Audience: All SyRF users

Subject: Scheduled Maintenance - SyRF Platform Migration

Message:

Dear SyRF Users,

We are upgrading the SyRF platform infrastructure to improve reliability,
performance, and reduce costs.

Scheduled Maintenance Window:
Date: [DATE]
Time: [TIME] - [TIME] [TIMEZONE]
Expected Downtime: 5-15 minutes

During this window:
- You may experience brief service interruption
- In-progress work will be saved
- No data loss expected

After the migration:
- Faster performance
- Improved reliability
- New team documentation site: https://docs.syrf.org.uk

Please save your work before the maintenance window.

If you experience issues after the migration, please contact:
[SUPPORT EMAIL]

Thank you for your patience.

The SyRF Team

During Migration Communication

Slack/Teams Channel: Post real-time updates

[TIMESTAMP] Migration started
[TIMESTAMP] DNS records updated
[TIMESTAMP] Testing authentication
[TIMESTAMP] Migration complete - please test
[TIMESTAMP] All systems nominal

Post-Migration Communication

Audience: All SyRF users

Subject: SyRF Platform Migration Complete

Message:

Dear SyRF Users,

The SyRF platform migration is complete. All services are now running
on our new infrastructure.

You may notice:
- Slightly different URLs for API (api.syrf.org.uk instead of syrf-api.syrf.org.uk)
- Faster page load times
- Improved reliability

If you bookmarked the old URLs, they will continue to work via redirects.

Please report any issues to [SUPPORT EMAIL].

Thank you for your patience during the migration.

The SyRF Team


Success Metrics

Technical Metrics

  • Uptime: >99.9% after migration
  • Response Time: API p95 <500ms
  • Error Rate: <0.1% of requests
  • DNS Resolution: <100ms globally
  • Certificate Validity: All certs valid for >60 days

Business Metrics

  • Cost Reduction: ~50% infrastructure cost savings
  • User Impact: <1% user-reported issues post-migration
  • Downtime: <15 minutes during cutover
  • Rollback Usage: 0 rollbacks required

Timeline Metrics

  • Phase 1: Complete within 3 hours
  • Phase 2: Complete within 2 days
  • Phase 3: Complete within 5 days
  • Phase 4: Complete within 2 hours
  • Phase 5: Complete within 4 hours
  • Phase 6: Complete within 2 weeks

Total Migration Time: 3-4 weeks from Phase 1 start to Phase 6 completion.


History

  • 2025-11-16: Initial migration plan created