Skip to content

Data Snapshot Automation - Edge Case Analysis

This document provides a comprehensive analysis of all edge cases and scenarios for the data snapshot automation feature, based on the implemented code.


Executive Summary

Category Total ✅ Handled ⚠️ Acceptable ❌ Issues
Snapshot Producer 10 8 2 0
Restore Job 12 11 1 0
DB Reset Job 3 3 0 0
Manual Trigger 5 4 1 0
Metadata 4 4 0 0
Permissions/Security 6 6 0 0
Network/Infrastructure 5 3 2 0
ArgoCD 5 4 1 0
Data Integrity 5 3 2 0
Timing 3 2 1 0
Labels 4 3 1 0
Resources 3 2 1 0
TOTAL 65 53 12 0

Overall Assessment: The implementation is robust with no critical issues. 12 edge cases have minor concerns that are acceptable for production use but should be monitored.


1. Snapshot Producer CronJob

✅ Handled Cases

# Scenario Behavior Status
1.1 First run (no syrf_snapshot exists) $out creates database and collections automatically ✅ Works
1.2 Production unavailable Connection fails, backoffLimit: 3 retries ✅ Clean failure
1.3 Concurrent runs concurrencyPolicy: Forbid prevents ✅ Blocked
1.4 Empty source collections Logs warning, skips, continues ✅ Safe
1.5 Collection schema changes $out copies BSON as-is ✅ No impact
1.6 GCP Secret Manager unavailable ExternalSecret fails, no credentials ✅ Clean failure
1.7 Wrong credentials mongosh auth fails ✅ Clear error
1.8 syrf_snapshot quota exceeded $out fails, old snapshot preserved ✅ Safe

⚠️ Acceptable Concerns

# Scenario Behavior Risk Mitigation
1.9 Partial failure mid-copy Some collections new, some old Low Metadata only written on success; restore checks pmProject count
1.10 Very large database (>20GB) May hit 30-min timeout Low Monitor; $out is fast (server-side); increase timeout if needed

2. Snapshot Restore Job

✅ Handled Cases

# Scenario Behavior Status
2.1 No snapshot exists Checks pmProject.countDocuments(), fails loudly with options ✅ Good UX
2.2 PR credentials not ready Wait loop (60s timeout) for secret mount ✅ Retries
2.3 Atlas Operator slow sync-wave ordering ensures user created first ✅ Ordered
2.4 DB already has data Drops all collections first ✅ Clean slate
2.5 persist-db with use-snapshot RESET_DATABASE=false prevents job generation ✅ Logic correct
2.6 Label added mid-deployment Workflow regenerates files, ArgoCD syncs ✅ Reactive
2.7 use-snapshot removed with persist-db Label change reverted by workflow ✅ Lock enforced
2.8 PR closed without persist-db PreDelete hook drops database ✅ Cleanup
2.9 PR closed with persist-db Database preserved, warning posted ✅ By design
2.10 Multiple PRs restore simultaneously Each restores to own DB, no conflicts ✅ Isolated
2.11 Marker check for duplicate restores SHA-based marker prevents re-run ✅ Idempotent

⚠️ Acceptable Concerns

# Scenario Behavior Risk Mitigation
2.12 PreDelete hook fails Orphan database remains Low Manual cleanup documented; rare scenario

3. DB Reset Job (Fresh Seed)

✅ Handled Cases

# Scenario Behavior Status
3.1 DB doesn't exist getCollectionNames() returns empty, no error ✅ Safe
3.2 Two syncs at same SHA Marker check, second skips ✅ Idempotent
3.3 Marker corrupted/deleted Reset runs again (just drops/recreates) ✅ Safe

4. Manual Trigger Workflow

✅ Handled Cases

# Scenario Behavior Status
4.1 Wrong confirmation text Validation fails immediately ✅ Clear error
4.2 CronJob doesn't exist Pre-check fails with instructions ✅ Helpful
4.3 GKE credentials invalid get-credentials fails ✅ Clean failure
4.4 Job takes >15 min Logs timeout, shows "running" with manual check ✅ Documented

⚠️ Acceptable Concerns

# Scenario Behavior Risk Mitigation
4.5 Manual run during weekly run Both run (Job not CronJob) Low Last one wins, both produce valid snapshots

5. Metadata Handling

✅ Handled Cases

# Scenario Behavior Status
5.1 snapshot_metadata doesn't exist updateOne with upsert: true ✅ Creates
5.2 Metadata write fails Job continues; restore checks pmProject count ✅ Fallback
5.3 Different MongoDB version Standard JavaScript, BSON compatible ✅ Works
5.4 Unexpected metadata format JavaScript handles gracefully ✅ Safe

6. Permissions & Security

✅ All Cases Handled (Defense in Depth)

# Attack Vector Protection Layer Status
6.1 Snapshot producer writes to syrftest MongoDB: read-only permission ✅ Blocked
6.2 PR user writes to syrf_snapshot MongoDB: read-only permission ✅ Blocked
6.3 PR user accesses other PR's DB MongoDB: no permission on syrf_pr_X ✅ Blocked
6.4 PR user accesses production MongoDB: no permission on syrftest ✅ Blocked
6.5 Script modified to target wrong DB MongoDB: permission denied anyway ✅ Defense in depth
6.6 AtlasDatabaseUser created bypassing workflow Kyverno: policy validates all resources ✅ Policy enforced

Security Assessment: The permission model provides multiple layers of protection. Even if one layer is bypassed, others prevent damage.


7. Network & Infrastructure

✅ Handled Cases

# Scenario Behavior Status
7.1 MongoDB Atlas maintenance Jobs fail, retry later ✅ Resilient
7.2 GKE node failure Pod rescheduled, K8s handles ✅ K8s native
7.3 Service account token expired K8s auto-refreshes ✅ Automatic

⚠️ Acceptable Concerns

# Scenario Behavior Risk Mitigation
7.4 Image pull failure (mongo:7) ImagePullBackOff until success Low Standard K8s, retries automatically
7.5 kubectl install in job fails apt-get or curl fails Medium Consider custom image with kubectl pre-installed

Recommendation: Build a custom image ghcr.io/camaradesuk/mongo-kubectl:7 with kubectl pre-installed to eliminate runtime dependency installation.


8. ArgoCD Integration

✅ Handled Cases

# Scenario Behavior Status
8.1 PreSync hook timeout Jobs have 15-30 min timeouts, ArgoCD default 1h ✅ Within limits
8.2 Hook ordering sync-wave annotations ensure correct order ✅ Ordered
8.3 Manual sync while workflow running Marker check prevents duplicates ✅ Idempotent
8.4 App deleted PreDelete hooks run before deletion ✅ Cleanup works

⚠️ Acceptable Concerns

# Scenario Behavior Risk Mitigation
8.5 Sync fails after restore DB has data, services don't deploy Low Database ready for retry; just sync again

9. Data Integrity

✅ Handled Cases

# Scenario Behavior Status
9.1 CSUUID format $out copies BSON, no conversion ✅ Preserved
9.2 Large documents (>16MB) MongoDB rejects, clear error ✅ Standard limit
9.3 Orphaned references All 11 collections copied together ✅ Consistent

⚠️ Acceptable Concerns

# Scenario Behavior Risk Mitigation
9.4 Indexes not copied $out doesn't copy indexes Medium Apps must create indexes on startup (should already do this)
9.5 GridFS data (fs.files, fs.chunks) Not in collection list N/A Verified: SyRF does not use GridFS (code search confirmed)

Note: Code search confirmed SyRF does not use GridFS - no IGridFSBucket, fs.files, or fs.chunks references found.


10. Timing & Race Conditions

✅ Handled Cases

# Scenario Behavior Status
10.1 Weekly snapshot during peak Sunday 3 AM UTC ✅ Off-peak
10.2 Clock skew NTP keeps drift minimal ✅ Acceptable

⚠️ Acceptable Concerns

# Scenario Behavior Risk Mitigation
10.3 Restore during snapshot May get partial data Very Low 3 AM timing unlikely to overlap; data still usable

11. Label State Machine

✅ Handled Cases

# Scenario Behavior Status
11.1 Labels added in wrong order Order doesn't matter, final state determines behavior ✅ Order-independent
11.2 All labels removed preview removal triggers cleanup ✅ Cleanup
11.3 Label changed by bot Same as human change ✅ Agnostic

⚠️ Acceptable Concerns

# Scenario Behavior Risk Mitigation
11.4 Two label changes simultaneously Possible race condition Low Final state converges; workflows are mostly idempotent

12. Resource Limits

✅ Handled Cases

# Scenario Behavior Status
12.1 Job exceeds memory OOMKilled $out is server-side, minimal client memory
12.2 Job exceeds CPU Throttled, runs slower ✅ Acceptable slowdown

⚠️ Acceptable Concerns

# Scenario Behavior Risk Mitigation
12.3 Too many concurrent PRs Cluster resources exhausted Medium Operational concern; set PR limits if needed

Critical Path Analysis

Happy Path (Most Common)

1. Weekly: CronJob → syrftest → syrf_snapshot (3 AM Sunday)
2. PR Created: Add preview + use-snapshot labels
3. Workflow: Detects labels, generates restore job
4. ArgoCD: Creates user → Runs restore → Deploys services
5. Developer: Tests with production data
6. PR Merged: PreDelete drops database, namespace cleaned

Risk Points in Happy Path: None identified.

Failure Recovery Paths

Failure Recovery Automatic?
Snapshot job fails Next weekly run succeeds; old snapshot usable Yes
Restore job fails Fix cause, re-sync ArgoCD Manual
PreDelete hook fails Manual database cleanup Manual
Credentials wrong Fix in GCP Secret Manager, wait for sync Manual

Recommendations

High Priority

  1. Verify GridFS usage: ✅ Confirmed SyRF does not use GridFS.

Medium Priority

  1. Custom Docker image: Build mongo-kubectl:7 image to avoid runtime apt-get install and curl downloads.

  2. Index documentation: Document that apps must create indexes on startup since $out doesn't copy them.

Low Priority

  1. Concurrent snapshot protection: Add a lock/marker to prevent manual + weekly snapshots from overlapping.

  2. Alerting: Set up alerts for snapshot job failures, orphan databases, and restore job timeouts.


Test Scenarios

Must Test Before Production

# Scenario Expected Outcome Priority
T1 Manual trigger first snapshot syrf_snapshot created with metadata Critical
T2 PR with use-snapshot (data exists) Restore completes, data matches Critical
T3 PR with use-snapshot (no data yet) Clear error with options Critical
T4 PR rebuild with persist-db Database untouched High
T5 PR close without persist-db Database dropped High
T6 use-snapshot added while persist-db present Label reverted, comment posted Medium
T7 persist-db removed from closed PR Database dropped Medium

Conclusion

The implementation is production-ready with 53 of 65 edge cases fully handled and 12 having acceptable minor concerns. The permission model provides strong security guarantees with defense in depth.

Key Strengths:

  • MongoDB permissions prevent any write to production
  • Metadata written to all database types for observability
  • Marker-based idempotency prevents duplicate operations
  • Clear error messages with actionable guidance

Key Monitoring Points:

  • Snapshot job success rate
  • Restore job duration (should be <10 minutes)
  • Orphan database count

Document End