edge-cases
mongodb
risk-analysis
testing
Data Snapshot Automation - Edge Case Analysis
This document provides a comprehensive analysis of all edge cases and scenarios for the data snapshot automation feature, based on the implemented code.
Executive Summary
Category
Total
✅ Handled
⚠️ Acceptable
❌ Issues
Snapshot Producer
10
8
2
0
Restore Job
12
11
1
0
DB Reset Job
3
3
0
0
Manual Trigger
5
4
1
0
Metadata
4
4
0
0
Permissions/Security
6
6
0
0
Network/Infrastructure
5
3
2
0
ArgoCD
5
4
1
0
Data Integrity
5
3
2
0
Timing
3
2
1
0
Labels
4
3
1
0
Resources
3
2
1
0
TOTAL
65
53
12
0
Overall Assessment : The implementation is robust with no critical issues. 12 edge cases have minor concerns that are acceptable for production use but should be monitored.
1. Snapshot Producer CronJob
✅ Handled Cases
#
Scenario
Behavior
Status
1.1
First run (no syrf_snapshot exists)
$out creates database and collections automatically
✅ Works
1.2
Production unavailable
Connection fails, backoffLimit: 3 retries
✅ Clean failure
1.3
Concurrent runs
concurrencyPolicy: Forbid prevents
✅ Blocked
1.4
Empty source collections
Logs warning, skips, continues
✅ Safe
1.5
Collection schema changes
$out copies BSON as-is
✅ No impact
1.6
GCP Secret Manager unavailable
ExternalSecret fails, no credentials
✅ Clean failure
1.7
Wrong credentials
mongosh auth fails
✅ Clear error
1.8
syrf_snapshot quota exceeded
$out fails, old snapshot preserved
✅ Safe
⚠️ Acceptable Concerns
#
Scenario
Behavior
Risk
Mitigation
1.9
Partial failure mid-copy
Some collections new, some old
Low
Metadata only written on success; restore checks pmProject count
1.10
Very large database (>20GB)
May hit 30-min timeout
Low
Monitor; $out is fast (server-side); increase timeout if needed
2. Snapshot Restore Job
✅ Handled Cases
#
Scenario
Behavior
Status
2.1
No snapshot exists
Checks pmProject.countDocuments(), fails loudly with options
✅ Good UX
2.2
PR credentials not ready
Wait loop (60s timeout) for secret mount
✅ Retries
2.3
Atlas Operator slow
sync-wave ordering ensures user created first
✅ Ordered
2.4
DB already has data
Drops all collections first
✅ Clean slate
2.5
persist-db with use-snapshot
RESET_DATABASE=false prevents job generation
✅ Logic correct
2.6
Label added mid-deployment
Workflow regenerates files, ArgoCD syncs
✅ Reactive
2.7
use-snapshot removed with persist-db
Label change reverted by workflow
✅ Lock enforced
2.8
PR closed without persist-db
PreDelete hook drops database
✅ Cleanup
2.9
PR closed with persist-db
Database preserved, warning posted
✅ By design
2.10
Multiple PRs restore simultaneously
Each restores to own DB, no conflicts
✅ Isolated
2.11
Marker check for duplicate restores
SHA-based marker prevents re-run
✅ Idempotent
⚠️ Acceptable Concerns
#
Scenario
Behavior
Risk
Mitigation
2.12
PreDelete hook fails
Orphan database remains
Low
Manual cleanup documented; rare scenario
3. DB Reset Job (Fresh Seed)
✅ Handled Cases
#
Scenario
Behavior
Status
3.1
DB doesn't exist
getCollectionNames() returns empty, no error
✅ Safe
3.2
Two syncs at same SHA
Marker check, second skips
✅ Idempotent
3.3
Marker corrupted/deleted
Reset runs again (just drops/recreates)
✅ Safe
4. Manual Trigger Workflow
✅ Handled Cases
#
Scenario
Behavior
Status
4.1
Wrong confirmation text
Validation fails immediately
✅ Clear error
4.2
CronJob doesn't exist
Pre-check fails with instructions
✅ Helpful
4.3
GKE credentials invalid
get-credentials fails
✅ Clean failure
4.4
Job takes >15 min
Logs timeout, shows "running" with manual check
✅ Documented
⚠️ Acceptable Concerns
#
Scenario
Behavior
Risk
Mitigation
4.5
Manual run during weekly run
Both run (Job not CronJob)
Low
Last one wins, both produce valid snapshots
✅ Handled Cases
#
Scenario
Behavior
Status
5.1
snapshot_metadata doesn't exist
updateOne with upsert: true
✅ Creates
5.2
Metadata write fails
Job continues; restore checks pmProject count
✅ Fallback
5.3
Different MongoDB version
Standard JavaScript, BSON compatible
✅ Works
5.4
Unexpected metadata format
JavaScript handles gracefully
✅ Safe
6. Permissions & Security
✅ All Cases Handled (Defense in Depth)
#
Attack Vector
Protection Layer
Status
6.1
Snapshot producer writes to syrftest
MongoDB: read-only permission
✅ Blocked
6.2
PR user writes to syrf_snapshot
MongoDB: read-only permission
✅ Blocked
6.3
PR user accesses other PR's DB
MongoDB: no permission on syrf_pr_X
✅ Blocked
6.4
PR user accesses production
MongoDB: no permission on syrftest
✅ Blocked
6.5
Script modified to target wrong DB
MongoDB: permission denied anyway
✅ Defense in depth
6.6
AtlasDatabaseUser created bypassing workflow
Kyverno: policy validates all resources
✅ Policy enforced
Security Assessment : The permission model provides multiple layers of protection. Even if one layer is bypassed, others prevent damage.
7. Network & Infrastructure
✅ Handled Cases
#
Scenario
Behavior
Status
7.1
MongoDB Atlas maintenance
Jobs fail, retry later
✅ Resilient
7.2
GKE node failure
Pod rescheduled, K8s handles
✅ K8s native
7.3
Service account token expired
K8s auto-refreshes
✅ Automatic
⚠️ Acceptable Concerns
#
Scenario
Behavior
Risk
Mitigation
7.4
Image pull failure (mongo:7)
ImagePullBackOff until success
Low
Standard K8s, retries automatically
7.5
kubectl install in job fails
apt-get or curl fails
Medium
Consider custom image with kubectl pre-installed
Recommendation : Build a custom image ghcr.io/camaradesuk/mongo-kubectl:7 with kubectl pre-installed to eliminate runtime dependency installation.
8. ArgoCD Integration
✅ Handled Cases
#
Scenario
Behavior
Status
8.1
PreSync hook timeout
Jobs have 15-30 min timeouts, ArgoCD default 1h
✅ Within limits
8.2
Hook ordering
sync-wave annotations ensure correct order
✅ Ordered
8.3
Manual sync while workflow running
Marker check prevents duplicates
✅ Idempotent
8.4
App deleted
PreDelete hooks run before deletion
✅ Cleanup works
⚠️ Acceptable Concerns
#
Scenario
Behavior
Risk
Mitigation
8.5
Sync fails after restore
DB has data, services don't deploy
Low
Database ready for retry; just sync again
9. Data Integrity
✅ Handled Cases
#
Scenario
Behavior
Status
9.1
CSUUID format
$out copies BSON, no conversion
✅ Preserved
9.2
Large documents (>16MB)
MongoDB rejects, clear error
✅ Standard limit
9.3
Orphaned references
All 11 collections copied together
✅ Consistent
⚠️ Acceptable Concerns
#
Scenario
Behavior
Risk
Mitigation
9.4
Indexes not copied
$out doesn't copy indexes
Medium
Apps must create indexes on startup (should already do this)
9.5
GridFS data (fs.files, fs.chunks)
Not in collection list
N/A
Verified : SyRF does not use GridFS (code search confirmed)
Note : Code search confirmed SyRF does not use GridFS - no IGridFSBucket, fs.files, or fs.chunks references found.
10. Timing & Race Conditions
✅ Handled Cases
#
Scenario
Behavior
Status
10.1
Weekly snapshot during peak
Sunday 3 AM UTC
✅ Off-peak
10.2
Clock skew
NTP keeps drift minimal
✅ Acceptable
⚠️ Acceptable Concerns
#
Scenario
Behavior
Risk
Mitigation
10.3
Restore during snapshot
May get partial data
Very Low
3 AM timing unlikely to overlap; data still usable
11. Label State Machine
✅ Handled Cases
#
Scenario
Behavior
Status
11.1
Labels added in wrong order
Order doesn't matter, final state determines behavior
✅ Order-independent
11.2
All labels removed
preview removal triggers cleanup
✅ Cleanup
11.3
Label changed by bot
Same as human change
✅ Agnostic
⚠️ Acceptable Concerns
#
Scenario
Behavior
Risk
Mitigation
11.4
Two label changes simultaneously
Possible race condition
Low
Final state converges; workflows are mostly idempotent
12. Resource Limits
✅ Handled Cases
#
Scenario
Behavior
Status
12.1
Job exceeds memory
OOMKilled
✅ $out is server-side, minimal client memory
12.2
Job exceeds CPU
Throttled, runs slower
✅ Acceptable slowdown
⚠️ Acceptable Concerns
#
Scenario
Behavior
Risk
Mitigation
12.3
Too many concurrent PRs
Cluster resources exhausted
Medium
Operational concern; set PR limits if needed
Critical Path Analysis
Happy Path (Most Common)
1. Weekly: CronJob → syrftest → syrf_snapshot (3 AM Sunday)
2. PR Created: Add preview + use-snapshot labels
3. Workflow: Detects labels, generates restore job
4. ArgoCD: Creates user → Runs restore → Deploys services
5. Developer: Tests with production data
6. PR Merged: PreDelete drops database, namespace cleaned
Risk Points in Happy Path : None identified.
Failure Recovery Paths
Failure
Recovery
Automatic?
Snapshot job fails
Next weekly run succeeds; old snapshot usable
Yes
Restore job fails
Fix cause, re-sync ArgoCD
Manual
PreDelete hook fails
Manual database cleanup
Manual
Credentials wrong
Fix in GCP Secret Manager, wait for sync
Manual
Recommendations
High Priority
Verify GridFS usage : ✅ Confirmed SyRF does not use GridFS.
Medium Priority
Custom Docker image : Build mongo-kubectl:7 image to avoid runtime apt-get install and curl downloads.
Index documentation : Document that apps must create indexes on startup since $out doesn't copy them.
Low Priority
Concurrent snapshot protection : Add a lock/marker to prevent manual + weekly snapshots from overlapping.
Alerting : Set up alerts for snapshot job failures, orphan databases, and restore job timeouts.
Test Scenarios
Must Test Before Production
#
Scenario
Expected Outcome
Priority
T1
Manual trigger first snapshot
syrf_snapshot created with metadata
Critical
T2
PR with use-snapshot (data exists)
Restore completes, data matches
Critical
T3
PR with use-snapshot (no data yet)
Clear error with options
Critical
T4
PR rebuild with persist-db
Database untouched
High
T5
PR close without persist-db
Database dropped
High
T6
use-snapshot added while persist-db present
Label reverted, comment posted
Medium
T7
persist-db removed from closed PR
Database dropped
Medium
Conclusion
The implementation is production-ready with 53 of 65 edge cases fully handled and 12 having acceptable minor concerns. The permission model provides strong security guarantees with defense in depth.
Key Strengths :
MongoDB permissions prevent any write to production
Metadata written to all database types for observability
Marker-based idempotency prevents duplicate operations
Clear error messages with actionable guidance
Key Monitoring Points :
Snapshot job success rate
Restore job duration (should be <10 minutes)
Orphan database count
Document End
2026-04-14 07:56:31
2026-04-14 07:56:31