Data Snapshot Automation - Edge Case Analysis¶

This document provides a comprehensive analysis of all edge cases and scenarios for the data snapshot automation feature, based on the implemented code.

Executive Summary¶

Category	Total	✅ Handled	⚠️ Acceptable
Snapshot Producer	10	8	2
Restore Job	12	11	1
DB Reset Job	3	3	0
Manual Trigger	5	4	1
Metadata	4	4	0
Permissions/Security	6	6	0
Network/Infrastructure	5	3	2
ArgoCD	5	4	1
Data Integrity	5	3	2
Timing	3	2	1
Labels	4	3	1
Resources	3	2	1
TOTAL	65	53	12

Overall Assessment: The implementation is robust with no critical issues. 12 edge cases have minor concerns that are acceptable for production use but should be monitored.

1. Snapshot Producer CronJob¶

✅ Handled Cases¶

#	Scenario	Behavior	Status
1.1	First run (no syrf_snapshot exists)	`$out` creates database and collections automatically	✅ Works
1.2	Production unavailable	Connection fails, `backoffLimit: 3` retries	✅ Clean failure
1.3	Concurrent runs	`concurrencyPolicy: Forbid` prevents	✅ Blocked
1.4	Empty source collections	Logs warning, skips, continues	✅ Safe
1.5	Collection schema changes	`$out` copies BSON as-is	✅ No impact
1.6	GCP Secret Manager unavailable	ExternalSecret fails, no credentials	✅ Clean failure
1.7	Wrong credentials	mongosh auth fails	✅ Clear error
1.8	syrf_snapshot quota exceeded	`$out` fails, old snapshot preserved	✅ Safe

⚠️ Acceptable Concerns¶

#	Scenario	Behavior	Risk	Mitigation
1.9	Partial failure mid-copy	Some collections new, some old	Low	Metadata only written on success; restore checks pmProject count
1.10	Very large database (>20GB)	May hit 30-min timeout	Low	Monitor; `$out` is fast (server-side); increase timeout if needed

2. Snapshot Restore Job¶

✅ Handled Cases¶

#	Scenario	Behavior	Status
2.1	No snapshot exists	Checks `pmProject.countDocuments()`, fails loudly with options	✅ Good UX
2.2	PR credentials not ready	Wait loop (60s timeout) for secret mount	✅ Retries
2.3	Atlas Operator slow	sync-wave ordering ensures user created first	✅ Ordered
2.4	DB already has data	Drops all collections first	✅ Clean slate
2.5	persist-db with use-snapshot	`RESET_DATABASE=false` prevents job generation	✅ Logic correct
2.6	Label added mid-deployment	Workflow regenerates files, ArgoCD syncs	✅ Reactive
2.7	use-snapshot removed with persist-db	Label change reverted by workflow	✅ Lock enforced
2.8	PR closed without persist-db	PreDelete hook drops database	✅ Cleanup
2.9	PR closed with persist-db	Database preserved, warning posted	✅ By design
2.10	Multiple PRs restore simultaneously	Each restores to own DB, no conflicts	✅ Isolated
2.11	Marker check for duplicate restores	SHA-based marker prevents re-run	✅ Idempotent

⚠️ Acceptable Concerns¶

#	Scenario	Behavior	Risk	Mitigation
2.12	PreDelete hook fails	Orphan database remains	Low	Manual cleanup documented; rare scenario

3. DB Reset Job (Fresh Seed)¶

✅ Handled Cases¶

#	Scenario	Behavior	Status
3.1	DB doesn't exist	`getCollectionNames()` returns empty, no error	✅ Safe
3.2	Two syncs at same SHA	Marker check, second skips	✅ Idempotent
3.3	Marker corrupted/deleted	Reset runs again (just drops/recreates)	✅ Safe

4. Manual Trigger Workflow¶

✅ Handled Cases¶

#	Scenario	Behavior	Status
4.1	Wrong confirmation text	Validation fails immediately	✅ Clear error
4.2	CronJob doesn't exist	Pre-check fails with instructions	✅ Helpful
4.3	GKE credentials invalid	get-credentials fails	✅ Clean failure
4.4	Job takes >15 min	Logs timeout, shows "running" with manual check	✅ Documented

⚠️ Acceptable Concerns¶

#	Scenario	Behavior	Risk	Mitigation
4.5	Manual run during weekly run	Both run (Job not CronJob)	Low	Last one wins, both produce valid snapshots

5. Metadata Handling¶

✅ Handled Cases¶

#	Scenario	Behavior	Status
5.1	snapshot_metadata doesn't exist	`updateOne` with `upsert: true`	✅ Creates
5.2	Metadata write fails	Job continues; restore checks pmProject count	✅ Fallback
5.3	Different MongoDB version	Standard JavaScript, BSON compatible	✅ Works
5.4	Unexpected metadata format	JavaScript handles gracefully	✅ Safe

6. Permissions & Security¶

✅ All Cases Handled (Defense in Depth)¶

#	Attack Vector	Protection Layer	Status
6.1	Snapshot producer writes to syrftest	MongoDB: read-only permission	✅ Blocked
6.2	PR user writes to syrf_snapshot	MongoDB: read-only permission	✅ Blocked
6.3	PR user accesses other PR's DB	MongoDB: no permission on syrf_pr_X	✅ Blocked
6.4	PR user accesses production	MongoDB: no permission on syrftest	✅ Blocked
6.5	Script modified to target wrong DB	MongoDB: permission denied anyway	✅ Defense in depth
6.6	AtlasDatabaseUser created bypassing workflow	Kyverno: policy validates all resources	✅ Policy enforced

Security Assessment: The permission model provides multiple layers of protection. Even if one layer is bypassed, others prevent damage.

7. Network & Infrastructure¶

✅ Handled Cases¶

#	Scenario	Behavior	Status
7.1	MongoDB Atlas maintenance	Jobs fail, retry later	✅ Resilient
7.2	GKE node failure	Pod rescheduled, K8s handles	✅ K8s native
7.3	Service account token expired	K8s auto-refreshes	✅ Automatic

⚠️ Acceptable Concerns¶

#	Scenario	Behavior	Risk	Mitigation
7.4	Image pull failure (mongo:7)	ImagePullBackOff until success	Low	Standard K8s, retries automatically
7.5	kubectl install in job fails	apt-get or curl fails	Medium	Consider custom image with kubectl pre-installed

Recommendation: Build a custom image ghcr.io/camaradesuk/mongo-kubectl:7 with kubectl pre-installed to eliminate runtime dependency installation.

8. ArgoCD Integration¶

✅ Handled Cases¶

#	Scenario	Behavior	Status
8.1	PreSync hook timeout	Jobs have 15-30 min timeouts, ArgoCD default 1h	✅ Within limits
8.2	Hook ordering	sync-wave annotations ensure correct order	✅ Ordered
8.3	Manual sync while workflow running	Marker check prevents duplicates	✅ Idempotent
8.4	App deleted	PreDelete hooks run before deletion	✅ Cleanup works

⚠️ Acceptable Concerns¶

#	Scenario	Behavior	Risk	Mitigation
8.5	Sync fails after restore	DB has data, services don't deploy	Low	Database ready for retry; just sync again

9. Data Integrity¶

✅ Handled Cases¶

#	Scenario	Behavior	Status
9.1	CSUUID format	`$out` copies BSON, no conversion	✅ Preserved
9.2	Large documents (>16MB)	MongoDB rejects, clear error	✅ Standard limit
9.3	Orphaned references	All 11 collections copied together	✅ Consistent

⚠️ Acceptable Concerns¶

#	Scenario	Behavior	Risk	Mitigation
9.4	Indexes not copied	`$out` doesn't copy indexes	Medium	Apps must create indexes on startup (should already do this)
9.5	GridFS data (fs.files, fs.chunks)	Not in collection list	N/A	Verified: SyRF does not use GridFS (code search confirmed)

Note: Code search confirmed SyRF does not use GridFS - no IGridFSBucket, fs.files, or fs.chunks references found.

10. Timing & Race Conditions¶

✅ Handled Cases¶

#	Scenario	Behavior	Status
10.1	Weekly snapshot during peak	Sunday 3 AM UTC	✅ Off-peak
10.2	Clock skew	NTP keeps drift minimal	✅ Acceptable

⚠️ Acceptable Concerns¶

#	Scenario	Behavior	Risk	Mitigation
10.3	Restore during snapshot	May get partial data	Very Low	3 AM timing unlikely to overlap; data still usable

11. Label State Machine¶

✅ Handled Cases¶

#	Scenario	Behavior	Status
11.1	Labels added in wrong order	Order doesn't matter, final state determines behavior	✅ Order-independent
11.2	All labels removed	preview removal triggers cleanup	✅ Cleanup
11.3	Label changed by bot	Same as human change	✅ Agnostic

⚠️ Acceptable Concerns¶

#	Scenario	Behavior	Risk	Mitigation
11.4	Two label changes simultaneously	Possible race condition	Low	Final state converges; workflows are mostly idempotent

12. Resource Limits¶

✅ Handled Cases¶

#	Scenario	Behavior	Status
12.1	Job exceeds memory	OOMKilled	✅ `$out` is server-side, minimal client memory
12.2	Job exceeds CPU	Throttled, runs slower	✅ Acceptable slowdown

⚠️ Acceptable Concerns¶

#	Scenario	Behavior	Risk	Mitigation
12.3	Too many concurrent PRs	Cluster resources exhausted	Medium	Operational concern; set PR limits if needed

Critical Path Analysis¶

Happy Path (Most Common)¶

1. Weekly: CronJob → syrftest → syrf_snapshot (3 AM Sunday)
2. PR Created: Add preview + use-snapshot labels
3. Workflow: Detects labels, generates restore job
4. ArgoCD: Creates user → Runs restore → Deploys services
5. Developer: Tests with production data
6. PR Merged: PreDelete drops database, namespace cleaned

Risk Points in Happy Path: None identified.

Failure Recovery Paths¶

Failure	Recovery	Automatic?
Snapshot job fails	Next weekly run succeeds; old snapshot usable	Yes
Restore job fails	Fix cause, re-sync ArgoCD	Manual
PreDelete hook fails	Manual database cleanup	Manual
Credentials wrong	Fix in GCP Secret Manager, wait for sync	Manual

Recommendations¶

High Priority¶

~~Verify GridFS usage~~: ✅ Confirmed SyRF does not use GridFS.

Medium Priority¶

Custom Docker image: Build mongo-kubectl:7 image to avoid runtime apt-get install and curl downloads.
Index documentation: Document that apps must create indexes on startup since $out doesn't copy them.

Low Priority¶

Concurrent snapshot protection: Add a lock/marker to prevent manual + weekly snapshots from overlapping.
Alerting: Set up alerts for snapshot job failures, orphan databases, and restore job timeouts.

Test Scenarios¶

Must Test Before Production¶

#	Scenario	Expected Outcome	Priority
T1	Manual trigger first snapshot	syrf_snapshot created with metadata	Critical
T2	PR with use-snapshot (data exists)	Restore completes, data matches	Critical
T3	PR with use-snapshot (no data yet)	Clear error with options	Critical
T4	PR rebuild with persist-db	Database untouched	High
T5	PR close without persist-db	Database dropped	High
T6	use-snapshot added while persist-db present	Label reverted, comment posted	Medium
T7	persist-db removed from closed PR	Database dropped	Medium

Conclusion¶

The implementation is production-ready with 53 of 65 edge cases fully handled and 12 having acceptable minor concerns. The permission model provides strong security guarantees with defense in depth.

Key Strengths:

MongoDB permissions prevent any write to production
Metadata written to all database types for observability
Marker-based idempotency prevents duplicate operations
Clear error messages with actionable guidance

Key Monitoring Points:

Snapshot job success rate
Restore job duration (should be <10 minutes)
Orphan database count

Document End