MongoDB Serialization Architecture Analysis¶
Executive Summary¶
This document provides a comprehensive analysis of the current MongoDB/BSON serialization architecture in SyRF, identifies problems and technical debt, and presents architectural options for improvement. The current implementation has significant testability issues and maintenance overhead that warrant refactoring.
1. Current State Analysis¶
1.1 Architecture Overview¶
SyRF uses MongoDB C# driver's BsonClassMap for serialization configuration. The current architecture distributes mapping responsibilities across multiple repositories:
┌─────────────────────────────────────────────────────────────────────┐
│ Application Startup │
│ CreateIndexesAndMapping.Execute() │
└──────────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ MongoPmUnitOfWork.CreateMappings() │
│ Iterates over all repository properties implementing IHasMappings│
└──────────────────────────────┬──────────────────────────────────────┘
│
┌───────────────────────┼───────────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ InvestigatorRepo │ │ ProjectRepo │ ... │ StudyRepo │
│ CreateMappings() │ │ CreateMappings()│ │ CreateMappings()│
│ ~40 lines │ │ ~425 lines │ │ ~80 lines │
│ 3 types │ │ 30+ types │ │ 15+ types │
└─────────────────┘ └─────────────────┘ └─────────────────┘
1.2 Repository Inventory¶
| Repository | Lines in CreateMappings | Types Registered | Complexity |
|---|---|---|---|
| ProjectRepository | ~425 | 30+ | Very High |
| StudyRepository | ~80 | 15+ | High |
| SystematicSearchRepository | ~30 | 1 | Medium |
| InvestigatorRepository | ~40 | 3 | Medium |
| RiskOfBiasAiJobRepository | ~35 | 10+ | Medium |
| InvestigatorUsageRepository | ~15 | 1 | Low |
| DataExportJobRepository | 0 | 0 | None |
| StudyCorrectionRepository | 0 | 0 | None |
1.3 Mapping Patterns Used¶
Pattern 1: Schema Version-Based Conditional Serialization¶
// Investigator: Serialize ProjectHistory only for SchemaVersion > 0
cm.MapProperty(i => i.ProjectHistory)
.SetShouldSerializeMethod(i => ((Investigator)i).SchemaVersion > 0);
// Investigator: Serialize Email only for SchemaVersion == 0
cm.MapProperty(i => i.Email)
.SetElementName("Email")
.SetShouldSerializeMethod(i => ((Investigator)i).SchemaVersion == 0);
Purpose: Enables schema migrations where old documents (SchemaVersion 0) and new documents (SchemaVersion > 0) coexist in the same collection.
Pattern 2: Field-to-Element Name Mapping¶
// Map private field to different BSON element name
cm.MapField(Investigator.InternalEmailsFieldName) // _internalEmails
.SetElementName(Investigator.EmailsPropertyName); // "Emails"
// Map with version-specific element names
cm.MapField(SystematicSearch.V0StudyIdsFieldName)
.SetElementName(SystematicSearch.V0StudyIdsElementName)
.SetShouldSerializeMethod(ss => ((SystematicSearch)ss).SchemaVersion == 0);
Pattern 3: Property Unmapping¶
// Exclude computed/navigation properties from serialization
cm.UnmapProperty(ss => ss.FromLivingSearch); // Computed property
cm.UnmapProperty(ss => ss.ProjectId); // Single project (now multi-project)
cm.UnmapProperty(ss => ss.NumberOfStudies); // Computed aggregate
cm.UnmapProperty(vc => vc.IsUnexpired); // Computed from other fields
Pattern 4: Polymorphic Type Registration¶
// Register base and derived types for polymorphic deserialization
if (!BsonClassMap.IsClassMapRegistered(typeof(Annotation)))
BsonClassMap.RegisterClassMap<Annotation>();
if (!BsonClassMap.IsClassMapRegistered(typeof(StringAnnotation)))
BsonClassMap.RegisterClassMap<StringAnnotation>();
if (!BsonClassMap.IsClassMapRegistered(typeof(BoolAnnotation)))
BsonClassMap.RegisterClassMap<BoolAnnotation>();
// ... 10+ annotation types
Pattern 5: Custom Serializers¶
// Enum as string serialization
cm.MapMember(job => job.Status)
.SetSerializer(new EnumAsStringSerializer<RiskOfBiasAiJobStatus>());
Pattern 6: IsClassMapRegistered Guard¶
// Prevent double registration (required due to global static state)
if (!BsonClassMap.IsClassMapRegistered(typeof(Project)))
{
BsonClassMap.RegisterClassMap<Project>(cm => { ... });
}
1.4 Type Registration Analysis¶
Types with Custom Mappings (require explicit configuration):
Investigator,NewEmailVerifier,VerificationCodeProject,Stage,ProjectMembership,Invitation,AnnotationQuestionStudy,ScreeningInfo,ExtractionInfo,Annotation(+ 10 subtypes)SystematicSearch,ReferenceFileRiskOfBiasAiJob,AiAnnotationResponse(+ 8 answer types)StageUsage
Types with Auto-Mapping (no custom config, just registration):
- Many polymorphic subtypes (e.g.,
StringAnnotation,BoolAnnotation) - Simple DTOs and value objects
2. Problems with Current Architecture¶
2.1 Global Static State Anti-Pattern¶
Problem: BsonClassMap is a global, static registry that cannot be reset or scoped.
// Once registered, cannot be unregistered or modified
BsonClassMap.RegisterClassMap<Project>(cm => { ... });
Impact:
- Tests cannot isolate mapping configurations
- Order-dependent behavior across test runs
- Cannot test different mapping scenarios in parallel
- Mapping configuration "leaks" between test fixtures
2.2 Dual Responsibility Violation¶
Problem: Repositories handle both data access AND schema definition.
public class ProjectRepository : MongoRepositoryBase<Project, Guid>
{
// Responsibility 1: Data access (CRUD, queries)
public async Task<Project?> GetByIdAsync(Guid id) { ... }
// Responsibility 2: Schema definition (425 lines!)
public override void CreateMappings() { ... }
}
Impact:
- Violates Single Responsibility Principle
- Makes repositories harder to test in isolation
- Schema changes require modifying repository code
- Harder to understand mapping behavior without reading repository code
2.3 Scattered Schema Definitions¶
Problem: Schema definitions are spread across 6+ repository files.
Impact:
- Difficult to get a complete picture of the schema
- Duplicate type registrations across repositories possible
- No central validation of mapping consistency
- Refactoring type hierarchies requires changes in multiple files
2.4 Untestable Conditional Logic¶
Problem: SetShouldSerializeMethod lambdas contain business logic that cannot be unit tested.
cm.MapProperty(i => i.ProjectHistory)
.SetShouldSerializeMethod(i => ((Investigator)i).SchemaVersion > 0);
Impact:
- Schema migration logic is implicitly defined in lambdas
- Cannot verify serialization behavior without MongoDB integration tests
- Bugs in conditional serialization only discovered at runtime
- Current tests only verify registration, not configuration
2.5 Magic Strings and Implicit Contracts¶
Problem: Element names are strings, creating implicit contracts between code and database.
cm.MapField(Project.V0RegistrationsName)
.SetElementName("Registrations"); // Must match database field name
Impact:
- Typos in element names cause silent data loss
- Renaming requires careful coordination with existing data
- No compile-time verification of field mappings
2.6 Test Coverage Gaps¶
Current Tests (from BsonClassMapTests.cs):
- ✅
CreateMappings()doesn't throw - ✅ Classes are registered
- ✅ Idempotent (calling twice doesn't fail)
Not Tested:
- ❌ Actual field-to-element mappings
- ❌ Conditional serialization logic
- ❌ Custom serializer behavior
- ❌ Unmapped properties are actually excluded
- ❌ Schema version behavior (v0 vs v1)
- ❌ Round-trip serialization/deserialization
3. Architectural Options¶
Option A: Centralized Mapping Registry (Recommended)¶
Approach: Extract all mappings to a dedicated MongoMappingRegistry class.
public interface IMongoMappingProvider
{
void RegisterMappings();
IReadOnlyList<Type> RegisteredTypes { get; }
}
public class SyRFMappingRegistry : IMongoMappingProvider
{
private readonly List<Type> _registeredTypes = new();
public void RegisterMappings()
{
RegisterInvestigatorMappings();
RegisterProjectMappings();
RegisterStudyMappings();
// ... etc
}
private void RegisterInvestigatorMappings()
{
RegisterIfNotExists<Investigator>(cm =>
{
cm.AutoMap();
MapConditionalField(cm, i => i.ProjectHistory,
schemaVersion => schemaVersion > 0);
// ... etc
});
}
public IReadOnlyList<Type> RegisteredTypes => _registeredTypes;
}
Benefits:
- Single location for all schema definitions
- Can implement custom conventions
- Easier to audit complete schema
- Repository classes become pure data access
Effort: Medium (refactor existing code)
Option B: Attribute-Based Mapping¶
Approach: Define mappings declaratively on model classes.
[BsonDiscriminator("Investigator")]
public class Investigator : AggregateRoot<Guid>
{
[BsonIgnore]
public string PrimaryEmail => Emails.First();
[BsonElement("Emails")]
[BsonField("_internalEmails")]
private List<string> _internalEmails;
[BsonConditional(nameof(SchemaVersion), ">", 0)]
public List<ProjectAccess> ProjectHistory { get; private set; }
}
Benefits:
- Schema definition co-located with domain model
- More discoverable (no need to find repository code)
- Standard MongoDB driver pattern (some attributes already supported)
Drawbacks:
- Not all mapping scenarios supported by attributes
- Custom attribute for conditional serialization would be needed
- Schema version logic embedded in domain model (coupling)
Effort: High (would require custom attributes + reflection processing)
Option C: Fluent Configuration Files (Schema-as-Code)¶
Approach: Separate configuration classes per aggregate.
public class InvestigatorMappingConfiguration : IMappingConfiguration<Investigator>
{
public void Configure(BsonClassMap<Investigator> cm)
{
cm.AutoMap();
// Clear declarative style
cm.MapPrivateField("_internalEmails")
.ToElement("Emails");
cm.MapProperty(i => i.ProjectHistory)
.SerializeWhen(i => i.SchemaVersion > 0);
cm.IgnoreProperty(i => i.PrimaryEmail);
cm.IgnoreProperty(i => i.Email);
}
}
Benefits:
- Clean separation of concerns
- Each configuration is testable in isolation
- Easy to compose and extend
- Follows Entity Framework Core pattern (familiar)
Effort: Medium-High (new abstraction layer)
Option D: Schema Version Migration System¶
Approach: Formalize schema versions with explicit migration logic.
public interface ISchemaVersion
{
int Version { get; }
void ConfigureMapping(BsonClassMap map);
}
public class InvestigatorSchemaV0 : ISchemaVersion
{
public int Version => 0;
public void ConfigureMapping(BsonClassMap<Investigator> cm)
{
cm.MapProperty(i => i.Email).SetElementName("Email");
// V0-specific mapping
}
}
public class InvestigatorSchemaV1 : ISchemaVersion
{
public int Version => 1;
public void ConfigureMapping(BsonClassMap<Investigator> cm)
{
cm.MapField("_internalEmails").SetElementName("Emails");
cm.MapProperty(i => i.ProjectHistory);
// V1-specific mapping
}
}
Benefits:
- Explicit, documented schema versions
- Each version testable independently
- Clear upgrade path for migrations
- Better alignment with eventual consistency patterns
Effort: High (significant architectural change)
4. Testing Strategy Improvements¶
4.1 BsonClassMap Inspection Tests¶
Test that mappings are configured correctly by inspecting the registered class maps.
[Fact]
public void Investigator_EmailsField_IsMappedCorrectly()
{
EnsureInvestigatorMappingsCreated();
var classMap = BsonClassMap.GetRegisteredClassMaps()
.First(cm => cm.ClassType == typeof(Investigator));
var memberMap = classMap.GetMemberMap(Investigator.InternalEmailsFieldName);
Assert.NotNull(memberMap);
Assert.Equal("Emails", memberMap.ElementName);
}
[Fact]
public void Investigator_PrimaryEmail_IsUnmapped()
{
EnsureInvestigatorMappingsCreated();
var classMap = BsonClassMap.GetRegisteredClassMaps()
.First(cm => cm.ClassType == typeof(Investigator));
var memberMap = classMap.AllMemberMaps
.FirstOrDefault(m => m.MemberName == nameof(Investigator.PrimaryEmail));
Assert.Null(memberMap);
}
4.2 Round-Trip Serialization Tests¶
Verify documents serialize and deserialize correctly.
[Theory]
[InlineData(0)]
[InlineData(1)]
public void Investigator_RoundTrip_PreservesData(int schemaVersion)
{
EnsureInvestigatorMappingsCreated();
var investigator = new Investigator(
Guid.NewGuid(), "auth0|123",
new FullName("Test", "User"),
"test@example.com",
schemaVersion);
// Serialize
var bson = investigator.ToBsonDocument();
// Verify schema-specific behavior
if (schemaVersion == 0)
{
Assert.True(bson.Contains("Email"));
Assert.False(bson.Contains("ProjectHistory"));
}
else
{
Assert.True(bson.Contains("Emails"));
Assert.True(bson.Contains("ProjectHistory"));
}
// Deserialize
var deserialized = BsonSerializer.Deserialize<Investigator>(bson);
Assert.Equal(investigator.Id, deserialized.Id);
Assert.Equal(investigator.PrimaryEmail, deserialized.PrimaryEmail);
}
4.3 Conditional Serialization Tests¶
Explicitly test SetShouldSerializeMethod behavior.
[Fact]
public void Project_Registrations_OnlySerializedForSchemaV0()
{
EnsureMappingsCreated();
var projectV0 = CreateProject(schemaVersion: 0);
projectV0.AddRegistration("test-reg");
var projectV1 = CreateProject(schemaVersion: 1);
projectV1.AddMembership(new ProjectMembership(...));
var bsonV0 = projectV0.ToBsonDocument();
var bsonV1 = projectV1.ToBsonDocument();
Assert.True(bsonV0.Contains("Registrations"));
Assert.False(bsonV0.Contains("Memberships"));
Assert.False(bsonV1.Contains("Registrations"));
Assert.True(bsonV1.Contains("Memberships"));
}
4.4 Integration Tests with TestContainers¶
Full integration tests with real MongoDB.
public class MongoSerializationIntegrationTests : IAsyncLifetime
{
private readonly MongoDbContainer _container;
private IMongoCollection<Investigator> _collection;
public async Task InitializeAsync()
{
await _container.StartAsync();
var client = new MongoClient(_container.GetConnectionString());
var database = client.GetDatabase("test");
_collection = database.GetCollection<Investigator>("investigators");
// Register mappings in clean state
new InvestigatorMappingConfiguration().Configure();
}
[Fact]
public async Task SaveAndLoad_Investigator_SchemaV1()
{
var investigator = new Investigator(..., schemaVersion: 1);
await _collection.InsertOneAsync(investigator);
var loaded = await _collection.Find(i => i.Id == investigator.Id)
.FirstOrDefaultAsync();
Assert.Equal(investigator.Emails, loaded.Emails);
Assert.Equal(investigator.ProjectHistory, loaded.ProjectHistory);
}
}
5. Recommendations¶
5.1 Short-Term (Low Risk)¶
- Add BsonClassMap inspection tests (Section 4.1)
- Verify field-to-element mappings
- Verify unmapped properties
-
Verify polymorphic type registrations
-
Add round-trip serialization tests (Section 4.2)
- Test serialization produces expected BSON structure
-
Test deserialization recreates domain objects correctly
-
Document schema version behavior
- Create explicit documentation of V0 vs V1 differences per entity
- Add inline comments explaining conditional serialization logic
5.2 Medium-Term (Recommended)¶
- Implement Centralized Mapping Registry (Option A)
- Extract all
CreateMappings()logic to dedicated registry class - Keep repositories focused on data access only
-
Easier to test and audit
-
Add conditional serialization tests (Section 4.3)
- Explicitly test schema version behavior
- Catch regressions in migration logic
5.3 Long-Term (Optional)¶
- Consider Fluent Configuration (Option C)
- If mapping complexity continues to grow
-
Provides cleaner separation of concerns
-
Schema Version Migration System (Option D)
- If adding new schema versions frequently
- Makes versioning explicit and testable
6. Implementation Roadmap¶
Phase 1: Enhanced Testing (1-2 days)¶
- Add BsonClassMap inspection tests for all repositories
- Add round-trip tests for key entities (Investigator, Project, Study)
- Document expected BSON structure for each entity
Phase 2: Centralized Registry (2-3 days)¶
- Create
SyRFMappingRegistryclass - Extract mappings from
InvestigatorRepository - Extract mappings from
ProjectRepository(largest) - Extract mappings from remaining repositories
- Update
MongoPmUnitOfWorkto use registry - Remove
CreateMappings()from repositories
Phase 3: Schema Documentation (1 day)¶
- Document V0 vs V1 schema differences
- Add BSON examples to documentation
- Create schema migration guide
7. Related Documentation¶
- CLAUDE.md - MongoDB Database Architecture section
- MongoDB Reference - CSUUID format and collection naming
- Testing Strategy - TestContainers integration
Appendix A: Code Statistics¶
Total lines in CreateMappings() across repositories: ~600 Total types registered: 60+ Repositories with custom mappings: 6 Repositories relying on AutoMap: 2
Appendix B: Schema Version Matrix¶
| Entity | V0 Fields | V1 Fields | Migration Notes |
|---|---|---|---|
| Investigator | Email (string) | Emails (array), ProjectHistory | Email → Emails array |
| Project | Registrations | Memberships | Complete restructure |
| Study | (see repo) | (see repo) | Multiple field changes |
| SystematicSearch | StudyIds, ReferenceLibraryDao | LivingSearchId, ReferencesFile | Multi-project support |