Skip to content

MongoDB Serialization Architecture Analysis

Executive Summary

This document provides a comprehensive analysis of the current MongoDB/BSON serialization architecture in SyRF, identifies problems and technical debt, and presents architectural options for improvement. The current implementation has significant testability issues and maintenance overhead that warrant refactoring.


1. Current State Analysis

1.1 Architecture Overview

SyRF uses MongoDB C# driver's BsonClassMap for serialization configuration. The current architecture distributes mapping responsibilities across multiple repositories:

┌─────────────────────────────────────────────────────────────────────┐
│                        Application Startup                          │
│                  CreateIndexesAndMapping.Execute()                  │
└──────────────────────────────┬──────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│                     MongoPmUnitOfWork.CreateMappings()              │
│     Iterates over all repository properties implementing IHasMappings│
└──────────────────────────────┬──────────────────────────────────────┘
       ┌───────────────────────┼───────────────────────────────┐
       │                       │                               │
       ▼                       ▼                               ▼
┌─────────────────┐  ┌─────────────────┐           ┌─────────────────┐
│ InvestigatorRepo │  │  ProjectRepo    │    ...    │   StudyRepo     │
│ CreateMappings() │  │ CreateMappings()│           │ CreateMappings()│
│  ~40 lines       │  │  ~425 lines     │           │  ~80 lines      │
│  3 types         │  │  30+ types      │           │  15+ types      │
└─────────────────┘  └─────────────────┘           └─────────────────┘

1.2 Repository Inventory

Repository Lines in CreateMappings Types Registered Complexity
ProjectRepository ~425 30+ Very High
StudyRepository ~80 15+ High
SystematicSearchRepository ~30 1 Medium
InvestigatorRepository ~40 3 Medium
RiskOfBiasAiJobRepository ~35 10+ Medium
InvestigatorUsageRepository ~15 1 Low
DataExportJobRepository 0 0 None
StudyCorrectionRepository 0 0 None

1.3 Mapping Patterns Used

Pattern 1: Schema Version-Based Conditional Serialization

// Investigator: Serialize ProjectHistory only for SchemaVersion > 0
cm.MapProperty(i => i.ProjectHistory)
    .SetShouldSerializeMethod(i => ((Investigator)i).SchemaVersion > 0);

// Investigator: Serialize Email only for SchemaVersion == 0
cm.MapProperty(i => i.Email)
    .SetElementName("Email")
    .SetShouldSerializeMethod(i => ((Investigator)i).SchemaVersion == 0);

Purpose: Enables schema migrations where old documents (SchemaVersion 0) and new documents (SchemaVersion > 0) coexist in the same collection.

Pattern 2: Field-to-Element Name Mapping

// Map private field to different BSON element name
cm.MapField(Investigator.InternalEmailsFieldName)  // _internalEmails
    .SetElementName(Investigator.EmailsPropertyName);  // "Emails"

// Map with version-specific element names
cm.MapField(SystematicSearch.V0StudyIdsFieldName)
    .SetElementName(SystematicSearch.V0StudyIdsElementName)
    .SetShouldSerializeMethod(ss => ((SystematicSearch)ss).SchemaVersion == 0);

Pattern 3: Property Unmapping

// Exclude computed/navigation properties from serialization
cm.UnmapProperty(ss => ss.FromLivingSearch);  // Computed property
cm.UnmapProperty(ss => ss.ProjectId);          // Single project (now multi-project)
cm.UnmapProperty(ss => ss.NumberOfStudies);    // Computed aggregate
cm.UnmapProperty(vc => vc.IsUnexpired);        // Computed from other fields

Pattern 4: Polymorphic Type Registration

// Register base and derived types for polymorphic deserialization
if (!BsonClassMap.IsClassMapRegistered(typeof(Annotation)))
    BsonClassMap.RegisterClassMap<Annotation>();
if (!BsonClassMap.IsClassMapRegistered(typeof(StringAnnotation)))
    BsonClassMap.RegisterClassMap<StringAnnotation>();
if (!BsonClassMap.IsClassMapRegistered(typeof(BoolAnnotation)))
    BsonClassMap.RegisterClassMap<BoolAnnotation>();
// ... 10+ annotation types

Pattern 5: Custom Serializers

// Enum as string serialization
cm.MapMember(job => job.Status)
    .SetSerializer(new EnumAsStringSerializer<RiskOfBiasAiJobStatus>());

Pattern 6: IsClassMapRegistered Guard

// Prevent double registration (required due to global static state)
if (!BsonClassMap.IsClassMapRegistered(typeof(Project)))
{
    BsonClassMap.RegisterClassMap<Project>(cm => { ... });
}

1.4 Type Registration Analysis

Types with Custom Mappings (require explicit configuration):

  • Investigator, NewEmailVerifier, VerificationCode
  • Project, Stage, ProjectMembership, Invitation, AnnotationQuestion
  • Study, ScreeningInfo, ExtractionInfo, Annotation (+ 10 subtypes)
  • SystematicSearch, ReferenceFile
  • RiskOfBiasAiJob, AiAnnotationResponse (+ 8 answer types)
  • StageUsage

Types with Auto-Mapping (no custom config, just registration):

  • Many polymorphic subtypes (e.g., StringAnnotation, BoolAnnotation)
  • Simple DTOs and value objects

2. Problems with Current Architecture

2.1 Global Static State Anti-Pattern

Problem: BsonClassMap is a global, static registry that cannot be reset or scoped.

// Once registered, cannot be unregistered or modified
BsonClassMap.RegisterClassMap<Project>(cm => { ... });

Impact:

  • Tests cannot isolate mapping configurations
  • Order-dependent behavior across test runs
  • Cannot test different mapping scenarios in parallel
  • Mapping configuration "leaks" between test fixtures

2.2 Dual Responsibility Violation

Problem: Repositories handle both data access AND schema definition.

public class ProjectRepository : MongoRepositoryBase<Project, Guid>
{
    // Responsibility 1: Data access (CRUD, queries)
    public async Task<Project?> GetByIdAsync(Guid id) { ... }

    // Responsibility 2: Schema definition (425 lines!)
    public override void CreateMappings() { ... }
}

Impact:

  • Violates Single Responsibility Principle
  • Makes repositories harder to test in isolation
  • Schema changes require modifying repository code
  • Harder to understand mapping behavior without reading repository code

2.3 Scattered Schema Definitions

Problem: Schema definitions are spread across 6+ repository files.

Impact:

  • Difficult to get a complete picture of the schema
  • Duplicate type registrations across repositories possible
  • No central validation of mapping consistency
  • Refactoring type hierarchies requires changes in multiple files

2.4 Untestable Conditional Logic

Problem: SetShouldSerializeMethod lambdas contain business logic that cannot be unit tested.

cm.MapProperty(i => i.ProjectHistory)
    .SetShouldSerializeMethod(i => ((Investigator)i).SchemaVersion > 0);

Impact:

  • Schema migration logic is implicitly defined in lambdas
  • Cannot verify serialization behavior without MongoDB integration tests
  • Bugs in conditional serialization only discovered at runtime
  • Current tests only verify registration, not configuration

2.5 Magic Strings and Implicit Contracts

Problem: Element names are strings, creating implicit contracts between code and database.

cm.MapField(Project.V0RegistrationsName)
    .SetElementName("Registrations");  // Must match database field name

Impact:

  • Typos in element names cause silent data loss
  • Renaming requires careful coordination with existing data
  • No compile-time verification of field mappings

2.6 Test Coverage Gaps

Current Tests (from BsonClassMapTests.cs):

  • CreateMappings() doesn't throw
  • ✅ Classes are registered
  • ✅ Idempotent (calling twice doesn't fail)

Not Tested:

  • ❌ Actual field-to-element mappings
  • ❌ Conditional serialization logic
  • ❌ Custom serializer behavior
  • ❌ Unmapped properties are actually excluded
  • ❌ Schema version behavior (v0 vs v1)
  • ❌ Round-trip serialization/deserialization

3. Architectural Options

Approach: Extract all mappings to a dedicated MongoMappingRegistry class.

public interface IMongoMappingProvider
{
    void RegisterMappings();
    IReadOnlyList<Type> RegisteredTypes { get; }
}

public class SyRFMappingRegistry : IMongoMappingProvider
{
    private readonly List<Type> _registeredTypes = new();

    public void RegisterMappings()
    {
        RegisterInvestigatorMappings();
        RegisterProjectMappings();
        RegisterStudyMappings();
        // ... etc
    }

    private void RegisterInvestigatorMappings()
    {
        RegisterIfNotExists<Investigator>(cm =>
        {
            cm.AutoMap();
            MapConditionalField(cm, i => i.ProjectHistory,
                schemaVersion => schemaVersion > 0);
            // ... etc
        });
    }

    public IReadOnlyList<Type> RegisteredTypes => _registeredTypes;
}

Benefits:

  • Single location for all schema definitions
  • Can implement custom conventions
  • Easier to audit complete schema
  • Repository classes become pure data access

Effort: Medium (refactor existing code)

Option B: Attribute-Based Mapping

Approach: Define mappings declaratively on model classes.

[BsonDiscriminator("Investigator")]
public class Investigator : AggregateRoot<Guid>
{
    [BsonIgnore]
    public string PrimaryEmail => Emails.First();

    [BsonElement("Emails")]
    [BsonField("_internalEmails")]
    private List<string> _internalEmails;

    [BsonConditional(nameof(SchemaVersion), ">", 0)]
    public List<ProjectAccess> ProjectHistory { get; private set; }
}

Benefits:

  • Schema definition co-located with domain model
  • More discoverable (no need to find repository code)
  • Standard MongoDB driver pattern (some attributes already supported)

Drawbacks:

  • Not all mapping scenarios supported by attributes
  • Custom attribute for conditional serialization would be needed
  • Schema version logic embedded in domain model (coupling)

Effort: High (would require custom attributes + reflection processing)

Option C: Fluent Configuration Files (Schema-as-Code)

Approach: Separate configuration classes per aggregate.

public class InvestigatorMappingConfiguration : IMappingConfiguration<Investigator>
{
    public void Configure(BsonClassMap<Investigator> cm)
    {
        cm.AutoMap();

        // Clear declarative style
        cm.MapPrivateField("_internalEmails")
            .ToElement("Emails");

        cm.MapProperty(i => i.ProjectHistory)
            .SerializeWhen(i => i.SchemaVersion > 0);

        cm.IgnoreProperty(i => i.PrimaryEmail);
        cm.IgnoreProperty(i => i.Email);
    }
}

Benefits:

  • Clean separation of concerns
  • Each configuration is testable in isolation
  • Easy to compose and extend
  • Follows Entity Framework Core pattern (familiar)

Effort: Medium-High (new abstraction layer)

Option D: Schema Version Migration System

Approach: Formalize schema versions with explicit migration logic.

public interface ISchemaVersion
{
    int Version { get; }
    void ConfigureMapping(BsonClassMap map);
}

public class InvestigatorSchemaV0 : ISchemaVersion
{
    public int Version => 0;
    public void ConfigureMapping(BsonClassMap<Investigator> cm)
    {
        cm.MapProperty(i => i.Email).SetElementName("Email");
        // V0-specific mapping
    }
}

public class InvestigatorSchemaV1 : ISchemaVersion
{
    public int Version => 1;
    public void ConfigureMapping(BsonClassMap<Investigator> cm)
    {
        cm.MapField("_internalEmails").SetElementName("Emails");
        cm.MapProperty(i => i.ProjectHistory);
        // V1-specific mapping
    }
}

Benefits:

  • Explicit, documented schema versions
  • Each version testable independently
  • Clear upgrade path for migrations
  • Better alignment with eventual consistency patterns

Effort: High (significant architectural change)


4. Testing Strategy Improvements

4.1 BsonClassMap Inspection Tests

Test that mappings are configured correctly by inspecting the registered class maps.

[Fact]
public void Investigator_EmailsField_IsMappedCorrectly()
{
    EnsureInvestigatorMappingsCreated();

    var classMap = BsonClassMap.GetRegisteredClassMaps()
        .First(cm => cm.ClassType == typeof(Investigator));

    var memberMap = classMap.GetMemberMap(Investigator.InternalEmailsFieldName);

    Assert.NotNull(memberMap);
    Assert.Equal("Emails", memberMap.ElementName);
}

[Fact]
public void Investigator_PrimaryEmail_IsUnmapped()
{
    EnsureInvestigatorMappingsCreated();

    var classMap = BsonClassMap.GetRegisteredClassMaps()
        .First(cm => cm.ClassType == typeof(Investigator));

    var memberMap = classMap.AllMemberMaps
        .FirstOrDefault(m => m.MemberName == nameof(Investigator.PrimaryEmail));

    Assert.Null(memberMap);
}

4.2 Round-Trip Serialization Tests

Verify documents serialize and deserialize correctly.

[Theory]
[InlineData(0)]
[InlineData(1)]
public void Investigator_RoundTrip_PreservesData(int schemaVersion)
{
    EnsureInvestigatorMappingsCreated();

    var investigator = new Investigator(
        Guid.NewGuid(), "auth0|123",
        new FullName("Test", "User"),
        "test@example.com",
        schemaVersion);

    // Serialize
    var bson = investigator.ToBsonDocument();

    // Verify schema-specific behavior
    if (schemaVersion == 0)
    {
        Assert.True(bson.Contains("Email"));
        Assert.False(bson.Contains("ProjectHistory"));
    }
    else
    {
        Assert.True(bson.Contains("Emails"));
        Assert.True(bson.Contains("ProjectHistory"));
    }

    // Deserialize
    var deserialized = BsonSerializer.Deserialize<Investigator>(bson);

    Assert.Equal(investigator.Id, deserialized.Id);
    Assert.Equal(investigator.PrimaryEmail, deserialized.PrimaryEmail);
}

4.3 Conditional Serialization Tests

Explicitly test SetShouldSerializeMethod behavior.

[Fact]
public void Project_Registrations_OnlySerializedForSchemaV0()
{
    EnsureMappingsCreated();

    var projectV0 = CreateProject(schemaVersion: 0);
    projectV0.AddRegistration("test-reg");

    var projectV1 = CreateProject(schemaVersion: 1);
    projectV1.AddMembership(new ProjectMembership(...));

    var bsonV0 = projectV0.ToBsonDocument();
    var bsonV1 = projectV1.ToBsonDocument();

    Assert.True(bsonV0.Contains("Registrations"));
    Assert.False(bsonV0.Contains("Memberships"));

    Assert.False(bsonV1.Contains("Registrations"));
    Assert.True(bsonV1.Contains("Memberships"));
}

4.4 Integration Tests with TestContainers

Full integration tests with real MongoDB.

public class MongoSerializationIntegrationTests : IAsyncLifetime
{
    private readonly MongoDbContainer _container;
    private IMongoCollection<Investigator> _collection;

    public async Task InitializeAsync()
    {
        await _container.StartAsync();
        var client = new MongoClient(_container.GetConnectionString());
        var database = client.GetDatabase("test");
        _collection = database.GetCollection<Investigator>("investigators");

        // Register mappings in clean state
        new InvestigatorMappingConfiguration().Configure();
    }

    [Fact]
    public async Task SaveAndLoad_Investigator_SchemaV1()
    {
        var investigator = new Investigator(..., schemaVersion: 1);

        await _collection.InsertOneAsync(investigator);
        var loaded = await _collection.Find(i => i.Id == investigator.Id)
            .FirstOrDefaultAsync();

        Assert.Equal(investigator.Emails, loaded.Emails);
        Assert.Equal(investigator.ProjectHistory, loaded.ProjectHistory);
    }
}

5. Recommendations

5.1 Short-Term (Low Risk)

  1. Add BsonClassMap inspection tests (Section 4.1)
  2. Verify field-to-element mappings
  3. Verify unmapped properties
  4. Verify polymorphic type registrations

  5. Add round-trip serialization tests (Section 4.2)

  6. Test serialization produces expected BSON structure
  7. Test deserialization recreates domain objects correctly

  8. Document schema version behavior

  9. Create explicit documentation of V0 vs V1 differences per entity
  10. Add inline comments explaining conditional serialization logic
  1. Implement Centralized Mapping Registry (Option A)
  2. Extract all CreateMappings() logic to dedicated registry class
  3. Keep repositories focused on data access only
  4. Easier to test and audit

  5. Add conditional serialization tests (Section 4.3)

  6. Explicitly test schema version behavior
  7. Catch regressions in migration logic

5.3 Long-Term (Optional)

  1. Consider Fluent Configuration (Option C)
  2. If mapping complexity continues to grow
  3. Provides cleaner separation of concerns

  4. Schema Version Migration System (Option D)

  5. If adding new schema versions frequently
  6. Makes versioning explicit and testable

6. Implementation Roadmap

Phase 1: Enhanced Testing (1-2 days)

  • Add BsonClassMap inspection tests for all repositories
  • Add round-trip tests for key entities (Investigator, Project, Study)
  • Document expected BSON structure for each entity

Phase 2: Centralized Registry (2-3 days)

  • Create SyRFMappingRegistry class
  • Extract mappings from InvestigatorRepository
  • Extract mappings from ProjectRepository (largest)
  • Extract mappings from remaining repositories
  • Update MongoPmUnitOfWork to use registry
  • Remove CreateMappings() from repositories

Phase 3: Schema Documentation (1 day)

  • Document V0 vs V1 schema differences
  • Add BSON examples to documentation
  • Create schema migration guide


Appendix A: Code Statistics

Total lines in CreateMappings() across repositories: ~600 Total types registered: 60+ Repositories with custom mappings: 6 Repositories relying on AutoMap: 2

Appendix B: Schema Version Matrix

Entity V0 Fields V1 Fields Migration Notes
Investigator Email (string) Emails (array), ProjectHistory Email → Emails array
Project Registrations Memberships Complete restructure
Study (see repo) (see repo) Multiple field changes
SystematicSearch StudyIds, ReferenceLibraryDao LivingSearchId, ReferencesFile Multi-project support