Skip to content

Seed Data Quality Analysis and Planning

Executive Summary

This document analyzes the current seed data implementation for preview environments, identifies quality issues, and provides comprehensive planning for creating high-quality seed data.

Key Principles:

  1. Breadth and depth - Seed projects should exemplify common configurations and states users will encounter
  2. Use high-level services - Prefer application services over bespoke seeding code to ensure realistic data
  3. Realistic data - Seeded data should be as close to production data as possible (including any real bugs/issues that would occur)
  4. Testability - Seeding should exercise the same code paths as real user operations
  5. Keep defaults where appropriate - Some projects should use default screening modes with seed investigators performing actual screening

Part 1: Current Implementation Analysis

Seeding Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Application Startup                          │
├─────────────────────────────────────────────────────────────────┤
│  1. IRunAtInit discovered via assembly scanning                  │
│  2. DatabaseSeeder.Execute() called                              │
│  3. Check SYRF_SEED_DATA_ENABLED=true                           │
│  4. Check if database already has data                           │
│  5. Create seed data if conditions met                           │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│                     First User Registration                      │
├─────────────────────────────────────────────────────────────────┤
│  1. UserHasRegisteredHandler fires                               │
│  2. SeedDataOwnershipTransfer.TransferToFirstRealUserAsync()    │
│  3. Add real user as admin to all seed projects                  │
│  4. Transfer ownership from SeedBot to real user                 │
└─────────────────────────────────────────────────────────────────┘

Current Seed Data Inventory

Project Studies Stages Screening State Annotation State AgreementMode
Quick Start Demo 10 2 (Screening active, Annotation inactive) Unscreened N/A Default (Dual)
Screening In Progress 30 1 (Screening active) 50% complete (dual) N/A Default (Dual)
Ready for Annotation 20 2 (Screening inactive, Annotation active) Complete Ready Default (Dual)
Complete Review 15 2 (both inactive) Complete Complete Default (Dual)
Private Research 8 1 (Screening active) Unscreened N/A Default (Dual)

Key Files


Part 2: Identified Issues

Issue 1: Stages Appearing Disabled for Screening

Symptom: Users report that screening stages appear disabled or inaccessible.

Root Cause Analysis:

  1. Project AgreementMode: Projects default to AgreementMode.AutomatedDualScreening:
public Project() : base(Guid.NewGuid())
{
    AgreementMode = AgreementMode.AutomatedDualScreening;
}
  1. Post-Ownership Transfer State: After ownership transfer, the real user is the only member who can log in. Seed reviewers exist but are fake users (system|seed-reviewer-alpha). For dual screening projects, the UI may show screening as blocked because only one real reviewer exists.

Resolution Options:

Option Description Recommended For
Keep dual screening with seed reviewers Demonstrates realistic dual screening workflow "Screening In Progress" project
Set SingleReviewerDecides User can complete all screening alone "Quick Start Demo" project
Add real user to reviewer role User joins existing screening workflow Projects with partial screening

Note: It's OK to keep some projects with default dual screening where seed investigators have already done screening. The new user can add more users and change settings as needed.

Issue 2: Study Stats Not Correct

Symptom: Study statistics (screened/unscreened counts) don't match actual study states.

Root Cause Analysis:

  1. Stats Are Computed Dynamically: Stats use MongoDB aggregation pipelines that query:
  2. ScreeningInfo.AgreementMeasure.NumberScreened
  3. ScreeningInfo.AgreementMeasure.AbsoluteAgreementRatio
  4. ScreeningInfo.Inclusion

  5. TrackedProjectAgreementThresholds Empty: Studies are created with empty thresholds:

trackedProjectAgreementThresholds: Array.Empty<ProjectAgreementThreshold>()

This means studies haven't been evaluated against any threshold, affecting stats calculations.

  1. Direct Domain Model Usage: The seeder uses domain methods directly (study.AddScreening()) rather than going through the same code paths as the API endpoints. This may skip side effects that normally occur.

Issue 3: Ownership Transfer Side Effects

Symptom: Seed reviewers remain as project members but cannot log in.

Analysis: This is actually acceptable behavior:

  • Seed reviewers demonstrate realistic projects with multiple team members
  • Their screening decisions show what dual screening looks like
  • The new owner can add real users and modify settings
  • This reflects how a real user would join an existing team project

Part 3: Architectural Analysis

Current Seeding Approach vs API Endpoints

The current DatabaseSeeder creates entities directly using domain models:

// Current approach - Direct domain model creation
var project = new Project(name, creatorId, ...);
typeof(Project).GetProperty("Id")?.SetValue(project, id);  // Reflection to set ID
project.AddStage(...);
_unitOfWork.Save(project);

Issues with this approach:

  1. Bypasses service layer validation - Misses checks that would occur in production
  2. Uses reflection - Sets IDs directly, not how real projects are created
  3. May miss side effects - Service layer often triggers additional operations
  4. Different code path - Any bugs in seeding won't be found in production (and vice versa)

API Endpoint Analysis

Based on analysis of API controllers and Angular effects, here's how real users create data:

Project Creation Flow (Production)

Angular Component
store.dispatch(projectDetailActions.create({...}))
project-detail.effects.ts → POST /api/projects
ProjectController.CreateProject()
IProjectManagementService.CreateProject(name, creatorId, ...)
new Project(...) + _unitOfWork.Save()

Screening Flow (Production)

Angular Component
store.dispatch(reviewComponentActions.submitScreeningAndFetchNext({...}))
review.effects.ts → POST /api/reviews/screening
ReviewController.SubmitScreeningAndGetNextStudy()
study.AddScreening() + IStageReviewService.GetRandomStudyAsync()
_unitOfWork.Save(study)

Service Layer Mapping

Operation Controller Service Method Seeder Should Use
Create Project ProjectController IProjectManagementService.CreateProject() Yes
Create Investigator ApplicationController IApplicationService.EnsureInvestigatorCreatedAndUpdatedAsync() Yes
Add Stage ProjectController project.AddStage() (domain method) OK - same as controller
Add Screening ReviewController study.AddScreening() (domain method) OK - but verify side effects
Update Inclusion Info StudyController IProjectManagementService.UpdateStudyInclusionInfoForProjectAsync() Yes - should call after screening
Add Search ProjectController IProjectManagementService.GetOrCreateSearchImportJobAsync() Consider for realism

Where Controllers Do More Than Just Call Services

Some controllers contain logic beyond simple service delegation. This affects seeding accuracy:

1. ReviewController - Stats Calculation After Screening

// ReviewController.cs - after screening submission
var stats = await _stageReviewService.GetFullStatsForInvestigatorAsync(
    project.Id, stage.Id, CurrentUserId);

Impact on Seeding: The seeder doesn't calculate/return stats after adding screenings.

Recommendation: Either call UpdateStudyInclusionInfoForProjectAsync() after seeding screenings, or ensure domain model correctly updates stats synchronously.

2. ProjectController - Instrumentation Code

// ProjectController.cs - timing instrumentation in RepositionAnnotationQuestion
var stopwatch = Stopwatch.StartNew();
// ... operation ...
Console.WriteLine($"Get Project Time: {getProjectTime} ms");

Impact: Debug code in production. Not a seeding concern but violates SRP.

Recommendation: Move instrumentation to a decorator or aspect.

3. ApplicationController - Multi-Step Signin

// ApplicationController.cs - signin involves multiple operations
var investigator = await _applicationService.EnsureInvestigatorCreatedAndUpdatedAsync(...);
await _applicationService.RetrieveAndValidateInvestigatorUsage(...);
investigator.RecordSignin(ipAddress);
_pmUnitOfWork.Save(investigator);

Impact on Seeding: Seeder creates investigators directly without going through signin flow.

Recommendation: Consider using IApplicationService.EnsureInvestigatorCreatedAndUpdatedAsync() for seed investigator creation.

SOLID Principle Violations Identified

Issue Principle Location Recommendation
Controller does instrumentation SRP ProjectController:539-574 Move to decorator
Seeder uses reflection for IDs DIP DatabaseSeeder:581 Use factory or accept generated IDs
Stats calculated in controller SRP ReviewController Move to service
Direct UnitOfWork in controllers DIP Multiple OK for simple queries, wrap complex operations

Part 4: Seed Project Planning Matrix

Guiding Principles

  1. Exemplify breadth: Cover all major feature configurations users will encounter
  2. Exemplify depth: Show projects at various workflow stages
  3. Keep some defaults: Not every project needs custom settings - real users start with defaults
  4. Use seed investigators actively: Have seed reviewers do actual screening to show realistic workflows
  5. Annotations matter: Include projects with annotation questions and some annotation data

Proposed Seed Project Matrix

Project Purpose AgreementMode Stages Studies Screening State Annotation State Outcome Data Members
Quick Start Demo Intro project for new users SingleReviewerDecides 1 screening, 1 annotation 10 Unscreened N/A None Owner only
Dual Screening Example Shows dual review workflow AutomatedDualScreening 1 screening 30 50% complete by seed reviewers N/A None Owner + Alpha + Beta
Ready for Annotation Shows annotation workflow SingleReviewerDecides 1 screening (complete), 1 annotation 20 Complete (all included) Ready with questions None Owner
Partial Annotation Shows annotation in progress SingleReviewerDecides 1 screening (complete), 1 annotation 15 Complete 50% annotated 25% studies Owner
Complete Review Shows finished project SingleReviewerDecides Both complete 15 Complete Complete All studies Owner
Data Extraction Demo Shows outcome data workflow SingleReviewerDecides 1 screening (complete), 1 annotation (complete) 12 Complete Complete Varied (see below) Owner
Private Research Shows access control AutomatedDualScreening 1 screening 8 Unscreened N/A None Owner (requires approval)
Large Study Set Performance testing SingleReviewerDecides 1 screening 100 Unscreened N/A None Owner
Three Reviewer Shows 3-reviewer workflow ThreeReviewersOneReconciles 1 screening 20 30% with agreement N/A None Owner + Alpha + Beta + Gamma

Configuration Coverage Matrix

Configuration Covered By Project(s)
Single reviewer screening Quick Start, Ready for Annotation, Complete Review
Dual screening Dual Screening Example, Private Research
Three reviewer Three Reviewer
Screening only stage Quick Start, Dual Screening Example
Annotation stage Quick Start, Ready for Annotation, Partial Annotation, Complete Review
Private project Private Research
Public project All others
Large study count Large Study Set
Complete workflow Complete Review
Auto-approve joins Quick Start (could add)
Require approval Private Research

Annotation Question Coverage

Seed projects with annotation should include questions from all categories:

Category Example Questions Projects
Study Species, Study design Ready for Annotation, Partial Annotation, Complete Review, Data Extraction Demo
Risk of Bias Randomization, Blinding Ready for Annotation, Partial Annotation, Complete Review, Data Extraction Demo
Model Induction Disease model, Induction method Partial Annotation, Complete Review, Data Extraction Demo
Treatment Intervention, Control, Dose Partial Annotation, Complete Review, Data Extraction Demo
Outcome Outcome measure, Timepoint Complete Review, Data Extraction Demo
Experiment Sample size, Duration Complete Review, Data Extraction Demo

Outcome Data Coverage

The OutcomeData entity (OutcomeData.cs) tracks quantitative experimental results per study.

Data Extraction Demo Project - Demonstrates various outcome data scenarios:

Studies Outcome Data State Purpose
3 studies Complete data (all cohorts) Shows fully extracted study
3 studies Partial data (some cohorts) Shows work-in-progress
3 studies Multiple experiments Shows multi-experiment studies
3 studies No outcome data Shows contrast with extracted studies

Sample Outcome Data Structure:

// Example seeded outcome data
var outcomeData = new OutcomeData(
    projectId: projectId,
    annotatorId: SeedDataConstants.SeedBotId,
    stageId: extractionStageId,
    experimentId: experimentId,
    cohortId: cohortId)
{
    Mean = 42.5,
    StandardDeviation = 3.2,
    SampleSize = 10,
    Timepoint = "Day 7",
    OutcomeMeasure = "Tumor volume (mm³)"
};

study.ExtractionInfo.AddOutcomeData(outcomeData);

Seeder Code Path for Outcome Data:

// Should mirror production: Study.AddSessionData() → ExtractionInfo.AddOutcomeData()
var sessionDto = new SessionSubmissionDto
{
    // ... annotations ...
    OutcomeData = GenerateOutcomeDataForStudy(study, project)
};

study.AddSessionData(investigatorId, sessionDto, questionIds, stageExtraction);

Priority 1: Use High-Level Services for Project Creation

Current:

var project = new Project(name, creatorId, ...);
typeof(Project).GetProperty("Id")?.SetValue(project, id);  // Reflection!
_unitOfWork.Save(project);

Recommended:

// Inject IProjectManagementService into DatabaseSeeder
var project = await _projectManagementService.CreateProject(
    name: SeedDataConstants.QuickStartProjectName,
    creatorId: SeedDataConstants.SeedBotId,
    protocolUrl: "",
    isPublic: true,
    inclusionCriteria: "Studies relevant to the research question",
    exclusionCriteria: "Studies not meeting inclusion criteria",
    contactEmail: SeedDataConstants.SeedBotEmail,
    notes: SeedDataConstants.QuickStartProjectDescription,
    keywords: new HashSet<string> { "demo", "quick-start" },
    machineLearningEnabled: false);

// Store generated ID in SeedDataConstants mapping or accept dynamic IDs

Trade-off: Loses deterministic IDs. Consider:

  • Option A: Accept dynamic IDs and store them after creation
  • Option B: Add ID parameter to CreateProject (service layer change)
  • Option C: Keep current approach for IDs but use services for everything else

Priority 2: Use IApplicationService for Investigator Creation

Current:

var investigator = new Investigator(id, auth0Id, fullName, email, schemaVersion);
_unitOfWork.Save(investigator);

Recommended:

var investigator = await _applicationService.EnsureInvestigatorCreatedAndUpdatedAsync(
    investigatorId: SeedDataConstants.SeedBotId,
    auth0Id: SeedDataConstants.SeedBotAuth0Id,
    name: new FullName(SeedDataConstants.SeedBotFirstName, SeedDataConstants.SeedBotLastName),
    email: SeedDataConstants.SeedBotEmail,
    pictureUrl: null,
    ipAddress: "127.0.0.1");

Benefit: Uses same code path as production signin.

Priority 3: Call Inclusion Update After Screening

Add after adding all screenings to a project:

// After seeding all screenings for a project
await _projectManagementService.UpdateStudyInclusionInfoForProjectAsync(
    projectId,
    replaceExisting: true);

_logger.LogDebug("Updated inclusion info for project {ProjectId}", projectId);

Benefit: Ensures stats are calculated correctly using production code.

Priority 4: Set AgreementMode Explicitly

For projects that should use single screening:

project.UpdateAgreementMode(AgreementMode.SingleReviewerDecides, SeedDataConstants.SeedBotId);

For projects demonstrating dual screening:

// Leave as default (AutomatedDualScreening) - this is intentional
// Seed reviewers Alpha and Beta do the screening

Priority 5: Add Annotation Data

For "Partial Annotation" and "Complete Review" projects:

// After adding annotation questions, add annotation sessions
foreach (var study in studies.Take(annotatedCount))
{
    var session = new AnnotationSession(
        study.Id,
        SeedDataConstants.SeedBotId,
        annotationStageId);

    // Add answers to questions
    foreach (var question in project.AnnotationQuestions)
    {
        session.AddAnswer(question.Id, GenerateSampleAnswer(question));
    }

    study.AddAnnotationSession(session);
    _unitOfWork.Save(study);
}

Part 6: Testing Recommendations

Integration Tests for Seeding

[Fact]
public async Task DatabaseSeeder_AfterSeeding_StatsAreCorrect()
{
    // Arrange
    var seeder = new DatabaseSeeder(...);

    // Act
    seeder.Execute();

    // Assert - Quick Start should have 10 unscreened studies
    var quickStartStats = await GetProjectStats(SeedDataConstants.QuickStartProjectId);
    quickStartStats.TotalStudies.Should().Be(10);
    quickStartStats.UnscreenedStudies.Should().Be(10);

    // Assert - Dual Screening In Progress should have correct split
    var dualStats = await GetProjectStats(SeedDataConstants.DualScreeningProjectId);
    dualStats.TotalStudies.Should().Be(30);
    dualStats.SufficientlyScreened.Should().BeGreaterThan(0);
    dualStats.PartiallyScreened.Should().BeGreaterThan(0);
    dualStats.UnscreenedStudies.Should().BeGreaterThan(0);
}

[Fact]
public async Task DatabaseSeeder_AfterSeeding_ProjectsHaveCorrectAgreementMode()
{
    // Arrange & Act
    seeder.Execute();

    // Assert
    var quickStart = await _unitOfWork.Projects.GetAsync(SeedDataConstants.QuickStartProjectId);
    quickStart.AgreementMode.Should().Be(AgreementMode.SingleReviewerDecides);

    var dualScreening = await _unitOfWork.Projects.GetAsync(SeedDataConstants.DualScreeningProjectId);
    dualScreening.AgreementMode.Should().Be(AgreementMode.AutomatedDualScreening);
}

Manual Testing Checklist

For each seed project, verify:

  • Project appears in project list with correct name/description
  • Can open project details without errors
  • Stage shows correct active/inactive state in UI
  • Study count in UI matches seeded count
  • Screening stats show expected numbers (not 0/0/0)
  • Can start screening (for active screening stages)
  • Screening decisions from seed reviewers are visible
  • Can start annotation (for active annotation stages)
  • Annotation questions appear in annotation view
  • Private projects require approval to join

Part 7: Service Layer Improvement Opportunities (Detailed Analysis)

Based on comprehensive codebase analysis, these architectural changes would benefit both seeding and testability.

7.1 Inclusion Calculation - When Called and Gaps

Where UpdateStudyInclusionInfoForProjectAsync IS called:

Location Trigger Context
UpdateStudyScreeningStatsConsumer.cs:17-33 Message handler after AgreementMode change Async via MassTransit
ProjectController.cs:259-267 Manual API call POST /update-study-inclusion On-demand admin action
ProjectController.cs:269-276 Batch admin POST /update-all-study-inclusion Batch processing

Deep Dive: How InclusionInfo Serialization Actually Works

The MongoDB BsonClassMap for ScreeningInfo (StudyRepository.cs:1427-1436) reveals the architecture:

cm.MapProperty(si => si.InclusionInfo).SetIgnoreIfNull(true);   // Computed property IS mapped
cm.UnmapField(ScreeningInfo.TrackedProjectAgreementThresholdsName); // Backing field NOT stored

The serialization flow:

  1. On Save: The InclusionInfo getter computes values from _trackedProjectAgreementThresholds + current Screenings
  2. On Load: Stored InclusionInfo is deserialized → setter extracts ProjectAgreementThreshold into _trackedProjectAgreementThresholds
  3. Subsequent saves: Getter recalculates from loaded thresholds + updated screenings → values ARE updated!

The ACTUAL gap (corrected analysis):

Scenario Behavior Is This a Problem?
Study where UpdateStudyInclusionInfoForProjectAsync was NEVER called InclusionInfo is null YES - MongoDB filters won't find it
Study where it WAS called, then screening added Save recalculates from stored thresholds NO - Auto-updates correctly
Production screening after project setup Works correctly (threshold from AgreementMode change) NO - Normal production flow
Seeded studies with screenings InclusionInfo is null (never initialized) YES - This is the seeding gap

Conclusion: The gap is NOT about individual screenings failing to update. It's about initialization - studies need UpdateStudyInclusionInfoForProjectAsync called at least once to populate the threshold configuration. After that, normal save operations update the computed values correctly.

Flow when AgreementMode changes:

ScreeningController.PostScreeningSettings()
    → project.UpdateAgreementMode()
    → raises ProjectAgreementThresholdUpdatedEvent
    → ProjectAgreementThresholdUpdatedHandler
    → sends IUpdateStudyScreeningStatsCommand
    → UpdateStudyScreeningStatsConsumer.Consume()
    → IStudyRepository.UpdateStudyInclusionInfoForProjectAsync()

Justification for Change: The production design is actually sound - once UpdateStudyInclusionInfoForProjectAsync is called (which happens after AgreementMode changes), subsequent screenings auto-update correctly via the computed property serialization. The gap is specific to seeding where this initialization step is skipped.

Recommendation for Seeding:

// After seeding all screenings for a project, initialize InclusionInfo:
await _projectManagementService.UpdateStudyInclusionInfoForProjectAsync(projectId, replaceExisting: true);

Why this works: This populates each study's InclusionInfo with the project's AgreementThreshold. Once set, any subsequent screening additions (even by seed users in the app) will correctly update via the computed property getter during save.

Affected Code for Implementing This:

  • DatabaseSeeder.cs - Add call after each project's screenings are seeded
  • This is a one-time initialization, not a recurring requirement

7.2 Optional ID Parameter - Deep Design Analysis

Question: Should IProjectManagementService.CreateProject() accept an optional ID?

Current Signature:

Task<Project> CreateProject(string name, Guid creatorId, string protocolUrl, ...);
// ID is generated internally by Project constructor

Deep Analysis: Where ProjectId is Stored

Critical Discovery: ProjectId is stored in many related entities, not just the Project itself:

Entity Storage Relationship
Study.ProjectId Stored property FK to Project
Screening.ProjectId Stored property FK to Project (within Study)
Annotation.ProjectId Stored property FK to Project (within Study.ExtractionInfo)
OutcomeData.ProjectId Stored property FK to Project (within Study.ExtractionInfo)
SystematicSearch.ProjectId Stored property FK to Project
DataExportJob.ProjectId Stored property FK to Project
RiskOfBiasAiJob.ProjectId Stored property FK to Project
StudyPdfCorrection.ProjectId Stored property FK to Project
LivingSearchSetting.ProjectId Stored property FK to Project
ProjectAccess.ProjectId Stored property FK to Project (in Investigator)

Computed Properties (safe to ignore) - these derive from Project.Id at runtime:

  • BulkStudyUpdateJob.ProjectId => Project.Id
  • AnnotationQuestion.ProjectId => Project.Id
  • Stage.ProjectId => Project!.Id
  • ProjectMembership.ProjectId => Project.Id
  • SearchImportJob.ProjectId => Project.Id

Technical Constraint: Entity ID Immutability

The base Entity<TId> class (Entity.cs:15) defines:

public TId Id { get; init; }  // init-only accessor - immutable after construction!

Implication: Post-creation ID changes require reflection to bypass the init accessor.


Where the change happens: IProjectManagementService (production service layer)

// Change to production service interface
public interface IProjectManagementService
{
    // Original signature modified with optional ID
    Task<Project> CreateProject(string name, Guid creatorId, ..., Guid? id = null);
}

// Implementation changes
public async Task<Project> CreateProject(string name, Guid creatorId, ..., Guid? id = null)
{
    var project = new Project(id ?? Guid.NewGuid());  // Use provided ID or generate
    // ... rest of implementation
}

Architectural Location: Service layer (production code path)

Cons:

  • Pollutes production API - Optional id parameter is never used in production
  • Violates Single Responsibility - Production service handles seeding concerns
  • Risk of misuse - Accidental ID collision if someone passes a duplicate ID
  • Testing concerns in production - Method signature reveals testing/seeding backdoor

Verdict: ❌ Rejected - seeding concerns shouldn't affect production service signature


Option B: Seeder Factory with Post-Creation ID Override (NOT Recommended)

public class SeederProjectFactory : ISeederProjectFactory
{
    public async Task<Project> CreateProjectAsync(CreateProjectDto dto, Guid? overrideId = null)
    {
        var project = await _service.CreateProject(...);
        if (overrideId.HasValue)
        {
            typeof(Project).GetProperty("Id")?.SetValue(project, overrideId.Value);
        }
        return project;
    }
}

Critical Consistency Problem:

If seeding creates Studies, Searches, Screenings, etc. after the ID override, they'll have the correct ID. But what about entities created before?

// CreateProject internally calls:
Memberships.Add(ProjectMembership.InitialProjectCreator(this));  // Uses original ID!

The ProjectMembership is created with the original Project.Id, then we override. This creates an inconsistent state.

Even worse: If CreateProject saves immediately (which it does), we'd have two documents in MongoDB with different IDs representing the same project.

Cons:

  • Entity.Id is init-only - requires reflection to override
  • Race condition: Entities created before override have wrong ProjectId
  • Database inconsistency: Original project may already be saved before override
  • Hidden complexity: Developers won't expect ID mutation post-construction

Verdict: ❌ Rejected - causes data consistency issues


// Store generated IDs after creation
var project = await _service.CreateProject(...);
SeedDataConstants.RuntimeProjectIds["QuickStart"] = project.Id;

// Later, ownership transfer looks up dynamically
var projectIds = SeedDataConstants.RuntimeProjectIds.Values.ToList();
await _ownershipService.TransferProjectsToUser(userId, projectIds);

Pros:

  • Clean production code - no seeding concerns leak into production
  • Consistency - all entities use the same ID from construction
  • Simple - no reflection, no special factory

Cons:

  • Non-deterministic IDs - IDs differ between environments
  • Dynamic lookup - ownership transfer must use runtime lookup
  • Test assertions - can't hardcode expected IDs in tests

When This Works Well:

  • Ownership transfer already loops through projects - just needs runtime list
  • Tests can query by name/characteristics instead of ID
  • Preview environments are ephemeral anyway

Verdict: ✅ Recommended for immediate implementation


Where the change happens: Project entity class (domain layer, NOT service layer)

Key Difference from Option A:

  • Option A modifies the service interface (production API changes)
  • Option D adds a factory method on the entity (production API unchanged, new creation path)

Core Insight: The factory method should call the same substantive constructor used by ProjectManagementService.CreateProject(). No duplication of initialization logic.

Current Constructor Chain (lines 27-57 of Project.cs):

// CURRENT: Substantive constructor → parameterless → base(NewGuid)
public Project(string name, Guid creatorId, string protocolUrl, ...)
    : this()  // ← calls parameterless
{
    // initialization
}

public Project() : base(Guid.NewGuid())  // ← ID generated here
{
    // defaults
}

Refactored Constructor Chain (factory method delegates to internal constructor):

// Project.cs - Refactored to support deterministic IDs
public class Project : AggregateRoot<Guid>
{
    // PUBLIC: Substantive constructor - delegates to internal with new ID
    public Project(string name, Guid creatorId, string protocolUrl,
        bool isPublic, string inclusionCriteria, string exclusionCriteria,
        int schemaVersion, ICollection<string> keywords, string contactEmail = "",
        string notes = "", LivingSearchConfig? livingSearchConfig = null)
        : this(Guid.NewGuid(), name, creatorId, protocolUrl, isPublic,
               inclusionCriteria, exclusionCriteria, schemaVersion, keywords,
               contactEmail, notes, livingSearchConfig)
    {
    }

    // PRIVATE: Internal constructor - ALL initialization happens here
    private Project(Guid id, string name, Guid creatorId, string protocolUrl,
        bool isPublic, string inclusionCriteria, string exclusionCriteria,
        int schemaVersion, ICollection<string> keywords, string contactEmail,
        string notes, LivingSearchConfig? livingSearchConfig)
        : this(id)
    {
        DefaultSchemaVersion = schemaVersion;
        SecuritySettings = new ProjectSecuritySettings(this);
        ContactEmail = contactEmail;
        Notes = notes;
        InclusionCriteria = inclusionCriteria;
        ExclusionCriteria = exclusionCriteria;
        Name = name;
        OwnerId = creatorId;
        CreatorId = creatorId;
        AddProtocolLink(protocolUrl);
        Memberships.Add(ProjectMembership.InitialProjectCreator(this));
        IsPublic = isPublic;
        CreationDate = DateTime.Now;
        Keywords = keywords;
        if (livingSearchConfig != null) AddLivingSearch(livingSearchConfig);
    }

    // PRIVATE: ID-only constructor (for deserialization compatibility)
    private Project(Guid id) : base(id)
    {
        LivingSearches = new List<LivingSearch>();
        AgreementMode = AgreementMode.AutomatedDualScreening;
    }

    // PUBLIC: Parameterless (for MongoDB deserialization)
    public Project() : this(Guid.NewGuid())
    {
    }

    // NEW: Factory method for deterministic IDs - calls SAME constructor as service
    public static Project CreateWithId(
        Guid id,
        string name,
        Guid creatorId,
        string protocolUrl = "",
        bool isPublic = false,
        string inclusionCriteria = "",
        string exclusionCriteria = "",
        int schemaVersion = 1,
        ICollection<string>? keywords = null,
        string contactEmail = "",
        string notes = "",
        LivingSearchConfig? livingSearchConfig = null)
    {
        return new Project(id, name, creatorId, protocolUrl, isPublic,
            inclusionCriteria, exclusionCriteria, schemaVersion,
            keywords ?? Array.Empty<string>(), contactEmail, notes,
            livingSearchConfig);
    }
}

// Seeder uses factory method
var project = Project.CreateWithId(
    id: SeedDataConstants.QuickStartProjectId,      // Deterministic!
    name: SeedDataConstants.QuickStartProjectName,
    creatorId: SeedDataConstants.SeedBotId);

// Production code continues using service (unchanged)
var project = await _projectService.CreateProject(name, userId, ...);  // Auto-generates ID

Why This Works:

  • Factory method calls the same private constructor as the public one
  • All initialization logic is in ONE place (the private constructor)
  • No duplication of AddMembership, AddProtocolLink, etc.
  • Public API unchanged - existing code continues working

Architectural Comparison:

Aspect Option A (Service) Option D (Factory)
What changes Service interface signature Entity class (add method)
Production code affected Yes - new optional param No - existing path unchanged
Where ID is provided Service layer Entity construction
Separation of concerns Poor - seeding in production API Good - factory is opt-in
Child entity IDs Correct (ID set before children) Correct (ID set before children)

Pros:

  • ID set at construction - all child entities (Membership, etc.) get correct ID from the start
  • No reflection - uses language features correctly
  • Production API unchanged - IProjectManagementService signature stays the same
  • Explicit opt-in - only code that needs deterministic IDs uses the factory
  • Type-safe - factory enforces required parameters at compile time

Cons:

  • Pattern needed for all aggregates - Studies, Investigators, etc. need similar factories
  • Slightly more code - private constructor + factory method per aggregate

Verdict: ✅ RECOMMENDED for deterministic ID requirement


Final Recommendation

Approach Recommendation Use Case
Option D (Factory Method) RECOMMENDED Deterministic IDs for seeding
Option C (Accept Generated) ✅ Fallback If factory changes are deferred
Option A/B ❌ Avoid Pollutes production API or causes inconsistency

Rationale: Option D provides deterministic IDs with clean architecture - the factory method is opt-in, production code is unchanged, and all child entities get the correct ID from construction.


7.3 Project Template Implementation Plan (Detailed)

Goal: Allow creating projects from templates for both seeding and future user features.

Note: ProjectTemplate does NOT currently exist in the codebase. This is a proposed new entity.


Is ProjectTemplate an Aggregate Root?

Answer: Yes, ProjectTemplate should be an Aggregate Root.

Reasoning:

Criteria Analysis
Identity Templates have their own lifecycle, independent of any Project
Invariants Templates enforce their own rules (valid stage configuration, etc.)
Transactional Boundary Templates are loaded/saved as a unit
Reference Pattern Projects reference templates by ID, not containment

Alternative Considered: Making templates a collection within a "TemplateLibrary" aggregate. This was rejected because:

  • Templates can be created by different users
  • No natural parent entity exists
  • Individual template versioning becomes complex

Phase 1: Template Entity (Aggregate Root)

// New file: SyRF.ProjectManagement.Core/Model/TemplateAggregate/ProjectTemplate.cs
public class ProjectTemplate : AggregateRoot<Guid>
{
    // Identity & Metadata
    public string Name { get; private set; }
    public string Description { get; private set; }
    public bool IsSystemTemplate { get; private set; }  // Seed/built-in templates
    public Guid? CreatedByInvestigatorId { get; private set; }
    public DateTime CreatedAt { get; private set; }

    // Visibility
    public TemplateVisibility Visibility { get; private set; }  // Public, Private, Organization

    // Template Configuration (Value Objects - snapshots, not live references)
    public AgreementMode DefaultAgreementMode { get; private set; }
    public IReadOnlyList<StageTemplate> Stages => _stages.AsReadOnly();
    public IReadOnlyList<AnnotationQuestionTemplate> Questions => _questions.AsReadOnly();
    public HashSet<string> DefaultKeywords { get; private set; }

    private List<StageTemplate> _stages = new();
    private List<AnnotationQuestionTemplate> _questions = new();

    // Factory method for creating from existing project
    public static ProjectTemplate CreateFromProject(Project source, string templateName, Guid creatorId)
    {
        var template = new ProjectTemplate(templateName, creatorId, source.AgreementMode);
        template.CopyStagesFrom(source);
        template.CopyQuestionsFrom(source);
        return template;
    }

    // Factory method for system templates
    public static ProjectTemplate CreateSystemTemplate(string name, string description)
    {
        return new ProjectTemplate(name, null, AgreementMode.AutomatedDualScreening)
        {
            IsSystemTemplate = true,
            Description = description
        };
    }
}

Value Objects (not Aggregate Roots):

// StageTemplate - snapshot of stage configuration
public record StageTemplate(
    string Name,
    StageType Type,
    bool IsActive,
    int Order
);

// AnnotationQuestionTemplate - snapshot of question configuration
public record AnnotationQuestionTemplate(
    string QuestionText,
    string AnswerType,
    bool IsRequired,
    IReadOnlyList<string> Options  // For multiple choice
);

Phase 2: Template Service Methods (Detailed)

public interface IProjectTemplateService
{
    // === Template Management ===

    /// <summary>
    /// Create a new template from an existing project.
    /// Copies stage configuration and annotation questions as snapshots.
    /// </summary>
    Task<ProjectTemplate> CreateTemplateFromProjectAsync(
        Guid sourceProjectId,
        string templateName,
        Guid creatorId);

    /// <summary>
    /// Create a new empty template (for manual configuration).
    /// </summary>
    Task<ProjectTemplate> CreateEmptyTemplateAsync(
        string name,
        string description,
        Guid creatorId);

    /// <summary>
    /// Update template metadata (name, description, visibility).
    /// Does NOT update stage/question configuration after creation.
    /// </summary>
    Task UpdateTemplateMetadataAsync(
        Guid templateId,
        string? name,
        string? description,
        TemplateVisibility? visibility);

    /// <summary>
    /// Delete a user-created template. System templates cannot be deleted.
    /// </summary>
    Task DeleteTemplateAsync(Guid templateId, Guid requesterId);

    // === Template Retrieval ===

    /// <summary>
    /// Get templates available to a user:
    /// - System templates (always visible)
    /// - Public templates
    /// - User's own private templates
    /// </summary>
    Task<IEnumerable<ProjectTemplate>> GetAvailableTemplatesAsync(Guid? investigatorId);

    /// <summary>
    /// Get a specific template by ID.
    /// </summary>
    Task<ProjectTemplate?> GetTemplateByIdAsync(Guid templateId);

    /// <summary>
    /// Get system template by name (for seeding).
    /// </summary>
    Task<ProjectTemplate?> GetSystemTemplateAsync(string templateName);

    // === Project Creation ===

    /// <summary>
    /// Create a new project from a template.
    /// Uses IProjectManagementService internally - same code path as manual creation.
    /// </summary>
    Task<Project> CreateProjectFromTemplateAsync(
        Guid templateId,
        string projectName,
        Guid ownerId,
        string? protocolUrl = null,
        bool? isPublic = null);  // Overrides template defaults
}

Implementation Details:

public class ProjectTemplateService : IProjectTemplateService
{
    private readonly IProjectTemplateRepository _templateRepository;
    private readonly IProjectManagementService _projectService;
    private readonly IPMUnitOfWork _unitOfWork;

    public async Task<Project> CreateProjectFromTemplateAsync(
        Guid templateId,
        string projectName,
        Guid ownerId,
        string? protocolUrl = null,
        bool? isPublic = null)
    {
        var template = await _templateRepository.GetAsync(templateId)
            ?? throw new EntityNotFoundException("Template", templateId);

        // 1. Create base project using EXISTING service (same code path!)
        var project = await _projectService.CreateProject(
            name: projectName,
            creatorId: ownerId,
            protocolUrl: protocolUrl ?? "",
            isPublic: isPublic ?? false,
            inclusionCriteria: "",
            exclusionCriteria: "",
            contactEmail: "",
            notes: $"Created from template: {template.Name}",
            keywords: template.DefaultKeywords,
            machineLearningEnabled: false);

        // 2. Apply template configuration
        project.UpdateAgreementMode(template.DefaultAgreementMode, ownerId);

        foreach (var stageTemplate in template.Stages.OrderBy(s => s.Order))
        {
            project.AddStage(stageTemplate.Name, stageTemplate.Type);
            if (!stageTemplate.IsActive)
            {
                var stage = project.Stages.Last();
                stage.Deactivate();
            }
        }

        foreach (var questionTemplate in template.Questions)
        {
            project.UpsertCustomAnnotationQuestion(
                questionTemplate.QuestionText,
                questionTemplate.AnswerType,
                questionTemplate.IsRequired,
                questionTemplate.Options?.ToList());
        }

        // 3. Save with all configuration applied
        await _unitOfWork.SaveAsync(project);

        return project;
    }
}

Phase 3: End User Workflow

Creating a Project from Template (Future UI):

┌─────────────────────────────────────────────────────┐
│ Create New Project                                   │
├─────────────────────────────────────────────────────┤
│                                                      │
│ ○ Start from scratch                                 │
│ ● Use a template                                     │
│                                                      │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Available Templates                             │ │
│ │                                                 │ │
│ │ 📋 Quick Start (System)                         │ │
│ │    Basic screening workflow                     │ │
│ │                                                 │ │
│ │ 📋 Full Systematic Review (System)              │ │
│ │    Screening + Annotation with ROB questions    │ │
│ │                                                 │ │
│ │ 📋 My Custom Template (Private)                 │ │
│ │    Based on "Cancer Research 2024" project      │ │
│ │                                                 │ │
│ └─────────────────────────────────────────────────┘ │
│                                                      │
│ Project Name: [________________________]             │
│                                                      │
│              [Cancel]  [Create Project]              │
└─────────────────────────────────────────────────────┘

Saving a Project as Template (Future UI):

Project Settings → Actions → "Save as Template"
    → Dialog: Template name, visibility (private/public)
    → Creates snapshot of current stage/question config

Phase 4: Seeder Integration

// In DatabaseSeeder
public async Task SeedProjectsAsync()
{
    // Ensure system templates exist
    await EnsureSystemTemplatesExistAsync();

    // Create seed projects from templates
    var quickStartTemplate = await _templateService.GetSystemTemplateAsync("QuickStart");
    var quickStartProject = await _templateService.CreateProjectFromTemplateAsync(
        quickStartTemplate!.Id,
        SeedDataConstants.QuickStartProjectName,
        SeedDataConstants.SeedBotId);

    // Store runtime ID for ownership transfer
    _runtimeProjectIds["QuickStart"] = quickStartProject.Id;

    // Add seed studies, screenings, etc.
    await SeedStudiesForProject(quickStartProject.Id, 10);
}

private async Task EnsureSystemTemplatesExistAsync()
{
    var existing = await _templateService.GetSystemTemplateAsync("QuickStart");
    if (existing != null) return;

    var template = ProjectTemplate.CreateSystemTemplate(
        "QuickStart",
        "Basic screening project with single-reviewer workflow");
    template.AddStage("Title/Abstract Screening", StageType.Screening, isActive: true);
    await _templateRepository.SaveAsync(template);
}

Files to Create

File Purpose
SyRF.ProjectManagement.Core/Model/TemplateAggregate/ProjectTemplate.cs Aggregate Root
SyRF.ProjectManagement.Core/Model/TemplateAggregate/StageTemplate.cs Value Object
SyRF.ProjectManagement.Core/Model/TemplateAggregate/AnnotationQuestionTemplate.cs Value Object
SyRF.ProjectManagement.Core/Interfaces/IProjectTemplateService.cs Service Interface
SyRF.ProjectManagement.Core/Interfaces/IProjectTemplateRepository.cs Repository Interface
SyRF.ProjectManagement.Core/Services/ProjectTemplateService.cs Service Implementation
SyRF.ProjectManagement.Mongo.Data/Repositories/ProjectTemplateRepository.cs MongoDB Repository

Migration Path

  1. Phase 1: Create entities, repository, basic service (template storage only)
  2. Phase 2: Add CreateProjectFromTemplateAsync (seeder integration)
  3. Phase 3: Add API endpoints for template CRUD
  4. Phase 4: Add UI for "Create from Template" and "Save as Template"

Seeder can use Phase 2 immediately, with user-facing features added later.


7.4 Stats Consolidation - Implementation Plan

Note: This section is part of the improvement plan, not just documentation. Stats consolidation is a prerequisite for reliable seeding.


Problem Statement

Stats calculation is scattered across multiple components, making it difficult to ensure seeded data has correct statistics. This affects both production code quality and seeding reliability.

Current Scattered Locations:

Component Location What It Calculates
StageReviewService.GetFullProjectStatsAsync() StageReviewService.cs:35-40 Overall project stats
StageReviewService.GetReviewerStatsForStageAsync() StageReviewService.cs:75-83 Per-reviewer stats
StudyRepository.UpdateStudyInclusionInfoForProjectAsync() StudyRepository.cs:1078-1104 Inclusion info updates
ExtractionInfo.SessionTallies ExtractionInfo.cs:30-42 Annotation session tallies
ReviewController (stats after screening) ReviewController.cs:61-119 Returns stats in response

Why Stats Need Service-Level Recalculation (or Don't)

Key Principle: Stats write operations at the domain service layer are only needed when the calculation depends on config and state spanning multiple aggregates.

Stat Type Needs Service Recalculation? Reason
Screening Stats ✅ YES Depends on Project config (AgreementMode, threshold) AND Study screenings
Inclusion Info ✅ YES Depends on Project config AND aggregated screening decisions
Annotation Stats ❌ NO SessionTallies is computed within Study aggregate alone
Outcome Data Stats ❌ NO Aggregation query over Studies, no cross-aggregate config needed

Annotation Stats Analysis (ExtractionInfo.cs:30-42):

// SessionTallies is a COMPUTED PROPERTY on ExtractionInfo (part of Study)
public IEnumerable<SessionTally> SessionTallies
{
    get
    {
        return Sessions.GroupBy(ans => ans.StageId).Select(gp =>
            new SessionTally(gp.Key,
                gp.Count(ses => !ses.Reconciliation),
                gp.Count(ses => !ses.Reconciliation && ses.Status == SessionStatus.Completed),
                gp.Any(ses => ses.Reconciliation),
                gp.Any(ses => ses.Reconciliation && ses.Status == SessionStatus.Completed))
        );
    }
}

This calculation:

  • Uses ONLY Sessions from the Study's own ExtractionInfo
  • Does NOT need Project config (no AgreementMode, no question count)
  • Is computed on read, not stored → no recalculation needed

Contrast with Screening Stats which MUST use service layer:

  • "Sufficiently screened" requires Project's AgreementMode and threshold
  • "Included/Excluded" aggregates screening decisions using Project rules
  • Cannot be computed from Study aggregate alone

Screening Stats Analysis (StudyStats.cs, ScreeningInfo.cs):

Why RecalculateScreeningStatsAsync IS needed:

  1. Cross-Aggregate Configuration Dependency:
// StudyStats.cs lines 292-295 - Stats use Project thresholds
var agreementThreshold = project.AgreementThreshold;
var minNumberScreened = agreementThreshold.NumberScreened;
var absoluteAgreementRatio = agreementThreshold.AbsoluteAgreementRatio;
  1. MongoDB Aggregation Uses Project Rules:
// Simplified from AllScreeningPipelineGroupStage (StudyStats.cs:274-276)
{
  SufficientlyScreened: {
    $sum: {
      $cond: [
        {$and: [
          {$gte: ["$ScreeningInfo.AgreementMeasure.NumberScreened", minNumberScreened]},
          {$gt: ["$ScreeningInfo.AgreementMeasure.AbsoluteAgreementRatio", absoluteAgreementRatio]}
        ]},
        1, 0
      ]
    }
  }
}
  1. Cached Thresholds on ScreeningInfo (ScreeningInfo.cs:49-60):
// _trackedProjectAgreementThresholds is CACHED on each Study's ScreeningInfo
private IEnumerable<ProjectAgreementThreshold>? _trackedProjectAgreementThresholds;

// InclusionInfo depends on this cached config
public IEnumerable<TrackedInclusionInfo>? InclusionInfo
{
    get => _trackedProjectAgreementThresholds?.Select(
        pat => new TrackedInclusionInfo(pat, pat.GetInclusionStatus(this), ...));
}

If Project thresholds change, this cached data becomes stale and needs recalculation.

When RecalculateScreeningStatsAsync is triggered:

  • After bulk screening operations
  • When Project AgreementMode or threshold changes
  • After seed data creation (to initialize _trackedProjectAgreementThresholds)
  • Via IUpdateStudyScreeningStatsCommand message bus command

Proposed Architecture (Corrected)

public interface IProjectStatsService
{
    // === Read Operations (Query Side) ===
    Task<ProjectStats> GetProjectStatsAsync(Guid projectId);
    Task<StageStats> GetStageStatsAsync(Guid projectId, Guid stageId);
    Task<ReviewerStats> GetReviewerStatsAsync(Guid projectId, Guid stageId, Guid reviewerId);

    // === Write Operations (Command Side) ===
    // ONLY for stats that span aggregates (Project config + Study data)
    Task RecalculateScreeningStatsAsync(Guid projectId);
    Task RecalculateInclusionInfoAsync(Guid projectId);

    // NOT NEEDED for annotation/outcome - these are computed properties or simple aggregations
    // Task RecalculateAnnotationStatsAsync(Guid projectId);  ← REMOVED
    // Task RecalculateOutcomeDataStatsAsync(Guid projectId); ← REMOVED
}

// Unified stats model
public record ProjectStats(
    int TotalStudies,
    ScreeningStats Screening,
    AnnotationStats Annotation,      // Read from SessionTallies aggregation
    OutcomeDataStats OutcomeData,    // Read from OutcomeData count query
    DateTime LastUpdated
);

public record ScreeningStats(
    int Unscreened,
    int PartiallyScreened,
    int SufficientlyScreened,  // Requires Project AgreementMode
    int Included,              // Requires Project rules
    int Excluded,              // Requires Project rules
    int Conflicts              // Requires Project threshold
);

public record AnnotationStats(
    int NotStarted,
    int InProgress,
    int Complete,
    int TotalSessions
    // Computed by aggregating SessionTallies across studies - no write operation needed
);

public record OutcomeDataStats(
    int StudiesWithData,
    int TotalDataPoints,
    int ExperimentsCount
    // Simple count aggregation - no write operation needed
);

Implementation Steps

Phase 1: Create Service Without Breaking Changes

  1. Create IProjectStatsService interface in SyRF.ProjectManagement.Core/Interfaces/
  2. Create ProjectStatsService implementation in SyRF.ProjectManagement.Core/Services/
  3. Initially delegate to existing methods in StageReviewService
  4. Add DI registration

Phase 2: Migrate Callers

  1. Update ReviewController to call IProjectStatsService
  2. Update DatabaseSeeder to call RecalculateAllStatsAsync() after seeding
  3. Update any Angular API calls that expect stats in responses

Phase 3: Consolidate Implementation

  1. Move actual stats calculation logic from StageReviewService to ProjectStatsService
  2. Keep StageReviewService focused on study retrieval/assignment
  3. Update StudyRepository.UpdateStudyInclusionInfoForProjectAsync() to be called via stats service

Phase 4: Deprecate Old Methods

  1. Mark old StageReviewService stats methods as [Obsolete]
  2. Eventually remove after all callers migrated

Files to Create/Modify

File Action Purpose
SyRF.ProjectManagement.Core/Interfaces/IProjectStatsService.cs Create Service interface
SyRF.ProjectManagement.Core/Services/ProjectStatsService.cs Create Implementation
SyRF.ProjectManagement.Core/Model/ValueObjects/ProjectStats.cs Create Stats model
SyRF.ProjectManagement.Core/Services/StageReviewService.cs Modify Delegate to new service
SyRF.API.Endpoint/Controllers/ReviewController.cs Modify Use new service
SyRF.ProjectManagement.Endpoint/Seeding/DatabaseSeeder.cs Modify Call RecalculateAllStatsAsync()

Seeder Integration

// After seeding all data for a project
await _projectStatsService.RecalculateAllStatsAsync(projectId);
_logger.LogDebug("Recalculated all stats for project {ProjectId}", projectId);

// This single call replaces multiple scattered calls:
// - UpdateStudyInclusionInfoForProjectAsync()
// - Manual stats aggregation
// - Annotation tally updates

Priority Assessment

Priority Rationale
Medium-High Needed for reliable seeding, but can work around by calling existing methods
Technical Debt Reduces code duplication and improves maintainability
Risk Low - new service, no breaking changes initially

Recommendation: Implement Phase 1-2 before major seeding work. Phase 3-4 can be done as ongoing refactoring.


Part 8: Annotation Seeding - Validation Architecture

Production Code Path

HTTP PUT /api/projects/{projectId}/stages/{stageId}/studies/{studyId}/session/{sessionId}
ReviewController.SubmitSession() [ReviewController.cs:61-119]
SessionSubmissionWebDto → SessionSubmissionDto (AutoMapper)
study.AddSessionData() [Study.cs:116-127]
ExtractionInfo.AddSessionData() [ExtractionInfo.cs:50-64]
    ├─ AddAnnotations() - Tree-shakes old annotations, adds new
    └─ AddOutcomeData() - Tree-shakes old outcome data, adds new
pmUnitOfWork.SaveAsync(study)
DispatchEvents() - Publishes domain events

Current Seeder Proposal (INCORRECT)

// This bypasses the ReviewController and service validation
var session = new AnnotationSession(study.Id, investigatorId, stageId);
foreach (var question in questions) {
    session.AddAnswer(question.Id, GenerateSampleAnswer(question));
}
study.AddAnnotationSession(session);  // Direct domain model manipulation

Validation Architecture Analysis

Question: Where should validation live - domain, application service, or frontend?

Current State: Mixed Validation

Validation Type Current Location Problem
Required fields Frontend (Angular forms) Bypassable by seeder
Question answer types Domain model (Annotation) Good - shared
Stage membership Controller (ReviewController) Should be in service
Study belongs to project Controller query Should be in service

Best Practice: Domain-Driven Validation Layers

┌─────────────────────────────────────────────────────────────┐
│ 1. DOMAIN VALIDATION (Innermost - Invariants)              │
│    • Entity state must be valid at all times               │
│    • Throws exceptions for invalid state transitions       │
│    • Example: Annotation answer must match question type   │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 2. APPLICATION SERVICE VALIDATION (Business Rules)         │
│    • Cross-aggregate validation                             │
│    • Authorization checks                                   │
│    • Example: User must be project member to annotate      │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 3. FRONTEND VALIDATION (UX - Optional for Backend)         │
│    • Early feedback for users                               │
│    • NOT security - backend must revalidate                 │
│    • Example: Show "required" error before submit           │
└─────────────────────────────────────────────────────────────┘

Proposed: Application Service for Annotation Submission

// New file: SyRF.ProjectManagement.Core/Interfaces/IAnnotationService.cs
public interface IAnnotationService
{
    /// <summary>
    /// Submit an annotation session for a study.
    /// Validates: user membership, stage access, question completeness.
    /// </summary>
    Task<AnnotationSessionResult> SubmitSessionAsync(
        Guid projectId,
        Guid stageId,
        Guid studyId,
        Guid annotatorId,
        SessionSubmissionDto session);

    /// <summary>
    /// Validate session data without saving (for preview/draft).
    /// Returns validation errors instead of throwing.
    /// </summary>
    Task<ValidationResult> ValidateSessionAsync(
        Guid projectId,
        Guid stageId,
        Guid studyId,
        SessionSubmissionDto session);
}

// Implementation
public class AnnotationService : IAnnotationService
{
    public async Task<AnnotationSessionResult> SubmitSessionAsync(
        Guid projectId, Guid stageId, Guid studyId, Guid annotatorId,
        SessionSubmissionDto session)
    {
        // 1. Load aggregates
        var project = await _projectRepository.GetAsync(projectId)
            ?? throw new EntityNotFoundException("Project", projectId);
        var study = await _studyRepository.GetAsync(studyId)
            ?? throw new EntityNotFoundException("Study", studyId);

        // 2. APPLICATION VALIDATION (business rules)
        ValidateMembership(project, annotatorId);      // User must be project member
        ValidateStageAccess(project, stageId);         // Stage must be active
        ValidateStudyBelongsToProject(study, projectId); // Study in this project
        ValidateNotAlreadyAnnotated(study, annotatorId, stageId); // No duplicate

        // 3. DOMAIN VALIDATION (via domain method)
        var stage = project.Stages.Single(s => s.Id == stageId);
        study.AddSessionData(
            annotatorId,
            session,
            project.GetStageQuestionIds(stageId),
            stage.Extraction);  // Throws if invalid

        // 4. Persist
        await _unitOfWork.SaveAsync(study);

        return new AnnotationSessionResult(
            Success: true,
            SessionId: session.Id,
            StudyId: studyId);
    }

    private void ValidateMembership(Project project, Guid userId)
    {
        if (!project.Memberships.Any(m => m.InvestigatorId == userId))
            throw new UnauthorizedAccessException("User is not a project member");
    }

    private void ValidateStageAccess(Project project, Guid stageId)
    {
        var stage = project.Stages.SingleOrDefault(s => s.Id == stageId)
            ?? throw new EntityNotFoundException("Stage", stageId);
        if (!stage.IsActive)
            throw new InvalidOperationException("Stage is not active");
    }
}

Seeder Uses Same Service

// Seeder injects IAnnotationService
public class DatabaseSeeder
{
    private readonly IAnnotationService _annotationService;

    public async Task SeedAnnotationsAsync(Guid projectId, Guid stageId, IEnumerable<Study> studies)
    {
        foreach (var study in studies)
        {
            var session = GenerateSessionForStudy(study, stageId);

            // Uses SAME validation as production!
            await _annotationService.SubmitSessionAsync(
                projectId,
                stageId,
                study.Id,
                SeedDataConstants.SeedBotId,
                session);
        }
    }
}

What Exactly Changes? (Detailed Migration)

Answer: Logic and functionality remain IDENTICAL. Only the code location changes.

Current Architecture (Controller-Based)
ReviewController.SubmitSessionAsync()
├─ HTTP deserialization (ASP.NET Core)
├─ Authorization check (inline in controller)
├─ Load project (inline)
├─ Validate membership (inline)
├─ Validate stage (inline)
├─ Call study.AddSessionData() (domain method)
└─ Save via UnitOfWork (inline)
Proposed Architecture (Service-Based)
ReviewController.SubmitSessionAsync()
├─ HTTP deserialization (ASP.NET Core)
└─ Call IAnnotationService.SubmitSessionAsync()  ← SINGLE CHANGE
        ├─ Authorization check (moved here)
        ├─ Load project (moved here)
        ├─ Validate membership (moved here)
        ├─ Validate stage (moved here)
        ├─ Call study.AddSessionData() (unchanged)
        └─ Save via UnitOfWork (moved here)
Specific Code Migration
Current Location New Location Logic Changes?
ReviewController.cs:61-119 AnnotationService.SubmitSessionAsync() NO - extracted as-is
Membership check (controller inline) ValidateMembership() private method NO - same check
Stage access check (controller inline) ValidateStageAccess() private method NO - same check
study.AddSessionData() call Same call in service NO - identical
_unitOfWork.SaveAsync() call Same call in service NO - identical
Controller BEFORE (Simplified)
// ReviewController.cs - Current
[HttpPost("annotation")]
public async Task<IActionResult> SubmitSessionAsync(SessionSubmissionWebDto dto)
{
    var project = await _projectRepo.GetAsync(dto.ProjectId);
    if (!project.Memberships.Any(m => m.InvestigatorId == UserId))
        return Forbid();

    var stage = project.Stages.SingleOrDefault(s => s.Id == dto.StageId);
    if (stage == null || !stage.IsActive)
        return BadRequest("Invalid stage");

    var study = await _studyRepo.GetAsync(dto.StudyId);
    study.AddSessionData(UserId, dto.ToDto(), project.GetStageQuestionIds(dto.StageId), stage.Extraction);

    await _unitOfWork.SaveAsync(study);
    return Ok(new { SessionId = dto.Id });
}
Controller AFTER (Simplified)
// ReviewController.cs - After refactor
[HttpPost("annotation")]
public async Task<IActionResult> SubmitSessionAsync(SessionSubmissionWebDto dto)
{
    var result = await _annotationService.SubmitSessionAsync(
        dto.ProjectId,
        dto.StageId,
        dto.StudyId,
        UserId,
        dto.ToDto());

    return Ok(new { SessionId = result.SessionId });
}

Functionality Changes: NONE

  • Same validation rules
  • Same domain method calls
  • Same persistence
  • Same error conditions

What's Different: Code is now in a testable, reusable service that the seeder can also call.

Files Changed
File Change Type Description
IAnnotationService.cs NEW Interface in Core/Interfaces
AnnotationService.cs NEW Implementation in Core/Services
ReviewController.cs MODIFY Delegate to service (less code)
Startup.cs / DI MODIFY Register new service

Benefits of Service-Layer Validation

Benefit Explanation
Single Source of Truth Validation logic in one place, not duplicated in controller and seeder
Testable Can unit test validation without HTTP context
Seeder Parity Seeder uses exact same code path as production
Future-Proof New validation rules automatically apply to seeding
Separation of Concerns Controller handles HTTP, service handles business logic

Frontend Validation Consideration

The Angular frontend should:

  1. Pre-validate for UX (immediate feedback)
  2. NOT be trusted - backend always re-validates
  3. Mirror service rules where practical (DRY is nice but security is mandatory)
// Angular component - UX validation only
validateSession(session: SessionSubmission): ValidationError[] {
    const errors: ValidationError[] = [];
    if (!session.annotations?.length) {
        errors.push({ field: 'annotations', message: 'At least one answer required' });
    }
    // ... more UX checks
    return errors;
}

The backend IAnnotationService.ValidateSessionAsync() method can be exposed via API for frontend draft validation if needed.


Part 9: Seeding Configuration as Source of Truth

Problem: Duplicated Knowledge

Currently, seeding logic and tests both contain hardcoded values:

// In DatabaseSeeder.cs
var quickStartStudyCount = 10;
CreateStudies(project, quickStartStudyCount);

// In Tests
quickStartStats.TotalStudies.Should().Be(10);  // Duplicated magic number!

If seeding changes to 15 studies, tests break. This coupling is fragile.


Solution: Centralized Seeding Configuration

Create a configuration file that defines seed data structure. Both seeder and tests read from this single source of truth.

Configuration Model

// SyRF.ProjectManagement.Core/Seeding/SeedingConfig.cs
public record SeedingConfig
{
    public required IReadOnlyList<SeedProjectConfig> Projects { get; init; }
    public required IReadOnlyList<SeedInvestigatorConfig> Investigators { get; init; }
}

public record SeedProjectConfig
{
    public required string Key { get; init; }  // "QuickStart", "DualScreening"
    public required string Name { get; init; }
    public required AgreementMode AgreementMode { get; init; }
    public required int StudyCount { get; init; }
    public required IReadOnlyList<SeedStageConfig> Stages { get; init; }
    public required SeedScreeningConfig? Screening { get; init; }
    public required SeedAnnotationConfig? Annotation { get; init; }
    public required SeedOutcomeDataConfig? OutcomeData { get; init; }
    public required IReadOnlyList<string> MemberKeys { get; init; }  // ["Owner", "Alpha", "Beta"]
}

public record SeedStageConfig(
    string Name,
    StageType Type,
    bool IsActive
);

public record SeedScreeningConfig(
    int ScreenedCount,
    int PartialCount,
    int IncludedCount,
    int ExcludedCount
);

public record SeedAnnotationConfig(
    int QuestionCount,
    int AnnotatedStudyCount,
    int CompletedSessionCount
);

public record SeedOutcomeDataConfig(
    int StudiesWithData,
    int DataPointsPerStudy
);

public record SeedInvestigatorConfig(
    string Key,
    string FirstName,
    string LastName,
    string Email,
    bool IsSystemAccount
);

Configuration File (YAML)

# src/libs/project-management/SyRF.ProjectManagement.Core/Seeding/seed-config.yaml
investigators:
  - key: SeedBot
    firstName: SyRF
    lastName: Seed Bot
    email: seedbot@syrf.org.uk
    isSystemAccount: true
  - key: Alpha
    firstName: Alice
    lastName: Alpha
    email: alice.alpha@example.com
    isSystemAccount: false
  - key: Beta
    firstName: Bob
    lastName: Beta
    email: bob.beta@example.com
    isSystemAccount: false

projects:
  - key: QuickStart
    name: Quick Start Demo
    agreementMode: SingleReviewerDecides
    studyCount: 10
    stages:
      - name: Title/Abstract Screening
        type: Screening
        isActive: true
      - name: Data Extraction
        type: Annotation
        isActive: false
    screening: null  # No screening done
    annotation: null
    outcomeData: null
    memberKeys: [SeedBot]

  - key: DualScreening
    name: Dual Screening Example
    agreementMode: AutomatedDualScreening
    studyCount: 30
    stages:
      - name: Title/Abstract Screening
        type: Screening
        isActive: true
    screening:
      screenedCount: 15
      partialCount: 10
      includedCount: 8
      excludedCount: 7
    annotation: null
    outcomeData: null
    memberKeys: [SeedBot, Alpha, Beta]

  - key: DataExtraction
    name: Data Extraction Demo
    agreementMode: SingleReviewerDecides
    studyCount: 12
    stages:
      - name: Screening
        type: Screening
        isActive: false
      - name: Data Extraction
        type: Annotation
        isActive: true
    screening:
      screenedCount: 12
      partialCount: 0
      includedCount: 12
      excludedCount: 0
    annotation:
      questionCount: 10
      annotatedStudyCount: 12
      completedSessionCount: 12
    outcomeData:
      studiesWithData: 9
      dataPointsPerStudy: 3
    memberKeys: [SeedBot]

Config Loader

public interface ISeedingConfigProvider
{
    SeedingConfig GetConfig();
    SeedProjectConfig GetProjectConfig(string key);
    SeedInvestigatorConfig GetInvestigatorConfig(string key);
}

public class YamlSeedingConfigProvider : ISeedingConfigProvider
{
    private readonly Lazy<SeedingConfig> _config;

    public YamlSeedingConfigProvider()
    {
        _config = new Lazy<SeedingConfig>(() =>
        {
            var yaml = EmbeddedResource.ReadAsString("seed-config.yaml");
            return YamlDeserializer.Deserialize<SeedingConfig>(yaml);
        });
    }

    public SeedingConfig GetConfig() => _config.Value;

    public SeedProjectConfig GetProjectConfig(string key) =>
        GetConfig().Projects.Single(p => p.Key == key);
}

Seeder Uses Config

public class DatabaseSeeder
{
    private readonly ISeedingConfigProvider _configProvider;

    public async Task SeedAsync()
    {
        var config = _configProvider.GetConfig();

        foreach (var projectConfig in config.Projects)
        {
            var project = await CreateProjectFromConfig(projectConfig);
            await SeedStudiesFromConfig(project.Id, projectConfig);

            if (projectConfig.Screening != null)
                await SeedScreeningsFromConfig(project.Id, projectConfig.Screening);

            if (projectConfig.Annotation != null)
                await SeedAnnotationsFromConfig(project.Id, projectConfig.Annotation);

            if (projectConfig.OutcomeData != null)
                await SeedOutcomeDataFromConfig(project.Id, projectConfig.OutcomeData);
        }
    }
}

Tests Use Same Config

public class SeedDataValidationTests : IClassFixture<SeededDatabaseFixture>
{
    private readonly ISeedingConfigProvider _configProvider;

    [Fact]
    public async Task QuickStart_HasCorrectStudyCount()
    {
        // Config is SOURCE OF TRUTH
        var config = _configProvider.GetProjectConfig("QuickStart");

        var stats = await _statsService.GetProjectStatsAsync(quickStartProjectId);

        stats.TotalStudies.Should().Be(config.StudyCount);  // No magic numbers!
    }

    [Fact]
    public async Task DualScreening_HasCorrectScreeningProgress()
    {
        var config = _configProvider.GetProjectConfig("DualScreening");

        var stats = await _statsService.GetProjectStatsAsync(dualProjectId);

        stats.Screening.SufficientlyScreened.Should().Be(config.Screening!.ScreenedCount);
        stats.Screening.PartiallyScreened.Should().Be(config.Screening!.PartialCount);
    }

    [Theory]
    [MemberData(nameof(AllProjectConfigs))]
    public async Task AllProjects_HaveCorrectStudyCounts(SeedProjectConfig config)
    {
        var projectId = await GetProjectIdByName(config.Name);
        var stats = await _statsService.GetProjectStatsAsync(projectId);

        stats.TotalStudies.Should().Be(config.StudyCount);
    }

    public static IEnumerable<object[]> AllProjectConfigs =>
        _configProvider.GetConfig().Projects.Select(p => new object[] { p });
}

Benefits of Config-Driven Testing

Benefit Explanation
Single Source of Truth Config defines both seeding AND expected test values
No Magic Numbers Tests read from config, not hardcoded values
Flexible Testing Theory tests can iterate all projects automatically
Documentation YAML file documents what seed data exists
Easy Changes Change config once, seeder and tests both update
Environment Specific Can have different configs for preview vs staging

E2E Tests Also Use Config

// Load config at test startup (exported as JSON or TypeScript)
import { seedConfig } from './seed-config';

describe('Seed Data Validation', () => {
    const quickStart = seedConfig.projects.find(p => p.key === 'QuickStart')!;

    it('Quick Start has correct study count', () => {
        cy.visit('/projects/' + quickStartId);
        cy.get('[data-testid="study-count"]')
            .should('contain', quickStart.studyCount.toString());
    });
});

Implementation Priority

Phase Work Benefit
Phase 1 Create config model + YAML file Documents current/planned seed data
Phase 2 Refactor seeder to use config Centralized seeding logic
Phase 3 Update integration tests to use config No more magic numbers
Phase 4 Export config for E2E tests Frontend tests also use same source

Recommendation: Start with Phase 1-2 immediately. The config file serves as documentation even before tests are updated.


Part 10: Non-Production Visual Indicator (Enhanced UX)

Current State

The web app has a Version Info Dialog (version-info-dialog.component.ts) with color-coded environment badges:

Environment Background Text Color
Production #e8f5e9 (green) #2e7d32
Staging #fff3e0 (orange) #bf360c
Preview #e3f2fd (blue) #1565c0
Development #fce4ec (pink) #c2185b

Gap: Database name (syrftest vs syrfdev) is NOT shown - frontend has no visibility.


Proposed: Banner-Based Warning System

Design Principle: Real estate usage should be similar to production app. The collapsed indicator should be minimal and non-intrusive.

User Flow

┌────────────────────────────────────────────────────────────────────┐
│ User visits non-production URL                                      │
└─────────────────────────┬──────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────┐
│ Check sessionStorage: 'env-banner-collapsed'                        │
└─────────────────────────┬──────────────────────────────────────────┘
           ┌──────────────┴──────────────┐
           ↓                             ↓
     [Not Collapsed]              [Previously Collapsed]
           ↓                             ↓
┌──────────────────────────┐  ┌────────────────────────────────────┐
│ SLIM BANNER (top bar)    │  │ MINIMAL INDICATOR (corner badge)   │
│ "STAGING | Click for     │  │ Small colored pill/badge           │
│  more info"              │  │ Clicking reopens banner            │
│ [Click anywhere]         │  └────────────────────────────────────┘
└───────────┬──────────────┘
            ↓ (user clicks)
┌──────────────────────────┐
│ DETAILED DIALOG          │
│ - Environment name       │
│ - Database name          │
│ - PR number (if preview) │
│ - Important notes        │
│ [Dismiss & Collapse]     │
└───────────┬──────────────┘
            ↓ (user dismisses)
┌────────────────────────────────────────────────────────────────────┐
│ Store 'env-banner-collapsed' = true in sessionStorage               │
│ Show MINIMAL INDICATOR (same as "Previously Collapsed" state)      │
└────────────────────────────────────────────────────────────────────┘

State 1: Slim Banner (Default for Non-Production)

A single-line banner at the top of the viewport, similar height to a small toolbar:

<!-- app.component.html -->
<div *ngIf="!isProduction && !bannerCollapsed"
     class="env-banner"
     [class]="environmentClass"
     (click)="showEnvironmentDialog()">
    <mat-icon class="env-icon">info</mat-icon>
    <span class="env-text">
        <strong>{{ environmentName | uppercase }}</strong>
        <span class="separator">|</span>
        Click for environment details
    </span>
</div>
.env-banner {
    height: 32px;           // Slim - similar to a toolbar
    display: flex;
    align-items: center;
    justify-content: center;
    gap: 8px;
    cursor: pointer;
    font-size: 13px;
    position: sticky;
    top: 0;
    z-index: 1000;

    &.preview { background: #2196f3; color: white; }
    &.staging { background: #ff9800; color: white; }
    &.development { background: #e91e63; color: white; }

    .separator { opacity: 0.7; margin: 0 4px; }
    .env-icon { font-size: 18px; height: 18px; width: 18px; }

    &:hover { filter: brightness(1.1); }
}

State 2: Detailed Dialog (On Banner Click)

Opens when user clicks the banner:

@Component({
    selector: 'app-environment-dialog',
    template: `
        <h2 mat-dialog-title>Environment Information</h2>
        <mat-dialog-content>
            <div class="env-details">
                <div class="detail-row">
                    <span class="label">Environment:</span>
                    <span class="badge" [class]="environmentClass">{{ environmentName }}</span>
                </div>
                <div class="detail-row">
                    <span class="label">Database:</span>
                    <span class="value">{{ databaseName }}</span>
                </div>
                <div class="detail-row" *ngIf="prNumber">
                    <span class="label">PR Preview:</span>
                    <span class="value">#{{ prNumber }}</span>
                </div>
                <div class="detail-row" *ngIf="dataSource?.sampledAt">
                    <span class="label">Data Sampled:</span>
                    <span class="value">{{ dataSource.sampledAt | date:'medium' }}</span>
                </div>
            </div>

            <mat-divider></mat-divider>

            <div class="notes">
                <p><strong>Note:</strong> This is not production. Data may be reset at any time.</p>
            </div>
        </mat-dialog-content>
        <mat-dialog-actions align="end">
            <button mat-button (click)="dismiss(false)">Keep Banner Visible</button>
            <button mat-raised-button color="primary" (click)="dismiss(true)">
                Dismiss & Collapse
            </button>
        </mat-dialog-actions>
    `
})
export class EnvironmentDialogComponent {
    dismiss(collapse: boolean) {
        if (collapse) {
            sessionStorage.setItem('env-banner-collapsed', 'true');
        }
        this.dialogRef.close(collapse);
    }
}

State 3: Minimal Indicator (After Collapse)

A small badge in the corner - minimal screen real estate (similar to production):

<!-- Collapsed state indicator -->
<div *ngIf="!isProduction && bannerCollapsed"
     class="env-indicator"
     [class]="environmentClass"
     (click)="expandBanner()"
     matTooltip="Click to show environment details">
    {{ environmentAbbrev }}
</div>
.env-indicator {
    position: fixed;
    top: 8px;
    right: 8px;
    padding: 4px 10px;
    border-radius: 12px;
    font-size: 11px;
    font-weight: 600;
    text-transform: uppercase;
    cursor: pointer;
    z-index: 999;
    box-shadow: 0 1px 3px rgba(0,0,0,0.2);

    &.preview { background: #2196f3; color: white; }
    &.staging { background: #ff9800; color: white; }
    &.development { background: #e91e63; color: white; }

    &:hover {
        transform: scale(1.05);
        box-shadow: 0 2px 6px rgba(0,0,0,0.3);
    }
}

Abbreviations:

Environment Abbreviation
Development DEV
Staging STG
Preview PR-123
Production (not shown)

Component Logic

@Component({ ... })
export class AppComponent implements OnInit {
    isProduction = false;
    bannerCollapsed = false;
    environmentName = '';
    environmentClass = '';
    environmentAbbrev = '';

    ngOnInit() {
        this.loadEnvironmentInfo();
        this.bannerCollapsed = sessionStorage.getItem('env-banner-collapsed') === 'true';
    }

    async loadEnvironmentInfo() {
        const info = await this.appService.getApplicationInfo();
        this.isProduction = info.environment === 'Production';
        this.environmentName = info.environment;
        this.environmentClass = info.environment.toLowerCase();
        this.environmentAbbrev = this.getAbbreviation(info);
    }

    getAbbreviation(info: AppInfo): string {
        if (info.prNumber) return `PR-${info.prNumber}`;
        switch (info.environment) {
            case 'Staging': return 'STG';
            case 'Development': return 'DEV';
            default: return info.environment.substring(0, 3).toUpperCase();
        }
    }

    showEnvironmentDialog() {
        const dialogRef = this.dialog.open(EnvironmentDialogComponent, {
            data: this.environmentInfo
        });
        dialogRef.afterClosed().subscribe(collapsed => {
            if (collapsed) this.bannerCollapsed = true;
        });
    }

    expandBanner() {
        sessionStorage.removeItem('env-banner-collapsed');
        this.bannerCollapsed = false;
    }
}

Visual Comparison

State Height Location Screen Real Estate
Production 0px N/A None (baseline)
Banner (expanded) 32px Top sticky Small toolbar
Indicator (collapsed) ~24px Top-right corner Badge only
Dialog Modal Center Overlay (temporary)

Goal: Collapsed state uses similar real estate to production (just a small corner badge).


Backend Changes

  1. Add databaseName to /api/application/info endpoint
  2. Add prNumber for preview environments (from environment variable)
  3. Return environment classification
// ApplicationController.cs
[HttpGet("info")]
public IActionResult GetApplicationInfo()
{
    return Ok(new
    {
        Version = _configuration["AppVersion"],
        Environment = _configuration["ASPNETCORE_ENVIRONMENT"],
        DatabaseName = GetDatabaseNameFromConnectionString(),
        PrNumber = Environment.GetEnvironmentVariable("PR_NUMBER"),
        IsProduction = _configuration["ASPNETCORE_ENVIRONMENT"] == "Production"
    });
}

Implementation Files

File Change
app.component.html Add banner and compact indicator
app.component.ts Add state management, dialog trigger
app.component.scss Add styles
environment-warning-dialog.component.ts New dialog component
ApplicationController.cs Add database name to info endpoint
MongoContext.cs or config Extract database name

Part 11: Production Data Sampling for Non-Production Environments

Context: PII Considerations for SyRF

SyRF is a research platform for systematic reviews where:

  • User data consists primarily of researcher emails and names (professional context)
  • Project data is scientific literature metadata (titles, abstracts, DOIs)
  • Studies are references to published academic papers (public information)
  • Annotations are structured research assessments (not personal data)

Conclusion: PII is not a major concern for this application. The primary data (study metadata from academic databases) is already public. User data is limited to professional research context.

This enables practical production sampling without complex anonymization.


Architecture Overview

┌──────────────────────────────────────────────────────────────────┐
│                    Production Database (syrftest)                 │
│   Projects: ~500  Studies: ~2M  Investigators: ~3000             │
└────────────────────────────┬─────────────────────────────────────┘
              ┌──────────────┴──────────────┐
              ↓                             ↓
   ┌─────────────────────┐     ┌─────────────────────────────┐
   │    PR Preview       │     │         Staging             │
   │    (Lightweight)    │     │    (Near-Production)        │
   ├─────────────────────┤     ├─────────────────────────────┤
   │ • 3-5 sampled       │     │ • 10-20 sampled projects    │
   │   projects          │     │ • Full study sets           │
   │ • Max 500 studies   │     │ • All screening data        │
   │ • Minimal screening │     │ • Full annotation sessions  │
   │ • No annotations    │     │ • Performance testing       │
   │ • Fast creation     │     │ • Weekly refresh            │
   │ • PR-specific       │     │                             │
   └─────────────────────┘     └─────────────────────────────┘

Configuration System

YAML Configuration File

# cluster-gitops/syrf/environments/{env}/data-sampling.yaml

# PR Preview - Lightweight for quick testing
preview:
  enabled: true
  schedule: "on-demand"  # Created when PR opens
  source:
    database: syrftest
    connection: ${PROD_MONGODB_URI}  # Read-only connection

  projects:
    strategy: curated        # Use specific project IDs
    count: 5
    selection:
      - id: "uuid-of-active-project-1"      # Large active project
        reason: "High activity, good for testing"
      - id: "uuid-of-dual-screening-project"
        reason: "Demonstrates dual screening workflow"
      - id: "uuid-of-annotation-project"
        reason: "Has extraction data"

    limits:
      maxStudiesPerProject: 500
      maxScreeningsPerStudy: 10
      includeAnnotations: false    # Skip for speed
      includeOutcomeData: false

  investigators:
    strategy: project-members   # Only members of sampled projects
    additionalIds: []           # No extras needed

# Staging - Comprehensive for integration testing
staging:
  enabled: true
  schedule: "0 3 * * 0"   # Weekly, Sunday 3am
  source:
    database: syrftest
    connection: ${PROD_MONGODB_URI}

  projects:
    strategy: mixed           # Curated + random sample
    count: 20
    selection:
      # Curated essential projects
      - id: "uuid-of-flagship-project"
        reason: "Largest project, performance testing"
      - id: "uuid-of-complex-stages"
        reason: "Multi-stage workflow testing"
    randomSampleCount: 15     # Additional random projects
    randomCriteria:
      minStudies: 100
      minMembers: 2
      hasScreening: true

    limits:
      maxStudiesPerProject: null    # No limit
      maxScreeningsPerStudy: null   # Full data
      includeAnnotations: true
      includeOutcomeData: true

  investigators:
    strategy: all-active      # All investigators who logged in last 90 days
    activityThresholdDays: 90

Implementation Components

1. Sampling Job (Kubernetes CronJob or Manual)

// SyRF.DataSampling.Job/SamplingJob.cs
public class ProductionSamplingJob
{
    private readonly IMongoClient _sourceClient;
    private readonly IMongoClient _targetClient;
    private readonly SamplingConfig _config;

    public async Task ExecuteSamplingAsync(CancellationToken ct)
    {
        _logger.LogInformation("Starting production sampling for {Environment}",
            _config.EnvironmentName);

        // 1. Determine which projects to sample
        var projectIds = await GetProjectIdsToSample();
        _logger.LogInformation("Sampling {Count} projects", projectIds.Count);

        // 2. Sample investigators (members of selected projects)
        await SampleInvestigatorsAsync(projectIds, ct);

        // 3. Sample projects with their related data
        foreach (var projectId in projectIds)
        {
            await SampleProjectAsync(projectId, ct);
        }

        // 4. Create sampling metadata record
        await RecordSamplingMetadataAsync(projectIds, ct);
    }

    private async Task SampleProjectAsync(Guid projectId, CancellationToken ct)
    {
        var project = await _sourceDb.GetCollection<Project>("pmProject")
            .Find(p => p.Id == projectId)
            .FirstOrDefaultAsync(ct);

        if (project == null) return;

        // Insert project
        await _targetDb.GetCollection<Project>("pmProject")
            .InsertOneAsync(project, cancellationToken: ct);

        // Sample studies with limit
        var studyQuery = _sourceDb.GetCollection<Study>("pmStudy")
            .Find(s => s.ProjectId == projectId);

        if (_config.Projects.Limits.MaxStudiesPerProject.HasValue)
        {
            studyQuery = studyQuery.Limit(_config.Projects.Limits.MaxStudiesPerProject.Value);
        }

        var studies = await studyQuery.ToListAsync(ct);

        if (studies.Any())
        {
            await _targetDb.GetCollection<Study>("pmStudy")
                .InsertManyAsync(studies, cancellationToken: ct);
        }

        // Sample related collections...
        await SampleSystematicSearchesAsync(projectId, ct);

        if (_config.Projects.Limits.IncludeAnnotations)
        {
            await SampleAnnotationsAsync(projectId, studies.Select(s => s.Id), ct);
        }
    }
}

2. Sampling Metadata Tracking

// Track what was sampled and when
public record SamplingMetadata
{
    public Guid Id { get; init; } = Guid.NewGuid();
    public DateTime SampledAt { get; init; } = DateTime.UtcNow;
    public string SourceDatabase { get; init; }
    public string TargetDatabase { get; init; }
    public string Environment { get; init; }
    public int ProjectCount { get; init; }
    public int StudyCount { get; init; }
    public int InvestigatorCount { get; init; }
    public List<Guid> SampledProjectIds { get; init; }
    public string ConfigHash { get; init; }  // Track config version
}
// Query sampling metadata
db.samplingMetadata.find().sort({ sampledAt: -1 }).limit(1)
// Shows when data was last sampled and what's included

3. GitHub Actions Integration (PR Preview)

# .github/workflows/pr-preview.yml (addition)
jobs:
  sample-production-data:
    needs: [deploy-preview]
    runs-on: ubuntu-latest
    if: ${{ github.event.action == 'opened' }}  # Only on PR open

    steps:
      - name: Run Production Sampling
        uses: actions/github-script@v6
        with:
          script: |
            // Trigger sampling job via kubectl or API
            const prNumber = context.issue.number;
            const targetDb = `syrf_pr_${prNumber}`;

            // Call sampling API or kubectl exec
            await exec.exec('kubectl', [
              'create', 'job',
              `--from=cronjob/production-sampler`,
              `production-sample-pr-${prNumber}`,
              '-n', `pr-${prNumber}`,
              '--', '--target-db', targetDb, '--config', 'preview'
            ]);

  # Or simpler: just use seed data for PRs
  seed-preview-data:
    needs: [deploy-preview]
    runs-on: ubuntu-latest
    steps:
      - name: Run Seeder
        run: |
          kubectl exec -n pr-${{ github.event.number }} \
            deployment/syrf-project-management \
            -- dotnet SyRF.Seeder.dll --config preview

Sampling Strategies Explained

Strategy Use Case How It Works
curated PR Preview Explicitly listed project IDs, maximum control
random Staging Random sample meeting criteria (min studies, etc.)
mixed Staging Curated essentials + random sample for variety
all-active Staging (investigators) All users active in last N days
project-members PR Preview Only investigators who are members of sampled projects

Environment-Specific Recommendations

PR Preview Environments

Goal: Fast, lightweight, focused on feature testing

Setting Value Rationale
Projects 3-5 curated Known good test projects
Max Studies 500 per project Fast query performance
Screening Include (limited) Test screening workflows
Annotations Exclude Speed, not needed for most PRs
Refresh On PR open Fresh data each time

Trigger: Automatic when PR opens, or manual via workflow dispatch

Staging Environment

Goal: Comprehensive, near-production testing

Setting Value Rationale
Projects 15-20 mixed Representative variety
Max Studies Unlimited Full performance testing
Screening Full Complete workflow testing
Annotations Include E2E extraction testing
Refresh Weekly Fresh but stable for testing

Trigger: Scheduled CronJob (Sunday 3am)


PR-Description Driven Configuration

Goal: Developers can easily switch between seeding, sampling, or combo by adding a config block to their PR description - no workflow file changes needed.

PR Description Format

Add a YAML block to the PR description to configure data strategy:

## Description
This PR adds new screening workflow improvements...

## Data Strategy
<!-- syrf-data-config -->
```yaml
data:
  strategy: seed-only       # Options: seed-only, sample-only, hybrid
  sampleProjects: 3         # If sampling, how many projects
  includeProdData: false    # Include production sample?
#### Supported Strategies

| Strategy | Description | Use Case |
|----------|-------------|----------|
| `seed-only` | Only run seeder (default) | Most PRs - fast, predictable |
| `sample-only` | Only sample production | Performance testing |
| `hybrid` | Seed first, then sample | Integration testing |
| `none` | Empty database | Testing fresh install |

#### Workflow Integration

```yaml
# .github/workflows/pr-preview.yml
jobs:
  parse-data-config:
    runs-on: ubuntu-latest
    outputs:
      strategy: ${{ steps.parse.outputs.strategy }}
      sample_projects: ${{ steps.parse.outputs.sample_projects }}
    steps:
      - name: Parse PR Description
        id: parse
        uses: actions/github-script@v7
        with:
          script: |
            const body = context.payload.pull_request?.body || '';

            // Extract YAML from <!-- syrf-data-config --> block
            const match = body.match(/<!--\s*syrf-data-config\s*-->\s*```yaml\s*([\s\S]*?)```\s*<!--\s*\/syrf-data-config\s*-->/);

            let config = { strategy: 'seed-only', sampleProjects: 0 };
            if (match) {
              const yaml = require('js-yaml');
              const parsed = yaml.load(match[1]);
              config = { ...config, ...parsed.data };
            }

            core.setOutput('strategy', config.strategy);
            core.setOutput('sample_projects', config.sampleProjects || 0);

  seed-data:
    needs: [deploy-preview, parse-data-config]
    if: needs.parse-data-config.outputs.strategy != 'sample-only' && needs.parse-data-config.outputs.strategy != 'none'
    runs-on: ubuntu-latest
    steps:
      - name: Run Seeder
        run: |
          kubectl exec -n pr-${{ github.event.number }} \
            deployment/syrf-project-management \
            -- dotnet SyRF.Seeder.dll --config preview

  sample-production:
    needs: [deploy-preview, parse-data-config, seed-data]
    if: |
      needs.parse-data-config.outputs.strategy == 'sample-only' ||
      needs.parse-data-config.outputs.strategy == 'hybrid'
    runs-on: ubuntu-latest
    steps:
      - name: Sample Production Data
        run: |
          kubectl create job \
            --from=cronjob/production-sampler \
            sample-pr-${{ github.event.number }} \
            -n pr-${{ github.event.number }} \
            -- --count=${{ needs.parse-data-config.outputs.sample_projects }}

Quick Reference for Developers

Scenario PR Description Config
Default (most PRs) No config block needed → uses seed-only
Integration testing strategy: hybrid + sampleProjects: 2
Performance testing strategy: sample-only + sampleProjects: 10
Empty database strategy: none

Full Config Options

data:
  strategy: hybrid              # seed-only | sample-only | hybrid | none
  sampleProjects: 5            # Number of projects to sample (if sampling)
  sampleStrategy: curated      # curated | random | mixed
  maxStudiesPerProject: 500    # Limit studies per project (faster)
  includeAnnotations: false    # Include annotation data (slower)
  seedConfig: preview          # Which seed config to use: preview | minimal | full

Alternative: Label-Based Configuration

For simpler use cases, use PR labels instead of YAML config:

Label Equivalent Strategy
data:seed-only Default seeding
data:hybrid Seed + sample
data:sample-only Production sample only
data:none Empty database
data:sample-large Sample with 10+ projects
# Workflow checks for labels
- name: Check Data Strategy Label
  id: labels
  run: |
    if [[ "${{ contains(github.event.pull_request.labels.*.name, 'data:hybrid') }}" == "true" ]]; then
      echo "strategy=hybrid" >> $GITHUB_OUTPUT
    elif [[ "${{ contains(github.event.pull_request.labels.*.name, 'data:sample-only') }}" == "true" ]]; then
      echo "strategy=sample-only" >> $GITHUB_OUTPUT
    else
      echo "strategy=seed-only" >> $GITHUB_OUTPUT
    fi

Recommendation: Use labels for quick toggling, PR description config for fine-grained control.


Data Freshness Indicators

Add sampling metadata to the environment info endpoint:

[HttpGet("info")]
public IActionResult GetApplicationInfo()
{
    var samplingInfo = await _db.GetCollection<SamplingMetadata>("samplingMetadata")
        .Find(_ => true)
        .SortByDescending(s => s.SampledAt)
        .FirstOrDefaultAsync();

    return Ok(new
    {
        Version = _configuration["AppVersion"],
        Environment = _configuration["ASPNETCORE_ENVIRONMENT"],
        DatabaseName = GetDatabaseName(),
        DataSource = samplingInfo != null ? new
        {
            SampledAt = samplingInfo.SampledAt,
            SourceDatabase = samplingInfo.SourceDatabase,
            ProjectCount = samplingInfo.ProjectCount,
            StudyCount = samplingInfo.StudyCount
        } : null
    });
}

Frontend Display (in environment warning dialog):

<div class="detail-row" *ngIf="dataSource">
    <span class="label">Data Sampled:</span>
    <span class="value">
        {{ dataSource.sampledAt | date:'medium' }}
        ({{ dataSource.projectCount }} projects, {{ dataSource.studyCount }} studies)
    </span>
</div>

Implementation Priority

Phase Work Environment Effort
1 Enhanced seed data (Part 4 matrix) All Medium
2 Sampling config YAML schema All Low
3 PR preview sampling job Preview Medium
4 Staging weekly sampling CronJob Staging Medium
5 Sampling metadata tracking All Low
6 Environment info API enhancement All Low

Recommendation: Start with Phase 1-2. Seed data alone may be sufficient for most testing. Add production sampling (Phase 3-4) only if specific production data scenarios are needed for testing.


Part 12: Concrete Next Steps Plan

Immediate Actions (This Sprint)

# Task Files to Change Effort
1 Refactor Project constructor chain Project.cs Low
Add private constructor with ID param, add CreateWithId factory
2 Apply same pattern to Study Study.cs Low
Factory method for deterministic study IDs
3 Apply same pattern to Investigator Investigator.cs Low
Factory method for deterministic investigator IDs
4 Update DatabaseSeeder to use factory methods DatabaseSeeder.cs Medium
Replace reflection-based ID setting with factory calls

Short-Term Actions (Next 2 Sprints)

# Task Files to Change Effort
5 Create ISeedDataOwnershipTransfer interface New file: Core/Seeding/ISeedDataOwnershipTransfer.cs Low
6 Implement comprehensive seed project matrix DatabaseSeeder.cs, SeedDataConstants.cs Medium
Per Part 4 coverage matrix - all AgreementModes, stage states
7 Add screening data seeding DatabaseSeeder.cs Medium
Seed screenings using study.AddScreening() path
8 Call RecalculateScreeningStatsAsync after seeding DatabaseSeeder.cs Low
Initialize _trackedProjectAgreementThresholds on Studies

Medium-Term Actions (Future Sprints)

# Task Effort Dependency
9 Create IProjectStatsService Medium None
10 Add environment warning banner (Part 10) Medium None
11 Production sampling job (Part 11) Medium None
12 PR description config parsing Low Part 11

Implementation Order Diagram

Sprint N (Foundation):
┌─────────────────────────────────────────────────────────────┐
│  1. Refactor constructors (Project, Study, Investigator)    │
│           ↓                                                  │
│  4. Update DatabaseSeeder to use factory methods            │
└─────────────────────────────────────────────────────────────┘

Sprint N+1 (Coverage):
┌─────────────────────────────────────────────────────────────┐
│  5. ISeedDataOwnershipTransfer interface                    │
│  6. Full project matrix (all AgreementModes)                │
│  7. Screening data seeding                                  │
│  8. Stats recalculation                                     │
└─────────────────────────────────────────────────────────────┘

Sprint N+2+ (Polish):
┌─────────────────────────────────────────────────────────────┐
│  9. IProjectStatsService consolidation                      │
│  10. Environment warning banner                             │
│  11. Production sampling (if needed)                        │
└─────────────────────────────────────────────────────────────┘

Definition of Done

Each task should meet these criteria before being marked complete:

  • Code compiles without warnings
  • Unit tests added/updated for new code
  • Existing tests pass
  • Seeded data visible in preview environment
  • Documentation updated (this file or CLAUDE.md)

Risks and Mitigations

Risk Mitigation
Constructor refactoring breaks MongoDB deserialization Keep parameterless public constructor, test deserialization
Factory methods duplicate initialization logic Private constructor with all params, factory delegates to it
Stats calculation fails after seeding Call RecalculateScreeningStatsAsync explicitly after seeding
Seed data differs from production data Use same domain methods as production (AddScreening, etc.)

Appendix A: AgreementMode Reference

Mode Reviewers Needed Agreement Required Description
SingleReviewerDecides 1 N/A One reviewer makes final decision
AutomatedDualScreening 2 100% Both reviewers must agree
ThreeReviewersOneReconciles 2 of 3 67% Two of three must agree, one reconciles
ThreeReviewersTwoReconcile 3 67% All three review, two must agree
Custom varies varies User-defined threshold

Appendix C: API Endpoint Summary

Key endpoints that seeding should mirror:

Endpoint Purpose Service Method
POST /api/projects Create project IProjectManagementService.CreateProject()
POST /api/application/signin Create/update investigator IApplicationService.EnsureInvestigatorCreatedAndUpdatedAsync()
POST /api/projects/{id}/stages Add stage project.AddStage()
PATCH /api/projects/{id}/stages/{stageId} Update stage project.UpdateStage()
POST /api/projects/{id}/questions Add annotation question project.UpsertCustomAnnotationQuestion()
POST /api/reviews/screening Submit screening study.AddScreening()
POST /api/projects/{id}/studies/inclusionInfo Calculate inclusion IProjectManagementService.UpdateStudyInclusionInfoForProjectAsync()