Seed Data Quality Analysis and Planning¶

Executive Summary¶

This document analyzes the current seed data implementation for preview environments, identifies quality issues, and provides comprehensive planning for creating high-quality seed data.

Key Principles:

Breadth and depth - Seed projects should exemplify common configurations and states users will encounter
Use high-level services - Prefer application services over bespoke seeding code to ensure realistic data
Realistic data - Seeded data should be as close to production data as possible (including any real bugs/issues that would occur)
Testability - Seeding should exercise the same code paths as real user operations
Keep defaults where appropriate - Some projects should use default screening modes with seed investigators performing actual screening

Part 1: Current Implementation Analysis¶

Seeding Architecture¶

┌─────────────────────────────────────────────────────────────────┐
│                     Application Startup                          │
├─────────────────────────────────────────────────────────────────┤
│  1. IRunAtInit discovered via assembly scanning                  │
│  2. DatabaseSeeder.Execute() called                              │
│  3. Check SYRF_SEED_DATA_ENABLED=true                           │
│  4. Check if database already has data                           │
│  5. Create seed data if conditions met                           │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                     First User Registration                      │
├─────────────────────────────────────────────────────────────────┤
│  1. UserHasRegisteredHandler fires                               │
│  2. SeedDataOwnershipTransfer.TransferToFirstRealUserAsync()    │
│  3. Add real user as admin to all seed projects                  │
│  4. Transfer ownership from SeedBot to real user                 │
└─────────────────────────────────────────────────────────────────┘

Current Seed Data Inventory¶

Project	Studies	Stages	Screening State	Annotation State	AgreementMode
Quick Start Demo	10	2 (Screening active, Annotation inactive)	Unscreened	N/A	Default (Dual)
Screening In Progress	30	1 (Screening active)	50% complete (dual)	N/A	Default (Dual)
Ready for Annotation	20	2 (Screening inactive, Annotation active)	Complete	Ready	Default (Dual)
Complete Review	15	2 (both inactive)	Complete	Complete	Default (Dual)
Private Research	8	1 (Screening active)	Unscreened	N/A	Default (Dual)

Key Files¶

DatabaseSeeder.cs - Main seeding logic
SeedDataConstants.cs - Constants and IDs
SeedDataOwnershipTransfer.cs - Ownership transfer logic
SampleStudies.json - Sample study data

Part 2: Identified Issues¶

Issue 1: Stages Appearing Disabled for Screening¶

Symptom: Users report that screening stages appear disabled or inaccessible.

Root Cause Analysis:

Project AgreementMode: Projects default to AgreementMode.AutomatedDualScreening:

public Project() : base(Guid.NewGuid())
{
    AgreementMode = AgreementMode.AutomatedDualScreening;
}

Post-Ownership Transfer State: After ownership transfer, the real user is the only member who can log in. Seed reviewers exist but are fake users (system|seed-reviewer-alpha). For dual screening projects, the UI may show screening as blocked because only one real reviewer exists.

Resolution Options:

Option	Description	Recommended For
Keep dual screening with seed reviewers	Demonstrates realistic dual screening workflow	"Screening In Progress" project
Set SingleReviewerDecides	User can complete all screening alone	"Quick Start Demo" project
Add real user to reviewer role	User joins existing screening workflow	Projects with partial screening

Note: It's OK to keep some projects with default dual screening where seed investigators have already done screening. The new user can add more users and change settings as needed.

Issue 2: Study Stats Not Correct¶

Symptom: Study statistics (screened/unscreened counts) don't match actual study states.

Root Cause Analysis:

Stats Are Computed Dynamically: Stats use MongoDB aggregation pipelines that query:
ScreeningInfo.AgreementMeasure.NumberScreened
ScreeningInfo.AgreementMeasure.AbsoluteAgreementRatio
ScreeningInfo.Inclusion
TrackedProjectAgreementThresholds Empty: Studies are created with empty thresholds:

trackedProjectAgreementThresholds: Array.Empty<ProjectAgreementThreshold>()

This means studies haven't been evaluated against any threshold, affecting stats calculations.

Direct Domain Model Usage: The seeder uses domain methods directly (study.AddScreening()) rather than going through the same code paths as the API endpoints. This may skip side effects that normally occur.

Issue 3: Ownership Transfer Side Effects¶

Symptom: Seed reviewers remain as project members but cannot log in.

Analysis: This is actually acceptable behavior:

Seed reviewers demonstrate realistic projects with multiple team members
Their screening decisions show what dual screening looks like
The new owner can add real users and modify settings
This reflects how a real user would join an existing team project

Part 3: Architectural Analysis¶

Current Seeding Approach vs API Endpoints¶

The current DatabaseSeeder creates entities directly using domain models:

// Current approach - Direct domain model creation
var project = new Project(name, creatorId, ...);
typeof(Project).GetProperty("Id")?.SetValue(project, id);  // Reflection to set ID
project.AddStage(...);
_unitOfWork.Save(project);

Issues with this approach:

Bypasses service layer validation - Misses checks that would occur in production
Uses reflection - Sets IDs directly, not how real projects are created
May miss side effects - Service layer often triggers additional operations
Different code path - Any bugs in seeding won't be found in production (and vice versa)

API Endpoint Analysis¶

Based on analysis of API controllers and Angular effects, here's how real users create data:

Project Creation Flow (Production)¶

Angular Component
      ↓
store.dispatch(projectDetailActions.create({...}))
      ↓
project-detail.effects.ts → POST /api/projects
      ↓
ProjectController.CreateProject()
      ↓
IProjectManagementService.CreateProject(name, creatorId, ...)
      ↓
new Project(...) + _unitOfWork.Save()

Screening Flow (Production)¶

Angular Component
      ↓
store.dispatch(reviewComponentActions.submitScreeningAndFetchNext({...}))
      ↓
review.effects.ts → POST /api/reviews/screening
      ↓
ReviewController.SubmitScreeningAndGetNextStudy()
      ↓
study.AddScreening() + IStageReviewService.GetRandomStudyAsync()
      ↓
_unitOfWork.Save(study)

Service Layer Mapping¶

Operation	Controller	Service Method	Seeder Should Use
Create Project	ProjectController	IProjectManagementService.CreateProject()	Yes
Create Investigator	ApplicationController	IApplicationService.EnsureInvestigatorCreatedAndUpdatedAsync()	Yes
Add Stage	ProjectController	project.AddStage() (domain method)	OK - same as controller
Add Screening	ReviewController	study.AddScreening() (domain method)	OK - but verify side effects
Update Inclusion Info	StudyController	IProjectManagementService.UpdateStudyInclusionInfoForProjectAsync()	Yes - should call after screening
Add Search	ProjectController	IProjectManagementService.GetOrCreateSearchImportJobAsync()	Consider for realism

Where Controllers Do More Than Just Call Services¶

Some controllers contain logic beyond simple service delegation. This affects seeding accuracy:

1. ReviewController - Stats Calculation After Screening¶

// ReviewController.cs - after screening submission
var stats = await _stageReviewService.GetFullStatsForInvestigatorAsync(
    project.Id, stage.Id, CurrentUserId);

Impact on Seeding: The seeder doesn't calculate/return stats after adding screenings.

Recommendation: Either call UpdateStudyInclusionInfoForProjectAsync() after seeding screenings, or ensure domain model correctly updates stats synchronously.

2. ProjectController - Instrumentation Code¶

// ProjectController.cs - timing instrumentation in RepositionAnnotationQuestion
var stopwatch = Stopwatch.StartNew();
// ... operation ...
Console.WriteLine($"Get Project Time: {getProjectTime} ms");

Impact: Debug code in production. Not a seeding concern but violates SRP.

Recommendation: Move instrumentation to a decorator or aspect.

// ApplicationController.cs - signin involves multiple operations
var investigator = await _applicationService.EnsureInvestigatorCreatedAndUpdatedAsync(...);
await _applicationService.RetrieveAndValidateInvestigatorUsage(...);
investigator.RecordSignin(ipAddress);
_pmUnitOfWork.Save(investigator);

Impact on Seeding: Seeder creates investigators directly without going through signin flow.

Recommendation: Consider using IApplicationService.EnsureInvestigatorCreatedAndUpdatedAsync() for seed investigator creation.

SOLID Principle Violations Identified¶

Issue	Principle	Location	Recommendation
Controller does instrumentation	SRP	ProjectController:539-574	Move to decorator
Seeder uses reflection for IDs	DIP	DatabaseSeeder:581	Use factory or accept generated IDs
Stats calculated in controller	SRP	ReviewController	Move to service
Direct UnitOfWork in controllers	DIP	Multiple	OK for simple queries, wrap complex operations

Part 4: Seed Project Planning Matrix¶

Guiding Principles¶

Exemplify breadth: Cover all major feature configurations users will encounter
Exemplify depth: Show projects at various workflow stages
Keep some defaults: Not every project needs custom settings - real users start with defaults
Use seed investigators actively: Have seed reviewers do actual screening to show realistic workflows
Annotations matter: Include projects with annotation questions and some annotation data

Proposed Seed Project Matrix¶

Project	Purpose	AgreementMode	Stages	Studies	Screening State	Annotation State	Outcome Data	Members
Quick Start Demo	Intro project for new users	SingleReviewerDecides	1 screening, 1 annotation	10	Unscreened	N/A	None	Owner only
Dual Screening Example	Shows dual review workflow	AutomatedDualScreening	1 screening	30	50% complete by seed reviewers	N/A	None	Owner + Alpha + Beta
Ready for Annotation	Shows annotation workflow	SingleReviewerDecides	1 screening (complete), 1 annotation	20	Complete (all included)	Ready with questions	None	Owner
Partial Annotation	Shows annotation in progress	SingleReviewerDecides	1 screening (complete), 1 annotation	15	Complete	50% annotated	25% studies	Owner
Complete Review	Shows finished project	SingleReviewerDecides	Both complete	15	Complete	Complete	All studies	Owner
Data Extraction Demo	Shows outcome data workflow	SingleReviewerDecides	1 screening (complete), 1 annotation (complete)	12	Complete	Complete	Varied (see below)	Owner
Private Research	Shows access control	AutomatedDualScreening	1 screening	8	Unscreened	N/A	None	Owner (requires approval)
Large Study Set	Performance testing	SingleReviewerDecides	1 screening	100	Unscreened	N/A	None	Owner
Three Reviewer	Shows 3-reviewer workflow	ThreeReviewersOneReconciles	1 screening	20	30% with agreement	N/A	None	Owner + Alpha + Beta + Gamma

Configuration Coverage Matrix¶

Configuration	Covered By Project(s)
Single reviewer screening	Quick Start, Ready for Annotation, Complete Review
Dual screening	Dual Screening Example, Private Research
Three reviewer	Three Reviewer
Screening only stage	Quick Start, Dual Screening Example
Annotation stage	Quick Start, Ready for Annotation, Partial Annotation, Complete Review
Private project	Private Research
Public project	All others
Large study count	Large Study Set
Complete workflow	Complete Review
Auto-approve joins	Quick Start (could add)
Require approval	Private Research

Annotation Question Coverage¶

Seed projects with annotation should include questions from all categories:

Category	Example Questions	Projects
Study	Species, Study design	Ready for Annotation, Partial Annotation, Complete Review, Data Extraction Demo
Risk of Bias	Randomization, Blinding	Ready for Annotation, Partial Annotation, Complete Review, Data Extraction Demo
Model Induction	Disease model, Induction method	Partial Annotation, Complete Review, Data Extraction Demo
Treatment	Intervention, Control, Dose	Partial Annotation, Complete Review, Data Extraction Demo
Outcome	Outcome measure, Timepoint	Complete Review, Data Extraction Demo
Experiment	Sample size, Duration	Complete Review, Data Extraction Demo

Outcome Data Coverage¶

The OutcomeData entity (OutcomeData.cs) tracks quantitative experimental results per study.

Data Extraction Demo Project - Demonstrates various outcome data scenarios:

Studies	Outcome Data State	Purpose
3 studies	Complete data (all cohorts)	Shows fully extracted study
3 studies	Partial data (some cohorts)	Shows work-in-progress
3 studies	Multiple experiments	Shows multi-experiment studies
3 studies	No outcome data	Shows contrast with extracted studies

Sample Outcome Data Structure:

// Example seeded outcome data
var outcomeData = new OutcomeData(
    projectId: projectId,
    annotatorId: SeedDataConstants.SeedBotId,
    stageId: extractionStageId,
    experimentId: experimentId,
    cohortId: cohortId)
{
    Mean = 42.5,
    StandardDeviation = 3.2,
    SampleSize = 10,
    Timepoint = "Day 7",
    OutcomeMeasure = "Tumor volume (mm³)"
};

study.ExtractionInfo.AddOutcomeData(outcomeData);

Seeder Code Path for Outcome Data:

// Should mirror production: Study.AddSessionData() → ExtractionInfo.AddOutcomeData()
var sessionDto = new SessionSubmissionDto
{
    // ... annotations ...
    OutcomeData = GenerateOutcomeDataForStudy(study, project)
};

study.AddSessionData(investigatorId, sessionDto, questionIds, stageExtraction);

Part 5: Recommended Implementation Changes¶

Priority 1: Use High-Level Services for Project Creation¶

Current:

var project = new Project(name, creatorId, ...);
typeof(Project).GetProperty("Id")?.SetValue(project, id);  // Reflection!
_unitOfWork.Save(project);

Recommended:

// Inject IProjectManagementService into DatabaseSeeder
var project = await _projectManagementService.CreateProject(
    name: SeedDataConstants.QuickStartProjectName,
    creatorId: SeedDataConstants.SeedBotId,
    protocolUrl: "",
    isPublic: true,
    inclusionCriteria: "Studies relevant to the research question",
    exclusionCriteria: "Studies not meeting inclusion criteria",
    contactEmail: SeedDataConstants.SeedBotEmail,
    notes: SeedDataConstants.QuickStartProjectDescription,
    keywords: new HashSet<string> { "demo", "quick-start" },
    machineLearningEnabled: false);

// Store generated ID in SeedDataConstants mapping or accept dynamic IDs

Trade-off: Loses deterministic IDs. Consider:

Option A: Accept dynamic IDs and store them after creation
Option B: Add ID parameter to CreateProject (service layer change)
Option C: Keep current approach for IDs but use services for everything else

Priority 2: Use IApplicationService for Investigator Creation¶

Current:

var investigator = new Investigator(id, auth0Id, fullName, email, schemaVersion);
_unitOfWork.Save(investigator);

Recommended:

var investigator = await _applicationService.EnsureInvestigatorCreatedAndUpdatedAsync(
    investigatorId: SeedDataConstants.SeedBotId,
    auth0Id: SeedDataConstants.SeedBotAuth0Id,
    name: new FullName(SeedDataConstants.SeedBotFirstName, SeedDataConstants.SeedBotLastName),
    email: SeedDataConstants.SeedBotEmail,
    pictureUrl: null,
    ipAddress: "127.0.0.1");

Benefit: Uses same code path as production signin.

Priority 3: Call Inclusion Update After Screening¶

Add after adding all screenings to a project:

// After seeding all screenings for a project
await _projectManagementService.UpdateStudyInclusionInfoForProjectAsync(
    projectId,
    replaceExisting: true);

_logger.LogDebug("Updated inclusion info for project {ProjectId}", projectId);

Benefit: Ensures stats are calculated correctly using production code.

Priority 4: Set AgreementMode Explicitly¶

For projects that should use single screening:

project.UpdateAgreementMode(AgreementMode.SingleReviewerDecides, SeedDataConstants.SeedBotId);

For projects demonstrating dual screening:

// Leave as default (AutomatedDualScreening) - this is intentional
// Seed reviewers Alpha and Beta do the screening

Priority 5: Add Annotation Data¶

For "Partial Annotation" and "Complete Review" projects:

// After adding annotation questions, add annotation sessions
foreach (var study in studies.Take(annotatedCount))
{
    var session = new AnnotationSession(
        study.Id,
        SeedDataConstants.SeedBotId,
        annotationStageId);

    // Add answers to questions
    foreach (var question in project.AnnotationQuestions)
    {
        session.AddAnswer(question.Id, GenerateSampleAnswer(question));
    }

    study.AddAnnotationSession(session);
    _unitOfWork.Save(study);
}

Part 6: Testing Recommendations¶

Integration Tests for Seeding¶

[Fact]
public async Task DatabaseSeeder_AfterSeeding_StatsAreCorrect()
{
    // Arrange
    var seeder = new DatabaseSeeder(...);

    // Act
    seeder.Execute();

    // Assert - Quick Start should have 10 unscreened studies
    var quickStartStats = await GetProjectStats(SeedDataConstants.QuickStartProjectId);
    quickStartStats.TotalStudies.Should().Be(10);
    quickStartStats.UnscreenedStudies.Should().Be(10);

    // Assert - Dual Screening In Progress should have correct split
    var dualStats = await GetProjectStats(SeedDataConstants.DualScreeningProjectId);
    dualStats.TotalStudies.Should().Be(30);
    dualStats.SufficientlyScreened.Should().BeGreaterThan(0);
    dualStats.PartiallyScreened.Should().BeGreaterThan(0);
    dualStats.UnscreenedStudies.Should().BeGreaterThan(0);
}

[Fact]
public async Task DatabaseSeeder_AfterSeeding_ProjectsHaveCorrectAgreementMode()
{
    // Arrange & Act
    seeder.Execute();

    // Assert
    var quickStart = await _unitOfWork.Projects.GetAsync(SeedDataConstants.QuickStartProjectId);
    quickStart.AgreementMode.Should().Be(AgreementMode.SingleReviewerDecides);

    var dualScreening = await _unitOfWork.Projects.GetAsync(SeedDataConstants.DualScreeningProjectId);
    dualScreening.AgreementMode.Should().Be(AgreementMode.AutomatedDualScreening);
}

Manual Testing Checklist¶

For each seed project, verify:

Part 7: Service Layer Improvement Opportunities (Detailed Analysis)¶

Based on comprehensive codebase analysis, these architectural changes would benefit both seeding and testability.

7.1 Inclusion Calculation - When Called and Gaps¶

Where UpdateStudyInclusionInfoForProjectAsync IS called:

Location	Trigger	Context
UpdateStudyScreeningStatsConsumer.cs:17-33	Message handler after `AgreementMode` change	Async via MassTransit
ProjectController.cs:259-267	Manual API call `POST /update-study-inclusion`	On-demand admin action
ProjectController.cs:269-276	Batch admin `POST /update-all-study-inclusion`	Batch processing

Deep Dive: How InclusionInfo Serialization Actually Works

The MongoDB BsonClassMap for ScreeningInfo (StudyRepository.cs:1427-1436) reveals the architecture:

cm.MapProperty(si => si.InclusionInfo).SetIgnoreIfNull(true);   // Computed property IS mapped
cm.UnmapField(ScreeningInfo.TrackedProjectAgreementThresholdsName); // Backing field NOT stored

The serialization flow:

On Save: The InclusionInfo getter computes values from _trackedProjectAgreementThresholds + current Screenings
On Load: Stored InclusionInfo is deserialized → setter extracts ProjectAgreementThreshold into _trackedProjectAgreementThresholds
Subsequent saves: Getter recalculates from loaded thresholds + updated screenings → values ARE updated!

The ACTUAL gap (corrected analysis):

Scenario	Behavior	Is This a Problem?
Study where `UpdateStudyInclusionInfoForProjectAsync` was NEVER called	`InclusionInfo` is null	YES - MongoDB filters won't find it
Study where it WAS called, then screening added	Save recalculates from stored thresholds	NO - Auto-updates correctly
Production screening after project setup	Works correctly (threshold from AgreementMode change)	NO - Normal production flow
Seeded studies with screenings	`InclusionInfo` is null (never initialized)	YES - This is the seeding gap

Conclusion: The gap is NOT about individual screenings failing to update. It's about initialization - studies need UpdateStudyInclusionInfoForProjectAsync called at least once to populate the threshold configuration. After that, normal save operations update the computed values correctly.

Flow when AgreementMode changes:

ScreeningController.PostScreeningSettings()
    → project.UpdateAgreementMode()
    → raises ProjectAgreementThresholdUpdatedEvent
    → ProjectAgreementThresholdUpdatedHandler
    → sends IUpdateStudyScreeningStatsCommand
    → UpdateStudyScreeningStatsConsumer.Consume()
    → IStudyRepository.UpdateStudyInclusionInfoForProjectAsync()

Justification for Change: The production design is actually sound - once UpdateStudyInclusionInfoForProjectAsync is called (which happens after AgreementMode changes), subsequent screenings auto-update correctly via the computed property serialization. The gap is specific to seeding where this initialization step is skipped.

Recommendation for Seeding:

// After seeding all screenings for a project, initialize InclusionInfo:
await _projectManagementService.UpdateStudyInclusionInfoForProjectAsync(projectId, replaceExisting: true);

Why this works: This populates each study's InclusionInfo with the project's AgreementThreshold. Once set, any subsequent screening additions (even by seed users in the app) will correctly update via the computed property getter during save.

Affected Code for Implementing This:

DatabaseSeeder.cs - Add call after each project's screenings are seeded
This is a one-time initialization, not a recurring requirement

7.2 Optional ID Parameter - Deep Design Analysis¶

Question: Should IProjectManagementService.CreateProject() accept an optional ID?

Current Signature:

Task<Project> CreateProject(string name, Guid creatorId, string protocolUrl, ...);
// ID is generated internally by Project constructor

Deep Analysis: Where ProjectId is Stored¶

Critical Discovery: ProjectId is stored in many related entities, not just the Project itself:

Entity	Storage	Relationship
`Study.ProjectId`	Stored property	FK to Project
`Screening.ProjectId`	Stored property	FK to Project (within Study)
`Annotation.ProjectId`	Stored property	FK to Project (within Study.ExtractionInfo)
`OutcomeData.ProjectId`	Stored property	FK to Project (within Study.ExtractionInfo)
`SystematicSearch.ProjectId`	Stored property	FK to Project
`DataExportJob.ProjectId`	Stored property	FK to Project
`RiskOfBiasAiJob.ProjectId`	Stored property	FK to Project
`StudyPdfCorrection.ProjectId`	Stored property	FK to Project
`LivingSearchSetting.ProjectId`	Stored property	FK to Project
`ProjectAccess.ProjectId`	Stored property	FK to Project (in Investigator)

Computed Properties (safe to ignore) - these derive from Project.Id at runtime:

BulkStudyUpdateJob.ProjectId => Project.Id
AnnotationQuestion.ProjectId => Project.Id
Stage.ProjectId => Project!.Id
ProjectMembership.ProjectId => Project.Id
SearchImportJob.ProjectId => Project.Id

Technical Constraint: Entity ID Immutability¶

The base Entity<TId> class (Entity.cs:15) defines:

public TId Id { get; init; }  // init-only accessor - immutable after construction!

Implication: Post-creation ID changes require reflection to bypass the init accessor.

Option A: Add Optional Parameter to Service Method (NOT Recommended)¶

Where the change happens: IProjectManagementService (production service layer)

// Change to production service interface
public interface IProjectManagementService
{
    // Original signature modified with optional ID
    Task<Project> CreateProject(string name, Guid creatorId, ..., Guid? id = null);
}

// Implementation changes
public async Task<Project> CreateProject(string name, Guid creatorId, ..., Guid? id = null)
{
    var project = new Project(id ?? Guid.NewGuid());  // Use provided ID or generate
    // ... rest of implementation
}

Architectural Location: Service layer (production code path)

Cons:

Pollutes production API - Optional id parameter is never used in production
Violates Single Responsibility - Production service handles seeding concerns
Risk of misuse - Accidental ID collision if someone passes a duplicate ID
Testing concerns in production - Method signature reveals testing/seeding backdoor

Verdict: ❌ Rejected - seeding concerns shouldn't affect production service signature

Option B: Seeder Factory with Post-Creation ID Override (NOT Recommended)¶

public class SeederProjectFactory : ISeederProjectFactory
{
    public async Task<Project> CreateProjectAsync(CreateProjectDto dto, Guid? overrideId = null)
    {
        var project = await _service.CreateProject(...);
        if (overrideId.HasValue)
        {
            typeof(Project).GetProperty("Id")?.SetValue(project, overrideId.Value);
        }
        return project;
    }
}

Critical Consistency Problem:

If seeding creates Studies, Searches, Screenings, etc. after the ID override, they'll have the correct ID. But what about entities created before?

// CreateProject internally calls:
Memberships.Add(ProjectMembership.InitialProjectCreator(this));  // Uses original ID!

The ProjectMembership is created with the original Project.Id, then we override. This creates an inconsistent state.

Even worse: If CreateProject saves immediately (which it does), we'd have two documents in MongoDB with different IDs representing the same project.

Cons:

Entity.Id is init-only - requires reflection to override
Race condition: Entities created before override have wrong ProjectId
Database inconsistency: Original project may already be saved before override
Hidden complexity: Developers won't expect ID mutation post-construction

Verdict: ❌ Rejected - causes data consistency issues

Option C: Accept Generated IDs (Recommended - Simple Path)¶

// Store generated IDs after creation
var project = await _service.CreateProject(...);
SeedDataConstants.RuntimeProjectIds["QuickStart"] = project.Id;

// Later, ownership transfer looks up dynamically
var projectIds = SeedDataConstants.RuntimeProjectIds.Values.ToList();
await _ownershipService.TransferProjectsToUser(userId, projectIds);

Pros:

Clean production code - no seeding concerns leak into production
Consistency - all entities use the same ID from construction
Simple - no reflection, no special factory

Cons:

Non-deterministic IDs - IDs differ between environments
Dynamic lookup - ownership transfer must use runtime lookup
Test assertions - can't hardcode expected IDs in tests

When This Works Well:

Ownership transfer already loops through projects - just needs runtime list
Tests can query by name/characteristics instead of ID
Preview environments are ephemeral anyway

Verdict: ✅ Recommended for immediate implementation

Option D: Factory Method on Entity (RECOMMENDED for Deterministic IDs)¶

Where the change happens: Project entity class (domain layer, NOT service layer)

Key Difference from Option A:

Option A modifies the service interface (production API changes)
Option D adds a factory method on the entity (production API unchanged, new creation path)

Core Insight: The factory method should call the same substantive constructor used by ProjectManagementService.CreateProject(). No duplication of initialization logic.

Current Constructor Chain (lines 27-57 of Project.cs):

// CURRENT: Substantive constructor → parameterless → base(NewGuid)
public Project(string name, Guid creatorId, string protocolUrl, ...)
    : this()  // ← calls parameterless
{
    // initialization
}

public Project() : base(Guid.NewGuid())  // ← ID generated here
{
    // defaults
}

Refactored Constructor Chain (factory method delegates to internal constructor):

// Project.cs - Refactored to support deterministic IDs
public class Project : AggregateRoot<Guid>
{
    // PUBLIC: Substantive constructor - delegates to internal with new ID
    public Project(string name, Guid creatorId, string protocolUrl,
        bool isPublic, string inclusionCriteria, string exclusionCriteria,
        int schemaVersion, ICollection<string> keywords, string contactEmail = "",
        string notes = "", LivingSearchConfig? livingSearchConfig = null)
        : this(Guid.NewGuid(), name, creatorId, protocolUrl, isPublic,
               inclusionCriteria, exclusionCriteria, schemaVersion, keywords,
               contactEmail, notes, livingSearchConfig)
    {
    }

    // PRIVATE: Internal constructor - ALL initialization happens here
    private Project(Guid id, string name, Guid creatorId, string protocolUrl,
        bool isPublic, string inclusionCriteria, string exclusionCriteria,
        int schemaVersion, ICollection<string> keywords, string contactEmail,
        string notes, LivingSearchConfig? livingSearchConfig)
        : this(id)
    {
        DefaultSchemaVersion = schemaVersion;
        SecuritySettings = new ProjectSecuritySettings(this);
        ContactEmail = contactEmail;
        Notes = notes;
        InclusionCriteria = inclusionCriteria;
        ExclusionCriteria = exclusionCriteria;
        Name = name;
        OwnerId = creatorId;
        CreatorId = creatorId;
        AddProtocolLink(protocolUrl);
        Memberships.Add(ProjectMembership.InitialProjectCreator(this));
        IsPublic = isPublic;
        CreationDate = DateTime.Now;
        Keywords = keywords;
        if (livingSearchConfig != null) AddLivingSearch(livingSearchConfig);
    }

    // PRIVATE: ID-only constructor (for deserialization compatibility)
    private Project(Guid id) : base(id)
    {
        LivingSearches = new List<LivingSearch>();
        AgreementMode = AgreementMode.AutomatedDualScreening;
    }

    // PUBLIC: Parameterless (for MongoDB deserialization)
    public Project() : this(Guid.NewGuid())
    {
    }

    // NEW: Factory method for deterministic IDs - calls SAME constructor as service
    public static Project CreateWithId(
        Guid id,
        string name,
        Guid creatorId,
        string protocolUrl = "",
        bool isPublic = false,
        string inclusionCriteria = "",
        string exclusionCriteria = "",
        int schemaVersion = 1,
        ICollection<string>? keywords = null,
        string contactEmail = "",
        string notes = "",
        LivingSearchConfig? livingSearchConfig = null)
    {
        return new Project(id, name, creatorId, protocolUrl, isPublic,
            inclusionCriteria, exclusionCriteria, schemaVersion,
            keywords ?? Array.Empty<string>(), contactEmail, notes,
            livingSearchConfig);
    }
}

// Seeder uses factory method
var project = Project.CreateWithId(
    id: SeedDataConstants.QuickStartProjectId,      // Deterministic!
    name: SeedDataConstants.QuickStartProjectName,
    creatorId: SeedDataConstants.SeedBotId);

// Production code continues using service (unchanged)
var project = await _projectService.CreateProject(name, userId, ...);  // Auto-generates ID

Why This Works:

Factory method calls the same private constructor as the public one
All initialization logic is in ONE place (the private constructor)
No duplication of AddMembership, AddProtocolLink, etc.
Public API unchanged - existing code continues working

Architectural Comparison:

Aspect	Option A (Service)	Option D (Factory)
What changes	Service interface signature	Entity class (add method)
Production code affected	Yes - new optional param	No - existing path unchanged
Where ID is provided	Service layer	Entity construction
Separation of concerns	Poor - seeding in production API	Good - factory is opt-in
Child entity IDs	Correct (ID set before children)	Correct (ID set before children)

Pros:

ID set at construction - all child entities (Membership, etc.) get correct ID from the start
No reflection - uses language features correctly
Production API unchanged - IProjectManagementService signature stays the same
Explicit opt-in - only code that needs deterministic IDs uses the factory
Type-safe - factory enforces required parameters at compile time

Cons:

Pattern needed for all aggregates - Studies, Investigators, etc. need similar factories
Slightly more code - private constructor + factory method per aggregate

Verdict: ✅ RECOMMENDED for deterministic ID requirement

Final Recommendation¶

Approach	Recommendation	Use Case
Option D (Factory Method)	✅ RECOMMENDED	Deterministic IDs for seeding
Option C (Accept Generated)	✅ Fallback	If factory changes are deferred
Option A/B	❌ Avoid	Pollutes production API or causes inconsistency

Rationale: Option D provides deterministic IDs with clean architecture - the factory method is opt-in, production code is unchanged, and all child entities get the correct ID from construction.

7.3 Project Template Implementation Plan (Detailed)¶

Goal: Allow creating projects from templates for both seeding and future user features.

Note: ProjectTemplate does NOT currently exist in the codebase. This is a proposed new entity.

Is ProjectTemplate an Aggregate Root?¶

Answer: Yes, ProjectTemplate should be an Aggregate Root.

Reasoning:

Criteria	Analysis
Identity	Templates have their own lifecycle, independent of any Project
Invariants	Templates enforce their own rules (valid stage configuration, etc.)
Transactional Boundary	Templates are loaded/saved as a unit
Reference Pattern	Projects reference templates by ID, not containment

Alternative Considered: Making templates a collection within a "TemplateLibrary" aggregate. This was rejected because:

Templates can be created by different users
No natural parent entity exists
Individual template versioning becomes complex

Phase 1: Template Entity (Aggregate Root)¶

// New file: SyRF.ProjectManagement.Core/Model/TemplateAggregate/ProjectTemplate.cs
public class ProjectTemplate : AggregateRoot<Guid>
{
    // Identity & Metadata
    public string Name { get; private set; }
    public string Description { get; private set; }
    public bool IsSystemTemplate { get; private set; }  // Seed/built-in templates
    public Guid? CreatedByInvestigatorId { get; private set; }
    public DateTime CreatedAt { get; private set; }

    // Visibility
    public TemplateVisibility Visibility { get; private set; }  // Public, Private, Organization

    // Template Configuration (Value Objects - snapshots, not live references)
    public AgreementMode DefaultAgreementMode { get; private set; }
    public IReadOnlyList<StageTemplate> Stages => _stages.AsReadOnly();
    public IReadOnlyList<AnnotationQuestionTemplate> Questions => _questions.AsReadOnly();
    public HashSet<string> DefaultKeywords { get; private set; }

    private List<StageTemplate> _stages = new();
    private List<AnnotationQuestionTemplate> _questions = new();

    // Factory method for creating from existing project
    public static ProjectTemplate CreateFromProject(Project source, string templateName, Guid creatorId)
    {
        var template = new ProjectTemplate(templateName, creatorId, source.AgreementMode);
        template.CopyStagesFrom(source);
        template.CopyQuestionsFrom(source);
        return template;
    }

    // Factory method for system templates
    public static ProjectTemplate CreateSystemTemplate(string name, string description)
    {
        return new ProjectTemplate(name, null, AgreementMode.AutomatedDualScreening)
        {
            IsSystemTemplate = true,
            Description = description
        };
    }
}

Value Objects (not Aggregate Roots):

// StageTemplate - snapshot of stage configuration
public record StageTemplate(
    string Name,
    StageType Type,
    bool IsActive,
    int Order
);

// AnnotationQuestionTemplate - snapshot of question configuration
public record AnnotationQuestionTemplate(
    string QuestionText,
    string AnswerType,
    bool IsRequired,
    IReadOnlyList<string> Options  // For multiple choice
);

Phase 2: Template Service Methods (Detailed)¶

public interface IProjectTemplateService
{
    // === Template Management ===

    /// <summary>
    /// Create a new template from an existing project.
    /// Copies stage configuration and annotation questions as snapshots.
    /// </summary>
    Task<ProjectTemplate> CreateTemplateFromProjectAsync(
        Guid sourceProjectId,
        string templateName,
        Guid creatorId);

    /// <summary>
    /// Create a new empty template (for manual configuration).
    /// </summary>
    Task<ProjectTemplate> CreateEmptyTemplateAsync(
        string name,
        string description,
        Guid creatorId);

    /// <summary>
    /// Update template metadata (name, description, visibility).
    /// Does NOT update stage/question configuration after creation.
    /// </summary>
    Task UpdateTemplateMetadataAsync(
        Guid templateId,
        string? name,
        string? description,
        TemplateVisibility? visibility);

    /// <summary>
    /// Delete a user-created template. System templates cannot be deleted.
    /// </summary>
    Task DeleteTemplateAsync(Guid templateId, Guid requesterId);

    // === Template Retrieval ===

    /// <summary>
    /// Get templates available to a user:
    /// - System templates (always visible)
    /// - Public templates
    /// - User's own private templates
    /// </summary>
    Task<IEnumerable<ProjectTemplate>> GetAvailableTemplatesAsync(Guid? investigatorId);

    /// <summary>
    /// Get a specific template by ID.
    /// </summary>
    Task<ProjectTemplate?> GetTemplateByIdAsync(Guid templateId);

    /// <summary>
    /// Get system template by name (for seeding).
    /// </summary>
    Task<ProjectTemplate?> GetSystemTemplateAsync(string templateName);

    // === Project Creation ===

    /// <summary>
    /// Create a new project from a template.
    /// Uses IProjectManagementService internally - same code path as manual creation.
    /// </summary>
    Task<Project> CreateProjectFromTemplateAsync(
        Guid templateId,
        string projectName,
        Guid ownerId,
        string? protocolUrl = null,
        bool? isPublic = null);  // Overrides template defaults
}

Implementation Details:

public class ProjectTemplateService : IProjectTemplateService
{
    private readonly IProjectTemplateRepository _templateRepository;
    private readonly IProjectManagementService _projectService;
    private readonly IPMUnitOfWork _unitOfWork;

    public async Task<Project> CreateProjectFromTemplateAsync(
        Guid templateId,
        string projectName,
        Guid ownerId,
        string? protocolUrl = null,
        bool? isPublic = null)
    {
        var template = await _templateRepository.GetAsync(templateId)
            ?? throw new EntityNotFoundException("Template", templateId);

        // 1. Create base project using EXISTING service (same code path!)
        var project = await _projectService.CreateProject(
            name: projectName,
            creatorId: ownerId,
            protocolUrl: protocolUrl ?? "",
            isPublic: isPublic ?? false,
            inclusionCriteria: "",
            exclusionCriteria: "",
            contactEmail: "",
            notes: $"Created from template: {template.Name}",
            keywords: template.DefaultKeywords,
            machineLearningEnabled: false);

        // 2. Apply template configuration
        project.UpdateAgreementMode(template.DefaultAgreementMode, ownerId);

        foreach (var stageTemplate in template.Stages.OrderBy(s => s.Order))
        {
            project.AddStage(stageTemplate.Name, stageTemplate.Type);
            if (!stageTemplate.IsActive)
            {
                var stage = project.Stages.Last();
                stage.Deactivate();
            }
        }

        foreach (var questionTemplate in template.Questions)
        {
            project.UpsertCustomAnnotationQuestion(
                questionTemplate.QuestionText,
                questionTemplate.AnswerType,
                questionTemplate.IsRequired,
                questionTemplate.Options?.ToList());
        }

        // 3. Save with all configuration applied
        await _unitOfWork.SaveAsync(project);

        return project;
    }
}

Phase 3: End User Workflow¶

Creating a Project from Template (Future UI):

┌─────────────────────────────────────────────────────┐
│ Create New Project                                   │
├─────────────────────────────────────────────────────┤
│                                                      │
│ ○ Start from scratch                                 │
│ ● Use a template                                     │
│                                                      │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Available Templates                             │ │
│ │                                                 │ │
│ │ 📋 Quick Start (System)                         │ │
│ │    Basic screening workflow                     │ │
│ │                                                 │ │
│ │ 📋 Full Systematic Review (System)              │ │
│ │    Screening + Annotation with ROB questions    │ │
│ │                                                 │ │
│ │ 📋 My Custom Template (Private)                 │ │
│ │    Based on "Cancer Research 2024" project      │ │
│ │                                                 │ │
│ └─────────────────────────────────────────────────┘ │
│                                                      │
│ Project Name: [________________________]             │
│                                                      │
│              [Cancel]  [Create Project]              │
└─────────────────────────────────────────────────────┘

Saving a Project as Template (Future UI):

Project Settings → Actions → "Save as Template"
    → Dialog: Template name, visibility (private/public)
    → Creates snapshot of current stage/question config

Phase 4: Seeder Integration¶

// In DatabaseSeeder
public async Task SeedProjectsAsync()
{
    // Ensure system templates exist
    await EnsureSystemTemplatesExistAsync();

    // Create seed projects from templates
    var quickStartTemplate = await _templateService.GetSystemTemplateAsync("QuickStart");
    var quickStartProject = await _templateService.CreateProjectFromTemplateAsync(
        quickStartTemplate!.Id,
        SeedDataConstants.QuickStartProjectName,
        SeedDataConstants.SeedBotId);

    // Store runtime ID for ownership transfer
    _runtimeProjectIds["QuickStart"] = quickStartProject.Id;

    // Add seed studies, screenings, etc.
    await SeedStudiesForProject(quickStartProject.Id, 10);
}

private async Task EnsureSystemTemplatesExistAsync()
{
    var existing = await _templateService.GetSystemTemplateAsync("QuickStart");
    if (existing != null) return;

    var template = ProjectTemplate.CreateSystemTemplate(
        "QuickStart",
        "Basic screening project with single-reviewer workflow");
    template.AddStage("Title/Abstract Screening", StageType.Screening, isActive: true);
    await _templateRepository.SaveAsync(template);
}

Files to Create¶

File	Purpose
`SyRF.ProjectManagement.Core/Model/TemplateAggregate/ProjectTemplate.cs`	Aggregate Root
`SyRF.ProjectManagement.Core/Model/TemplateAggregate/StageTemplate.cs`	Value Object
`SyRF.ProjectManagement.Core/Model/TemplateAggregate/AnnotationQuestionTemplate.cs`	Value Object
`SyRF.ProjectManagement.Core/Interfaces/IProjectTemplateService.cs`	Service Interface
`SyRF.ProjectManagement.Core/Interfaces/IProjectTemplateRepository.cs`	Repository Interface
`SyRF.ProjectManagement.Core/Services/ProjectTemplateService.cs`	Service Implementation
`SyRF.ProjectManagement.Mongo.Data/Repositories/ProjectTemplateRepository.cs`	MongoDB Repository

Migration Path¶

Phase 1: Create entities, repository, basic service (template storage only)
Phase 2: Add CreateProjectFromTemplateAsync (seeder integration)
Phase 3: Add API endpoints for template CRUD
Phase 4: Add UI for "Create from Template" and "Save as Template"

Seeder can use Phase 2 immediately, with user-facing features added later.

7.4 Stats Consolidation - Implementation Plan¶

Note: This section is part of the improvement plan, not just documentation. Stats consolidation is a prerequisite for reliable seeding.

Problem Statement¶

Stats calculation is scattered across multiple components, making it difficult to ensure seeded data has correct statistics. This affects both production code quality and seeding reliability.

Current Scattered Locations:

Component	Location	What It Calculates
`StageReviewService.GetFullProjectStatsAsync()`	StageReviewService.cs:35-40	Overall project stats
`StageReviewService.GetReviewerStatsForStageAsync()`	StageReviewService.cs:75-83	Per-reviewer stats
`StudyRepository.UpdateStudyInclusionInfoForProjectAsync()`	StudyRepository.cs:1078-1104	Inclusion info updates
`ExtractionInfo.SessionTallies`	ExtractionInfo.cs:30-42	Annotation session tallies
`ReviewController` (stats after screening)	ReviewController.cs:61-119	Returns stats in response

Why Stats Need Service-Level Recalculation (or Don't)¶

Key Principle: Stats write operations at the domain service layer are only needed when the calculation depends on config and state spanning multiple aggregates.

Stat Type	Needs Service Recalculation?	Reason
Screening Stats	✅ YES	Depends on Project config (AgreementMode, threshold) AND Study screenings
Inclusion Info	✅ YES	Depends on Project config AND aggregated screening decisions
Annotation Stats	❌ NO	`SessionTallies` is computed within Study aggregate alone
Outcome Data Stats	❌ NO	Aggregation query over Studies, no cross-aggregate config needed

Annotation Stats Analysis (ExtractionInfo.cs:30-42):

// SessionTallies is a COMPUTED PROPERTY on ExtractionInfo (part of Study)
public IEnumerable<SessionTally> SessionTallies
{
    get
    {
        return Sessions.GroupBy(ans => ans.StageId).Select(gp =>
            new SessionTally(gp.Key,
                gp.Count(ses => !ses.Reconciliation),
                gp.Count(ses => !ses.Reconciliation && ses.Status == SessionStatus.Completed),
                gp.Any(ses => ses.Reconciliation),
                gp.Any(ses => ses.Reconciliation && ses.Status == SessionStatus.Completed))
        );
    }
}

This calculation:

Uses ONLY Sessions from the Study's own ExtractionInfo
Does NOT need Project config (no AgreementMode, no question count)
Is computed on read, not stored → no recalculation needed

Contrast with Screening Stats which MUST use service layer:

"Sufficiently screened" requires Project's AgreementMode and threshold
"Included/Excluded" aggregates screening decisions using Project rules
Cannot be computed from Study aggregate alone

Screening Stats Analysis (StudyStats.cs, ScreeningInfo.cs):

Why RecalculateScreeningStatsAsync IS needed:

Cross-Aggregate Configuration Dependency:

// StudyStats.cs lines 292-295 - Stats use Project thresholds
var agreementThreshold = project.AgreementThreshold;
var minNumberScreened = agreementThreshold.NumberScreened;
var absoluteAgreementRatio = agreementThreshold.AbsoluteAgreementRatio;

MongoDB Aggregation Uses Project Rules:

// Simplified from AllScreeningPipelineGroupStage (StudyStats.cs:274-276)
{
  SufficientlyScreened: {
    $sum: {
      $cond: [
        {$and: [
          {$gte: ["$ScreeningInfo.AgreementMeasure.NumberScreened", minNumberScreened]},
          {$gt: ["$ScreeningInfo.AgreementMeasure.AbsoluteAgreementRatio", absoluteAgreementRatio]}
        ]},
        1, 0
      ]
    }
  }
}

Cached Thresholds on ScreeningInfo (ScreeningInfo.cs:49-60):

// _trackedProjectAgreementThresholds is CACHED on each Study's ScreeningInfo
private IEnumerable<ProjectAgreementThreshold>? _trackedProjectAgreementThresholds;

// InclusionInfo depends on this cached config
public IEnumerable<TrackedInclusionInfo>? InclusionInfo
{
    get => _trackedProjectAgreementThresholds?.Select(
        pat => new TrackedInclusionInfo(pat, pat.GetInclusionStatus(this), ...));
}

If Project thresholds change, this cached data becomes stale and needs recalculation.

When RecalculateScreeningStatsAsync is triggered:

After bulk screening operations
When Project AgreementMode or threshold changes
After seed data creation (to initialize _trackedProjectAgreementThresholds)
Via IUpdateStudyScreeningStatsCommand message bus command

Proposed Architecture (Corrected)¶

public interface IProjectStatsService
{
    // === Read Operations (Query Side) ===
    Task<ProjectStats> GetProjectStatsAsync(Guid projectId);
    Task<StageStats> GetStageStatsAsync(Guid projectId, Guid stageId);
    Task<ReviewerStats> GetReviewerStatsAsync(Guid projectId, Guid stageId, Guid reviewerId);

    // === Write Operations (Command Side) ===
    // ONLY for stats that span aggregates (Project config + Study data)
    Task RecalculateScreeningStatsAsync(Guid projectId);
    Task RecalculateInclusionInfoAsync(Guid projectId);

    // NOT NEEDED for annotation/outcome - these are computed properties or simple aggregations
    // Task RecalculateAnnotationStatsAsync(Guid projectId);  ← REMOVED
    // Task RecalculateOutcomeDataStatsAsync(Guid projectId); ← REMOVED
}

// Unified stats model
public record ProjectStats(
    int TotalStudies,
    ScreeningStats Screening,
    AnnotationStats Annotation,      // Read from SessionTallies aggregation
    OutcomeDataStats OutcomeData,    // Read from OutcomeData count query
    DateTime LastUpdated
);

public record ScreeningStats(
    int Unscreened,
    int PartiallyScreened,
    int SufficientlyScreened,  // Requires Project AgreementMode
    int Included,              // Requires Project rules
    int Excluded,              // Requires Project rules
    int Conflicts              // Requires Project threshold
);

public record AnnotationStats(
    int NotStarted,
    int InProgress,
    int Complete,
    int TotalSessions
    // Computed by aggregating SessionTallies across studies - no write operation needed
);

public record OutcomeDataStats(
    int StudiesWithData,
    int TotalDataPoints,
    int ExperimentsCount
    // Simple count aggregation - no write operation needed
);

Implementation Steps¶

Phase 1: Create Service Without Breaking Changes

Create IProjectStatsService interface in SyRF.ProjectManagement.Core/Interfaces/
Create ProjectStatsService implementation in SyRF.ProjectManagement.Core/Services/
Initially delegate to existing methods in StageReviewService
Add DI registration

Phase 2: Migrate Callers

Update ReviewController to call IProjectStatsService
Update DatabaseSeeder to call RecalculateAllStatsAsync() after seeding
Update any Angular API calls that expect stats in responses

Phase 3: Consolidate Implementation

Move actual stats calculation logic from StageReviewService to ProjectStatsService
Keep StageReviewService focused on study retrieval/assignment
Update StudyRepository.UpdateStudyInclusionInfoForProjectAsync() to be called via stats service

Phase 4: Deprecate Old Methods

Mark old StageReviewService stats methods as [Obsolete]
Eventually remove after all callers migrated

Files to Create/Modify¶

File	Action	Purpose
`SyRF.ProjectManagement.Core/Interfaces/IProjectStatsService.cs`	Create	Service interface
`SyRF.ProjectManagement.Core/Services/ProjectStatsService.cs`	Create	Implementation
`SyRF.ProjectManagement.Core/Model/ValueObjects/ProjectStats.cs`	Create	Stats model
`SyRF.ProjectManagement.Core/Services/StageReviewService.cs`	Modify	Delegate to new service
`SyRF.API.Endpoint/Controllers/ReviewController.cs`	Modify	Use new service
`SyRF.ProjectManagement.Endpoint/Seeding/DatabaseSeeder.cs`	Modify	Call `RecalculateAllStatsAsync()`

Seeder Integration¶

// After seeding all data for a project
await _projectStatsService.RecalculateAllStatsAsync(projectId);
_logger.LogDebug("Recalculated all stats for project {ProjectId}", projectId);

// This single call replaces multiple scattered calls:
// - UpdateStudyInclusionInfoForProjectAsync()
// - Manual stats aggregation
// - Annotation tally updates

Priority Assessment¶

Priority	Rationale
Medium-High	Needed for reliable seeding, but can work around by calling existing methods
Technical Debt	Reduces code duplication and improves maintainability
Risk	Low - new service, no breaking changes initially

Recommendation: Implement Phase 1-2 before major seeding work. Phase 3-4 can be done as ongoing refactoring.

Part 8: Annotation Seeding - Validation Architecture¶

Production Code Path¶

HTTP PUT /api/projects/{projectId}/stages/{stageId}/studies/{studyId}/session/{sessionId}
    ↓
ReviewController.SubmitSession() [ReviewController.cs:61-119]
    ↓
SessionSubmissionWebDto → SessionSubmissionDto (AutoMapper)
    ↓
study.AddSessionData() [Study.cs:116-127]
    ↓
ExtractionInfo.AddSessionData() [ExtractionInfo.cs:50-64]
    ├─ AddAnnotations() - Tree-shakes old annotations, adds new
    └─ AddOutcomeData() - Tree-shakes old outcome data, adds new
    ↓
pmUnitOfWork.SaveAsync(study)
    ↓
DispatchEvents() - Publishes domain events

Current Seeder Proposal (INCORRECT)¶

// This bypasses the ReviewController and service validation
var session = new AnnotationSession(study.Id, investigatorId, stageId);
foreach (var question in questions) {
    session.AddAnswer(question.Id, GenerateSampleAnswer(question));
}
study.AddAnnotationSession(session);  // Direct domain model manipulation

Validation Architecture Analysis¶

Question: Where should validation live - domain, application service, or frontend?

Current State: Mixed Validation¶

Validation Type	Current Location	Problem
Required fields	Frontend (Angular forms)	Bypassable by seeder
Question answer types	Domain model (Annotation)	Good - shared
Stage membership	Controller (ReviewController)	Should be in service
Study belongs to project	Controller query	Should be in service

Best Practice: Domain-Driven Validation Layers¶

┌─────────────────────────────────────────────────────────────┐
│ 1. DOMAIN VALIDATION (Innermost - Invariants)              │
│    • Entity state must be valid at all times               │
│    • Throws exceptions for invalid state transitions       │
│    • Example: Annotation answer must match question type   │
└─────────────────────────────────────────────────────────────┘
                           ↑
┌─────────────────────────────────────────────────────────────┐
│ 2. APPLICATION SERVICE VALIDATION (Business Rules)         │
│    • Cross-aggregate validation                             │
│    • Authorization checks                                   │
│    • Example: User must be project member to annotate      │
└─────────────────────────────────────────────────────────────┘
                           ↑
┌─────────────────────────────────────────────────────────────┐
│ 3. FRONTEND VALIDATION (UX - Optional for Backend)         │
│    • Early feedback for users                               │
│    • NOT security - backend must revalidate                 │
│    • Example: Show "required" error before submit           │
└─────────────────────────────────────────────────────────────┘

Proposed: Application Service for Annotation Submission¶

// New file: SyRF.ProjectManagement.Core/Interfaces/IAnnotationService.cs
public interface IAnnotationService
{
    /// <summary>
    /// Submit an annotation session for a study.
    /// Validates: user membership, stage access, question completeness.
    /// </summary>
    Task<AnnotationSessionResult> SubmitSessionAsync(
        Guid projectId,
        Guid stageId,
        Guid studyId,
        Guid annotatorId,
        SessionSubmissionDto session);

    /// <summary>
    /// Validate session data without saving (for preview/draft).
    /// Returns validation errors instead of throwing.
    /// </summary>
    Task<ValidationResult> ValidateSessionAsync(
        Guid projectId,
        Guid stageId,
        Guid studyId,
        SessionSubmissionDto session);
}

// Implementation
public class AnnotationService : IAnnotationService
{
    public async Task<AnnotationSessionResult> SubmitSessionAsync(
        Guid projectId, Guid stageId, Guid studyId, Guid annotatorId,
        SessionSubmissionDto session)
    {
        // 1. Load aggregates
        var project = await _projectRepository.GetAsync(projectId)
            ?? throw new EntityNotFoundException("Project", projectId);
        var study = await _studyRepository.GetAsync(studyId)
            ?? throw new EntityNotFoundException("Study", studyId);

        // 2. APPLICATION VALIDATION (business rules)
        ValidateMembership(project, annotatorId);      // User must be project member
        ValidateStageAccess(project, stageId);         // Stage must be active
        ValidateStudyBelongsToProject(study, projectId); // Study in this project
        ValidateNotAlreadyAnnotated(study, annotatorId, stageId); // No duplicate

        // 3. DOMAIN VALIDATION (via domain method)
        var stage = project.Stages.Single(s => s.Id == stageId);
        study.AddSessionData(
            annotatorId,
            session,
            project.GetStageQuestionIds(stageId),
            stage.Extraction);  // Throws if invalid

        // 4. Persist
        await _unitOfWork.SaveAsync(study);

        return new AnnotationSessionResult(
            Success: true,
            SessionId: session.Id,
            StudyId: studyId);
    }

    private void ValidateMembership(Project project, Guid userId)
    {
        if (!project.Memberships.Any(m => m.InvestigatorId == userId))
            throw new UnauthorizedAccessException("User is not a project member");
    }

    private void ValidateStageAccess(Project project, Guid stageId)
    {
        var stage = project.Stages.SingleOrDefault(s => s.Id == stageId)
            ?? throw new EntityNotFoundException("Stage", stageId);
        if (!stage.IsActive)
            throw new InvalidOperationException("Stage is not active");
    }
}

Seeder Uses Same Service¶

// Seeder injects IAnnotationService
public class DatabaseSeeder
{
    private readonly IAnnotationService _annotationService;

    public async Task SeedAnnotationsAsync(Guid projectId, Guid stageId, IEnumerable<Study> studies)
    {
        foreach (var study in studies)
        {
            var session = GenerateSessionForStudy(study, stageId);

            // Uses SAME validation as production!
            await _annotationService.SubmitSessionAsync(
                projectId,
                stageId,
                study.Id,
                SeedDataConstants.SeedBotId,
                session);
        }
    }
}

What Exactly Changes? (Detailed Migration)¶

Answer: Logic and functionality remain IDENTICAL. Only the code location changes.

Current Architecture (Controller-Based)¶

ReviewController.SubmitSessionAsync()
├─ HTTP deserialization (ASP.NET Core)
├─ Authorization check (inline in controller)
├─ Load project (inline)
├─ Validate membership (inline)
├─ Validate stage (inline)
├─ Call study.AddSessionData() (domain method)
└─ Save via UnitOfWork (inline)

Proposed Architecture (Service-Based)¶

ReviewController.SubmitSessionAsync()
├─ HTTP deserialization (ASP.NET Core)
└─ Call IAnnotationService.SubmitSessionAsync()  ← SINGLE CHANGE
        ├─ Authorization check (moved here)
        ├─ Load project (moved here)
        ├─ Validate membership (moved here)
        ├─ Validate stage (moved here)
        ├─ Call study.AddSessionData() (unchanged)
        └─ Save via UnitOfWork (moved here)

Specific Code Migration¶

Current Location	New Location	Logic Changes?
ReviewController.cs:61-119	`AnnotationService.SubmitSessionAsync()`	NO - extracted as-is
Membership check (controller inline)	`ValidateMembership()` private method	NO - same check
Stage access check (controller inline)	`ValidateStageAccess()` private method	NO - same check
`study.AddSessionData()` call	Same call in service	NO - identical
`_unitOfWork.SaveAsync()` call	Same call in service	NO - identical

Controller BEFORE (Simplified)¶

// ReviewController.cs - Current
[HttpPost("annotation")]
public async Task<IActionResult> SubmitSessionAsync(SessionSubmissionWebDto dto)
{
    var project = await _projectRepo.GetAsync(dto.ProjectId);
    if (!project.Memberships.Any(m => m.InvestigatorId == UserId))
        return Forbid();

    var stage = project.Stages.SingleOrDefault(s => s.Id == dto.StageId);
    if (stage == null || !stage.IsActive)
        return BadRequest("Invalid stage");

    var study = await _studyRepo.GetAsync(dto.StudyId);
    study.AddSessionData(UserId, dto.ToDto(), project.GetStageQuestionIds(dto.StageId), stage.Extraction);

    await _unitOfWork.SaveAsync(study);
    return Ok(new { SessionId = dto.Id });
}

Controller AFTER (Simplified)¶

// ReviewController.cs - After refactor
[HttpPost("annotation")]
public async Task<IActionResult> SubmitSessionAsync(SessionSubmissionWebDto dto)
{
    var result = await _annotationService.SubmitSessionAsync(
        dto.ProjectId,
        dto.StageId,
        dto.StudyId,
        UserId,
        dto.ToDto());

    return Ok(new { SessionId = result.SessionId });
}

Functionality Changes: NONE

Same validation rules
Same domain method calls
Same persistence
Same error conditions

What's Different: Code is now in a testable, reusable service that the seeder can also call.

Files Changed¶

File	Change Type	Description
`IAnnotationService.cs`	NEW	Interface in Core/Interfaces
`AnnotationService.cs`	NEW	Implementation in Core/Services
`ReviewController.cs`	MODIFY	Delegate to service (less code)
`Startup.cs` / DI	MODIFY	Register new service

Benefits of Service-Layer Validation¶

Benefit	Explanation
Single Source of Truth	Validation logic in one place, not duplicated in controller and seeder
Testable	Can unit test validation without HTTP context
Seeder Parity	Seeder uses exact same code path as production
Future-Proof	New validation rules automatically apply to seeding
Separation of Concerns	Controller handles HTTP, service handles business logic

Frontend Validation Consideration¶

The Angular frontend should:

Pre-validate for UX (immediate feedback)
NOT be trusted - backend always re-validates
Mirror service rules where practical (DRY is nice but security is mandatory)

// Angular component - UX validation only
validateSession(session: SessionSubmission): ValidationError[] {
    const errors: ValidationError[] = [];
    if (!session.annotations?.length) {
        errors.push({ field: 'annotations', message: 'At least one answer required' });
    }
    // ... more UX checks
    return errors;
}

The backend IAnnotationService.ValidateSessionAsync() method can be exposed via API for frontend draft validation if needed.

Part 9: Seeding Configuration as Source of Truth¶

Problem: Duplicated Knowledge¶

Currently, seeding logic and tests both contain hardcoded values:

// In DatabaseSeeder.cs
var quickStartStudyCount = 10;
CreateStudies(project, quickStartStudyCount);

// In Tests
quickStartStats.TotalStudies.Should().Be(10);  // Duplicated magic number!

If seeding changes to 15 studies, tests break. This coupling is fragile.

Solution: Centralized Seeding Configuration¶

Create a configuration file that defines seed data structure. Both seeder and tests read from this single source of truth.

Configuration Model¶

// SyRF.ProjectManagement.Core/Seeding/SeedingConfig.cs
public record SeedingConfig
{
    public required IReadOnlyList<SeedProjectConfig> Projects { get; init; }
    public required IReadOnlyList<SeedInvestigatorConfig> Investigators { get; init; }
}

public record SeedProjectConfig
{
    public required string Key { get; init; }  // "QuickStart", "DualScreening"
    public required string Name { get; init; }
    public required AgreementMode AgreementMode { get; init; }
    public required int StudyCount { get; init; }
    public required IReadOnlyList<SeedStageConfig> Stages { get; init; }
    public required SeedScreeningConfig? Screening { get; init; }
    public required SeedAnnotationConfig? Annotation { get; init; }
    public required SeedOutcomeDataConfig? OutcomeData { get; init; }
    public required IReadOnlyList<string> MemberKeys { get; init; }  // ["Owner", "Alpha", "Beta"]
}

public record SeedStageConfig(
    string Name,
    StageType Type,
    bool IsActive
);

public record SeedScreeningConfig(
    int ScreenedCount,
    int PartialCount,
    int IncludedCount,
    int ExcludedCount
);

public record SeedAnnotationConfig(
    int QuestionCount,
    int AnnotatedStudyCount,
    int CompletedSessionCount
);

public record SeedOutcomeDataConfig(
    int StudiesWithData,
    int DataPointsPerStudy
);

public record SeedInvestigatorConfig(
    string Key,
    string FirstName,
    string LastName,
    string Email,
    bool IsSystemAccount
);

Configuration File (YAML)¶

# src/libs/project-management/SyRF.ProjectManagement.Core/Seeding/seed-config.yaml
investigators:
  - key: SeedBot
    firstName: SyRF
    lastName: Seed Bot
    email: seedbot@syrf.org.uk
    isSystemAccount: true
  - key: Alpha
    firstName: Alice
    lastName: Alpha
    email: alice.alpha@example.com
    isSystemAccount: false
  - key: Beta
    firstName: Bob
    lastName: Beta
    email: bob.beta@example.com
    isSystemAccount: false

projects:
  - key: QuickStart
    name: Quick Start Demo
    agreementMode: SingleReviewerDecides
    studyCount: 10
    stages:
      - name: Title/Abstract Screening
        type: Screening
        isActive: true
      - name: Data Extraction
        type: Annotation
        isActive: false
    screening: null  # No screening done
    annotation: null
    outcomeData: null
    memberKeys: [SeedBot]

  - key: DualScreening
    name: Dual Screening Example
    agreementMode: AutomatedDualScreening
    studyCount: 30
    stages:
      - name: Title/Abstract Screening
        type: Screening
        isActive: true
    screening:
      screenedCount: 15
      partialCount: 10
      includedCount: 8
      excludedCount: 7
    annotation: null
    outcomeData: null
    memberKeys: [SeedBot, Alpha, Beta]

  - key: DataExtraction
    name: Data Extraction Demo
    agreementMode: SingleReviewerDecides
    studyCount: 12
    stages:
      - name: Screening
        type: Screening
        isActive: false
      - name: Data Extraction
        type: Annotation
        isActive: true
    screening:
      screenedCount: 12
      partialCount: 0
      includedCount: 12
      excludedCount: 0
    annotation:
      questionCount: 10
      annotatedStudyCount: 12
      completedSessionCount: 12
    outcomeData:
      studiesWithData: 9
      dataPointsPerStudy: 3
    memberKeys: [SeedBot]

Config Loader¶

public interface ISeedingConfigProvider
{
    SeedingConfig GetConfig();
    SeedProjectConfig GetProjectConfig(string key);
    SeedInvestigatorConfig GetInvestigatorConfig(string key);
}

public class YamlSeedingConfigProvider : ISeedingConfigProvider
{
    private readonly Lazy<SeedingConfig> _config;

    public YamlSeedingConfigProvider()
    {
        _config = new Lazy<SeedingConfig>(() =>
        {
            var yaml = EmbeddedResource.ReadAsString("seed-config.yaml");
            return YamlDeserializer.Deserialize<SeedingConfig>(yaml);
        });
    }

    public SeedingConfig GetConfig() => _config.Value;

    public SeedProjectConfig GetProjectConfig(string key) =>
        GetConfig().Projects.Single(p => p.Key == key);
}

Seeder Uses Config¶

public class DatabaseSeeder
{
    private readonly ISeedingConfigProvider _configProvider;

    public async Task SeedAsync()
    {
        var config = _configProvider.GetConfig();

        foreach (var projectConfig in config.Projects)
        {
            var project = await CreateProjectFromConfig(projectConfig);
            await SeedStudiesFromConfig(project.Id, projectConfig);

            if (projectConfig.Screening != null)
                await SeedScreeningsFromConfig(project.Id, projectConfig.Screening);

            if (projectConfig.Annotation != null)
                await SeedAnnotationsFromConfig(project.Id, projectConfig.Annotation);

            if (projectConfig.OutcomeData != null)
                await SeedOutcomeDataFromConfig(project.Id, projectConfig.OutcomeData);
        }
    }
}

Tests Use Same Config¶

public class SeedDataValidationTests : IClassFixture<SeededDatabaseFixture>
{
    private readonly ISeedingConfigProvider _configProvider;

    [Fact]
    public async Task QuickStart_HasCorrectStudyCount()
    {
        // Config is SOURCE OF TRUTH
        var config = _configProvider.GetProjectConfig("QuickStart");

        var stats = await _statsService.GetProjectStatsAsync(quickStartProjectId);

        stats.TotalStudies.Should().Be(config.StudyCount);  // No magic numbers!
    }

    [Fact]
    public async Task DualScreening_HasCorrectScreeningProgress()
    {
        var config = _configProvider.GetProjectConfig("DualScreening");

        var stats = await _statsService.GetProjectStatsAsync(dualProjectId);

        stats.Screening.SufficientlyScreened.Should().Be(config.Screening!.ScreenedCount);
        stats.Screening.PartiallyScreened.Should().Be(config.Screening!.PartialCount);
    }

    [Theory]
    [MemberData(nameof(AllProjectConfigs))]
    public async Task AllProjects_HaveCorrectStudyCounts(SeedProjectConfig config)
    {
        var projectId = await GetProjectIdByName(config.Name);
        var stats = await _statsService.GetProjectStatsAsync(projectId);

        stats.TotalStudies.Should().Be(config.StudyCount);
    }

    public static IEnumerable<object[]> AllProjectConfigs =>
        _configProvider.GetConfig().Projects.Select(p => new object[] { p });
}

Benefits of Config-Driven Testing¶

Benefit	Explanation
Single Source of Truth	Config defines both seeding AND expected test values
No Magic Numbers	Tests read from config, not hardcoded values
Flexible Testing	Theory tests can iterate all projects automatically
Documentation	YAML file documents what seed data exists
Easy Changes	Change config once, seeder and tests both update
Environment Specific	Can have different configs for preview vs staging

E2E Tests Also Use Config¶

// Load config at test startup (exported as JSON or TypeScript)
import { seedConfig } from './seed-config';

describe('Seed Data Validation', () => {
    const quickStart = seedConfig.projects.find(p => p.key === 'QuickStart')!;

    it('Quick Start has correct study count', () => {
        cy.visit('/projects/' + quickStartId);
        cy.get('[data-testid="study-count"]')
            .should('contain', quickStart.studyCount.toString());
    });
});

Implementation Priority¶

Phase	Work	Benefit
Phase 1	Create config model + YAML file	Documents current/planned seed data
Phase 2	Refactor seeder to use config	Centralized seeding logic
Phase 3	Update integration tests to use config	No more magic numbers
Phase 4	Export config for E2E tests	Frontend tests also use same source

Recommendation: Start with Phase 1-2 immediately. The config file serves as documentation even before tests are updated.

Part 10: Non-Production Visual Indicator (Enhanced UX)¶

Current State¶

The web app has a Version Info Dialog (version-info-dialog.component.ts) with color-coded environment badges:

Environment	Background	Text Color
Production	`#e8f5e9` (green)	`#2e7d32`
Staging	`#fff3e0` (orange)	`#bf360c`
Preview	`#e3f2fd` (blue)	`#1565c0`
Development	`#fce4ec` (pink)	`#c2185b`

Gap: Database name (syrftest vs syrfdev) is NOT shown - frontend has no visibility.

Design Principle: Real estate usage should be similar to production app. The collapsed indicator should be minimal and non-intrusive.

User Flow¶

┌────────────────────────────────────────────────────────────────────┐
│ User visits non-production URL                                      │
└─────────────────────────┬──────────────────────────────────────────┘
                          ↓
┌────────────────────────────────────────────────────────────────────┐
│ Check sessionStorage: 'env-banner-collapsed'                        │
└─────────────────────────┬──────────────────────────────────────────┘
                          ↓
           ┌──────────────┴──────────────┐
           ↓                             ↓
     [Not Collapsed]              [Previously Collapsed]
           ↓                             ↓
┌──────────────────────────┐  ┌────────────────────────────────────┐
│ SLIM BANNER (top bar)    │  │ MINIMAL INDICATOR (corner badge)   │
│ "STAGING | Click for     │  │ Small colored pill/badge           │
│  more info"              │  │ Clicking reopens banner            │
│ [Click anywhere]         │  └────────────────────────────────────┘
└───────────┬──────────────┘
            ↓ (user clicks)
┌──────────────────────────┐
│ DETAILED DIALOG          │
│ - Environment name       │
│ - Database name          │
│ - PR number (if preview) │
│ - Important notes        │
│ [Dismiss & Collapse]     │
└───────────┬──────────────┘
            ↓ (user dismisses)
┌────────────────────────────────────────────────────────────────────┐
│ Store 'env-banner-collapsed' = true in sessionStorage               │
│ Show MINIMAL INDICATOR (same as "Previously Collapsed" state)      │
└────────────────────────────────────────────────────────────────────┘

A single-line banner at the top of the viewport, similar height to a small toolbar:

<!-- app.component.html -->
<div *ngIf="!isProduction && !bannerCollapsed"
     class="env-banner"
     [class]="environmentClass"
     (click)="showEnvironmentDialog()">
    <mat-icon class="env-icon">info</mat-icon>
    <span class="env-text">
        <strong>{{ environmentName | uppercase }}</strong>
        <span class="separator">|</span>
        Click for environment details
    </span>
</div>

.env-banner {
    height: 32px;           // Slim - similar to a toolbar
    display: flex;
    align-items: center;
    justify-content: center;
    gap: 8px;
    cursor: pointer;
    font-size: 13px;
    position: sticky;
    top: 0;
    z-index: 1000;

    &.preview { background: #2196f3; color: white; }
    &.staging { background: #ff9800; color: white; }
    &.development { background: #e91e63; color: white; }

    .separator { opacity: 0.7; margin: 0 4px; }
    .env-icon { font-size: 18px; height: 18px; width: 18px; }

    &:hover { filter: brightness(1.1); }
}

Opens when user clicks the banner:

@Component({
    selector: 'app-environment-dialog',
    template: `
        <h2 mat-dialog-title>Environment Information</h2>
        <mat-dialog-content>
            <div class="env-details">
                <div class="detail-row">
                    <span class="label">Environment:</span>
                    <span class="badge" [class]="environmentClass">{{ environmentName }}</span>
                </div>
                <div class="detail-row">
                    <span class="label">Database:</span>
                    <span class="value">{{ databaseName }}</span>
                </div>
                <div class="detail-row" *ngIf="prNumber">
                    <span class="label">PR Preview:</span>
                    <span class="value">#{{ prNumber }}</span>
                </div>
                <div class="detail-row" *ngIf="dataSource?.sampledAt">
                    <span class="label">Data Sampled:</span>
                    <span class="value">{{ dataSource.sampledAt | date:'medium' }}</span>
                </div>
            </div>

            <mat-divider></mat-divider>

            <div class="notes">
                <p><strong>Note:</strong> This is not production. Data may be reset at any time.</p>
            </div>
        </mat-dialog-content>
        <mat-dialog-actions align="end">
            <button mat-button (click)="dismiss(false)">Keep Banner Visible</button>
            <button mat-raised-button color="primary" (click)="dismiss(true)">
                Dismiss & Collapse
            </button>
        </mat-dialog-actions>
    `
})
export class EnvironmentDialogComponent {
    dismiss(collapse: boolean) {
        if (collapse) {
            sessionStorage.setItem('env-banner-collapsed', 'true');
        }
        this.dialogRef.close(collapse);
    }
}

State 3: Minimal Indicator (After Collapse)¶

A small badge in the corner - minimal screen real estate (similar to production):

<!-- Collapsed state indicator -->
<div *ngIf="!isProduction && bannerCollapsed"
     class="env-indicator"
     [class]="environmentClass"
     (click)="expandBanner()"
     matTooltip="Click to show environment details">
    {{ environmentAbbrev }}
</div>

.env-indicator {
    position: fixed;
    top: 8px;
    right: 8px;
    padding: 4px 10px;
    border-radius: 12px;
    font-size: 11px;
    font-weight: 600;
    text-transform: uppercase;
    cursor: pointer;
    z-index: 999;
    box-shadow: 0 1px 3px rgba(0,0,0,0.2);

    &.preview { background: #2196f3; color: white; }
    &.staging { background: #ff9800; color: white; }
    &.development { background: #e91e63; color: white; }

    &:hover {
        transform: scale(1.05);
        box-shadow: 0 2px 6px rgba(0,0,0,0.3);
    }
}

Abbreviations:

Environment	Abbreviation
Development	DEV
Staging	STG
Preview	PR-123
Production	(not shown)

Component Logic¶

@Component({ ... })
export class AppComponent implements OnInit {
    isProduction = false;
    bannerCollapsed = false;
    environmentName = '';
    environmentClass = '';
    environmentAbbrev = '';

    ngOnInit() {
        this.loadEnvironmentInfo();
        this.bannerCollapsed = sessionStorage.getItem('env-banner-collapsed') === 'true';
    }

    async loadEnvironmentInfo() {
        const info = await this.appService.getApplicationInfo();
        this.isProduction = info.environment === 'Production';
        this.environmentName = info.environment;
        this.environmentClass = info.environment.toLowerCase();
        this.environmentAbbrev = this.getAbbreviation(info);
    }

    getAbbreviation(info: AppInfo): string {
        if (info.prNumber) return `PR-${info.prNumber}`;
        switch (info.environment) {
            case 'Staging': return 'STG';
            case 'Development': return 'DEV';
            default: return info.environment.substring(0, 3).toUpperCase();
        }
    }

    showEnvironmentDialog() {
        const dialogRef = this.dialog.open(EnvironmentDialogComponent, {
            data: this.environmentInfo
        });
        dialogRef.afterClosed().subscribe(collapsed => {
            if (collapsed) this.bannerCollapsed = true;
        });
    }

    expandBanner() {
        sessionStorage.removeItem('env-banner-collapsed');
        this.bannerCollapsed = false;
    }
}

Visual Comparison¶

State	Height	Location	Screen Real Estate
Production	0px	N/A	None (baseline)
Banner (expanded)	32px	Top sticky	Small toolbar
Indicator (collapsed)	~24px	Top-right corner	Badge only
Dialog	Modal	Center	Overlay (temporary)

Goal: Collapsed state uses similar real estate to production (just a small corner badge).

Backend Changes¶

Add databaseName to /api/application/info endpoint
Add prNumber for preview environments (from environment variable)
Return environment classification

// ApplicationController.cs
[HttpGet("info")]
public IActionResult GetApplicationInfo()
{
    return Ok(new
    {
        Version = _configuration["AppVersion"],
        Environment = _configuration["ASPNETCORE_ENVIRONMENT"],
        DatabaseName = GetDatabaseNameFromConnectionString(),
        PrNumber = Environment.GetEnvironmentVariable("PR_NUMBER"),
        IsProduction = _configuration["ASPNETCORE_ENVIRONMENT"] == "Production"
    });
}

Implementation Files¶

File	Change
`app.component.html`	Add banner and compact indicator
`app.component.ts`	Add state management, dialog trigger
`app.component.scss`	Add styles
`environment-warning-dialog.component.ts`	New dialog component
`ApplicationController.cs`	Add database name to info endpoint
`MongoContext.cs` or config	Extract database name

Part 11: Production Data Sampling for Non-Production Environments¶

Context: PII Considerations for SyRF¶

SyRF is a research platform for systematic reviews where:

User data consists primarily of researcher emails and names (professional context)
Project data is scientific literature metadata (titles, abstracts, DOIs)
Studies are references to published academic papers (public information)
Annotations are structured research assessments (not personal data)

Conclusion: PII is not a major concern for this application. The primary data (study metadata from academic databases) is already public. User data is limited to professional research context.

This enables practical production sampling without complex anonymization.

Recommended Approach: Configurable Production Sampling¶

Architecture Overview¶

┌──────────────────────────────────────────────────────────────────┐
│                    Production Database (syrftest)                 │
│   Projects: ~500  Studies: ~2M  Investigators: ~3000             │
└────────────────────────────┬─────────────────────────────────────┘
                             │
              ┌──────────────┴──────────────┐
              ↓                             ↓
   ┌─────────────────────┐     ┌─────────────────────────────┐
   │    PR Preview       │     │         Staging             │
   │    (Lightweight)    │     │    (Near-Production)        │
   ├─────────────────────┤     ├─────────────────────────────┤
   │ • 3-5 sampled       │     │ • 10-20 sampled projects    │
   │   projects          │     │ • Full study sets           │
   │ • Max 500 studies   │     │ • All screening data        │
   │ • Minimal screening │     │ • Full annotation sessions  │
   │ • No annotations    │     │ • Performance testing       │
   │ • Fast creation     │     │ • Weekly refresh            │
   │ • PR-specific       │     │                             │
   └─────────────────────┘     └─────────────────────────────┘

Configuration System¶

YAML Configuration File¶

# cluster-gitops/syrf/environments/{env}/data-sampling.yaml

# PR Preview - Lightweight for quick testing
preview:
  enabled: true
  schedule: "on-demand"  # Created when PR opens
  source:
    database: syrftest
    connection: ${PROD_MONGODB_URI}  # Read-only connection

  projects:
    strategy: curated        # Use specific project IDs
    count: 5
    selection:
      - id: "uuid-of-active-project-1"      # Large active project
        reason: "High activity, good for testing"
      - id: "uuid-of-dual-screening-project"
        reason: "Demonstrates dual screening workflow"
      - id: "uuid-of-annotation-project"
        reason: "Has extraction data"

    limits:
      maxStudiesPerProject: 500
      maxScreeningsPerStudy: 10
      includeAnnotations: false    # Skip for speed
      includeOutcomeData: false

  investigators:
    strategy: project-members   # Only members of sampled projects
    additionalIds: []           # No extras needed

# Staging - Comprehensive for integration testing
staging:
  enabled: true
  schedule: "0 3 * * 0"   # Weekly, Sunday 3am
  source:
    database: syrftest
    connection: ${PROD_MONGODB_URI}

  projects:
    strategy: mixed           # Curated + random sample
    count: 20
    selection:
      # Curated essential projects
      - id: "uuid-of-flagship-project"
        reason: "Largest project, performance testing"
      - id: "uuid-of-complex-stages"
        reason: "Multi-stage workflow testing"
    randomSampleCount: 15     # Additional random projects
    randomCriteria:
      minStudies: 100
      minMembers: 2
      hasScreening: true

    limits:
      maxStudiesPerProject: null    # No limit
      maxScreeningsPerStudy: null   # Full data
      includeAnnotations: true
      includeOutcomeData: true

  investigators:
    strategy: all-active      # All investigators who logged in last 90 days
    activityThresholdDays: 90

Implementation Components¶

1. Sampling Job (Kubernetes CronJob or Manual)¶

// SyRF.DataSampling.Job/SamplingJob.cs
public class ProductionSamplingJob
{
    private readonly IMongoClient _sourceClient;
    private readonly IMongoClient _targetClient;
    private readonly SamplingConfig _config;

    public async Task ExecuteSamplingAsync(CancellationToken ct)
    {
        _logger.LogInformation("Starting production sampling for {Environment}",
            _config.EnvironmentName);

        // 1. Determine which projects to sample
        var projectIds = await GetProjectIdsToSample();
        _logger.LogInformation("Sampling {Count} projects", projectIds.Count);

        // 2. Sample investigators (members of selected projects)
        await SampleInvestigatorsAsync(projectIds, ct);

        // 3. Sample projects with their related data
        foreach (var projectId in projectIds)
        {
            await SampleProjectAsync(projectId, ct);
        }

        // 4. Create sampling metadata record
        await RecordSamplingMetadataAsync(projectIds, ct);
    }

    private async Task SampleProjectAsync(Guid projectId, CancellationToken ct)
    {
        var project = await _sourceDb.GetCollection<Project>("pmProject")
            .Find(p => p.Id == projectId)
            .FirstOrDefaultAsync(ct);

        if (project == null) return;

        // Insert project
        await _targetDb.GetCollection<Project>("pmProject")
            .InsertOneAsync(project, cancellationToken: ct);

        // Sample studies with limit
        var studyQuery = _sourceDb.GetCollection<Study>("pmStudy")
            .Find(s => s.ProjectId == projectId);

        if (_config.Projects.Limits.MaxStudiesPerProject.HasValue)
        {
            studyQuery = studyQuery.Limit(_config.Projects.Limits.MaxStudiesPerProject.Value);
        }

        var studies = await studyQuery.ToListAsync(ct);

        if (studies.Any())
        {
            await _targetDb.GetCollection<Study>("pmStudy")
                .InsertManyAsync(studies, cancellationToken: ct);
        }

        // Sample related collections...
        await SampleSystematicSearchesAsync(projectId, ct);

        if (_config.Projects.Limits.IncludeAnnotations)
        {
            await SampleAnnotationsAsync(projectId, studies.Select(s => s.Id), ct);
        }
    }
}

2. Sampling Metadata Tracking¶

// Track what was sampled and when
public record SamplingMetadata
{
    public Guid Id { get; init; } = Guid.NewGuid();
    public DateTime SampledAt { get; init; } = DateTime.UtcNow;
    public string SourceDatabase { get; init; }
    public string TargetDatabase { get; init; }
    public string Environment { get; init; }
    public int ProjectCount { get; init; }
    public int StudyCount { get; init; }
    public int InvestigatorCount { get; init; }
    public List<Guid> SampledProjectIds { get; init; }
    public string ConfigHash { get; init; }  // Track config version
}

// Query sampling metadata
db.samplingMetadata.find().sort({ sampledAt: -1 }).limit(1)
// Shows when data was last sampled and what's included

3. GitHub Actions Integration (PR Preview)¶

# .github/workflows/pr-preview.yml (addition)
jobs:
  sample-production-data:
    needs: [deploy-preview]
    runs-on: ubuntu-latest
    if: ${{ github.event.action == 'opened' }}  # Only on PR open

    steps:
      - name: Run Production Sampling
        uses: actions/github-script@v6
        with:
          script: |
            // Trigger sampling job via kubectl or API
            const prNumber = context.issue.number;
            const targetDb = `syrf_pr_${prNumber}`;

            // Call sampling API or kubectl exec
            await exec.exec('kubectl', [
              'create', 'job',
              `--from=cronjob/production-sampler`,
              `production-sample-pr-${prNumber}`,
              '-n', `pr-${prNumber}`,
              '--', '--target-db', targetDb, '--config', 'preview'
            ]);

  # Or simpler: just use seed data for PRs
  seed-preview-data:
    needs: [deploy-preview]
    runs-on: ubuntu-latest
    steps:
      - name: Run Seeder
        run: |
          kubectl exec -n pr-${{ github.event.number }} \
            deployment/syrf-project-management \
            -- dotnet SyRF.Seeder.dll --config preview

Sampling Strategies Explained¶

Strategy	Use Case	How It Works
curated	PR Preview	Explicitly listed project IDs, maximum control
random	Staging	Random sample meeting criteria (min studies, etc.)
mixed	Staging	Curated essentials + random sample for variety
all-active	Staging (investigators)	All users active in last N days
project-members	PR Preview	Only investigators who are members of sampled projects

Environment-Specific Recommendations¶

PR Preview Environments¶

Goal: Fast, lightweight, focused on feature testing

Setting	Value	Rationale
Projects	3-5 curated	Known good test projects
Max Studies	500 per project	Fast query performance
Screening	Include (limited)	Test screening workflows
Annotations	Exclude	Speed, not needed for most PRs
Refresh	On PR open	Fresh data each time

Trigger: Automatic when PR opens, or manual via workflow dispatch

Staging Environment¶

Goal: Comprehensive, near-production testing

Setting	Value	Rationale
Projects	15-20 mixed	Representative variety
Max Studies	Unlimited	Full performance testing
Screening	Full	Complete workflow testing
Annotations	Include	E2E extraction testing
Refresh	Weekly	Fresh but stable for testing

Trigger: Scheduled CronJob (Sunday 3am)

PR-Description Driven Configuration¶

Goal: Developers can easily switch between seeding, sampling, or combo by adding a config block to their PR description - no workflow file changes needed.

PR Description Format¶

Add a YAML block to the PR description to configure data strategy:

## Description
This PR adds new screening workflow improvements...

## Data Strategy
<!-- syrf-data-config -->
```yaml
data:
  strategy: seed-only       # Options: seed-only, sample-only, hybrid
  sampleProjects: 3         # If sampling, how many projects
  includeProdData: false    # Include production sample?

#### Supported Strategies

| Strategy | Description | Use Case |
|----------|-------------|----------|
| `seed-only` | Only run seeder (default) | Most PRs - fast, predictable |
| `sample-only` | Only sample production | Performance testing |
| `hybrid` | Seed first, then sample | Integration testing |
| `none` | Empty database | Testing fresh install |

#### Workflow Integration

```yaml
# .github/workflows/pr-preview.yml
jobs:
  parse-data-config:
    runs-on: ubuntu-latest
    outputs:
      strategy: ${{ steps.parse.outputs.strategy }}
      sample_projects: ${{ steps.parse.outputs.sample_projects }}
    steps:
      - name: Parse PR Description
        id: parse
        uses: actions/github-script@v7
        with:
          script: |
            const body = context.payload.pull_request?.body || '';

            // Extract YAML from <!-- syrf-data-config --> block
            const match = body.match(/<!--\s*syrf-data-config\s*-->\s*```yaml\s*([\s\S]*?)```\s*<!--\s*\/syrf-data-config\s*-->/);

            let config = { strategy: 'seed-only', sampleProjects: 0 };
            if (match) {
              const yaml = require('js-yaml');
              const parsed = yaml.load(match[1]);
              config = { ...config, ...parsed.data };
            }

            core.setOutput('strategy', config.strategy);
            core.setOutput('sample_projects', config.sampleProjects || 0);

  seed-data:
    needs: [deploy-preview, parse-data-config]
    if: needs.parse-data-config.outputs.strategy != 'sample-only' && needs.parse-data-config.outputs.strategy != 'none'
    runs-on: ubuntu-latest
    steps:
      - name: Run Seeder
        run: |
          kubectl exec -n pr-${{ github.event.number }} \
            deployment/syrf-project-management \
            -- dotnet SyRF.Seeder.dll --config preview

  sample-production:
    needs: [deploy-preview, parse-data-config, seed-data]
    if: |
      needs.parse-data-config.outputs.strategy == 'sample-only' ||
      needs.parse-data-config.outputs.strategy == 'hybrid'
    runs-on: ubuntu-latest
    steps:
      - name: Sample Production Data
        run: |
          kubectl create job \
            --from=cronjob/production-sampler \
            sample-pr-${{ github.event.number }} \
            -n pr-${{ github.event.number }} \
            -- --count=${{ needs.parse-data-config.outputs.sample_projects }}

Quick Reference for Developers¶

Scenario	PR Description Config
Default (most PRs)	No config block needed → uses `seed-only`
Integration testing	`strategy: hybrid` + `sampleProjects: 2`
Performance testing	`strategy: sample-only` + `sampleProjects: 10`
Empty database	`strategy: none`

Full Config Options¶

data:
  strategy: hybrid              # seed-only | sample-only | hybrid | none
  sampleProjects: 5            # Number of projects to sample (if sampling)
  sampleStrategy: curated      # curated | random | mixed
  maxStudiesPerProject: 500    # Limit studies per project (faster)
  includeAnnotations: false    # Include annotation data (slower)
  seedConfig: preview          # Which seed config to use: preview | minimal | full

Alternative: Label-Based Configuration¶

For simpler use cases, use PR labels instead of YAML config:

Label	Equivalent Strategy
`data:seed-only`	Default seeding
`data:hybrid`	Seed + sample
`data:sample-only`	Production sample only
`data:none`	Empty database
`data:sample-large`	Sample with 10+ projects

# Workflow checks for labels
- name: Check Data Strategy Label
  id: labels
  run: |
    if [[ "${{ contains(github.event.pull_request.labels.*.name, 'data:hybrid') }}" == "true" ]]; then
      echo "strategy=hybrid" >> $GITHUB_OUTPUT
    elif [[ "${{ contains(github.event.pull_request.labels.*.name, 'data:sample-only') }}" == "true" ]]; then
      echo "strategy=sample-only" >> $GITHUB_OUTPUT
    else
      echo "strategy=seed-only" >> $GITHUB_OUTPUT
    fi

Recommendation: Use labels for quick toggling, PR description config for fine-grained control.

Data Freshness Indicators¶

Add sampling metadata to the environment info endpoint:

[HttpGet("info")]
public IActionResult GetApplicationInfo()
{
    var samplingInfo = await _db.GetCollection<SamplingMetadata>("samplingMetadata")
        .Find(_ => true)
        .SortByDescending(s => s.SampledAt)
        .FirstOrDefaultAsync();

    return Ok(new
    {
        Version = _configuration["AppVersion"],
        Environment = _configuration["ASPNETCORE_ENVIRONMENT"],
        DatabaseName = GetDatabaseName(),
        DataSource = samplingInfo != null ? new
        {
            SampledAt = samplingInfo.SampledAt,
            SourceDatabase = samplingInfo.SourceDatabase,
            ProjectCount = samplingInfo.ProjectCount,
            StudyCount = samplingInfo.StudyCount
        } : null
    });
}

Frontend Display (in environment warning dialog):

<div class="detail-row" *ngIf="dataSource">
    <span class="label">Data Sampled:</span>
    <span class="value">
        {{ dataSource.sampledAt | date:'medium' }}
        ({{ dataSource.projectCount }} projects, {{ dataSource.studyCount }} studies)
    </span>
</div>

Implementation Priority¶

Phase	Work	Environment	Effort
1	Enhanced seed data (Part 4 matrix)	All	Medium
2	Sampling config YAML schema	All	Low
3	PR preview sampling job	Preview	Medium
4	Staging weekly sampling CronJob	Staging	Medium
5	Sampling metadata tracking	All	Low
6	Environment info API enhancement	All	Low

Recommendation: Start with Phase 1-2. Seed data alone may be sufficient for most testing. Add production sampling (Phase 3-4) only if specific production data scenarios are needed for testing.

Part 12: Concrete Next Steps Plan¶

Immediate Actions (This Sprint)¶

#	Task	Files to Change	Effort
1	Refactor Project constructor chain	Project.cs	Low
	Add private constructor with ID param, add `CreateWithId` factory
2	Apply same pattern to Study	Study.cs	Low
	Factory method for deterministic study IDs
3	Apply same pattern to Investigator	Investigator.cs	Low
	Factory method for deterministic investigator IDs
4	Update DatabaseSeeder to use factory methods	DatabaseSeeder.cs	Medium
	Replace reflection-based ID setting with factory calls

Short-Term Actions (Next 2 Sprints)¶

#	Task	Files to Change	Effort
5	Create ISeedDataOwnershipTransfer interface	New file: `Core/Seeding/ISeedDataOwnershipTransfer.cs`	Low
6	Implement comprehensive seed project matrix	DatabaseSeeder.cs, SeedDataConstants.cs	Medium
	Per Part 4 coverage matrix - all AgreementModes, stage states
7	Add screening data seeding	DatabaseSeeder.cs	Medium
	Seed screenings using `study.AddScreening()` path
8	Call RecalculateScreeningStatsAsync after seeding	DatabaseSeeder.cs	Low
	Initialize `_trackedProjectAgreementThresholds` on Studies

Medium-Term Actions (Future Sprints)¶

#	Task	Effort	Dependency
9	Create IProjectStatsService	Medium	None
10	Add environment warning banner (Part 10)	Medium	None
11	Production sampling job (Part 11)	Medium	None
12	PR description config parsing	Low	Part 11

Implementation Order Diagram¶

Sprint N (Foundation):
┌─────────────────────────────────────────────────────────────┐
│  1. Refactor constructors (Project, Study, Investigator)    │
│           ↓                                                  │
│  4. Update DatabaseSeeder to use factory methods            │
└─────────────────────────────────────────────────────────────┘

Sprint N+1 (Coverage):
┌─────────────────────────────────────────────────────────────┐
│  5. ISeedDataOwnershipTransfer interface                    │
│  6. Full project matrix (all AgreementModes)                │
│  7. Screening data seeding                                  │
│  8. Stats recalculation                                     │
└─────────────────────────────────────────────────────────────┘

Sprint N+2+ (Polish):
┌─────────────────────────────────────────────────────────────┐
│  9. IProjectStatsService consolidation                      │
│  10. Environment warning banner                             │
│  11. Production sampling (if needed)                        │
└─────────────────────────────────────────────────────────────┘

Definition of Done¶

Each task should meet these criteria before being marked complete:

Code compiles without warnings
Unit tests added/updated for new code
Existing tests pass
Seeded data visible in preview environment
Documentation updated (this file or CLAUDE.md)

Risks and Mitigations¶

Risk	Mitigation
Constructor refactoring breaks MongoDB deserialization	Keep parameterless public constructor, test deserialization
Factory methods duplicate initialization logic	Private constructor with all params, factory delegates to it
Stats calculation fails after seeding	Call `RecalculateScreeningStatsAsync` explicitly after seeding
Seed data differs from production data	Use same domain methods as production (AddScreening, etc.)

Appendix A: AgreementMode Reference¶

Mode	Reviewers Needed	Agreement Required	Description
SingleReviewerDecides	1	N/A	One reviewer makes final decision
AutomatedDualScreening	2	100%	Both reviewers must agree
ThreeReviewersOneReconciles	2 of 3	67%	Two of three must agree, one reconciles
ThreeReviewersTwoReconcile	3	67%	All three review, two must agree
Custom	varies	varies	User-defined threshold

MongoDB Testing Strategy - Database isolation for preview environments
PR Preview Environments - Using preview environments
CLAUDE.md - Project overview and database architecture

Appendix C: API Endpoint Summary¶

Key endpoints that seeding should mirror:

Endpoint	Purpose	Service Method
POST /api/projects	Create project	IProjectManagementService.CreateProject()
POST /api/application/signin	Create/update investigator	IApplicationService.EnsureInvestigatorCreatedAndUpdatedAsync()
POST /api/projects/{id}/stages	Add stage	project.AddStage()
PATCH /api/projects/{id}/stages/{stageId}	Update stage	project.UpdateStage()
POST /api/projects/{id}/questions	Add annotation question	project.UpsertCustomAnnotationQuestion()
POST /api/reviews/screening	Submit screening	study.AddScreening()
POST /api/projects/{id}/studies/inclusionInfo	Calculate inclusion	IProjectManagementService.UpdateStudyInclusionInfoForProjectAsync()

Seed Data Quality Analysis and Planning¶

Executive Summary¶

Part 1: Current Implementation Analysis¶

Seeding Architecture¶

Current Seed Data Inventory¶

Key Files¶

Part 2: Identified Issues¶

Issue 1: Stages Appearing Disabled for Screening¶

Issue 2: Study Stats Not Correct¶

Issue 3: Ownership Transfer Side Effects¶

Part 3: Architectural Analysis¶

Current Seeding Approach vs API Endpoints¶

API Endpoint Analysis¶

Project Creation Flow (Production)¶

Screening Flow (Production)¶

Service Layer Mapping¶

Where Controllers Do More Than Just Call Services¶

1. ReviewController - Stats Calculation After Screening¶

2. ProjectController - Instrumentation Code¶

3. ApplicationController - Multi-Step Signin¶

SOLID Principle Violations Identified¶

Part 4: Seed Project Planning Matrix¶

Guiding Principles¶

Proposed Seed Project Matrix¶

Configuration Coverage Matrix¶

Annotation Question Coverage¶

Outcome Data Coverage¶

Part 5: Recommended Implementation Changes¶

Priority 1: Use High-Level Services for Project Creation¶

Priority 2: Use IApplicationService for Investigator Creation¶

Priority 3: Call Inclusion Update After Screening¶

Priority 4: Set AgreementMode Explicitly¶

Priority 5: Add Annotation Data¶

Part 6: Testing Recommendations¶

Integration Tests for Seeding¶

Manual Testing Checklist¶

Part 7: Service Layer Improvement Opportunities (Detailed Analysis)¶

7.1 Inclusion Calculation - When Called and Gaps¶

7.2 Optional ID Parameter - Deep Design Analysis¶

Deep Analysis: Where ProjectId is Stored¶

Technical Constraint: Entity ID Immutability¶

Option A: Add Optional Parameter to Service Method (NOT Recommended)¶

Option B: Seeder Factory with Post-Creation ID Override (NOT Recommended)¶

Option C: Accept Generated IDs (Recommended - Simple Path)¶

Option D: Factory Method on Entity (RECOMMENDED for Deterministic IDs)¶

Final Recommendation¶

7.3 Project Template Implementation Plan (Detailed)¶

Is ProjectTemplate an Aggregate Root?¶

Phase 1: Template Entity (Aggregate Root)¶

Phase 2: Template Service Methods (Detailed)¶

Phase 3: End User Workflow¶

Phase 4: Seeder Integration¶

Files to Create¶

Migration Path¶

7.4 Stats Consolidation - Implementation Plan¶

Problem Statement¶

Why Stats Need Service-Level Recalculation (or Don't)¶

Proposed Architecture (Corrected)¶

Implementation Steps¶

Files to Create/Modify¶

Seeder Integration¶

Priority Assessment¶

Part 8: Annotation Seeding - Validation Architecture¶

Production Code Path¶

Current Seeder Proposal (INCORRECT)¶

Validation Architecture Analysis¶

Current State: Mixed Validation¶

Best Practice: Domain-Driven Validation Layers¶

Proposed: Application Service for Annotation Submission¶

Seeder Uses Same Service¶

What Exactly Changes? (Detailed Migration)¶

Current Architecture (Controller-Based)¶

Proposed Architecture (Service-Based)¶

Specific Code Migration¶

Controller BEFORE (Simplified)¶

Controller AFTER (Simplified)¶

Files Changed¶

Benefits of Service-Layer Validation¶

Frontend Validation Consideration¶

Part 9: Seeding Configuration as Source of Truth¶