Seed Data Quality Analysis and Planning¶
Executive Summary¶
This document analyzes the current seed data implementation for preview environments, identifies quality issues, and provides comprehensive planning for creating high-quality seed data.
Key Principles:
- Breadth and depth - Seed projects should exemplify common configurations and states users will encounter
- Use high-level services - Prefer application services over bespoke seeding code to ensure realistic data
- Realistic data - Seeded data should be as close to production data as possible (including any real bugs/issues that would occur)
- Testability - Seeding should exercise the same code paths as real user operations
- Keep defaults where appropriate - Some projects should use default screening modes with seed investigators performing actual screening
Part 1: Current Implementation Analysis¶
Seeding Architecture¶
┌─────────────────────────────────────────────────────────────────┐
│ Application Startup │
├─────────────────────────────────────────────────────────────────┤
│ 1. IRunAtInit discovered via assembly scanning │
│ 2. DatabaseSeeder.Execute() called │
│ 3. Check SYRF_SEED_DATA_ENABLED=true │
│ 4. Check if database already has data │
│ 5. Create seed data if conditions met │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ First User Registration │
├─────────────────────────────────────────────────────────────────┤
│ 1. UserHasRegisteredHandler fires │
│ 2. SeedDataOwnershipTransfer.TransferToFirstRealUserAsync() │
│ 3. Add real user as admin to all seed projects │
│ 4. Transfer ownership from SeedBot to real user │
└─────────────────────────────────────────────────────────────────┘
Current Seed Data Inventory¶
| Project | Studies | Stages | Screening State | Annotation State | AgreementMode |
|---|---|---|---|---|---|
| Quick Start Demo | 10 | 2 (Screening active, Annotation inactive) | Unscreened | N/A | Default (Dual) |
| Screening In Progress | 30 | 1 (Screening active) | 50% complete (dual) | N/A | Default (Dual) |
| Ready for Annotation | 20 | 2 (Screening inactive, Annotation active) | Complete | Ready | Default (Dual) |
| Complete Review | 15 | 2 (both inactive) | Complete | Complete | Default (Dual) |
| Private Research | 8 | 1 (Screening active) | Unscreened | N/A | Default (Dual) |
Key Files¶
- DatabaseSeeder.cs - Main seeding logic
- SeedDataConstants.cs - Constants and IDs
- SeedDataOwnershipTransfer.cs - Ownership transfer logic
- SampleStudies.json - Sample study data
Part 2: Identified Issues¶
Issue 1: Stages Appearing Disabled for Screening¶
Symptom: Users report that screening stages appear disabled or inaccessible.
Root Cause Analysis:
- Project AgreementMode: Projects default to
AgreementMode.AutomatedDualScreening:
- Post-Ownership Transfer State: After ownership transfer, the real user is the only member who can log in. Seed reviewers exist but are fake users (
system|seed-reviewer-alpha). For dual screening projects, the UI may show screening as blocked because only one real reviewer exists.
Resolution Options:
| Option | Description | Recommended For |
|---|---|---|
| Keep dual screening with seed reviewers | Demonstrates realistic dual screening workflow | "Screening In Progress" project |
| Set SingleReviewerDecides | User can complete all screening alone | "Quick Start Demo" project |
| Add real user to reviewer role | User joins existing screening workflow | Projects with partial screening |
Note: It's OK to keep some projects with default dual screening where seed investigators have already done screening. The new user can add more users and change settings as needed.
Issue 2: Study Stats Not Correct¶
Symptom: Study statistics (screened/unscreened counts) don't match actual study states.
Root Cause Analysis:
- Stats Are Computed Dynamically: Stats use MongoDB aggregation pipelines that query:
ScreeningInfo.AgreementMeasure.NumberScreenedScreeningInfo.AgreementMeasure.AbsoluteAgreementRatio-
ScreeningInfo.Inclusion -
TrackedProjectAgreementThresholds Empty: Studies are created with empty thresholds:
This means studies haven't been evaluated against any threshold, affecting stats calculations.
- Direct Domain Model Usage: The seeder uses domain methods directly (
study.AddScreening()) rather than going through the same code paths as the API endpoints. This may skip side effects that normally occur.
Issue 3: Ownership Transfer Side Effects¶
Symptom: Seed reviewers remain as project members but cannot log in.
Analysis: This is actually acceptable behavior:
- Seed reviewers demonstrate realistic projects with multiple team members
- Their screening decisions show what dual screening looks like
- The new owner can add real users and modify settings
- This reflects how a real user would join an existing team project
Part 3: Architectural Analysis¶
Current Seeding Approach vs API Endpoints¶
The current DatabaseSeeder creates entities directly using domain models:
// Current approach - Direct domain model creation
var project = new Project(name, creatorId, ...);
typeof(Project).GetProperty("Id")?.SetValue(project, id); // Reflection to set ID
project.AddStage(...);
_unitOfWork.Save(project);
Issues with this approach:
- Bypasses service layer validation - Misses checks that would occur in production
- Uses reflection - Sets IDs directly, not how real projects are created
- May miss side effects - Service layer often triggers additional operations
- Different code path - Any bugs in seeding won't be found in production (and vice versa)
API Endpoint Analysis¶
Based on analysis of API controllers and Angular effects, here's how real users create data:
Project Creation Flow (Production)¶
Angular Component
↓
store.dispatch(projectDetailActions.create({...}))
↓
project-detail.effects.ts → POST /api/projects
↓
ProjectController.CreateProject()
↓
IProjectManagementService.CreateProject(name, creatorId, ...)
↓
new Project(...) + _unitOfWork.Save()
Screening Flow (Production)¶
Angular Component
↓
store.dispatch(reviewComponentActions.submitScreeningAndFetchNext({...}))
↓
review.effects.ts → POST /api/reviews/screening
↓
ReviewController.SubmitScreeningAndGetNextStudy()
↓
study.AddScreening() + IStageReviewService.GetRandomStudyAsync()
↓
_unitOfWork.Save(study)
Service Layer Mapping¶
| Operation | Controller | Service Method | Seeder Should Use |
|---|---|---|---|
| Create Project | ProjectController | IProjectManagementService.CreateProject() | Yes |
| Create Investigator | ApplicationController | IApplicationService.EnsureInvestigatorCreatedAndUpdatedAsync() | Yes |
| Add Stage | ProjectController | project.AddStage() (domain method) | OK - same as controller |
| Add Screening | ReviewController | study.AddScreening() (domain method) | OK - but verify side effects |
| Update Inclusion Info | StudyController | IProjectManagementService.UpdateStudyInclusionInfoForProjectAsync() | Yes - should call after screening |
| Add Search | ProjectController | IProjectManagementService.GetOrCreateSearchImportJobAsync() | Consider for realism |
Where Controllers Do More Than Just Call Services¶
Some controllers contain logic beyond simple service delegation. This affects seeding accuracy:
1. ReviewController - Stats Calculation After Screening¶
// ReviewController.cs - after screening submission
var stats = await _stageReviewService.GetFullStatsForInvestigatorAsync(
project.Id, stage.Id, CurrentUserId);
Impact on Seeding: The seeder doesn't calculate/return stats after adding screenings.
Recommendation: Either call UpdateStudyInclusionInfoForProjectAsync() after seeding screenings, or ensure domain model correctly updates stats synchronously.
2. ProjectController - Instrumentation Code¶
// ProjectController.cs - timing instrumentation in RepositionAnnotationQuestion
var stopwatch = Stopwatch.StartNew();
// ... operation ...
Console.WriteLine($"Get Project Time: {getProjectTime} ms");
Impact: Debug code in production. Not a seeding concern but violates SRP.
Recommendation: Move instrumentation to a decorator or aspect.
3. ApplicationController - Multi-Step Signin¶
// ApplicationController.cs - signin involves multiple operations
var investigator = await _applicationService.EnsureInvestigatorCreatedAndUpdatedAsync(...);
await _applicationService.RetrieveAndValidateInvestigatorUsage(...);
investigator.RecordSignin(ipAddress);
_pmUnitOfWork.Save(investigator);
Impact on Seeding: Seeder creates investigators directly without going through signin flow.
Recommendation: Consider using IApplicationService.EnsureInvestigatorCreatedAndUpdatedAsync() for seed investigator creation.
SOLID Principle Violations Identified¶
| Issue | Principle | Location | Recommendation |
|---|---|---|---|
| Controller does instrumentation | SRP | ProjectController:539-574 | Move to decorator |
| Seeder uses reflection for IDs | DIP | DatabaseSeeder:581 | Use factory or accept generated IDs |
| Stats calculated in controller | SRP | ReviewController | Move to service |
| Direct UnitOfWork in controllers | DIP | Multiple | OK for simple queries, wrap complex operations |
Part 4: Seed Project Planning Matrix¶
Guiding Principles¶
- Exemplify breadth: Cover all major feature configurations users will encounter
- Exemplify depth: Show projects at various workflow stages
- Keep some defaults: Not every project needs custom settings - real users start with defaults
- Use seed investigators actively: Have seed reviewers do actual screening to show realistic workflows
- Annotations matter: Include projects with annotation questions and some annotation data
Proposed Seed Project Matrix¶
| Project | Purpose | AgreementMode | Stages | Studies | Screening State | Annotation State | Outcome Data | Members |
|---|---|---|---|---|---|---|---|---|
| Quick Start Demo | Intro project for new users | SingleReviewerDecides | 1 screening, 1 annotation | 10 | Unscreened | N/A | None | Owner only |
| Dual Screening Example | Shows dual review workflow | AutomatedDualScreening | 1 screening | 30 | 50% complete by seed reviewers | N/A | None | Owner + Alpha + Beta |
| Ready for Annotation | Shows annotation workflow | SingleReviewerDecides | 1 screening (complete), 1 annotation | 20 | Complete (all included) | Ready with questions | None | Owner |
| Partial Annotation | Shows annotation in progress | SingleReviewerDecides | 1 screening (complete), 1 annotation | 15 | Complete | 50% annotated | 25% studies | Owner |
| Complete Review | Shows finished project | SingleReviewerDecides | Both complete | 15 | Complete | Complete | All studies | Owner |
| Data Extraction Demo | Shows outcome data workflow | SingleReviewerDecides | 1 screening (complete), 1 annotation (complete) | 12 | Complete | Complete | Varied (see below) | Owner |
| Private Research | Shows access control | AutomatedDualScreening | 1 screening | 8 | Unscreened | N/A | None | Owner (requires approval) |
| Large Study Set | Performance testing | SingleReviewerDecides | 1 screening | 100 | Unscreened | N/A | None | Owner |
| Three Reviewer | Shows 3-reviewer workflow | ThreeReviewersOneReconciles | 1 screening | 20 | 30% with agreement | N/A | None | Owner + Alpha + Beta + Gamma |
Configuration Coverage Matrix¶
| Configuration | Covered By Project(s) |
|---|---|
| Single reviewer screening | Quick Start, Ready for Annotation, Complete Review |
| Dual screening | Dual Screening Example, Private Research |
| Three reviewer | Three Reviewer |
| Screening only stage | Quick Start, Dual Screening Example |
| Annotation stage | Quick Start, Ready for Annotation, Partial Annotation, Complete Review |
| Private project | Private Research |
| Public project | All others |
| Large study count | Large Study Set |
| Complete workflow | Complete Review |
| Auto-approve joins | Quick Start (could add) |
| Require approval | Private Research |
Annotation Question Coverage¶
Seed projects with annotation should include questions from all categories:
| Category | Example Questions | Projects |
|---|---|---|
| Study | Species, Study design | Ready for Annotation, Partial Annotation, Complete Review, Data Extraction Demo |
| Risk of Bias | Randomization, Blinding | Ready for Annotation, Partial Annotation, Complete Review, Data Extraction Demo |
| Model Induction | Disease model, Induction method | Partial Annotation, Complete Review, Data Extraction Demo |
| Treatment | Intervention, Control, Dose | Partial Annotation, Complete Review, Data Extraction Demo |
| Outcome | Outcome measure, Timepoint | Complete Review, Data Extraction Demo |
| Experiment | Sample size, Duration | Complete Review, Data Extraction Demo |
Outcome Data Coverage¶
The OutcomeData entity (OutcomeData.cs) tracks quantitative experimental results per study.
Data Extraction Demo Project - Demonstrates various outcome data scenarios:
| Studies | Outcome Data State | Purpose |
|---|---|---|
| 3 studies | Complete data (all cohorts) | Shows fully extracted study |
| 3 studies | Partial data (some cohorts) | Shows work-in-progress |
| 3 studies | Multiple experiments | Shows multi-experiment studies |
| 3 studies | No outcome data | Shows contrast with extracted studies |
Sample Outcome Data Structure:
// Example seeded outcome data
var outcomeData = new OutcomeData(
projectId: projectId,
annotatorId: SeedDataConstants.SeedBotId,
stageId: extractionStageId,
experimentId: experimentId,
cohortId: cohortId)
{
Mean = 42.5,
StandardDeviation = 3.2,
SampleSize = 10,
Timepoint = "Day 7",
OutcomeMeasure = "Tumor volume (mm³)"
};
study.ExtractionInfo.AddOutcomeData(outcomeData);
Seeder Code Path for Outcome Data:
// Should mirror production: Study.AddSessionData() → ExtractionInfo.AddOutcomeData()
var sessionDto = new SessionSubmissionDto
{
// ... annotations ...
OutcomeData = GenerateOutcomeDataForStudy(study, project)
};
study.AddSessionData(investigatorId, sessionDto, questionIds, stageExtraction);
Part 5: Recommended Implementation Changes¶
Priority 1: Use High-Level Services for Project Creation¶
Current:
var project = new Project(name, creatorId, ...);
typeof(Project).GetProperty("Id")?.SetValue(project, id); // Reflection!
_unitOfWork.Save(project);
Recommended:
// Inject IProjectManagementService into DatabaseSeeder
var project = await _projectManagementService.CreateProject(
name: SeedDataConstants.QuickStartProjectName,
creatorId: SeedDataConstants.SeedBotId,
protocolUrl: "",
isPublic: true,
inclusionCriteria: "Studies relevant to the research question",
exclusionCriteria: "Studies not meeting inclusion criteria",
contactEmail: SeedDataConstants.SeedBotEmail,
notes: SeedDataConstants.QuickStartProjectDescription,
keywords: new HashSet<string> { "demo", "quick-start" },
machineLearningEnabled: false);
// Store generated ID in SeedDataConstants mapping or accept dynamic IDs
Trade-off: Loses deterministic IDs. Consider:
- Option A: Accept dynamic IDs and store them after creation
- Option B: Add ID parameter to CreateProject (service layer change)
- Option C: Keep current approach for IDs but use services for everything else
Priority 2: Use IApplicationService for Investigator Creation¶
Current:
var investigator = new Investigator(id, auth0Id, fullName, email, schemaVersion);
_unitOfWork.Save(investigator);
Recommended:
var investigator = await _applicationService.EnsureInvestigatorCreatedAndUpdatedAsync(
investigatorId: SeedDataConstants.SeedBotId,
auth0Id: SeedDataConstants.SeedBotAuth0Id,
name: new FullName(SeedDataConstants.SeedBotFirstName, SeedDataConstants.SeedBotLastName),
email: SeedDataConstants.SeedBotEmail,
pictureUrl: null,
ipAddress: "127.0.0.1");
Benefit: Uses same code path as production signin.
Priority 3: Call Inclusion Update After Screening¶
Add after adding all screenings to a project:
// After seeding all screenings for a project
await _projectManagementService.UpdateStudyInclusionInfoForProjectAsync(
projectId,
replaceExisting: true);
_logger.LogDebug("Updated inclusion info for project {ProjectId}", projectId);
Benefit: Ensures stats are calculated correctly using production code.
Priority 4: Set AgreementMode Explicitly¶
For projects that should use single screening:
For projects demonstrating dual screening:
// Leave as default (AutomatedDualScreening) - this is intentional
// Seed reviewers Alpha and Beta do the screening
Priority 5: Add Annotation Data¶
For "Partial Annotation" and "Complete Review" projects:
// After adding annotation questions, add annotation sessions
foreach (var study in studies.Take(annotatedCount))
{
var session = new AnnotationSession(
study.Id,
SeedDataConstants.SeedBotId,
annotationStageId);
// Add answers to questions
foreach (var question in project.AnnotationQuestions)
{
session.AddAnswer(question.Id, GenerateSampleAnswer(question));
}
study.AddAnnotationSession(session);
_unitOfWork.Save(study);
}
Part 6: Testing Recommendations¶
Integration Tests for Seeding¶
[Fact]
public async Task DatabaseSeeder_AfterSeeding_StatsAreCorrect()
{
// Arrange
var seeder = new DatabaseSeeder(...);
// Act
seeder.Execute();
// Assert - Quick Start should have 10 unscreened studies
var quickStartStats = await GetProjectStats(SeedDataConstants.QuickStartProjectId);
quickStartStats.TotalStudies.Should().Be(10);
quickStartStats.UnscreenedStudies.Should().Be(10);
// Assert - Dual Screening In Progress should have correct split
var dualStats = await GetProjectStats(SeedDataConstants.DualScreeningProjectId);
dualStats.TotalStudies.Should().Be(30);
dualStats.SufficientlyScreened.Should().BeGreaterThan(0);
dualStats.PartiallyScreened.Should().BeGreaterThan(0);
dualStats.UnscreenedStudies.Should().BeGreaterThan(0);
}
[Fact]
public async Task DatabaseSeeder_AfterSeeding_ProjectsHaveCorrectAgreementMode()
{
// Arrange & Act
seeder.Execute();
// Assert
var quickStart = await _unitOfWork.Projects.GetAsync(SeedDataConstants.QuickStartProjectId);
quickStart.AgreementMode.Should().Be(AgreementMode.SingleReviewerDecides);
var dualScreening = await _unitOfWork.Projects.GetAsync(SeedDataConstants.DualScreeningProjectId);
dualScreening.AgreementMode.Should().Be(AgreementMode.AutomatedDualScreening);
}
Manual Testing Checklist¶
For each seed project, verify:
- Project appears in project list with correct name/description
- Can open project details without errors
- Stage shows correct active/inactive state in UI
- Study count in UI matches seeded count
- Screening stats show expected numbers (not 0/0/0)
- Can start screening (for active screening stages)
- Screening decisions from seed reviewers are visible
- Can start annotation (for active annotation stages)
- Annotation questions appear in annotation view
- Private projects require approval to join
Part 7: Service Layer Improvement Opportunities (Detailed Analysis)¶
Based on comprehensive codebase analysis, these architectural changes would benefit both seeding and testability.
7.1 Inclusion Calculation - When Called and Gaps¶
Where UpdateStudyInclusionInfoForProjectAsync IS called:
| Location | Trigger | Context |
|---|---|---|
| UpdateStudyScreeningStatsConsumer.cs:17-33 | Message handler after AgreementMode change |
Async via MassTransit |
| ProjectController.cs:259-267 | Manual API call POST /update-study-inclusion |
On-demand admin action |
| ProjectController.cs:269-276 | Batch admin POST /update-all-study-inclusion |
Batch processing |
Deep Dive: How InclusionInfo Serialization Actually Works
The MongoDB BsonClassMap for ScreeningInfo (StudyRepository.cs:1427-1436) reveals the architecture:
cm.MapProperty(si => si.InclusionInfo).SetIgnoreIfNull(true); // Computed property IS mapped
cm.UnmapField(ScreeningInfo.TrackedProjectAgreementThresholdsName); // Backing field NOT stored
The serialization flow:
- On Save: The
InclusionInfogetter computes values from_trackedProjectAgreementThresholds+ currentScreenings - On Load: Stored
InclusionInfois deserialized → setter extractsProjectAgreementThresholdinto_trackedProjectAgreementThresholds - Subsequent saves: Getter recalculates from loaded thresholds + updated screenings → values ARE updated!
The ACTUAL gap (corrected analysis):
| Scenario | Behavior | Is This a Problem? |
|---|---|---|
Study where UpdateStudyInclusionInfoForProjectAsync was NEVER called |
InclusionInfo is null |
YES - MongoDB filters won't find it |
| Study where it WAS called, then screening added | Save recalculates from stored thresholds | NO - Auto-updates correctly |
| Production screening after project setup | Works correctly (threshold from AgreementMode change) | NO - Normal production flow |
| Seeded studies with screenings | InclusionInfo is null (never initialized) |
YES - This is the seeding gap |
Conclusion: The gap is NOT about individual screenings failing to update. It's about initialization - studies need UpdateStudyInclusionInfoForProjectAsync called at least once to populate the threshold configuration. After that, normal save operations update the computed values correctly.
Flow when AgreementMode changes:
ScreeningController.PostScreeningSettings()
→ project.UpdateAgreementMode()
→ raises ProjectAgreementThresholdUpdatedEvent
→ ProjectAgreementThresholdUpdatedHandler
→ sends IUpdateStudyScreeningStatsCommand
→ UpdateStudyScreeningStatsConsumer.Consume()
→ IStudyRepository.UpdateStudyInclusionInfoForProjectAsync()
Justification for Change: The production design is actually sound - once UpdateStudyInclusionInfoForProjectAsync is called (which happens after AgreementMode changes), subsequent screenings auto-update correctly via the computed property serialization. The gap is specific to seeding where this initialization step is skipped.
Recommendation for Seeding:
// After seeding all screenings for a project, initialize InclusionInfo:
await _projectManagementService.UpdateStudyInclusionInfoForProjectAsync(projectId, replaceExisting: true);
Why this works: This populates each study's InclusionInfo with the project's AgreementThreshold. Once set, any subsequent screening additions (even by seed users in the app) will correctly update via the computed property getter during save.
Affected Code for Implementing This:
- DatabaseSeeder.cs - Add call after each project's screenings are seeded
- This is a one-time initialization, not a recurring requirement
7.2 Optional ID Parameter - Deep Design Analysis¶
Question: Should IProjectManagementService.CreateProject() accept an optional ID?
Current Signature:
Task<Project> CreateProject(string name, Guid creatorId, string protocolUrl, ...);
// ID is generated internally by Project constructor
Deep Analysis: Where ProjectId is Stored¶
Critical Discovery: ProjectId is stored in many related entities, not just the Project itself:
| Entity | Storage | Relationship |
|---|---|---|
Study.ProjectId |
Stored property | FK to Project |
Screening.ProjectId |
Stored property | FK to Project (within Study) |
Annotation.ProjectId |
Stored property | FK to Project (within Study.ExtractionInfo) |
OutcomeData.ProjectId |
Stored property | FK to Project (within Study.ExtractionInfo) |
SystematicSearch.ProjectId |
Stored property | FK to Project |
DataExportJob.ProjectId |
Stored property | FK to Project |
RiskOfBiasAiJob.ProjectId |
Stored property | FK to Project |
StudyPdfCorrection.ProjectId |
Stored property | FK to Project |
LivingSearchSetting.ProjectId |
Stored property | FK to Project |
ProjectAccess.ProjectId |
Stored property | FK to Project (in Investigator) |
Computed Properties (safe to ignore) - these derive from Project.Id at runtime:
BulkStudyUpdateJob.ProjectId => Project.IdAnnotationQuestion.ProjectId => Project.IdStage.ProjectId => Project!.IdProjectMembership.ProjectId => Project.IdSearchImportJob.ProjectId => Project.Id
Technical Constraint: Entity ID Immutability¶
The base Entity<TId> class (Entity.cs:15) defines:
Implication: Post-creation ID changes require reflection to bypass the init accessor.
Option A: Add Optional Parameter to Service Method (NOT Recommended)¶
Where the change happens: IProjectManagementService (production service layer)
// Change to production service interface
public interface IProjectManagementService
{
// Original signature modified with optional ID
Task<Project> CreateProject(string name, Guid creatorId, ..., Guid? id = null);
}
// Implementation changes
public async Task<Project> CreateProject(string name, Guid creatorId, ..., Guid? id = null)
{
var project = new Project(id ?? Guid.NewGuid()); // Use provided ID or generate
// ... rest of implementation
}
Architectural Location: Service layer (production code path)
Cons:
- Pollutes production API - Optional
idparameter is never used in production - Violates Single Responsibility - Production service handles seeding concerns
- Risk of misuse - Accidental ID collision if someone passes a duplicate ID
- Testing concerns in production - Method signature reveals testing/seeding backdoor
Verdict: ❌ Rejected - seeding concerns shouldn't affect production service signature
Option B: Seeder Factory with Post-Creation ID Override (NOT Recommended)¶
public class SeederProjectFactory : ISeederProjectFactory
{
public async Task<Project> CreateProjectAsync(CreateProjectDto dto, Guid? overrideId = null)
{
var project = await _service.CreateProject(...);
if (overrideId.HasValue)
{
typeof(Project).GetProperty("Id")?.SetValue(project, overrideId.Value);
}
return project;
}
}
Critical Consistency Problem:
If seeding creates Studies, Searches, Screenings, etc. after the ID override, they'll have the correct ID. But what about entities created before?
// CreateProject internally calls:
Memberships.Add(ProjectMembership.InitialProjectCreator(this)); // Uses original ID!
The ProjectMembership is created with the original Project.Id, then we override. This creates an inconsistent state.
Even worse: If CreateProject saves immediately (which it does), we'd have two documents in MongoDB with different IDs representing the same project.
Cons:
- Entity.Id is
init-only - requires reflection to override - Race condition: Entities created before override have wrong ProjectId
- Database inconsistency: Original project may already be saved before override
- Hidden complexity: Developers won't expect ID mutation post-construction
Verdict: ❌ Rejected - causes data consistency issues
Option C: Accept Generated IDs (Recommended - Simple Path)¶
// Store generated IDs after creation
var project = await _service.CreateProject(...);
SeedDataConstants.RuntimeProjectIds["QuickStart"] = project.Id;
// Later, ownership transfer looks up dynamically
var projectIds = SeedDataConstants.RuntimeProjectIds.Values.ToList();
await _ownershipService.TransferProjectsToUser(userId, projectIds);
Pros:
- Clean production code - no seeding concerns leak into production
- Consistency - all entities use the same ID from construction
- Simple - no reflection, no special factory
Cons:
- Non-deterministic IDs - IDs differ between environments
- Dynamic lookup - ownership transfer must use runtime lookup
- Test assertions - can't hardcode expected IDs in tests
When This Works Well:
- Ownership transfer already loops through projects - just needs runtime list
- Tests can query by name/characteristics instead of ID
- Preview environments are ephemeral anyway
Verdict: ✅ Recommended for immediate implementation
Option D: Factory Method on Entity (RECOMMENDED for Deterministic IDs)¶
Where the change happens: Project entity class (domain layer, NOT service layer)
Key Difference from Option A:
- Option A modifies the service interface (production API changes)
- Option D adds a factory method on the entity (production API unchanged, new creation path)
Core Insight: The factory method should call the same substantive constructor used by ProjectManagementService.CreateProject(). No duplication of initialization logic.
Current Constructor Chain (lines 27-57 of Project.cs):
// CURRENT: Substantive constructor → parameterless → base(NewGuid)
public Project(string name, Guid creatorId, string protocolUrl, ...)
: this() // ← calls parameterless
{
// initialization
}
public Project() : base(Guid.NewGuid()) // ← ID generated here
{
// defaults
}
Refactored Constructor Chain (factory method delegates to internal constructor):
// Project.cs - Refactored to support deterministic IDs
public class Project : AggregateRoot<Guid>
{
// PUBLIC: Substantive constructor - delegates to internal with new ID
public Project(string name, Guid creatorId, string protocolUrl,
bool isPublic, string inclusionCriteria, string exclusionCriteria,
int schemaVersion, ICollection<string> keywords, string contactEmail = "",
string notes = "", LivingSearchConfig? livingSearchConfig = null)
: this(Guid.NewGuid(), name, creatorId, protocolUrl, isPublic,
inclusionCriteria, exclusionCriteria, schemaVersion, keywords,
contactEmail, notes, livingSearchConfig)
{
}
// PRIVATE: Internal constructor - ALL initialization happens here
private Project(Guid id, string name, Guid creatorId, string protocolUrl,
bool isPublic, string inclusionCriteria, string exclusionCriteria,
int schemaVersion, ICollection<string> keywords, string contactEmail,
string notes, LivingSearchConfig? livingSearchConfig)
: this(id)
{
DefaultSchemaVersion = schemaVersion;
SecuritySettings = new ProjectSecuritySettings(this);
ContactEmail = contactEmail;
Notes = notes;
InclusionCriteria = inclusionCriteria;
ExclusionCriteria = exclusionCriteria;
Name = name;
OwnerId = creatorId;
CreatorId = creatorId;
AddProtocolLink(protocolUrl);
Memberships.Add(ProjectMembership.InitialProjectCreator(this));
IsPublic = isPublic;
CreationDate = DateTime.Now;
Keywords = keywords;
if (livingSearchConfig != null) AddLivingSearch(livingSearchConfig);
}
// PRIVATE: ID-only constructor (for deserialization compatibility)
private Project(Guid id) : base(id)
{
LivingSearches = new List<LivingSearch>();
AgreementMode = AgreementMode.AutomatedDualScreening;
}
// PUBLIC: Parameterless (for MongoDB deserialization)
public Project() : this(Guid.NewGuid())
{
}
// NEW: Factory method for deterministic IDs - calls SAME constructor as service
public static Project CreateWithId(
Guid id,
string name,
Guid creatorId,
string protocolUrl = "",
bool isPublic = false,
string inclusionCriteria = "",
string exclusionCriteria = "",
int schemaVersion = 1,
ICollection<string>? keywords = null,
string contactEmail = "",
string notes = "",
LivingSearchConfig? livingSearchConfig = null)
{
return new Project(id, name, creatorId, protocolUrl, isPublic,
inclusionCriteria, exclusionCriteria, schemaVersion,
keywords ?? Array.Empty<string>(), contactEmail, notes,
livingSearchConfig);
}
}
// Seeder uses factory method
var project = Project.CreateWithId(
id: SeedDataConstants.QuickStartProjectId, // Deterministic!
name: SeedDataConstants.QuickStartProjectName,
creatorId: SeedDataConstants.SeedBotId);
// Production code continues using service (unchanged)
var project = await _projectService.CreateProject(name, userId, ...); // Auto-generates ID
Why This Works:
- Factory method calls the same private constructor as the public one
- All initialization logic is in ONE place (the private constructor)
- No duplication of
AddMembership,AddProtocolLink, etc. - Public API unchanged - existing code continues working
Architectural Comparison:
| Aspect | Option A (Service) | Option D (Factory) |
|---|---|---|
| What changes | Service interface signature | Entity class (add method) |
| Production code affected | Yes - new optional param | No - existing path unchanged |
| Where ID is provided | Service layer | Entity construction |
| Separation of concerns | Poor - seeding in production API | Good - factory is opt-in |
| Child entity IDs | Correct (ID set before children) | Correct (ID set before children) |
Pros:
- ID set at construction - all child entities (Membership, etc.) get correct ID from the start
- No reflection - uses language features correctly
- Production API unchanged -
IProjectManagementServicesignature stays the same - Explicit opt-in - only code that needs deterministic IDs uses the factory
- Type-safe - factory enforces required parameters at compile time
Cons:
- Pattern needed for all aggregates - Studies, Investigators, etc. need similar factories
- Slightly more code - private constructor + factory method per aggregate
Verdict: ✅ RECOMMENDED for deterministic ID requirement
Final Recommendation¶
| Approach | Recommendation | Use Case |
|---|---|---|
| Option D (Factory Method) | ✅ RECOMMENDED | Deterministic IDs for seeding |
| Option C (Accept Generated) | ✅ Fallback | If factory changes are deferred |
| Option A/B | ❌ Avoid | Pollutes production API or causes inconsistency |
Rationale: Option D provides deterministic IDs with clean architecture - the factory method is opt-in, production code is unchanged, and all child entities get the correct ID from construction.
7.3 Project Template Implementation Plan (Detailed)¶
Goal: Allow creating projects from templates for both seeding and future user features.
Note:
ProjectTemplatedoes NOT currently exist in the codebase. This is a proposed new entity.
Is ProjectTemplate an Aggregate Root?¶
Answer: Yes, ProjectTemplate should be an Aggregate Root.
Reasoning:
| Criteria | Analysis |
|---|---|
| Identity | Templates have their own lifecycle, independent of any Project |
| Invariants | Templates enforce their own rules (valid stage configuration, etc.) |
| Transactional Boundary | Templates are loaded/saved as a unit |
| Reference Pattern | Projects reference templates by ID, not containment |
Alternative Considered: Making templates a collection within a "TemplateLibrary" aggregate. This was rejected because:
- Templates can be created by different users
- No natural parent entity exists
- Individual template versioning becomes complex
Phase 1: Template Entity (Aggregate Root)¶
// New file: SyRF.ProjectManagement.Core/Model/TemplateAggregate/ProjectTemplate.cs
public class ProjectTemplate : AggregateRoot<Guid>
{
// Identity & Metadata
public string Name { get; private set; }
public string Description { get; private set; }
public bool IsSystemTemplate { get; private set; } // Seed/built-in templates
public Guid? CreatedByInvestigatorId { get; private set; }
public DateTime CreatedAt { get; private set; }
// Visibility
public TemplateVisibility Visibility { get; private set; } // Public, Private, Organization
// Template Configuration (Value Objects - snapshots, not live references)
public AgreementMode DefaultAgreementMode { get; private set; }
public IReadOnlyList<StageTemplate> Stages => _stages.AsReadOnly();
public IReadOnlyList<AnnotationQuestionTemplate> Questions => _questions.AsReadOnly();
public HashSet<string> DefaultKeywords { get; private set; }
private List<StageTemplate> _stages = new();
private List<AnnotationQuestionTemplate> _questions = new();
// Factory method for creating from existing project
public static ProjectTemplate CreateFromProject(Project source, string templateName, Guid creatorId)
{
var template = new ProjectTemplate(templateName, creatorId, source.AgreementMode);
template.CopyStagesFrom(source);
template.CopyQuestionsFrom(source);
return template;
}
// Factory method for system templates
public static ProjectTemplate CreateSystemTemplate(string name, string description)
{
return new ProjectTemplate(name, null, AgreementMode.AutomatedDualScreening)
{
IsSystemTemplate = true,
Description = description
};
}
}
Value Objects (not Aggregate Roots):
// StageTemplate - snapshot of stage configuration
public record StageTemplate(
string Name,
StageType Type,
bool IsActive,
int Order
);
// AnnotationQuestionTemplate - snapshot of question configuration
public record AnnotationQuestionTemplate(
string QuestionText,
string AnswerType,
bool IsRequired,
IReadOnlyList<string> Options // For multiple choice
);
Phase 2: Template Service Methods (Detailed)¶
public interface IProjectTemplateService
{
// === Template Management ===
/// <summary>
/// Create a new template from an existing project.
/// Copies stage configuration and annotation questions as snapshots.
/// </summary>
Task<ProjectTemplate> CreateTemplateFromProjectAsync(
Guid sourceProjectId,
string templateName,
Guid creatorId);
/// <summary>
/// Create a new empty template (for manual configuration).
/// </summary>
Task<ProjectTemplate> CreateEmptyTemplateAsync(
string name,
string description,
Guid creatorId);
/// <summary>
/// Update template metadata (name, description, visibility).
/// Does NOT update stage/question configuration after creation.
/// </summary>
Task UpdateTemplateMetadataAsync(
Guid templateId,
string? name,
string? description,
TemplateVisibility? visibility);
/// <summary>
/// Delete a user-created template. System templates cannot be deleted.
/// </summary>
Task DeleteTemplateAsync(Guid templateId, Guid requesterId);
// === Template Retrieval ===
/// <summary>
/// Get templates available to a user:
/// - System templates (always visible)
/// - Public templates
/// - User's own private templates
/// </summary>
Task<IEnumerable<ProjectTemplate>> GetAvailableTemplatesAsync(Guid? investigatorId);
/// <summary>
/// Get a specific template by ID.
/// </summary>
Task<ProjectTemplate?> GetTemplateByIdAsync(Guid templateId);
/// <summary>
/// Get system template by name (for seeding).
/// </summary>
Task<ProjectTemplate?> GetSystemTemplateAsync(string templateName);
// === Project Creation ===
/// <summary>
/// Create a new project from a template.
/// Uses IProjectManagementService internally - same code path as manual creation.
/// </summary>
Task<Project> CreateProjectFromTemplateAsync(
Guid templateId,
string projectName,
Guid ownerId,
string? protocolUrl = null,
bool? isPublic = null); // Overrides template defaults
}
Implementation Details:
public class ProjectTemplateService : IProjectTemplateService
{
private readonly IProjectTemplateRepository _templateRepository;
private readonly IProjectManagementService _projectService;
private readonly IPMUnitOfWork _unitOfWork;
public async Task<Project> CreateProjectFromTemplateAsync(
Guid templateId,
string projectName,
Guid ownerId,
string? protocolUrl = null,
bool? isPublic = null)
{
var template = await _templateRepository.GetAsync(templateId)
?? throw new EntityNotFoundException("Template", templateId);
// 1. Create base project using EXISTING service (same code path!)
var project = await _projectService.CreateProject(
name: projectName,
creatorId: ownerId,
protocolUrl: protocolUrl ?? "",
isPublic: isPublic ?? false,
inclusionCriteria: "",
exclusionCriteria: "",
contactEmail: "",
notes: $"Created from template: {template.Name}",
keywords: template.DefaultKeywords,
machineLearningEnabled: false);
// 2. Apply template configuration
project.UpdateAgreementMode(template.DefaultAgreementMode, ownerId);
foreach (var stageTemplate in template.Stages.OrderBy(s => s.Order))
{
project.AddStage(stageTemplate.Name, stageTemplate.Type);
if (!stageTemplate.IsActive)
{
var stage = project.Stages.Last();
stage.Deactivate();
}
}
foreach (var questionTemplate in template.Questions)
{
project.UpsertCustomAnnotationQuestion(
questionTemplate.QuestionText,
questionTemplate.AnswerType,
questionTemplate.IsRequired,
questionTemplate.Options?.ToList());
}
// 3. Save with all configuration applied
await _unitOfWork.SaveAsync(project);
return project;
}
}
Phase 3: End User Workflow¶
Creating a Project from Template (Future UI):
┌─────────────────────────────────────────────────────┐
│ Create New Project │
├─────────────────────────────────────────────────────┤
│ │
│ ○ Start from scratch │
│ ● Use a template │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Available Templates │ │
│ │ │ │
│ │ 📋 Quick Start (System) │ │
│ │ Basic screening workflow │ │
│ │ │ │
│ │ 📋 Full Systematic Review (System) │ │
│ │ Screening + Annotation with ROB questions │ │
│ │ │ │
│ │ 📋 My Custom Template (Private) │ │
│ │ Based on "Cancer Research 2024" project │ │
│ │ │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ Project Name: [________________________] │
│ │
│ [Cancel] [Create Project] │
└─────────────────────────────────────────────────────┘
Saving a Project as Template (Future UI):
Project Settings → Actions → "Save as Template"
→ Dialog: Template name, visibility (private/public)
→ Creates snapshot of current stage/question config
Phase 4: Seeder Integration¶
// In DatabaseSeeder
public async Task SeedProjectsAsync()
{
// Ensure system templates exist
await EnsureSystemTemplatesExistAsync();
// Create seed projects from templates
var quickStartTemplate = await _templateService.GetSystemTemplateAsync("QuickStart");
var quickStartProject = await _templateService.CreateProjectFromTemplateAsync(
quickStartTemplate!.Id,
SeedDataConstants.QuickStartProjectName,
SeedDataConstants.SeedBotId);
// Store runtime ID for ownership transfer
_runtimeProjectIds["QuickStart"] = quickStartProject.Id;
// Add seed studies, screenings, etc.
await SeedStudiesForProject(quickStartProject.Id, 10);
}
private async Task EnsureSystemTemplatesExistAsync()
{
var existing = await _templateService.GetSystemTemplateAsync("QuickStart");
if (existing != null) return;
var template = ProjectTemplate.CreateSystemTemplate(
"QuickStart",
"Basic screening project with single-reviewer workflow");
template.AddStage("Title/Abstract Screening", StageType.Screening, isActive: true);
await _templateRepository.SaveAsync(template);
}
Files to Create¶
| File | Purpose |
|---|---|
SyRF.ProjectManagement.Core/Model/TemplateAggregate/ProjectTemplate.cs |
Aggregate Root |
SyRF.ProjectManagement.Core/Model/TemplateAggregate/StageTemplate.cs |
Value Object |
SyRF.ProjectManagement.Core/Model/TemplateAggregate/AnnotationQuestionTemplate.cs |
Value Object |
SyRF.ProjectManagement.Core/Interfaces/IProjectTemplateService.cs |
Service Interface |
SyRF.ProjectManagement.Core/Interfaces/IProjectTemplateRepository.cs |
Repository Interface |
SyRF.ProjectManagement.Core/Services/ProjectTemplateService.cs |
Service Implementation |
SyRF.ProjectManagement.Mongo.Data/Repositories/ProjectTemplateRepository.cs |
MongoDB Repository |
Migration Path¶
- Phase 1: Create entities, repository, basic service (template storage only)
- Phase 2: Add
CreateProjectFromTemplateAsync(seeder integration) - Phase 3: Add API endpoints for template CRUD
- Phase 4: Add UI for "Create from Template" and "Save as Template"
Seeder can use Phase 2 immediately, with user-facing features added later.
7.4 Stats Consolidation - Implementation Plan¶
Note: This section is part of the improvement plan, not just documentation. Stats consolidation is a prerequisite for reliable seeding.
Problem Statement¶
Stats calculation is scattered across multiple components, making it difficult to ensure seeded data has correct statistics. This affects both production code quality and seeding reliability.
Current Scattered Locations:
| Component | Location | What It Calculates |
|---|---|---|
StageReviewService.GetFullProjectStatsAsync() |
StageReviewService.cs:35-40 | Overall project stats |
StageReviewService.GetReviewerStatsForStageAsync() |
StageReviewService.cs:75-83 | Per-reviewer stats |
StudyRepository.UpdateStudyInclusionInfoForProjectAsync() |
StudyRepository.cs:1078-1104 | Inclusion info updates |
ExtractionInfo.SessionTallies |
ExtractionInfo.cs:30-42 | Annotation session tallies |
ReviewController (stats after screening) |
ReviewController.cs:61-119 | Returns stats in response |
Why Stats Need Service-Level Recalculation (or Don't)¶
Key Principle: Stats write operations at the domain service layer are only needed when the calculation depends on config and state spanning multiple aggregates.
| Stat Type | Needs Service Recalculation? | Reason |
|---|---|---|
| Screening Stats | ✅ YES | Depends on Project config (AgreementMode, threshold) AND Study screenings |
| Inclusion Info | ✅ YES | Depends on Project config AND aggregated screening decisions |
| Annotation Stats | ❌ NO | SessionTallies is computed within Study aggregate alone |
| Outcome Data Stats | ❌ NO | Aggregation query over Studies, no cross-aggregate config needed |
Annotation Stats Analysis (ExtractionInfo.cs:30-42):
// SessionTallies is a COMPUTED PROPERTY on ExtractionInfo (part of Study)
public IEnumerable<SessionTally> SessionTallies
{
get
{
return Sessions.GroupBy(ans => ans.StageId).Select(gp =>
new SessionTally(gp.Key,
gp.Count(ses => !ses.Reconciliation),
gp.Count(ses => !ses.Reconciliation && ses.Status == SessionStatus.Completed),
gp.Any(ses => ses.Reconciliation),
gp.Any(ses => ses.Reconciliation && ses.Status == SessionStatus.Completed))
);
}
}
This calculation:
- Uses ONLY
Sessionsfrom the Study's ownExtractionInfo - Does NOT need Project config (no AgreementMode, no question count)
- Is computed on read, not stored → no recalculation needed
Contrast with Screening Stats which MUST use service layer:
- "Sufficiently screened" requires Project's
AgreementModeand threshold - "Included/Excluded" aggregates screening decisions using Project rules
- Cannot be computed from Study aggregate alone
Screening Stats Analysis (StudyStats.cs, ScreeningInfo.cs):
Why RecalculateScreeningStatsAsync IS needed:
- Cross-Aggregate Configuration Dependency:
// StudyStats.cs lines 292-295 - Stats use Project thresholds
var agreementThreshold = project.AgreementThreshold;
var minNumberScreened = agreementThreshold.NumberScreened;
var absoluteAgreementRatio = agreementThreshold.AbsoluteAgreementRatio;
- MongoDB Aggregation Uses Project Rules:
// Simplified from AllScreeningPipelineGroupStage (StudyStats.cs:274-276)
{
SufficientlyScreened: {
$sum: {
$cond: [
{$and: [
{$gte: ["$ScreeningInfo.AgreementMeasure.NumberScreened", minNumberScreened]},
{$gt: ["$ScreeningInfo.AgreementMeasure.AbsoluteAgreementRatio", absoluteAgreementRatio]}
]},
1, 0
]
}
}
}
- Cached Thresholds on ScreeningInfo (ScreeningInfo.cs:49-60):
// _trackedProjectAgreementThresholds is CACHED on each Study's ScreeningInfo
private IEnumerable<ProjectAgreementThreshold>? _trackedProjectAgreementThresholds;
// InclusionInfo depends on this cached config
public IEnumerable<TrackedInclusionInfo>? InclusionInfo
{
get => _trackedProjectAgreementThresholds?.Select(
pat => new TrackedInclusionInfo(pat, pat.GetInclusionStatus(this), ...));
}
If Project thresholds change, this cached data becomes stale and needs recalculation.
When RecalculateScreeningStatsAsync is triggered:
- After bulk screening operations
- When Project
AgreementModeor threshold changes - After seed data creation (to initialize
_trackedProjectAgreementThresholds) - Via
IUpdateStudyScreeningStatsCommandmessage bus command
Proposed Architecture (Corrected)¶
public interface IProjectStatsService
{
// === Read Operations (Query Side) ===
Task<ProjectStats> GetProjectStatsAsync(Guid projectId);
Task<StageStats> GetStageStatsAsync(Guid projectId, Guid stageId);
Task<ReviewerStats> GetReviewerStatsAsync(Guid projectId, Guid stageId, Guid reviewerId);
// === Write Operations (Command Side) ===
// ONLY for stats that span aggregates (Project config + Study data)
Task RecalculateScreeningStatsAsync(Guid projectId);
Task RecalculateInclusionInfoAsync(Guid projectId);
// NOT NEEDED for annotation/outcome - these are computed properties or simple aggregations
// Task RecalculateAnnotationStatsAsync(Guid projectId); ← REMOVED
// Task RecalculateOutcomeDataStatsAsync(Guid projectId); ← REMOVED
}
// Unified stats model
public record ProjectStats(
int TotalStudies,
ScreeningStats Screening,
AnnotationStats Annotation, // Read from SessionTallies aggregation
OutcomeDataStats OutcomeData, // Read from OutcomeData count query
DateTime LastUpdated
);
public record ScreeningStats(
int Unscreened,
int PartiallyScreened,
int SufficientlyScreened, // Requires Project AgreementMode
int Included, // Requires Project rules
int Excluded, // Requires Project rules
int Conflicts // Requires Project threshold
);
public record AnnotationStats(
int NotStarted,
int InProgress,
int Complete,
int TotalSessions
// Computed by aggregating SessionTallies across studies - no write operation needed
);
public record OutcomeDataStats(
int StudiesWithData,
int TotalDataPoints,
int ExperimentsCount
// Simple count aggregation - no write operation needed
);
Implementation Steps¶
Phase 1: Create Service Without Breaking Changes
- Create
IProjectStatsServiceinterface inSyRF.ProjectManagement.Core/Interfaces/ - Create
ProjectStatsServiceimplementation inSyRF.ProjectManagement.Core/Services/ - Initially delegate to existing methods in
StageReviewService - Add DI registration
Phase 2: Migrate Callers
- Update
ReviewControllerto callIProjectStatsService - Update
DatabaseSeederto callRecalculateAllStatsAsync()after seeding - Update any Angular API calls that expect stats in responses
Phase 3: Consolidate Implementation
- Move actual stats calculation logic from
StageReviewServicetoProjectStatsService - Keep
StageReviewServicefocused on study retrieval/assignment - Update
StudyRepository.UpdateStudyInclusionInfoForProjectAsync()to be called via stats service
Phase 4: Deprecate Old Methods
- Mark old
StageReviewServicestats methods as[Obsolete] - Eventually remove after all callers migrated
Files to Create/Modify¶
| File | Action | Purpose |
|---|---|---|
SyRF.ProjectManagement.Core/Interfaces/IProjectStatsService.cs |
Create | Service interface |
SyRF.ProjectManagement.Core/Services/ProjectStatsService.cs |
Create | Implementation |
SyRF.ProjectManagement.Core/Model/ValueObjects/ProjectStats.cs |
Create | Stats model |
SyRF.ProjectManagement.Core/Services/StageReviewService.cs |
Modify | Delegate to new service |
SyRF.API.Endpoint/Controllers/ReviewController.cs |
Modify | Use new service |
SyRF.ProjectManagement.Endpoint/Seeding/DatabaseSeeder.cs |
Modify | Call RecalculateAllStatsAsync() |
Seeder Integration¶
// After seeding all data for a project
await _projectStatsService.RecalculateAllStatsAsync(projectId);
_logger.LogDebug("Recalculated all stats for project {ProjectId}", projectId);
// This single call replaces multiple scattered calls:
// - UpdateStudyInclusionInfoForProjectAsync()
// - Manual stats aggregation
// - Annotation tally updates
Priority Assessment¶
| Priority | Rationale |
|---|---|
| Medium-High | Needed for reliable seeding, but can work around by calling existing methods |
| Technical Debt | Reduces code duplication and improves maintainability |
| Risk | Low - new service, no breaking changes initially |
Recommendation: Implement Phase 1-2 before major seeding work. Phase 3-4 can be done as ongoing refactoring.
Part 8: Annotation Seeding - Validation Architecture¶
Production Code Path¶
HTTP PUT /api/projects/{projectId}/stages/{stageId}/studies/{studyId}/session/{sessionId}
↓
ReviewController.SubmitSession() [ReviewController.cs:61-119]
↓
SessionSubmissionWebDto → SessionSubmissionDto (AutoMapper)
↓
study.AddSessionData() [Study.cs:116-127]
↓
ExtractionInfo.AddSessionData() [ExtractionInfo.cs:50-64]
├─ AddAnnotations() - Tree-shakes old annotations, adds new
└─ AddOutcomeData() - Tree-shakes old outcome data, adds new
↓
pmUnitOfWork.SaveAsync(study)
↓
DispatchEvents() - Publishes domain events
Current Seeder Proposal (INCORRECT)¶
// This bypasses the ReviewController and service validation
var session = new AnnotationSession(study.Id, investigatorId, stageId);
foreach (var question in questions) {
session.AddAnswer(question.Id, GenerateSampleAnswer(question));
}
study.AddAnnotationSession(session); // Direct domain model manipulation
Validation Architecture Analysis¶
Question: Where should validation live - domain, application service, or frontend?
Current State: Mixed Validation¶
| Validation Type | Current Location | Problem |
|---|---|---|
| Required fields | Frontend (Angular forms) | Bypassable by seeder |
| Question answer types | Domain model (Annotation) | Good - shared |
| Stage membership | Controller (ReviewController) | Should be in service |
| Study belongs to project | Controller query | Should be in service |
Best Practice: Domain-Driven Validation Layers¶
┌─────────────────────────────────────────────────────────────┐
│ 1. DOMAIN VALIDATION (Innermost - Invariants) │
│ • Entity state must be valid at all times │
│ • Throws exceptions for invalid state transitions │
│ • Example: Annotation answer must match question type │
└─────────────────────────────────────────────────────────────┘
↑
┌─────────────────────────────────────────────────────────────┐
│ 2. APPLICATION SERVICE VALIDATION (Business Rules) │
│ • Cross-aggregate validation │
│ • Authorization checks │
│ • Example: User must be project member to annotate │
└─────────────────────────────────────────────────────────────┘
↑
┌─────────────────────────────────────────────────────────────┐
│ 3. FRONTEND VALIDATION (UX - Optional for Backend) │
│ • Early feedback for users │
│ • NOT security - backend must revalidate │
│ • Example: Show "required" error before submit │
└─────────────────────────────────────────────────────────────┘
Proposed: Application Service for Annotation Submission¶
// New file: SyRF.ProjectManagement.Core/Interfaces/IAnnotationService.cs
public interface IAnnotationService
{
/// <summary>
/// Submit an annotation session for a study.
/// Validates: user membership, stage access, question completeness.
/// </summary>
Task<AnnotationSessionResult> SubmitSessionAsync(
Guid projectId,
Guid stageId,
Guid studyId,
Guid annotatorId,
SessionSubmissionDto session);
/// <summary>
/// Validate session data without saving (for preview/draft).
/// Returns validation errors instead of throwing.
/// </summary>
Task<ValidationResult> ValidateSessionAsync(
Guid projectId,
Guid stageId,
Guid studyId,
SessionSubmissionDto session);
}
// Implementation
public class AnnotationService : IAnnotationService
{
public async Task<AnnotationSessionResult> SubmitSessionAsync(
Guid projectId, Guid stageId, Guid studyId, Guid annotatorId,
SessionSubmissionDto session)
{
// 1. Load aggregates
var project = await _projectRepository.GetAsync(projectId)
?? throw new EntityNotFoundException("Project", projectId);
var study = await _studyRepository.GetAsync(studyId)
?? throw new EntityNotFoundException("Study", studyId);
// 2. APPLICATION VALIDATION (business rules)
ValidateMembership(project, annotatorId); // User must be project member
ValidateStageAccess(project, stageId); // Stage must be active
ValidateStudyBelongsToProject(study, projectId); // Study in this project
ValidateNotAlreadyAnnotated(study, annotatorId, stageId); // No duplicate
// 3. DOMAIN VALIDATION (via domain method)
var stage = project.Stages.Single(s => s.Id == stageId);
study.AddSessionData(
annotatorId,
session,
project.GetStageQuestionIds(stageId),
stage.Extraction); // Throws if invalid
// 4. Persist
await _unitOfWork.SaveAsync(study);
return new AnnotationSessionResult(
Success: true,
SessionId: session.Id,
StudyId: studyId);
}
private void ValidateMembership(Project project, Guid userId)
{
if (!project.Memberships.Any(m => m.InvestigatorId == userId))
throw new UnauthorizedAccessException("User is not a project member");
}
private void ValidateStageAccess(Project project, Guid stageId)
{
var stage = project.Stages.SingleOrDefault(s => s.Id == stageId)
?? throw new EntityNotFoundException("Stage", stageId);
if (!stage.IsActive)
throw new InvalidOperationException("Stage is not active");
}
}
Seeder Uses Same Service¶
// Seeder injects IAnnotationService
public class DatabaseSeeder
{
private readonly IAnnotationService _annotationService;
public async Task SeedAnnotationsAsync(Guid projectId, Guid stageId, IEnumerable<Study> studies)
{
foreach (var study in studies)
{
var session = GenerateSessionForStudy(study, stageId);
// Uses SAME validation as production!
await _annotationService.SubmitSessionAsync(
projectId,
stageId,
study.Id,
SeedDataConstants.SeedBotId,
session);
}
}
}
What Exactly Changes? (Detailed Migration)¶
Answer: Logic and functionality remain IDENTICAL. Only the code location changes.
Current Architecture (Controller-Based)¶
ReviewController.SubmitSessionAsync()
├─ HTTP deserialization (ASP.NET Core)
├─ Authorization check (inline in controller)
├─ Load project (inline)
├─ Validate membership (inline)
├─ Validate stage (inline)
├─ Call study.AddSessionData() (domain method)
└─ Save via UnitOfWork (inline)
Proposed Architecture (Service-Based)¶
ReviewController.SubmitSessionAsync()
├─ HTTP deserialization (ASP.NET Core)
└─ Call IAnnotationService.SubmitSessionAsync() ← SINGLE CHANGE
├─ Authorization check (moved here)
├─ Load project (moved here)
├─ Validate membership (moved here)
├─ Validate stage (moved here)
├─ Call study.AddSessionData() (unchanged)
└─ Save via UnitOfWork (moved here)
Specific Code Migration¶
| Current Location | New Location | Logic Changes? |
|---|---|---|
| ReviewController.cs:61-119 | AnnotationService.SubmitSessionAsync() |
NO - extracted as-is |
| Membership check (controller inline) | ValidateMembership() private method |
NO - same check |
| Stage access check (controller inline) | ValidateStageAccess() private method |
NO - same check |
study.AddSessionData() call |
Same call in service | NO - identical |
_unitOfWork.SaveAsync() call |
Same call in service | NO - identical |
Controller BEFORE (Simplified)¶
// ReviewController.cs - Current
[HttpPost("annotation")]
public async Task<IActionResult> SubmitSessionAsync(SessionSubmissionWebDto dto)
{
var project = await _projectRepo.GetAsync(dto.ProjectId);
if (!project.Memberships.Any(m => m.InvestigatorId == UserId))
return Forbid();
var stage = project.Stages.SingleOrDefault(s => s.Id == dto.StageId);
if (stage == null || !stage.IsActive)
return BadRequest("Invalid stage");
var study = await _studyRepo.GetAsync(dto.StudyId);
study.AddSessionData(UserId, dto.ToDto(), project.GetStageQuestionIds(dto.StageId), stage.Extraction);
await _unitOfWork.SaveAsync(study);
return Ok(new { SessionId = dto.Id });
}
Controller AFTER (Simplified)¶
// ReviewController.cs - After refactor
[HttpPost("annotation")]
public async Task<IActionResult> SubmitSessionAsync(SessionSubmissionWebDto dto)
{
var result = await _annotationService.SubmitSessionAsync(
dto.ProjectId,
dto.StageId,
dto.StudyId,
UserId,
dto.ToDto());
return Ok(new { SessionId = result.SessionId });
}
Functionality Changes: NONE
- Same validation rules
- Same domain method calls
- Same persistence
- Same error conditions
What's Different: Code is now in a testable, reusable service that the seeder can also call.
Files Changed¶
| File | Change Type | Description |
|---|---|---|
IAnnotationService.cs |
NEW | Interface in Core/Interfaces |
AnnotationService.cs |
NEW | Implementation in Core/Services |
ReviewController.cs |
MODIFY | Delegate to service (less code) |
Startup.cs / DI |
MODIFY | Register new service |
Benefits of Service-Layer Validation¶
| Benefit | Explanation |
|---|---|
| Single Source of Truth | Validation logic in one place, not duplicated in controller and seeder |
| Testable | Can unit test validation without HTTP context |
| Seeder Parity | Seeder uses exact same code path as production |
| Future-Proof | New validation rules automatically apply to seeding |
| Separation of Concerns | Controller handles HTTP, service handles business logic |
Frontend Validation Consideration¶
The Angular frontend should:
- Pre-validate for UX (immediate feedback)
- NOT be trusted - backend always re-validates
- Mirror service rules where practical (DRY is nice but security is mandatory)
// Angular component - UX validation only
validateSession(session: SessionSubmission): ValidationError[] {
const errors: ValidationError[] = [];
if (!session.annotations?.length) {
errors.push({ field: 'annotations', message: 'At least one answer required' });
}
// ... more UX checks
return errors;
}
The backend IAnnotationService.ValidateSessionAsync() method can be exposed via API for frontend draft validation if needed.
Part 9: Seeding Configuration as Source of Truth¶
Problem: Duplicated Knowledge¶
Currently, seeding logic and tests both contain hardcoded values:
// In DatabaseSeeder.cs
var quickStartStudyCount = 10;
CreateStudies(project, quickStartStudyCount);
// In Tests
quickStartStats.TotalStudies.Should().Be(10); // Duplicated magic number!
If seeding changes to 15 studies, tests break. This coupling is fragile.
Solution: Centralized Seeding Configuration¶
Create a configuration file that defines seed data structure. Both seeder and tests read from this single source of truth.
Configuration Model¶
// SyRF.ProjectManagement.Core/Seeding/SeedingConfig.cs
public record SeedingConfig
{
public required IReadOnlyList<SeedProjectConfig> Projects { get; init; }
public required IReadOnlyList<SeedInvestigatorConfig> Investigators { get; init; }
}
public record SeedProjectConfig
{
public required string Key { get; init; } // "QuickStart", "DualScreening"
public required string Name { get; init; }
public required AgreementMode AgreementMode { get; init; }
public required int StudyCount { get; init; }
public required IReadOnlyList<SeedStageConfig> Stages { get; init; }
public required SeedScreeningConfig? Screening { get; init; }
public required SeedAnnotationConfig? Annotation { get; init; }
public required SeedOutcomeDataConfig? OutcomeData { get; init; }
public required IReadOnlyList<string> MemberKeys { get; init; } // ["Owner", "Alpha", "Beta"]
}
public record SeedStageConfig(
string Name,
StageType Type,
bool IsActive
);
public record SeedScreeningConfig(
int ScreenedCount,
int PartialCount,
int IncludedCount,
int ExcludedCount
);
public record SeedAnnotationConfig(
int QuestionCount,
int AnnotatedStudyCount,
int CompletedSessionCount
);
public record SeedOutcomeDataConfig(
int StudiesWithData,
int DataPointsPerStudy
);
public record SeedInvestigatorConfig(
string Key,
string FirstName,
string LastName,
string Email,
bool IsSystemAccount
);
Configuration File (YAML)¶
# src/libs/project-management/SyRF.ProjectManagement.Core/Seeding/seed-config.yaml
investigators:
- key: SeedBot
firstName: SyRF
lastName: Seed Bot
email: seedbot@syrf.org.uk
isSystemAccount: true
- key: Alpha
firstName: Alice
lastName: Alpha
email: alice.alpha@example.com
isSystemAccount: false
- key: Beta
firstName: Bob
lastName: Beta
email: bob.beta@example.com
isSystemAccount: false
projects:
- key: QuickStart
name: Quick Start Demo
agreementMode: SingleReviewerDecides
studyCount: 10
stages:
- name: Title/Abstract Screening
type: Screening
isActive: true
- name: Data Extraction
type: Annotation
isActive: false
screening: null # No screening done
annotation: null
outcomeData: null
memberKeys: [SeedBot]
- key: DualScreening
name: Dual Screening Example
agreementMode: AutomatedDualScreening
studyCount: 30
stages:
- name: Title/Abstract Screening
type: Screening
isActive: true
screening:
screenedCount: 15
partialCount: 10
includedCount: 8
excludedCount: 7
annotation: null
outcomeData: null
memberKeys: [SeedBot, Alpha, Beta]
- key: DataExtraction
name: Data Extraction Demo
agreementMode: SingleReviewerDecides
studyCount: 12
stages:
- name: Screening
type: Screening
isActive: false
- name: Data Extraction
type: Annotation
isActive: true
screening:
screenedCount: 12
partialCount: 0
includedCount: 12
excludedCount: 0
annotation:
questionCount: 10
annotatedStudyCount: 12
completedSessionCount: 12
outcomeData:
studiesWithData: 9
dataPointsPerStudy: 3
memberKeys: [SeedBot]
Config Loader¶
public interface ISeedingConfigProvider
{
SeedingConfig GetConfig();
SeedProjectConfig GetProjectConfig(string key);
SeedInvestigatorConfig GetInvestigatorConfig(string key);
}
public class YamlSeedingConfigProvider : ISeedingConfigProvider
{
private readonly Lazy<SeedingConfig> _config;
public YamlSeedingConfigProvider()
{
_config = new Lazy<SeedingConfig>(() =>
{
var yaml = EmbeddedResource.ReadAsString("seed-config.yaml");
return YamlDeserializer.Deserialize<SeedingConfig>(yaml);
});
}
public SeedingConfig GetConfig() => _config.Value;
public SeedProjectConfig GetProjectConfig(string key) =>
GetConfig().Projects.Single(p => p.Key == key);
}
Seeder Uses Config¶
public class DatabaseSeeder
{
private readonly ISeedingConfigProvider _configProvider;
public async Task SeedAsync()
{
var config = _configProvider.GetConfig();
foreach (var projectConfig in config.Projects)
{
var project = await CreateProjectFromConfig(projectConfig);
await SeedStudiesFromConfig(project.Id, projectConfig);
if (projectConfig.Screening != null)
await SeedScreeningsFromConfig(project.Id, projectConfig.Screening);
if (projectConfig.Annotation != null)
await SeedAnnotationsFromConfig(project.Id, projectConfig.Annotation);
if (projectConfig.OutcomeData != null)
await SeedOutcomeDataFromConfig(project.Id, projectConfig.OutcomeData);
}
}
}
Tests Use Same Config¶
public class SeedDataValidationTests : IClassFixture<SeededDatabaseFixture>
{
private readonly ISeedingConfigProvider _configProvider;
[Fact]
public async Task QuickStart_HasCorrectStudyCount()
{
// Config is SOURCE OF TRUTH
var config = _configProvider.GetProjectConfig("QuickStart");
var stats = await _statsService.GetProjectStatsAsync(quickStartProjectId);
stats.TotalStudies.Should().Be(config.StudyCount); // No magic numbers!
}
[Fact]
public async Task DualScreening_HasCorrectScreeningProgress()
{
var config = _configProvider.GetProjectConfig("DualScreening");
var stats = await _statsService.GetProjectStatsAsync(dualProjectId);
stats.Screening.SufficientlyScreened.Should().Be(config.Screening!.ScreenedCount);
stats.Screening.PartiallyScreened.Should().Be(config.Screening!.PartialCount);
}
[Theory]
[MemberData(nameof(AllProjectConfigs))]
public async Task AllProjects_HaveCorrectStudyCounts(SeedProjectConfig config)
{
var projectId = await GetProjectIdByName(config.Name);
var stats = await _statsService.GetProjectStatsAsync(projectId);
stats.TotalStudies.Should().Be(config.StudyCount);
}
public static IEnumerable<object[]> AllProjectConfigs =>
_configProvider.GetConfig().Projects.Select(p => new object[] { p });
}
Benefits of Config-Driven Testing¶
| Benefit | Explanation |
|---|---|
| Single Source of Truth | Config defines both seeding AND expected test values |
| No Magic Numbers | Tests read from config, not hardcoded values |
| Flexible Testing | Theory tests can iterate all projects automatically |
| Documentation | YAML file documents what seed data exists |
| Easy Changes | Change config once, seeder and tests both update |
| Environment Specific | Can have different configs for preview vs staging |
E2E Tests Also Use Config¶
// Load config at test startup (exported as JSON or TypeScript)
import { seedConfig } from './seed-config';
describe('Seed Data Validation', () => {
const quickStart = seedConfig.projects.find(p => p.key === 'QuickStart')!;
it('Quick Start has correct study count', () => {
cy.visit('/projects/' + quickStartId);
cy.get('[data-testid="study-count"]')
.should('contain', quickStart.studyCount.toString());
});
});
Implementation Priority¶
| Phase | Work | Benefit |
|---|---|---|
| Phase 1 | Create config model + YAML file | Documents current/planned seed data |
| Phase 2 | Refactor seeder to use config | Centralized seeding logic |
| Phase 3 | Update integration tests to use config | No more magic numbers |
| Phase 4 | Export config for E2E tests | Frontend tests also use same source |
Recommendation: Start with Phase 1-2 immediately. The config file serves as documentation even before tests are updated.
Part 10: Non-Production Visual Indicator (Enhanced UX)¶
Current State¶
The web app has a Version Info Dialog (version-info-dialog.component.ts) with color-coded environment badges:
| Environment | Background | Text Color |
|---|---|---|
| Production | #e8f5e9 (green) |
#2e7d32 |
| Staging | #fff3e0 (orange) |
#bf360c |
| Preview | #e3f2fd (blue) |
#1565c0 |
| Development | #fce4ec (pink) |
#c2185b |
Gap: Database name (syrftest vs syrfdev) is NOT shown - frontend has no visibility.
Proposed: Banner-Based Warning System¶
Design Principle: Real estate usage should be similar to production app. The collapsed indicator should be minimal and non-intrusive.
User Flow¶
┌────────────────────────────────────────────────────────────────────┐
│ User visits non-production URL │
└─────────────────────────┬──────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────────────────────────┐
│ Check sessionStorage: 'env-banner-collapsed' │
└─────────────────────────┬──────────────────────────────────────────┘
↓
┌──────────────┴──────────────┐
↓ ↓
[Not Collapsed] [Previously Collapsed]
↓ ↓
┌──────────────────────────┐ ┌────────────────────────────────────┐
│ SLIM BANNER (top bar) │ │ MINIMAL INDICATOR (corner badge) │
│ "STAGING | Click for │ │ Small colored pill/badge │
│ more info" │ │ Clicking reopens banner │
│ [Click anywhere] │ └────────────────────────────────────┘
└───────────┬──────────────┘
↓ (user clicks)
┌──────────────────────────┐
│ DETAILED DIALOG │
│ - Environment name │
│ - Database name │
│ - PR number (if preview) │
│ - Important notes │
│ [Dismiss & Collapse] │
└───────────┬──────────────┘
↓ (user dismisses)
┌────────────────────────────────────────────────────────────────────┐
│ Store 'env-banner-collapsed' = true in sessionStorage │
│ Show MINIMAL INDICATOR (same as "Previously Collapsed" state) │
└────────────────────────────────────────────────────────────────────┘
State 1: Slim Banner (Default for Non-Production)¶
A single-line banner at the top of the viewport, similar height to a small toolbar:
<!-- app.component.html -->
<div *ngIf="!isProduction && !bannerCollapsed"
class="env-banner"
[class]="environmentClass"
(click)="showEnvironmentDialog()">
<mat-icon class="env-icon">info</mat-icon>
<span class="env-text">
<strong>{{ environmentName | uppercase }}</strong>
<span class="separator">|</span>
Click for environment details
</span>
</div>
.env-banner {
height: 32px; // Slim - similar to a toolbar
display: flex;
align-items: center;
justify-content: center;
gap: 8px;
cursor: pointer;
font-size: 13px;
position: sticky;
top: 0;
z-index: 1000;
&.preview { background: #2196f3; color: white; }
&.staging { background: #ff9800; color: white; }
&.development { background: #e91e63; color: white; }
.separator { opacity: 0.7; margin: 0 4px; }
.env-icon { font-size: 18px; height: 18px; width: 18px; }
&:hover { filter: brightness(1.1); }
}
State 2: Detailed Dialog (On Banner Click)¶
Opens when user clicks the banner:
@Component({
selector: 'app-environment-dialog',
template: `
<h2 mat-dialog-title>Environment Information</h2>
<mat-dialog-content>
<div class="env-details">
<div class="detail-row">
<span class="label">Environment:</span>
<span class="badge" [class]="environmentClass">{{ environmentName }}</span>
</div>
<div class="detail-row">
<span class="label">Database:</span>
<span class="value">{{ databaseName }}</span>
</div>
<div class="detail-row" *ngIf="prNumber">
<span class="label">PR Preview:</span>
<span class="value">#{{ prNumber }}</span>
</div>
<div class="detail-row" *ngIf="dataSource?.sampledAt">
<span class="label">Data Sampled:</span>
<span class="value">{{ dataSource.sampledAt | date:'medium' }}</span>
</div>
</div>
<mat-divider></mat-divider>
<div class="notes">
<p><strong>Note:</strong> This is not production. Data may be reset at any time.</p>
</div>
</mat-dialog-content>
<mat-dialog-actions align="end">
<button mat-button (click)="dismiss(false)">Keep Banner Visible</button>
<button mat-raised-button color="primary" (click)="dismiss(true)">
Dismiss & Collapse
</button>
</mat-dialog-actions>
`
})
export class EnvironmentDialogComponent {
dismiss(collapse: boolean) {
if (collapse) {
sessionStorage.setItem('env-banner-collapsed', 'true');
}
this.dialogRef.close(collapse);
}
}
State 3: Minimal Indicator (After Collapse)¶
A small badge in the corner - minimal screen real estate (similar to production):
<!-- Collapsed state indicator -->
<div *ngIf="!isProduction && bannerCollapsed"
class="env-indicator"
[class]="environmentClass"
(click)="expandBanner()"
matTooltip="Click to show environment details">
{{ environmentAbbrev }}
</div>
.env-indicator {
position: fixed;
top: 8px;
right: 8px;
padding: 4px 10px;
border-radius: 12px;
font-size: 11px;
font-weight: 600;
text-transform: uppercase;
cursor: pointer;
z-index: 999;
box-shadow: 0 1px 3px rgba(0,0,0,0.2);
&.preview { background: #2196f3; color: white; }
&.staging { background: #ff9800; color: white; }
&.development { background: #e91e63; color: white; }
&:hover {
transform: scale(1.05);
box-shadow: 0 2px 6px rgba(0,0,0,0.3);
}
}
Abbreviations:
| Environment | Abbreviation |
|---|---|
| Development | DEV |
| Staging | STG |
| Preview | PR-123 |
| Production | (not shown) |
Component Logic¶
@Component({ ... })
export class AppComponent implements OnInit {
isProduction = false;
bannerCollapsed = false;
environmentName = '';
environmentClass = '';
environmentAbbrev = '';
ngOnInit() {
this.loadEnvironmentInfo();
this.bannerCollapsed = sessionStorage.getItem('env-banner-collapsed') === 'true';
}
async loadEnvironmentInfo() {
const info = await this.appService.getApplicationInfo();
this.isProduction = info.environment === 'Production';
this.environmentName = info.environment;
this.environmentClass = info.environment.toLowerCase();
this.environmentAbbrev = this.getAbbreviation(info);
}
getAbbreviation(info: AppInfo): string {
if (info.prNumber) return `PR-${info.prNumber}`;
switch (info.environment) {
case 'Staging': return 'STG';
case 'Development': return 'DEV';
default: return info.environment.substring(0, 3).toUpperCase();
}
}
showEnvironmentDialog() {
const dialogRef = this.dialog.open(EnvironmentDialogComponent, {
data: this.environmentInfo
});
dialogRef.afterClosed().subscribe(collapsed => {
if (collapsed) this.bannerCollapsed = true;
});
}
expandBanner() {
sessionStorage.removeItem('env-banner-collapsed');
this.bannerCollapsed = false;
}
}
Visual Comparison¶
| State | Height | Location | Screen Real Estate |
|---|---|---|---|
| Production | 0px | N/A | None (baseline) |
| Banner (expanded) | 32px | Top sticky | Small toolbar |
| Indicator (collapsed) | ~24px | Top-right corner | Badge only |
| Dialog | Modal | Center | Overlay (temporary) |
Goal: Collapsed state uses similar real estate to production (just a small corner badge).
Backend Changes¶
- Add
databaseNameto/api/application/infoendpoint - Add
prNumberfor preview environments (from environment variable) - Return environment classification
// ApplicationController.cs
[HttpGet("info")]
public IActionResult GetApplicationInfo()
{
return Ok(new
{
Version = _configuration["AppVersion"],
Environment = _configuration["ASPNETCORE_ENVIRONMENT"],
DatabaseName = GetDatabaseNameFromConnectionString(),
PrNumber = Environment.GetEnvironmentVariable("PR_NUMBER"),
IsProduction = _configuration["ASPNETCORE_ENVIRONMENT"] == "Production"
});
}
Implementation Files¶
| File | Change |
|---|---|
app.component.html |
Add banner and compact indicator |
app.component.ts |
Add state management, dialog trigger |
app.component.scss |
Add styles |
environment-warning-dialog.component.ts |
New dialog component |
ApplicationController.cs |
Add database name to info endpoint |
MongoContext.cs or config |
Extract database name |
Part 11: Production Data Sampling for Non-Production Environments¶
Context: PII Considerations for SyRF¶
SyRF is a research platform for systematic reviews where:
- User data consists primarily of researcher emails and names (professional context)
- Project data is scientific literature metadata (titles, abstracts, DOIs)
- Studies are references to published academic papers (public information)
- Annotations are structured research assessments (not personal data)
Conclusion: PII is not a major concern for this application. The primary data (study metadata from academic databases) is already public. User data is limited to professional research context.
This enables practical production sampling without complex anonymization.
Recommended Approach: Configurable Production Sampling¶
Architecture Overview¶
┌──────────────────────────────────────────────────────────────────┐
│ Production Database (syrftest) │
│ Projects: ~500 Studies: ~2M Investigators: ~3000 │
└────────────────────────────┬─────────────────────────────────────┘
│
┌──────────────┴──────────────┐
↓ ↓
┌─────────────────────┐ ┌─────────────────────────────┐
│ PR Preview │ │ Staging │
│ (Lightweight) │ │ (Near-Production) │
├─────────────────────┤ ├─────────────────────────────┤
│ • 3-5 sampled │ │ • 10-20 sampled projects │
│ projects │ │ • Full study sets │
│ • Max 500 studies │ │ • All screening data │
│ • Minimal screening │ │ • Full annotation sessions │
│ • No annotations │ │ • Performance testing │
│ • Fast creation │ │ • Weekly refresh │
│ • PR-specific │ │ │
└─────────────────────┘ └─────────────────────────────┘
Configuration System¶
YAML Configuration File¶
# cluster-gitops/syrf/environments/{env}/data-sampling.yaml
# PR Preview - Lightweight for quick testing
preview:
enabled: true
schedule: "on-demand" # Created when PR opens
source:
database: syrftest
connection: ${PROD_MONGODB_URI} # Read-only connection
projects:
strategy: curated # Use specific project IDs
count: 5
selection:
- id: "uuid-of-active-project-1" # Large active project
reason: "High activity, good for testing"
- id: "uuid-of-dual-screening-project"
reason: "Demonstrates dual screening workflow"
- id: "uuid-of-annotation-project"
reason: "Has extraction data"
limits:
maxStudiesPerProject: 500
maxScreeningsPerStudy: 10
includeAnnotations: false # Skip for speed
includeOutcomeData: false
investigators:
strategy: project-members # Only members of sampled projects
additionalIds: [] # No extras needed
# Staging - Comprehensive for integration testing
staging:
enabled: true
schedule: "0 3 * * 0" # Weekly, Sunday 3am
source:
database: syrftest
connection: ${PROD_MONGODB_URI}
projects:
strategy: mixed # Curated + random sample
count: 20
selection:
# Curated essential projects
- id: "uuid-of-flagship-project"
reason: "Largest project, performance testing"
- id: "uuid-of-complex-stages"
reason: "Multi-stage workflow testing"
randomSampleCount: 15 # Additional random projects
randomCriteria:
minStudies: 100
minMembers: 2
hasScreening: true
limits:
maxStudiesPerProject: null # No limit
maxScreeningsPerStudy: null # Full data
includeAnnotations: true
includeOutcomeData: true
investigators:
strategy: all-active # All investigators who logged in last 90 days
activityThresholdDays: 90
Implementation Components¶
1. Sampling Job (Kubernetes CronJob or Manual)¶
// SyRF.DataSampling.Job/SamplingJob.cs
public class ProductionSamplingJob
{
private readonly IMongoClient _sourceClient;
private readonly IMongoClient _targetClient;
private readonly SamplingConfig _config;
public async Task ExecuteSamplingAsync(CancellationToken ct)
{
_logger.LogInformation("Starting production sampling for {Environment}",
_config.EnvironmentName);
// 1. Determine which projects to sample
var projectIds = await GetProjectIdsToSample();
_logger.LogInformation("Sampling {Count} projects", projectIds.Count);
// 2. Sample investigators (members of selected projects)
await SampleInvestigatorsAsync(projectIds, ct);
// 3. Sample projects with their related data
foreach (var projectId in projectIds)
{
await SampleProjectAsync(projectId, ct);
}
// 4. Create sampling metadata record
await RecordSamplingMetadataAsync(projectIds, ct);
}
private async Task SampleProjectAsync(Guid projectId, CancellationToken ct)
{
var project = await _sourceDb.GetCollection<Project>("pmProject")
.Find(p => p.Id == projectId)
.FirstOrDefaultAsync(ct);
if (project == null) return;
// Insert project
await _targetDb.GetCollection<Project>("pmProject")
.InsertOneAsync(project, cancellationToken: ct);
// Sample studies with limit
var studyQuery = _sourceDb.GetCollection<Study>("pmStudy")
.Find(s => s.ProjectId == projectId);
if (_config.Projects.Limits.MaxStudiesPerProject.HasValue)
{
studyQuery = studyQuery.Limit(_config.Projects.Limits.MaxStudiesPerProject.Value);
}
var studies = await studyQuery.ToListAsync(ct);
if (studies.Any())
{
await _targetDb.GetCollection<Study>("pmStudy")
.InsertManyAsync(studies, cancellationToken: ct);
}
// Sample related collections...
await SampleSystematicSearchesAsync(projectId, ct);
if (_config.Projects.Limits.IncludeAnnotations)
{
await SampleAnnotationsAsync(projectId, studies.Select(s => s.Id), ct);
}
}
}
2. Sampling Metadata Tracking¶
// Track what was sampled and when
public record SamplingMetadata
{
public Guid Id { get; init; } = Guid.NewGuid();
public DateTime SampledAt { get; init; } = DateTime.UtcNow;
public string SourceDatabase { get; init; }
public string TargetDatabase { get; init; }
public string Environment { get; init; }
public int ProjectCount { get; init; }
public int StudyCount { get; init; }
public int InvestigatorCount { get; init; }
public List<Guid> SampledProjectIds { get; init; }
public string ConfigHash { get; init; } // Track config version
}
// Query sampling metadata
db.samplingMetadata.find().sort({ sampledAt: -1 }).limit(1)
// Shows when data was last sampled and what's included
3. GitHub Actions Integration (PR Preview)¶
# .github/workflows/pr-preview.yml (addition)
jobs:
sample-production-data:
needs: [deploy-preview]
runs-on: ubuntu-latest
if: ${{ github.event.action == 'opened' }} # Only on PR open
steps:
- name: Run Production Sampling
uses: actions/github-script@v6
with:
script: |
// Trigger sampling job via kubectl or API
const prNumber = context.issue.number;
const targetDb = `syrf_pr_${prNumber}`;
// Call sampling API or kubectl exec
await exec.exec('kubectl', [
'create', 'job',
`--from=cronjob/production-sampler`,
`production-sample-pr-${prNumber}`,
'-n', `pr-${prNumber}`,
'--', '--target-db', targetDb, '--config', 'preview'
]);
# Or simpler: just use seed data for PRs
seed-preview-data:
needs: [deploy-preview]
runs-on: ubuntu-latest
steps:
- name: Run Seeder
run: |
kubectl exec -n pr-${{ github.event.number }} \
deployment/syrf-project-management \
-- dotnet SyRF.Seeder.dll --config preview
Sampling Strategies Explained¶
| Strategy | Use Case | How It Works |
|---|---|---|
| curated | PR Preview | Explicitly listed project IDs, maximum control |
| random | Staging | Random sample meeting criteria (min studies, etc.) |
| mixed | Staging | Curated essentials + random sample for variety |
| all-active | Staging (investigators) | All users active in last N days |
| project-members | PR Preview | Only investigators who are members of sampled projects |
Environment-Specific Recommendations¶
PR Preview Environments¶
Goal: Fast, lightweight, focused on feature testing
| Setting | Value | Rationale |
|---|---|---|
| Projects | 3-5 curated | Known good test projects |
| Max Studies | 500 per project | Fast query performance |
| Screening | Include (limited) | Test screening workflows |
| Annotations | Exclude | Speed, not needed for most PRs |
| Refresh | On PR open | Fresh data each time |
Trigger: Automatic when PR opens, or manual via workflow dispatch
Staging Environment¶
Goal: Comprehensive, near-production testing
| Setting | Value | Rationale |
|---|---|---|
| Projects | 15-20 mixed | Representative variety |
| Max Studies | Unlimited | Full performance testing |
| Screening | Full | Complete workflow testing |
| Annotations | Include | E2E extraction testing |
| Refresh | Weekly | Fresh but stable for testing |
Trigger: Scheduled CronJob (Sunday 3am)
PR-Description Driven Configuration¶
Goal: Developers can easily switch between seeding, sampling, or combo by adding a config block to their PR description - no workflow file changes needed.
PR Description Format¶
Add a YAML block to the PR description to configure data strategy:
## Description
This PR adds new screening workflow improvements...
## Data Strategy
<!-- syrf-data-config -->
```yaml
data:
strategy: seed-only # Options: seed-only, sample-only, hybrid
sampleProjects: 3 # If sampling, how many projects
includeProdData: false # Include production sample?
#### Supported Strategies
| Strategy | Description | Use Case |
|----------|-------------|----------|
| `seed-only` | Only run seeder (default) | Most PRs - fast, predictable |
| `sample-only` | Only sample production | Performance testing |
| `hybrid` | Seed first, then sample | Integration testing |
| `none` | Empty database | Testing fresh install |
#### Workflow Integration
```yaml
# .github/workflows/pr-preview.yml
jobs:
parse-data-config:
runs-on: ubuntu-latest
outputs:
strategy: ${{ steps.parse.outputs.strategy }}
sample_projects: ${{ steps.parse.outputs.sample_projects }}
steps:
- name: Parse PR Description
id: parse
uses: actions/github-script@v7
with:
script: |
const body = context.payload.pull_request?.body || '';
// Extract YAML from <!-- syrf-data-config --> block
const match = body.match(/<!--\s*syrf-data-config\s*-->\s*```yaml\s*([\s\S]*?)```\s*<!--\s*\/syrf-data-config\s*-->/);
let config = { strategy: 'seed-only', sampleProjects: 0 };
if (match) {
const yaml = require('js-yaml');
const parsed = yaml.load(match[1]);
config = { ...config, ...parsed.data };
}
core.setOutput('strategy', config.strategy);
core.setOutput('sample_projects', config.sampleProjects || 0);
seed-data:
needs: [deploy-preview, parse-data-config]
if: needs.parse-data-config.outputs.strategy != 'sample-only' && needs.parse-data-config.outputs.strategy != 'none'
runs-on: ubuntu-latest
steps:
- name: Run Seeder
run: |
kubectl exec -n pr-${{ github.event.number }} \
deployment/syrf-project-management \
-- dotnet SyRF.Seeder.dll --config preview
sample-production:
needs: [deploy-preview, parse-data-config, seed-data]
if: |
needs.parse-data-config.outputs.strategy == 'sample-only' ||
needs.parse-data-config.outputs.strategy == 'hybrid'
runs-on: ubuntu-latest
steps:
- name: Sample Production Data
run: |
kubectl create job \
--from=cronjob/production-sampler \
sample-pr-${{ github.event.number }} \
-n pr-${{ github.event.number }} \
-- --count=${{ needs.parse-data-config.outputs.sample_projects }}
Quick Reference for Developers¶
| Scenario | PR Description Config |
|---|---|
| Default (most PRs) | No config block needed → uses seed-only |
| Integration testing | strategy: hybrid + sampleProjects: 2 |
| Performance testing | strategy: sample-only + sampleProjects: 10 |
| Empty database | strategy: none |
Full Config Options¶
data:
strategy: hybrid # seed-only | sample-only | hybrid | none
sampleProjects: 5 # Number of projects to sample (if sampling)
sampleStrategy: curated # curated | random | mixed
maxStudiesPerProject: 500 # Limit studies per project (faster)
includeAnnotations: false # Include annotation data (slower)
seedConfig: preview # Which seed config to use: preview | minimal | full
Alternative: Label-Based Configuration¶
For simpler use cases, use PR labels instead of YAML config:
| Label | Equivalent Strategy |
|---|---|
data:seed-only |
Default seeding |
data:hybrid |
Seed + sample |
data:sample-only |
Production sample only |
data:none |
Empty database |
data:sample-large |
Sample with 10+ projects |
# Workflow checks for labels
- name: Check Data Strategy Label
id: labels
run: |
if [[ "${{ contains(github.event.pull_request.labels.*.name, 'data:hybrid') }}" == "true" ]]; then
echo "strategy=hybrid" >> $GITHUB_OUTPUT
elif [[ "${{ contains(github.event.pull_request.labels.*.name, 'data:sample-only') }}" == "true" ]]; then
echo "strategy=sample-only" >> $GITHUB_OUTPUT
else
echo "strategy=seed-only" >> $GITHUB_OUTPUT
fi
Recommendation: Use labels for quick toggling, PR description config for fine-grained control.
Data Freshness Indicators¶
Add sampling metadata to the environment info endpoint:
[HttpGet("info")]
public IActionResult GetApplicationInfo()
{
var samplingInfo = await _db.GetCollection<SamplingMetadata>("samplingMetadata")
.Find(_ => true)
.SortByDescending(s => s.SampledAt)
.FirstOrDefaultAsync();
return Ok(new
{
Version = _configuration["AppVersion"],
Environment = _configuration["ASPNETCORE_ENVIRONMENT"],
DatabaseName = GetDatabaseName(),
DataSource = samplingInfo != null ? new
{
SampledAt = samplingInfo.SampledAt,
SourceDatabase = samplingInfo.SourceDatabase,
ProjectCount = samplingInfo.ProjectCount,
StudyCount = samplingInfo.StudyCount
} : null
});
}
Frontend Display (in environment warning dialog):
<div class="detail-row" *ngIf="dataSource">
<span class="label">Data Sampled:</span>
<span class="value">
{{ dataSource.sampledAt | date:'medium' }}
({{ dataSource.projectCount }} projects, {{ dataSource.studyCount }} studies)
</span>
</div>
Implementation Priority¶
| Phase | Work | Environment | Effort |
|---|---|---|---|
| 1 | Enhanced seed data (Part 4 matrix) | All | Medium |
| 2 | Sampling config YAML schema | All | Low |
| 3 | PR preview sampling job | Preview | Medium |
| 4 | Staging weekly sampling CronJob | Staging | Medium |
| 5 | Sampling metadata tracking | All | Low |
| 6 | Environment info API enhancement | All | Low |
Recommendation: Start with Phase 1-2. Seed data alone may be sufficient for most testing. Add production sampling (Phase 3-4) only if specific production data scenarios are needed for testing.
Part 12: Concrete Next Steps Plan¶
Immediate Actions (This Sprint)¶
| # | Task | Files to Change | Effort |
|---|---|---|---|
| 1 | Refactor Project constructor chain | Project.cs | Low |
Add private constructor with ID param, add CreateWithId factory |
|||
| 2 | Apply same pattern to Study | Study.cs | Low |
| Factory method for deterministic study IDs | |||
| 3 | Apply same pattern to Investigator | Investigator.cs | Low |
| Factory method for deterministic investigator IDs | |||
| 4 | Update DatabaseSeeder to use factory methods | DatabaseSeeder.cs | Medium |
| Replace reflection-based ID setting with factory calls |
Short-Term Actions (Next 2 Sprints)¶
| # | Task | Files to Change | Effort |
|---|---|---|---|
| 5 | Create ISeedDataOwnershipTransfer interface | New file: Core/Seeding/ISeedDataOwnershipTransfer.cs |
Low |
| 6 | Implement comprehensive seed project matrix | DatabaseSeeder.cs, SeedDataConstants.cs | Medium |
| Per Part 4 coverage matrix - all AgreementModes, stage states | |||
| 7 | Add screening data seeding | DatabaseSeeder.cs | Medium |
Seed screenings using study.AddScreening() path |
|||
| 8 | Call RecalculateScreeningStatsAsync after seeding | DatabaseSeeder.cs | Low |
Initialize _trackedProjectAgreementThresholds on Studies |
Medium-Term Actions (Future Sprints)¶
| # | Task | Effort | Dependency |
|---|---|---|---|
| 9 | Create IProjectStatsService | Medium | None |
| 10 | Add environment warning banner (Part 10) | Medium | None |
| 11 | Production sampling job (Part 11) | Medium | None |
| 12 | PR description config parsing | Low | Part 11 |
Implementation Order Diagram¶
Sprint N (Foundation):
┌─────────────────────────────────────────────────────────────┐
│ 1. Refactor constructors (Project, Study, Investigator) │
│ ↓ │
│ 4. Update DatabaseSeeder to use factory methods │
└─────────────────────────────────────────────────────────────┘
Sprint N+1 (Coverage):
┌─────────────────────────────────────────────────────────────┐
│ 5. ISeedDataOwnershipTransfer interface │
│ 6. Full project matrix (all AgreementModes) │
│ 7. Screening data seeding │
│ 8. Stats recalculation │
└─────────────────────────────────────────────────────────────┘
Sprint N+2+ (Polish):
┌─────────────────────────────────────────────────────────────┐
│ 9. IProjectStatsService consolidation │
│ 10. Environment warning banner │
│ 11. Production sampling (if needed) │
└─────────────────────────────────────────────────────────────┘
Definition of Done¶
Each task should meet these criteria before being marked complete:
- Code compiles without warnings
- Unit tests added/updated for new code
- Existing tests pass
- Seeded data visible in preview environment
- Documentation updated (this file or CLAUDE.md)
Risks and Mitigations¶
| Risk | Mitigation |
|---|---|
| Constructor refactoring breaks MongoDB deserialization | Keep parameterless public constructor, test deserialization |
| Factory methods duplicate initialization logic | Private constructor with all params, factory delegates to it |
| Stats calculation fails after seeding | Call RecalculateScreeningStatsAsync explicitly after seeding |
| Seed data differs from production data | Use same domain methods as production (AddScreening, etc.) |
Appendix A: AgreementMode Reference¶
| Mode | Reviewers Needed | Agreement Required | Description |
|---|---|---|---|
| SingleReviewerDecides | 1 | N/A | One reviewer makes final decision |
| AutomatedDualScreening | 2 | 100% | Both reviewers must agree |
| ThreeReviewersOneReconciles | 2 of 3 | 67% | Two of three must agree, one reconciles |
| ThreeReviewersTwoReconcile | 3 | 67% | All three review, two must agree |
| Custom | varies | varies | User-defined threshold |
Appendix B: Related Documentation¶
- MongoDB Testing Strategy - Database isolation for preview environments
- PR Preview Environments - Using preview environments
- CLAUDE.md - Project overview and database architecture
Appendix C: API Endpoint Summary¶
Key endpoints that seeding should mirror:
| Endpoint | Purpose | Service Method |
|---|---|---|
| POST /api/projects | Create project | IProjectManagementService.CreateProject() |
| POST /api/application/signin | Create/update investigator | IApplicationService.EnsureInvestigatorCreatedAndUpdatedAsync() |
| POST /api/projects/{id}/stages | Add stage | project.AddStage() |
| PATCH /api/projects/{id}/stages/{stageId} | Update stage | project.UpdateStage() |
| POST /api/projects/{id}/questions | Add annotation question | project.UpsertCustomAnnotationQuestion() |
| POST /api/reviews/screening | Submit screening | study.AddScreening() |
| POST /api/projects/{id}/studies/inclusionInfo | Calculate inclusion | IProjectManagementService.UpdateStudyInclusionInfoForProjectAsync() |