Quantitative Seed Data & Annotation Relationship Validation¶

Overview¶

Two pieces of work in PR #2126:

AnnotationRelationshipValidator — Backend validation of annotation graph integrity at submission time
Seed data enhancement — Complete Review project gets rich quantitative outcome data; all seed projects use AddSessionData() for uniform validation

Motivation¶

The Complete Review seed project has annotations but no OutcomeData, so the quantitative export produces empty results
Backend has no validation of annotation relationship integrity — the frontend enforces constraints through UI, but the backend blindly accepts anything
Invalid annotation structures only surface during data export as MISSING_* placeholders in CSV output

Approach: Validation First, Then Seed Data¶

Validation is implemented first so the seeder exercises it — the seeder becomes a de facto integration test of the validator.

Part 1: AnnotationRelationshipValidator¶

Location¶

SyRF.ProjectManagement.Core/Services/Validation/AnnotationRelationshipValidator.cs

Input¶

Existing annotations on the study (ExtractionInfo.Annotations)
Incoming annotations from the session
Incoming outcome data from the session (if any)
The project's annotation questions (for hierarchy and conditional display validation)
Whether the stage has data extraction enabled

Error Model¶

AnnotationValidationError (value object):

Field	Description
`Tier`	Which tier failed: `TreeIntegrity`, `LookupLink`, `OutcomeDataFK`
`Rule`	Specific rule (e.g., `ParentIdNotFound`, `ConditionalNotSatisfied`, `WrongQuestionType`)
`OffendingEntityId`	The annotation ID or OutcomeData ID that's invalid
`InvalidReferenceId`	The ID that doesn't resolve or is wrong type
`Expected`	What was expected (e.g., "annotation with QuestionId == DiseaseModelInductionLabelGuid")
`Actual`	What was found (e.g., "annotation with QuestionId == TreatmentLabelGuid" or "not found")

Return type: IReadOnlyList<AnnotationValidationError>

The validator is a pure function that returns errors. The caller decides whether to throw: - collectAll: false — stops at first error, returns list with single error - collectAll: true — validates everything, returns complete list of all errors

Exception raising is the caller's responsibility, not the validator's.

Validation Tiers¶

Tier 1: OutcomeData Foreign Keys¶

Only validated when the stage has data extraction enabled.

OutcomeData.ExperimentId → must reference an annotation with QuestionId == ExperimentLabelGuid
OutcomeData.CohortId → must reference an annotation with QuestionId == CohortLabelQuestionGuid
OutcomeData.OutcomeId → must reference an annotation with QuestionId == OutcomeAssessmentLabelGuid
All three must belong to the same study

Tier 2: Annotation Tree Integrity¶

Every ParentId (if set) → must reference an annotation in (existing ∪ incoming)
Every ID in Children[] → must reference an annotation in (existing ∪ incoming)
No self-references (annotation can't be its own parent/child)
Question hierarchy correspondence: If annotation A has QuestionId = Q1 and parent annotation B has QuestionId = Q2, then Q1's parent question in the annotation question tree must be Q2
Conditional display satisfaction: If annotation A's AnnotationQuestion has a Target.ConditionalParentAnswers (non-null):
BooleanConditionalTargetParentOptions: parent annotation's BoolAnnotation.Answer == TargetParentBoolean
OptionConditionalTargetParentOptions: parent annotation's answer contains at least one of TargetParentOptions

Tier 3: Lookup Link Validity¶

For system questions that produce StringArrayAnnotation answers linking to other annotations:

System Question	Answer GUIDs must reference annotations with QuestionId
`CohortModelInduction`	`DiseaseModelInductionLabelGuid`
`CohortTreatment`	`TreatmentLabelGuid`
`CohortOutcome`	`OutcomeAssessmentLabelGuid`
`ExperimentCohort`	`CohortLabelQuestionGuid`

Integration Points¶

The validator is a Domain Service — a pure function called by orchestrating code, not embedded inside the Study aggregate. This keeps the Study aggregate focused on state transitions and avoids coupling it to the Project aggregate's question hierarchy.

Called from ReviewSubmissionService.AddSessionData() — between the membership guard and study.AddSessionData(). The service derives stage data (question IDs, extraction flag) from the Project aggregate internally using the stageId parameter. Callers no longer pre-compute stage-specific values.

Signature change to IReviewSubmissionService.AddSessionData():

// Before: caller pre-computes stage data, no validation control
void AddSessionData(Project project, Study study, Guid investigatorId,
    SessionSubmissionDto sessionSubmission, IEnumerable<Guid> stageQuestionIds,
    bool includesDataExtraction);

// After: service derives stage data from project, caller controls validation mode
void AddSessionData(Project project, Study study, Guid stageId, Guid investigatorId,
    SessionSubmissionDto sessionSubmission, bool collectAll = false);

Design rationale (DDD principles):

Keep Project/Study aggregates, not IDs — domain services accept domain objects; if it accepted IDs it would need repository dependencies, breaking domain layer purity and making it an application service
Replace stageQuestionIds + includesDataExtraction with stageId — the service looks up the stage from project.GetStageOrDefault(stageId) and derives both values itself; this keeps domain knowledge (which parts of the Project matter) inside the domain service, not leaked into the controller
Add bool collectAll = false — caller controls fail-fast vs full diagnostics without the service dictating error-handling policy

Three known consumers:

Consumer	Call site	`collectAll`	Behaviour on errors
Session submission (API)	`ReviewSubmissionService.AddSessionData()`	`false` (default)	Throws `InvalidOperationException` immediately
Database seeder	`ReviewSubmissionService.AddSessionData()`	`true`	Full diagnostics before throwing
Future JSON annotation import	`ReviewSubmissionService.AddSessionData()`	`true`	Reports all issues back to user

Why not inside ExtractionInfo? ExtractionInfo.AddAnnotations() manages annotation state (tree-shaking, session tracking, outcome data). It has no knowledge of AnnotationQuestion hierarchy, conditional display rules, or lookup link semantics — those are Project aggregate concepts. Threading them in would violate aggregate boundaries for validation that is naturally a cross-aggregate concern, same as the membership check that already lives in ReviewSubmissionService.

Open Question: Stage Setting Changes¶

When stage settings change (question assignments, extraction enabled/disabled), existing data may become invalid. This is out of scope for this PR but noted as a known gap for future work. Options include: - Warning-only re-validation on settings change - Blocking settings changes that would invalidate existing data

Part 2: Seed Data Enhancement¶

Uniform Code Path¶

The only seed project that submits annotations is Complete Review — it switches from AddAnnotations() to AddSessionData():

Project	Current	New
Quick Start Demo	No annotations	No annotations (unchanged)
Screening In Progress	Screening only	Screening only (unchanged)
Ready for Annotation	Screening only	Screening only (unchanged)
Complete Review	`AddAnnotations()`	`AddSessionData()` with full hierarchy + outcome data
Private Research	No annotations	No annotations (unchanged)

The validator runs on every AddSessionData() call, regardless of whether outcome data is present.

Complete Review: Annotation Hierarchy¶

Each study gets the full system annotation tree per reviewer (Alpha and Beta). A builder method constructs the hierarchy from config objects:

Experiment Label
├── ExperimentCohort: [cohortIds...]
├── Custom experiment questions (Sample Size)
│
├── Cohort Label ("Control group")
│   ├── CohortModelInduction: [diseaseModelIds...]
│   ├── CohortTreatment: [treatmentIds...]
│   ├── CohortOutcome: [outcomeIds...]
│   ├── NumberOfAnimals
│   ├── DiseaseModel Label → custom questions
│   ├── Treatment Label → custom questions
│   ├── Outcome Label(s) → custom questions
│
└── Cohort Label ("Treatment group")
    └── (same structure, different values)

Complete Review: OutcomeData¶

For each study, OutcomeData entries are created from the cartesian product of cohorts × outcomes:

2-3 cohorts per study (varies by study index)
1-2 outcomes per study (varies by study index)
2-4 TimePoints per OutcomeData (time, average, error)
Treatment groups show better outcomes than control
Later timepoints show progression
Values vary per study via small offsets from base values
AverageType: mostly "Mean", some "Median"
ErrorType: mostly "SD", some "SEM"
Units: tied to outcome type ("mm³", "points", "seconds", etc.)

Total: ~240-360 OutcomeData entries across 15 studies × 2 reviewers

Builder Design¶

The seeder uses a builder method, not inline definitions:

CreateCompleteSessionSubmission(
    study, project, stageId, investigatorId, studyIndex,
    species, diseaseModel,
    cohortConfigs: [...],    // CohortConfig(label, treatment, numberOfAnimals)
    outcomeConfigs: [...]    // OutcomeConfig(label, units, greaterIsWorse, timepoints)
)

Variation comes from lookup tables indexed by studyIndex % n.

Seeder Validation¶

ValidateSeedData() asserts expected OutcomeData count on Complete Review studies
The validator running inside AddSessionData() guarantees structural integrity
If seed data is invalid, the seeder throws during execution

Testing¶

AnnotationRelationshipValidatorTests (new)¶

Tier 1: - OutcomeData with non-existent ExperimentId → error returned - OutcomeData with ExperimentId pointing to wrong question type → error returned - Same for CohortId and OutcomeId - Valid OutcomeData → no errors

Tier 2: - ParentId referencing non-existent annotation → error - Children[] containing non-existent ID → error - Self-referencing annotation → error - Question hierarchy mismatch (annotation's question parent ≠ parent annotation's question) → error - Conditional display not satisfied (parent answer doesn't match condition) → error - Conditional display null (unconditional) → passes regardless - Cross-batch: incoming annotation references existing annotation → passes

Tier 3: - CohortModelInduction answer referencing a treatment label → error - ExperimentCohort answer referencing non-existent annotation → error - Valid lookup links → no errors

Mode behaviour: - collectAll: false — returns single error on first failure - collectAll: true — returns all errors - collectAll: true with no errors → empty list

DatabaseSeederTests (updates)¶

Existing tests continue to pass
New: Complete Review studies have expected OutcomeData count
Seeder exercises validator via AddSessionData() — invalid seed data causes test failure

Deferred / Future Work¶

Annotation Content Validation (not in scope)¶

The current validator checks structural relationships (FK references, tree integrity, conditional display, lookup links) but does not validate that annotation answers match their question's expected type or content. This is currently enforced only by the frontend.

Examples of what a future "content validation" tier would cover:

Type matching: Does the annotation subclass (BoolAnnotation, StringArrayAnnotation, etc.) match the question's ControlType and expected data type?
Option validity: For dropdown/checkbox questions, are the selected values present in the question's Options list?
Numeric constraints: For decimal/int questions, are values within expected ranges?
Reconciliation scoping: Annotations in a reconciliation session may belong to multiple annotators (by design), while non-reconciliation sessions should contain only the submitting annotator's annotations. The existing ExtractionInfo.AddAnnotations() already enforces this at line 80 (reconciliation || an.AnnotatorId == annotatorId), so this is covered at the data layer but not by the validator.

This would be a separate validation tier with its own test matrix and should be designed as a follow-up feature.