Skip to content

Quantitative Seed Data & Annotation Relationship Validation

Overview

Two pieces of work in PR #2126:

  1. AnnotationRelationshipValidator — Backend validation of annotation graph integrity at submission time
  2. Seed data enhancement — Complete Review project gets rich quantitative outcome data; all seed projects use AddSessionData() for uniform validation

Motivation

  • The Complete Review seed project has annotations but no OutcomeData, so the quantitative export produces empty results
  • Backend has no validation of annotation relationship integrity — the frontend enforces constraints through UI, but the backend blindly accepts anything
  • Invalid annotation structures only surface during data export as MISSING_* placeholders in CSV output

Approach: Validation First, Then Seed Data

Validation is implemented first so the seeder exercises it — the seeder becomes a de facto integration test of the validator.


Part 1: AnnotationRelationshipValidator

Location

SyRF.ProjectManagement.Core/Services/Validation/AnnotationRelationshipValidator.cs

Input

  • Existing annotations on the study (ExtractionInfo.Annotations)
  • Incoming annotations from the session
  • Incoming outcome data from the session (if any)
  • The project's annotation questions (for hierarchy and conditional display validation)
  • Whether the stage has data extraction enabled

Error Model

AnnotationValidationError (value object):

Field Description
Tier Which tier failed: TreeIntegrity, LookupLink, OutcomeDataFK
Rule Specific rule (e.g., ParentIdNotFound, ConditionalNotSatisfied, WrongQuestionType)
OffendingEntityId The annotation ID or OutcomeData ID that's invalid
InvalidReferenceId The ID that doesn't resolve or is wrong type
Expected What was expected (e.g., "annotation with QuestionId == DiseaseModelInductionLabelGuid")
Actual What was found (e.g., "annotation with QuestionId == TreatmentLabelGuid" or "not found")

Return type: IReadOnlyList<AnnotationValidationError>

The validator is a pure function that returns errors. The caller decides whether to throw: - collectAll: false — stops at first error, returns list with single error - collectAll: true — validates everything, returns complete list of all errors

Exception raising is the caller's responsibility, not the validator's.

Validation Tiers

Tier 1: OutcomeData Foreign Keys

Only validated when the stage has data extraction enabled.

  • OutcomeData.ExperimentId → must reference an annotation with QuestionId == ExperimentLabelGuid
  • OutcomeData.CohortId → must reference an annotation with QuestionId == CohortLabelQuestionGuid
  • OutcomeData.OutcomeId → must reference an annotation with QuestionId == OutcomeAssessmentLabelGuid
  • All three must belong to the same study

Tier 2: Annotation Tree Integrity

  • Every ParentId (if set) → must reference an annotation in (existing ∪ incoming)
  • Every ID in Children[] → must reference an annotation in (existing ∪ incoming)
  • No self-references (annotation can't be its own parent/child)
  • Question hierarchy correspondence: If annotation A has QuestionId = Q1 and parent annotation B has QuestionId = Q2, then Q1's parent question in the annotation question tree must be Q2
  • Conditional display satisfaction: If annotation A's AnnotationQuestion has a Target.ConditionalParentAnswers (non-null):
  • BooleanConditionalTargetParentOptions: parent annotation's BoolAnnotation.Answer == TargetParentBoolean
  • OptionConditionalTargetParentOptions: parent annotation's answer contains at least one of TargetParentOptions

For system questions that produce StringArrayAnnotation answers linking to other annotations:

System Question Answer GUIDs must reference annotations with QuestionId
CohortModelInduction DiseaseModelInductionLabelGuid
CohortTreatment TreatmentLabelGuid
CohortOutcome OutcomeAssessmentLabelGuid
ExperimentCohort CohortLabelQuestionGuid

Integration Points

The validator is a Domain Service — a pure function called by orchestrating code, not embedded inside the Study aggregate. This keeps the Study aggregate focused on state transitions and avoids coupling it to the Project aggregate's question hierarchy.

Called from ReviewSubmissionService.AddSessionData() — between the membership guard and study.AddSessionData(). The service derives stage data (question IDs, extraction flag) from the Project aggregate internally using the stageId parameter. Callers no longer pre-compute stage-specific values.

Signature change to IReviewSubmissionService.AddSessionData():

// Before: caller pre-computes stage data, no validation control
void AddSessionData(Project project, Study study, Guid investigatorId,
    SessionSubmissionDto sessionSubmission, IEnumerable<Guid> stageQuestionIds,
    bool includesDataExtraction);

// After: service derives stage data from project, caller controls validation mode
void AddSessionData(Project project, Study study, Guid stageId, Guid investigatorId,
    SessionSubmissionDto sessionSubmission, bool collectAll = false);

Design rationale (DDD principles):

  • Keep Project/Study aggregates, not IDs — domain services accept domain objects; if it accepted IDs it would need repository dependencies, breaking domain layer purity and making it an application service
  • Replace stageQuestionIds + includesDataExtraction with stageId — the service looks up the stage from project.GetStageOrDefault(stageId) and derives both values itself; this keeps domain knowledge (which parts of the Project matter) inside the domain service, not leaked into the controller
  • Add bool collectAll = false — caller controls fail-fast vs full diagnostics without the service dictating error-handling policy

Three known consumers:

Consumer Call site collectAll Behaviour on errors
Session submission (API) ReviewSubmissionService.AddSessionData() false (default) Throws InvalidOperationException immediately
Database seeder ReviewSubmissionService.AddSessionData() true Full diagnostics before throwing
Future JSON annotation import ReviewSubmissionService.AddSessionData() true Reports all issues back to user

Why not inside ExtractionInfo? ExtractionInfo.AddAnnotations() manages annotation state (tree-shaking, session tracking, outcome data). It has no knowledge of AnnotationQuestion hierarchy, conditional display rules, or lookup link semantics — those are Project aggregate concepts. Threading them in would violate aggregate boundaries for validation that is naturally a cross-aggregate concern, same as the membership check that already lives in ReviewSubmissionService.

Open Question: Stage Setting Changes

When stage settings change (question assignments, extraction enabled/disabled), existing data may become invalid. This is out of scope for this PR but noted as a known gap for future work. Options include: - Warning-only re-validation on settings change - Blocking settings changes that would invalidate existing data


Part 2: Seed Data Enhancement

Uniform Code Path

The only seed project that submits annotations is Complete Review — it switches from AddAnnotations() to AddSessionData():

Project Current New
Quick Start Demo No annotations No annotations (unchanged)
Screening In Progress Screening only Screening only (unchanged)
Ready for Annotation Screening only Screening only (unchanged)
Complete Review AddAnnotations() AddSessionData() with full hierarchy + outcome data
Private Research No annotations No annotations (unchanged)

The validator runs on every AddSessionData() call, regardless of whether outcome data is present.

Complete Review: Annotation Hierarchy

Each study gets the full system annotation tree per reviewer (Alpha and Beta). A builder method constructs the hierarchy from config objects:

Experiment Label
├── ExperimentCohort: [cohortIds...]
├── Custom experiment questions (Sample Size)
├── Cohort Label ("Control group")
│   ├── CohortModelInduction: [diseaseModelIds...]
│   ├── CohortTreatment: [treatmentIds...]
│   ├── CohortOutcome: [outcomeIds...]
│   ├── NumberOfAnimals
│   ├── DiseaseModel Label → custom questions
│   ├── Treatment Label → custom questions
│   ├── Outcome Label(s) → custom questions
└── Cohort Label ("Treatment group")
    └── (same structure, different values)

Complete Review: OutcomeData

For each study, OutcomeData entries are created from the cartesian product of cohorts × outcomes:

  • 2-3 cohorts per study (varies by study index)
  • 1-2 outcomes per study (varies by study index)
  • 2-4 TimePoints per OutcomeData (time, average, error)
  • Treatment groups show better outcomes than control
  • Later timepoints show progression
  • Values vary per study via small offsets from base values
  • AverageType: mostly "Mean", some "Median"
  • ErrorType: mostly "SD", some "SEM"
  • Units: tied to outcome type ("mm³", "points", "seconds", etc.)

Total: ~240-360 OutcomeData entries across 15 studies × 2 reviewers

Builder Design

The seeder uses a builder method, not inline definitions:

CreateCompleteSessionSubmission(
    study, project, stageId, investigatorId, studyIndex,
    species, diseaseModel,
    cohortConfigs: [...],    // CohortConfig(label, treatment, numberOfAnimals)
    outcomeConfigs: [...]    // OutcomeConfig(label, units, greaterIsWorse, timepoints)
)

Variation comes from lookup tables indexed by studyIndex % n.

Seeder Validation

  • ValidateSeedData() asserts expected OutcomeData count on Complete Review studies
  • The validator running inside AddSessionData() guarantees structural integrity
  • If seed data is invalid, the seeder throws during execution

Testing

AnnotationRelationshipValidatorTests (new)

Tier 1: - OutcomeData with non-existent ExperimentId → error returned - OutcomeData with ExperimentId pointing to wrong question type → error returned - Same for CohortId and OutcomeId - Valid OutcomeData → no errors

Tier 2: - ParentId referencing non-existent annotation → error - Children[] containing non-existent ID → error - Self-referencing annotation → error - Question hierarchy mismatch (annotation's question parent ≠ parent annotation's question) → error - Conditional display not satisfied (parent answer doesn't match condition) → error - Conditional display null (unconditional) → passes regardless - Cross-batch: incoming annotation references existing annotation → passes

Tier 3: - CohortModelInduction answer referencing a treatment label → error - ExperimentCohort answer referencing non-existent annotation → error - Valid lookup links → no errors

Mode behaviour: - collectAll: false — returns single error on first failure - collectAll: true — returns all errors - collectAll: true with no errors → empty list

DatabaseSeederTests (updates)

  • Existing tests continue to pass
  • New: Complete Review studies have expected OutcomeData count
  • Seeder exercises validator via AddSessionData() — invalid seed data causes test failure

Deferred / Future Work

Annotation Content Validation (not in scope)

The current validator checks structural relationships (FK references, tree integrity, conditional display, lookup links) but does not validate that annotation answers match their question's expected type or content. This is currently enforced only by the frontend.

Examples of what a future "content validation" tier would cover:

  • Type matching: Does the annotation subclass (BoolAnnotation, StringArrayAnnotation, etc.) match the question's ControlType and expected data type?
  • Option validity: For dropdown/checkbox questions, are the selected values present in the question's Options list?
  • Numeric constraints: For decimal/int questions, are values within expected ranges?
  • Reconciliation scoping: Annotations in a reconciliation session may belong to multiple annotators (by design), while non-reconciliation sessions should contain only the submitting annotator's annotations. The existing ExtractionInfo.AddAnnotations() already enforces this at line 80 (reconciliation || an.AnnotatorId == annotatorId), so this is covered at the data layer but not by the validator.

This would be a separate validation tier with its own test matrix and should be designed as a follow-up feature.