Quantitative Seed Data & Annotation Relationship Validation¶
Overview¶
Two pieces of work in PR #2126:
AnnotationRelationshipValidator— Backend validation of annotation graph integrity at submission time- Seed data enhancement — Complete Review project gets rich quantitative outcome data; all seed projects use
AddSessionData()for uniform validation
Motivation¶
- The Complete Review seed project has annotations but no
OutcomeData, so the quantitative export produces empty results - Backend has no validation of annotation relationship integrity — the frontend enforces constraints through UI, but the backend blindly accepts anything
- Invalid annotation structures only surface during data export as
MISSING_*placeholders in CSV output
Approach: Validation First, Then Seed Data¶
Validation is implemented first so the seeder exercises it — the seeder becomes a de facto integration test of the validator.
Part 1: AnnotationRelationshipValidator¶
Location¶
SyRF.ProjectManagement.Core/Services/Validation/AnnotationRelationshipValidator.cs
Input¶
- Existing annotations on the study (
ExtractionInfo.Annotations) - Incoming annotations from the session
- Incoming outcome data from the session (if any)
- The project's annotation questions (for hierarchy and conditional display validation)
- Whether the stage has data extraction enabled
Error Model¶
AnnotationValidationError (value object):
| Field | Description |
|---|---|
Tier |
Which tier failed: TreeIntegrity, LookupLink, OutcomeDataFK |
Rule |
Specific rule (e.g., ParentIdNotFound, ConditionalNotSatisfied, WrongQuestionType) |
OffendingEntityId |
The annotation ID or OutcomeData ID that's invalid |
InvalidReferenceId |
The ID that doesn't resolve or is wrong type |
Expected |
What was expected (e.g., "annotation with QuestionId == DiseaseModelInductionLabelGuid") |
Actual |
What was found (e.g., "annotation with QuestionId == TreatmentLabelGuid" or "not found") |
Return type: IReadOnlyList<AnnotationValidationError>
The validator is a pure function that returns errors. The caller decides whether to throw:
- collectAll: false — stops at first error, returns list with single error
- collectAll: true — validates everything, returns complete list of all errors
Exception raising is the caller's responsibility, not the validator's.
Validation Tiers¶
Tier 1: OutcomeData Foreign Keys¶
Only validated when the stage has data extraction enabled.
OutcomeData.ExperimentId→ must reference an annotation withQuestionId == ExperimentLabelGuidOutcomeData.CohortId→ must reference an annotation withQuestionId == CohortLabelQuestionGuidOutcomeData.OutcomeId→ must reference an annotation withQuestionId == OutcomeAssessmentLabelGuid- All three must belong to the same study
Tier 2: Annotation Tree Integrity¶
- Every
ParentId(if set) → must reference an annotation in (existing ∪ incoming) - Every ID in
Children[]→ must reference an annotation in (existing ∪ incoming) - No self-references (annotation can't be its own parent/child)
- Question hierarchy correspondence: If annotation A has
QuestionId = Q1and parent annotation B hasQuestionId = Q2, then Q1's parent question in the annotation question tree must be Q2 - Conditional display satisfaction: If annotation A's
AnnotationQuestionhas aTarget.ConditionalParentAnswers(non-null): BooleanConditionalTargetParentOptions: parent annotation'sBoolAnnotation.Answer == TargetParentBooleanOptionConditionalTargetParentOptions: parent annotation's answer contains at least one ofTargetParentOptions
Tier 3: Lookup Link Validity¶
For system questions that produce StringArrayAnnotation answers linking to other annotations:
| System Question | Answer GUIDs must reference annotations with QuestionId |
|---|---|
CohortModelInduction |
DiseaseModelInductionLabelGuid |
CohortTreatment |
TreatmentLabelGuid |
CohortOutcome |
OutcomeAssessmentLabelGuid |
ExperimentCohort |
CohortLabelQuestionGuid |
Integration Points¶
The validator is a Domain Service — a pure function called by orchestrating code, not embedded inside the Study aggregate. This keeps the Study aggregate focused on state transitions and avoids coupling it to the Project aggregate's question hierarchy.
Called from ReviewSubmissionService.AddSessionData() — between the membership guard and study.AddSessionData(). The service derives stage data (question IDs, extraction flag) from the Project aggregate internally using the stageId parameter. Callers no longer pre-compute stage-specific values.
Signature change to IReviewSubmissionService.AddSessionData():
// Before: caller pre-computes stage data, no validation control
void AddSessionData(Project project, Study study, Guid investigatorId,
SessionSubmissionDto sessionSubmission, IEnumerable<Guid> stageQuestionIds,
bool includesDataExtraction);
// After: service derives stage data from project, caller controls validation mode
void AddSessionData(Project project, Study study, Guid stageId, Guid investigatorId,
SessionSubmissionDto sessionSubmission, bool collectAll = false);
Design rationale (DDD principles):
- Keep
Project/Studyaggregates, not IDs — domain services accept domain objects; if it accepted IDs it would need repository dependencies, breaking domain layer purity and making it an application service - Replace
stageQuestionIds+includesDataExtractionwithstageId— the service looks up the stage fromproject.GetStageOrDefault(stageId)and derives both values itself; this keeps domain knowledge (which parts of the Project matter) inside the domain service, not leaked into the controller - Add
bool collectAll = false— caller controls fail-fast vs full diagnostics without the service dictating error-handling policy
Three known consumers:
| Consumer | Call site | collectAll |
Behaviour on errors |
|---|---|---|---|
| Session submission (API) | ReviewSubmissionService.AddSessionData() |
false (default) |
Throws InvalidOperationException immediately |
| Database seeder | ReviewSubmissionService.AddSessionData() |
true |
Full diagnostics before throwing |
| Future JSON annotation import | ReviewSubmissionService.AddSessionData() |
true |
Reports all issues back to user |
Why not inside ExtractionInfo? ExtractionInfo.AddAnnotations() manages annotation state (tree-shaking, session tracking, outcome data). It has no knowledge of AnnotationQuestion hierarchy, conditional display rules, or lookup link semantics — those are Project aggregate concepts. Threading them in would violate aggregate boundaries for validation that is naturally a cross-aggregate concern, same as the membership check that already lives in ReviewSubmissionService.
Open Question: Stage Setting Changes¶
When stage settings change (question assignments, extraction enabled/disabled), existing data may become invalid. This is out of scope for this PR but noted as a known gap for future work. Options include: - Warning-only re-validation on settings change - Blocking settings changes that would invalidate existing data
Part 2: Seed Data Enhancement¶
Uniform Code Path¶
The only seed project that submits annotations is Complete Review — it switches from AddAnnotations() to AddSessionData():
| Project | Current | New |
|---|---|---|
| Quick Start Demo | No annotations | No annotations (unchanged) |
| Screening In Progress | Screening only | Screening only (unchanged) |
| Ready for Annotation | Screening only | Screening only (unchanged) |
| Complete Review | AddAnnotations() |
AddSessionData() with full hierarchy + outcome data |
| Private Research | No annotations | No annotations (unchanged) |
The validator runs on every AddSessionData() call, regardless of whether outcome data is present.
Complete Review: Annotation Hierarchy¶
Each study gets the full system annotation tree per reviewer (Alpha and Beta). A builder method constructs the hierarchy from config objects:
Experiment Label
├── ExperimentCohort: [cohortIds...]
├── Custom experiment questions (Sample Size)
│
├── Cohort Label ("Control group")
│ ├── CohortModelInduction: [diseaseModelIds...]
│ ├── CohortTreatment: [treatmentIds...]
│ ├── CohortOutcome: [outcomeIds...]
│ ├── NumberOfAnimals
│ ├── DiseaseModel Label → custom questions
│ ├── Treatment Label → custom questions
│ ├── Outcome Label(s) → custom questions
│
└── Cohort Label ("Treatment group")
└── (same structure, different values)
Complete Review: OutcomeData¶
For each study, OutcomeData entries are created from the cartesian product of cohorts × outcomes:
- 2-3 cohorts per study (varies by study index)
- 1-2 outcomes per study (varies by study index)
- 2-4 TimePoints per OutcomeData (time, average, error)
- Treatment groups show better outcomes than control
- Later timepoints show progression
- Values vary per study via small offsets from base values
- AverageType: mostly "Mean", some "Median"
- ErrorType: mostly "SD", some "SEM"
- Units: tied to outcome type ("mm³", "points", "seconds", etc.)
Total: ~240-360 OutcomeData entries across 15 studies × 2 reviewers
Builder Design¶
The seeder uses a builder method, not inline definitions:
CreateCompleteSessionSubmission(
study, project, stageId, investigatorId, studyIndex,
species, diseaseModel,
cohortConfigs: [...], // CohortConfig(label, treatment, numberOfAnimals)
outcomeConfigs: [...] // OutcomeConfig(label, units, greaterIsWorse, timepoints)
)
Variation comes from lookup tables indexed by studyIndex % n.
Seeder Validation¶
ValidateSeedData()asserts expected OutcomeData count on Complete Review studies- The validator running inside
AddSessionData()guarantees structural integrity - If seed data is invalid, the seeder throws during execution
Testing¶
AnnotationRelationshipValidatorTests (new)¶
Tier 1: - OutcomeData with non-existent ExperimentId → error returned - OutcomeData with ExperimentId pointing to wrong question type → error returned - Same for CohortId and OutcomeId - Valid OutcomeData → no errors
Tier 2: - ParentId referencing non-existent annotation → error - Children[] containing non-existent ID → error - Self-referencing annotation → error - Question hierarchy mismatch (annotation's question parent ≠ parent annotation's question) → error - Conditional display not satisfied (parent answer doesn't match condition) → error - Conditional display null (unconditional) → passes regardless - Cross-batch: incoming annotation references existing annotation → passes
Tier 3: - CohortModelInduction answer referencing a treatment label → error - ExperimentCohort answer referencing non-existent annotation → error - Valid lookup links → no errors
Mode behaviour:
- collectAll: false — returns single error on first failure
- collectAll: true — returns all errors
- collectAll: true with no errors → empty list
DatabaseSeederTests (updates)¶
- Existing tests continue to pass
- New: Complete Review studies have expected OutcomeData count
- Seeder exercises validator via
AddSessionData()— invalid seed data causes test failure
Deferred / Future Work¶
Annotation Content Validation (not in scope)¶
The current validator checks structural relationships (FK references, tree integrity, conditional display, lookup links) but does not validate that annotation answers match their question's expected type or content. This is currently enforced only by the frontend.
Examples of what a future "content validation" tier would cover:
- Type matching: Does the annotation subclass (
BoolAnnotation,StringArrayAnnotation, etc.) match the question'sControlTypeand expected data type? - Option validity: For dropdown/checkbox questions, are the selected values present in the question's
Optionslist? - Numeric constraints: For decimal/int questions, are values within expected ranges?
- Reconciliation scoping: Annotations in a reconciliation session may belong to multiple annotators (by design), while non-reconciliation sessions should contain only the submitting annotator's annotations. The existing
ExtractionInfo.AddAnnotations()already enforces this at line 80 (reconciliation || an.AnnotatorId == annotatorId), so this is covered at the data layer but not by the validator.
This would be a separate validation tier with its own test matrix and should be designed as a follow-up feature.