Technical Plan: Advanced Screening & Filtering¶

Audience: Chris (Senior Dev), Nuri (Junior Dev)

Scope: MVP-first, hybrid cloud aware. Incorporates: PRISMA implementation plan, detailed Filter Set model + Angular Material Filter Builder UI, MongoDB query/performance strategy, forward path to annotation-based filtering.

Key Design: No MassTransit consumer for tallies—materialised tallies are updated atomically inside the Study aggregate alongside screening/annotation writes.

Phasing & Constraints¶

MVP (3–4 sprints): Screening Profiles (immutable-on-use + clone), Stage Settings (mode required), Filter Set v2 (nested groups in storage, simple UI), Selection & Stats, studies endpoint with stageId, Reviewer UI wiring, opt-in Migration Wizard.
Hardening (1–2 sprints): a11y polish, perf/telemetry dashboards, helpdesk SOP.
Phase-2: PRISMA diagram/export, annotation-based filtering, tie-breaker groups, optional materialised pools for very large projects.
Hybrid infrastructure: Angular (Material + ngrx), ASP.NET API (.NET 10), MongoDB (Atlas/GKE), RabbitMQ present but not used for tallies; on-prem file server unchanged.

Architecture Overview¶

[ Angular SPA ]
  ├─ Stage Settings (mode required)
  ├─ Filter Builder (Material) — JSON v2
  ├─ Reviewer Screen (Screening/Annotation/Reconciliation)
  └─ Stats widgets + Study table (Pool via stageId)
      ▼ REST/JSON
[ ASP.NET API (.NET 10, GKE) ]
  ├─ ProfilesController       ── CRUD/Clone
  ├─ StagesController         ── settings incl. FilterSet
  ├─ StudiesController        ── GET studies?stageId=… (Stage Study Pool)
  ├─ SelectionController      ── POST select_next, GET stats
  ├─ DecisionsController      ── POST decisions (stage context)
  ├─ PrismaController (P2)    ── GET prisma_data
  └─ Domain services: FilterCompiler, SelectionService, StatsService,
                      ReconciliationService, PrismaAggregator, AuditService

[ MongoDB ]  — Project, Study, Reviewing_Audit
  • Project: screeningProfiles[], stages[] (filterSet + settings), prismaMapping
  • Study: screeningOutcomes[], extractionInfo.sessions[],
           extractionInfo.sessionTallies[], reconciledAnnotations{}

[ On-prem FT Files ] — unchanged; fetched via existing endpoints

Service Boundaries & Responsibilities¶

ProfilesService — immutable-on-use; clone semantics; where-used lookup.
StagesService — stage settings + FilterSet persistence (with schema validation).
FilterCompiler — validate/simplify/compile FilterSet JSON → efficient MongoDB filter(s). Knows array-matching pitfalls and merges conditions per array path.
SelectionService — derives Selection Subset from Stage Study Pool by mode; supports saved-session routing; random selection using index-friendly technique (see §6.5), not $sample for very large pools.
StatsService — fast counts by caller and stage; micro-cache allowed (in-memory/Redis optional).
ReconciliationService — eligibility, self-reconciliation policy, commit reconciled annotations.
AuditService — append-only reviewing_audit.
PrismaAggregator (P2) — compute PRISMA metrics from Screening Outcomes + import metadata.

Consistency boundary: All per-study tallies (screening/annotation session tallies, InclusionInfo) are updated atomically with the Study document write.

Data Model (MongoDB)¶

Project¶

{
  screeningProfiles: [
    { id, name, criteriaText, parentId?, createdBy, createdAt, used: bool }
  ],
  stages: [
    { id, name, studySelectionMode, hideExcluded, maxInProgress,
      sessionCountTarget, selfReconciliation, filterSet }
  ],
  prismaMapping: { taProfileId?, ftProfileId?, sourceFields, notes } // Phase-2
}

Study¶

{
  screeningOutcomes: [
    { profileId, decisions[], status, updatedAt }
  ],
  extractionInfo: {
    sessions: [
      { stageId, reviewerId, reconciliation: bool,
        status: "InProgress"|"Completed", startedAt, completedAt? }
    ],
    sessionTallies: [
      { stageId, numberOfCandidateSessions,
        numberOfCompletedCandidateSessions,
        numberOfReconciliationSessions? }
    ]
  },
  inclusionInfo: [], // optional materialised summary per profile/stage
  reconciledAnnotations: { "<questionId>": "<value|array|object>" }, // Phase-2
  rand: 0.0 // double in [0,1) for index-friendly random selection
}

Indexes (created at startup)¶

screeningOutcomes.profileId + screeningOutcomes.status (compound, multi-key)
extractionInfo.sessionTallies.stageId
extractionInfo.sessions.{stageId, reviewerId, reconciliation, status}
rand (ascending) for random-by-range selection
Phase-2: wildcard reconciledAnnotations.$** (with partial and sparse strategies)

Filter Set v2 — Storage, Semantics, Compilation¶

JSON Schema (backward-compatible & future-proof)¶

{
  "version": 2,
  "logic": "AND",
  "rules": [
    {
      "type": "group",
      "logic": "AND",
      "rules": [
        { "type": "profileOutcome", "profileId": "<guid>", "op": "in", "values": ["Included","Conflict","Maybe"] },
        { "type": "profileOutcome", "profileId": "<guidB>", "op": "notIn", "values": ["Included"] }
      ]
    },
    { "type": "annotation", "questionId": "ft_reason", "op": "in", "values": ["WrongPopulation","Duplicate"] }
  ]
}

Types: profileOutcome (MVP), annotation (Phase-2). Additional future types (e.g., importSource, studyTag) can slot in without breaking clients.
Nested groups from day one (UI may only allow simple cases in MVP).

C# Model & Validation¶

public enum NodeType { Group, ProfileOutcome, Annotation }
public enum Logic { And, Or }
public enum Op { In, NotIn, Eq, Neq, Any, All }

public abstract record Node(NodeType Type);
public sealed record GroupNode(Logic Logic, IReadOnlyList<Node> Rules) : Node(NodeType.Group);
public sealed record ProfileOutcomeNode(string ProfileId, Op Op, IReadOnlyList<string> Values) : Node(NodeType.ProfileOutcome);
public sealed record AnnotationNode(string QuestionId, Op Op, IReadOnlyList<string> Values) : Node(NodeType.Annotation);

public static class FilterValidator {
  public static void Validate(Node n) {
    // validate ids, enum values, non-empty rules,
    // detect circular references if we later allow profile->stage edges
  }
}

Simplifier — Make Queries Cheaper (Idempotent)¶

simplify(node):
  if node is Group(AND/OR):
    node.rules = [simplify(r) for r in node.rules]
    // Flatten nested groups with same logic
    flatten(node)
    // Merge ProfileOutcome rules targeting SAME profileId
    //   AND + (in A) + (in B)   → in (A ∩ B)
    //   AND + (in A) + (notIn B)→ in (A − B); if empty → FALSE
    //   OR  + (in A) + (in B)   → in (A ∪ B)
    // Remove tautologies and contradictions
    // Drop empty groups; if group becomes empty:
    //   AND → TRUE; OR → FALSE  (apply identities carefully)
  return node

Why: MongoDB $elemMatch can't enforce same element across separate $elemMatch stages; by merging rules per profileId we avoid incorrect matches and reduce pipeline stages.

Compiler — Array-Aware and Index-Friendly¶

FilterDefinition<BsonDocument> Compile(Node n) {
  var f = Builders<BsonDocument>.Filter;
  return n switch {
    GroupNode g => g.Logic == Logic.And
      ? f.And(g.Rules.Select(Compile))
      : f.Or(g.Rules.Select(Compile)),

    ProfileOutcomeNode p => f.ElemMatch("screeningOutcomes", f.And(
        f.Eq("profileId", p.ProfileId),
        p.Op switch {
          Op.In    => f.In("status", p.Values),
          Op.NotIn => f.Nin("status", p.Values),
          _        => throw new NotSupportedException()
        }
      )),

    AnnotationNode a => a.Op switch {
      Op.In    => f.In($"reconciledAnnotations.{a.QuestionId}", a.Values),
      Op.NotIn => f.Nin($"reconciledAnnotations.{a.QuestionId}", a.Values),
      Op.Any   => f.Exists($"reconciledAnnotations.{a.QuestionId}"),
      Op.All   => f.All($"reconciledAnnotations.{a.QuestionId}", a.Values),
      _        => throw new NotSupportedException()
    },

    _ => throw new NotSupportedException()
  };
}

Bottlenecks & Mitigations¶

B1: Many profiles across a huge corpus → large multi-key fan-out.
Mitigate: pre-filter by profileId with high selectivity; ensure compound index { profileId, status }.
B2: Deep OR trees lead to index intersection and memory pressure.
Mitigate: simplifier flattens/merges; push down most selective branches first.
B3: $sample on large matched sets is CPU heavy.
Mitigate: random-by-range using rand field and range query with wrap-around.
B4: Annotation value cardinality (Phase-2) → poor selectivity for free-text.
Mitigate: restrict to enumerated codes; use flattened reconciledAnnotations.<qid> fields.

Angular Material Filter Builder — UI/State/Contracts¶

UX Constraints¶

MVP exposes one pass-forward rule (Profile + outcomes), but backend stores full v2 schema.
Live count preview (debounced) via GET studies?stageId=…&countOnly=true.
Clear circular reference errors at save.

Components (Angular 18+/Material)¶

<app-filter-builder>
  <app-group [logic]="AND">
    <app-rule type="profileOutcome"></app-rule>
    <!-- Future: nested groups; annotation rules -->
  </app-group>
  <mat-divider></mat-divider>
  <div class="preview">
    <span>Matches: {{count$ | async}}</span>
    <button mat-stroked-button (click)="reset()">Reset</button>
  </div>
</app-filter-builder>

Material: mat-form-field, mat-select for profile/outcome pickers; mat-button-toggle-group for AND/OR; cdkDragDrop for reordering; mat-tree optional for nested groups.
State: ngrx store for project/stage/global; signals for component local state & derived values.

Reactive Forms + Signals¶

const ruleForm = this.fb.group({
  type: this.fb.control<'profileOutcome'|'annotation'>('profileOutcome', { nonNullable: true }),
  profileId: this.fb.control<string | null>(null),
  op: this.fb.control<'in'|'notIn'>('in', { nonNullable: true }),
  values: this.fb.control<string[]>([], { nonNullable: true })
});

// derive JSON (signal)
readonly filterJson = computed(() => serializeToJsonV2(this.rootGroup()));

// preview count (debounced) — convert signal to observable for RxJS operators
readonly count$ = toObservable(this.filterJson).pipe(
  debounceTime(300),
  switchMap(json => this.api.getPoolCount(projectId, stageId, json))
);

Selection — Efficient, Fair, and Scalable¶

Modes & Policies¶

screening, annotation, screeningAndAnnotation, reconciliation
Apply per-reviewer suppression and hideExcluded where relevant
Saved-session routing when restrictToSaved or maxInProgress reached

Random Selection: Index-Friendly Approach¶

Instead of $sample(1) on large sets, use a precomputed rand field ∈ [0,1) with an index:

r = random()
q1: match(candidates & rand >= r) sort(rand ASC) limit 1
if none: q2: match(candidates & rand < r) sort(rand ASC) limit 1

Pros: uses an index; avoids collection scans; stable distribution.
Refresh rand rarely (e.g., when creating a study).

Selection Filter Build (C#)¶

var pool = FilterCompiler.Compile(stage.FilterSet);
var candidates = builder.And(builder.Eq("projectId", projectId), pool);

if (mode is Screening or ScreeningAndAnnotation && stage.HideExcluded) {
  var myExcluded = builder.ElemMatch("screeningOutcomes", builder.And(
    builder.Eq("profileId", stage.ActiveProfileId),
    builder.ElemMatch("decisions", builder.And(
      builder.Eq("reviewerId", callerId), builder.Eq("outcome", "Excluded")
    ))));
  candidates &= !myExcluded;
}

if (mode == Reconciliation) {
  candidates &= builder.ElemMatch("extractionInfo.sessionTallies",
    builder.And(builder.Eq("stageId", stage.Id),
                builder.Gte("numberOfCandidateSessions", stage.SessionCountTarget)));
  if (!stage.SelfReconciliation) {
    var mine = builder.ElemMatch("extractionInfo.sessions",
      builder.And(builder.Eq("stageId", stage.Id),
                  builder.Eq("reviewerId", callerId),
                  builder.Eq("reconciliation", false)));
    candidates &= !mine;
  }
}

// random-by-range
var result = await _studies.Find(candidates)
  .SortBy(x => x["rand"])
  .FirstOrDefaultAsync(ct);

API Contracts (Illustrative)¶

Stage Study Pool¶

GET /api/projects/{projectId}/studies?stageId=<stageId>&skip&take&sort&countOnly

Returns reviewer-agnostic Stage Study Pool IDs or records.

Selection¶

POST /api/projects/{projectId}/stages/{stageId}/select_next
Body: { mode: "screening" | "annotation" | "screeningAndAnnotation" | "reconciliation", restrictToSaved?: boolean }
Response: 200 Study | 204 No Content

Stats¶

GET /api/projects/{projectId}/stages/{stageId}/stats
Response: {
  availableForScreening: number,
  availableForAnnotation: number,
  reconciliationEligible: number,
  inProgress: number,
  completed: number,
  reconciliationInProgress: number
}

Decisions¶

POST /api/projects/{projectId}/stages/{stageId}/screening/decisions
Body: { outcome: "Included|Excluded|Conflict|Pending|Maybe?", notes?, ... }

Server infers profileId from stage.

Testing Strategy¶

Unit Tests¶

FilterValidator/Simplifier/Compiler
Selection eligibility per mode
Reconciliation policy
Decision status computation
PRISMA aggregator math (Phase-2)

Integration Tests (API+Mongo)¶

/select_next behaviours
/stats counts
studies?stageId pool correctness
FilterSet round-trip
Decision write atomic updates (tallies & inclusion info)

Contract Tests¶

DTO parity (TypeScript vs C#)
Versioned schemas

E2E Tests¶

Stage creation (mode required)
Filter Builder (preview counts)
Reviewer flows (all modes)
Migration Wizard
PRISMA mapping & preview (Phase-2)

Performance Tests¶

Selection with rand vs $sample
OR-heavy filters
Annotation filters on popular questions (Phase-2)

Property-Based Tests¶

Simplifier equivalence (random trees → compile → run vs naive eval on sample set)

Deployment & Env Config¶

Feature flag: features.advancedScreeningProfiles per project
Indexes job: ensure all indexes on boot
Config: selection.random.method = randRange (fallback $sample for tiny pools)
K8s: HPA on p95 latency & CPU; probes at /healthz & /livez
CI/CD: build → unit/integration → deploy Dev → smoke → UAT → Prod; canary under feature flag

Migration & Rollback¶

Migration Wizard (opt-in)¶

Freeze: set project.migrationStatus = Freezing; block review actions
Snapshot: copy project doc + study IDs into migrationSnapshots
Backfill: create first Screening Profile from legacy criteria text; sweep studies to derive initial outcomes
Verify: counts match; sample QA; write audit log
Unfreeze: set project.migrationStatus = Complete and enable feature

Rollback¶

If any step fails, set migrationStatus = Failed; offer Revert which restores snapshot
Clear partial writes with job that deletes new fields where safe

Note: Reasons for exclusion not backfilled; consider post-hoc annotation pass in Phase-2.

Security & Access Control¶

Admin-only create/edit Profiles/FilterSets/Stage Settings/PRISMA mapping
Reviewers can view criteria text on review UI (read-only)
Audit all decisions, session changes, reconciled commits (reviewing_audit)

Developer Checklist¶

Backend Tasks¶

Frontend Tasks¶

Documentation Tasks¶

Update API documentation
Create user guide for new features
Helpdesk SOP export bundle

Sprint Timeline (Indicative)¶

Sprint 1: Profiles domain/API; Stage Settings (mode required); ensure indexes; decisions write atomic tallies
Sprint 2: FilterSet storage (v2), Simplifier + Compiler; MVP Filter Builder; studies?stageId pool
Sprint 3: Selection (rand strategy) + Stats; Reviewer UI wiring; telemetry
Sprint 4: Migration Wizard; E2E hardening
Sprint 5 (Phase-2 start): PRISMA mapping + aggregator; annotation filtering storage; wildcard/partial indexes

Open Questions (Carried)¶

Should HideExcludedStudiesFromReviewers apply in Reconciliation Mode?
PRISMA box breakdowns beyond TA/FT (e.g., multiple phases)?