Skip to content

Technical Plan: Advanced Screening & Filtering

Audience: Chris (Senior Dev), Nuri (Junior Dev)

Scope: MVP-first, hybrid cloud aware. Incorporates: PRISMA implementation plan, detailed Filter Set model + Angular Material Filter Builder UI, MongoDB query/performance strategy, forward path to annotation-based filtering.

Key Design: No MassTransit consumer for tallies—materialised tallies are updated atomically inside the Study aggregate alongside screening/annotation writes.

Phasing & Constraints

  • MVP (3–4 sprints): Screening Profiles (immutable-on-use + clone), Stage Settings (mode required), Filter Set v2 (nested groups in storage, simple UI), Selection & Stats, studies endpoint with stageId, Reviewer UI wiring, opt-in Migration Wizard.
  • Hardening (1–2 sprints): a11y polish, perf/telemetry dashboards, helpdesk SOP.
  • Phase-2: PRISMA diagram/export, annotation-based filtering, tie-breaker groups, optional materialised pools for very large projects.
  • Hybrid infrastructure: Angular (Material + ngrx), ASP.NET API (.NET 10), MongoDB (Atlas/GKE), RabbitMQ present but not used for tallies; on-prem file server unchanged.

Architecture Overview

[ Angular SPA ]
  ├─ Stage Settings (mode required)
  ├─ Filter Builder (Material) — JSON v2
  ├─ Reviewer Screen (Screening/Annotation/Reconciliation)
  └─ Stats widgets + Study table (Pool via stageId)
      ▼ REST/JSON
[ ASP.NET API (.NET 10, GKE) ]
  ├─ ProfilesController       ── CRUD/Clone
  ├─ StagesController         ── settings incl. FilterSet
  ├─ StudiesController        ── GET studies?stageId=… (Stage Study Pool)
  ├─ SelectionController      ── POST select_next, GET stats
  ├─ DecisionsController      ── POST decisions (stage context)
  ├─ PrismaController (P2)    ── GET prisma_data
  └─ Domain services: FilterCompiler, SelectionService, StatsService,
                      ReconciliationService, PrismaAggregator, AuditService

[ MongoDB ]  — Project, Study, Reviewing_Audit
  • Project: screeningProfiles[], stages[] (filterSet + settings), prismaMapping
  • Study: screeningOutcomes[], extractionInfo.sessions[],
           extractionInfo.sessionTallies[], reconciledAnnotations{}

[ On-prem FT Files ] — unchanged; fetched via existing endpoints

Service Boundaries & Responsibilities

  • ProfilesService — immutable-on-use; clone semantics; where-used lookup.
  • StagesService — stage settings + FilterSet persistence (with schema validation).
  • FilterCompiler — validate/simplify/compile FilterSet JSON → efficient MongoDB filter(s). Knows array-matching pitfalls and merges conditions per array path.
  • SelectionService — derives Selection Subset from Stage Study Pool by mode; supports saved-session routing; random selection using index-friendly technique (see §6.5), not $sample for very large pools.
  • StatsService — fast counts by caller and stage; micro-cache allowed (in-memory/Redis optional).
  • ReconciliationService — eligibility, self-reconciliation policy, commit reconciled annotations.
  • AuditService — append-only reviewing_audit.
  • PrismaAggregator (P2) — compute PRISMA metrics from Screening Outcomes + import metadata.

Consistency boundary: All per-study tallies (screening/annotation session tallies, InclusionInfo) are updated atomically with the Study document write.

Data Model (MongoDB)

Project

{
  screeningProfiles: [
    { id, name, criteriaText, parentId?, createdBy, createdAt, used: bool }
  ],
  stages: [
    { id, name, studySelectionMode, hideExcluded, maxInProgress,
      sessionCountTarget, selfReconciliation, filterSet }
  ],
  prismaMapping: { taProfileId?, ftProfileId?, sourceFields, notes } // Phase-2
}

Study

{
  screeningOutcomes: [
    { profileId, decisions[], status, updatedAt }
  ],
  extractionInfo: {
    sessions: [
      { stageId, reviewerId, reconciliation: bool,
        status: "InProgress"|"Completed", startedAt, completedAt? }
    ],
    sessionTallies: [
      { stageId, numberOfCandidateSessions,
        numberOfCompletedCandidateSessions,
        numberOfReconciliationSessions? }
    ]
  },
  inclusionInfo: [], // optional materialised summary per profile/stage
  reconciledAnnotations: { "<questionId>": "<value|array|object>" }, // Phase-2
  rand: 0.0 // double in [0,1) for index-friendly random selection
}

Indexes (created at startup)

  • screeningOutcomes.profileId + screeningOutcomes.status (compound, multi-key)
  • extractionInfo.sessionTallies.stageId
  • extractionInfo.sessions.{stageId, reviewerId, reconciliation, status}
  • rand (ascending) for random-by-range selection
  • Phase-2: wildcard reconciledAnnotations.$** (with partial and sparse strategies)

Filter Set v2 — Storage, Semantics, Compilation

JSON Schema (backward-compatible & future-proof)

{
  "version": 2,
  "logic": "AND",
  "rules": [
    {
      "type": "group",
      "logic": "AND",
      "rules": [
        { "type": "profileOutcome", "profileId": "<guid>", "op": "in", "values": ["Included","Conflict","Maybe"] },
        { "type": "profileOutcome", "profileId": "<guidB>", "op": "notIn", "values": ["Included"] }
      ]
    },
    { "type": "annotation", "questionId": "ft_reason", "op": "in", "values": ["WrongPopulation","Duplicate"] }
  ]
}
  • Types: profileOutcome (MVP), annotation (Phase-2). Additional future types (e.g., importSource, studyTag) can slot in without breaking clients.
  • Nested groups from day one (UI may only allow simple cases in MVP).

C# Model & Validation

public enum NodeType { Group, ProfileOutcome, Annotation }
public enum Logic { And, Or }
public enum Op { In, NotIn, Eq, Neq, Any, All }

public abstract record Node(NodeType Type);
public sealed record GroupNode(Logic Logic, IReadOnlyList<Node> Rules) : Node(NodeType.Group);
public sealed record ProfileOutcomeNode(string ProfileId, Op Op, IReadOnlyList<string> Values) : Node(NodeType.ProfileOutcome);
public sealed record AnnotationNode(string QuestionId, Op Op, IReadOnlyList<string> Values) : Node(NodeType.Annotation);

public static class FilterValidator {
  public static void Validate(Node n) {
    // validate ids, enum values, non-empty rules,
    // detect circular references if we later allow profile->stage edges
  }
}

Simplifier — Make Queries Cheaper (Idempotent)

simplify(node):
  if node is Group(AND/OR):
    node.rules = [simplify(r) for r in node.rules]
    // Flatten nested groups with same logic
    flatten(node)
    // Merge ProfileOutcome rules targeting SAME profileId
    //   AND + (in A) + (in B)   → in (A ∩ B)
    //   AND + (in A) + (notIn B)→ in (A − B); if empty → FALSE
    //   OR  + (in A) + (in B)   → in (A ∪ B)
    // Remove tautologies and contradictions
    // Drop empty groups; if group becomes empty:
    //   AND → TRUE; OR → FALSE  (apply identities carefully)
  return node

Why: MongoDB $elemMatch can't enforce same element across separate $elemMatch stages; by merging rules per profileId we avoid incorrect matches and reduce pipeline stages.

Compiler — Array-Aware and Index-Friendly

FilterDefinition<BsonDocument> Compile(Node n) {
  var f = Builders<BsonDocument>.Filter;
  return n switch {
    GroupNode g => g.Logic == Logic.And
      ? f.And(g.Rules.Select(Compile))
      : f.Or(g.Rules.Select(Compile)),

    ProfileOutcomeNode p => f.ElemMatch("screeningOutcomes", f.And(
        f.Eq("profileId", p.ProfileId),
        p.Op switch {
          Op.In    => f.In("status", p.Values),
          Op.NotIn => f.Nin("status", p.Values),
          _        => throw new NotSupportedException()
        }
      )),

    AnnotationNode a => a.Op switch {
      Op.In    => f.In($"reconciledAnnotations.{a.QuestionId}", a.Values),
      Op.NotIn => f.Nin($"reconciledAnnotations.{a.QuestionId}", a.Values),
      Op.Any   => f.Exists($"reconciledAnnotations.{a.QuestionId}"),
      Op.All   => f.All($"reconciledAnnotations.{a.QuestionId}", a.Values),
      _        => throw new NotSupportedException()
    },

    _ => throw new NotSupportedException()
  };
}

Bottlenecks & Mitigations

  • B1: Many profiles across a huge corpus → large multi-key fan-out.
  • Mitigate: pre-filter by profileId with high selectivity; ensure compound index { profileId, status }.
  • B2: Deep OR trees lead to index intersection and memory pressure.
  • Mitigate: simplifier flattens/merges; push down most selective branches first.
  • B3: $sample on large matched sets is CPU heavy.
  • Mitigate: random-by-range using rand field and range query with wrap-around.
  • B4: Annotation value cardinality (Phase-2) → poor selectivity for free-text.
  • Mitigate: restrict to enumerated codes; use flattened reconciledAnnotations.<qid> fields.

Angular Material Filter Builder — UI/State/Contracts

UX Constraints

  • MVP exposes one pass-forward rule (Profile + outcomes), but backend stores full v2 schema.
  • Live count preview (debounced) via GET studies?stageId=…&countOnly=true.
  • Clear circular reference errors at save.

Components (Angular 18+/Material)

<app-filter-builder>
  <app-group [logic]="AND">
    <app-rule type="profileOutcome"></app-rule>
    <!-- Future: nested groups; annotation rules -->
  </app-group>
  <mat-divider></mat-divider>
  <div class="preview">
    <span>Matches: {{count$ | async}}</span>
    <button mat-stroked-button (click)="reset()">Reset</button>
  </div>
</app-filter-builder>
  • Material: mat-form-field, mat-select for profile/outcome pickers; mat-button-toggle-group for AND/OR; cdkDragDrop for reordering; mat-tree optional for nested groups.
  • State: ngrx store for project/stage/global; signals for component local state & derived values.

Reactive Forms + Signals

const ruleForm = this.fb.group({
  type: this.fb.control<'profileOutcome'|'annotation'>('profileOutcome', { nonNullable: true }),
  profileId: this.fb.control<string | null>(null),
  op: this.fb.control<'in'|'notIn'>('in', { nonNullable: true }),
  values: this.fb.control<string[]>([], { nonNullable: true })
});

// derive JSON (signal)
readonly filterJson = computed(() => serializeToJsonV2(this.rootGroup()));

// preview count (debounced) — convert signal to observable for RxJS operators
readonly count$ = toObservable(this.filterJson).pipe(
  debounceTime(300),
  switchMap(json => this.api.getPoolCount(projectId, stageId, json))
);

Selection — Efficient, Fair, and Scalable

Modes & Policies

  • screening, annotation, screeningAndAnnotation, reconciliation
  • Apply per-reviewer suppression and hideExcluded where relevant
  • Saved-session routing when restrictToSaved or maxInProgress reached

Random Selection: Index-Friendly Approach

Instead of $sample(1) on large sets, use a precomputed rand field ∈ [0,1) with an index:

r = random()
q1: match(candidates & rand >= r) sort(rand ASC) limit 1
if none: q2: match(candidates & rand < r) sort(rand ASC) limit 1
  • Pros: uses an index; avoids collection scans; stable distribution.
  • Refresh rand rarely (e.g., when creating a study).

Selection Filter Build (C#)

var pool = FilterCompiler.Compile(stage.FilterSet);
var candidates = builder.And(builder.Eq("projectId", projectId), pool);

if (mode is Screening or ScreeningAndAnnotation && stage.HideExcluded) {
  var myExcluded = builder.ElemMatch("screeningOutcomes", builder.And(
    builder.Eq("profileId", stage.ActiveProfileId),
    builder.ElemMatch("decisions", builder.And(
      builder.Eq("reviewerId", callerId), builder.Eq("outcome", "Excluded")
    ))));
  candidates &= !myExcluded;
}

if (mode == Reconciliation) {
  candidates &= builder.ElemMatch("extractionInfo.sessionTallies",
    builder.And(builder.Eq("stageId", stage.Id),
                builder.Gte("numberOfCandidateSessions", stage.SessionCountTarget)));
  if (!stage.SelfReconciliation) {
    var mine = builder.ElemMatch("extractionInfo.sessions",
      builder.And(builder.Eq("stageId", stage.Id),
                  builder.Eq("reviewerId", callerId),
                  builder.Eq("reconciliation", false)));
    candidates &= !mine;
  }
}

// random-by-range
var result = await _studies.Find(candidates)
  .SortBy(x => x["rand"])
  .FirstOrDefaultAsync(ct);

API Contracts (Illustrative)

Stage Study Pool

GET /api/projects/{projectId}/studies?stageId=<stageId>&skip&take&sort&countOnly

Returns reviewer-agnostic Stage Study Pool IDs or records.

Selection

POST /api/projects/{projectId}/stages/{stageId}/select_next
Body: { mode: "screening" | "annotation" | "screeningAndAnnotation" | "reconciliation", restrictToSaved?: boolean }
Response: 200 Study | 204 No Content

Stats

GET /api/projects/{projectId}/stages/{stageId}/stats
Response: {
  availableForScreening: number,
  availableForAnnotation: number,
  reconciliationEligible: number,
  inProgress: number,
  completed: number,
  reconciliationInProgress: number
}

Decisions

POST /api/projects/{projectId}/stages/{stageId}/screening/decisions
Body: { outcome: "Included|Excluded|Conflict|Pending|Maybe?", notes?, ... }

Server infers profileId from stage.

Testing Strategy

Unit Tests

  • FilterValidator/Simplifier/Compiler
  • Selection eligibility per mode
  • Reconciliation policy
  • Decision status computation
  • PRISMA aggregator math (Phase-2)

Integration Tests (API+Mongo)

  • /select_next behaviours
  • /stats counts
  • studies?stageId pool correctness
  • FilterSet round-trip
  • Decision write atomic updates (tallies & inclusion info)

Contract Tests

  • DTO parity (TypeScript vs C#)
  • Versioned schemas

E2E Tests

  • Stage creation (mode required)
  • Filter Builder (preview counts)
  • Reviewer flows (all modes)
  • Migration Wizard
  • PRISMA mapping & preview (Phase-2)

Performance Tests

  • Selection with rand vs $sample
  • OR-heavy filters
  • Annotation filters on popular questions (Phase-2)

Property-Based Tests

  • Simplifier equivalence (random trees → compile → run vs naive eval on sample set)

Deployment & Env Config

  • Feature flag: features.advancedScreeningProfiles per project
  • Indexes job: ensure all indexes on boot
  • Config: selection.random.method = randRange (fallback $sample for tiny pools)
  • K8s: HPA on p95 latency & CPU; probes at /healthz & /livez
  • CI/CD: build → unit/integration → deploy Dev → smoke → UAT → Prod; canary under feature flag

Migration & Rollback

Migration Wizard (opt-in)

  1. Freeze: set project.migrationStatus = Freezing; block review actions
  2. Snapshot: copy project doc + study IDs into migrationSnapshots
  3. Backfill: create first Screening Profile from legacy criteria text; sweep studies to derive initial outcomes
  4. Verify: counts match; sample QA; write audit log
  5. Unfreeze: set project.migrationStatus = Complete and enable feature

Rollback

  • If any step fails, set migrationStatus = Failed; offer Revert which restores snapshot
  • Clear partial writes with job that deletes new fields where safe

Note: Reasons for exclusion not backfilled; consider post-hoc annotation pass in Phase-2.

Security & Access Control

  • Admin-only create/edit Profiles/FilterSets/Stage Settings/PRISMA mapping
  • Reviewers can view criteria text on review UI (read-only)
  • Audit all decisions, session changes, reconciled commits (reviewing_audit)

Developer Checklist

Backend Tasks

  • Create Screening Profile aggregate under Project
  • Implement immutable-once-used + clone semantics
  • Add Stage Settings fields (mode, policies)
  • Implement Filter Set v2 storage + validation
  • Build FilterCompiler (validate/simplify/compile)
  • Create SelectionService with modes
  • Implement StatsService with per-caller counts
  • Add stageId param to studies endpoint
  • Create decisions route with stage context
  • Implement Migration Wizard (freeze/snapshot/backfill/rollback)
  • Add audit entries for all actions
  • Create MongoDB indexes

Frontend Tasks

  • Stage Settings UI (mode required)
  • Filter Builder component (MVP: single pass-forward rule)
  • Live count preview (debounced)
  • Wire "Get next" by mode
  • Show active profile criteria on review UI
  • Add mode banners (Reconciliation Mode)
  • Implement keyboard shortcuts (J/K, ½/3)
  • Empty-state guidance
  • Saved-session resume flows
  • Stats widgets

Documentation Tasks

  • Update API documentation
  • Create user guide for new features
  • Helpdesk SOP export bundle

Sprint Timeline (Indicative)

  • Sprint 1: Profiles domain/API; Stage Settings (mode required); ensure indexes; decisions write atomic tallies
  • Sprint 2: FilterSet storage (v2), Simplifier + Compiler; MVP Filter Builder; studies?stageId pool
  • Sprint 3: Selection (rand strategy) + Stats; Reviewer UI wiring; telemetry
  • Sprint 4: Migration Wizard; E2E hardening
  • Sprint 5 (Phase-2 start): PRISMA mapping + aggregator; annotation filtering storage; wildcard/partial indexes

Open Questions (Carried)

  • Should HideExcludedStudiesFromReviewers apply in Reconciliation Mode?
  • PRISMA box breakdowns beyond TA/FT (e.g., multiple phases)?