Skip to content

MongoDB Index Cleanup Analysis

Context

Index usage statistics were collected from the production database (syrftest) on the Cluster0 Atlas cluster. Stats have been tracking since 2026-03-12 (7 days of production usage at time of analysis).

The primary motivation is cost reduction: the Preview cluster (used for staging and PR preview environments) runs at M20 ($0.079/server-hour) largely because snapshot-based PR databases carry the full production index set. The pmStudy collection alone has 4.6 GB of indexes on ~5 GB of data — nearly a 1:1 ratio. Removing unused indexes reduces memory pressure, potentially allowing the Preview cluster to scale down to M10.

Summary of Findings

pmStudy — 31 indexes, 15 unused (0 ops in 7 days)

This is the most critical collection. It holds ~3.1M documents and accounts for the vast majority of index storage.

Indexes to REMOVE (0 ops, created by application code)

These indexes are created in StudyRepository.InitialiseIndexesAsync() but have zero usage on production over 7 days.

Note: The text index (Abstract_text_...) was originally listed here but has been moved to the KEEP section — it is still created by code and used by StudyController.GetTableData() via the $text query in CreateMongoFindQuery().

# Index Name Variable in Code Line Reason Unused
1 ScreeningInfo.Inclusion_1 (standalone call) 1380 Single-field on a computed ratio. All queries that filter on inclusion status use the compound InclusionInfo indexes instead.
2 ExtractionInfo.Annotations.ProjectId_1_...QuestionId_1 annotationIndex 1339-1341 Queries on annotations always go through ProjectId first, making this redundant.
3 ProjectId_1_ExtractionInfo.Annotations.QuestionId_1 projectAnnotationIndex 1342-1344 Zero usage — annotation queries don't filter by QuestionId alone.
4 ExtractionInfo.Annotations.QuestionId_1 questionIndex 1346-1347 Single-field subset of #2 and #3 above, also unused.
5 ProjectId_1_SystematicSearchId_1_ReferenceFileId_1 referenceFileIndex 1327-1330 Three-field compound index. Zero usage. The two-field ProjectId_1_SystematicSearchId_1 (110 ops) handles these queries.

Indexes to REMOVE (0 ops, NOT in application code — manually added)

These indexes exist on production but are not created by InitialiseIndexesAsync(). They were likely added manually via mongosh or Atlas UI at some point and serve no purpose:

# Index Name Ops Notes
7 ScreeningInfo.AgreementMeasure.NumberScreened_1 0 Single-field. Redundant with compound indexes that include this as a prefix.
8 ScreeningInfo.AgreementMeasure.AbsoluteAgreementRatio_1 0 Single-field. Same as above.
9 SystematicSearchId_1 0 Single-field. Covered by ProjectId_1_SystematicSearchId_1.
10 ProjectId_1 14,438 ops USED but manually added. Keep this and add to application code.
11 idx_proj_annProjId (ProjectId_1_ExtractionInfo.Annotations.ProjectId_1) 0 Manual index, never used.
12 idx_proj_scrnProjId (ProjectId_1_ScreeningInfo.Screenings.ProjectId_1) 0 Manual index, never used.
13 AnnotationInfo.Annotations.Id_1 0 Legacy schema — AnnotationInfo was renamed to ExtractionInfo.
14 AnnotationInfo.Sessions.Id_1 0 Legacy schema.
15 AnnotationInfo.Annotations.ParentId_1 0 Legacy schema.
16 AnnotationInfo.Annotations.AnnotatorId_1_ProjectId_1 0 Legacy schema.
17 ProjectId_1__id_1 344 ops Low usage, manually added. Supports paged export cursor. Keep and add to code.

Indexes to KEEP (actively used)

Index Name Ops (7 days) Purpose
_id_ 46,604 Default, cannot remove
ProjectId_1_...ExtractionInfo.Sessions.InvestigatorId_1_...Status_1_SystematicSearchId_1 35,820 Annotation session queries
ProjectId_1_...InclusionInfo...StudyScreenStatus_1_...Overscreened_1 25,127 Screening status filtering
ProjectId_1 14,438 General project filtering (manually added — should be added to code)
ProjectId_1_ScreeningInfo.Screenings.ScreenerId_1 6,670 Screener lookup
ProjectId_1_...SessionTallies...NumberOfCandidateSessions_1 4,846 Annotation allocation
ProjectId_1_...InclusionInfo...Overscreened_1 4,547 Overscreening detection
ProjectId_1_ExtractionInfo.Annotations.AnnotatorId_1 557 Annotator lookup
ProjectId_1_Screenings.ScreenerId_1 383 Legacy screener lookup
ProjectId_1__id_1 344 Paged export cursor
ProjectId_1_...AgreementMeasure...SessionTallies... 179 Compound stats query
Abstract_text_... (text index) 0 KEPT — zero ops in observation window, but actively wired to StudyController.GetTableData() via $text query. Removing requires code change first.
ProjectId_1_SystematicSearchId_1 110 Search-scoped queries
ProjectId_1_...AgreementMeasure...SystematicSearchId_1 114 Search-scoped screening
ProjectId_1_...SessionTallies...NumberOfCandidateSessions_1_SystematicSearchId_1 77 Allocation with search scope
ProjectId_1_...AgreementMeasure...DateTimeCreated_1 38 Date-sorted screening
ProjectId_1_...SessionTallies...ReconciliationStarted_1 36 Reconciliation queries

Special case: ProjectId_1_AllocationBucket_1

  • 0 ops, created 2026-03-19 (today). Brand new index, not yet used. Presumably added by recent code. Leave for now.

pmProject — 6 indexes, 5 unused

Index Name Ops Created By Code Action
_id_ 10,612 Default Keep
Keywords_text_Name_text_IsPublic_1 0 ProjectRepository L387-399 REMOVE — project text search unused
Stages.Id_1 0 ProjectRepository L378-381 REMOVE — stage lookup by ID unused
Memberships.InvestigatorId_1 0 ProjectRepository L374-376 REMOVE — membership queries go through _id
Registrations.InvestigatorId_1 0 ProjectRepository L371-373 REMOVE — registration queries go through _id
SystematicSearchIds_1 0 ProjectRepository L383-385 REMOVE — search lookup goes through _id

pmInvestigator — 6 indexes, 4 unused

Index Name Ops Created By Code Action
_id_ 60,576 Default Keep
Emails_1 534 InvestigatorRepository L35-43 Keep — used for email lookup, also unique constraint
NewEmailVerifiers.VerificationCodes.Code_1 0 InvestigatorRepository L31-34 Keep — sparse, small, supports verification feature
ChangeEmailVerify.Code_1 0 Manual / legacy REMOVE — legacy field name
Email_1 0 Manual / legacy REMOVE — old single-email field, replaced by Emails_1
ChangeEmailVerify.VerificationCodes.Code_1 0 Manual / legacy REMOVE — legacy

pmSystematicSearch — 2 indexes, both used

Index Name Ops Action
_id_ 26 Keep
ProjectIds_1 3,209 Keep

pmRiskOfBiasAiJob — 3 indexes, all unused

Index Name Ops Created By Code Action
_id_ 0 Default Keep (mandatory)
StudyId_1 0 RiskOfBiasAiJobRepository L23-24 REMOVE — feature not in active use
ProjectId_1_StudyId_1 0 RiskOfBiasAiJobRepository L25-27 REMOVE — same

Impact Summary

Indexes to remove from application code

Repository File Indexes to Remove Est. Size Saved (per DB)
StudyRepository StudyRepository.cs 5 (annotationIndex, projectAnnotationIndex, questionIndex, referenceFileIndex, ScreeningInfo.Inclusion) ~2-3 GB
ProjectRepository ProjectRepository.cs 5 (textIndex, stageIdIndex, membershipIndex, registrationIndex, systematicSearchIndex) Minimal
RiskOfBiasAiJobRepository RiskOfBiasAiJobRepository.cs 2 (StudyId, ProjectId+StudyId) Minimal

Indexes to remove manually from databases (not in code)

11 manually-added or legacy indexes on pmStudy and 3 on pmInvestigator need to be dropped directly from the database. See the companion document index-cleanup-commands.md.

Indexes to ADD to application code

Index Ops Notes
ProjectId_1 on pmStudy 14,438 Heavily used but not created by code. Add to StudyRepository.InitialiseIndexesAsync().
ProjectId_1__id_1 on pmStudy 344 Supports paged export. Add to StudyRepository.InitialiseIndexesAsync().

Total estimated savings

  • Per snapshot-based PR database: ~2.5-3 GB index reduction (from ~4.6 GB to ~1.5-2 GB)
  • Production database: ~2.5-3 GB index reduction (from ~5.5 GB to ~2.5-3 GB)
  • Preview cluster impact: With reduced index footprint, the working set may fit within M10's 2 GB RAM, enabling the cluster to scale down from M20 — saving ~$100/month

Risks and Mitigations

  1. Text search feature: The text index on pmStudy is kept in this PR. It is still created by InitialiseIndexesAsync() and used by StudyController.GetTableData() via the $text query in CreateMongoFindQuery(). Dropping the text index requires removing or replacing the $text query code first.

  2. Re-adding indexes: If a removed index turns out to be needed, it can be recreated. On 3.1M documents, index builds take a few minutes and happen in the background on Atlas without downtime.

  3. NewEmailVerifiers.VerificationCodes.Code_1: Zero ops, but this is used for email verification. If no one has verified an email in 7 days, that's expected. Recommendation: Keep this index — it supports a correctness feature, not a performance one. It's also sparse and very small.