MongoDB Index Cleanup Analysis¶

Context¶

Index usage statistics were collected from the production database (syrftest) on the Cluster0 Atlas cluster. Stats have been tracking since 2026-03-12 (7 days of production usage at time of analysis).

The primary motivation is cost reduction: the Preview cluster (used for staging and PR preview environments) runs at M20 ($0.079/server-hour) largely because snapshot-based PR databases carry the full production index set. The pmStudy collection alone has 4.6 GB of indexes on ~5 GB of data — nearly a 1:1 ratio. Removing unused indexes reduces memory pressure, potentially allowing the Preview cluster to scale down to M10.

Summary of Findings¶

pmStudy — 31 indexes, 15 unused (0 ops in 7 days)¶

This is the most critical collection. It holds ~3.1M documents and accounts for the vast majority of index storage.

Indexes to REMOVE (0 ops, created by application code)¶

These indexes are created in StudyRepository.InitialiseIndexesAsync() but have zero usage on production over 7 days.

Note: The text index (Abstract_text_...) was originally listed here but has been moved to the KEEP section — it is still created by code and used by StudyController.GetTableData() via the $text query in CreateMongoFindQuery().

#	Index Name	Variable in Code	Line	Reason Unused
1	`ScreeningInfo.Inclusion_1`	(standalone call)	1380	Single-field on a computed ratio. All queries that filter on inclusion status use the compound `InclusionInfo` indexes instead.
2	`ExtractionInfo.Annotations.ProjectId_1_...QuestionId_1`	`annotationIndex`	1339-1341	Queries on annotations always go through `ProjectId` first, making this redundant.
3	`ProjectId_1_ExtractionInfo.Annotations.QuestionId_1`	`projectAnnotationIndex`	1342-1344	Zero usage — annotation queries don't filter by QuestionId alone.
4	`ExtractionInfo.Annotations.QuestionId_1`	`questionIndex`	1346-1347	Single-field subset of #2 and #3 above, also unused.
5	`ProjectId_1_SystematicSearchId_1_ReferenceFileId_1`	`referenceFileIndex`	1327-1330	Three-field compound index. Zero usage. The two-field `ProjectId_1_SystematicSearchId_1` (110 ops) handles these queries.

Indexes to REMOVE (0 ops, NOT in application code — manually added)¶

These indexes exist on production but are not created by InitialiseIndexesAsync(). They were likely added manually via mongosh or Atlas UI at some point and serve no purpose:

#	Index Name	Ops	Notes
7	`ScreeningInfo.AgreementMeasure.NumberScreened_1`	0	Single-field. Redundant with compound indexes that include this as a prefix.
8	`ScreeningInfo.AgreementMeasure.AbsoluteAgreementRatio_1`	0	Single-field. Same as above.
9	`SystematicSearchId_1`	0	Single-field. Covered by `ProjectId_1_SystematicSearchId_1`.
10	`ProjectId_1`	14,438 ops	USED but manually added. Keep this and add to application code.
11	`idx_proj_annProjId` (`ProjectId_1_ExtractionInfo.Annotations.ProjectId_1`)	0	Manual index, never used.
12	`idx_proj_scrnProjId` (`ProjectId_1_ScreeningInfo.Screenings.ProjectId_1`)	0	Manual index, never used.
13	`AnnotationInfo.Annotations.Id_1`	0	Legacy schema — `AnnotationInfo` was renamed to `ExtractionInfo`.
14	`AnnotationInfo.Sessions.Id_1`	0	Legacy schema.
15	`AnnotationInfo.Annotations.ParentId_1`	0	Legacy schema.
16	`AnnotationInfo.Annotations.AnnotatorId_1_ProjectId_1`	0	Legacy schema.
17	`ProjectId_1__id_1`	344 ops	Low usage, manually added. Supports paged export cursor. Keep and add to code.

Indexes to KEEP (actively used)¶

Index Name	Ops (7 days)	Purpose
`_id_`	46,604	Default, cannot remove
`ProjectId_1_...ExtractionInfo.Sessions.InvestigatorId_1_...Status_1_SystematicSearchId_1`	35,820	Annotation session queries
`ProjectId_1_...InclusionInfo...StudyScreenStatus_1_...Overscreened_1`	25,127	Screening status filtering
`ProjectId_1`	14,438	General project filtering (manually added — should be added to code)
`ProjectId_1_ScreeningInfo.Screenings.ScreenerId_1`	6,670	Screener lookup
`ProjectId_1_...SessionTallies...NumberOfCandidateSessions_1`	4,846	Annotation allocation
`ProjectId_1_...InclusionInfo...Overscreened_1`	4,547	Overscreening detection
`ProjectId_1_ExtractionInfo.Annotations.AnnotatorId_1`	557	Annotator lookup
`ProjectId_1_Screenings.ScreenerId_1`	383	Legacy screener lookup
`ProjectId_1__id_1`	344	Paged export cursor
`ProjectId_1_...AgreementMeasure...SessionTallies...`	179	Compound stats query
`Abstract_text_...` (text index)	0	KEPT — zero ops in observation window, but actively wired to `StudyController.GetTableData()` via `$text` query. Removing requires code change first.
`ProjectId_1_SystematicSearchId_1`	110	Search-scoped queries
`ProjectId_1_...AgreementMeasure...SystematicSearchId_1`	114	Search-scoped screening
`ProjectId_1_...SessionTallies...NumberOfCandidateSessions_1_SystematicSearchId_1`	77	Allocation with search scope
`ProjectId_1_...AgreementMeasure...DateTimeCreated_1`	38	Date-sorted screening
`ProjectId_1_...SessionTallies...ReconciliationStarted_1`	36	Reconciliation queries

Special case: `ProjectId_1_AllocationBucket_1`¶

0 ops, created 2026-03-19 (today). Brand new index, not yet used. Presumably added by recent code. Leave for now.

pmProject — 6 indexes, 5 unused¶

Index Name	Ops	Created By Code	Action
`_id_`	10,612	Default	Keep
`Keywords_text_Name_text_IsPublic_1`	0	`ProjectRepository` L387-399	REMOVE — project text search unused
`Stages.Id_1`	0	`ProjectRepository` L378-381	REMOVE — stage lookup by ID unused
`Memberships.InvestigatorId_1`	0	`ProjectRepository` L374-376	REMOVE — membership queries go through `_id`
`Registrations.InvestigatorId_1`	0	`ProjectRepository` L371-373	REMOVE — registration queries go through `_id`
`SystematicSearchIds_1`	0	`ProjectRepository` L383-385	REMOVE — search lookup goes through `_id`

pmInvestigator — 6 indexes, 4 unused¶

Index Name	Ops	Created By Code	Action
`_id_`	60,576	Default	Keep
`Emails_1`	534	`InvestigatorRepository` L35-43	Keep — used for email lookup, also unique constraint
`NewEmailVerifiers.VerificationCodes.Code_1`	0	`InvestigatorRepository` L31-34	Keep — sparse, small, supports verification feature
`ChangeEmailVerify.Code_1`	0	Manual / legacy	REMOVE — legacy field name
`Email_1`	0	Manual / legacy	REMOVE — old single-email field, replaced by `Emails_1`
`ChangeEmailVerify.VerificationCodes.Code_1`	0	Manual / legacy	REMOVE — legacy

pmSystematicSearch — 2 indexes, both used¶

Index Name	Ops	Action
`_id_`	26	Keep
`ProjectIds_1`	3,209	Keep

pmRiskOfBiasAiJob — 3 indexes, all unused¶

Index Name	Created By Code	Action
`_id_`	Default	Keep (mandatory)
`StudyId_1`	`RiskOfBiasAiJobRepository` L23-24	REMOVE — feature not in active use
`ProjectId_1_StudyId_1`	`RiskOfBiasAiJobRepository` L25-27	REMOVE — same

Impact Summary¶

Indexes to remove from application code¶

Repository	File	Indexes to Remove	Est. Size Saved (per DB)
`StudyRepository`	`StudyRepository.cs`	5 (annotationIndex, projectAnnotationIndex, questionIndex, referenceFileIndex, ScreeningInfo.Inclusion)	~2-3 GB
`ProjectRepository`	`ProjectRepository.cs`	5 (textIndex, stageIdIndex, membershipIndex, registrationIndex, systematicSearchIndex)	Minimal
`RiskOfBiasAiJobRepository`	`RiskOfBiasAiJobRepository.cs`	2 (StudyId, ProjectId+StudyId)	Minimal

Indexes to remove manually from databases (not in code)¶

11 manually-added or legacy indexes on pmStudy and 3 on pmInvestigator need to be dropped directly from the database. See the companion document index-cleanup-commands.md.

Indexes to ADD to application code¶

Index	Ops	Notes
`ProjectId_1` on pmStudy	14,438	Heavily used but not created by code. Add to `StudyRepository.InitialiseIndexesAsync()`.
`ProjectId_1__id_1` on pmStudy	344	Supports paged export. Add to `StudyRepository.InitialiseIndexesAsync()`.

Total estimated savings¶

Per snapshot-based PR database: ~2.5-3 GB index reduction (from ~4.6 GB to ~1.5-2 GB)
Production database: ~2.5-3 GB index reduction (from ~5.5 GB to ~2.5-3 GB)
Preview cluster impact: With reduced index footprint, the working set may fit within M10's 2 GB RAM, enabling the cluster to scale down from M20 — saving ~$100/month

Risks and Mitigations¶

Text search feature: The text index on pmStudy is kept in this PR. It is still created by InitialiseIndexesAsync() and used by StudyController.GetTableData() via the $text query in CreateMongoFindQuery(). Dropping the text index requires removing or replacing the $text query code first.
Re-adding indexes: If a removed index turns out to be needed, it can be recreated. On 3.1M documents, index builds take a few minutes and happen in the background on Atlas without downtime.
NewEmailVerifiers.VerificationCodes.Code_1: Zero ops, but this is used for email verification. If no one has verified an email in 7 days, that's expected. Recommendation: Keep this index — it supports a correctness feature, not a performance one. It's also sparse and very small.