MongoDB Index Cleanup Analysis¶
Context¶
Index usage statistics were collected from the production database (syrftest) on the Cluster0 Atlas cluster. Stats have been tracking since 2026-03-12 (7 days of production usage at time of analysis).
The primary motivation is cost reduction: the Preview cluster (used for staging and PR preview environments) runs at M20 ($0.079/server-hour) largely because snapshot-based PR databases carry the full production index set. The pmStudy collection alone has 4.6 GB of indexes on ~5 GB of data — nearly a 1:1 ratio. Removing unused indexes reduces memory pressure, potentially allowing the Preview cluster to scale down to M10.
Summary of Findings¶
pmStudy — 31 indexes, 15 unused (0 ops in 7 days)¶
This is the most critical collection. It holds ~3.1M documents and accounts for the vast majority of index storage.
Indexes to REMOVE (0 ops, created by application code)¶
These indexes are created in StudyRepository.InitialiseIndexesAsync() but have zero usage on production over 7 days.
Note: The text index (
Abstract_text_...) was originally listed here but has been moved to the KEEP section — it is still created by code and used byStudyController.GetTableData()via the$textquery inCreateMongoFindQuery().
| # | Index Name | Variable in Code | Line | Reason Unused |
|---|---|---|---|---|
| 1 | ScreeningInfo.Inclusion_1 |
(standalone call) | 1380 | Single-field on a computed ratio. All queries that filter on inclusion status use the compound InclusionInfo indexes instead. |
| 2 | ExtractionInfo.Annotations.ProjectId_1_...QuestionId_1 |
annotationIndex |
1339-1341 | Queries on annotations always go through ProjectId first, making this redundant. |
| 3 | ProjectId_1_ExtractionInfo.Annotations.QuestionId_1 |
projectAnnotationIndex |
1342-1344 | Zero usage — annotation queries don't filter by QuestionId alone. |
| 4 | ExtractionInfo.Annotations.QuestionId_1 |
questionIndex |
1346-1347 | Single-field subset of #2 and #3 above, also unused. |
| 5 | ProjectId_1_SystematicSearchId_1_ReferenceFileId_1 |
referenceFileIndex |
1327-1330 | Three-field compound index. Zero usage. The two-field ProjectId_1_SystematicSearchId_1 (110 ops) handles these queries. |
Indexes to REMOVE (0 ops, NOT in application code — manually added)¶
These indexes exist on production but are not created by InitialiseIndexesAsync(). They were likely added manually via mongosh or Atlas UI at some point and serve no purpose:
| # | Index Name | Ops | Notes |
|---|---|---|---|
| 7 | ScreeningInfo.AgreementMeasure.NumberScreened_1 |
0 | Single-field. Redundant with compound indexes that include this as a prefix. |
| 8 | ScreeningInfo.AgreementMeasure.AbsoluteAgreementRatio_1 |
0 | Single-field. Same as above. |
| 9 | SystematicSearchId_1 |
0 | Single-field. Covered by ProjectId_1_SystematicSearchId_1. |
| 10 | ProjectId_1 |
14,438 ops | USED but manually added. Keep this and add to application code. |
| 11 | idx_proj_annProjId (ProjectId_1_ExtractionInfo.Annotations.ProjectId_1) |
0 | Manual index, never used. |
| 12 | idx_proj_scrnProjId (ProjectId_1_ScreeningInfo.Screenings.ProjectId_1) |
0 | Manual index, never used. |
| 13 | AnnotationInfo.Annotations.Id_1 |
0 | Legacy schema — AnnotationInfo was renamed to ExtractionInfo. |
| 14 | AnnotationInfo.Sessions.Id_1 |
0 | Legacy schema. |
| 15 | AnnotationInfo.Annotations.ParentId_1 |
0 | Legacy schema. |
| 16 | AnnotationInfo.Annotations.AnnotatorId_1_ProjectId_1 |
0 | Legacy schema. |
| 17 | ProjectId_1__id_1 |
344 ops | Low usage, manually added. Supports paged export cursor. Keep and add to code. |
Indexes to KEEP (actively used)¶
| Index Name | Ops (7 days) | Purpose |
|---|---|---|
_id_ |
46,604 | Default, cannot remove |
ProjectId_1_...ExtractionInfo.Sessions.InvestigatorId_1_...Status_1_SystematicSearchId_1 |
35,820 | Annotation session queries |
ProjectId_1_...InclusionInfo...StudyScreenStatus_1_...Overscreened_1 |
25,127 | Screening status filtering |
ProjectId_1 |
14,438 | General project filtering (manually added — should be added to code) |
ProjectId_1_ScreeningInfo.Screenings.ScreenerId_1 |
6,670 | Screener lookup |
ProjectId_1_...SessionTallies...NumberOfCandidateSessions_1 |
4,846 | Annotation allocation |
ProjectId_1_...InclusionInfo...Overscreened_1 |
4,547 | Overscreening detection |
ProjectId_1_ExtractionInfo.Annotations.AnnotatorId_1 |
557 | Annotator lookup |
ProjectId_1_Screenings.ScreenerId_1 |
383 | Legacy screener lookup |
ProjectId_1__id_1 |
344 | Paged export cursor |
ProjectId_1_...AgreementMeasure...SessionTallies... |
179 | Compound stats query |
Abstract_text_... (text index) |
0 | KEPT — zero ops in observation window, but actively wired to StudyController.GetTableData() via $text query. Removing requires code change first. |
ProjectId_1_SystematicSearchId_1 |
110 | Search-scoped queries |
ProjectId_1_...AgreementMeasure...SystematicSearchId_1 |
114 | Search-scoped screening |
ProjectId_1_...SessionTallies...NumberOfCandidateSessions_1_SystematicSearchId_1 |
77 | Allocation with search scope |
ProjectId_1_...AgreementMeasure...DateTimeCreated_1 |
38 | Date-sorted screening |
ProjectId_1_...SessionTallies...ReconciliationStarted_1 |
36 | Reconciliation queries |
Special case: ProjectId_1_AllocationBucket_1¶
- 0 ops, created 2026-03-19 (today). Brand new index, not yet used. Presumably added by recent code. Leave for now.
pmProject — 6 indexes, 5 unused¶
| Index Name | Ops | Created By Code | Action |
|---|---|---|---|
_id_ |
10,612 | Default | Keep |
Keywords_text_Name_text_IsPublic_1 |
0 | ProjectRepository L387-399 |
REMOVE — project text search unused |
Stages.Id_1 |
0 | ProjectRepository L378-381 |
REMOVE — stage lookup by ID unused |
Memberships.InvestigatorId_1 |
0 | ProjectRepository L374-376 |
REMOVE — membership queries go through _id |
Registrations.InvestigatorId_1 |
0 | ProjectRepository L371-373 |
REMOVE — registration queries go through _id |
SystematicSearchIds_1 |
0 | ProjectRepository L383-385 |
REMOVE — search lookup goes through _id |
pmInvestigator — 6 indexes, 4 unused¶
| Index Name | Ops | Created By Code | Action |
|---|---|---|---|
_id_ |
60,576 | Default | Keep |
Emails_1 |
534 | InvestigatorRepository L35-43 |
Keep — used for email lookup, also unique constraint |
NewEmailVerifiers.VerificationCodes.Code_1 |
0 | InvestigatorRepository L31-34 |
Keep — sparse, small, supports verification feature |
ChangeEmailVerify.Code_1 |
0 | Manual / legacy | REMOVE — legacy field name |
Email_1 |
0 | Manual / legacy | REMOVE — old single-email field, replaced by Emails_1 |
ChangeEmailVerify.VerificationCodes.Code_1 |
0 | Manual / legacy | REMOVE — legacy |
pmSystematicSearch — 2 indexes, both used¶
| Index Name | Ops | Action |
|---|---|---|
_id_ |
26 | Keep |
ProjectIds_1 |
3,209 | Keep |
pmRiskOfBiasAiJob — 3 indexes, all unused¶
| Index Name | Ops | Created By Code | Action |
|---|---|---|---|
_id_ |
0 | Default | Keep (mandatory) |
StudyId_1 |
0 | RiskOfBiasAiJobRepository L23-24 |
REMOVE — feature not in active use |
ProjectId_1_StudyId_1 |
0 | RiskOfBiasAiJobRepository L25-27 |
REMOVE — same |
Impact Summary¶
Indexes to remove from application code¶
| Repository | File | Indexes to Remove | Est. Size Saved (per DB) |
|---|---|---|---|
StudyRepository |
StudyRepository.cs |
5 (annotationIndex, projectAnnotationIndex, questionIndex, referenceFileIndex, ScreeningInfo.Inclusion) | ~2-3 GB |
ProjectRepository |
ProjectRepository.cs |
5 (textIndex, stageIdIndex, membershipIndex, registrationIndex, systematicSearchIndex) | Minimal |
RiskOfBiasAiJobRepository |
RiskOfBiasAiJobRepository.cs |
2 (StudyId, ProjectId+StudyId) | Minimal |
Indexes to remove manually from databases (not in code)¶
11 manually-added or legacy indexes on pmStudy and 3 on pmInvestigator need to be dropped directly from the database. See the companion document index-cleanup-commands.md.
Indexes to ADD to application code¶
| Index | Ops | Notes |
|---|---|---|
ProjectId_1 on pmStudy |
14,438 | Heavily used but not created by code. Add to StudyRepository.InitialiseIndexesAsync(). |
ProjectId_1__id_1 on pmStudy |
344 | Supports paged export. Add to StudyRepository.InitialiseIndexesAsync(). |
Total estimated savings¶
- Per snapshot-based PR database: ~2.5-3 GB index reduction (from ~4.6 GB to ~1.5-2 GB)
- Production database: ~2.5-3 GB index reduction (from ~5.5 GB to ~2.5-3 GB)
- Preview cluster impact: With reduced index footprint, the working set may fit within M10's 2 GB RAM, enabling the cluster to scale down from M20 — saving ~$100/month
Risks and Mitigations¶
-
Text search feature: The text index on
pmStudyis kept in this PR. It is still created byInitialiseIndexesAsync()and used byStudyController.GetTableData()via the$textquery inCreateMongoFindQuery(). Dropping the text index requires removing or replacing the$textquery code first. -
Re-adding indexes: If a removed index turns out to be needed, it can be recreated. On 3.1M documents, index builds take a few minutes and happen in the background on Atlas without downtime.
-
NewEmailVerifiers.VerificationCodes.Code_1: Zero ops, but this is used for email verification. If no one has verified an email in 7 days, that's expected. Recommendation: Keep this index — it supports a correctness feature, not a performance one. It's also sparse and very small.