Skip to content

Systematic Search Upload and Study Creation Flow

This document describes the complete data flow for uploading systematic search reference files through the SyRF platform, from the Angular frontend through to the creation of Study records in MongoDB.

Table of Contents

Overview

The systematic search upload process is a multi-service, event-driven workflow that:

  1. Uploads reference library files (CSV, TSV, Endnote XML, PubMed XML, Living Search JSON) to S3
  2. Triggers a state machine in the Project Management service
  3. Parses the uploaded files to extract study metadata
  4. Creates Study aggregate roots in MongoDB

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                              ANGULAR FRONTEND                                │
│  ┌──────────────────┐    ┌─────────────────┐    ┌────────────────────────┐  │
│  │ CreateSearch     │───▶│ ProjectDetail   │───▶│ S3FileService          │  │
│  │ Component        │    │ Effects         │    │ (presigned upload)     │  │
│  └──────────────────┘    └─────────────────┘    └──────────┬─────────────┘  │
└──────────────────────────────────────────────────────────────┼──────────────┘
                         ┌─────────────────────────────────────┼──────────────┐
                         │              API SERVICE            │              │
                         │  ┌──────────────────┐               │              │
                         │  │ SearchController │               │              │
                         │  │ GetS3Signature() │──────┐        │              │
                         │  └──────────────────┘      │        │              │
                         └────────────────────────────┼────────┼──────────────┘
                                                      │        │
                                                      ▼        ▼
┌─────────────────────────────┐              ┌────────────────────────────────┐
│        RABBITMQ             │◀────────────▶│          AWS S3                │
│  ISearchUploadStartedEvent  │              │   (Reference File Storage)     │
│  ISearchUploadSavedToS3Event│              └────────────────┬───────────────┘
└──────────────┬──────────────┘                               │
               │                                              │
               │                    ┌─────────────────────────┼───────────────┐
               │                    │     S3 NOTIFIER LAMBDA  │               │
               │                    │  ┌─────────────────────────────────┐    │
               │                    │  │ S3FileReceivedHandler           │    │
               │                    │  │ (S3:ObjectCreated:Put trigger)  │────┘
               │                    │  └─────────────────────────────────┘
               │                    └─────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────────────┐
│                      PROJECT MANAGEMENT SERVICE                               │
│  ┌────────────────────────────────────────────────────────────────────────┐  │
│  │                   SearchImportJobStateMachine                          │  │
│  │  ┌──────────┐    ┌──────────┐    ┌─────────┐    ┌───────────┐         │  │
│  │  │ Initial  │───▶│Uploading │───▶│Uploaded │───▶│  Parsing  │──┬──────┤  │
│  │  └──────────┘    └──────────┘    └─────────┘    └───────────┘  │      │  │
│  │       │                                                        │      │  │
│  │       │ (S3 event first)                                       │      │  │
│  │       └───────────────────────────────────────▶ (to Parsing)   │      │  │
│  │                                 ┌────────────┐  ┌─────────┐   │      │  │
│  │                                 │ Completed  │◀─┤         │   │      │  │
│  │                                 └────────────┘  │  Error  │◀──┘      │  │
│  │                                                 └─────────┘          │  │
│  └──────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────────┐  │
│  │                   ReferenceFileParseJobConsumer                        │  │
│  │  ┌──────────────────────┐    ┌────────────────────┐                   │  │
│  │  │ ProjectManagement    │───▶│ StudyReferenceFile │───▶ MongoDB       │  │
│  │  │ Service              │    │ Parser             │    (Studies)      │  │
│  │  └──────────────────────┘    └────────────────────┘                   │  │
│  └────────────────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────────────┘

Sequence Diagram

sequenceDiagram
    participant User
    participant Angular as Angular Frontend
    participant API as API Service
    participant S3
    participant Lambda as S3 Notifier Lambda
    participant RabbitMQ
    participant PM as Project Management
    participant MongoDB

    User->>Angular: Select reference file
    Angular->>Angular: Validate file format
    Angular->>Angular: Calculate SHA-256 hash
    Angular->>API: POST /api/projects/{id}/searches/getSignature

    Note over API: Generate ReferenceFileId (UUID)
    Note over API: Build S3 metadata headers
    Note over API: Sign with AWS Sig V4

    API->>RabbitMQ: Publish ISearchUploadStartedEvent
    API-->>Angular: Return presigned URL + headers

    RabbitMQ->>PM: Deliver ISearchUploadStartedEvent
    PM->>PM: Create SearchImportJob (Uploading state)
    PM->>PM: Schedule 60-min timeout

    Angular->>S3: PUT file (presigned URL)
    S3-->>Angular: 200 OK

    S3->>Lambda: S3:ObjectCreated:Put event
    Lambda->>Lambda: Extract metadata from headers
    Lambda->>RabbitMQ: Publish ISearchUploadSavedToS3Event

    RabbitMQ->>PM: Deliver ISearchUploadSavedToS3Event
    PM->>PM: Transition to Uploaded state
    PM->>PM: Execute StartParseJobsActivity
    PM->>RabbitMQ: Publish IStartParsingReferenceFileCommand

    loop For each reference file
        RabbitMQ->>PM: Deliver parse command
        PM->>S3: Download file
        PM->>PM: Parse records
        PM->>MongoDB: Batch insert Studies (5000/batch)
        PM->>RabbitMQ: Publish IReferenceFileParsingCompletedEvent
    end

    PM->>PM: All files parsed → Completed state
    PM->>MongoDB: Create SystematicSearch
    PM-->>User: Search ready for screening

Detailed Flow

Phase 1: Frontend File Selection and Validation

Component: CreateSearchComponent (src/services/web/src/app/project/project-overview/create-search/create-search.component.ts)

  1. User selects a reference library file in the Create Search dialog
  2. Frontend validates the file format:
  3. CSV/TSV: Checks for required column headers (Title, Authors, PublicationName, etc.)
  4. XML: Validates Endnote or PubMed XML structure
  5. User optionally configures screening import settings (mapping columns to investigators)
  6. User submits the form, dispatching projectDetailActions.startSearchImport()

Supported File Types (LibraryFileType enum):

Value Type Description
0 CSV Comma-separated values
1 TSV Tab-separated values
2 EndnoteXml Endnote XML export
3 PubmedXml PubMed XML format
4 LivingSearchJson Living search JSON format

Required CSV/TSV Columns:

Column Description
title Study title
authors Author list
publicationName Journal/publication name
alternateName Alternative publication name
abstract Study abstract
url Link to study
authorAddress Author contact information
year Publication year
doi Digital Object Identifier
referenceType Type of reference
pdfRelativePath Path to PDF file
keywords Study keywords
customId User-defined identifier

Phase 2: S3 Presigning Deep Dive

The presigning process allows the frontend to upload files directly to S3 without the file passing through our API servers. This is critical for large files and reduces server load.

Effect: project-detail.effects.ts (addSearch$ effect) (src/services/web/src/app/core/services/project/project-detail.effects.ts)

  1. Effect generates SHA-256 hash of the file content
  2. Calls SearchService.getS3RequestSignature() with file metadata

API Endpoint: POST /api/projects/{projectId}/searches/getSignature

Controller: SearchController.cs (lines 92-156) (src/services/api/SyRF.API.Endpoint/Controllers/SearchController.cs)

AWS Signature Version 4 Process

The API uses AWS Signature Version 4 (SigV4) to create presigned URLs. This is implemented in S3PostSigner.cs (src/libs/appservices/SyRF.AppServices/Services/S3PostSigner.cs).

Signing Algorithm (AWS4-HMAC-SHA256):

StringToSign = Algorithm + '\n' +
               RequestDateTime + '\n' +
               CredentialScope + '\n' +
               HashedCanonicalRequest

CredentialScope = Date + '/' + Region + '/' + Service + '/aws4_request'

Signature = HMAC-SHA256(SigningKey, StringToSign)

Key Derivation Chain:

kSecret  = "AWS4" + SecretAccessKey
kDate    = HMAC-SHA256(kSecret, Date)
kRegion  = HMAC-SHA256(kDate, Region)      // "eu-west-1"
kService = HMAC-SHA256(kRegion, Service)   // "s3"
kSigning = HMAC-SHA256(kService, "aws4_request")

Canonical Request Format:

CanonicalRequest = HTTPMethod + '\n' +
                   CanonicalURI + '\n' +
                   CanonicalQueryString + '\n' +
                   CanonicalHeaders + '\n' +
                   SignedHeaders + '\n' +
                   HashedPayload

What the API Does

  1. Validates the search ID and search name
  2. Generates a unique referenceFileId (UUID) for tracking
  3. Constructs S3 key: Projects/{projectId}/Imported Search Libraries/SyRF Library - {searchId}.{ext}
  4. Builds metadata headers (x-amz-meta-*):
  5. x-amz-meta-projectid: Project GUID
  6. x-amz-meta-searchid: Search GUID
  7. x-amz-meta-searchname: URL-encoded search name
  8. x-amz-meta-description: URL-encoded description
  9. x-amz-meta-referencefiles: JSON-serialized ReferenceFileInfo[]
  10. Signs the request using AWS SigV4
  11. Publishes ISearchUploadStartedEvent to RabbitMQ (starts state machine)
  12. Returns presigned URL, headers, and metadata to frontend

S3 Configuration

  • Bucket: Configured via AWS_S3_BUCKET_NAME environment variable
  • Region: eu-west-1
  • CORS: Must allow PUT from web origins
  • Presign expiry: Configurable (default implementation)

Phase 3: Direct S3 Upload

Service: S3FileService (src/services/web/src/app/core/services/s3-file.service.ts)

  1. Frontend performs HTTP PUT directly to S3 using presigned URL
  2. Includes all metadata headers from signature response
  3. Tracks upload progress via HttpEventType.UploadProgress
  4. S3 stores file with metadata in x-amz-meta-* headers

Phase 4: S3 Event Notification (Lambda)

Lambda Function: S3FileReceivedFunction.cs (src/services/s3-notifier/SyRF.S3FileSavedNotifier.Endpoint/S3FileReceivedFunction.cs)

When S3 receives the file:

  1. S3 triggers S3:ObjectCreated:Put event
  2. Lambda function syrfAppUploadS3Notifier receives the event
  3. Extracts metadata from S3 object headers
  4. Publishes ISearchUploadSavedToS3Event to RabbitMQ with:
  5. ProjectId, SearchId, SearchName, Description
  6. ReferenceFiles (deserialized from JSON header)
  7. DateTimeEventOccurred

Phase 5: State Machine Orchestration

State Machine: SearchImportJobStateMachine.cs (src/services/project-management/SyRF.ProjectManagement.Endpoint/Sagas/SearchImportJobStateMachine.cs)

The MassTransit state machine manages the import workflow.

States

State Description
Initial State machine not yet started
Uploading API has been notified, waiting for S3 confirmation
Uploaded File successfully stored in S3
Parsing Reference files being parsed into studies
Completed All parsing complete, SystematicSearch created. Saga persists with CompletedAt timestamp; cleaned up by TTL index after 7 days.
Error Failure occurred at any stage. Saga persists with CompletedAt timestamp; cleaned up by TTL index after 7 days.

SearchImportJobStatus Enum:

Value Status Description
0 Uploading Initial upload in progress
1 Parsing File parsing underway
2 Complete Successfully finished
3 Error Failed with error

Dual-Path Entry (Race Condition Handling)

The state machine handles a potential race condition where events can arrive in either order:

Path A (Normal flow - API event first):

Initial → [ISearchUploadStartedEvent] → Uploading → [ISearchUploadSavedToS3Event] → Uploaded

Path B (S3 event arrives first):

Initial → [ISearchUploadSavedToS3Event] → Parsing

In Path B, SearchUploadSaved arriving first goes directly through Initially to Parsing (via CreateSearchImportJobActivity, SetFileReceivedActivity, and StartParseJobsActivity). When SearchUploadStarted arrives later, it is handled by During(Uploading, Uploaded, Parsing) which sets metadata but does not re-trigger parsing.

This ensures the workflow completes correctly regardless of which event arrives first.

Events and Transitions

                    ISearchUploadStartedEvent
Initial ──────────────────────────────────────────▶ Uploading
    │                                                   │
    │ ISearchUploadSavedToS3Event                       │ ISearchUploadSavedToS3Event
    │ (S3 event arrives first)                          │ (normal flow)
    │                                                   ▼
    │                                               Uploaded
    │                                                   │
    │                                                   │ StartParseJobsActivity
    │                                                   ▼
    └──────────────────────────────────────────────▶ Parsing
                    ┌───────────────────────────────────┼───────────────────────────────┐
                    │                                   │                               │
        IReferenceFileParsingCompletedEvent             │     IReferenceFileParsingFaultedEvent
                    │                                   │                               │
                    ▼                                   ▼                               ▼
               (all done?)                       (still pending)                  (mark fault)
                    │                                                                   │
        ┌───────────┴───────────┐                                                       │
        │                       │                                                       │
   (no faults)             (has faults)                                                 │
        │                       │                                                       │
        ▼                       ▼                                                       │
   Completed ◀─────────────  Error ◀────────────────────────────────────────────────────┘

Terminal states (Completed, Error) persist with CompletedAt timestamp.
MongoDB TTL index (ttl_CompletedAt_7d) cleans up after 7 days.
All 6 events are Ignore()'d in terminal states to handle late duplicates.

Activities Executed

Activity Trigger Purpose
CreateSearchImportJobActivity SearchUploadStarted, SearchUploadSaved Creates SearchImportJob aggregate with ReferenceFileParseJob entities
SetFileReceivedActivity SearchUploadSaved Sets upload completion timestamp
StartParseJobsActivity Transition to Parsing Publishes IStartParsingReferenceFileCommand for each reference file
CompleteSearchJobActivity All parse jobs complete (no faults) Creates SystematicSearch aggregate
FailSearchJobActivity Any parse job faulted Marks import job with error status

Timeout Handling

  • 60-minute timeout scheduled when entering Uploading state
  • If S3 confirmation not received, transitions to Error state
  • Publishes ISearchImportJobErrorEvent with timeout message

Phase 6: Reference File Parsing

Consumer: ReferenceFileParseJobConsumer.cs (src/services/project-management/SyRF.ProjectManagement.Endpoint/Consumers/ReferenceFileParseJobConsumer.cs)

For each reference file, the consumer:

  1. Receives IStartParsingReferenceFileCommand
  2. Calls ProjectManagementService.ParseReferenceFile()
  3. On success: Publishes IReferenceFileParsingCompletedEvent
  4. On failure: Publishes IReferenceFileParsingFaultedEvent

Configuration:

  • Job timeout: 5 minutes
  • Retry policy: 4 incremental retries (1 min + 1 min increments)
  • Concurrent job limit: 5

Service: ProjectManagementService.cs (lines 129-199) (src/libs/project-management/SyRF.ProjectManagement.Core/Services/ProjectManagementService.cs)

The ParseReferenceFile method:

  1. Retrieves the ReferenceFileParseJob from the project
  2. Sets up progress tracking with throttled updates (every 2 seconds)
  3. Calls StudyReferenceFileParser.ParseStudiesAsync()
  4. Handles progress updates and final completion/error states

Phase 7: Study Creation

Parser Router: StudyReferenceFileParser.cs (src/libs/project-management/SyRF.ProjectManagement.Core/Services/StudyReferenceFileParser.cs)

  1. Selects appropriate parser implementation based on LibraryFileType
  2. Creates new SyRF reference file URL for storing parsed output
  3. Invokes parser's ParseStudiesAsync() method

Parser Implementations:

Parser File Types Location
SpreadsheetParseImplementation CSV, TSV SpreadsheetParseImplementation.cs
EndnoteXmlParseImplementation Endnote XML EndnoteXmlParseImplementation.cs
PubmedXmlParseImplementation PubMed XML PubmedXmlParseImplementation.cs

Record Processors:

Processor Purpose
StudySpreadsheetRecordProcessor Converts CSV/TSV rows to Study objects
StudyEndnoteRecordProcessor Converts Endnote XML records to Study objects
StudyPubmedRecordProcessor Converts PubMed XML records to Study objects

Each parser:

  1. Downloads the original file from S3
  2. Parses records using the appropriate record processor
  3. Creates Study aggregate roots with metadata:
  4. Title, Authors, Abstract, Year
  5. DOI, URL, Keywords
  6. PublicationName, ReferenceType
  7. Project and Search associations
  8. Batches studies (5000 at a time) for MongoDB insertion
  9. Creates a new SyRF reference file with Study IDs added
  10. Reports progress back to the state machine

Phase 8: Systematic Search Creation

Domain Model: SearchImportJob.cs (line 41) (src/libs/project-management/SyRF.ProjectManagement.Core/Model/ProjectAggregate/SearchImportJob.cs)

When all ReferenceFileParseJob entities complete:

public SystematicSearch CompleteSearchImportJob()
{
    Status = SearchImportJobStatus.Complete;
    var search = new SystematicSearch(Id, Name, Description,
        ReferenceFileParseJobs.Select(rfj => rfj.CreateStudyReferenceFile()),
        ProjectId, LivingSearchId);
    return search;
}

Result: A SystematicSearch aggregate is created containing:

  • Search ID (same as SearchImportJob ID)
  • Name and Description
  • Collection of StudyReferenceFile objects
  • Project association
  • Optional Living Search association

Message Contracts

Events

Event Publisher Subscribers Purpose
ISearchUploadStartedEvent API Service PM State Machine Notifies upload has begun
ISearchUploadSavedToS3Event S3 Notifier Lambda PM State Machine Confirms file stored in S3
IReferenceFileParsingCompletedEvent ReferenceFileParseJobConsumer PM State Machine Parse job succeeded
IReferenceFileParsingFaultedEvent ReferenceFileParseJobConsumer PM State Machine Parse job failed
ISearchImportJobErrorEvent PM State Machine Error handlers Import job failed

Commands

Command Publisher Consumer Purpose
IStartParsingReferenceFileCommand StartParseJobsActivity ReferenceFileParseJobConsumer Triggers parsing of individual file

Shared Interfaces

ICanStartSearchImportJob (src/libs/kernel/SyRF.SharedKernel/Interfaces/ICanStartSearchImportJob.cs)

Base interface implemented by ISearchUploadStartedEvent and ISearchUploadSavedToS3Event (the two events that can initiate a search import job). Note: IStartSearchImportJobCommand was removed as it had no producer anywhere in the monorepo.

public interface ICanStartSearchImportJob
{
    Guid ProjectId { get; }
    Guid SearchId { get; }
    string SearchName { get; }
    string Description { get; }
    IEnumerable<ReferenceFileInfo> ReferenceFiles { get; }
}

ReferenceFileInfo (src/libs/kernel/SyRF.SharedKernel/ValueObjects/ReferenceFileInfo.cs)

Value object containing file metadata:

public record ReferenceFileInfo(
    Guid ReferenceFileId,
    string FileUrl,
    LibraryFileType LibraryFileType,
    ScreeningImportSettings? ScreeningImportSettings
);

Domain Models

SearchImportJob

Location: SearchImportJob.cs

MongoDB Collection: Projects (embedded in Project aggregate)

Aggregate root tracking the import process:

Property Type Description
Id Guid Same as SearchId
Name string User-provided search name
Description string User-provided description
Status SearchImportJobStatus Uploading, Parsing, Complete, Error
ReferenceFileParseJobs IEnumerable<ReferenceFileParseJob> Individual file parsing jobs
TotalNumberOfStudies int? Sum across all parse jobs
NumberOfParsedStudies int Progress counter
Errors List<string> Error messages

ReferenceFileParseJob

Location: ReferenceFileParseJob.cs

MongoDB Collection: Embedded in SearchImportJob

Entity tracking individual file parsing:

Property Type Description
Id Guid ReferenceFileId
LibraryType LibraryFileType File format
OriginalFileUrl string S3 location of uploaded file
SyrfReferenceFileUrl string? S3 location of parsed file with IDs
TotalNumberOfStudies int? Total studies in file
NumberOfParsedStudies int Parsed count
ScreeningImportSettings ScreeningImportSettings? Column-to-investigator mapping
IsComplete bool Parsed == Total

Study

Location: Study.cs

MongoDB Collection: Studies

Aggregate root for individual studies:

Property Type Description
Id Guid Unique study identifier
Title string Study title
Authors IEnumerable<Author> Author list
Abstract string? Study abstract
Year int? Publication year
DOI string? Digital Object Identifier
PublicationName PublicationName Journal information
Keywords List<string> Study keywords
ProjectId Guid Owning project
SystematicSearchId Guid Owning search
ReferenceFileId Guid Source file
ScreeningInfo ScreeningInfo Screening decisions
ExtractionInfo ExtractionInfo Data extraction

SystematicSearch

Location: SystematicSearch.cs

MongoDB Collection: SystematicSearches

Aggregate root representing a completed search:

Property Type Description
Id Guid Search identifier
Name string Search name
Description string Search description
SyrfReferenceFiles IEnumerable<StudyReferenceFile> Parsed file references
ProjectId Guid? Owning project
NumberOfStudies int Total studies across all files

Example Payloads

Presigning Request

Request: POST /api/projects/{projectId}/searches/getSignature

{
  "searchId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "searchName": "PubMed Search 2024",
  "description": "Search results from PubMed for cardiac intervention studies",
  "referenceFiles": [
    {
      "referenceFileId": "f1e2d3c4-b5a6-7890-1234-567890abcdef",
      "fileUrl": "Projects/project-123/Imported Search Libraries/SyRF Library - a1b2c3d4.csv",
      "libraryFileType": 0,
      "screeningImportSettings": null
    }
  ],
  "fileHash": "abc123def456...",
  "contentType": "text/csv"
}

Presigning Response

{
  "uploadUrl": "https://syrf-bucket.s3.eu-west-1.amazonaws.com/Projects/...",
  "headers": {
    "x-amz-meta-projectid": "project-123",
    "x-amz-meta-searchid": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "x-amz-meta-searchname": "PubMed%20Search%202024",
    "x-amz-meta-description": "Search%20results%20from%20PubMed...",
    "x-amz-meta-referencefiles": "[{\"ReferenceFileId\":\"f1e2d3c4...\",\"FileUrl\":\"...\",\"LibraryFileType\":0}]",
    "x-amz-content-sha256": "abc123def456...",
    "x-amz-date": "20241207T120000Z",
    "Authorization": "AWS4-HMAC-SHA256 Credential=AKIAIOSFODNN7EXAMPLE/20241207/eu-west-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date;x-amz-meta-projectid;x-amz-meta-referencefiles;x-amz-meta-searchid;x-amz-meta-searchname, Signature=..."
  },
  "referenceFileId": "f1e2d3c4-b5a6-7890-1234-567890abcdef"
}

ISearchUploadStartedEvent

{
  "messageId": "msg-uuid-123",
  "messageType": ["urn:message:SyRF.ProjectManagement.Messages.Events:ISearchUploadStartedEvent"],
  "message": {
    "projectId": "project-123",
    "searchId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "searchName": "PubMed Search 2024",
    "description": "Search results from PubMed...",
    "referenceFiles": [
      {
        "referenceFileId": "f1e2d3c4-b5a6-7890-1234-567890abcdef",
        "fileUrl": "Projects/project-123/Imported Search Libraries/SyRF Library - a1b2c3d4.csv",
        "libraryFileType": 0,
        "screeningImportSettings": null
      }
    ],
    "dateTimeEventOccurred": "2024-12-07T12:00:00Z"
  }
}

ISearchUploadSavedToS3Event

{
  "messageId": "msg-uuid-456",
  "messageType": ["urn:message:SyRF.S3FileSavedNotifier.Messages:ISearchUploadSavedToS3Event"],
  "message": {
    "projectId": "project-123",
    "searchId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "searchName": "PubMed Search 2024",
    "description": "Search results from PubMed...",
    "referenceFiles": [
      {
        "referenceFileId": "f1e2d3c4-b5a6-7890-1234-567890abcdef",
        "fileUrl": "Projects/project-123/Imported Search Libraries/SyRF Library - a1b2c3d4.csv",
        "libraryFileType": 0
      }
    ],
    "dateTimeEventOccurred": "2024-12-07T12:00:05Z"
  }
}

Error Handling

Upload Timeout

  • 60-minute timeout from upload start
  • State machine transitions to Error state
  • ISearchImportJobErrorEvent published with "Timed out waiting for file upload"

Parsing Failures

  • Individual file parsing has 5-minute timeout
  • 4 automatic retries with incremental backoff
  • On final failure, IReferenceFileParsingFaultedEvent published
  • If any file fails, entire import marked as Error

Rollback on Parse Error

If parsing fails mid-stream:

  1. All studies created for that file are deleted
  2. FileParseResult contains error details
  3. ReferenceFileParseJob marked with errors
  4. SearchImportJob status set to Error

Troubleshooting

Common Issues

Symptom Likely Cause Resolution
Upload stuck at 0% CORS misconfiguration on S3 bucket Verify S3 CORS allows PUT from web origin
"Timed out waiting for file upload" Lambda not triggered or RabbitMQ connectivity Check Lambda CloudWatch logs, verify S3 event notification
Search stuck in "Uploading" S3 event never reached PM service Check S3 bucket event notifications, Lambda invocations
Parsing errors Malformed input file Validate CSV columns, check XML structure
Studies not appearing MongoDB connectivity or batch insert failure Check PM service logs, MongoDB connection

Diagnostic Steps

  1. Check RabbitMQ queues: Verify messages are being delivered
  2. search-import-job-state queue for state machine events
  3. start-parsing-reference-file-command queue for parse jobs

  4. Check Lambda CloudWatch logs: /aws/lambda/syrfAppUploadS3Notifier

  5. Look for S3 event processing errors
  6. Verify metadata extraction succeeded

  7. Check PM service logs: Search for SearchImportJobStateMachine

  8. State transitions logged with correlation ID
  9. Activity execution results

  10. MongoDB queries:

// Check SearchImportJob status
db.Projects.find({"SearchImportJobs.Id": UUID("search-id")})

// Count studies for a search
db.Studies.countDocuments({SystematicSearchId: UUID("search-id")})

Debug Mode

Enable verbose logging in Project Management service:

{
  "Logging": {
    "LogLevel": {
      "MassTransit": "Debug",
      "SyRF.ProjectManagement": "Debug"
    }
  }
}

Screening Import Integration

When ScreeningImportSettings is provided:

  1. CSV/TSV columns are mapped to investigator IDs
  2. Screening decisions are extracted during parsing
  3. Studies are created with pre-populated ScreeningInfo
  4. Allows bulk import of existing screening decisions

ScreeningImportSettings Structure:

public record ScreeningImportSettings(
    Guid StageId,
    Dictionary<string, Guid> UserColumnMap  // column name -> investigator ID
);

Performance Characteristics

Operation Typical Duration Bottleneck
S3 signature generation < 100ms API response
S3 upload Variable (file size) Network bandwidth
S3 notification 1-5 seconds Lambda cold start
Parse job (CSV, 10K studies) 30-60 seconds MongoDB batch writes
Parse job (XML, 10K studies) 60-120 seconds XML parsing + MongoDB

Note: Performance numbers are estimates based on typical workloads and may vary.

Optimizations:

  • Studies batched in groups of 5,000 for MongoDB insertion
  • Progress updates throttled to every 2 seconds
  • Up to 5 concurrent parse jobs per instance
  • Incremental retry with backoff on transient failures

Implementation Critique

This section documents known architectural concerns and potential improvements for the systematic search upload flow.

Strengths

  1. Robust Race Condition Handling: The dual-path state machine entry handles the case where S3 events can arrive before API events. Both SearchUploadStarted and SearchUploadSaved can initiate the saga via Initially, and late-arriving events are handled by explicit During blocks.

  2. Good Separation of Concerns: The architecture cleanly separates:

  3. API (presigning + event publishing)
  4. Lambda (S3 event notification)
  5. State machine (orchestration)
  6. Consumers (parsing work)

  7. Proper Use of MassTransit Sagas: Using state machines for workflow orchestration is the right pattern for this use case.

  8. Batching Strategy: The 5,000 study batch size for MongoDB inserts is a reasonable trade-off between memory usage and database round-trips.

Known Issues and Technical Debt

1. Presigned URL Security Gap

Location: SearchController.cs:92-156

The presigning endpoint publishes ISearchUploadStartedEvent before the file is actually uploaded. If a user requests a signature but never uploads:

  • A SearchImportJob gets created in "Uploading" state
  • It will time out after 60 minutes and transition to "Error"
  • This creates orphaned/failed jobs in the database

Recommendation: Consider a two-phase approach where the job isn't created until the S3 event confirms the upload, or implement cleanup for abandoned uploads.

2. Silent Exception Swallowing

Location: ReferenceFileParseJobConsumer.cs:29-32

catch (Exception e)
{
    await PublishFaultEvent(context, command);
}

The exception e is caught but never logged. This makes debugging production issues difficult. The fault event doesn't contain the exception details.

Recommendation: Log the exception with correlation ID before publishing the fault event.

3. State Machine Duplicate Event Handling (Resolved)

Location: SearchImportJobStateMachine.cs

Previously, duplicate events could create duplicate jobs or trigger duplicate parse commands. This has been resolved with a three-layer defense: scoped UseMessageRetry (only MongoDbConcurrencyException), UseInMemoryOutbox, and Ignore() handlers for all 6 events in terminal states. Saga instances persist in terminal states and are cleaned up by a MongoDB TTL index after 7 days. See SearchImportJob Saga Duplicate Event Handling for full details.

4. Tight Coupling to S3 Path Convention

Location: SearchController.cs

The S3 key pattern Projects/{projectId}/Imported Search Libraries/SyRF Library - {searchId}.{ext} is hard-coded. If this convention changes, both the API and Lambda need coordinated updates.

Recommendation: Extract path construction to a shared utility, or store the full URL in metadata rather than reconstructing it.

5. SystematicSearch Schema Migration Complexity

Location: SystematicSearch.cs:72-105

The SyrfReferenceFiles property has complex getter/setter logic handling schema version 0 vs newer versions. This backward compatibility logic makes the domain model harder to understand and maintain.

Recommendation: Consider a one-time data migration script to upgrade all v0 documents, then remove the backward compatibility code.

6. Missing Validation in ReferenceFileParseJob.UpdateProgress

Location: ReferenceFileParseJob.cs:95-133

The UpdateProgress method silently returns false when HasError || IsComplete. Callers may not check this return value, leading to silent failures where progress updates are ignored.

Recommendation: Either throw an exception or log a warning when progress updates are attempted on completed/errored jobs.

7. No Circuit Breaker for MongoDB Operations

The parsing flow does batch inserts of 5,000 studies without circuit breaker protection. If MongoDB is under load:

  • All 5 concurrent parse jobs will pile up
  • Retries will compound the problem
  • No backpressure mechanism exists

Recommendation: Implement a circuit breaker pattern (Polly) for MongoDB operations, and consider adding backpressure through RabbitMQ prefetch limits.

8. Metadata Size Limit Risk

Location: SearchController.cs

S3 object metadata has a 2KB limit. The x-amz-meta-referencefiles header contains JSON-serialized file info. With multiple files or long descriptions, this could exceed the limit.

Recommendation: Store minimal identifiers in metadata and have the Lambda look up full details from a database or use the S3 key itself to encode the correlation ID.

9. Inconsistent Error Handling Strategy

  • Some errors delete created studies (rollback)
  • Some errors just mark the job as failed
  • The state machine transitions to Error but existing partial data may remain

Recommendation: Define and document a consistent error handling contract - either always roll back partial data or always keep it for manual review.

10. No Dead Letter Queue Strategy

If messages permanently fail, they'll exhaust retries and be moved to error queues, but there's no documented strategy for:

  • Monitoring dead letters
  • Alerting on accumulated failures
  • Manual retry or resolution procedures

Recommendation: Add observability and runbooks for DLQ handling.

Minor Improvements

  • Progress throttling (2 seconds) could be configurable
  • Concurrent job limit (5) should be based on resource constraints, not a magic number
  • Add correlation IDs to all log statements for distributed tracing
  • Consider using CloudEvents format for better observability tooling compatibility

Operational Notes

Saga State Collection Growth

The pmSearchImportJobState MongoDB collection stores saga instances for the SearchImportJobStateMachine. With TTL-based cleanup (replacing immediate SetCompleted deletion), saga instances persist for 7 days after reaching a terminal state (Completed or Error). The TTL index ttl_CompletedAt_7d on the CompletedAt field handles automatic cleanup.

Monitor: Collection size should remain proportional to the volume of search imports over the past 7 days. If the collection grows unexpectedly, check:

  • TTL index exists: db.pmSearchImportJobState.getIndexes() should show ttl_CompletedAt_7d
  • CompletedAt is being set: documents in terminal states should have a non-null CompletedAt value
  • MongoDB TTL monitor thread is running (runs every 60 seconds by default)