Skip to content

Search Upload Process Improvements

Overview

This document outlines proposed improvements to the systematic search upload process, with particular focus on addressing the poor progress reporting experience where parsing progress updates are not reaching the frontend regularly enough to be useful.

Current State Analysis

Upload Flow Summary

The search upload process follows this path:

  1. Frontend uploads file to S3 via presigned URL
  2. S3 Lambda (syrfAppUploadS3Notifier) detects upload and publishes SearchUploadSavedToS3Event
  3. PM Service state machine orchestrates parsing via SearchImportJobStateMachine
  4. ReferenceFileParseJobConsumer processes files using StudyReferenceFileParser
  5. Progress updates are saved to MongoDB and flow through change streams to SignalR

Root Causes of Poor Progress Reporting

Investigation identified multiple issues causing progress updates to not reach the frontend regularly:

1. Backend 2-Second Throttle (Primary Issue)

Location: ProjectManagementService.cs:137-138

var updateTimer = new Timer((_) => readyForUpdate = true, null,
    TimeSpan.FromSeconds(2), TimeSpan.FromSeconds(2));

Progress updates can only be sent every 2 seconds, regardless of actual parsing progress.

2. Heavy MongoDB Change Stream Approach

Location: MongoContext.cs:130-238

Each progress update requires:

  • Saving the entire Project aggregate to MongoDB
  • MongoDB emitting a change stream event
  • Change stream triggering SignalR notification

This is inefficient for lightweight progress updates.

3. Full Project Entity in Notifications

Location: NotificationHub.cs:128

The frontend receives the entire ProjectWithRelatedInvestigatorsDto for every progress update. Search import progress is a tiny fraction of this large payload.

4. No Dedicated Progress Channel

Frontend: signal-r.service.ts:197-200

this.projectNotifications$ = fromEvent(
  hubConnection,
  'ProjectNotification'
);

There is no dedicated SearchImportProgressNotification channel - progress is embedded in generic project updates.

5. Version-Based Filtering May Drop Updates

Frontend: signal-r.service.ts:404-406

if (storedVersion === undefined ||
    storedVersion <= (updatedVersion ?? Infinity) || ...

Version comparison logic may inadvertently skip rapid progress updates if versions don't increment as expected.

Proposed Improvements

Phase 1: Quick Wins (Low Effort, High Impact)

1.1 Reduce Progress Throttle Interval

Current: 2-second throttle Proposed: 500ms throttle (or configurable)

Change:

// In ProjectManagementService.ParseReferenceFile
var updateInterval = TimeSpan.FromMilliseconds(500);
var updateTimer = new Timer((_) => readyForUpdate = true, null,
    updateInterval, updateInterval);

Impact: More frequent updates with minimal additional load Risk: Low - still throttled, just more frequently

1.2 Add Progress Percentage to Logs

Add structured logging for progress updates to help diagnose issues:

_logger.LogInformation(
    "Search {SearchId} parsing progress: {Parsed}/{Total} ({Percent}%)",
    searchId, numberOfParsedStudies, totalStudiesInFile,
    (numberOfParsedStudies * 100) / totalStudiesInFile);

Phase 2: Dedicated Progress Channel (Medium Effort)

2.1 Create Lightweight Progress Event

Create a dedicated MassTransit event for progress updates:

public interface ISearchImportProgressEvent
{
    Guid ProjectId { get; }
    Guid SearchId { get; }
    Guid ReferenceFileId { get; }
    int NumberOfParsedStudies { get; }
    int TotalStudiesInFile { get; }
    string? SyrfReferenceFileUrl { get; }
    DateTime Timestamp { get; }
}

2.2 Add SignalR Progress Channel

Backend (NotificationHub.cs):

public interface INotificationHubClient
{
    // Existing...
    Task SearchImportProgressNotification(SearchImportProgressDto notification);
}

Frontend (signal-r.service.ts):

this.searchImportProgressNotifications$ = fromEvent(
  hubConnection,
  'SearchImportProgressNotification'
);

2.3 Direct SignalR Push (Bypass Change Streams)

Instead of relying on MongoDB change streams, publish progress directly to SignalR:

// In ProjectManagementService or a dedicated service
await _hubContext.Clients.Group($"project-{projectId}")
    .SearchImportProgressNotification(new SearchImportProgressDto
    {
        ProjectId = projectId,
        SearchId = searchId,
        ReferenceFileId = referenceFileId,
        NumberOfParsedStudies = numberOfParsedStudies,
        TotalStudiesInFile = totalStudiesInFile,
        Timestamp = DateTime.UtcNow
    });

This bypasses the heavyweight MongoDB save + change stream path.

Phase 3: Architecture Improvements (Higher Effort)

3.1 Separate Progress State from Project Aggregate

Move transient parsing progress to a separate, lightweight collection:

public class SearchImportProgress
{
    public Guid Id { get; set; }  // = SearchId
    public Guid ProjectId { get; set; }
    public Dictionary<Guid, ReferenceFileProgress> FileProgress { get; set; }
    public DateTime LastUpdated { get; set; }
}

Benefits:

  • Frequent updates don't bloat Project document history
  • Can use different change stream subscription
  • Easier to implement TTL/cleanup

3.2 WebSocket Streaming for Large Uploads

For very large files (>10,000 studies), consider WebSocket streaming:

// Frontend subscribes to progress stream
const progressStream = signalR.stream('StreamSearchProgress', searchId);
progressStream.subscribe({
  next: (progress) => updateProgressBar(progress),
  complete: () => markComplete(),
  error: (err) => handleError(err)
});

3.3 Frontend Progress State Management

Add dedicated NgRx state for search import progress:

// search-import-progress.state.ts
interface SearchImportProgressState {
  [searchId: string]: {
    files: {
      [fileId: string]: {
        parsed: number;
        total: number;
        status: 'parsing' | 'complete' | 'error';
      }
    };
    overallProgress: number;  // 0-100
  }
}

Implementation Priority

Improvement Effort Impact Priority
1.1 Reduce throttle interval Low Medium P1
1.2 Add progress logging Low Low P1
2.1 Dedicated progress event Medium High P2
2.2 SignalR progress channel Medium High P2
2.3 Direct SignalR push Medium High P2
3.1 Separate progress state High Medium P3
3.2 WebSocket streaming High Medium P3
3.3 Frontend state management Medium Medium P3
  1. Sprint 1: Phase 1 - Quick wins (1.1, 1.2)
  2. Immediate improvement with minimal risk
  3. Provides better diagnostics for further investigation

  4. Sprint 2: Phase 2 - Dedicated channel (2.1, 2.2, 2.3)

  5. Core architectural fix
  6. Decouples progress from Project entity updates

  7. Future: Phase 3 - As needed

  8. Only if Phase 2 doesn't fully resolve issues
  9. Consider for very large file uploads

Testing Considerations

Manual Testing

  1. Upload file with 1,000+ studies
  2. Observe frontend progress bar updates
  3. Verify updates occur at expected intervals
  4. Check browser DevTools for SignalR messages

Automated Testing

  1. Unit tests for throttle behavior
  2. Integration tests for progress event publishing
  3. E2E tests for frontend progress display

Success Metrics

  • Progress updates reach frontend at least every 1 second during parsing
  • Frontend progress bar shows smooth, continuous progress
  • No "stuck" progress bars during active parsing
  • Reduced SignalR payload size for progress updates (Phase 2+)

Appendix: Code References

Key Files

File Purpose
ProjectManagementService.cs Progress throttle, update logic
SearchImportJobStateMachine.cs State machine orchestration
ReferenceFileParseJobConsumer.cs File parsing consumer
NotificationHub.cs SignalR hub
signal-r.service.ts Frontend SignalR service
MongoContext.cs MongoDB change streams