Search Upload Process Improvements¶
Overview¶
This document outlines proposed improvements to the systematic search upload process, with particular focus on addressing the poor progress reporting experience where parsing progress updates are not reaching the frontend regularly enough to be useful.
Current State Analysis¶
Upload Flow Summary¶
The search upload process follows this path:
- Frontend uploads file to S3 via presigned URL
- S3 Lambda (
syrfAppUploadS3Notifier) detects upload and publishesSearchUploadSavedToS3Event - PM Service state machine orchestrates parsing via
SearchImportJobStateMachine - ReferenceFileParseJobConsumer processes files using
StudyReferenceFileParser - Progress updates are saved to MongoDB and flow through change streams to SignalR
Root Causes of Poor Progress Reporting¶
Investigation identified multiple issues causing progress updates to not reach the frontend regularly:
1. Backend 2-Second Throttle (Primary Issue)¶
Location: ProjectManagementService.cs:137-138
var updateTimer = new Timer((_) => readyForUpdate = true, null,
TimeSpan.FromSeconds(2), TimeSpan.FromSeconds(2));
Progress updates can only be sent every 2 seconds, regardless of actual parsing progress.
2. Heavy MongoDB Change Stream Approach¶
Location: MongoContext.cs:130-238
Each progress update requires:
- Saving the entire Project aggregate to MongoDB
- MongoDB emitting a change stream event
- Change stream triggering SignalR notification
This is inefficient for lightweight progress updates.
3. Full Project Entity in Notifications¶
Location: NotificationHub.cs:128
The frontend receives the entire ProjectWithRelatedInvestigatorsDto for every progress update. Search import progress is a tiny fraction of this large payload.
4. No Dedicated Progress Channel¶
Frontend: signal-r.service.ts:197-200
There is no dedicated SearchImportProgressNotification channel - progress is embedded in generic project updates.
5. Version-Based Filtering May Drop Updates¶
Frontend: signal-r.service.ts:404-406
Version comparison logic may inadvertently skip rapid progress updates if versions don't increment as expected.
Proposed Improvements¶
Phase 1: Quick Wins (Low Effort, High Impact)¶
1.1 Reduce Progress Throttle Interval¶
Current: 2-second throttle Proposed: 500ms throttle (or configurable)
Change:
// In ProjectManagementService.ParseReferenceFile
var updateInterval = TimeSpan.FromMilliseconds(500);
var updateTimer = new Timer((_) => readyForUpdate = true, null,
updateInterval, updateInterval);
Impact: More frequent updates with minimal additional load Risk: Low - still throttled, just more frequently
1.2 Add Progress Percentage to Logs¶
Add structured logging for progress updates to help diagnose issues:
_logger.LogInformation(
"Search {SearchId} parsing progress: {Parsed}/{Total} ({Percent}%)",
searchId, numberOfParsedStudies, totalStudiesInFile,
(numberOfParsedStudies * 100) / totalStudiesInFile);
Phase 2: Dedicated Progress Channel (Medium Effort)¶
2.1 Create Lightweight Progress Event¶
Create a dedicated MassTransit event for progress updates:
public interface ISearchImportProgressEvent
{
Guid ProjectId { get; }
Guid SearchId { get; }
Guid ReferenceFileId { get; }
int NumberOfParsedStudies { get; }
int TotalStudiesInFile { get; }
string? SyrfReferenceFileUrl { get; }
DateTime Timestamp { get; }
}
2.2 Add SignalR Progress Channel¶
Backend (NotificationHub.cs):
public interface INotificationHubClient
{
// Existing...
Task SearchImportProgressNotification(SearchImportProgressDto notification);
}
Frontend (signal-r.service.ts):
this.searchImportProgressNotifications$ = fromEvent(
hubConnection,
'SearchImportProgressNotification'
);
2.3 Direct SignalR Push (Bypass Change Streams)¶
Instead of relying on MongoDB change streams, publish progress directly to SignalR:
// In ProjectManagementService or a dedicated service
await _hubContext.Clients.Group($"project-{projectId}")
.SearchImportProgressNotification(new SearchImportProgressDto
{
ProjectId = projectId,
SearchId = searchId,
ReferenceFileId = referenceFileId,
NumberOfParsedStudies = numberOfParsedStudies,
TotalStudiesInFile = totalStudiesInFile,
Timestamp = DateTime.UtcNow
});
This bypasses the heavyweight MongoDB save + change stream path.
Phase 3: Architecture Improvements (Higher Effort)¶
3.1 Separate Progress State from Project Aggregate¶
Move transient parsing progress to a separate, lightweight collection:
public class SearchImportProgress
{
public Guid Id { get; set; } // = SearchId
public Guid ProjectId { get; set; }
public Dictionary<Guid, ReferenceFileProgress> FileProgress { get; set; }
public DateTime LastUpdated { get; set; }
}
Benefits:
- Frequent updates don't bloat Project document history
- Can use different change stream subscription
- Easier to implement TTL/cleanup
3.2 WebSocket Streaming for Large Uploads¶
For very large files (>10,000 studies), consider WebSocket streaming:
// Frontend subscribes to progress stream
const progressStream = signalR.stream('StreamSearchProgress', searchId);
progressStream.subscribe({
next: (progress) => updateProgressBar(progress),
complete: () => markComplete(),
error: (err) => handleError(err)
});
3.3 Frontend Progress State Management¶
Add dedicated NgRx state for search import progress:
// search-import-progress.state.ts
interface SearchImportProgressState {
[searchId: string]: {
files: {
[fileId: string]: {
parsed: number;
total: number;
status: 'parsing' | 'complete' | 'error';
}
};
overallProgress: number; // 0-100
}
}
Implementation Priority¶
| Improvement | Effort | Impact | Priority |
|---|---|---|---|
| 1.1 Reduce throttle interval | Low | Medium | P1 |
| 1.2 Add progress logging | Low | Low | P1 |
| 2.1 Dedicated progress event | Medium | High | P2 |
| 2.2 SignalR progress channel | Medium | High | P2 |
| 2.3 Direct SignalR push | Medium | High | P2 |
| 3.1 Separate progress state | High | Medium | P3 |
| 3.2 WebSocket streaming | High | Medium | P3 |
| 3.3 Frontend state management | Medium | Medium | P3 |
Recommended Implementation Order¶
- Sprint 1: Phase 1 - Quick wins (1.1, 1.2)
- Immediate improvement with minimal risk
-
Provides better diagnostics for further investigation
-
Sprint 2: Phase 2 - Dedicated channel (2.1, 2.2, 2.3)
- Core architectural fix
-
Decouples progress from Project entity updates
-
Future: Phase 3 - As needed
- Only if Phase 2 doesn't fully resolve issues
- Consider for very large file uploads
Testing Considerations¶
Manual Testing¶
- Upload file with 1,000+ studies
- Observe frontend progress bar updates
- Verify updates occur at expected intervals
- Check browser DevTools for SignalR messages
Automated Testing¶
- Unit tests for throttle behavior
- Integration tests for progress event publishing
- E2E tests for frontend progress display
Success Metrics¶
- Progress updates reach frontend at least every 1 second during parsing
- Frontend progress bar shows smooth, continuous progress
- No "stuck" progress bars during active parsing
- Reduced SignalR payload size for progress updates (Phase 2+)
Related Documentation¶
Appendix: Code References¶
Key Files¶
| File | Purpose |
|---|---|
ProjectManagementService.cs |
Progress throttle, update logic |
SearchImportJobStateMachine.cs |
State machine orchestration |
ReferenceFileParseJobConsumer.cs |
File parsing consumer |
NotificationHub.cs |
SignalR hub |
signal-r.service.ts |
Frontend SignalR service |
MongoContext.cs |
MongoDB change streams |