Skip to content

SyRF Data Export Feature - Comprehensive Analysis

Executive Summary

The data export feature is a substantial, well-architected system spanning the full stack: Angular frontend with NgRx state management, ASP.NET API controller, domain models, CSV generation engine, real-time progress tracking via SignalR, and intelligent background progress reporting with MongoDB persistence. The feature handles 6 export formats, 3 blinding levels, 6 annotation categories, and streams CSV data directly to the HTTP response.

Overall assessment: The architecture is strong and the error handling model is sophisticated. However, there is virtually zero test coverage, significant code duplication in error handling, some questionable control flow decisions, and several areas where debuggability and maintainability could be improved. Below is the detailed analysis.


1. Architecture Overview

User clicks "Download Data"
Angular Component (generates correlationId, dispatches action)
NgRx Effects → DataExportService.createDataExportJob()
POST /api/projects/{id}/data-export-jobs  →  returns 202 + jobId
NgRx Effects → DataExportService.getDataExport()  (starts file download)
    │                                              ┌──────────────────────────┐
    ▼                                              │ SignalR (parallel)       │
GET /api/projects/{id}/data-export/{jobId}         │ DataExportJobNotification│
    │                                              │ → real-time progress     │
    ▼                                              └──────────────────────────┘
DataExportController.GetDataExport()
StudyDataService → CsvDataExportWriter → StudyWriter → Row Writers
    │                    │                    │
    ▼                    ▼                    ▼
IAsyncEnumerable<Study> → IAsyncEnumerable<IEnumerable<string>> → CSV stream
OptimizedDataExportProgressReporter (channel-based batching → MongoDB updates)

Key design decisions: - Export is synchronous-streaming: the GET request streams CSV directly, no intermediate file storage - Progress is tracked via a channel-based producer-consumer pattern running alongside the stream - A two-step flow (POST to create job, GET to download) allows the job to be tracked before download begins - The frontend uses correlation IDs to link requests to jobs independent of backend job IDs


2. Maintainability

Strengths

  • Clean separation of concerns: Controller (DataExportController) handles HTTP, StudyDataService orchestrates, CsvDataExportWriter generates CSV, StudyWriter handles per-study rows, FormatOptions subclasses handle format-specific logic
  • Immutable context objects: ExportContext, StudyContext, WriterConfig are records/immutable, preventing accidental mutation
  • Strategy pattern for formats: FormatOptions abstract class with 6 concrete implementations (LongFormatOptions, WideFormatOptions, etc.) makes adding new formats straightforward
  • Frontend component structure: Shared DataExportLayoutComponent provides consistent UX across all 4 export types via content projection

Concerns

2a. Massive error handling duplication - This is the most significant maintainability issue.

CsvDataExportWriter.CsvRows() (lines 202-359) has 7 catch blocks, each ~20 lines, creating nearly identical DataExportError objects. StudyWriter.GetRowsForStudy() (lines 66-229) has 8 catch blocks doing the same thing. That's ~450 lines of nearly identical error handling boilerplate across two files.

// This pattern repeats 15 times across the two files:
catch (SomeException ex)
{
    var structuredError = new DataExportError(
        Message: $"...",
        ErrorType: DataExportErrorType.SomeType,
        Category: DataExportErrorCategory.SomeCategory,
        Scope: DataExportErrorScope.SingleStudy,
        CanContinue: true,
        Timestamp: DateTime.UtcNow,
        StudyId: study.Id,
        Metadata: new Dictionary<string, string> { ... }
    );
    _progressReporter.Report(new(...));
    rows = new List<List<string>> { new() { study.Id.ToString(), "CODE", "message" } };
}

Recommendation: Extract a single error-mapping method:

private (DataExportError error, IEnumerable<string> errorRow) HandleStudyError(
    Exception ex, Guid studyId, string? studyTitle)
{
    var (errorType, category, errorCode) = ex switch
    {
        UnauthorizedAccessException => (DataExportErrorType.StudyAccessDenied, 
            DataExportErrorCategory.PermissionDenied, "ACCESS_DENIED"),
        ArgumentNullException ane when ane.ParamName?.Contains("annotation") == true =>
            (DataExportErrorType.MissingRequiredField, DataExportErrorCategory.MissingData, "MISSING_DATA"),
        // ... etc
    };
    // Build error + report + return error row
}

This would reduce ~450 lines to ~50, making the actual control flow of CsvRows() and GetRowsForStudy() much easier to follow.

2b. CsvDataExportWriter constructor is doing too much (lines 95-158). It performs complex filtering logic (stage question filtering, category question filtering, label exclusion) that should arguably live in a dedicated factory or builder. The constructor is 60 lines of dense initialization logic.

2c. Console.WriteLine for logging - Both DataExportProgressReporter (lines 149, 198, 220) and CsvDataExportWriter (lines 396-397, 410-411) use Console.WriteLine for logging. This bypasses the structured logging infrastructure (ILogger), making log filtering, correlation, and production monitoring difficult.

2d. Magic numbers scattered throughout: - pagingThreshold = 25000 (controller:80) - SnapshotInterval = 50 (CsvDataExportWriter:163) - 500 error memory limit (reporter:173) - 1000 channel capacity (reporter:82) - desiredUpdateCount = 100 (reporter:70)

These should be in a configuration class or at minimum const declarations at class level with documentation.


3. Structure & Organization

Strengths

  • Logical file organization: Domain models in Model/DataExportJobAggregate/, services in Services/DataExportServices/, contexts in Contexts/, options in Options/, performance tracking in Performance/
  • Frontend follows feature-module pattern: data-export/ folder contains all routes, with shared components in shared/
  • NgRx follows entity/UI state separation: Job entities normalized separately from UI request status, selectors compose them

Concerns

3a. The controller is doing data access strategy work - DataExportController.GetDataExport() (lines 77-94) contains study retrieval strategy logic (paging threshold decisions, choosing between paged/non-paged queries). This should live in the service layer. The controller should just call something like _studyDataService.GetDataExport(job) and let the service handle how studies are fetched.

3b. Dual completion reporting - Both CsvDataExportWriter.CsvRows() (line 403) AND DataExportController (line 128) report DataExportJobStatus.Completed. This creates a race condition where two "Completed" signals hit the progress reporter. While the reporter handles this gracefully (deduplication), it indicates unclear ownership of the "mark complete" responsibility.

3c. ExportContext is a mutable record - Despite being a record, ExportContext has mutable state (ExpectedColumnCount via SetExpectedColumnCount()). This is surprising for a record type and undermines the immutability contract that records typically imply. Consider using a separate validation context or making this explicit.

3d. Frontend: DataExportState type is complex - The discriminated union in data-export.selectors.ts (lines 50-75) uses TypeScript intersection types with string literal types. While type-safe, this makes the type harder to work with than a simpler status enum + nullable fields approach.


4. Ease of Reasoning

Strengths

  • Correlation ID pattern is well-implemented: The flow from frontend UUID generation through NgRx state to backend job is traceable
  • Progress reporter has clear producer-consumer semantics: The channel-based design is well-documented with comments explaining why each decision was made
  • Domain model is self-documenting: DataExportError, DataExportJobSummary, DataExportJobStatus have clear, descriptive names

Concerns

4a. The streaming execution model is hard to reason about - The GET endpoint creates a FileCallbackResult whose callback executes asynchronously after the controller method returns. This means: - The progress reporter's lifecycle spans beyond the controller's scope - Errors during streaming can't return proper HTTP status codes (headers already sent) - The catch block at controller level (line 132) re-throws, but the response is already streaming - the client may receive a partial CSV followed by a connection drop

This is a fundamental architectural tension. The user gets a 200 OK when the download starts, but the export can still fail midway. The current design handles this via SignalR notifications and the job summary, but it's a source of confusion.

4b. Error rows break CSV semantics - When a study fails, an "error row" with 3 columns ([studyId, ERROR_CODE, message]) is inserted into the CSV. But the CSV header has N columns. The column count validation should catch this, but only if ExportContext.ExpectedColumnCount is set. There's an implicit assumption that error rows will fail column validation and stop the export, but the code also says "we'll continue for now" (line 334). This ambiguity makes it hard to predict what the output CSV looks like when errors occur.

4c. OutOfMemoryException handling is contradictory - In CsvDataExportWriter (line 332), CanContinue is set to false, but the code says "we'll continue for now" and produces an error row. In the DataExportController (line 139), the same exception stops the entire export. The handling is inconsistent between layers.


5. Debuggability

Strengths

  • Rich error metadata: DataExportError.Metadata dictionaries include StudyTitle, ExceptionType, StackTrace, ProjectId, JobId, and format-specific context
  • StudyContext.BuildAnnotationLookupMetadata() provides extensive debugging context including question hierarchy paths
  • Progress reporter tracks statistics: Max queue depth, total updates received vs. DB updates performed - useful for performance debugging
  • PerformanceMetrics tracks per-study and per-outcome-row timing with Stopwatch precision

Concerns

5a. Console.WriteLine instead of structured logging - This is both a maintainability and debuggability issue. Progress reporter logging (lines 149, 198, 220) goes to stdout rather than structured logs. In a Kubernetes environment, these will mix with all other stdout output without correlation IDs, log levels, or structured fields.

Recommendation: Inject ILogger<OptimizedDataExportProgressReporter> and use structured logging:

_logger.LogInformation("Processing update for Job {JobId}: Status={Status}, Processed={Processed}",
    _jobId, status, numberProcessed);

5b. Error deduplication can hide problems - The HashSet<DataExportError> deduplication (reporter:159) and the 500-error memory cap (reporter:173-188) mean that in a large export with many errors, older errors are silently discarded. The summary still counts them, but the detailed error list may be incomplete. This could make debugging a failing export harder. Consider logging a warning when errors are evicted.

5c. No request/response logging at the API level - The DataExportController has no logging at all. Adding structured logs for job creation (with correlation ID, project ID, export format) and download initiation would significantly help with production debugging.

5d. Frontend error display is thorough - The ExportErrorDetailsDialogComponent groups errors by category and type, shows affected study counts, and has copy-to-clipboard for error reports. This is genuinely helpful for end users.


6. Error Handling & User Feedback

Strengths

  • Structured error model is excellent: DataExportError with typed enums (DataExportErrorType, DataExportErrorCategory, DataExportErrorScope) plus CanContinue flag and metadata is far better than unstructured strings
  • DataExportJobSummary provides a clear completion summary with success rate, error breakdowns by type/category, and a human-readable message
  • Frontend displays errors progressively: During export, the layout shows "Z errors encountered" in real-time alongside the progress bar
  • Error details dialog groups errors intelligently and allows copying for bug reports
  • Non-critical errors don't stop the export: Individual study failures produce error rows while the export continues, which is the right trade-off for large exports

Concerns

6a. Validation messages could be more user-friendly - The BadRequest responses in CreateDataExportJob (controller:52-54) say things like "SplitBySession is must not be true in Long Format" (note the grammar error: "is must not be true"). These are raw developer messages exposed to the UI.

Recommendation: Return structured validation errors:

return BadRequest(new { 
    field = "splitBySession", 
    message = "Session splitting is only available with Wide format exports." 
});

6b. Missing authorization checks - The controller doesn't verify that the requesting user is a member of the project. CreateDataExportJob saves the job with CurrentUserId but never checks project membership. GetDataExport fetches the job by ID without verifying the requesting user has access. A user could potentially download exports from projects they don't belong to if they know the job ID.

6c. Error rows corrupt CSV output - As mentioned, when a study fails processing, a 3-column error row ([studyId, ERROR_CODE, message]) is inserted into a CSV that may have 50+ columns. This will either fail column validation (stopping the export) or produce an invalid CSV. The behavior depends on whether ExpectedColumnCount is set, creating unpredictable outcomes.

Better approach: Either skip the study entirely (omitting it from output), or pad error rows to match the expected column count:

var errorRow = Enumerable.Repeat("", ExportContext.ExpectedColumnCount ?? 3).ToList();
errorRow[0] = study.Id.ToString();
errorRow[1] = "ERROR: " + errorCode;

6d. No cancellation support - There's no CancellationToken threading through the export pipeline. A user who navigates away or closes the tab has no way to cancel a long-running export. The export continues consuming server resources (MongoDB cursor, memory, CPU) until completion. For exports of 25,000+ studies, this could be significant.

6e. No concurrent export limiting - There's no check to prevent a user from starting multiple simultaneous exports, which could overwhelm the database with parallel full-collection scans.


7. Testing

This is the critical weakness of the feature.

Layer Component Tests Assessment
API DataExportController 0 No controller tests
Service StudyDataService 0 No service tests
Core CsvDataExportWriter 0 No CSV generation tests
Core StudyWriter 0 No row generation tests
Core DataExportJob aggregate 0 No domain model tests
Core OptimizedDataExportProgressReporter 0 No concurrency tests
Data DataExportJobRepository 0 No repository tests
Frontend 10 component specs ~12 All are "should create" stubs
Frontend Effects/Reducers/Selectors 0 No state management tests
Frontend 2 component specs 0 Broken with expect(true).toBe(true)

Overall: ~0% meaningful test coverage.

What this means

  • No regression safety net: Any change to the export pipeline could break it silently
  • No format correctness verification: Nobody has verified that Wide format produces correct column counts, or that Long format generates the right number of rows per study
  • No progress reporter verification: The channel-based batching logic (376 lines) with error deduplication, memory management, and summary generation has zero tests. This is complex concurrent code that is hard to get right.
  • No blinding verification: The investigator blinding logic is untested - a bug could expose investigator identities in "blinded" exports

Minimum viable test coverage I'd recommend (prioritized)

Priority 1 - High-value, moderate effort:

  1. CsvDataExportWriter unit tests - Test each format produces correct headers, correct row counts, handles missing data gracefully. This is the core value of the feature.
  2. OptimizedDataExportProgressReporter tests - Test batching, deduplication, memory limits, summary generation. This is complex concurrent code.
  3. DataExportController integration tests - Test validation, job creation, and the streaming callback.

Priority 2 - Medium-value:

  1. StudyWriter tests - Per-format row generation, blinding levels, bibliographic data inclusion
  2. NgRx selector tests - selectDataExportStateFromCorrelationId computes complex derived state
  3. NgRx effects tests - Verify the create-then-download flow

Priority 3 - Lower priority but still important:

  1. Fix broken Angular specs (QuantitativeDataComponent, QuestionSelectionDialogComponent)
  2. Frontend form validation tests
  3. Error display component tests

8. Summary of Recommendations

Quick Wins (low effort, high impact)

# Recommendation Impact
1 Replace Console.WriteLine with ILogger in progress reporter and CSV writer Debuggability
2 Fix grammar in validation messages ("is must not be true") User experience
3 Extract error-mapping helper to eliminate ~400 lines of duplicated catch blocks Maintainability
4 Add project membership authorization check to both controller endpoints Security
5 Fix error rows to pad to expected column count (or skip study entirely) Correctness

Medium-Term Improvements

# Recommendation Impact
6 Move study retrieval strategy from controller into StudyDataService Structure
7 Add CancellationToken support through the export pipeline Resource management
8 Consolidate "mark complete" to single location (remove dual completion reporting) Clarity
9 Add concurrent export limiting (per-user or per-project) Reliability
10 Extract CsvDataExportWriter constructor logic into a builder Maintainability

Strategic Investment (high effort, high impact)

# Recommendation Impact
11 Add CsvDataExportWriter unit tests (Priority 1) Regression safety
12 Add OptimizedDataExportProgressReporter unit tests (Priority 1) Concurrency correctness
13 Add controller integration tests (Priority 1) End-to-end confidence
14 Consider background job architecture for very large exports Scalability

What NOT to change

  • The DataExportError / DataExportJobSummary model is well-designed - keep it
  • The channel-based progress reporter architecture is sound - just add tests
  • The Angular signal-based component architecture is modern and clean
  • The ExportContext / StudyContext immutable context pattern is good
  • The SignalR real-time notification integration is valuable

The feature works well as a whole system, with particularly strong error modeling and real-time UX. The critical gap is testing - the level of sophistication in the code (channel-based batching, error deduplication, concurrent contexts, streaming CSV) demands automated tests to maintain confidence. The error handling duplication is the most impactful maintainability issue and should be the first refactoring target.

You're right - several issues I flagged are diagnosis without prescription. Let me fix that with concrete, actionable recommendations for each.


Recommendations for Issues Without Them

2b. CsvDataExportWriter constructor doing too much

The constructor (60 lines) mixes configuration, filtering, and context assembly. Extract a static factory:

// New: CsvDataExportWriterFactory.cs
public static class CsvDataExportWriterFactory
{
    public static CsvDataExportWriter Create(
        Project project,
        Dictionary<Guid, SystematicSearch> systematicSearchesMap,
        IEnumerable<Investigator> projectInvestigators,
        WriterConfig writerConfig,
        Func<Guid, Investigator> globalInvestigatorLookup)
    {
        var options = new StudyWriterOptions(writerConfig);
        var stageIds = (writerConfig.StageIds ?? project.Stages.Select(s => s.Id)).ToList();
        var categoryInfo = ResolveCategoryInfo(options);
        var stageQuestionIds = ResolveStageQuestionIds(project, stageIds);
        var annotationContext = BuildAnnotationContext(project, categoryInfo, stageQuestionIds);

        return new CsvDataExportWriter(project, options, annotationContext, ...);
    }

    private static CategoryInfo? ResolveCategoryInfo(StudyWriterOptions options) { ... }
    private static List<Guid> ResolveStageQuestionIds(Project project, List<Guid> stageIds) { ... }
    private static AnnotationContext BuildAnnotationContext(...) { ... }
}

The constructor then becomes a simple field assignment. Each helper method is independently testable.


3a. Move study retrieval strategy from controller into service layer

The controller currently decides paging strategy (DataExportController.cs:77-94). Move this into StudyDataService:

// In IStudyDataService - simplify the signature:
IAsyncEnumerable<IEnumerable<string>> GetDataExport(
    DataExportJob job,
    IProgress<DataExportJobUpdate> reporter);
    // No longer needs studies or count passed in - it fetches them itself

// In StudyDataService implementation:
public async IAsyncEnumerable<IEnumerable<string>> GetDataExport(
    DataExportJob job,
    IProgress<DataExportJobUpdate> reporter)
{
    var (count, studies) = await ResolveStudySource(job);
    // ... existing logic using count and studies
}

private async Task<(int count, IAsyncEnumerable<Study> studies)> ResolveStudySource(
    DataExportJob job)
{
    const int pagingThreshold = 25000;
    var projectId = job.ProjectId;
    // ... paging decision logic currently in controller
}

The controller becomes:

[HttpGet("api/projects/{projectId:guid}/data-export/{jobId:guid}")]
public async Task<IActionResult> GetDataExport(Guid projectId, Guid jobId)
{
    var job = await _pmUnitOfWork.DataExportJobs.GetOrDefaultAsync(jobId);
    if (job is null) return NotFound(...);

    return new FileCallbackResult(
        new MediaTypeHeaderValue("text/csv"),
        async (outputStream, _) =>
        {
            await using var reporter = new OptimizedDataExportProgressReporter(job.Id, _pmUnitOfWork);
            var rows = _studyDataService.GetDataExport(job, reporter);
            await WriteCsvToStream(outputStream, rows, reporter);
        })
    { FileDownloadName = BuildFileName(job), LastModified = DateTimeOffset.Now };
}

3b. Consolidate dual completion reporting

Currently both CsvDataExportWriter.CsvRows() (line 403) AND DataExportController (line 128) report Completed. Pick one owner - the writer, since it knows the actual study count:

// CsvDataExportWriter.CsvRows() - KEEP this completion report (line 403)
_progressReporter.Report(new(
    NumberProcessed: studyCount, 
    Status: DataExportJobStatus.Completed, 
    ForceUpdate: true));
// DataExportController.cs - REMOVE this (lines 128-130):
// progressReporter.Report(
//     new(Convert.ToInt32(count), DataExportJobStatus.Completed, true)
// );

The writer already reports completion at the end of CsvRows(). The controller's duplicate report is redundant and creates a race.


3c. ExportContext is a mutable record

Replace the mutable SetExpectedColumnCount with an immutable approach - set it during construction or use a wrapper:

// Option A: Make it a separate validation concern
public class CsvColumnValidator
{
    private int? _expectedCount;

    public void SetExpectedColumnCount(int count) { ... }
    public DataExportError? Validate(IEnumerable<string> row, ...) { ... }
}

// ExportContext stays immutable, CsvColumnValidator is owned by CsvDataExportWriter

This removes the mutation from ExportContext and makes the validation lifecycle explicit.


3d. Simplify frontend DataExportState type

Replace the discriminated union with a single flat type using nullable fields:

// Instead of the intersection + union type:
export type DataExportState = {
  correlationId: string;
  jobId: string | null;
  status: DataExportRequestStatus;
  processedStudyCount: number;
  progress: number | null;
  totalStudyCount: number | null;
  resolved: boolean;
  errors: DataExportError[] | null;
  summary: DataExportJobSummary | null;
};

The current discriminated union adds complexity for marginal type safety gains - callers already check status before accessing errors/summary, so the narrow typing doesn't prevent real bugs. The flat type is easier to construct, destructure, and reason about.


4a. Streaming model - partial response problem

The fundamental issue: headers are sent (200 OK) before the export can fail. Two approaches:

Option A (pragmatic, recommended): Accept the current model but improve error communication. When the export fails mid-stream, write a clearly-marked error trailer row at the end of the CSV:

catch (Exception e)
{
    // Write error marker that tools/users can detect
    var errorRow = new List<string> 
    { 
        "### EXPORT FAILED ###", 
        e.Message, 
        DateTime.UtcNow.ToString("O") 
    };
    // Pad to column count for valid CSV
    while (errorRow.Count < expectedColumnCount)
        errorRow.Add("");

    csv.WriteField(errorRow);
    await csv.NextRecordAsync();
    await writer.FlushAsync();

    // Report failure via progress reporter (for SignalR notification)
    progressReporter.Report(new(...Failed...));
}

Option B (future, more robust): Write CSV to a temporary blob/file, only return the download link once complete. This is a bigger architectural change (background job model) but eliminates the partial-response problem entirely.


4c. OutOfMemoryException contradictory handling

Pick one behavior and be consistent. Since CanContinue: false is correct (memory issues are system-wide), the handler should not produce an error row and continue. Instead, re-throw to let the controller-level handler stop the export:

// CsvDataExportWriter.CsvRows() - change the OOM handler:
catch (OutOfMemoryException ex)
{
    var structuredError = new DataExportError(
        Message: $"Insufficient memory processing study {study.Id}",
        ErrorType: DataExportErrorType.InsufficientMemory,
        Category: DataExportErrorCategory.SystemError,
        Scope: DataExportErrorScope.EntireExport,
        CanContinue: false,  // This is correct
        Timestamp: DateTime.UtcNow,
        StudyId: study.Id
    );
    _progressReporter.Report(new(studyCount, DataExportJobStatus.Failed, true, 
        new List<DataExportError> { structuredError }));

    throw;  // Let it propagate - don't continue with error row
}

This aligns the behavior with what CanContinue: false promises.


5b. Error deduplication eviction - log when errors are discarded

// OptimizedDataExportProgressReporter.cs, around line 173:
if (_accumulatedErrors.Count > 500)
{
    var errorsToRemove = _accumulatedErrors.Count - 500;

    // ADD: Log that errors are being evicted
    Console.WriteLine(  // or better, ILogger
        $"[Reporter] WARNING: Error memory limit reached for Job {_jobId}. " +
        $"Evicting {errorsToRemove} oldest errors. Total errors so far: {_accumulatedErrors.Count}");

    // Also track eviction count for the final summary
    _totalEvictedErrors += errorsToRemove;

    // ... existing eviction logic
}

Then include _totalEvictedErrors in the DataExportJobSummary message so users know errors were lost:

// In GenerateFinalSummary():
var evictionNote = _totalEvictedErrors > 0 
    ? $" ({_totalEvictedErrors} older errors were evicted due to memory limits.)" 
    : "";
// Append to summaryMessage

6d. Cancellation support

Thread CancellationToken through the pipeline:

// Controller - pass HttpContext.RequestAborted:
return new FileCallbackResult(
    new MediaTypeHeaderValue("text/csv"),
    async (outputStream, httpContext) =>
    {
        var cancellationToken = httpContext.RequestAborted;
        await using var progressReporter = new OptimizedDataExportProgressReporter(...);

        var studyWideData = _studyDataService.GetDataExport(
            job, progressReporter, studies, count, cancellationToken);

        await foreach (var record in studyWideData.WithCancellation(cancellationToken))
        {
            record.ForEach(field => csv.WriteField(field));
            await csv.NextRecordAsync();
        }
    });

// StudyDataService - accept and pass through:
IAsyncEnumerable<IEnumerable<string>> GetDataExport(
    DataExportJob job,
    IProgress<DataExportJobUpdate> reporter,
    IAsyncEnumerable<Study> studies,
    int count,
    CancellationToken cancellationToken = default);

// CsvDataExportWriter.CsvRows() - check periodically:
await foreach (var study in studies.WithCancellation(cancellationToken))
{
    cancellationToken.ThrowIfCancellationRequested();
    // ... existing logic
}

When cancelled, the progress reporter's DisposeAsync will flush the final state, and the job will show as failed with a cancellation message.


6e. Concurrent export limiting

Add a simple per-project concurrency check at job creation:

// DataExportController.CreateDataExportJob():
[HttpPost("api/projects/{projectId:guid}/data-export-jobs")]
public async Task<IActionResult> CreateDataExportJob(
    Guid projectId, DataExportRequest dataExportRequest)
{
    // ADD: Check for active exports
    var activeJobs = await _pmUnitOfWork.DataExportJobs
        .GetActiveJobsForProjectAsync(projectId);
    if (activeJobs.Count >= 2)  // Allow 2 concurrent exports per project
    {
        return Conflict(new {
            message = "This project already has active exports in progress. " +
                      "Please wait for them to complete before starting a new one.",
            activeJobIds = activeJobs.Select(j => j.Id)
        });
    }

    // ... existing logic
}

// New repository method:
public interface IDataExportJobRepository : ICrudRepository<DataExportJob, Guid>
{
    Task<List<DataExportJob>> GetActiveJobsForProjectAsync(Guid projectId);
    // ... existing methods
}

// Implementation queries for Pending or InProgress jobs:
public async Task<List<DataExportJob>> GetActiveJobsForProjectAsync(Guid projectId)
{
    return await Collection
        .Find(j => j.ProjectId == projectId && 
              (j.Status == DataExportJobStatus.Pending || 
               j.Status == DataExportJobStatus.InProgress))
        .ToListAsync();
}

These should give concrete implementation paths for every issue I flagged.


2b. Is it bad practice to call many methods in a constructor?

Not inherently - it depends on what the methods are doing. The general guidance:

Acceptable in constructors: - Computing derived fields from constructor parameters (pure transformations) - Building lookup dictionaries, filtering lists - Setting up internal state

Problematic in constructors: - I/O (database calls, HTTP requests, file reads) - Side effects (logging, event publishing) - Work that can fail in complex ways requiring error handling - Work that makes the object partially constructed if it throws

The CsvDataExportWriter constructor is in a grey area. It's all pure computation (filtering questions, building maps, resolving categories), so there's no I/O concern. The real problem isn't that it calls many methods - it's that the logic is inline rather than in named methods, making it a 60-line wall of dense LINQ. If the same logic were extracted into well-named private methods called from the constructor, it would be perfectly fine:

public CsvDataExportWriter(Project project, ..., WriterConfig writerConfig, ...)
{
    Project = project;
    _progressReporter = writerConfig.ProgressReporter;

    var options = new StudyWriterOptions(writerConfig);
    var stageIds = ResolveStageIds(writerConfig, project);
    var categoryInfo = ResolveCategoryInfo(options);
    var stageQuestionIds = GetStageQuestionIds(project, stageIds);

    CategorySystemQuestions = GetCategorySystemQuestions(options, categoryInfo, stageQuestionIds);
    AllAnnotationQuestionsInProject = GetAllAnnotationQuestions(project);

    ExportContext = BuildExportContext(options, categoryInfo, stageQuestionIds, ...);
}

Same work, same constructor, but each step is named and independently testable. A factory is only warranted if the construction logic needs to be async or needs to be shared across multiple creation paths - neither applies here. So I'd revise my earlier recommendation: just extract private methods, don't add a factory.


4a Option B. Resources required for background job export

If you moved to a "write to temp storage, then serve" model:

What changes: 1. The GET endpoint no longer streams in-request. Instead, POST creates the job, a background worker processes it, and a new GET endpoint serves the completed file. 2. You need temporary file storage for completed CSVs.

Resource requirements on the API pod: - Actually less than today. Currently, the API pod holds a MongoDB cursor + streams CSV for the entire duration of the download. Under the background model, the API pod just serves a pre-built file (or redirect).

What you'd need instead:

Resource Options Cost
Worker process MassTransit consumer in the existing PM service (or a dedicated worker pod) Minimal - reuses existing infrastructure. Could even be a new consumer in the API service itself
Temp file storage GCS bucket with lifecycle policy (auto-delete after 24h) Negligible - CSVs are small (typically < 100MB). A lifecycle rule handles cleanup automatically
Alternative: in-memory Redis or in-pod tmpfs for small exports Riskier - pod restarts lose data

Recommended approach using what you already have:

POST /data-export-jobs  →  Creates job (Pending)
                        MassTransit command: IStartDataExportJobCommand
                        Consumer processes export, writes CSV to GCS
                        Updates job status → Completed, sets downloadUrl
                        SignalR notifies client (already wired up)
GET /data-export/{jobId}/download  →  Redirects to signed GCS URL (or streams from GCS)

This fits naturally into the existing MassTransit + job pattern you already use for SearchImportJob, RobProcessingJob, etc. The export consumer would live in the PM service alongside those existing consumers. GCS signed URLs with 1-hour expiry handle the download, so no long-lived API connections. A GCS lifecycle rule (delete objects older than 24h) prevents storage accumulation.

Net infrastructure cost: One GCS bucket with a lifecycle rule. Everything else reuses existing patterns.


6e. How to prevent stale jobs from blocking future exports

Good catch - if a job gets stuck in Pending or InProgress (pod crash, OOM kill, network partition), it would permanently block new exports under the concurrency limit. Two complementary defenses:

Defense 1: TTL-based staleness check at query time

Don't just check status - check age:

public async Task<List<DataExportJob>> GetActiveJobsForProjectAsync(Guid projectId)
{
    var staleThreshold = DateTime.UtcNow.AddMinutes(-30); // Jobs older than 30min are stale

    var filter = Builders<DataExportJob>.Filter.And(
        Builders<DataExportJob>.Filter.Eq(j => j.ProjectId, projectId),
        Builders<DataExportJob>.Filter.In(j => j.Status, 
            new[] { DataExportJobStatus.Pending, DataExportJobStatus.InProgress }),
        Builders<DataExportJob>.Filter.Gt(j => j.Audit.Created, staleThreshold) // Ignore stale
    );

    return await Collection.Find(filter).ToListAsync();
}

Jobs older than 30 minutes are ignored regardless of status. A legitimately running 30+ minute export is rare enough that this is a safe cutoff (and you could tune it). This means stale jobs never block new exports - they simply fall out of the active window.

Defense 2: Mark stale jobs as failed when discovered

While checking for active jobs, clean up any stale ones you find:

// In the controller, before the concurrency check:
await _pmUnitOfWork.DataExportJobs.FailStaleJobsAsync(projectId, 
    staleThreshold: TimeSpan.FromMinutes(30));
// Repository implementation:
public async Task FailStaleJobsAsync(Guid projectId, TimeSpan staleThreshold)
{
    var cutoff = DateTime.UtcNow - staleThreshold;

    var filter = Builders<DataExportJob>.Filter.And(
        Builders<DataExportJob>.Filter.Eq(j => j.ProjectId, projectId),
        Builders<DataExportJob>.Filter.In(j => j.Status,
            new[] { DataExportJobStatus.Pending, DataExportJobStatus.InProgress }),
        Builders<DataExportJob>.Filter.Lt(j => j.Audit.Created, cutoff)
    );

    var update = Builders<DataExportJob>.Update
        .Set(j => j.Status, DataExportJobStatus.Failed)
        .Set(j => j.Summary, new DataExportJobSummary(
            Message: "Export job timed out - the server may have restarted during processing.",
            FinalStatus: DataExportJobStatus.Failed,
            TotalStudiesProcessed: 0,
            TotalErrors: 1,
            HasCriticalErrors: true,
            CompletedAt: DateTime.UtcNow,
            ErrorsByType: new() { ["TimeoutError"] = 1 },
            ErrorsByCategory: new() { ["SystemError"] = 1 }
        ));

    await Collection.UpdateManyAsync(filter, update);
}

This way, stale jobs get a proper Failed status with an explanatory summary, so users who check the job status later understand what happened. And the concurrency slot is freed for new exports.

Together: Defense 1 ensures stale jobs never block (even without cleanup). Defense 2 gives users a clean explanation and keeps the data tidy. Both are cheap operations that piggyback on the existing create-job request.

Good - I now have the full picture. Let me write the architectural analysis.


Architectural Analysis: Class Structure & Annotation Handling

Current Class Relationships

Here's what the current architecture looks like:

DataExportController
  └→ StudyDataService (thin orchestrator)
       ├→ WriterConfig (13-param flat config from DataExportJob fields)
       │    └→ FormatOptions (created by WriterConfig constructor AND StudyWriterOptions constructor)
       ├→ CsvDataExportWriter (the central engine)
       │    ├→ StudyWriterOptions (wraps WriterConfig, re-creates FormatOptions)
       │    ├→ ExportContext (record carrying ~15 fields: project, questions, maps, options, reporter...)
       │    │    └→ Delegates to FormatOptions for GetRowWriters() and GetHeaderWriter()
       │    └→ CsvRows() iterates studies → creates StudyWriter per study
       │         └→ StudyWriter
       │              ├→ StudyContext (study + annotation map + error recording)
       │              ├→ Gets DataFormatRowWriterBase[] from FormatOptions.GetRowWriters()
       │              └→ Calls each row writer's AddDataFormatColumnsToInfo()
       └→ Row writer hierarchy (per format):
            ├→ WideDataFormatRowWriter  ←→ AnnotationsGroup / LabeledAnnotationsGroup
            ├→ LongDataFormatRowWriter  (simple: one row per annotation)
            ├→ OutcomeDataFormatRowWriter ←→ ReviewContext ←→ CohortAnnotationsGroup
            │                                                ←→ ExperimentAnnotationsGroup
            │                                                ←→ LabeledAnnotationsGroup
            ├→ LongScreeningDataFormatRowWriter
            ├→ WideScreeningDataFormatRowWriter
            └→ BibliographicDataFormatRowWriter

What's Well-Structured

The strategy pattern for formats is the right call. Six export formats with genuinely different semantics (Wide produces N rows per study grouped by investigator+label, Long produces one row per annotation, OutcomeData produces one row per timepoint per outcome per cohort/experiment combination) need polymorphism. The FormatOptionsDataFormatRowWriterBaseDataFormatHeaderWriterBase trio is a clean Strategy implementation.

The AnnotationsGroup hierarchy models a real domain concept well. The annotation hierarchy in SyRF is:

Study
  └→ Cohort (has label, has DiseaseModel + Treatment lookups)
       ├→ DiseaseModel (has label)
       └→ Treatment (has label)
  └→ Experiment (has label, has Cohort lookups)
       └→ Cohort (reuses CohortAnnotationsGroup)
  └→ Outcome (has label, via OutcomeData)

This is a genuinely complex domain relationship, and the class hierarchy (AnnotationsGroupLabeledAnnotationsGroupCohortAnnotationsGroup / ExperimentAnnotationsGroup) mirrors the domain accurately. The Lazy<LabeledAnnotationsGroup> caching via LabeledAnnotationsGroupCache in ReviewContext is a sensible optimisation since the same cohort/experiment annotation can be referenced by multiple outcome rows within a study.

StudyAnnotationsGroup with its answer cache is a good performance optimisation - pre-computing the grouped answers avoids repeated LINQ Where().Select() chains per question per row.

Structural Problems

Problem 1: FormatOptions is created twice.

WriterConfig constructor creates FormatOptions (line 32-41). Then StudyWriterOptions constructor also creates FormatOptions from the same WriterConfig (line 13-22). Both store it. ExportContext then uses the one from StudyWriterOptions.FormatOptions. The one created in WriterConfig is accessed via WriterConfig.FormatOptions in a few places but mostly unused.

This is confusing and wasteful. There should be one creation point.

Recommendation: Remove the FormatOptions creation from WriterConfig. Make WriterConfig purely a data carrier (which is what its name suggests). Let StudyWriterOptions be the single point that creates FormatOptions:

// WriterConfig becomes a pure data record - no FormatOptions property
public record WriterConfig(
    ExportDataFormat ExportDataFormat,
    bool IncludeBibliographicData,
    BlindingLevel BlindingLevel,
    // ... other flat config fields
);

// StudyWriterOptions creates FormatOptions once
public record StudyWriterOptions(WriterConfig WriterConfig)
{
    public FormatOptions FormatOptions { get; } = WriterConfig.ExportDataFormat switch
    {
        ExportDataFormat.Long => new LongFormatOptions(WriterConfig),
        // ...
    };
}

Problem 2: ExportContext is a god object.

ExportContext carries 15 fields: Options, CategoryInfo, LookupQuestions, NonLabelCategoryQuestions, GeneralAnnotationQuestions, StudyAnnotationQuestionIds, SystematicSearchesMap, BlindedInvestigatorLookup, AnnotationQuestionMap, StageQuestionIds, CategoryQuestionIds, Project, StageIds, ProgressReporter, plus ExpectedColumnCount (mutable).

This is the "pass-everything" pattern. Every class that needs any context takes the whole ExportContext. This creates hidden coupling - you can't tell from a class signature what data it actually needs.

Recommendation: Don't refactor this into many small context objects (that would create a different kind of mess). Instead, recognize that ExportContext is effectively a composition root for the export pipeline. Make that explicit:

  1. Make ExportContext internal and document it as the pipeline's shared state container
  2. Extract the ValidateRowColumnCount and SetExpectedColumnCount into a separate CsvValidator (as discussed earlier) - these are write-time concerns, not context
  3. Split the question-filtering logic (currently in CsvDataExportWriter constructor) into named factory methods on ExportContext.Create(...) - this puts the complex filtering near the data it operates on

Problem 3: WideFormatOptions.GetRowWriters() has 4 branches doing similar-but-different things.

The method (lines 42-113) has a 2x2 matrix of (splitBySession, hasCategory) combinations, each producing WideDataFormatRowWriter instances differently. The branches share a pattern but the details differ enough that they were written as four separate LINQ chains.

The issue is that the branching logic and the data assembly are interleaved. Each branch creates ReviewContext, selects sessions/annotations, filters by label question, and constructs row writers, but the order and grouping differ.

Recommendation: Extract two clear steps - (1) determine what to iterate and how to group, (2) produce row writers from groups:

public override IEnumerable<DataFormatRowWriterBase> GetRowWriters(
    StudyContext studyContext, IEnumerable<Guid> stageIds, 
    bool? splitBySession, CategoryInfo? categoryInfo)
{
    var split = splitBySession ?? throw new ArgumentException(...);
    var hasCategory = categoryInfo is not null && categoryInfo.Category != Category.Study;

    // Step 1: Get annotation groups per investigator
    var investigatorGroups = split
        ? GetSessionBasedGroups(studyContext, stageIds)
        : GetAnnotationBasedGroups(studyContext);

    // Step 2: If category-level, expand each group into per-label rows
    return hasCategory
        ? investigatorGroups.SelectMany(g => ExpandToLabelRows(g, studyContext, categoryInfo))
        : investigatorGroups.Select(g => CreateStudyLevelRow(g, studyContext, categoryInfo));
}

This makes the 2x2 branching explicit and separable rather than four interleaved LINQ chains.


Problem 4: The annotation group handling in OutcomeDataFormatRowWriter.AddDataFormatColumnsToInfo() is doing too much.

This single method (lines 79-163) builds the entire outcome row by: 1. Querying 6 different annotation categories for lookup questions 2. Expanding DiseaseModel and Treatment groups with inner loops 3. Assembling ~20+ fixed columns (experiment, cohort, outcome labels, control questions, timepoint values) 4. Handling null groups with ?? "MISSING_*" fallbacks throughout

It's 85 lines of tightly coupled column assembly where changing the order or count of any column breaks the CSV.

Recommendation: The header writer (OutcomeDataFormatHeaderWriter) defines the column order via FixedFormatHeaders + category questions. But the row writer has no reference to the header order - it just adds columns in an order that must manually match. This is the root cause of column mismatch bugs.

Introduce a shared column definition that both header and row writer reference:

// Shared definition
public static class OutcomeColumns
{
    public static readonly ColumnDefinition[] FixedColumns = 
    {
        new("DateTimeDataEntered", (row, ctx) => ctx.OutcomeData.DateTimeCreated.ToString("...")),
        new("ExperimentId", (row, ctx) => ctx.OutcomeData.ExperimentId.ToString()),
        new("ExperimentLabel", (row, ctx) => ctx.ExperimentGroup?.LabelAnnotation?.GetAnswer() ?? "MISSING"),
        // ... all 21 fixed columns defined once
    };
}

// Header writer reads column names
protected override List<string> FormatStandardHeaders => 
    categoryHeaders.Concat(OutcomeColumns.FixedColumns.Select(c => c.Name)).ToList();

// Row writer reads column values
public override void AddDataFormatColumnsToInfo(List<string> info)
{
    foreach (var col in OutcomeColumns.FixedColumns)
        info.Add(col.GetValue(this));
}

This eliminates the possibility of header/row column mismatch entirely. The same principle applies to WideDataFormatRowWriter (where AddAnswerQuadFromGroup must produce exactly 4 columns per question to match the header's AnnotationHeaderQuad).


Problem 5: ReviewContext and StudyContext have overlapping responsibilities.

Both have TryGetAnnotation(), both have ExportErrors, both have AnnotationMap. ReviewContext delegates ExportErrors to StudyContext.ExportErrors and delegates RecordAnnotationLookupError to StudyContext. But ReviewContext.TryGetAnnotation() adds an investigator-ownership check that StudyContext.TryGetAnnotation() doesn't have.

The relationship is: ReviewContext = StudyContext + investigator filter. But the inheritance isn't formal - it's a wrapper with delegation.

Recommendation: Make the relationship explicit. ReviewContext should compose StudyContext, not duplicate its API:

public record ReviewContext(StudyContext StudyContext, Guid InvestigatorId, bool Reconciled)
{
    // ReviewContext-specific: annotation lookup with investigator ownership check
    public Annotation? GetAnnotationForInvestigator(Guid annotationId, string type, AnnotationLookupContext ctx)
    {
        var annotation = StudyContext.TryGetAnnotation(annotationId, type, ctx);
        if (annotation == null) return null;

        if (annotation.AnnotatorId != InvestigatorId)
        {
            RecordInvestigatorMismatchError(...);
            return null;
        }
        return annotation;
    }

    // Remove: TryGetAnnotation() - callers use GetAnnotationForInvestigator() explicitly
    // Remove: ExportErrors (access via StudyContext.ExportErrors)
    // Remove: AnnotationMap (access via StudyContext.AnnotationMap)
}

This makes it clear when you're doing an investigator-scoped lookup vs a study-wide lookup, rather than having two TryGetAnnotation methods that behave subtly differently.


Problem 6: LabeledAnnotationsGroup.GetDescendentAnnotations() uses recursion without depth limits.

This method (lines 71-93) walks the annotation tree recursively via Children references. If annotation data is corrupted (circular reference), this stack overflows. Since annotations come from user-entered data in MongoDB, this isn't hypothetical.

Recommendation: Add a depth guard:

private IEnumerable<Annotation> GetDescendentAnnotations(Annotation root, int maxDepth = 10)
{
    var result = new List<Annotation>();
    var stack = new Stack<(Annotation annotation, int depth)>();
    stack.Push((root, 0));

    while (stack.Count > 0)
    {
        var (current, depth) = stack.Pop();
        if (depth >= maxDepth) continue;  // Safety limit

        foreach (var childId in current.Children)
        {
            var child = ReviewContext.TryGetAnnotation(childId, "Child", ...);
            if (child == null) continue;
            result.Add(child);
            stack.Push((child, depth + 1));
        }
    }
    return result;
}

Iterative with explicit depth limit instead of recursive - eliminates the stack overflow risk.


Summary: What I'd Prioritise

Priority Change Why
1 Shared column definitions between header/row writers Eliminates entire class of column mismatch bugs
2 Remove duplicate FormatOptions creation Confusing, creates two divergent objects
3 Clarify ReviewContext vs StudyContext API boundary Currently easy to call the wrong TryGetAnnotation
4 Depth-limit the annotation tree walk Defensive against data corruption
5 Refactor WideFormatOptions.GetRowWriters() into two-step grouping The 4-branch LINQ is hard to reason about
6 Document ExportContext as composition root, extract validation Makes the "god object" intentional rather than accidental

The domain modelling (AnnotationsGroup hierarchy, CategoryInfo/LabelInfo/LookupQuestion types) is sound - it accurately reflects a genuinely complex domain. The problems are in the plumbing between these well-modelled concepts, not the concepts themselves.

Good, the background exploration agent finished and confirmed my analysis. My response above covers the architectural analysis you asked about. Let me know if you want me to go deeper on any specific area, or if you'd like me to start implementing any of the recommended changes.