Skip to content

Data Export Refactoring Plan

Overview

The data export feature is a substantial system spanning Angular frontend, ASP.NET API controller, domain models, CSV generation engine, real-time progress tracking, and background reporting. A comprehensive analysis was completed in docs/planning/data-export-analysis.md identifying architectural strengths and improvement areas.

PR #2126 delivered test coverage for the core export writers (header, row, integration) and quick code quality wins (grammar fixes, structured logging, named constants). This document captures all remaining refactoring work as a prioritized backlog for future PRs.

Completed in PR #2126

  • Test builders (ExportContextBuilder, WriterConfigBuilder, AnnotationBuilder)
  • OutcomeDataFormatHeaderWriter unit tests (column ordering, IncludeComments interleaving)
  • OutcomeDataFormatRowWriter unit tests (value formatting, sentinels, comments, multi-group joining)
  • Column count consistency integration tests
  • StudyDataService unit tests with mocked IPmUnitOfWork
  • Full pipeline integration tests with error recording assertions
  • Grammar fix in validation message
  • Console.WriteLine replaced with ILogger
  • Magic numbers extracted to named constants
  • SonarCloud coverage exclusions removed

Refactoring Items

REFAC-01: Extract Error-Mapping Helper

Priority: High Effort: 1-2 hours Source: data-export-analysis.md, section 2a

Description: CsvDataExportWriter.CsvRows() has 7 catch blocks (~20 lines each) and StudyWriter.GetRowsForStudy() has 8 catch blocks creating nearly identical DataExportError objects. This totals approximately 450 lines of duplicated error handling boilerplate.

Rationale: The repetition makes the actual control flow of both methods hard to follow. A bug fix to error handling must be applied 15 times instead of once.

Approach: Extract a single error-mapping method using pattern matching:

private (DataExportError error, IEnumerable<string> errorRow) HandleStudyError(
    Exception ex, Guid studyId, string? studyTitle)
{
    var (errorType, category, errorCode) = ex switch
    {
        UnauthorizedAccessException => (DataExportErrorType.StudyAccessDenied,
            DataExportErrorCategory.PermissionDenied, "ACCESS_DENIED"),
        ArgumentNullException ane when ane.ParamName?.Contains("annotation") == true =>
            (DataExportErrorType.MissingRequiredField, DataExportErrorCategory.MissingData, "MISSING_DATA"),
        // ... remaining exception types
    };
    // Build error, report, return error row
}

This reduces ~450 lines to ~50, making the control flow of CsvRows() and GetRowsForStudy() clear.


REFAC-02: Move Study Retrieval Strategy to Service Layer

Priority: Medium Effort: 1-2 hours Source: data-export-analysis.md, section 3a

Description: DataExportController.GetDataExport() (lines 77-94) contains study retrieval strategy logic: paging threshold decisions, choosing between paged/non-paged queries. This is data access logic that belongs in the service layer.

Rationale: The controller should coordinate HTTP concerns, not data access strategy. Moving this logic makes it testable and keeps the controller thin.

Approach: Simplify IStudyDataService.GetDataExport signature to accept just the DataExportJob and IProgress<DataExportJobUpdate>. The service internally handles study retrieval strategy, count fetching, and paging decisions.


REFAC-03: Consolidate Dual Completion Reporting

Priority: Medium Effort: 30 minutes Source: data-export-analysis.md, section 3b

Description: Both CsvDataExportWriter.CsvRows() and DataExportController report DataExportJobStatus.Completed. This creates a race condition where two "Completed" signals hit the progress reporter.

Rationale: While the reporter handles this gracefully via deduplication, unclear ownership of "mark complete" responsibility makes the code harder to reason about.

Approach: Remove the completion report from DataExportController and let CsvDataExportWriter be the sole owner of signaling completion. The writer has the most accurate study count.


REFAC-04: Fix Error Rows to Match Column Count

Priority: High Effort: 1 hour Source: data-export-analysis.md, sections 4b, 6c

Description: When a study fails processing, an error row with 3 columns ([studyId, ERROR_CODE, message]) is inserted into a CSV that has N columns (potentially 50+). This creates an invalid CSV.

Rationale: The column count validation catches this and stops the export, but the behavior is unpredictable and the error message misleading (it reports a "column count mismatch" rather than "study processing failed").

Approach: Either skip the study entirely (omitting from output) or pad error rows to match the expected column count:

var errorRow = Enumerable.Repeat("", ExportContext.ExpectedColumnCount ?? 3).ToList();
errorRow[0] = study.Id.ToString();
errorRow[1] = "ERROR: " + errorCode;

REFAC-05: Add CancellationToken Support

Priority: Medium Effort: 2-3 hours Source: data-export-analysis.md, section 6d

Description: No CancellationToken threading exists through the export pipeline. A user who navigates away or closes the tab has no way to cancel a long-running export.

Rationale: Exports of 25,000+ studies consume significant server resources (MongoDB cursor, memory, CPU). Without cancellation, abandoned exports continue consuming resources until completion.

Approach: Thread CancellationToken from the controller's HttpContext.RequestAborted through StudyDataService, CsvDataExportWriter.CsvRows(), and OptimizedDataExportProgressReporter. Check cancellation in the study processing loop.


REFAC-06: Add Concurrent Export Limiting

Priority: Low Effort: 2-3 hours Source: data-export-analysis.md, section 6e

Description: No check prevents a user from starting multiple simultaneous exports, which could overwhelm the database with parallel full-collection scans.

Rationale: Each export opens a MongoDB cursor that scans potentially the entire studies collection. Multiple concurrent exports multiply the database load.

Approach: Add a check in CreateDataExportJob that queries for existing in-progress jobs for the same user or project. Return a 429 (Too Many Requests) if limit exceeded. Alternatively, use a SemaphoreSlim at the service level.


REFAC-07: Add Project Membership Authorization

Priority: High Effort: 1 hour Source: data-export-analysis.md, section 6b

Description: The controller doesn't verify that the requesting user is a member of the project. CreateDataExportJob saves the job with CurrentUserId but never checks project membership. GetDataExport fetches the job by ID without verifying access.

Rationale: A user could potentially download exports from projects they don't belong to if they know the job ID. This is a security gap.

Approach: Add project membership verification in both endpoints:

var project = await _pmUnitOfWork.Projects.GetAsync(projectId);
if (!project.Memberships.Any(m => m.InvestigatorId == CurrentUserId))
    return Forbid();

REFAC-08: Make ExportContext Truly Immutable

Priority: Low Effort: 1-2 hours Source: data-export-analysis.md, section 3c

Description: Despite being a record, ExportContext has mutable state (ExpectedColumnCount via SetExpectedColumnCount()). This undermines the immutability contract that records imply.

Rationale: Mutable records are surprising and can lead to subtle bugs, especially in concurrent scenarios.

Approach: Move ExpectedColumnCount to a separate CsvValidationContext class that wraps ExportContext, or compute the column count at ExportContext construction time by running the header writer logic eagerly.


REFAC-09: Extract CsvDataExportWriter Constructor Logic

Priority: Medium Effort: 2 hours Source: data-export-analysis.md, section 2b

Description: The 60-line constructor performs complex filtering logic (stage question filtering, category question filtering, label exclusion) that mixes configuration, filtering, and context assembly.

Rationale: Long constructors are hard to test and understand. Each filtering step should be independently verifiable.

Approach: Extract a CsvDataExportWriterFactory static class with named helper methods:

public static class CsvDataExportWriterFactory
{
    public static CsvDataExportWriter Create(...) { ... }
    private static CategoryInfo? ResolveCategoryInfo(StudyWriterOptions options) { ... }
    private static List<Guid> ResolveStageQuestionIds(Project project, List<Guid> stageIds) { ... }
    private static AnnotationContext BuildAnnotationContext(...) { ... }
}

Suggested PR Grouping

PR Items Theme Effort
PR A (High priority) REFAC-01, REFAC-04, REFAC-07 Error handling + security 3-4 hours
PR B (Medium priority) REFAC-02, REFAC-03, REFAC-05, REFAC-09 Architecture cleanup 5-7 hours
PR C (Low priority) REFAC-06, REFAC-08 Polish 3-5 hours

Created: 2026-03-11 Source analysis: docs/planning/data-export-analysis.md