Data Export Refactoring Plan¶
Overview¶
The data export feature is a substantial system spanning Angular frontend, ASP.NET API controller, domain models, CSV generation engine, real-time progress tracking, and background reporting. A comprehensive analysis was completed in docs/planning/data-export-analysis.md identifying architectural strengths and improvement areas.
PR #2126 delivered test coverage for the core export writers (header, row, integration) and quick code quality wins (grammar fixes, structured logging, named constants). This document captures all remaining refactoring work as a prioritized backlog for future PRs.
Completed in PR #2126¶
- Test builders (ExportContextBuilder, WriterConfigBuilder, AnnotationBuilder)
- OutcomeDataFormatHeaderWriter unit tests (column ordering, IncludeComments interleaving)
- OutcomeDataFormatRowWriter unit tests (value formatting, sentinels, comments, multi-group joining)
- Column count consistency integration tests
- StudyDataService unit tests with mocked IPmUnitOfWork
- Full pipeline integration tests with error recording assertions
- Grammar fix in validation message
- Console.WriteLine replaced with ILogger
- Magic numbers extracted to named constants
- SonarCloud coverage exclusions removed
Refactoring Items¶
REFAC-01: Extract Error-Mapping Helper¶
Priority: High Effort: 1-2 hours Source: data-export-analysis.md, section 2a
Description: CsvDataExportWriter.CsvRows() has 7 catch blocks (~20 lines each) and StudyWriter.GetRowsForStudy() has 8 catch blocks creating nearly identical DataExportError objects. This totals approximately 450 lines of duplicated error handling boilerplate.
Rationale: The repetition makes the actual control flow of both methods hard to follow. A bug fix to error handling must be applied 15 times instead of once.
Approach: Extract a single error-mapping method using pattern matching:
private (DataExportError error, IEnumerable<string> errorRow) HandleStudyError(
Exception ex, Guid studyId, string? studyTitle)
{
var (errorType, category, errorCode) = ex switch
{
UnauthorizedAccessException => (DataExportErrorType.StudyAccessDenied,
DataExportErrorCategory.PermissionDenied, "ACCESS_DENIED"),
ArgumentNullException ane when ane.ParamName?.Contains("annotation") == true =>
(DataExportErrorType.MissingRequiredField, DataExportErrorCategory.MissingData, "MISSING_DATA"),
// ... remaining exception types
};
// Build error, report, return error row
}
This reduces ~450 lines to ~50, making the control flow of CsvRows() and GetRowsForStudy() clear.
REFAC-02: Move Study Retrieval Strategy to Service Layer¶
Priority: Medium Effort: 1-2 hours Source: data-export-analysis.md, section 3a
Description: DataExportController.GetDataExport() (lines 77-94) contains study retrieval strategy logic: paging threshold decisions, choosing between paged/non-paged queries. This is data access logic that belongs in the service layer.
Rationale: The controller should coordinate HTTP concerns, not data access strategy. Moving this logic makes it testable and keeps the controller thin.
Approach: Simplify IStudyDataService.GetDataExport signature to accept just the DataExportJob and IProgress<DataExportJobUpdate>. The service internally handles study retrieval strategy, count fetching, and paging decisions.
REFAC-03: Consolidate Dual Completion Reporting¶
Priority: Medium Effort: 30 minutes Source: data-export-analysis.md, section 3b
Description: Both CsvDataExportWriter.CsvRows() and DataExportController report DataExportJobStatus.Completed. This creates a race condition where two "Completed" signals hit the progress reporter.
Rationale: While the reporter handles this gracefully via deduplication, unclear ownership of "mark complete" responsibility makes the code harder to reason about.
Approach: Remove the completion report from DataExportController and let CsvDataExportWriter be the sole owner of signaling completion. The writer has the most accurate study count.
REFAC-04: Fix Error Rows to Match Column Count¶
Priority: High Effort: 1 hour Source: data-export-analysis.md, sections 4b, 6c
Description: When a study fails processing, an error row with 3 columns ([studyId, ERROR_CODE, message]) is inserted into a CSV that has N columns (potentially 50+). This creates an invalid CSV.
Rationale: The column count validation catches this and stops the export, but the behavior is unpredictable and the error message misleading (it reports a "column count mismatch" rather than "study processing failed").
Approach: Either skip the study entirely (omitting from output) or pad error rows to match the expected column count:
var errorRow = Enumerable.Repeat("", ExportContext.ExpectedColumnCount ?? 3).ToList();
errorRow[0] = study.Id.ToString();
errorRow[1] = "ERROR: " + errorCode;
REFAC-05: Add CancellationToken Support¶
Priority: Medium Effort: 2-3 hours Source: data-export-analysis.md, section 6d
Description: No CancellationToken threading exists through the export pipeline. A user who navigates away or closes the tab has no way to cancel a long-running export.
Rationale: Exports of 25,000+ studies consume significant server resources (MongoDB cursor, memory, CPU). Without cancellation, abandoned exports continue consuming resources until completion.
Approach: Thread CancellationToken from the controller's HttpContext.RequestAborted through StudyDataService, CsvDataExportWriter.CsvRows(), and OptimizedDataExportProgressReporter. Check cancellation in the study processing loop.
REFAC-06: Add Concurrent Export Limiting¶
Priority: Low Effort: 2-3 hours Source: data-export-analysis.md, section 6e
Description: No check prevents a user from starting multiple simultaneous exports, which could overwhelm the database with parallel full-collection scans.
Rationale: Each export opens a MongoDB cursor that scans potentially the entire studies collection. Multiple concurrent exports multiply the database load.
Approach: Add a check in CreateDataExportJob that queries for existing in-progress jobs for the same user or project. Return a 429 (Too Many Requests) if limit exceeded. Alternatively, use a SemaphoreSlim at the service level.
REFAC-07: Add Project Membership Authorization¶
Priority: High Effort: 1 hour Source: data-export-analysis.md, section 6b
Description: The controller doesn't verify that the requesting user is a member of the project. CreateDataExportJob saves the job with CurrentUserId but never checks project membership. GetDataExport fetches the job by ID without verifying access.
Rationale: A user could potentially download exports from projects they don't belong to if they know the job ID. This is a security gap.
Approach: Add project membership verification in both endpoints:
var project = await _pmUnitOfWork.Projects.GetAsync(projectId);
if (!project.Memberships.Any(m => m.InvestigatorId == CurrentUserId))
return Forbid();
REFAC-08: Make ExportContext Truly Immutable¶
Priority: Low Effort: 1-2 hours Source: data-export-analysis.md, section 3c
Description: Despite being a record, ExportContext has mutable state (ExpectedColumnCount via SetExpectedColumnCount()). This undermines the immutability contract that records imply.
Rationale: Mutable records are surprising and can lead to subtle bugs, especially in concurrent scenarios.
Approach: Move ExpectedColumnCount to a separate CsvValidationContext class that wraps ExportContext, or compute the column count at ExportContext construction time by running the header writer logic eagerly.
REFAC-09: Extract CsvDataExportWriter Constructor Logic¶
Priority: Medium Effort: 2 hours Source: data-export-analysis.md, section 2b
Description: The 60-line constructor performs complex filtering logic (stage question filtering, category question filtering, label exclusion) that mixes configuration, filtering, and context assembly.
Rationale: Long constructors are hard to test and understand. Each filtering step should be independently verifiable.
Approach: Extract a CsvDataExportWriterFactory static class with named helper methods:
public static class CsvDataExportWriterFactory
{
public static CsvDataExportWriter Create(...) { ... }
private static CategoryInfo? ResolveCategoryInfo(StudyWriterOptions options) { ... }
private static List<Guid> ResolveStageQuestionIds(Project project, List<Guid> stageIds) { ... }
private static AnnotationContext BuildAnnotationContext(...) { ... }
}
Suggested PR Grouping¶
| PR | Items | Theme | Effort |
|---|---|---|---|
| PR A (High priority) | REFAC-01, REFAC-04, REFAC-07 | Error handling + security | 3-4 hours |
| PR B (Medium priority) | REFAC-02, REFAC-03, REFAC-05, REFAC-09 | Architecture cleanup | 5-7 hours |
| PR C (Low priority) | REFAC-06, REFAC-08 | Polish | 3-5 hours |
Created: 2026-03-11 Source analysis: docs/planning/data-export-analysis.md