Skip to content

SignalR Multi-Replica Analysis

Executive Summary

Conclusion: The current SignalR implementation is replica-safe and works correctly with multiple API pods.

The architecture uses MongoDB change streams as an implicit backplane, avoiding the typical SignalR scaling problems. However, this comes with trade-offs that should be understood.

Current Production Configuration

Service Staging Replicas Production Replicas
API (SignalR host) 2 3
Project Management 1 2
Web 2 3

Architecture Overview

Traditional SignalR Scaling Problem

In typical SignalR deployments, scaling to multiple replicas causes issues because:

┌─────────────────────────────────────────────────────────────┐
│ Traditional SignalR (PROBLEMATIC)                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Pod-1                      Pod-2                           │
│  ┌──────────────┐          ┌──────────────┐                │
│  │ SignalR Hub  │          │ SignalR Hub  │                │
│  │ Groups:      │          │ Groups:      │                │
│  │  - Project-X │          │  - Project-X │                │
│  │ Clients:     │          │ Clients:     │                │
│  │  - Client-A  │          │  - Client-B  │                │
│  └──────────────┘          └──────────────┘                │
│         │                         │                         │
│         ▼                         ▼                         │
│  Server broadcasts to         Client-B never                │
│  Group("Project-X")           receives message!             │
│  Only Client-A gets it                                      │
│                                                             │
└─────────────────────────────────────────────────────────────┘

SyRF's Change Stream Pattern (REPLICA-SAFE)

SyRF uses a fundamentally different approach:

┌─────────────────────────────────────────────────────────────┐
│ SyRF SignalR Architecture                                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│                    ┌─────────────────┐                      │
│                    │  MongoDB Atlas  │                      │
│                    │  Change Streams │                      │
│                    └────────┬────────┘                      │
│              ┌──────────────┼──────────────┐                │
│              ▼              ▼              ▼                │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │   Pod-1      │  │   Pod-2      │  │   Pod-3      │      │
│  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │      │
│  │ │ Change   │ │  │ │ Change   │ │  │ │ Change   │ │      │
│  │ │ Stream   │ │  │ │ Stream   │ │  │ │ Stream   │ │      │
│  │ │ Watcher  │ │  │ │ Watcher  │ │  │ │ Watcher  │ │      │
│  │ └────┬─────┘ │  │ └────┬─────┘ │  │ └────┬─────┘ │      │
│  │      ▼       │  │      ▼       │  │      ▼       │      │
│  │ Client-A ✓   │  │ Client-B ✓   │  │ Client-C ✓   │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
│                                                             │
│  All clients receive notifications independently!           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Detailed Flow Analysis

Subscription Flow

  1. Client connects to SignalR hub (sticky session routes to one pod)
  2. Client subscribes by calling hub method (e.g., SubscribeToProject)
  3. Server creates MongoDB change stream watching that entity/collection
  4. Change stream subscription is stored with the client's connectionId
  5. When document changes in MongoDB:
  6. MongoDB sends change event to ALL pods watching that collection
  7. Each pod processes the event for its own connected clients
  8. Notification sent via Clients.Client(connectionId)

Key Code Paths

Hub subscription (NotificationHub.cs:99-136):

public async Task SubscribeToProject(Guid projectId)
{
    var connectionId = Context.ConnectionId;
    _projectsSubscriptionManager.SubscribeToEntity(Context.ConnectionId, projectId,
        notification =>
        {
            // This callback fires on the SAME pod that created the subscription
            _hubContext.Clients.Client(connectionId).ProjectNotification(notificationDto);
        }
    );
}

Change stream source (MongoRepositoryBase.cs:87-88):

public IObservable<EntityNotification<TAggregateRoot>> GetEntityNotificationStream(Guid id) =>
    _GetEntityNotificationStream(id);  // Uses MongoCachedChangeStream

Cached change stream (MongoContext.cs:142-249):

public IObservable<ChangeStreamDocument<TAggregateRoot>> GetCachedCollectionChangeStream<...>()
{
    return _changeStreamCache.GetOrAdd(() =>
        Observable.Create<ChangeStreamDocument<TAggregateRoot>>(async (obs, ct) =>
        {
            var cursor = await GetCollection<TAggregateRoot, TId>()
                .WatchAsync(pipeline, opts, ct);
            // Each pod opens its own cursor to MongoDB
            while (!ct.IsCancellationRequested)
            {
                await cursor.MoveNextAsync(ct);
                foreach (var doc in cursor.Current)
                    obs.OnNext(doc);  // Emits to all subscribers on this pod
            }
        })
        .Publish()
        .RefCount()  // Hot observable - shared across all subscriptions on this pod
    );
}

Scenario Analysis

Scenario 1: Project Update Notification

Step What Happens Replica Impact
1 User A (Pod-1) and User B (Pod-2) both viewing Project-X Each has subscription
2 User A saves a change to Project-X MongoDB document updated
3 MongoDB emits change event Sent to ALL change stream cursors
4 Pod-1 receives event Notifies User A via Clients.Client(A)
5 Pod-2 receives event Notifies User B via Clients.Client(B)
Result Both users see the update ✅ Works correctly

Scenario 2: File Upload Progress

Step What Happens Replica Impact
1 User uploads file, connected to Pod-1 Subscription created on Pod-1
2 S3 notifier processes file Publishes to RabbitMQ
3 Project-Management service updates MongoDB Progress field updated
4 MongoDB emits change event Pod-1's change stream receives it
5 Pod-1 notifies user Via Clients.Client(connectionId)
Result User sees progress updates ✅ Works correctly

Scenario 3: Pod Restart

Step What Happens Replica Impact
1 Pod-2 restarts (rolling update) Connections to Pod-2 drop
2 Clients reconnect May land on any pod
3 Client runs .withAutomaticReconnect() SignalR handles reconnection
4 Client re-subscribes allReconnected$ triggers reload
5 New change stream subscriptions created On whichever pod client landed
Result Brief interruption, then normal ✅ Self-healing

Scenario 4: Version Check (UiVersionCheck)

Step What Happens Replica Impact
1 Client connects to Pod-1 OnConnectedAsync fires
2 Hub sends version check Clients.Client(Context.ConnectionId)
3 Target is the SAME connection On this pod, guaranteed
Result Client receives version check ✅ Works correctly

What Would Break (Hypothetical)

The following patterns are NOT used but would cause problems:

❌ Group Broadcasting (NOT USED)

// If this existed, only 1/3 of clients would receive it
await Clients.Group("project-X").SomeNotification(...);

❌ User-Targeted Messages (NOT USED)

// If this existed, would fail if user connected to different pod
await Clients.User(userId).SomeNotification(...);

❌ All-Client Broadcast (NOT USED)

// If this existed, only 1/3 of clients would receive it
await Clients.All.SomeNotification(...);

Verification: Grep for Clients.(Group|User|All)( returns no matches in the codebase.

Vestigial Group Code

Groups ARE added but never used for broadcasting:

// NotificationHub.cs:57-58 - Added on connect
await Groups.AddToGroupAsync(Context.ConnectionId, UserGroupName(userId.ToString()));

// NotificationHub.cs:132 - Added on project subscription
await Groups.AddToGroupAsync(Context.ConnectionId, ProjectGroupName(projectId));

These groups add minor overhead but cause no correctness issues. They may have been intended for future use or are remnants of an older design.

Sticky Sessions Analysis

Current Configuration

_ingress.tpl:

annotations:
  nginx.ingress.kubernetes.io/affinity: "cookie"
  nginx.ingress.kubernetes.io/session-cookie-name: "route"
  nginx.ingress.kubernetes.io/session-cookie-expires: "172800"  # 2 days

Are They Required?

No, but they help. Without sticky sessions:

Aspect With Sticky Sessions Without Sticky Sessions
Initial HTTP handshake Same pod Any pod
WebSocket upgrade Same pod Any pod
Subscription creation Predictable Works, but subscription on any pod
Reconnection Returns to same pod May land on different pod
Subscription recreation Avoided if same pod Always recreated on reconnect

Recommendation: Keep sticky sessions to reduce subscription churn, but they're not required for correctness.

Resume Token Management

Current Implementation

InMemoryResumePointRepository.cs:

public class InMemoryResumePointRepository : IResumePointRepository
{
    private readonly ConcurrentDictionary<string, ResumePoint> _store = new();
    // ...
}

Impact

  • Each pod stores resume tokens in memory (not shared)
  • On pod restart, resume tokens are lost
  • Change streams restart from current time (safe, no missed events for new subscriptions)
  • This is acceptable because subscriptions are connection-scoped

Resource Implications

MongoDB Change Stream Cursors

Configuration Cursor Count per Collection
1 replica 1 cursor
3 replicas 3 cursors
6 replicas 6 cursors

MongoDB Atlas handles this well - change streams are designed for multiple consumers.

Memory Per Pod

Each pod stores:

  • Connection → Subscription mappings (ConcurrentDictionary)
  • Rx subscriptions (IDisposable for each active subscription)
  • Resume tokens (ConcurrentDictionary)

Memory usage scales with connections per pod, not total connections.

Potential Improvements (Not Required)

1. Remove Unused Group Code

The Groups.AddToGroupAsync calls could be removed to reduce overhead.

2. Shared Resume Token Storage

Could use Redis/MongoDB for resume tokens to survive pod restarts without briefly re-streaming. Low priority - current behavior is correct.

3. Connection Metrics

Add observability for:

  • Connections per pod
  • Subscriptions per pod
  • Change stream lag

Comparison with Redis Backplane

Aspect SyRF (Change Streams) Traditional (Redis Backplane)
Cross-pod messaging Via MongoDB Via Redis
Additional infrastructure None Redis cluster required
Latency Database → All pods Hub → Redis → All pods
Notification trigger Database change Explicit broadcast
Use case fit CRUD apps Chat, live collaboration

SyRF's pattern is well-suited because:

  • All notifications stem from database changes
  • No need for arbitrary server-initiated broadcasts
  • Avoids Redis infrastructure complexity

Potential Missed Events Analysis

While the architecture is replica-safe, there are edge cases where events can be missed. This section analyzes each scenario, its impact, and mitigation options.

Overview

Scenario Can Miss Events? Self-Heals? Time to Heal Severity
Initial load race condition ⚠️ Yes Next update/refresh/reconnect Seconds–minutes Medium
Disconnection period Yes, but handled ✅ Auto reload on reconnect Immediate Low
WebSocket send failure ⚠️ Yes Next update/refresh/reconnect Seconds–minutes Low
Out-of-order delivery No (filtered) N/A N/A None
Pod restart Yes, but handled ✅ Client reconnects + reloads Immediate Low
Change stream cursor failure Rare ✅ Retry with backoff Seconds Low

Scenario 1: Initial Load Race Condition ⚠️

The Problem

There's a timing gap between when HTTP data loads and when the SignalR subscription is created:

Timeline:
─────────────────────────────────────────────────────────────────────>
     │                      │                      │
     │ HTTP Response        │ Another user saves   │ Subscription Created
     │ (Project v5)         │ (Project v6)         │ (watching from now)
     │                      │                      │
     └──────────────────────┴──────────────────────┘
                      VERSION 6 MISSED!

Code Path

  1. Route guard calls loadProject() → HTTP request (project-guard.service.ts:107)
  2. HTTP response returns version N
  3. Component renders
  4. SignalR service detects _currentProjectId$ changed (signal-r.service.ts:530-548)
  5. Subscription created via SubscribeToProject()

Any change between steps 2 and 5 is not delivered.

Impact

  • User sees stale data until next update, refresh, or reconnection
  • For most SyRF use cases (screening, annotation), this is a minor issue
  • For progress updates during file upload, user might miss intermediate progress

Self-Healing Triggers

  • Another database change triggers version N+2 notification
  • User refreshes the page
  • Connection drops and reconnects (triggers full reload)
  • User navigates away and back

Mitigation Options

Option Complexity Effectiveness Recommendation
A. Subscribe-then-Load Medium High ⭐ Recommended
B. Version Check After Subscribe Low Medium Good quick fix
C. Periodic Version Polling Low Medium Simple fallback
D. Accept Current Behavior None N/A Acceptable for SyRF

Option A: Subscribe-then-Load Pattern (with HTTP version filtering)

Reverse the order: create SignalR subscription BEFORE making the HTTP request, and add version filtering to HTTP responses.

// Current (problematic):
// 1. HTTP load project (gets v5)
// 2. Subscribe to project (misses v6 that happened between 1 and 2)

// Improved:
// 1. Subscribe to project (captures all changes from this point)
// 2. HTTP load project (gets v5 or v6 depending on timing)
// 3. Version filtering handles any out-of-order delivery

Critical Caveat: This pattern only works if HTTP responses also go through version filtering.

Current problem: Version filtering only exists in SignalR path, NOT in HTTP path:

Path Version Filtering Code Location
SignalR notifications ✅ Yes signal-r.service.ts:422-432
HTTP responses ❌ No requests.ts:47-55 → reducer spreads directly

What happens WITHOUT HTTP version filtering:

1. Subscribe created
2. SignalR v6 arrives → version filter passes → store updated to v6
3. HTTP returns v5 → NO filter → dispatches detailLoaded → OVERWRITES with v5!

User ends up with stale v5 even though v6 was correctly received.

Implementation changes required:

  1. Add version filtering to HTTP response handling (effect or reducer level)
  2. Modify projectCanMatchGuard to subscribe before loading
  3. Ensure subscription is created synchronously or awaited
  4. Add subscription cleanup on navigation away

Example HTTP version filter (in effect):

map((projectWithRelatedInvestigatorsDto) => {
  const incomingVersion = projectWithRelatedInvestigatorsDto.project.audit?.version;
  const storedVersion = store.selectSignal(selectProjectVersion(props.projectId))();

  // Only dispatch if incoming is newer or same
  if (storedVersion === undefined || incomingVersion >= storedVersion) {
    return projectDetailActions.detailLoaded({...});
  } else {
    console.warn(`[HTTP] Dropping stale response v${incomingVersion}, have v${storedVersion}`);
    return { type: '[Project] Stale HTTP Response Ignored' };
  }
})

Complexity: Medium-High (requires changes to effects and careful testing)

Option B: Version Check After Subscribe

After subscription is confirmed, make a lightweight version-check call:

async _subscribeToProject(projectId: string) {
  await this._hubConnection.invoke('SubscribeToProject', projectId);

  // New: Check if we missed anything
  const serverVersion = await this._hubConnection.invoke('GetProjectVersion', projectId);
  const clientVersion = this._store.selectSignal(selectProjectVersion(projectId))();

  if (serverVersion > clientVersion) {
    this._loadProjectRequest.dispatchRequest.loadProject({ projectId });
  }
}

Pros: Simple, explicit check Cons: Adds an extra round-trip, requires new hub method

Option C: Periodic Version Polling

Extend UiVersionCheck to include entity versions:

// Server sends periodically (e.g., every 30 seconds):
{
  minUiVersion: "1.2.3",
  entityVersions: {
    "project:abc123": 42,
    "project:def456": 17
  }
}

// Client compares and reloads stale entities

Pros: Catches all gaps, low implementation effort Cons: Adds periodic load, slight delay in detection

Option D: Accept Current Behavior

Document the behavior and rely on natural self-healing.

When this is acceptable:

  • Users rarely edit the same project simultaneously
  • Stale data doesn't cause data loss (optimistic locking on save)
  • Most gaps self-heal within seconds

Scenario 2: Disconnection Period ✅ Handled

The Problem

During disconnection, change events are not delivered to the client.

Current Mitigation (Already Implemented)

signal-r.service.ts:641-653:

this.allReconnected$
  .pipe(
    switchMap(() => this._currentProjectId$),
    withLatestFrom(this._isMemberOfProject$),
    filter(([projectId, isMember]) => !!projectId && !!isMember),
    takeUntil(this._destroy$)
  )
  .subscribe(([projectId]) =>
    this._loadProjectRequest.dispatchRequest.loadProject({ projectId })
  );

On reconnection:

  1. allReconnected$ fires
  2. Full project data is reloaded via HTTP
  3. State is synchronized

Status: ✅ No additional mitigation needed


Scenario 3: WebSocket Send Failure ⚠️

The Problem

If the WebSocket connection breaks during Clients.Client(connectionId).ProjectNotification(...), that specific message is lost. SignalR uses fire-and-forget semantics with no application-level acknowledgments.

How It Happens

┌─────────────────────────────────────────────────────────────────────────┐
│ Server (API Pod)                              │ Client (Browser)        │
├─────────────────────────────────────────────────────────────────────────┤
│                                               │                         │
│ 1. MongoDB emits change event (v6)            │                         │
│         │                                     │                         │
│         ▼                                     │                         │
│ 2. Rx subscription callback fires             │                         │
│         │                                     │                         │
│         ▼                                     │                         │
│ 3. _hubContext.Clients.Client(connId)         │                         │
│      .ProjectNotification(dto)                │                         │
│         │                                     │                         │
│         ▼                                     │                         │
│ 4. SignalR serializes message                 │                         │
│         │                                     │                         │
│         └────────── WebSocket ────────X       │ Connection breaks!      │
│                                        │      │ (network hiccup,        │
│                                        │      │  WiFi briefly drops,    │
│                                        │      │  mobile tower switch)   │
│                                        ▼      │                         │
│                                    MESSAGE    │ Client never receives   │
│                                    LOST!      │ anything                │
│                                               │                         │
│ Server doesn't know it failed ───────────────│─ Client doesn't know    │
│ (fire-and-forget semantics)                  │  it missed anything     │
│                                               │                         │
└─────────────────────────────────────────────────────────────────────────┘

Why Fire-and-Forget?

The notification callback doesn't await delivery confirmation (NotificationHub.cs:99-136):

_projectsSubscriptionManager.SubscribeToEntity(Context.ConnectionId, projectId,
    notification =>
    {
        // Fire-and-forget - no await, no acknowledgment
        _hubContext.Clients.Client(connectionId).ProjectNotification(notificationDto);
    }
);

When Does This Happen?

Cause Likelihood Duration
Mobile network handoff (WiFi ↔ cellular) Medium 1-5 seconds
Brief network congestion Low Milliseconds
Browser tab backgrounded aggressively Medium Varies
VPN reconnection Medium 1-10 seconds

Difference from Full Disconnection (Scenario 2)

Full Disconnection Send Failure
Client knows it disconnected Client may not know
onclose / onreconnected fires No event fires
allReconnected$ triggers reload Nothing triggers
Self-heals immediately Waits for next update

Impact

  • Single notification lost
  • User sees stale state until next update
  • No error visible to user or server
  • No data loss - optimistic locking prevents conflicts (see below)

Self-Healing Triggers

  • Next database change delivers newer version
  • User refreshes
  • Connection fully drops → reconnection → reload

Mitigation Options

Option Complexity Effectiveness Recommendation
A. Heartbeat with Versions Low High ⭐ Recommended
B. Client-side Staleness Detection Low Medium Good supplement
C. Application-level ACKs High Very High Overkill for SyRF

Extend the existing UiVersionCheck mechanism to periodically sync versions:

Server-side (new hub method):

public async Task SendVersionHeartbeat()
{
    var userId = GetUserId();
    var subscribedProjectIds = _projectsSubscriptionManager.GetSubscribedEntityIds(Context.ConnectionId);

    var versions = subscribedProjectIds.ToDictionary(
        id => id,
        id => _pmUnitOfWork.Projects.Get(id)?.Audit?.Version ?? 0
    );

    await Clients.Caller.VersionHeartbeat(versions);
}

Client-side:

this.versionHeartbeat$.pipe(
  switchMap(versions => {
    const staleProjects = Object.entries(versions)
      .filter(([id, serverVersion]) => {
        const clientVersion = this._store.selectSignal(selectProjectVersion(id))();
        return serverVersion > (clientVersion ?? 0);
      });

    return from(staleProjects).pipe(
      tap(([projectId]) =>
        this._loadProjectRequest.dispatchRequest.loadProject({ projectId })
      )
    );
  })
).subscribe();

Trigger: Could be periodic (every 30s) or on specific actions (focus window, complete operation).

Option B: Client-side Staleness Detection

Track when data was last updated and trigger refresh if stale:

// In project state
lastUpdated: Date;
lastSignalRNotification: Date;

// Detect potential gaps
const timeSinceLastNotification = Date.now() - lastSignalRNotification;
if (timeSinceLastNotification > 60000 && isActivelyViewing) {
  // Trigger lightweight version check or full reload
}

Scenario 4: Out-of-Order Delivery ✅ Handled

The Problem (Hypothetical)

Network conditions could cause notifications to arrive out of order:

  1. Version 6 emitted
  2. Version 7 emitted
  3. Version 7 arrives at client first
  4. Version 6 arrives at client second

Current Mitigation (Already Implemented)

Version filtering (signal-r.service.ts:422-432) prevents older versions from overwriting newer:

const shouldUpdate =
  storedVersion === undefined ||
  storedVersion <= (updatedVersion ?? Infinity) || ...

if (!shouldUpdate) {
  console.warn(`[SignalR] Dropping notification - storedVersion=${storedVersion}, updatedVersion=${updatedVersion}`);
}

Result: Version 6 is correctly dropped if version 7 was already applied.

Status: ✅ No additional mitigation needed


Scenario 5: Pod Restart ✅ Handled

The Problem

When a pod restarts:

  1. All connections to that pod are broken
  2. In-memory resume tokens are lost
  3. Change stream starts from "now" on new pod

Why It's Handled

For SignalR clients, this is equivalent to a disconnection:

  1. Client detects connection loss
  2. Auto-reconnection kicks in (may land on different pod)
  3. allReconnected$ fires
  4. Full data reload occurs

Edge Case: Resume Token Loss

The in-memory InMemoryResumePointRepository means each pod loses resume tokens on restart. However:

  • New subscriptions start fresh change streams
  • Client reload gets current state
  • No events are actually "missed" from the client's perspective

Potential Issue: If you added server-side consumers of change streams (beyond SignalR), they would need persistent resume tokens. Currently not applicable.

Status: ✅ No additional mitigation needed for SignalR use case


Scenario 6: Rapid Updates ✅ Handled

Concern: Could rapid updates with the same version number cause issues?

Analysis: This can't happen in practice because:

  1. MongoDB uses optimistic concurrency with version field
  2. Each successful save increments version atomically
  3. Concurrent saves with same base version → one fails with concurrency exception

The actual flow:

User A: Read v5 → Save → Success (v6)
User B: Read v5 → Save → Concurrency Error (v5 already changed)
User B: Retry → Read v6 → Save → Success (v7)

Status: ✅ Not a real concern due to optimistic concurrency


Why Missed Notifications Don't Cause Data Loss

Even when notifications are missed, optimistic locking prevents data corruption or overwrites. Here's how it works:

How Optimistic Locking Works in SyRF

Every document in MongoDB has a version field in its audit property. This version is used for optimistic concurrency control:

┌─────────────────────────────────────────────────────────────────────────┐
│ Optimistic Locking Flow                                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│ 1. Client reads document                                                │
│    GET /api/projects/abc123                                             │
│    Response: { id: "abc123", name: "My Project", audit: { version: 5 }} │
│                                                                         │
│ 2. Client makes local edits                                             │
│    User changes name to "Updated Project"                               │
│                                                                         │
│ 3. Client saves with version                                            │
│    PATCH /api/projects/abc123                                           │
│    Body: { name: "Updated Project" }                                    │
│    Header: If-Match: 5  (or version in body)                            │
│                                                                         │
│ 4. Server checks version                                                │
│    Current DB version == 5?                                             │
│    ├─ YES → Save succeeds, version becomes 6                            │
│    └─ NO  → Reject with 409 Conflict                                    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Scenario: User B Misses Notification, Then Saves

Timeline:
─────────────────────────────────────────────────────────────────────────>
     │                    │                    │                    │
     │ User A & B both    │ User A saves       │ User B tries       │
     │ load project v5    │ (v5 → v6)          │ to save            │
     │                    │                    │                    │
     │                    │ SignalR notifies   │                    │
     │                    │ User B... but      │                    │
     │                    │ message LOST!      │                    │
     │                    │                    │                    │
     ▼                    ▼                    ▼                    ▼
   User B has v5       User B still         User B sends:        Server rejects:
   in browser          shows v5             "Update from v5"     "Current is v6,
                       (stale)                                    not v5 - 409"

What happens next:

  1. User B sees a conflict error in the UI
  2. Client refreshes to get latest data (v6)
  3. User B can now see User A's changes
  4. User B reapplies their changes on top of v6
  5. User B saves successfully (v6 → v7)

Code Implementation

Version field in Audit (Audit.cs:28-34):

public void OnSaving(int schemaVersion, string lastAppVersion, Guid? userId = null)
{
    LastModified = DateTime.UtcNow;
    LastModifiedBy = userId;
    LastAppVersion = lastAppVersion;
    Version++;  // Increment version on each save
}

Server-side version check (MongoExtensions.cs:217-244):

public static async Task<ReplaceOneResult> SaveAsync<TAggregateRoot, TId>(
    this IMongoCollection<TAggregateRoot> collection, TAggregateRoot aggregateRoot, ...)
{
    var currentVersion = aggregateRoot.Version;  // Capture version before save
    aggregateRoot.OnSaving(schemaVersion, appVersion, userId);  // Increments version

    // Filter includes version - won't match if document was modified
    var filter = GetFilter<TAggregateRoot, TId>(aggregateRoot.Id, currentVersion);

    return await collection.ReplaceOneAsync(filter, aggregateRoot,
        new ReplaceOptions { IsUpsert = true });
}

Filter with version check (MongoExtensions.cs:417-441):

public static FilterDefinition<TAggregateRoot> GetFilter<TAggregateRoot, TId>(TId id, int? currentVersion)
{
    var builder = new FilterDefinitionBuilder<TAggregateRoot>();
    var eqId = builder.Eq(a => a.Id, id);
    var versionIsCurrent = builder.Eq(ag => ag.Audit.Version, currentVersion);

    // Filter requires both ID AND version match
    return currentVersion == null ? eqId : builder.And(eqId, versionIsCurrent);
}

How conflicts are detected: When User B tries to save with version 5 but the document is now at version 6, the filter doesn't match. With IsUpsert = true, MongoDB attempts to insert a new document, but since the _id already exists, it throws a DuplicateKeyException. This bubbles up as an error to the client.

Client-side conflict handling (conceptual - actual implementation varies by endpoint):

// When save fails due to version conflict
this.projectService.updateProject(projectId, changes).pipe(
  catchError((error) => {
    // Conflict detected - refresh and let user retry
    this.snackBar.open('Project was modified by another user. Refreshing...', 'OK');
    this.loadProjectRequest.dispatchRequest.loadProject({ projectId });
    return throwError(() => error);
  })
);

Why This Matters

Without Optimistic Locking With Optimistic Locking
User B overwrites User A's changes silently User B gets conflict error
Data loss occurs No data loss
Last write wins (bad) Conflicts are detected (good)
Missed notifications = corruption Missed notifications = slightly worse UX

The Trade-off

Missed SignalR notifications cause:

  • ❌ User sees stale data temporarily
  • ❌ User might get a conflict error when saving
  • ✅ No silent data loss
  • ✅ All changes are preserved
  • ✅ User can merge changes manually if needed

This is why "eventual consistency" is acceptable for SyRF - the worst case is a minor UX inconvenience, not data corruption.


Mitigation Priority Matrix

Mitigation Effort Impact Priority
Subscribe-then-Load + HTTP version filter (Scenario 1, Option A) Medium-High High P2
Version Heartbeat (Scenario 3, Option A) Low Medium P3
Version Check After Subscribe (Scenario 1, Option B) Low Medium ⭐ P2
Document Current Behavior Very Low Low P4

Current Recommendation: The existing implementation provides acceptable consistency for SyRF's use case. The self-healing mechanisms cover most gaps within seconds.

If tighter consistency is needed:

  1. Quick win: Option B (Version Check After Subscribe) - low effort, catches the gap explicitly
  2. Comprehensive fix: Option A (Subscribe-then-Load) requires adding HTTP version filtering first, which is more invasive but provides a cleaner architecture

Important Finding: HTTP responses currently bypass version filtering entirely (requests.ts:47-55entity.helpers.ts:68). This means HTTP always wins over SignalR in a race, regardless of which has newer data. Consider adding version filtering to HTTP responses as a standalone improvement.


Conclusion

The SignalR implementation in SyRF is correctly designed for multi-replica deployment. The MongoDB change stream pattern provides implicit backplane functionality without requiring Redis or Azure SignalR Service.

Key Takeaways

  1. No action required - the current implementation is replica-safe
  2. Sticky sessions are helpful but not critical - they reduce reconnection overhead
  3. MongoDB change streams ARE the backplane - each pod independently watches MongoDB
  4. All notifications use Clients.Client(connectionId) - no group or user broadcasts exist
  5. Groups are vestigial - added but never used, minor overhead only

Risk Assessment

Risk Likelihood Impact Mitigation
MongoDB change stream failure Low High Retry with exponential backoff (implemented)
Pod restart during active work Medium Low Auto-reconnection with subscription recreation
Future code adds group broadcasts Low High Code review, add this doc to review checklist