SignalR Multi-Replica Analysis¶

Executive Summary¶

Conclusion: The current SignalR implementation is replica-safe and works correctly with multiple API pods.

The architecture uses MongoDB change streams as an implicit backplane, avoiding the typical SignalR scaling problems. However, this comes with trade-offs that should be understood.

Current Production Configuration¶

Service	Staging Replicas	Production Replicas
API (SignalR host)	2	3
Project Management	1	2
Web	2	3

Architecture Overview¶

Traditional SignalR Scaling Problem¶

In typical SignalR deployments, scaling to multiple replicas causes issues because:

┌─────────────────────────────────────────────────────────────┐
│ Traditional SignalR (PROBLEMATIC)                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Pod-1                      Pod-2                           │
│  ┌──────────────┐          ┌──────────────┐                │
│  │ SignalR Hub  │          │ SignalR Hub  │                │
│  │ Groups:      │          │ Groups:      │                │
│  │  - Project-X │          │  - Project-X │                │
│  │ Clients:     │          │ Clients:     │                │
│  │  - Client-A  │          │  - Client-B  │                │
│  └──────────────┘          └──────────────┘                │
│         │                         │                         │
│         ▼                         ▼                         │
│  Server broadcasts to         Client-B never                │
│  Group("Project-X")           receives message!             │
│  Only Client-A gets it                                      │
│                                                             │
└─────────────────────────────────────────────────────────────┘

SyRF's Change Stream Pattern (REPLICA-SAFE)¶

SyRF uses a fundamentally different approach:

┌─────────────────────────────────────────────────────────────┐
│ SyRF SignalR Architecture                                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│                    ┌─────────────────┐                      │
│                    │  MongoDB Atlas  │                      │
│                    │  Change Streams │                      │
│                    └────────┬────────┘                      │
│              ┌──────────────┼──────────────┐                │
│              ▼              ▼              ▼                │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │   Pod-1      │  │   Pod-2      │  │   Pod-3      │      │
│  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │      │
│  │ │ Change   │ │  │ │ Change   │ │  │ │ Change   │ │      │
│  │ │ Stream   │ │  │ │ Stream   │ │  │ │ Stream   │ │      │
│  │ │ Watcher  │ │  │ │ Watcher  │ │  │ │ Watcher  │ │      │
│  │ └────┬─────┘ │  │ └────┬─────┘ │  │ └────┬─────┘ │      │
│  │      ▼       │  │      ▼       │  │      ▼       │      │
│  │ Client-A ✓   │  │ Client-B ✓   │  │ Client-C ✓   │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
│                                                             │
│  All clients receive notifications independently!           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Detailed Flow Analysis¶

Subscription Flow¶

Client connects to SignalR hub (sticky session routes to one pod)
Client subscribes by calling hub method (e.g., SubscribeToProject)
Server creates MongoDB change stream watching that entity/collection
Change stream subscription is stored with the client's connectionId
When document changes in MongoDB:
MongoDB sends change event to ALL pods watching that collection
Each pod processes the event for its own connected clients
Notification sent via Clients.Client(connectionId)

Key Code Paths¶

Hub subscription (NotificationHub.cs:99-136):

public async Task SubscribeToProject(Guid projectId)
{
    var connectionId = Context.ConnectionId;
    _projectsSubscriptionManager.SubscribeToEntity(Context.ConnectionId, projectId,
        notification =>
        {
            // This callback fires on the SAME pod that created the subscription
            _hubContext.Clients.Client(connectionId).ProjectNotification(notificationDto);
        }
    );
}

Change stream source (MongoRepositoryBase.cs:87-88):

public IObservable<EntityNotification<TAggregateRoot>> GetEntityNotificationStream(Guid id) =>
    _GetEntityNotificationStream(id);  // Uses MongoCachedChangeStream

Cached change stream (MongoContext.cs:142-249):

public IObservable<ChangeStreamDocument<TAggregateRoot>> GetCachedCollectionChangeStream<...>()
{
    return _changeStreamCache.GetOrAdd(() =>
        Observable.Create<ChangeStreamDocument<TAggregateRoot>>(async (obs, ct) =>
        {
            var cursor = await GetCollection<TAggregateRoot, TId>()
                .WatchAsync(pipeline, opts, ct);
            // Each pod opens its own cursor to MongoDB
            while (!ct.IsCancellationRequested)
            {
                await cursor.MoveNextAsync(ct);
                foreach (var doc in cursor.Current)
                    obs.OnNext(doc);  // Emits to all subscribers on this pod
            }
        })
        .Publish()
        .RefCount()  // Hot observable - shared across all subscriptions on this pod
    );
}

Scenario Analysis¶

Scenario 1: Project Update Notification¶

Step	What Happens	Replica Impact
1	User A (Pod-1) and User B (Pod-2) both viewing Project-X	Each has subscription
2	User A saves a change to Project-X	MongoDB document updated
3	MongoDB emits change event	Sent to ALL change stream cursors
4	Pod-1 receives event	Notifies User A via `Clients.Client(A)`
5	Pod-2 receives event	Notifies User B via `Clients.Client(B)`
Result	Both users see the update	✅ Works correctly

Scenario 2: File Upload Progress¶

Step	What Happens	Replica Impact
1	User uploads file, connected to Pod-1	Subscription created on Pod-1
2	S3 notifier processes file	Publishes to RabbitMQ
3	Project-Management service updates MongoDB	Progress field updated
4	MongoDB emits change event	Pod-1's change stream receives it
5	Pod-1 notifies user	Via `Clients.Client(connectionId)`
Result	User sees progress updates	✅ Works correctly

Scenario 3: Pod Restart¶

Step	What Happens	Replica Impact
1	Pod-2 restarts (rolling update)	Connections to Pod-2 drop
2	Clients reconnect	May land on any pod
3	Client runs `.withAutomaticReconnect()`	SignalR handles reconnection
4	Client re-subscribes	`allReconnected$` triggers reload
5	New change stream subscriptions created	On whichever pod client landed
Result	Brief interruption, then normal	✅ Self-healing

Scenario 4: Version Check (UiVersionCheck)¶

Step	What Happens	Replica Impact
1	Client connects to Pod-1	`OnConnectedAsync` fires
2	Hub sends version check	`Clients.Client(Context.ConnectionId)`
3	Target is the SAME connection	On this pod, guaranteed
Result	Client receives version check	✅ Works correctly

What Would Break (Hypothetical)¶

The following patterns are NOT used but would cause problems:

❌ Group Broadcasting (NOT USED)¶

// If this existed, only 1/3 of clients would receive it
await Clients.Group("project-X").SomeNotification(...);

❌ User-Targeted Messages (NOT USED)¶

// If this existed, would fail if user connected to different pod
await Clients.User(userId).SomeNotification(...);

❌ All-Client Broadcast (NOT USED)¶

// If this existed, only 1/3 of clients would receive it
await Clients.All.SomeNotification(...);

Verification: Grep for Clients.(Group|User|All)( returns no matches in the codebase.

Vestigial Group Code¶

Groups ARE added but never used for broadcasting:

// NotificationHub.cs:57-58 - Added on connect
await Groups.AddToGroupAsync(Context.ConnectionId, UserGroupName(userId.ToString()));

// NotificationHub.cs:132 - Added on project subscription
await Groups.AddToGroupAsync(Context.ConnectionId, ProjectGroupName(projectId));

These groups add minor overhead but cause no correctness issues. They may have been intended for future use or are remnants of an older design.

Sticky Sessions Analysis¶

Current Configuration¶

_ingress.tpl:

annotations:
  nginx.ingress.kubernetes.io/affinity: "cookie"
  nginx.ingress.kubernetes.io/session-cookie-name: "route"
  nginx.ingress.kubernetes.io/session-cookie-expires: "172800"  # 2 days

Are They Required?¶

No, but they help. Without sticky sessions:

Aspect	With Sticky Sessions	Without Sticky Sessions
Initial HTTP handshake	Same pod	Any pod
WebSocket upgrade	Same pod	Any pod
Subscription creation	Predictable	Works, but subscription on any pod
Reconnection	Returns to same pod	May land on different pod
Subscription recreation	Avoided if same pod	Always recreated on reconnect

Recommendation: Keep sticky sessions to reduce subscription churn, but they're not required for correctness.

Resume Token Management¶

Current Implementation¶

InMemoryResumePointRepository.cs:

public class InMemoryResumePointRepository : IResumePointRepository
{
    private readonly ConcurrentDictionary<string, ResumePoint> _store = new();
    // ...
}

Impact¶

Each pod stores resume tokens in memory (not shared)
On pod restart, resume tokens are lost
Change streams restart from current time (safe, no missed events for new subscriptions)
This is acceptable because subscriptions are connection-scoped

Resource Implications¶

MongoDB Change Stream Cursors¶

Configuration	Cursor Count per Collection
1 replica	1 cursor
3 replicas	3 cursors
6 replicas	6 cursors

MongoDB Atlas handles this well - change streams are designed for multiple consumers.

Memory Per Pod¶

Each pod stores:

Connection → Subscription mappings (ConcurrentDictionary)
Rx subscriptions (IDisposable for each active subscription)
Resume tokens (ConcurrentDictionary)

Memory usage scales with connections per pod, not total connections.

Potential Improvements (Not Required)¶

1. Remove Unused Group Code¶

The Groups.AddToGroupAsync calls could be removed to reduce overhead.

2. Shared Resume Token Storage¶

Could use Redis/MongoDB for resume tokens to survive pod restarts without briefly re-streaming. Low priority - current behavior is correct.

3. Connection Metrics¶

Add observability for:

Connections per pod
Subscriptions per pod
Change stream lag

Comparison with Redis Backplane¶

Aspect	SyRF (Change Streams)	Traditional (Redis Backplane)
Cross-pod messaging	Via MongoDB	Via Redis
Additional infrastructure	None	Redis cluster required
Latency	Database → All pods	Hub → Redis → All pods
Notification trigger	Database change	Explicit broadcast
Use case fit	CRUD apps	Chat, live collaboration

SyRF's pattern is well-suited because:

All notifications stem from database changes
No need for arbitrary server-initiated broadcasts
Avoids Redis infrastructure complexity

Potential Missed Events Analysis¶

While the architecture is replica-safe, there are edge cases where events can be missed. This section analyzes each scenario, its impact, and mitigation options.

Overview¶

Scenario	Can Miss Events?	Self-Heals?	Time to Heal	Severity
Initial load race condition	⚠️ Yes	Next update/refresh/reconnect	Seconds–minutes	Medium
Disconnection period	Yes, but handled	✅ Auto reload on reconnect	Immediate	Low
WebSocket send failure	⚠️ Yes	Next update/refresh/reconnect	Seconds–minutes	Low
Out-of-order delivery	No (filtered)	N/A	N/A	None
Pod restart	Yes, but handled	✅ Client reconnects + reloads	Immediate	Low
Change stream cursor failure	Rare	✅ Retry with backoff	Seconds	Low

Scenario 1: Initial Load Race Condition ⚠️¶

The Problem

There's a timing gap between when HTTP data loads and when the SignalR subscription is created:

Timeline:
─────────────────────────────────────────────────────────────────────>
     │                      │                      │
     │ HTTP Response        │ Another user saves   │ Subscription Created
     │ (Project v5)         │ (Project v6)         │ (watching from now)
     │                      │                      │
     └──────────────────────┴──────────────────────┘
                            │
                      VERSION 6 MISSED!

Code Path

Route guard calls loadProject() → HTTP request (project-guard.service.ts:107)
HTTP response returns version N
Component renders
SignalR service detects _currentProjectId$ changed (signal-r.service.ts:530-548)
Subscription created via SubscribeToProject()

Any change between steps 2 and 5 is not delivered.

Impact

User sees stale data until next update, refresh, or reconnection
For most SyRF use cases (screening, annotation), this is a minor issue
For progress updates during file upload, user might miss intermediate progress

Self-Healing Triggers

Another database change triggers version N+2 notification
User refreshes the page
Connection drops and reconnects (triggers full reload)
User navigates away and back

Mitigation Options

Option	Complexity	Effectiveness	Recommendation
A. Subscribe-then-Load	Medium	High	⭐ Recommended
B. Version Check After Subscribe	Low	Medium	Good quick fix
C. Periodic Version Polling	Low	Medium	Simple fallback
D. Accept Current Behavior	None	N/A	Acceptable for SyRF

Reverse the order: create SignalR subscription BEFORE making the HTTP request, and add version filtering to HTTP responses.

// Current (problematic):
// 1. HTTP load project (gets v5)
// 2. Subscribe to project (misses v6 that happened between 1 and 2)

// Improved:
// 1. Subscribe to project (captures all changes from this point)
// 2. HTTP load project (gets v5 or v6 depending on timing)
// 3. Version filtering handles any out-of-order delivery

Critical Caveat: This pattern only works if HTTP responses also go through version filtering.

Current problem: Version filtering only exists in SignalR path, NOT in HTTP path:

Path	Version Filtering	Code Location
SignalR notifications	✅ Yes	signal-r.service.ts:422-432
HTTP responses	❌ No	requests.ts:47-55 → reducer spreads directly

What happens WITHOUT HTTP version filtering:

1. Subscribe created
2. SignalR v6 arrives → version filter passes → store updated to v6
3. HTTP returns v5 → NO filter → dispatches detailLoaded → OVERWRITES with v5!

User ends up with stale v5 even though v6 was correctly received.

Implementation changes required:

Add version filtering to HTTP response handling (effect or reducer level)
Modify projectCanMatchGuard to subscribe before loading
Ensure subscription is created synchronously or awaited
Add subscription cleanup on navigation away

Example HTTP version filter (in effect):

map((projectWithRelatedInvestigatorsDto) => {
  const incomingVersion = projectWithRelatedInvestigatorsDto.project.audit?.version;
  const storedVersion = store.selectSignal(selectProjectVersion(props.projectId))();

  // Only dispatch if incoming is newer or same
  if (storedVersion === undefined || incomingVersion >= storedVersion) {
    return projectDetailActions.detailLoaded({...});
  } else {
    console.warn(`[HTTP] Dropping stale response v${incomingVersion}, have v${storedVersion}`);
    return { type: '[Project] Stale HTTP Response Ignored' };
  }
})

Complexity: Medium-High (requires changes to effects and careful testing)

After subscription is confirmed, make a lightweight version-check call:

async _subscribeToProject(projectId: string) {
  await this._hubConnection.invoke('SubscribeToProject', projectId);

  // New: Check if we missed anything
  const serverVersion = await this._hubConnection.invoke('GetProjectVersion', projectId);
  const clientVersion = this._store.selectSignal(selectProjectVersion(projectId))();

  if (serverVersion > clientVersion) {
    this._loadProjectRequest.dispatchRequest.loadProject({ projectId });
  }
}

Pros: Simple, explicit check Cons: Adds an extra round-trip, requires new hub method

Option C: Periodic Version Polling¶

Extend UiVersionCheck to include entity versions:

// Server sends periodically (e.g., every 30 seconds):
{
  minUiVersion: "1.2.3",
  entityVersions: {
    "project:abc123": 42,
    "project:def456": 17
  }
}

// Client compares and reloads stale entities

Pros: Catches all gaps, low implementation effort Cons: Adds periodic load, slight delay in detection

Option D: Accept Current Behavior¶

Document the behavior and rely on natural self-healing.

When this is acceptable:

Users rarely edit the same project simultaneously
Stale data doesn't cause data loss (optimistic locking on save)
Most gaps self-heal within seconds

Scenario 2: Disconnection Period ✅ Handled¶

The Problem

During disconnection, change events are not delivered to the client.

Current Mitigation (Already Implemented)

signal-r.service.ts:641-653:

this.allReconnected$
  .pipe(
    switchMap(() => this._currentProjectId$),
    withLatestFrom(this._isMemberOfProject$),
    filter(([projectId, isMember]) => !!projectId && !!isMember),
    takeUntil(this._destroy$)
  )
  .subscribe(([projectId]) =>
    this._loadProjectRequest.dispatchRequest.loadProject({ projectId })
  );

On reconnection:

allReconnected$ fires
Full project data is reloaded via HTTP
State is synchronized

Status: ✅ No additional mitigation needed

Scenario 3: WebSocket Send Failure ⚠️¶

The Problem

If the WebSocket connection breaks during Clients.Client(connectionId).ProjectNotification(...), that specific message is lost. SignalR uses fire-and-forget semantics with no application-level acknowledgments.

How It Happens

┌─────────────────────────────────────────────────────────────────────────┐
│ Server (API Pod)                              │ Client (Browser)        │
├─────────────────────────────────────────────────────────────────────────┤
│                                               │                         │
│ 1. MongoDB emits change event (v6)            │                         │
│         │                                     │                         │
│         ▼                                     │                         │
│ 2. Rx subscription callback fires             │                         │
│         │                                     │                         │
│         ▼                                     │                         │
│ 3. _hubContext.Clients.Client(connId)         │                         │
│      .ProjectNotification(dto)                │                         │
│         │                                     │                         │
│         ▼                                     │                         │
│ 4. SignalR serializes message                 │                         │
│         │                                     │                         │
│         └────────── WebSocket ────────X       │ Connection breaks!      │
│                                        │      │ (network hiccup,        │
│                                        │      │  WiFi briefly drops,    │
│                                        │      │  mobile tower switch)   │
│                                        ▼      │                         │
│                                    MESSAGE    │ Client never receives   │
│                                    LOST!      │ anything                │
│                                               │                         │
│ Server doesn't know it failed ───────────────│─ Client doesn't know    │
│ (fire-and-forget semantics)                  │  it missed anything     │
│                                               │                         │
└─────────────────────────────────────────────────────────────────────────┘

Why Fire-and-Forget?

The notification callback doesn't await delivery confirmation (NotificationHub.cs:99-136):

_projectsSubscriptionManager.SubscribeToEntity(Context.ConnectionId, projectId,
    notification =>
    {
        // Fire-and-forget - no await, no acknowledgment
        _hubContext.Clients.Client(connectionId).ProjectNotification(notificationDto);
    }
);

When Does This Happen?

Cause	Likelihood	Duration
Mobile network handoff (WiFi ↔ cellular)	Medium	1-5 seconds
Brief network congestion	Low	Milliseconds
Browser tab backgrounded aggressively	Medium	Varies
VPN reconnection	Medium	1-10 seconds

Difference from Full Disconnection (Scenario 2)

Full Disconnection	Send Failure
Client knows it disconnected	Client may not know
`onclose` / `onreconnected` fires	No event fires
`allReconnected$` triggers reload	Nothing triggers
Self-heals immediately	Waits for next update

Impact

Single notification lost
User sees stale state until next update
No error visible to user or server
No data loss - optimistic locking prevents conflicts (see below)

Self-Healing Triggers

Next database change delivers newer version
User refreshes
Connection fully drops → reconnection → reload

Mitigation Options

Option	Complexity	Effectiveness	Recommendation
A. Heartbeat with Versions	Low	High	⭐ Recommended
B. Client-side Staleness Detection	Low	Medium	Good supplement
C. Application-level ACKs	High	Very High	Overkill for SyRF

Option A: Heartbeat with Versions (Recommended)¶

Extend the existing UiVersionCheck mechanism to periodically sync versions:

Server-side (new hub method):

public async Task SendVersionHeartbeat()
{
    var userId = GetUserId();
    var subscribedProjectIds = _projectsSubscriptionManager.GetSubscribedEntityIds(Context.ConnectionId);

    var versions = subscribedProjectIds.ToDictionary(
        id => id,
        id => _pmUnitOfWork.Projects.Get(id)?.Audit?.Version ?? 0
    );

    await Clients.Caller.VersionHeartbeat(versions);
}

Client-side:

this.versionHeartbeat$.pipe(
  switchMap(versions => {
    const staleProjects = Object.entries(versions)
      .filter(([id, serverVersion]) => {
        const clientVersion = this._store.selectSignal(selectProjectVersion(id))();
        return serverVersion > (clientVersion ?? 0);
      });

    return from(staleProjects).pipe(
      tap(([projectId]) =>
        this._loadProjectRequest.dispatchRequest.loadProject({ projectId })
      )
    );
  })
).subscribe();

Trigger: Could be periodic (every 30s) or on specific actions (focus window, complete operation).

Option B: Client-side Staleness Detection¶

Track when data was last updated and trigger refresh if stale:

// In project state
lastUpdated: Date;
lastSignalRNotification: Date;

// Detect potential gaps
const timeSinceLastNotification = Date.now() - lastSignalRNotification;
if (timeSinceLastNotification > 60000 && isActivelyViewing) {
  // Trigger lightweight version check or full reload
}

Scenario 4: Out-of-Order Delivery ✅ Handled¶

The Problem (Hypothetical)

Network conditions could cause notifications to arrive out of order:

Version 6 emitted
Version 7 emitted
Version 7 arrives at client first
Version 6 arrives at client second

Current Mitigation (Already Implemented)

Version filtering (signal-r.service.ts:422-432) prevents older versions from overwriting newer:

const shouldUpdate =
  storedVersion === undefined ||
  storedVersion <= (updatedVersion ?? Infinity) || ...

if (!shouldUpdate) {
  console.warn(`[SignalR] Dropping notification - storedVersion=${storedVersion}, updatedVersion=${updatedVersion}`);
}

Result: Version 6 is correctly dropped if version 7 was already applied.

Status: ✅ No additional mitigation needed

Scenario 5: Pod Restart ✅ Handled¶

The Problem

When a pod restarts:

All connections to that pod are broken
In-memory resume tokens are lost
Change stream starts from "now" on new pod

Why It's Handled

For SignalR clients, this is equivalent to a disconnection:

Client detects connection loss
Auto-reconnection kicks in (may land on different pod)
allReconnected$ fires
Full data reload occurs

Edge Case: Resume Token Loss

The in-memory InMemoryResumePointRepository means each pod loses resume tokens on restart. However:

New subscriptions start fresh change streams
Client reload gets current state
No events are actually "missed" from the client's perspective

Potential Issue: If you added server-side consumers of change streams (beyond SignalR), they would need persistent resume tokens. Currently not applicable.

Status: ✅ No additional mitigation needed for SignalR use case

Scenario 6: Rapid Updates ✅ Handled¶

Concern: Could rapid updates with the same version number cause issues?

Analysis: This can't happen in practice because:

MongoDB uses optimistic concurrency with version field
Each successful save increments version atomically
Concurrent saves with same base version → one fails with concurrency exception

The actual flow:

User A: Read v5 → Save → Success (v6)
User B: Read v5 → Save → Concurrency Error (v5 already changed)
User B: Retry → Read v6 → Save → Success (v7)

Status: ✅ Not a real concern due to optimistic concurrency

Why Missed Notifications Don't Cause Data Loss¶

Even when notifications are missed, optimistic locking prevents data corruption or overwrites. Here's how it works:

How Optimistic Locking Works in SyRF¶

Every document in MongoDB has a version field in its audit property. This version is used for optimistic concurrency control:

┌─────────────────────────────────────────────────────────────────────────┐
│ Optimistic Locking Flow                                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│ 1. Client reads document                                                │
│    GET /api/projects/abc123                                             │
│    Response: { id: "abc123", name: "My Project", audit: { version: 5 }} │
│                                                                         │
│ 2. Client makes local edits                                             │
│    User changes name to "Updated Project"                               │
│                                                                         │
│ 3. Client saves with version                                            │
│    PATCH /api/projects/abc123                                           │
│    Body: { name: "Updated Project" }                                    │
│    Header: If-Match: 5  (or version in body)                            │
│                                                                         │
│ 4. Server checks version                                                │
│    Current DB version == 5?                                             │
│    ├─ YES → Save succeeds, version becomes 6                            │
│    └─ NO  → Reject with 409 Conflict                                    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Scenario: User B Misses Notification, Then Saves¶

Timeline:
─────────────────────────────────────────────────────────────────────────>
     │                    │                    │                    │
     │ User A & B both    │ User A saves       │ User B tries       │
     │ load project v5    │ (v5 → v6)          │ to save            │
     │                    │                    │                    │
     │                    │ SignalR notifies   │                    │
     │                    │ User B... but      │                    │
     │                    │ message LOST!      │                    │
     │                    │                    │                    │
     ▼                    ▼                    ▼                    ▼
   User B has v5       User B still         User B sends:        Server rejects:
   in browser          shows v5             "Update from v5"     "Current is v6,
                       (stale)                                    not v5 - 409"

What happens next:

User B sees a conflict error in the UI
Client refreshes to get latest data (v6)
User B can now see User A's changes
User B reapplies their changes on top of v6
User B saves successfully (v6 → v7)

Code Implementation¶

Version field in Audit (Audit.cs:28-34):

public void OnSaving(int schemaVersion, string lastAppVersion, Guid? userId = null)
{
    LastModified = DateTime.UtcNow;
    LastModifiedBy = userId;
    LastAppVersion = lastAppVersion;
    Version++;  // Increment version on each save
}

Server-side version check (MongoExtensions.cs:217-244):

public static async Task<ReplaceOneResult> SaveAsync<TAggregateRoot, TId>(
    this IMongoCollection<TAggregateRoot> collection, TAggregateRoot aggregateRoot, ...)
{
    var currentVersion = aggregateRoot.Version;  // Capture version before save
    aggregateRoot.OnSaving(schemaVersion, appVersion, userId);  // Increments version

    // Filter includes version - won't match if document was modified
    var filter = GetFilter<TAggregateRoot, TId>(aggregateRoot.Id, currentVersion);

    return await collection.ReplaceOneAsync(filter, aggregateRoot,
        new ReplaceOptions { IsUpsert = true });
}

Filter with version check (MongoExtensions.cs:417-441):

public static FilterDefinition<TAggregateRoot> GetFilter<TAggregateRoot, TId>(TId id, int? currentVersion)
{
    var builder = new FilterDefinitionBuilder<TAggregateRoot>();
    var eqId = builder.Eq(a => a.Id, id);
    var versionIsCurrent = builder.Eq(ag => ag.Audit.Version, currentVersion);

    // Filter requires both ID AND version match
    return currentVersion == null ? eqId : builder.And(eqId, versionIsCurrent);
}

How conflicts are detected: When User B tries to save with version 5 but the document is now at version 6, the filter doesn't match. With IsUpsert = true, MongoDB attempts to insert a new document, but since the _id already exists, it throws a DuplicateKeyException. This bubbles up as an error to the client.

Client-side conflict handling (conceptual - actual implementation varies by endpoint):

// When save fails due to version conflict
this.projectService.updateProject(projectId, changes).pipe(
  catchError((error) => {
    // Conflict detected - refresh and let user retry
    this.snackBar.open('Project was modified by another user. Refreshing...', 'OK');
    this.loadProjectRequest.dispatchRequest.loadProject({ projectId });
    return throwError(() => error);
  })
);

Why This Matters¶

Without Optimistic Locking	With Optimistic Locking
User B overwrites User A's changes silently	User B gets conflict error
Data loss occurs	No data loss
Last write wins (bad)	Conflicts are detected (good)
Missed notifications = corruption	Missed notifications = slightly worse UX

The Trade-off¶

Missed SignalR notifications cause:

❌ User sees stale data temporarily
❌ User might get a conflict error when saving
✅ No silent data loss
✅ All changes are preserved
✅ User can merge changes manually if needed

This is why "eventual consistency" is acceptable for SyRF - the worst case is a minor UX inconvenience, not data corruption.

Mitigation Priority Matrix¶

Mitigation	Effort	Impact	Priority
Subscribe-then-Load + HTTP version filter (Scenario 1, Option A)	Medium-High	High	P2
Version Heartbeat (Scenario 3, Option A)	Low	Medium	P3
Version Check After Subscribe (Scenario 1, Option B)	Low	Medium	⭐ P2
Document Current Behavior	Very Low	Low	P4

Current Recommendation: The existing implementation provides acceptable consistency for SyRF's use case. The self-healing mechanisms cover most gaps within seconds.

If tighter consistency is needed:

Quick win: Option B (Version Check After Subscribe) - low effort, catches the gap explicitly
Comprehensive fix: Option A (Subscribe-then-Load) requires adding HTTP version filtering first, which is more invasive but provides a cleaner architecture

Important Finding: HTTP responses currently bypass version filtering entirely (requests.ts:47-55 → entity.helpers.ts:68). This means HTTP always wins over SignalR in a race, regardless of which has newer data. Consider adding version filtering to HTTP responses as a standalone improvement.

Conclusion¶

The SignalR implementation in SyRF is correctly designed for multi-replica deployment. The MongoDB change stream pattern provides implicit backplane functionality without requiring Redis or Azure SignalR Service.

Key Takeaways¶

No action required - the current implementation is replica-safe
Sticky sessions are helpful but not critical - they reduce reconnection overhead
MongoDB change streams ARE the backplane - each pod independently watches MongoDB
All notifications use Clients.Client(connectionId) - no group or user broadcasts exist
Groups are vestigial - added but never used, minor overhead only

Risk Assessment¶

Risk	Likelihood	Impact	Mitigation
MongoDB change stream failure	Low	High	Retry with exponential backoff (implemented)
Pod restart during active work	Medium	Low	Auto-reconnection with subscription recreation
Future code adds group broadcasts	Low	High	Code review, add this doc to review checklist

SignalR Multi-Replica Analysis¶

Executive Summary¶

Current Production Configuration¶

Architecture Overview¶

Traditional SignalR Scaling Problem¶

SyRF's Change Stream Pattern (REPLICA-SAFE)¶

Detailed Flow Analysis¶

Subscription Flow¶

Key Code Paths¶

Scenario Analysis¶

Scenario 1: Project Update Notification¶

Scenario 2: File Upload Progress¶

Scenario 3: Pod Restart¶

Scenario 4: Version Check (UiVersionCheck)¶

What Would Break (Hypothetical)¶

❌ Group Broadcasting (NOT USED)¶

❌ User-Targeted Messages (NOT USED)¶

❌ All-Client Broadcast (NOT USED)¶

Vestigial Group Code¶

Sticky Sessions Analysis¶

Current Configuration¶

Are They Required?¶

Resume Token Management¶

Current Implementation¶

Impact¶

Resource Implications¶

MongoDB Change Stream Cursors¶

Memory Per Pod¶

Potential Improvements (Not Required)¶

1. Remove Unused Group Code¶

2. Shared Resume Token Storage¶

3. Connection Metrics¶

Comparison with Redis Backplane¶

Potential Missed Events Analysis¶

Overview¶

Scenario 1: Initial Load Race Condition ⚠️¶

Option A: Subscribe-then-Load Pattern (with HTTP version filtering)¶

Option B: Version Check After Subscribe¶

Option C: Periodic Version Polling¶

Option D: Accept Current Behavior¶

Scenario 2: Disconnection Period ✅ Handled¶

Scenario 3: WebSocket Send Failure ⚠️¶

Option A: Heartbeat with Versions (Recommended)¶

Option B: Client-side Staleness Detection¶

Scenario 4: Out-of-Order Delivery ✅ Handled¶

Scenario 5: Pod Restart ✅ Handled¶

Scenario 6: Rapid Updates ✅ Handled¶

Why Missed Notifications Don't Cause Data Loss¶

How Optimistic Locking Works in SyRF¶

Scenario: User B Misses Notification, Then Saves¶

Code Implementation¶

Why This Matters¶

The Trade-off¶

Mitigation Priority Matrix¶

Conclusion¶

Key Takeaways¶

Risk Assessment¶