SignalR Multi-Replica Analysis¶
Executive Summary¶
Conclusion: The current SignalR implementation is replica-safe and works correctly with multiple API pods.
The architecture uses MongoDB change streams as an implicit backplane, avoiding the typical SignalR scaling problems. However, this comes with trade-offs that should be understood.
Current Production Configuration¶
| Service | Staging Replicas | Production Replicas |
|---|---|---|
| API (SignalR host) | 2 | 3 |
| Project Management | 1 | 2 |
| Web | 2 | 3 |
Architecture Overview¶
Traditional SignalR Scaling Problem¶
In typical SignalR deployments, scaling to multiple replicas causes issues because:
┌─────────────────────────────────────────────────────────────┐
│ Traditional SignalR (PROBLEMATIC) │
├─────────────────────────────────────────────────────────────┤
│ │
│ Pod-1 Pod-2 │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ SignalR Hub │ │ SignalR Hub │ │
│ │ Groups: │ │ Groups: │ │
│ │ - Project-X │ │ - Project-X │ │
│ │ Clients: │ │ Clients: │ │
│ │ - Client-A │ │ - Client-B │ │
│ └──────────────┘ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ Server broadcasts to Client-B never │
│ Group("Project-X") receives message! │
│ Only Client-A gets it │
│ │
└─────────────────────────────────────────────────────────────┘
SyRF's Change Stream Pattern (REPLICA-SAFE)¶
SyRF uses a fundamentally different approach:
┌─────────────────────────────────────────────────────────────┐
│ SyRF SignalR Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ │ MongoDB Atlas │ │
│ │ Change Streams │ │
│ └────────┬────────┘ │
│ ┌──────────────┼──────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Pod-1 │ │ Pod-2 │ │ Pod-3 │ │
│ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │
│ │ │ Change │ │ │ │ Change │ │ │ │ Change │ │ │
│ │ │ Stream │ │ │ │ Stream │ │ │ │ Stream │ │ │
│ │ │ Watcher │ │ │ │ Watcher │ │ │ │ Watcher │ │ │
│ │ └────┬─────┘ │ │ └────┬─────┘ │ │ └────┬─────┘ │ │
│ │ ▼ │ │ ▼ │ │ ▼ │ │
│ │ Client-A ✓ │ │ Client-B ✓ │ │ Client-C ✓ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ All clients receive notifications independently! │
│ │
└─────────────────────────────────────────────────────────────┘
Detailed Flow Analysis¶
Subscription Flow¶
- Client connects to SignalR hub (sticky session routes to one pod)
- Client subscribes by calling hub method (e.g.,
SubscribeToProject) - Server creates MongoDB change stream watching that entity/collection
- Change stream subscription is stored with the client's
connectionId - When document changes in MongoDB:
- MongoDB sends change event to ALL pods watching that collection
- Each pod processes the event for its own connected clients
- Notification sent via
Clients.Client(connectionId)
Key Code Paths¶
Hub subscription (NotificationHub.cs:99-136):
public async Task SubscribeToProject(Guid projectId)
{
var connectionId = Context.ConnectionId;
_projectsSubscriptionManager.SubscribeToEntity(Context.ConnectionId, projectId,
notification =>
{
// This callback fires on the SAME pod that created the subscription
_hubContext.Clients.Client(connectionId).ProjectNotification(notificationDto);
}
);
}
Change stream source (MongoRepositoryBase.cs:87-88):
public IObservable<EntityNotification<TAggregateRoot>> GetEntityNotificationStream(Guid id) =>
_GetEntityNotificationStream(id); // Uses MongoCachedChangeStream
Cached change stream (MongoContext.cs:142-249):
public IObservable<ChangeStreamDocument<TAggregateRoot>> GetCachedCollectionChangeStream<...>()
{
return _changeStreamCache.GetOrAdd(() =>
Observable.Create<ChangeStreamDocument<TAggregateRoot>>(async (obs, ct) =>
{
var cursor = await GetCollection<TAggregateRoot, TId>()
.WatchAsync(pipeline, opts, ct);
// Each pod opens its own cursor to MongoDB
while (!ct.IsCancellationRequested)
{
await cursor.MoveNextAsync(ct);
foreach (var doc in cursor.Current)
obs.OnNext(doc); // Emits to all subscribers on this pod
}
})
.Publish()
.RefCount() // Hot observable - shared across all subscriptions on this pod
);
}
Scenario Analysis¶
Scenario 1: Project Update Notification¶
| Step | What Happens | Replica Impact |
|---|---|---|
| 1 | User A (Pod-1) and User B (Pod-2) both viewing Project-X | Each has subscription |
| 2 | User A saves a change to Project-X | MongoDB document updated |
| 3 | MongoDB emits change event | Sent to ALL change stream cursors |
| 4 | Pod-1 receives event | Notifies User A via Clients.Client(A) |
| 5 | Pod-2 receives event | Notifies User B via Clients.Client(B) |
| Result | Both users see the update | ✅ Works correctly |
Scenario 2: File Upload Progress¶
| Step | What Happens | Replica Impact |
|---|---|---|
| 1 | User uploads file, connected to Pod-1 | Subscription created on Pod-1 |
| 2 | S3 notifier processes file | Publishes to RabbitMQ |
| 3 | Project-Management service updates MongoDB | Progress field updated |
| 4 | MongoDB emits change event | Pod-1's change stream receives it |
| 5 | Pod-1 notifies user | Via Clients.Client(connectionId) |
| Result | User sees progress updates | ✅ Works correctly |
Scenario 3: Pod Restart¶
| Step | What Happens | Replica Impact |
|---|---|---|
| 1 | Pod-2 restarts (rolling update) | Connections to Pod-2 drop |
| 2 | Clients reconnect | May land on any pod |
| 3 | Client runs .withAutomaticReconnect() |
SignalR handles reconnection |
| 4 | Client re-subscribes | allReconnected$ triggers reload |
| 5 | New change stream subscriptions created | On whichever pod client landed |
| Result | Brief interruption, then normal | ✅ Self-healing |
Scenario 4: Version Check (UiVersionCheck)¶
| Step | What Happens | Replica Impact |
|---|---|---|
| 1 | Client connects to Pod-1 | OnConnectedAsync fires |
| 2 | Hub sends version check | Clients.Client(Context.ConnectionId) |
| 3 | Target is the SAME connection | On this pod, guaranteed |
| Result | Client receives version check | ✅ Works correctly |
What Would Break (Hypothetical)¶
The following patterns are NOT used but would cause problems:
❌ Group Broadcasting (NOT USED)¶
// If this existed, only 1/3 of clients would receive it
await Clients.Group("project-X").SomeNotification(...);
❌ User-Targeted Messages (NOT USED)¶
// If this existed, would fail if user connected to different pod
await Clients.User(userId).SomeNotification(...);
❌ All-Client Broadcast (NOT USED)¶
Verification: Grep for Clients.(Group|User|All)( returns no matches in the codebase.
Vestigial Group Code¶
Groups ARE added but never used for broadcasting:
// NotificationHub.cs:57-58 - Added on connect
await Groups.AddToGroupAsync(Context.ConnectionId, UserGroupName(userId.ToString()));
// NotificationHub.cs:132 - Added on project subscription
await Groups.AddToGroupAsync(Context.ConnectionId, ProjectGroupName(projectId));
These groups add minor overhead but cause no correctness issues. They may have been intended for future use or are remnants of an older design.
Sticky Sessions Analysis¶
Current Configuration¶
annotations:
nginx.ingress.kubernetes.io/affinity: "cookie"
nginx.ingress.kubernetes.io/session-cookie-name: "route"
nginx.ingress.kubernetes.io/session-cookie-expires: "172800" # 2 days
Are They Required?¶
No, but they help. Without sticky sessions:
| Aspect | With Sticky Sessions | Without Sticky Sessions |
|---|---|---|
| Initial HTTP handshake | Same pod | Any pod |
| WebSocket upgrade | Same pod | Any pod |
| Subscription creation | Predictable | Works, but subscription on any pod |
| Reconnection | Returns to same pod | May land on different pod |
| Subscription recreation | Avoided if same pod | Always recreated on reconnect |
Recommendation: Keep sticky sessions to reduce subscription churn, but they're not required for correctness.
Resume Token Management¶
Current Implementation¶
InMemoryResumePointRepository.cs:
public class InMemoryResumePointRepository : IResumePointRepository
{
private readonly ConcurrentDictionary<string, ResumePoint> _store = new();
// ...
}
Impact¶
- Each pod stores resume tokens in memory (not shared)
- On pod restart, resume tokens are lost
- Change streams restart from current time (safe, no missed events for new subscriptions)
- This is acceptable because subscriptions are connection-scoped
Resource Implications¶
MongoDB Change Stream Cursors¶
| Configuration | Cursor Count per Collection |
|---|---|
| 1 replica | 1 cursor |
| 3 replicas | 3 cursors |
| 6 replicas | 6 cursors |
MongoDB Atlas handles this well - change streams are designed for multiple consumers.
Memory Per Pod¶
Each pod stores:
- Connection → Subscription mappings (
ConcurrentDictionary) - Rx subscriptions (
IDisposablefor each active subscription) - Resume tokens (
ConcurrentDictionary)
Memory usage scales with connections per pod, not total connections.
Potential Improvements (Not Required)¶
1. Remove Unused Group Code¶
The Groups.AddToGroupAsync calls could be removed to reduce overhead.
2. Shared Resume Token Storage¶
Could use Redis/MongoDB for resume tokens to survive pod restarts without briefly re-streaming. Low priority - current behavior is correct.
3. Connection Metrics¶
Add observability for:
- Connections per pod
- Subscriptions per pod
- Change stream lag
Comparison with Redis Backplane¶
| Aspect | SyRF (Change Streams) | Traditional (Redis Backplane) |
|---|---|---|
| Cross-pod messaging | Via MongoDB | Via Redis |
| Additional infrastructure | None | Redis cluster required |
| Latency | Database → All pods | Hub → Redis → All pods |
| Notification trigger | Database change | Explicit broadcast |
| Use case fit | CRUD apps | Chat, live collaboration |
SyRF's pattern is well-suited because:
- All notifications stem from database changes
- No need for arbitrary server-initiated broadcasts
- Avoids Redis infrastructure complexity
Potential Missed Events Analysis¶
While the architecture is replica-safe, there are edge cases where events can be missed. This section analyzes each scenario, its impact, and mitigation options.
Overview¶
| Scenario | Can Miss Events? | Self-Heals? | Time to Heal | Severity |
|---|---|---|---|---|
| Initial load race condition | ⚠️ Yes | Next update/refresh/reconnect | Seconds–minutes | Medium |
| Disconnection period | Yes, but handled | ✅ Auto reload on reconnect | Immediate | Low |
| WebSocket send failure | ⚠️ Yes | Next update/refresh/reconnect | Seconds–minutes | Low |
| Out-of-order delivery | No (filtered) | N/A | N/A | None |
| Pod restart | Yes, but handled | ✅ Client reconnects + reloads | Immediate | Low |
| Change stream cursor failure | Rare | ✅ Retry with backoff | Seconds | Low |
Scenario 1: Initial Load Race Condition ⚠️¶
The Problem
There's a timing gap between when HTTP data loads and when the SignalR subscription is created:
Timeline:
─────────────────────────────────────────────────────────────────────>
│ │ │
│ HTTP Response │ Another user saves │ Subscription Created
│ (Project v5) │ (Project v6) │ (watching from now)
│ │ │
└──────────────────────┴──────────────────────┘
│
VERSION 6 MISSED!
Code Path
- Route guard calls
loadProject()→ HTTP request (project-guard.service.ts:107) - HTTP response returns version N
- Component renders
- SignalR service detects
_currentProjectId$changed (signal-r.service.ts:530-548) - Subscription created via
SubscribeToProject()
Any change between steps 2 and 5 is not delivered.
Impact
- User sees stale data until next update, refresh, or reconnection
- For most SyRF use cases (screening, annotation), this is a minor issue
- For progress updates during file upload, user might miss intermediate progress
Self-Healing Triggers
- Another database change triggers version N+2 notification
- User refreshes the page
- Connection drops and reconnects (triggers full reload)
- User navigates away and back
Mitigation Options
| Option | Complexity | Effectiveness | Recommendation |
|---|---|---|---|
| A. Subscribe-then-Load | Medium | High | ⭐ Recommended |
| B. Version Check After Subscribe | Low | Medium | Good quick fix |
| C. Periodic Version Polling | Low | Medium | Simple fallback |
| D. Accept Current Behavior | None | N/A | Acceptable for SyRF |
Option A: Subscribe-then-Load Pattern (with HTTP version filtering)¶
Reverse the order: create SignalR subscription BEFORE making the HTTP request, and add version filtering to HTTP responses.
// Current (problematic):
// 1. HTTP load project (gets v5)
// 2. Subscribe to project (misses v6 that happened between 1 and 2)
// Improved:
// 1. Subscribe to project (captures all changes from this point)
// 2. HTTP load project (gets v5 or v6 depending on timing)
// 3. Version filtering handles any out-of-order delivery
Critical Caveat: This pattern only works if HTTP responses also go through version filtering.
Current problem: Version filtering only exists in SignalR path, NOT in HTTP path:
| Path | Version Filtering | Code Location |
|---|---|---|
| SignalR notifications | ✅ Yes | signal-r.service.ts:422-432 |
| HTTP responses | ❌ No | requests.ts:47-55 → reducer spreads directly |
What happens WITHOUT HTTP version filtering:
1. Subscribe created
2. SignalR v6 arrives → version filter passes → store updated to v6
3. HTTP returns v5 → NO filter → dispatches detailLoaded → OVERWRITES with v5!
User ends up with stale v5 even though v6 was correctly received.
Implementation changes required:
- Add version filtering to HTTP response handling (effect or reducer level)
- Modify
projectCanMatchGuardto subscribe before loading - Ensure subscription is created synchronously or awaited
- Add subscription cleanup on navigation away
Example HTTP version filter (in effect):
map((projectWithRelatedInvestigatorsDto) => {
const incomingVersion = projectWithRelatedInvestigatorsDto.project.audit?.version;
const storedVersion = store.selectSignal(selectProjectVersion(props.projectId))();
// Only dispatch if incoming is newer or same
if (storedVersion === undefined || incomingVersion >= storedVersion) {
return projectDetailActions.detailLoaded({...});
} else {
console.warn(`[HTTP] Dropping stale response v${incomingVersion}, have v${storedVersion}`);
return { type: '[Project] Stale HTTP Response Ignored' };
}
})
Complexity: Medium-High (requires changes to effects and careful testing)
Option B: Version Check After Subscribe¶
After subscription is confirmed, make a lightweight version-check call:
async _subscribeToProject(projectId: string) {
await this._hubConnection.invoke('SubscribeToProject', projectId);
// New: Check if we missed anything
const serverVersion = await this._hubConnection.invoke('GetProjectVersion', projectId);
const clientVersion = this._store.selectSignal(selectProjectVersion(projectId))();
if (serverVersion > clientVersion) {
this._loadProjectRequest.dispatchRequest.loadProject({ projectId });
}
}
Pros: Simple, explicit check Cons: Adds an extra round-trip, requires new hub method
Option C: Periodic Version Polling¶
Extend UiVersionCheck to include entity versions:
// Server sends periodically (e.g., every 30 seconds):
{
minUiVersion: "1.2.3",
entityVersions: {
"project:abc123": 42,
"project:def456": 17
}
}
// Client compares and reloads stale entities
Pros: Catches all gaps, low implementation effort Cons: Adds periodic load, slight delay in detection
Option D: Accept Current Behavior¶
Document the behavior and rely on natural self-healing.
When this is acceptable:
- Users rarely edit the same project simultaneously
- Stale data doesn't cause data loss (optimistic locking on save)
- Most gaps self-heal within seconds
Scenario 2: Disconnection Period ✅ Handled¶
The Problem
During disconnection, change events are not delivered to the client.
Current Mitigation (Already Implemented)
this.allReconnected$
.pipe(
switchMap(() => this._currentProjectId$),
withLatestFrom(this._isMemberOfProject$),
filter(([projectId, isMember]) => !!projectId && !!isMember),
takeUntil(this._destroy$)
)
.subscribe(([projectId]) =>
this._loadProjectRequest.dispatchRequest.loadProject({ projectId })
);
On reconnection:
allReconnected$fires- Full project data is reloaded via HTTP
- State is synchronized
Status: ✅ No additional mitigation needed
Scenario 3: WebSocket Send Failure ⚠️¶
The Problem
If the WebSocket connection breaks during Clients.Client(connectionId).ProjectNotification(...), that specific message is lost. SignalR uses fire-and-forget semantics with no application-level acknowledgments.
How It Happens
┌─────────────────────────────────────────────────────────────────────────┐
│ Server (API Pod) │ Client (Browser) │
├─────────────────────────────────────────────────────────────────────────┤
│ │ │
│ 1. MongoDB emits change event (v6) │ │
│ │ │ │
│ ▼ │ │
│ 2. Rx subscription callback fires │ │
│ │ │ │
│ ▼ │ │
│ 3. _hubContext.Clients.Client(connId) │ │
│ .ProjectNotification(dto) │ │
│ │ │ │
│ ▼ │ │
│ 4. SignalR serializes message │ │
│ │ │ │
│ └────────── WebSocket ────────X │ Connection breaks! │
│ │ │ (network hiccup, │
│ │ │ WiFi briefly drops, │
│ │ │ mobile tower switch) │
│ ▼ │ │
│ MESSAGE │ Client never receives │
│ LOST! │ anything │
│ │ │
│ Server doesn't know it failed ───────────────│─ Client doesn't know │
│ (fire-and-forget semantics) │ it missed anything │
│ │ │
└─────────────────────────────────────────────────────────────────────────┘
Why Fire-and-Forget?
The notification callback doesn't await delivery confirmation (NotificationHub.cs:99-136):
_projectsSubscriptionManager.SubscribeToEntity(Context.ConnectionId, projectId,
notification =>
{
// Fire-and-forget - no await, no acknowledgment
_hubContext.Clients.Client(connectionId).ProjectNotification(notificationDto);
}
);
When Does This Happen?
| Cause | Likelihood | Duration |
|---|---|---|
| Mobile network handoff (WiFi ↔ cellular) | Medium | 1-5 seconds |
| Brief network congestion | Low | Milliseconds |
| Browser tab backgrounded aggressively | Medium | Varies |
| VPN reconnection | Medium | 1-10 seconds |
Difference from Full Disconnection (Scenario 2)
| Full Disconnection | Send Failure |
|---|---|
| Client knows it disconnected | Client may not know |
onclose / onreconnected fires |
No event fires |
allReconnected$ triggers reload |
Nothing triggers |
| Self-heals immediately | Waits for next update |
Impact
- Single notification lost
- User sees stale state until next update
- No error visible to user or server
- No data loss - optimistic locking prevents conflicts (see below)
Self-Healing Triggers
- Next database change delivers newer version
- User refreshes
- Connection fully drops → reconnection → reload
Mitigation Options
| Option | Complexity | Effectiveness | Recommendation |
|---|---|---|---|
| A. Heartbeat with Versions | Low | High | ⭐ Recommended |
| B. Client-side Staleness Detection | Low | Medium | Good supplement |
| C. Application-level ACKs | High | Very High | Overkill for SyRF |
Option A: Heartbeat with Versions (Recommended)¶
Extend the existing UiVersionCheck mechanism to periodically sync versions:
Server-side (new hub method):
public async Task SendVersionHeartbeat()
{
var userId = GetUserId();
var subscribedProjectIds = _projectsSubscriptionManager.GetSubscribedEntityIds(Context.ConnectionId);
var versions = subscribedProjectIds.ToDictionary(
id => id,
id => _pmUnitOfWork.Projects.Get(id)?.Audit?.Version ?? 0
);
await Clients.Caller.VersionHeartbeat(versions);
}
Client-side:
this.versionHeartbeat$.pipe(
switchMap(versions => {
const staleProjects = Object.entries(versions)
.filter(([id, serverVersion]) => {
const clientVersion = this._store.selectSignal(selectProjectVersion(id))();
return serverVersion > (clientVersion ?? 0);
});
return from(staleProjects).pipe(
tap(([projectId]) =>
this._loadProjectRequest.dispatchRequest.loadProject({ projectId })
)
);
})
).subscribe();
Trigger: Could be periodic (every 30s) or on specific actions (focus window, complete operation).
Option B: Client-side Staleness Detection¶
Track when data was last updated and trigger refresh if stale:
// In project state
lastUpdated: Date;
lastSignalRNotification: Date;
// Detect potential gaps
const timeSinceLastNotification = Date.now() - lastSignalRNotification;
if (timeSinceLastNotification > 60000 && isActivelyViewing) {
// Trigger lightweight version check or full reload
}
Scenario 4: Out-of-Order Delivery ✅ Handled¶
The Problem (Hypothetical)
Network conditions could cause notifications to arrive out of order:
- Version 6 emitted
- Version 7 emitted
- Version 7 arrives at client first
- Version 6 arrives at client second
Current Mitigation (Already Implemented)
Version filtering (signal-r.service.ts:422-432) prevents older versions from overwriting newer:
const shouldUpdate =
storedVersion === undefined ||
storedVersion <= (updatedVersion ?? Infinity) || ...
if (!shouldUpdate) {
console.warn(`[SignalR] Dropping notification - storedVersion=${storedVersion}, updatedVersion=${updatedVersion}`);
}
Result: Version 6 is correctly dropped if version 7 was already applied.
Status: ✅ No additional mitigation needed
Scenario 5: Pod Restart ✅ Handled¶
The Problem
When a pod restarts:
- All connections to that pod are broken
- In-memory resume tokens are lost
- Change stream starts from "now" on new pod
Why It's Handled
For SignalR clients, this is equivalent to a disconnection:
- Client detects connection loss
- Auto-reconnection kicks in (may land on different pod)
allReconnected$fires- Full data reload occurs
Edge Case: Resume Token Loss
The in-memory InMemoryResumePointRepository means each pod loses resume tokens on restart. However:
- New subscriptions start fresh change streams
- Client reload gets current state
- No events are actually "missed" from the client's perspective
Potential Issue: If you added server-side consumers of change streams (beyond SignalR), they would need persistent resume tokens. Currently not applicable.
Status: ✅ No additional mitigation needed for SignalR use case
Scenario 6: Rapid Updates ✅ Handled¶
Concern: Could rapid updates with the same version number cause issues?
Analysis: This can't happen in practice because:
- MongoDB uses optimistic concurrency with version field
- Each successful save increments version atomically
- Concurrent saves with same base version → one fails with concurrency exception
The actual flow:
User A: Read v5 → Save → Success (v6)
User B: Read v5 → Save → Concurrency Error (v5 already changed)
User B: Retry → Read v6 → Save → Success (v7)
Status: ✅ Not a real concern due to optimistic concurrency
Why Missed Notifications Don't Cause Data Loss¶
Even when notifications are missed, optimistic locking prevents data corruption or overwrites. Here's how it works:
How Optimistic Locking Works in SyRF¶
Every document in MongoDB has a version field in its audit property. This version is used for optimistic concurrency control:
┌─────────────────────────────────────────────────────────────────────────┐
│ Optimistic Locking Flow │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Client reads document │
│ GET /api/projects/abc123 │
│ Response: { id: "abc123", name: "My Project", audit: { version: 5 }} │
│ │
│ 2. Client makes local edits │
│ User changes name to "Updated Project" │
│ │
│ 3. Client saves with version │
│ PATCH /api/projects/abc123 │
│ Body: { name: "Updated Project" } │
│ Header: If-Match: 5 (or version in body) │
│ │
│ 4. Server checks version │
│ Current DB version == 5? │
│ ├─ YES → Save succeeds, version becomes 6 │
│ └─ NO → Reject with 409 Conflict │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Scenario: User B Misses Notification, Then Saves¶
Timeline:
─────────────────────────────────────────────────────────────────────────>
│ │ │ │
│ User A & B both │ User A saves │ User B tries │
│ load project v5 │ (v5 → v6) │ to save │
│ │ │ │
│ │ SignalR notifies │ │
│ │ User B... but │ │
│ │ message LOST! │ │
│ │ │ │
▼ ▼ ▼ ▼
User B has v5 User B still User B sends: Server rejects:
in browser shows v5 "Update from v5" "Current is v6,
(stale) not v5 - 409"
What happens next:
- User B sees a conflict error in the UI
- Client refreshes to get latest data (v6)
- User B can now see User A's changes
- User B reapplies their changes on top of v6
- User B saves successfully (v6 → v7)
Code Implementation¶
Version field in Audit (Audit.cs:28-34):
public void OnSaving(int schemaVersion, string lastAppVersion, Guid? userId = null)
{
LastModified = DateTime.UtcNow;
LastModifiedBy = userId;
LastAppVersion = lastAppVersion;
Version++; // Increment version on each save
}
Server-side version check (MongoExtensions.cs:217-244):
public static async Task<ReplaceOneResult> SaveAsync<TAggregateRoot, TId>(
this IMongoCollection<TAggregateRoot> collection, TAggregateRoot aggregateRoot, ...)
{
var currentVersion = aggregateRoot.Version; // Capture version before save
aggregateRoot.OnSaving(schemaVersion, appVersion, userId); // Increments version
// Filter includes version - won't match if document was modified
var filter = GetFilter<TAggregateRoot, TId>(aggregateRoot.Id, currentVersion);
return await collection.ReplaceOneAsync(filter, aggregateRoot,
new ReplaceOptions { IsUpsert = true });
}
Filter with version check (MongoExtensions.cs:417-441):
public static FilterDefinition<TAggregateRoot> GetFilter<TAggregateRoot, TId>(TId id, int? currentVersion)
{
var builder = new FilterDefinitionBuilder<TAggregateRoot>();
var eqId = builder.Eq(a => a.Id, id);
var versionIsCurrent = builder.Eq(ag => ag.Audit.Version, currentVersion);
// Filter requires both ID AND version match
return currentVersion == null ? eqId : builder.And(eqId, versionIsCurrent);
}
How conflicts are detected: When User B tries to save with version 5 but the document is now at version 6, the filter doesn't match. With IsUpsert = true, MongoDB attempts to insert a new document, but since the _id already exists, it throws a DuplicateKeyException. This bubbles up as an error to the client.
Client-side conflict handling (conceptual - actual implementation varies by endpoint):
// When save fails due to version conflict
this.projectService.updateProject(projectId, changes).pipe(
catchError((error) => {
// Conflict detected - refresh and let user retry
this.snackBar.open('Project was modified by another user. Refreshing...', 'OK');
this.loadProjectRequest.dispatchRequest.loadProject({ projectId });
return throwError(() => error);
})
);
Why This Matters¶
| Without Optimistic Locking | With Optimistic Locking |
|---|---|
| User B overwrites User A's changes silently | User B gets conflict error |
| Data loss occurs | No data loss |
| Last write wins (bad) | Conflicts are detected (good) |
| Missed notifications = corruption | Missed notifications = slightly worse UX |
The Trade-off¶
Missed SignalR notifications cause:
- ❌ User sees stale data temporarily
- ❌ User might get a conflict error when saving
- ✅ No silent data loss
- ✅ All changes are preserved
- ✅ User can merge changes manually if needed
This is why "eventual consistency" is acceptable for SyRF - the worst case is a minor UX inconvenience, not data corruption.
Mitigation Priority Matrix¶
| Mitigation | Effort | Impact | Priority |
|---|---|---|---|
| Subscribe-then-Load + HTTP version filter (Scenario 1, Option A) | Medium-High | High | P2 |
| Version Heartbeat (Scenario 3, Option A) | Low | Medium | P3 |
| Version Check After Subscribe (Scenario 1, Option B) | Low | Medium | ⭐ P2 |
| Document Current Behavior | Very Low | Low | P4 |
Current Recommendation: The existing implementation provides acceptable consistency for SyRF's use case. The self-healing mechanisms cover most gaps within seconds.
If tighter consistency is needed:
- Quick win: Option B (Version Check After Subscribe) - low effort, catches the gap explicitly
- Comprehensive fix: Option A (Subscribe-then-Load) requires adding HTTP version filtering first, which is more invasive but provides a cleaner architecture
Important Finding: HTTP responses currently bypass version filtering entirely (requests.ts:47-55 → entity.helpers.ts:68). This means HTTP always wins over SignalR in a race, regardless of which has newer data. Consider adding version filtering to HTTP responses as a standalone improvement.
Conclusion¶
The SignalR implementation in SyRF is correctly designed for multi-replica deployment. The MongoDB change stream pattern provides implicit backplane functionality without requiring Redis or Azure SignalR Service.
Key Takeaways¶
- No action required - the current implementation is replica-safe
- Sticky sessions are helpful but not critical - they reduce reconnection overhead
- MongoDB change streams ARE the backplane - each pod independently watches MongoDB
- All notifications use
Clients.Client(connectionId)- no group or user broadcasts exist - Groups are vestigial - added but never used, minor overhead only
Risk Assessment¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| MongoDB change stream failure | Low | High | Retry with exponential backoff (implemented) |
| Pod restart during active work | Medium | Low | Auto-reconnection with subscription recreation |
| Future code adds group broadcasts | Low | High | Code review, add this doc to review checklist |