Reputation: 343
I have a system design question that I'm looking for some guidance on. I have two different systems that need to have a basic level of communication. This is abstracted via message queues.
For example, System A is responsible for user account registration. A user access System B and enters the user registration flow. After the user submits their details, they are sent to System B's Loading / Waiting screen, and they register a long-polling request with System B awaiting the completion of their account. On the back-end, System B places a message with registration details onto a message queue. System A consumes this message, creates the account, then places a message onto a separate queue indicating that registration is completed. System B consumes this message, and needs to notify the user that the registration is complete. This is where the problem lies. System B is running a cluster of servers, and there is no guarantee that the server which is holding onto the long-polling request is the same server that processes the registration completed message.
This seems like it should be a common problem, but I haven't been able to find much discussion on it. I've explored some options, but they all seem to come with some level of downside.
Potential Solutions:
Any insight or suggestions on this problem are greatly appreciated. Thank you.
Upvotes: 0
Views: 113
Reputation: 456322
One consideration with any long-lived connections (including both long-polling and WebSockets) is that the server pool needs to be able to recycle. E.g., during a rolling update or server replacement, both client and server apps need to be able to handle losing long-lived connections.
Examples in more detail:
This is generally done via (generally silent) retries on long-lived connection loss. In some cases there's a "logical connection" abstraction that can exist over time on more than one literal connection. E.g., Microsoft's SignalR takes that approach.
This approach is the only one I would reject:
Keep track of which server the request is on and find a way to route the message to the appropriate server
because it treats servers individually instead of as a cluster.
Personally, I prefer the polling approach in most cases. I have also used fan-out per-server queues, but only for situations where every server actually has something to do for every message (i.e., some state is copied to all servers), and building an API as a single point of truth for those messages is too inefficient. That's a rare situation, and I almost always use polling instead.
Regarding the other solutions:
When System B consumes the registration completed message, use something like JGroups to broadcast this to all other servers within the cluster
Each instance dynamically creates its own SQS queue and use the SNS-fan out pattern to deliver the message to all of these queues so all servers receive a copy of the message
I'm not familiar with JGroups, but if the broadcast can be lost, then I wouldn't take the first approach.
With any kind of messaging (both broadcast and per-server queues), there's also a concern where, e.g., if the client disconnects (if a server is taken out of the pool), the message is sent and handled by all servers, and then the client reconnects. Would the message be lost in that case?
You could handle that case by either caching all messages on all servers for some time (and disallowing client auto-reconnect after a related timeout), or by having System B call System A - in which case you've just implemented the API you need for polling anyway.
Just use standard polling to eliminate the aspect of server-state
My preferred solution. Servers (at this level) are a cluster, not individual servers.
Upvotes: 0