Reputation: 11
I have two applications that need to communicate over a ZMQ pub/sub socket using python. The publisher only runs for a couple hours and then shuts down. The subscriber should always be ready and waiting for available messages. The system has been working well while I've been developing it but now it's in production and the subscriber is sitting for long periods with no messages coming in. This has likely resulted in it "going to sleep" and not accepting new messages from the publisher with the publisher having no errors in sending these undelivered messages.
My subscriber is set up as follows:
context = zmq.Context()
socket = context.socket(zmq.SUB)
socket.connect("tcp://localhost:10001")
socket.setsockopt_string(zmq.SUBSCRIBE, "")
while True:
message = socket.recv_string()
do_something(message)
Messages can arrive at any time and interval (typically no less than 0.5 seconds) so I don't want to really use a non-blocking recv_string as missing a message can screw up the rest of the system. I'm using Windows 7 and I think this is something to do with the TCP aliveness timing out but I've got no insight to this. Ideally, I'd like to have the connection never timeout or at least be able to tell when it has so I can reestablish the socket.
Another possible thing I could do is to just send empty messages from another thread to the subscriber but there has to be a cleaner way to detect when the socket is suddenly not available. I don't want to use a broker as that just complicates my system and removes some of the portability that I'm aiming for.
Upvotes: 1
Views: 968
Reputation: 8434
There are socket options ZMQ_RECONNECT_IVL and ZMQ_RECONNECT_IVL_MAX that control reconnection. The default is that it will try once every 100ms. Your description of the behaviour is as if you had set these to have an exponentially increasing reconnect interval, but I think you'd have noticed that in your code and mentioned it in your question!! So, it's something else.
There is zmq_socket_monitor. There is probably a Python binding for that too. If you set this up in your subscriber, masked for all events, and log what events are happening within the socket. Your code would need to repeatedly call zmq_poll for both the sub socket and the pair socket you create to work with the zmq monitor, and read whichever socket becomes ready to read. This gives you the ability to see the internal finite state machine of the socket as it connects, disconnects, etc, as well as being able to receive messages from the sub socket.
You could then see if your sub socket is repeatedly going round a loop of ZMQ_EVENT_CONNECT_RETRIED and ZMQ_EVENT_CONNECT_DELAYED; I think (I don't know for sure) that this is what you should be seeing in your subscriber, in the situation you have. If so, that'd confirm that the subscriber is attempting to connect, getting rebuffed, trying again after some delay, and the fault may well indeed lie somewhere deep down in the OS / network.
If you don't see such events, that'd probably indicate some issue inside ZMQ I'd guess.
Upvotes: 1