Hot.PxL
Hot.PxL

Reputation: 1980

How does ZeroMQ connect and bind work internally

I am experimenting with ZeroMQ. And I found it really interesting that in ZeroMQ, it does not matter whether either connect or bind happens first. I tried looking into the source code of ZeroMQ but it was too big to find anything.

The code is as follows.

# client side
import zmq
ctx = zmq.Context()
socket = ctx.socket(zmq.PAIR)
socket.connect('tcp://*:2345') # line [1]
# make it wait here

# server side
import zmq
ctx = zmq.Context()
socket = ctx.socket(zmq.PAIR)
socket.bind('tcp://localhost:2345')
# make it wait here

If I start client side first, the server has not been started yet, but magically the code is not blocked at line [1]. At this point, I checked with ss and made sure that the client is not listening on any port. Nor does it have any open connection. Then I start the server. Now the server is listening on port 2345, and magically the client is connected to it. My question is how does the client know the server is now online?

Upvotes: 3

Views: 1890

Answers (3)

JSON
JSON

Reputation: 1835

When you call socket.connect('tcp://*:2345') or socket.bind('tcp://localhost:2345') you are not calling these methods directly on an underlying TCP socket. All of ZMQ's IO - including connecting/binding underlying TCP sockets - happens in threads that are abstracted away from the user.

When these methods are called on a ZMQ socket it essentially queues these events within the IO threads. Once the IO threads begin to process them they will not return an error unless the event is truly impossible, otherwise they will continually attempt to connect/reconnect.

This means that a ZMQ socket may return without an error even if socket.connect is not successful. In your example it would likely fail without error but then quickly reattempt and succeeded if you were to run the server side of script.

It may also allow you to send messages while in this state (depending on the state of the queue in this situation, rather than the state of the network) and will then attempt to transmit queued messages once the IO threads are able to successfully connect. This also includes if a working TCP connection is later lost. The queues may continue to accept messages for the unconnected socket while IO attempts to automatically resolve the lost connection in the background. If the endpoint takes a while to come back online it should still receive it's messages.

To better explain here's another example

<?php


$pid = pcntl_fork();


if($pid)
{
    $context = new ZMQContext();

    $client = new ZMQSocket($context, ZMQ::SOCKET_REQ);
    
    try
    {
        $client->connect("tcp://0.0.0.0:9000");
    
    }catch (ZMQSocketException $e)
    {
        var_dump($e);
    }
    
    
    $client->send("request");
    $msg = $client->recv();

    var_dump($msg);

}else
{
    // in spawned process
    echo "waiting 2 seconds\n";
    sleep(2);

    $context = new ZMQContext();
    
    $server = new ZMQSocket($context, ZMQ::SOCKET_REP);

    try
    {
        $server->bind("tcp://0.0.0.0:9000");
    
    }catch (ZMQSocketException $e)
    {
        var_dump($e);
    }

    $msg = $server->recv();
    $server->send("response");

    var_dump($msg);
}

The binding process will not begin until 2 seconds later than the connecting process. But once the child process wakes and successfully binds the req/rep transaction will successfully take place without error.

jason@jason-VirtualBox:~/php-dev$ php play.php 
waiting 2 seconds
string(7) "request"
string(8) "response"

If I was to replace tcp://0.0.0.0:9000 on the binding socket with tcp://0.0.0.0:2345 it will hang because the client is trying to connect to tcp://0.0.0.0:9000, yet still without error.

But if I replace both with tcp://localhost:2345 I get an error on my system because it can't bind on localhost making the call truly impossible.

object(ZMQSocketException)#3 (7) {
  ["message":protected]=>
  string(38) "Failed to bind the ZMQ: No such device"
  ["string":"Exception":private]=>
  string(0) ""
  ["code":protected]=>
  int(19)
  ["file":protected]=>
  string(28) "/home/jason/php-dev/play.php"
  ["line":protected]=>
  int(40)
  ["trace":"Exception":private]=>
  array(1) {
    [0]=>
    array(6) {
      ["file"]=>
      string(28) "/home/jason/php-dev/play.php"
      ["line"]=>
      int(40)
      ["function"]=>
      string(4) "bind"
      ["class"]=>
      string(9) "ZMQSocket"
      ["type"]=>
      string(2) "->"
      ["args"]=>
      array(1) {
        [0]=>
        string(20) "tcp://localhost:2345"
      }
    }
  }
  ["previous":"Exception":private]=>
  NULL
}

If your needing real-time information for the state of underlying sockets you should look into socket monitors. Using socket monitors along with the ZMQ poll allows you to poll for both socket events and queue events.

Keep in mind that polling a monitor socket using ZMQ poll is not similar to polling a ZMQ_FD resource via select, epoll, etc. The ZMQ_FD is edge triggered and therefor doesn't behave the way you would expect when polling network resources, where a monitor socket within ZMQ poll is level triggered. Also, monitor sockets are very light weight and latency between the system event and the resulting monitor event is typically sub microsecond.

Upvotes: 0

pktiuk
pktiuk

Reputation: 270

I think the best answer is in zeromq wiki

When should I use bind and when connect?

As a very general advice: use bind on the most stable points in your architecture and connect from the more volatile endpoints. For request/reply the service provider might be point where you bind and the client uses connect. Like plain old TCP.

If you can't figure out which parts are more stable (i.e. peer-to-peer) think about a stable device in the middle, where boths sides can connect to.

The question of bind or connect is often overemphasized. It's really just a matter of what the endpoints do and if they live long — or not. And this depends on your architecture. So build your architecture to fit your problem, not to fit the tool.

And

Why do I see different behavior when I bind a socket versus connect a socket?

ZeroMQ creates queues per underlying connection, e.g. if your socket is connected to 3 peer sockets there are 3 messages queues.

With bind, you allow peers to connect to you, thus you don't know how many peers there will be in the future and you cannot create the queues in advance. Instead, queues are created as individual peers connect to the bound socket.

With connect, ZeroMQ knows that there's going to be at least a single peer and thus it can create a single queue immediately. This applies to all socket types except ROUTER, where queues are only created after the peer we connect to has acknowledge our connection.

Consequently, when sending a message to bound socket with no peers, or a ROUTER with no live connections, there's no queue to store the message to.

Upvotes: 1

Jason
Jason

Reputation: 13766

The best place to ask your question is the ZMQ mailing list, as many of the developers (and founders!) of the library are active there and can answer your question directly, but I'll give it a try. I'll admit that I'm not a C developer so my understanding of the source is limited, but here's what I gather, mostly from src/tcp_connector.cpp (other transports are covered in their respective files and may behave differently).

Line 214 starts the open() method, and here looks to be the meat of what's going on.

To answer your question about why the code is not blocked at Line [1], see line 258. It's specifically calling a method to make the socket behave asynchronously (for specifics on how unblock_socket() works you'll have to talk to someone more versed in C, it's defined here).

On line 278, it attempts to make the connection to the remote peer. If it's successful immediately, you're good, the bound socket was there and we've connected. If it wasn't, on line 294 it sets the error code to EINPROGRESS and fails.

To see what happens then, we go back to the start_connecting() method on line 161. This is where the open() method is called from, and where the EINPROGRESS error is used. My best understanding of what's happening here is that if at first it does not succeed, it tries again, asynchronously, until it finds its peer.

Upvotes: 4

Related Questions