Boost Asio async_read sometimes hangs while reading but not always

Question

I am implementing a small distributed system that consists N machines. Each of them receives some data from some remote server and then propagates the data to other n-1 fellow machines. I am using the Boost Asio async_read and async_write to implement this. I set up a test cluster of N=30 machines. When I tried smaller datesets (receiving 75KB to 750KB per machine), the program always worked. But when I moved on to just a slightly larger dataset (7.5MB), I observed strange behavior: at the beginning, reads and writes happened as expected, but after a while, some machines hanged while others finished, the number of machines that hanged varied with each run. I tried to print out some messages in each handler and found that for those machines that hanged, async_read basically could not successfully read after a while, therefore nothing could proceed afterwards. I checked the remote servers, and they all finished writing. I have tried out using strand to control the order of execution of async reads and writes, and I also tried using different io_services for read and write. None of them solved the problem. I am pretty desperate. Can anyone help me?

Here is the code for the class that does the read and propagation:

const int TRANS_TUPLE_SIZE=15;
const int TRANS_BUFFER_SIZE=5120/TRANS_TUPLE_SIZE*TRANS_TUPLE_SIZE;
class Asio_Trans_Broadcaster
{
private:
   char buffer[TRANS_BUFFER_SIZE];
   int node_id;
   int mpi_size;
   int mpi_rank;
   boost::asio::ip::tcp::socket* dbsocket;
   boost::asio::ip::tcp::socket** sender_sockets;
   int n_send;
   boost::mutex mutex;
   bool done;
public:
   Asio_Trans_Broadcaster(boost::asio::ip::tcp::socket* dbskt, boost::asio::ip::tcp::socket** senderskts,
        int msize, int mrank, int id)
{
    dbsocket=dbskt;
    count=0;
    node_id=id;
    mpi_size=mpi_rank=-1;
    sender_sockets=senderskts;
    mpi_size=msize;
    mpi_rank=mrank;
    n_send=-1;
    done=false;
}

static std::size_t completion_condition(const boost::system::error_code& error, std::size_t bytes_transferred)
{
    int remain=bytes_transferred%TRANS_TUPLE_SIZE;
    if(remain==0 && bytes_transferred>0)
        return 0;
    else
        return TRANS_BUFFER_SIZE-bytes_transferred;
}


void write_handler(const boost::system::error_code &ec, std::size_t bytes_transferred)
{
    int n=-1;
    mutex.lock();
    n_send--;
    n=n_send;
    mutex.unlock();
    fprintf(stdout, "~~~~~~ @%d, write_handler: %d bytes, copies_to_send: %d
",
                                    node_id, bytes_transferred, n);
    if(n==0 && !done)
        boost::asio::async_read(*dbsocket,
            boost::asio::buffer(buffer, TRANS_BUFFER_SIZE),
            Asio_Trans_Broadcaster::completion_condition, boost::bind(&Asio_Trans_Broadcaster::broadcast_handler, this,
            boost::asio::placeholders::error,
            boost::asio::placeholders::bytes_transferred));
}

void broadcast_handler(const boost::system::error_code &ec, std::size_t bytes_transferred)
{
    fprintf(stdout, "@%d, broadcast_handler: %d bytes, mpi_size:%d, mpi_rank: %d
", node_id, bytes_transferred, mpi_size, mpi_rank);
    if (!ec)
    {
        int pos=0;
        while(pos



Here is the main code running on each machine:

int N=30;
boost::asio::io_service* sender_io_service=new boost::asio::io_service();
boost::asio::io_service::work* p_work=new boost::asio::io_service::work(*sender_io_service);
boost::thread_group send_thread_pool;
for(int i=0; i p_work2(new boost::asio::io_service::work(*receiver_io_service));
boost::thread_group thread_pool2;
thread_pool2.create_thread( boost::bind( & boost::asio::io_service::run, receiver_io_service) );

boost::asio::ip::tcp::socket* receiver_socket;
    //establish nonblocking connection with remote server
AsioConnectToRemote(5000, 1, receiver_io_service, receiver_socket, true);

boost::asio::ip::tcp::socket* send_sockets[N];
    //establish blocking connection with other machines
hadoopNodes = SetupAsioConnectionsWIthOthers(sender_io_service, send_sockets, hostFileName, mpi_rank, mpi_size, 3000, false);

Asio_Trans_Broadcaster* db_receiver=new Asio_Trans_Broadcaster(receiver_socket, send_sockets,
mpi_size,  mpi_rank, mpi_rank);

db_receiver->broadcast();
  p_work2.reset();
  thread_pool2.join_all();
  delete p_work;
send_thread_pool.join_all();

Boost Asio async_read sometimes hangs while reading but not always

Answers (1)

Related Questions