ASIO: getting deadlock when using several strands and threads with one io_context

Question

I am compiling with Mingw64 on Windows the latest version of ASIO.

I have a sandbox code for accepting tcp connections. I use one context, a strand per acceptor and a socket and 2 threads (I have read in the documentation that posting into two different strands does not guarantee concurrent invocation). For some reason I get a deadlock at the end of execution and I don't know why it happens. It does not happen if:

I use 1 thread and one common context
Sometimes when I use 1 context and 2 threads without strands
I use 2 different contexts with 2 different threads without strands
when some time passes between std::future synchronization and a request to stop a server
Sometimes when I post acceptor.cancel() to its executor explicitly

Deadlock also does not happen if I close the acceptor.

I have failed to find any relevant info in the documentation which might explain the reason for such behavior. And I don't want to ignore it since it might result in unpredictable problems.

Here is my sandbox code:


#include 
#include 
#include 
#include 

constexpr const char localhost[] = "127.0.0.1";
constexpr unsigned short port = 12000;

void runContext(asio::io_context &io_context)
{
    std::string threadId{};
    std::stringstream ss;
    ss << std::this_thread::get_id();
    ss >> threadId;
    std::cout << std::string("New thread for asio context: ")
                 + threadId + "
";
    std::cout.flush();

    io_context.run();

    std::cout << std::string("Stopping thread: ")
                 + threadId + "
";
    std::cout.flush();
};

class server
{
public:

    template
    explicit server(Executor &executor)
            : acceptor_(executor)
    {
        using asio::ip::tcp;
        auto endpoint = tcp::endpoint(asio::ip::make_address_v4(localhost),
                                      port);
        acceptor_.open(endpoint.protocol());
        acceptor_.set_option(tcp::acceptor::reuse_address(true));
        acceptor_.bind(endpoint);
        acceptor_.listen();
    }

    void startAccepting()
    {
        acceptor_.async_accept(
                [this](const asio::error_code &errorCode,
                       asio::ip::tcp::socket peer)
                {
                    if (!errorCode)
                    {
                        startAccepting();
//                        std::cout << "Connection accepted
";
                    }
                    if (errorCode == asio::error::operation_aborted)
                    {
//                        std::cout << "Stopped accepting connections
";
                        return;
                    }
                });
    }

    void startRejecting()
    {
        // TODO: how to reject?
    }

    void stop()
    {
        // asio::post(acceptor_.get_executor(), [this](){acceptor_.cancel();}); // this line fixes deadlock
        acceptor_.cancel();
        // acceptor_.close(); // this line also fixes deadlock
    }

private:
    asio::ip::tcp::acceptor acceptor_;
};

int main()
{
    setvbuf(stdout, NULL, _IONBF, 0);
    asio::io_context context;

    // run server
    auto serverStrand = asio::make_strand(context);
    server server{serverStrand};
    server.startAccepting();

    // run client
    auto clientStrand = asio::make_strand(context);
    asio::ip::tcp::socket socket{clientStrand};

    size_t attempts = 1;
    auto endpoint = asio::ip::tcp::endpoint(
            asio::ip::make_address_v4(localhost), port);

    std::future res = socket.async_connect(endpoint, asio::use_future);

    std::future runningContexts[] = {
            std::async(std::launch::async, runContext, std::ref(context)),
            std::async(std::launch::async, runContext, std::ref(context))
    };

    res.get();
    server.stop();
    std::cout << "Server has been requested to stop" << std::endl;

    return 0;
}

UPDATE
According to the sehe's answer I am getting a deadlock, because when server.stop() is invoked, completion handler for successful acception has been already posted but due to cancellation is never invoked, which causes a context to have pending work, hence a deadlock at the end (if I understood correctly). The things I still don't understand are:

There is a separate strand for the server which (according to specification) enforces acceptor's commands to be invoked non-concurrently and in FIFO order. Handlers with no provided executors also have to be handled in the same thread. There's nothing about thread safety of acceptor::cancel() method in documentation, though distinct acceptor objects are safe. So I assumed that it is thread safe (no data races possible within one strand). @sehe's code does not cause deadlock in case the cancel is explicitly posted into the acceptor's thread via asio::post. For 500 invokation there were no deadlocks:

test 499
Awaiting client
New thread 3
New thread 2
Completed client
Server stopped
Accept: 127.0.0.1:14475
Accept: The I/O operation has been aborted because of either a thread exit or an application request.
Stopping thread: 2
Stopping thread: 3
Everyting shutdown

However, if I remove printing code before synchronization and stop() which causes a delay, a dead lock is easily achievable:

PS C:\dev\builds\asio_connection_logic\Release-MinGW-w64\bin> for ($i = 0; $i -lt 500; $i++){
>> Write-Output "
>> test $i"
>> .\sb.sf_example.exe}

test 0
New thread 2
New thread 3
Server stopped
Accept: 127.0.0.1:15160
PS C:\dev\builds\asio_connection_logic\Release-MinGW-w64\bin>
PS C:\dev\builds\asio_connection_logic\Release-MinGW-w64\bin> for ($i = 0; $i -lt 500; $i++){
>> Write-Output "
>> test $i"
>> .\sb.sf_example.exe}

test 0
New thread 2New thread 3

Server stopped
Accept: 127.0.0.1:15174
PS C:\dev\builds\asio_connection_logic\Release-MinGW-w64\bin> ^C

So, the conclusion is that no matter how you invoke acceptor.cancel(), you will get a deadlock.

Is there even a way to avoid deadlock for acceptor?

sehe · Accepted Answer

I did some liposuction on the code and added some tracing:

Live On Wandbox

#include 
#include 
#include 

constexpr unsigned short port = 12000;
namespace asio = boost::asio;
using boost::system::error_code;
using asio::ip::tcp;

void runContext(asio::io_context& io_context) {
    std::cout << "New thread " << std::this_thread::get_id() << std::endl;
    io_context.run();
    std::cout << "Stopping thread: " << std::this_thread::get_id() << std::endl;
}

class server {
  public:
    template 
    explicit server(Executor executor) : acceptor_(executor, {{}, port}) {
        acceptor_.set_option(tcp::acceptor::reuse_address(true));
    }

    void startAccepting() {
        acceptor_.listen();
        acceptLoop();
    }

    void stop()
    {
        //asio::post(acceptor_.get_executor(), [this]() {
            //acceptor_.cancel();
        //}); // this line fixes deadlock
        acceptor_.cancel();
        // acceptor_.close(); // this line also fixes deadlock
    }

  private:
    void acceptLoop() {
        acceptor_.async_accept([this](error_code errorCode, tcp::socket peer) {
            if (!errorCode) {
                std::cout << "Accept: " << peer.remote_endpoint() << std::endl;
                acceptLoop();
            } else {
                std::cout << "Accept: " << errorCode.message() << std::endl;
            }
        });
    }

    tcp::acceptor acceptor_;
};

int main() {
    setvbuf(stdout, NULL, _IONBF, 0);
    asio::io_context context;

    // run server
    server server{make_strand(context)};
    server.startAccepting();

    // run client
    tcp::socket socket{make_strand(context)};
    std::future res = socket.async_connect({ {}, port}, asio::use_future);

    std::thread t1(runContext, std::ref(context));
    std::thread t2(runContext, std::ref(context));

    std::cout << "Awaiting client " << std::endl;

    res.get();

    std::cout << "Completed client" << std::endl;

    server.stop();

    std::cout << "Server stopped" << std::endl;

    t1.join();
    t2.join();
    std::cout << "Everyting shutdown" << std::endl;
}

As you can see a "correct" run prints:

Awaiting client 
New thread 140712013797120
Accept: New thread 140712005404416
127.0.0.1:57500
Completed client
Server stopped
Accept: Operation canceled
Stopping thread: 140712013797120
Stopping thread: 140712005404416
Everyting shutdown

However, an "incorrect run" prints:

New thread 140544269350656
Awaiting client 
New thread 140544260957952
Completed client
Server stopped
Accept: 127.0.0.1:48580
^C

The key is here:

Server stopped
Accept: 127.0.0.1:48580

The cancel comes before the accept. Which means that there was a race, where the completion handler for the async_accept was already in flight with non-failed errorCode. (In other words async_connect returned a bit earlier than the server was able to handle its async_accept completion.)

Indeed, posting on the strand is one way to fix it. This is because any handlers in flight will run before the cancellation, and if async-operation is pending it will be canceled.

NOTE: The other approach with acceptor_.close() invokes undefined behavior because of the data race on acceptor_ itself (which is not thread-safe).

Pet Peeve

Aside: a pet peeve:

One "problem" is std::launch::async. I don't use it. I think the behaviour of it is implementation-defined to an extent that makes it not very useful. Perhaps, use std::thread instead, since that's what you're after, here. On recent boost, use asio::thread_pool(2).

The answers here shed some light Why should I use std::async?

UPDATE TO EDITED QUESTION

You're right it has the same race condition - although without the UB, so that's nice.

Sidenote: You should probably stop calling it "deadlock" because there is none. It's a softlock, you're just waiting for something that never happens (async_connect). Deadlock is when multiple parties contend for locks in a way that can never be satisfied. This is just a soft-lock in the sense that a connection or even a network failure will allow the system to proceed.

So I also stripped the output but added BOOST_ASIO_ENABLE_HANDLER_TRACKING. The resulting picture confirms the exact explanation above:

From here the only obvious solutions would seem to be:

don't cancel the server from the client strand. By extension it would mean
- use the same strand for client and server (not feasible in inter-process comms)
- signal closure inside a connection, so that the message is received on a server strand anyways. This is the "make shutdown command part of the protocol" way.
post a close instead of a cancel, this actively makes any async_accept an error
alternatively, have a state machine in the server that manually checks whether we should still be accepting before initiating a new async_accept

Note that

with the .close() method in with code that uses the native handle it can lead to its own race condition (where another thread immediately opens a new file/socket with the reused filedescriptor and the native code doesn't notice it's talking to the wrong socket). To be honest this mostly seems a problem with stream sockets (not acceptors) and there it's very easy to fix with .shutdown(), so just to be aware.
I've never reached this problem in a lot of production use of ASIO. I guess, in practice your exact usecase (.cancel() perfectly timed together with a new accept completion) doesn't come up a lot
A similar use case that does come up quite a lot is with timers, which can also be tricky to handle race-free. There, the disambiguating factor is also extra state, in the form of basic_waitable_timer::expiry(). See e.g. Cancelling boost asio deadline timer safely

ASIO: getting deadlock when using several strands and threads with one io_context

Answers (1)

Pet Peeve

UPDATE TO EDITED QUESTION

Related Questions