Nazareth
Nazareth

Reputation: 33

boost::asio::ip::tcp::socket.read_some() stops working. No exception or errors detected

I am currently debugging a server(win32/64) that utilizes Boost:asio 1.78.

The code is a blend of legacy, older legacy and some newer code. None of this code is mine. I can't answer for why something is done in a certain way. I'm just trying to understand why this is happening and hopefully fix it wo. rewriting it from scratch. This code has been running for years on 50+ servers with no errors. Just these 2 servers that missbehaves.

I have one client (dot.net) that is connected to two servers. Client is sending the same data to the 2 servers. The servers run the same code, as follows in code sect.

All is working well but now and then communications halts. No errors or exceptions on either end. It just halts. Never on both servers at the same time. This happens very seldom. Like every 3 months or less. I have no way of reproducing it in a debugger bc I don't know where to look for this behavior.

On the client side the socket appears to be working/open but does not accept new data. No errors is detected in the socket.

Here's a shortened code describing the functions. I want to stress that I can't detect any errors or exceptions during these failures. Code just stops at "m_socket->read_some()".

Only solution to "unblock" right now is to close the socket manually and restart the acceptor. When I manually close the socket the read_some method returns with error code so I know it is inside there it stops.

Questions:

  1. What may go wrong here and give this behavior?
  2. What parameters should I log to enable me to determine what is happening, and from where.

main code:

std::shared_ptr<boost::asio::io_service> io_service_is = std::make_shared<boost::asio::io_service>();
auto is_work = std::make_shared<boost::asio::io_service::work>(*io_service_is.get());

auto acceptor = std::make_shared<TcpAcceptorWrapper>(*io_service_is.get(), port);
acceptor->start();

auto threadhandle = std::thread([&io_service_is]() {io_service_is->run();});

TcpAcceptorWrapper:

void start(){
    m_asio_tcp_acceptor.open(boost::asio::ip::tcp::v4());
    m_asio_tcp_acceptor.bind(boost::asio::ip::tcp::endpoint(boost::asio::ip::tcp::v4(), m_port));
    m_asio_tcp_acceptor.listen();
    start_internal();
}
void start_internal(){
    m_asio_tcp_acceptor.async_accept(m_socket, [this](boost::system::error_code error) { /* Handler code */ });
}

Handler code:

m_current_session = std::make_shared<TcpSession>(&m_socket);
std::condition_variable condition;
std::mutex mutex;
bool stopped(false);

m_current_session->run(condition, mutex, stopped);              
{
    std::unique_lock<std::mutex> lock(mutex);
    condition.wait(lock, [&stopped] { return stopped; });
}

TcpSession runner:

void run(std::condition_variable& complete, std::mutex& mutex, bool& stopped){
    auto self(shared_from_this());
    
    std::thread([this, self, &complete, &mutex, &stopped]() {
        { // mutex scope

            // Lock and hold mutex from tcp_acceptor scope
            std::lock_guard<std::mutex> lock(mutex);

            while (true) {
                std::array<char, M_BUFFER_SIZE> buffer;

                try {
                    boost::system::error_code error;

                    /* Next call just hangs/blocks but only rarely. like once every 3 months or more seldom */
                    std::size_t read = m_socket->read_some(boost::asio::buffer(buffer, M_BUFFER_SIZE), error);

                    if (error || read == -1) {
                        // This never happens
                        break;
                    }
                    // inside this all is working
                    process(buffer);

                } catch (std::exception& ex) {
                    // This never happens
                    break;
                } catch (...) {
                    // Neither does this
                    break;
                }
            }
            stopped = true;
        } // mutex released
        complete.notify_one();
    }).detach();
}

Upvotes: 2

Views: 705

Answers (1)

sehe
sehe

Reputation: 392833

This:

m_acceptor.async_accept(m_socket, [this](boost::system::error_code error) { // Handler code });

Handler code:

std::condition_variable condition;
std::mutex mutex;
bool stopped(false);
m_current_session->run(condition, mutex, stopped);
{
  std::unique_lock<std::mutex> lock(mutex);
  condition.wait(lock, [&stopped] { return stopped; });
}

Is strange. It suggests you are using an "async" accept, but the handler block unconditionally until the session completes. That's the opposite of asynchrony. You could much easier write the same code without the asynchrony, and also without the thread and synchronization around it.

My intuition says something is blocking the mutex. Have you established that the session stack is actually inside the read_some frame when e.g. doing a debugger break during a "lock-up"?

When I manually close the socket the read_some method returns with error code so I know it is inside there I have an issue.

You can't legally do that. Your socket is in use on a thread - in a blocking read -, and you're bound to close it from a separate thread. That's a race-condition (see docs). If you want cancellable operations, use async_read*.

There are more code smells (read_some is a lowlevel primitive that is rarely what you want at the application level, detached threads with manual synchronization on termination could be packaged tasks, shared boolean flags could be atomics, notify_one outside the mutex could lead to thread starvation on some platforms etc.).

If you can share more code I'll be happy to sketch simplified solutions that remove the problems.

Upvotes: 1

Related Questions