Reputation: 231

why many libraries does not detect dead TCP connections?

TCP has a keep-alive mechanism to detect dead connections, but it surprised me that this option is turned off by default and many libraries/tools do not utilize this feature.

If I am understanding correctly, a TCP connection blocked in a recv call won't be able to detect if a connection has been actually aborted by peer if all the FIN/RST packets from peer have been lost.

A timeout parameter on client side may alleviate the issue but many libraries does not have a option to set timeout either. One example is that the mysql-python connector does not have a recv timeout option. Another example is that a Nginx server talks to a gunicorn backend with proxy_pass, gunicorn workers may stop responding due to dead connections on it, but there is no way for gunicorn workers to detect it.

Could anyone can explain the reason or correct me if I am wrong?

Upvotes: 1

Answers (2)

Jeremy Friesner

Reputation: 73294

The term "dead connection" is a bit ambiguous -- it could mean any of the following:

The peer program closed its socket (or the peer program exited or crashed, and the peer computer's OS closed the socket as part of its standard process-cleanup)
Connectivity to the peer computer has suddenly been lost (this could happen because the peer computer lost power, or somebody pulled out the Ethernet cord that was connecting the peer computer to the router, or the peer's ISP had a router failure, or your ISP had a router failure, or etc)
The peer program is still running but simply decided (for some reason, probably due to a bug) to stop calling recv() on his TCP socket anymore.
The packet-path between your program and the remote peer still exists, sort of, but something along that path is dropping so many packets that the effective transmission rate of the TCP connection has dropped to approximately zero.

So the first question to answer is, which of the above conditions will the TCP layer detect on its own?

Condition (1) is the easy case -- the peer's TCP stack will send you the FIN packets, and when your program's network stack receives them, it will know for sure that the TCP connection is closed and act accordingly, and therefore your recv() call will return 0 very quickly.

In condition (2), the answer is "sometimes" -- in particular, if your program has any TCP data in the socket's output buffer that it is trying to send to the peer, and it never gets any ACK packets back regarding that data, then after a certain number of timeouts (and subsequent packet-resend attempts), your computer's TCP stack will give up, declare the connection dead, and unilaterally close the TCP connection; at which point recv() will return 0. If there are no outgoing TCP data packets trying to be sent, on the other hand, then the local TCP stack won't be waiting for any ACKs to come back, and therefore it won't time out when it doesn't get them, and therefore it won't ever give up and close the TCP connection. In this scenario, your recv() call could well block indefinitely, because the TCP connection is idle and the TCP stack has no way of knowing that the peer is gone (as opposed to simply not sending any data right now). It is this scenario that the SO_KEEPALIVE option was meant to handle, but since the designers of the SO_KEEPALIVE option wanted to conserve bandwidth by default, and sending automatic keepalive packets uses up additional bandwidth, they decided to make the keepalive option disabled by default. Also, the default send-a-keepalive interval is often quite long by modern standards (e.g. hours) and on some OS's it is difficult to change except on a system-wide basis, which make SO_KEEPALIVE of limited usefulness for many applications.

For conditions (3) and (4), the TCP connection isn't really "dead", it's just that some device (either the peer program, or a piece of networking gear somewhere between your program and the peer) is being uncooperative. Since the TCP layer can't know what the applications that are using it are trying to achieve, it wisely doesn't try to second-guess them in this regard, and it leaves the TCP connection open unless you explicitly tell it to close() the connection.

So now that we've described the TCP layer's behavior, what about the applications and API's that use it? i.e. why don't they try to improve on the basic TCP-stack behavior by offering better detection? The answer is that some of them do; e.g. by periodically sending dummy "ping" messages across any socket that would otherwise be idle, simply to "stimulate" the TCP stack into detecting when no ACKs are coming back as described in the paragraph about condition (2), above. Some go even further and expect the remote peer to send a corresponding "pong" message to come back on the same socket within (so many) seconds, and if it doesn't, the program will unilaterally close the socket. This sort-of works, but it also makes assumptions about the performance of your network, and that can lead to false positives and therefore unwanted disconnections when the peer is connecting via a slow or unreliable network, which is why many applications/libraries don't implement this (or at least don't enable it by default).

Upvotes: 6

cshu

Reputation: 5954

It's not surprising to me that keep-alive is turned off by default.

Because it's always possible that the peer program can freeze due to a bug or error, etc. In this case recv also blocks forever even if the TCP connection is alive. So keep-alive may be not so useful after all (except to prevent router from dropping connection). Various reasons might cause your recv to block forever anyway.

Besides, a low-level underlying protocol for general purpose should probably be kept as simple as possible.

In addition, I'm not surprised by your examples about not being able to set timeout either. Look at the most popular software tools in this world. They are polished, evolved, optimized, and used for such a long time. Yet many of them still freeze, crash, or misbehave rather frequently. Writing correct code is meticulous work. Not to mention further requirements like security, cross-platform, backward compatibility. Programmer's life is not easy.

Upvotes: 0

why many libraries does not detect dead TCP connections?

Answers (2)

Related Questions