syko
syko

Reputation: 3637

What is the proper mechanism for handling TCP failure?

I am writing a socket program in c++. The program runs on a set of cluster machines.

I just entered into the socket programming and just learned how to send and receive. I think that, during the long running of the program, some TCP connections can get lost. In that case, re-connecting the server and client smoothly is necessary.

I wonder if there is a well-known basic mechanism (or algorithm? protocol?) to achieve it. I found that there are many many socket error codes with different semantics, which makes me hard to start.

Can any one suggest any reference code that I can learn from?

Thanks,

Upvotes: 0

Views: 2074

Answers (2)

Sam Varshavchik
Sam Varshavchik

Reputation: 118292

The actual, specific error code, is irrelevant. If you have an active socket connection, a failed read or a write indicates that the connection is gone. The error code perhaps gives you some explanation, but it's a bit too late now. The socket is gone. It is no more. It ceased to exist. It's an ex-socket. You can use the error code to come up with a colorful explanation, but it would be little more than some minor consolation. No matter what was the specific reason, but your socket is gone and you have to deal with it.

When using non-blocking sockets there are certain specific return codes and errno values that indicate that the socket is still fine, but just is not ready to read or write anything, that you'll have to specifically check for, and handle. This would be the only exception to this.

Also, EINTR usually does not necessarily mean that the socket is really broken; so that might be another exception to check for.

Once you have a broken socket, the only general design principle, if there is one, is that you have to close() it as the first order of business. The file descriptor is completely useless. After that point, it's entirely up to you what to do next. There are no rules, etched in stone, for this situation. Typically, applications would log an error, in some form or fashion, or attempt to make another connection. It's generally up to you to figure out what to do.

About the only "well-known basic mechanism" in socket programming is explicit timeouts. Network errors, and failures, don't always get immediately detected by the underlying operating system. When a networking problem occurs, it is not always immediately detectable. It can take many minutes before the protocol stack declares a broken socket, and gives you an error indication.

So, if you're coding a particular application, and you know that you should expect to read or write something within some prescribed time frame, a common design pattern is to code an explicit timeout, and if nothing happens when the timeout expires, assume that the socket is broken -- even if you have no explicit error indication otherwise -- close() it, then proceed to the next step.

Upvotes: -1

user207421
user207421

Reputation: 310860

It's not complicated. The only two error codes that aren't fatal to the connection are:

  • EAGAIN/EWOULDBLOCK, which are in fact two names for the same number, and mean that it is OK to re-attempt the operation after a period, or after select()/poll()/epoll() has so indicated;
  • EINTR, which just means 'interrupted system call' - try again.

All others are fatal to the connection and should cause to you close it.

Upvotes: 3

Related Questions