syko
syko

Reputation: 3637

What can I do for failover of lost TCP connection during running a program?

I program distributed program in C++ that uses TCP and run it on linux cent os 7 with kernel 3.1.0

The program is built for high performance with high CPU, disk and network usage.

The program might run over a few days like 4 days. I am worried about the case where TCP connection is lost during the computation for any reason except for the case that one of machines died.

Can this happen? (The tcp connection is lost while the machines are all alive and no one invoked close on the socket?)

If possible, what can the programmer like me do for it? Can I detect the lost connection and try to reconnect it?

Thanks,

Upvotes: 0

Views: 557

Answers (1)

stefaanv
stefaanv

Reputation: 14392

Ideally, connection management is part of the protocol. This way the management is documented and client and server know what is expected.

Some strategies:

  • use UDP: no connection. Handle request/reply and possible failures. Timeout handling of reply may be needed.
  • short TCP connections: only connect when needed and disconnect after "transaction" (e.g. http)
  • long TCP connection with keep-alive checks and connection retries: check for connection failures and have client reconnect and servers wait for reconnection.

Upvotes: 1

Related Questions