Adjusting HTTP Timeout versus backoff during retries

Question

I'm wondering about the trade-offs between two approaches to handling HTTP timeouts between two services. Service A is trying to implement retry functionality when calling service B.

Approach 1: This is the typical approach (e.g. Ethernet proto). Perform a request with fixed timeout T. If timeout occurs, sleep for X and retry the request. Increase X exponentially.
Approach 2: Instead of sleeping between retries, increase the actual HTTP timeout value (say, exponentially). In both cases, consider a max-bound.

For Ethernet, this makes sense because of it's low-level location in the network stack. However, for an application-level retry mechanism, would approach 2 be more appropriate? In a situation where there are high levels of network congestion, I would think #2 is better for a couple reasons:

Sending additional TCP connection requests will only flood the network more
You're basically guaranteed to not receive a response when you're sleeping (because you already timed out and/or tore down the socket), whereas if you instead just allowed the TCP request to remain outstanding (or kept the socket open if the connection has at least been established), you at least have the possibility of success occurring.

Any thoughts on this?

dgatwood · Accepted Answer

On a high-packet-loss network (e.g. cellular, or wi-fi near the limits of its range), there's a distinct possibility that your requests will continue to time out forever if the timeout is too short. So increasing the timeout is often a good idea.

And retrying the request immediately often works, and if it doesn't, waiting a while might make no difference (e.g. if you no longer have a network connection). For example, on iOS, your best bet is to use reachability, and if reachability determines that the network is down, there's no reason to retry until it isn't.

My general thoughts are that for short requests (i.e. not uploading/downloading large files) if you haven't received any response from the server at all after 3-5 seconds, start a second request in parallel. Whichever request returns a header first wins. Cancel the other one. Keep the timeout at 90 seconds. If that fails, see if you can reach generate_204.

If generate_204 works, the problem could be a server issue. Retry immediately, but flag the server as suspect. If that retry fails a second time (after a successful generate_204 response), start your exponential backoff waiting for the server (with a cap on the maximum interval).
If the generate_204 request doesn't respond, your network is dead. Wait for a network change, trying only very occasionally (e.g. every couple of minutes minimum).
If the network connectivity changes (i.e. if you suddenly have Wi-Fi), restart any waiting connections after a few seconds. There's no reason to wait the full time at that point, because everything has changed.

But obviously there's no correct answer. This approach is fairly aggressive. Others might take the opposite approach. It all depends on what your goals are.

Adjusting HTTP Timeout versus backoff during retries

Answers (2)

Related Questions