Reputation: 3314
I'm wondering about the trade-offs between two approaches to handling HTTP timeouts between two services. Service A is trying to implement retry functionality when calling service B.
Approach 1: This is the typical approach (e.g. Ethernet proto). Perform a request with fixed timeout T. If timeout occurs, sleep for X and retry the request. Increase X exponentially.
Approach 2: Instead of sleeping between retries, increase the actual HTTP timeout value (say, exponentially). In both cases, consider a max-bound.
For Ethernet, this makes sense because of it's low-level location in the network stack. However, for an application-level retry mechanism, would approach 2 be more appropriate? In a situation where there are high levels of network congestion, I would think #2 is better for a couple reasons:
Any thoughts on this?
Upvotes: 1
Views: 2244
Reputation: 311018
There's not much point in sleeping when you could be doing useful work, or in using a shorter timeout than you can really tolerate. I would use (2).
The idea that Ethernet or indeed anything uses (1) seems fanciful. Do you have a citation?
Upvotes: 0
Reputation: 10417
On a high-packet-loss network (e.g. cellular, or wi-fi near the limits of its range), there's a distinct possibility that your requests will continue to time out forever if the timeout is too short. So increasing the timeout is often a good idea.
And retrying the request immediately often works, and if it doesn't, waiting a while might make no difference (e.g. if you no longer have a network connection). For example, on iOS, your best bet is to use reachability, and if reachability determines that the network is down, there's no reason to retry until it isn't.
My general thoughts are that for short requests (i.e. not uploading/downloading large files) if you haven't received any response from the server at all after 3-5 seconds, start a second request in parallel. Whichever request returns a header first wins. Cancel the other one. Keep the timeout at 90 seconds. If that fails, see if you can reach generate_204.
If generate_204 works, the problem could be a server issue. Retry immediately, but flag the server as suspect. If that retry fails a second time (after a successful generate_204 response), start your exponential backoff waiting for the server (with a cap on the maximum interval).
If the generate_204 request doesn't respond, your network is dead. Wait for a network change, trying only very occasionally (e.g. every couple of minutes minimum).
If the network connectivity changes (i.e. if you suddenly have Wi-Fi), restart any waiting connections after a few seconds. There's no reason to wait the full time at that point, because everything has changed.
But obviously there's no correct answer. This approach is fairly aggressive. Others might take the opposite approach. It all depends on what your goals are.
Upvotes: 1