user1724763
user1724763

Reputation:

Handling failures in Thrift in general

I read through the official documentation and the official whitepaper, but I couldn't find a satisfying answer to how Thrift handles failures in the following scenario:

Say you have a client sending a method call to a server to insert an entry in some data structure residing in that server (it doesn't really matter what it is). Suppose the server has processed the call and inserted the entry but the client couldn't receive a response due to a network failure. In such a case, how should the client handle this? A simple retry of sending the call would possibly result in a duplicate entry being inserted. Does the Thrift library persist the response somewhere so that it can resend to the client when it is back online? Or is it the application's responsibility to do so?

Would appreciate it if someone could point out the details of how it works, besides directing to its source code.

Upvotes: 3

Views: 1693

Answers (2)

King.Zevin
King.Zevin

Reputation: 71

Let me try to give a straight answer.

... is it the application's responsibility to do so?

Yes.

There're 4 types of Exceptions involved in Thrift RPC, including TTransportException, TProtocolException, TApplicationException, and User-defined exceptions.

Based on the book Programmer's Guide to Apache Thrift, the former 2 are local exceptions, while the latter 2 are not.

As the names imply, TTransportException includes exceptions like NOT_OPEN, TIMED_OUT, and TProtocolException includes INVALID_DATA, BAD_VERSION, etc. These exceptions are not propagated from the server the the client and act much like normal language exceptions.

TApplicationExceptions involve problems such as calling a method that isn’t implemented or failing to provide the necessary arguments to a method.

User-defined Exceptions are defined in IDL files and raised by the user code.

For all of these exceptions, no retry operations are done by Thrift RPC framework itself. Instead, they should be handled properly by the application code.

Upvotes: 2

JensG
JensG

Reputation: 13411

The question is an interesting one, but it is by no means limited to Thrift. A better name would be

Handling failures in asynchronous or remote calls in general

because that's in essence, what it is. Altough in the specific case of an RPC-style API like, for example, a Thrift service, the client blocks and it seems to be an synchronous call, it really isn't that way.

The whole problem can be rephrased to the more general question about

Designing robust distributed systems

So what is the main problem, that we have to deal with? We have to assume that every call we do may fail. In particular, it can fail in three ways:

  • request died
  • request sent, server processing successful, response died
  • request sent, server processing failed, response died

In some cases, this is not a big deal, regardless of the exact case we have. If the client just wants to retrieve some values, he can simply re-query and will get some results eventually if he tries often enough.

In other cases, especially when the client modifies data on the server, it may become more problematic. The general recommendation in such cases is to make the service calls idempotent, meaning: regardless, how often I do the same call, the end result is always the same. This could be achieved by various means and more or less depends on the use case.

For example, one method is it to send some logical "ticket" values along with each request to filter out doubled or outdated requests on the server. The server keeps track and/or checks these tickets, before the processing starts eventually. But again, if that method suits your needs depends on your use case.

The Command and Query Responsibility Segregation (CQRS) pattern is another approach to deal with the complexity. It basically breaks the API into setters and getters. I'd recommend to look into that topic, but it is not useful for every scenario. I'd also recommend to look at the Data Consistency Primer article. Last not least the CAP theorem is always a good read.

Good Service/API design is not simple, and the fact, that we have to deal with a distributed parallel system does not make it easier, quite the opposite.

Upvotes: 3

Related Questions