Reputation: 10311
I have a Thrift API served from a Java application running on Linux. I'm using a .NET client to connect to the API and execute operations.
The first few calls to the service work fine without errors, but then (seemingly at random) a call will "hang." If I force-quit my client and try to reconnect, the service either hangs again, or my client has the following error:
Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.
at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
at Thrift.Transport.TStreamTransport.Read(Byte[] buf, Int32 off, Int32 len)
(etc.)
When I use JConsole to get a thread dump, the server is on accept()
"Thread-1" prio=10 tid=0x00002aaad457a800 nid=0x79c7 runnable [0x00000000434af000]
java.lang.Thread.State: RUNNABLE
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
- locked <0x00000005c0fef470> (a java.net.SocksSocketImpl)
at java.net.ServerSocket.implAccept(ServerSocket.java:462)
at java.net.ServerSocket.accept(ServerSocket.java:430)
at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:113)
at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35)
at org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31)
at org.apache.thrift.server.TSimpleServer.serve(TSimpleServer.java:63)
netstat
on the sever shows connections to the service port that are on TIME_WAIT
which eventually disappear several minutes after I force-quit the client (as would be expected).
The code that sets up the Thrift service is as follows:
int port = thriftServicePort;
String host = thriftServiceHost;
InetAddress adr = InetAddress.getByName(host);
InetSocketAddress address = new InetSocketAddress(adr, port);
TServerTransport serverTransport = new TServerSocket(address);
TServer server = new TSimpleServer(new TServer.Args(serverTransport).processor((org.apache.thrift.TProcessor)processor));
server.serve();
Note that we're using the TServerTransport
constructor that takes an explicit hostname or IP address. I suspect that I should change it to take the constructor that only specifies a port (ultimately binding to InetAddress.anyLocalAddress()
). Alternatively, I suppose I could configure the service to bind to the "wildcard" address ("0.0.0.0").
I should mention that the service is not hosted on the open Internet. It is hosted in a private network and I am using SSH tunneling to reach it. Hence, the hostname that the service is bound to does not resolve in my local network (although I can make the initial connection via tunneling). I wonder if this is something similar to the RMI TCP callback problem?
Is there a technical explanation for what's going on (if this is a common issue), or additional troublehshooting steps that I can take?
UPDATE
Had the same problem today, but this time jstack
shows that the Thrift server is blocking forever reading from the input stream:
"Thread-1" prio=10 tid=0x00002aaad43fc000 nid=0x60b3 runnable [0x0000000041741000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:22)
at org.apache.thrift.server.TSimpleServer.serve(TSimpleServer.java:70)
So we need to set a "client timeout" in the TServerSocket
constructor. But why would that cause the application to also refuse connections when blocking on accept()
?
Upvotes: 8
Views: 6130
Reputation: 819
I have a similar c ++ server / client environment.
The c ++ client calls a method (attributeDefinitionsAliases) and waits for a response.
The c ++ server starts writing to the socket but locks. Wireshark capture:
After closing the c ++ client on the c ++ server, an exception appears:
Thrift internal message: TSocket::write_partial() send() : errno = 10054
Thrift internal message: TConnectedClient died: write() send(): errno = 10054
EDIT 1: It is not a thrift problem. It seems a problem with the way the server starts/launch. I have an application (launcher-app) that starts/launch the server with QProcess (https://doc.qt.io/archives/qt-4.8/qprocess.html), using popen works fine.
Upvotes: 0
Reputation: 10311
Biding the Thrift service to the wildcard address ("0.0.0.0") solved the problem, no more hanging.
Using the multithreaded server would make the application more responsive, but would still result in hung / incomplete requests.
If someone stumbles across this question and can provide a more complete explanation and how it relates to the Java RMI TCP callbacks issue (which I linked to in my question), upvotes for you.
Upvotes: 0
Reputation: 1571
I have some suggestions. You mentioned that the first few calls to the server works and then there are hangs. That's a clue. One scenario where this happens is when the client does not fully send the bytes to the server. I am not familiar with TSimpleServer, but I assume it listens on a port and has some binary protocol and expects any client to talk to it in that protocol. Your .net client is talking to this server by sending bytes. If its not correctly flushing its output buffer then it may not be sending all the bytes to the server thereby hanging the server.
In Java this could happen at the client side ,like this :
BufferedOutputStream stream = new BufferedOutputStream(socket.getOutputstream()) //get the socket stream to write
stream.write(content);//write everything that needs to be written
stream.flush();//if flush() is not called, could result in server getting incomplete packets resulting in hangs!!!
Suggestions :
a) Go through your .net client code. See if any part of the code that actually communicates to the server are properly calling the equivalent flush() or cleanup methods. Note : I saw from their documentation that their transport layer defines a flush(). You should scan your .net code and see if its using the transport methods. http://thrift.apache.org/docs/concepts/
b) For further debugging, you could try writing a small Java client that simulates your .net client. Run the java client on your linux machine (same machine where TSimpleServer runs). See if it causes same issue. If it does, you could debug your java client and find the root cause. If it doesn't, you could then run it on where your .net client runs and see if there any issues and take it from there.
Edit :c) I was able to see a sample thrift client code in Java here : https://chamibuddhika.wordpress.com/2011/10/02/apache-thrift-quickstart-tutorial/ I noticed transport.open(); //do some code transport.close(); As suggested in a) you could go though your .net client code and see if you are calling the transport methods flush() and close() on completion
Upvotes: 3
Reputation: 25150
From your stack trace it seems you are using TSimpleServer, whose javadocs say,
Simple singlethreaded server for testing.
Probably what you want to use is TThreadPoolServer.
Most likely what is happening is the single thread of TSimpleServer is blocked waiting for the dead client to respond or timeout. And because the TSimpleServer is single threaded, no thread is available to process other requests.
Upvotes: 5