MAFA
MAFA

Reputation: 81

Linux - Too many closed connections

I'm coding an application opening on a single linux machine 1800 connections/minute using Netty (async nio). A connection lives for a few seconds and then it is closed or it is timeouted after 20 secs if no answer is received. More, the read/write timeout is 30 secs and the request header contains connection=close. After a while (2-3 hours) I get a lot of exceptions in the logs because Netty is unable to create new connections due to a lack of resources. I increased the max number of open files in limits.conf as:

root            hard    nofile           200000
root            soft    nofile           200000

Here is the output of netstat -nat | awk '{print $6}' | sort | uniq -c | sort -n:

   1 established)
   1 FIN_WAIT2
   1 Foreign
   2 TIME_WAIT
   6 LISTEN
  739 SYN_SENT
 6439 LAST_ACK
 6705 CLOSE_WAIT
12484 ESTABLISHED

This is the output of the ss -s command:

Total: 194738 (kernel 194975)
TCP:   201128 (estab 13052, closed 174321, orphaned 6477, synrecv 0, timewait 3/0), ports 0

Transport Total     IP        IPv6
*         194975    -         -
RAW       0         0         0
UDP       17        12        5
TCP       26807     8         26799
INET      26824     20        26804
FRAG      0         0         0

Also ls -l /proc/2448/fd | wc -l gives about 199K.

That said, the questions are about the closed connections reported in the ss -s command output:

1)what are they exactly?

2)why do they keep dangling without being destroyed?

3)Is there any setting (timeout or whatever) which can help to keep them under a reasonable limit?

Upvotes: 4

Views: 9454

Answers (3)

MAFA
MAFA

Reputation: 81

As Roman correctly pointed out, closed connections do exist and are sockets which have never been closed properly. In my case I had some clues about what was going wrong which I report below:

1)ss -s showed stranges values, in particular a lot of closed connections

2)ls -l /proc/pid/fd | wc -l showed a lot of open descriptors

3)Numbers in netstat -nat | awk '{print $6}' | sort | uniq -c | sort -n did not match with the previous ones

4)sudo lsof -n -p pid (Roman suggestion) showed a lot of entries with can't identify protocol.

Looking around on the web I found an interesting post (https://idea.popcount.org/2012-12-09-lsof-cant-identify-protocol/) which explains what really point 4 might mean and why netstat numbers do not match (see also here https://serverfault.com/questions/153983/sockets-found-by-lsof-but-not-by-netstat) .

I was quite surprised, since I used netty 4.1.x (with Spring) with a common pattern where every connection was supposed to be properly closed, so I spent a few days before understanding what was really wrong.

The subtle problem was in the netty IO thread, where the message body was copied and put in a blocking queue (as part of my code). When the queue was full, that slowed down things, introducing some latency and causing a connection time out not detected by my end and, consequently, the leak of FDs.

My solution was to introduce a sort of pooled queue preventing netty requests when the queue is full.

Upvotes: 2

Roman
Roman

Reputation: 6657

1)what are they exactly?

They are sockets that were either never connected or were disconnected and weren't closed.

In Linux, an outgoing TCP socket goes through the following stages (roughly):

  1. You create the socket (unconnected), and kernel allocates a file descriptor for it.
  2. You connect() it to the remote side, establishing a network connection.
  3. You do data transfer (read/write).
  4. When you are done with reading/writing, you shutdown() the socket for both reading and writing, closing the network connection.
  5. You close() the socket, and kernel frees the file descriptor.

So those 174K connections ss reports as closed are sockets that were either not gone past stage 1 (maybe connect() failed or even never called) or gone through stage 4, but not 5. Effectively, they are sockets with underlying open file descriptors, but without any network binding (so netstat / ss listing doesn't show them).

2)why do they keep dangling without being destroyed?

Because nobody called close() on them. I would call it a "file descriptor leak" or a "socket descriptor leak".

3)Is there any setting (timeout or whatever) which can help to keep them under a reasonable limit?

From the Linux point of view, no. You have to explicitly call close() on them (or terminate the process that owns them so the kernel knows they aren't used anymore).

From the Netty/Java point of view, maybe, I don't know.

Essentially, it's a bug in your code, or in Netty code (less likely), or in JRE code (much less likely). You are not releasing the resources when you should. If you show the code, maybe somebody can spot the error.

Upvotes: 4

user207421
user207421

Reputation: 310985

You still haven't provided the exact error message I asked for, but as far as I can see the question should be about the six and a half thousand connections in CLOSE_WAIT state, not 'those closed connections'.

You're not closing sockets that the peer has disconnected.

That said, the questions are about those closed connections.

What closed connections? Your netstat display doesn't show any closed connections. And there is no evidence that your resource exhaustion problem has anything to do with closed connections.

1)what are they exactly?

They aren't.

2)why do they keep dangling without being destroyed?

They don't.

3)Is there any setting (timeout or whatever) which can help to keep them under a reasonable limit?

As they don't exist, the question is meaningless.

Upvotes: 0

Related Questions