Multiple tcp sockets, one stalled

Question

I'm trying to get a starting point on where to begin understanding what could cause a socket stall and would appreciate any insights any of you might have.

So, server is a modern dual socket xeon (2 x 6 core @ 3.5 ghz) running windows 2012. In a single process, there are 6 blocking tcp sockets with default options, each of which are running on their own threads (not numa/core specified). 5 of them are connected to the same remote server and receiving very heavy loads (hundreds of thousands of small ~75 byte msgs per second). The last socket is connected to a different server with a very light send/receive load for administrative messaging.

The problem I ran into was a 5 second stall in the admin messaging socket. Multiple send calls to the socket returned successfully, however nothing was received from the remote server (should receive a protocol ack within milliseconds) or received BY the remote admin server for 5 seconds. It was as if that socket just turned off for a bit. After the 5 seconds stall passed, all of the acks came in a burst, and afterwards everything continued normally. During this, the other sockets were receiving much higher numbers of messages than normal, however there was no indication of any interruption or stall as the data logs displayed nothing unusual (light logging, maybe 500 msgs/sec).

From what I understand, the socket send call does not ensure that data has gone out on the wire, just that a transfer to the tcp stack was successful. So, I'm trying to understand the different scenarios that could have taken place that would cause a 5 second stall on the admin socket. Is it possible that due to the large amount of data being received the tcp stack was essentially overwhelmed and prioritized those sockets that were being most heavily utilized? What other situations could have potentially caused this?

Thank you!

rodolk · Accepted Answer

If the sockets are receiving hundreds of thousands of 75-byte messages per second there is a possibility that the server is at maximum capacity with some resources. Maybe not bandwidth, as with 100K messages you might be consuming around 10Mbps. But it could be CPU utilization.

You should use two tools to understand you problem:

perfmon to see utilization of CPU (user and priviledged https://technet.microsoft.com/en-us/library/aa173932(v=sql.80).aspx) , memory, bandwidth, and disk queue length. You can also check number of interrupts and context switches with perfmon.
A sniffer like Wireshark to see if at TCP level data is being transmitted and responses received.
Something else I would do is to write a timestamp right after the send call and right before and after the read call in the thread in charge of admin socket. Maybe it is a coding problem.

The fact that send calls return successfully doesn't mean data was immediately sent. In TCP data will be stored in the send buffer and from there, TCP stack will send the data to the other end.

If your system is CPU bound (you can see with perfmon if this is true), then you should put attention to the comments written by @EJP, this is something that could happen when the machine is under heavy load. With the tools I mentioned, you can see if the receive window in the admin socket is closed or if it is just that socket read is taking time in the admin socket.

Multiple tcp sockets, one stalled

Answers (1)

Related Questions