Matt
Matt

Reputation: 27001

Gracefully shutting down a TcpListener, waiting for existing TcpClients to finish

I am attempting to shutdown a TCPListener gracefully - meaning, if any clients have connected, wait for that request to be served, then disconnect gracefully.

 namespace Server {
    class Program{
        static void Main(string[]args) {
            Console.WriteLine("Starting Server");
            CancellationTokenSource cts = new CancellationTokenSource();
            Server cc = new Server();
            // In production, this will not be a task, but its own thread
            var t = cc.StartListener(cts.Token);
            Console.WriteLine("Server Started - Press any key to exit to stop the listener");
            Console.ReadKey(true);
            Console.WriteLine("\r\nStopping the listener");
            cts.Cancel();
            // Wait for the task (that should be a thread to finish 
            Task[] ts = new Task[1]{ t };
            Task.WaitAll(ts);
            Console.WriteLine("\r\nListener stopped - exiting");
        }
    }

    public class Server{
        public async Task StartListener(CancellationToken cts) {
            tcpListener = new TcpListener(IPAddress.Any, 4321);
            tcpListener.Start();

            Console.WriteLine();
            // Keep accepting clients until the cancellation token
            while (!cts.IsCancellationRequested) {
                var tcpClient = await tcpListener.AcceptTcpClientAsync().ConfigureAwait(false);
                // Increment the count of outstanding clients
                Interlocked.Increment(ref c);
                // When we are done, use a continuation to decrement the count
                ProcessClient(tcpClient).ContinueWith((_t) => Interlocked.Decrement(ref c));
                Console.Write("\b\b\b\b" + c);
            }

            Console.WriteLine($ "\r\nWaiting for {c} connections to finish");
            // Stop the listener
            tcpListener.Stop();

            // Stick around until all clients are done
            while (c > 0) {}
            Console.WriteLine("Done");
        }

        int c = 0;

        public TcpListener tcpListener;

        static Random random = new Random();

        private async Task ProcessClient(TcpClient tcpClient) {
            var ns = tcpClient.GetStream();
            try {
                byte[]b = new byte[16];
                await ns.ReadAsync(b, 0, 16);
                // Introduce a random delay to simulate 'real world' conditions
                await Task.Delay(random.Next(100, 500)); 
                // Write back the payload we receive (should be a guid, i.e 16-bytes)
                await ns.WriteAsync(b, 0, 16);
            } catch (Exception ex) {
                Console.WriteLine(ex.Message);
            }
            finally {
                tcpClient.Close();
            }
        }
    }
}

And here is my client

namespace client{
    class Program{
        static void Main(string[]args) {
            List<Task> ts = new List<Task>();

            for (int i = 0; i < 5000; i++) {
                var t = Connect(i);
                ts.Add(t);
            }

            Task.WaitAll(ts.ToArray());
            Console.WriteLine("done - exiting, but first \r\n");

            // Group all the messages so they are only output once
            foreach(var m in messages.GroupBy(x => x).Select(x => (x.Count() + " " + x.Key))) {
                Console.WriteLine(m);
            }
        }

        static object o = new Object();

        static Random r = new Random();
        static List <string> messages = new List <string> ();

        static async Task Connect(int i) {
            try {
                // Delay below will simulate requests coming in over time
                await Task.Delay(r.Next(0, 10000));
                TcpClient tcpClient = new TcpClient();
                await tcpClient.ConnectAsync("127.0.0.1", 4321);
                using(var ns = tcpClient.GetStream()) {
                    var g = Guid.NewGuid();
                    // Send a guid
                    var bytes = g.ToByteArray();
                    await ns.WriteAsync(bytes, 0, 16);
                    // Read guid back out
                    var outputBytes = new byte[16];
                    await ns.ReadAsync(outputBytes, 0, 16);
                    // Did we get the right value back?
                    var og = new Guid(outputBytes);
                }
            } catch (Exception ex) {
                lock(o) {
                    var message = ex.Message.Length <= 150 ? ex.Message + "..." : ex.Message.Substring(0, 149);
                    if (messages.IndexOf(message) == -1) {}
                    messages.Add(message);
                }
            }
        }
    }
}

If I stop the server, but the client keeps going, obviously, I get a bunch of

No connection could be made because the target machine actively refused it. [::ffff:127.0.0.1]:4321...

That's expected - what I don't understand, is why the client still reports some connections (very few) as forcibly closed.

Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.....

Upvotes: 1

Views: 622

Answers (1)

Peter Duniho
Peter Duniho

Reputation: 70701

That's expected - what I don't understand, is why the client still reports some connections (very few) as forcibly closed.

There are two things stymieing your expectations: you aren't using graceful closure on the connections, and (much more important) client connections are accepted by the network driver on your behalf and put into a "backlog".

I modified your code example to display the current system time for key points:

  • For the client, the earliest time any given error message was produced
  • For the server, the time the user pressed a key, and the time when the server believed all clients had been successfully closed

I also added two-second waits after each of those reports in the server, so that the actual closing of the listening socket and the closing of the process both happen significantly later than the report. I did this to make it easier to relate the output of the client process from that of the server process and its operations. The first delay also allows the backlog to fill completely, so it's easier to see its influence (see below). The second delay helps ensure that the closing of the process itself doesn't affect the way the connections are handled.

The output from the server looks like this:

Starting Server

Server Started - Press any key to exit to stop the listener
163
Stopping the listener
156
Waiting for 156 connections to finish (current time is 04:10:27.040)
Done (exiting at 04:10:29.048)

Listener stopped - exiting

The client output looks like this:

done - exiting, but first

200: Unable to read data from the transport connection: An existing connection was forcibly closed
by the remote host... (earliest: 04:10:29.047) 2514: No connection could be made because the target machine actively refused it 127.0.0.1:4321...
(earliest: 04:10:29.202)

Note that:

  • The earliest time that a client socket failed with "unable to read" was two seconds after the server's first report. I.e. almost exactly when the server actually closes the listening socket.
  • Likewise, the earliest time that the client socket failed with "no connection could be made" was just a little later than the earliest "unable to read" error. Importantly, it's also just after the message that the server displays right after the last established client is done.

Sidebar: I made some other modifications to your code as well. In particular, I added explicit graceful shutdown, to eliminate that as a potential issue. Indeed, you'll noticed that with my version of the code, you only ever get read or connection errors. Without the graceful shutdown logic, every now and then (but rarely) some of the errors will also be write errors, which is a consequence of a race between the client and server.

It's also instructive to see that the number of read error messages is exactly 200.

So, what's going on?

  • The read errors are a result of clients which have had their connections accepted on behalf of the server's network driver, but not by the server program itself. I.e. these are connections in the listening socket's backlog. As far as the client knew, their connection was accepted, but in reality the server process never actually did. So the network driver has to forcibly close these connections.

    This is also why it's interesting that the number of read errors is exactly 200. In the past, the default (and maximum) on Windows for non-server OS's was a backlog of only 5, while the server OS's allowed a backlog of 200. I'm running Windows 10 "Pro", so from this I surmise that Microsoft has also increased the default backlog for "Pro" versions of the OS. I'd guess that "Home" versions still have the old default of 5, but I could be wrong about that.

    Note that in reality, due to thread scheduling and timing issues, the actual number may sometimes be a little higher than the actual backlog value. But it will always be very close.
  • The connection errors are a result of clients which attempt to connect after the listening socket is actually closed. This is the normal thing you expected on closing the socket.

The bottom line: none of the connections which ever actually were completely established with the server had an error. The only errors which ever occurred, happened once the server started the process of shutting everything down. The exact error that occurs depends on how far into the connection process the client was able to get.

I will note that in addition to the lack of graceful closure, your code also has a couple of other problems: the Random class is not thread-safe, and there is no guarantee that a read operation will return the number of bytes you expect (if there is any data to be received, the read operation may return as little as just a single byte).

I think for the purpose of the proof-of-concept example you provided, those two other problems probably don't matter. Random will not return a proper Gaussian distribution when used unsafely, and your code example didn't actually care about whether the current bytes were received. I ran a version of your code with those issue fixed, and it didn't seem to affect the overall behavior, nor would I have expected it to.

But at the end of the day, when dealing with network I/O, and especially where clients and servers are not very closely coordinated (e.g. the server doesn't shut down until it has some confirmation that no more client connections will be requested…in real life, this pretty much never happens :) ), errors are a fact of life, and the code must be robust in the face of them.

Upvotes: 2

Related Questions