tivvit
tivvit

Reputation: 36

Aerospike node connections overload - global connection problems

I am trying to use Aerospike DB in forked server. My cluster consists of 6 nodes. I use python client but that shouldn't be important.

First I created client, connected it and than I forked the server. This solution has problem when the cluster state changes (node dies) the connection returns True for is_connected() but no read or write operations may be executed succefully. In this solution I had around 700 connections for each node, which was ok for the cluster.

My second try was to create client, than fork the server and connect in each fork. This solution handles cluster changes correctly, but i've got to around 7k connections for each node, which causes that nodes are not stable.

Any ideas how to solve this better?

Upvotes: 1

Views: 685

Answers (2)

tivvit
tivvit

Reputation: 36

We studied this issue more carefully, and we figured out that each process does not need to have own client connection.

When using forking it is essential to use shared memory (shm). Working example is shown below: https://gist.github.com/tivvit/c3652fdb6208752188fc

Shm is essential because in this mode parent process (the one creating the connection) spawns service thread which keeps the state of the cluster in the connection actual.

If shm is ommited, nobody is keeping the state of the cluster. When the cluster state changes (node dies) processes are using connection to the old cluster setup which results in CONNECTION TIMEOUTS and other problems.

Another obvious issue is that you have to keep the parent process living. If parent process is killed, the service thread is killed too. So you have to keep in mind that process daemonization has to happen before connecting to the cluster.

Maybe this should be mentioned in the documentation?

Upvotes: 0

Ronen Botzer
Ronen Botzer

Reputation: 7117

In Python, each process will need to have its own client connecting to the server. As you suggested, after forking you should close and re-open the connection.

I would guess that you are running into the issue with CLOSE_WAIT. In the case of long-running apps written in Java, C#, C, etc, those connect to the cluster, and keep running and sending operations to it. For most dynamic languages in a server context, they're designed to accept a limited number of requests then are terminated, with some process manager constantly forking new processes. This is common to Python apps behind WSGI, PHP inside Apache (mod_php) or behind fastCGI (PHP-FPM), Ruby apps behind Passenger, etc. Each time a process is terminated, the sockets it was using become unavailable for a period of time (usually for 4 minutes). As processes are configured to take a small number of requests you will see the connections stuck in CLOSE_WAIT rise.

The solution is to have each process take as large a number of requests as possible. I would monitor your application processes to see if their size stays stable, and as long as no memory leaks bloat them, find the correct higher max requests number. The similar case in PHP is related to outdated server config recommendations that suggest a few hundred requests per-process. With PHP being much more stable in terms of memory use that is an invalid assumption and hurts performance, as well as having the side-effect you described. As both the Python and PHP clients can handle thousands (~4.5Ktps per-process in a latest test) of requests per-second, having a max request limit of 500 means the process is killed in a fraction of a second at peak.

Upvotes: 2

Related Questions