dtj
dtj

Reputation: 281

Aerospike losing documents when node goes down

I've been doing dome tests using aerospike and I noticed a behavior different than what is sold.

I have a cluster of 4 nodes running on AWS in the same AZ, the instances are t2micro (1cpu, 1gb RAM, 25gb SSD) using the aws linux with the AMI aerospike

aerospike.conf:

heartbeat {
        mode mesh
        port 3002                        
        mesh-seed-address-port XXX.XX.XXX.164 3002
        mesh-seed-address-port XXX.XX.XXX.167 3002
        mesh-seed-address-port XXX.XX.XXX.165 3002
        #internal aws IPs
...
namespace teste2 {
        replication-factor 2
        memory-size 650M
            default-ttl 365d                                                                                                                    
        storage-engine device {
                    file /opt/aerospike/data/bar.dat
                    filesize 22G
                        data-in-memory false                                                                     
        }
}

What I did was a test to see if I would loose documents when a node goes down. For that I wrote a little code on python:

from __future__ import print_function
import aerospike
import pandas as pd
import numpy as np
import time
import sys
config = {
  'hosts': [ ('XX.XX.XX.XX', 3000),('XX.XX.XX.XX',3000),
             ('XX.XX.XX.XX',3000), ('XX.XX.XX.XX',3000)]
} # external aws ips
client = aerospike.client(config).connect()
for i in range(1,10000):
  key = ('teste2', 'setTest3', ''.join(('p',str(i))))
  try:
    client.put(key, {'id11': i})
    print(i)
  except Exception as e:
    print("error: {0}".format(e), file=sys.stderr)
  time.sleep(1)

I used this code just for inserting a sequence of integers that I could check after that. I ran that code and after a few seconds I stopped the aerospike service at one node for 10 seconds, using sudo service aerospike stop and sudo service aerospike colstart to restart.

I waited for a few seconds until the nodes did all the migration and executed the following python script:

query = client.query('teste2', 'setTest3')
query.select('id11')
te = []
def save_result((key, metadata, record)):
    te.append(record)
query.foreach(save_result)
d = pd.DataFrame(te)
d2 = d.sort(columns='id11')
te2 = np.array(d2.id11)
for i in range(0,len(te2)):
  if i > 0:
    if (te2[i] !=  (te2[i-1]+1) ):
      print('no %d'% int(te2[i-1]+1))
print(te2)

And got as response:

no 3
no 6
no 8
no 11
no 13
no 17
no 20
no 22
no 24
no 26
no 30
no 34
no 39
no 41
no 48
no 53
[ 1  2  5  7 10 12 16 19 21 23 25 27 28 29 33 35 36 37 38 40 43 44 45 46 47 51 52 54]

Is my cluster configured wrong or this is normal?

ps: I tried to include as many things I could, if you please suggest more information to include I will appreciate.

Upvotes: 3

Views: 331

Answers (1)

dtj
dtj

Reputation: 281

Actually I found a solution, and it is pretty simple and foolish to be honest.

In the configuration file we have some parameters for network communication between nodes, such as:

interval 150                    # Number of milliseconds between heartbeats
timeout 10                      # Number of heartbeat intervals to wait
                                # before timing out a node

This two parameters set the time it takes to the cluster to realize the node is down and out of the cluster. (in this case 1.5 sec).

What we found useful was to tune the write policies at the client to work along this parameters.

Depending on the client you will have some policies like number of tries until the operation fails, timeout for the operation, time between tries.

You just need to adapt the client parameters. For example: set the number of retries to 4 (each is executed after 500 ms) and the timeout to 2 sec. Doing that the client will recognize the node is down and redirect the operation to another node.

This setup can be overwhelming on the cluster, generating a huge overload, but it worked for us.

Upvotes: 3

Related Questions