Reputation: 281
I've been doing dome tests using aerospike and I noticed a behavior different than what is sold.
I have a cluster of 4 nodes running on AWS in the same AZ, the instances are t2micro (1cpu, 1gb RAM, 25gb SSD) using the aws linux with the AMI aerospike
aerospike.conf:
heartbeat {
mode mesh
port 3002
mesh-seed-address-port XXX.XX.XXX.164 3002
mesh-seed-address-port XXX.XX.XXX.167 3002
mesh-seed-address-port XXX.XX.XXX.165 3002
#internal aws IPs
...
namespace teste2 {
replication-factor 2
memory-size 650M
default-ttl 365d
storage-engine device {
file /opt/aerospike/data/bar.dat
filesize 22G
data-in-memory false
}
}
What I did was a test to see if I would loose documents when a node goes down. For that I wrote a little code on python:
from __future__ import print_function
import aerospike
import pandas as pd
import numpy as np
import time
import sys
config = {
'hosts': [ ('XX.XX.XX.XX', 3000),('XX.XX.XX.XX',3000),
('XX.XX.XX.XX',3000), ('XX.XX.XX.XX',3000)]
} # external aws ips
client = aerospike.client(config).connect()
for i in range(1,10000):
key = ('teste2', 'setTest3', ''.join(('p',str(i))))
try:
client.put(key, {'id11': i})
print(i)
except Exception as e:
print("error: {0}".format(e), file=sys.stderr)
time.sleep(1)
I used this code just for inserting a sequence of integers that I could check after that. I ran that code and after a few seconds I stopped the aerospike service at one node for 10 seconds, using sudo service aerospike stop
and sudo service aerospike colstart
to restart.
I waited for a few seconds until the nodes did all the migration and executed the following python script:
query = client.query('teste2', 'setTest3')
query.select('id11')
te = []
def save_result((key, metadata, record)):
te.append(record)
query.foreach(save_result)
d = pd.DataFrame(te)
d2 = d.sort(columns='id11')
te2 = np.array(d2.id11)
for i in range(0,len(te2)):
if i > 0:
if (te2[i] != (te2[i-1]+1) ):
print('no %d'% int(te2[i-1]+1))
print(te2)
And got as response:
no 3
no 6
no 8
no 11
no 13
no 17
no 20
no 22
no 24
no 26
no 30
no 34
no 39
no 41
no 48
no 53
[ 1 2 5 7 10 12 16 19 21 23 25 27 28 29 33 35 36 37 38 40 43 44 45 46 47 51 52 54]
Is my cluster configured wrong or this is normal?
ps: I tried to include as many things I could, if you please suggest more information to include I will appreciate.
Upvotes: 3
Views: 331
Reputation: 281
Actually I found a solution, and it is pretty simple and foolish to be honest.
In the configuration file we have some parameters for network communication between nodes, such as:
interval 150 # Number of milliseconds between heartbeats
timeout 10 # Number of heartbeat intervals to wait
# before timing out a node
This two parameters set the time it takes to the cluster to realize the node is down and out of the cluster. (in this case 1.5 sec).
What we found useful was to tune the write policies at the client to work along this parameters.
Depending on the client you will have some policies like number of tries until the operation fails, timeout for the operation, time between tries.
You just need to adapt the client parameters. For example: set the number of retries to 4 (each is executed after 500 ms) and the timeout to 2 sec. Doing that the client will recognize the node is down and redirect the operation to another node.
This setup can be overwhelming on the cluster, generating a huge overload, but it worked for us.
Upvotes: 3