Reputation: 444
I have an Aerospike cluster of 15 nodes. This cluster performs fairly well under a normal load 10k TPS. I did some tests today, on a higher TPS. I raised the TPS to around 130k-150k TPS. I observed that some nodes intermittently went down, and automatically came back up after a few seconds. Due to these nodes going down, we are getting heartbeat timeouts, and hence, read timeouts.
One cluster node configuration: 8 cores. 120GB RAM. I am storing data in memory. All nodes have sufficient space remaining. Out of a total cluster space of 1.2TB (15*120), only 275 GB of space is used up. Also, the network in not at all flaky. All these machines are in a data centre, and are high bandwidth machines.
Some observations made by monitoring AMC:
Some error logs in cluster nodes:
Sep 15 2020 17:00:43 GMT: WARNING (hb): (hb.c:4864) (repeated:5) could not create heartbeat connection to node {10.33.162.134:2057}
Sep 15 2020 17:00:43 GMT: WARNING (socket): (socket.c:808) (repeated:5) Error while connecting socket to 10.33.162.134:2057
Sep 15 2020 17:00:53 GMT: WARNING (socket): (socket.c:740) (repeated:3) Timeout while connecting
Sep 15 2020 17:00:53 GMT: WARNING (hb): (hb.c:4864) (repeated:3) could not create heartbeat connection to node {10.33.162.134:2057}
Sep 15 2020 17:00:53 GMT: WARNING (socket): (socket.c:808) (repeated:3) Error while connecting socket to 10.33.162.134:2057
Sep 15 2020 17:01:03 GMT: WARNING (socket): (socket.c:740) (repeated:1) Timeout while connecting
Sep 15 2020 17:01:03 GMT: WARNING (hb): (hb.c:4864) (repeated:1) could not create heartbeat connection to node {10.33.162.134:2057}
Sep 15 2020 17:01:03 GMT: WARNING (socket): (socket.c:808) (repeated:1) Error while connecting socket to 10.33.162.134:2057
Sep 15 2020 17:01:13 GMT: WARNING (socket): (socket.c:740) (repeated:2) Timeout while connecting
Sep 15 2020 17:01:13 GMT: WARNING (hb): (hb.c:4864) (repeated:2) could not create heartbeat connection to node {10.33.162.134:2057}
Sep 15 2020 17:01:13 GMT: WARNING (socket): (socket.c:808) (repeated:2) Error while connecting socket to 10.33.162.134:2057
Sep 15 2020 17:02:44 GMT: WARNING (socket): (socket.c:740) Timeout while connecting
Sep 15 2020 17:02:44 GMT: WARNING (socket): (socket.c:808) Error while connecting socket to 10.33.54.144:2057
Sep 15 2020 17:02:44 GMT: WARNING (hb): (hb.c:4864) could not create heartbeat connection to node {10.33.54.144:2057}
Sep 15 2020 17:02:53 GMT: WARNING (socket): (socket.c:740) (repeated:1) Timeout while connecting
We also saw some of these error logs in nodes that were going down:
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9280f220a0102 on fd 4155 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9b676220a0102 on fd 4149 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9fbd6200a0102 on fd 42 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb96d3d220a0102 on fd 4444 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb99036210a0102 on fd 4278 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9f102220a0102 on fd 4143 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb91822210a0102 on fd 4515 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9e5ff200a0102 on fd 4173 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb93f65200a0102 on fd 38 failed : Broken pipe
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9132f220a0102 on fd 4414 failed : Connection reset by peer
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb939be210a0102 on fd 4567 failed : Connection reset by peer
Sep 15 2020 17:08:58 GMT: WARNING (hb): (hb.c:5122) sending mesh message to bb9b19a220a0102 on fd 4165 failed : Broken pipe
Attaching the aerospike.conf file here:
service {
user root
group root
service-threads 12
transaction-queues 12
transaction-threads-per-queue 4
proto-fd-max 50000
migrate-threads 1
pidfile /var/run/aerospike/asd.pid
}
logging {
file /var/log/aerospike/aerospike.log {
context any info
context migrate debug
}
}
network {
service {
address any
port 3000
}
heartbeat {
mode mesh
port 2057
mesh-seed-address-port 10.34.154.177 2057
mesh-seed-address-port 10.34.15.40 2057
mesh-seed-address-port 10.32.255.229 2057
mesh-seed-address-port 10.33.54.144 2057
mesh-seed-address-port 10.32.190.157 2057
mesh-seed-address-port 10.32.101.63 2057
mesh-seed-address-port 10.34.2.241 2057
mesh-seed-address-port 10.32.214.251 2057
mesh-seed-address-port 10.34.30.114 2057
mesh-seed-address-port 10.33.162.134 2057
mesh-seed-address-port 10.33.190.57 2057
mesh-seed-address-port 10.34.61.109 2057
mesh-seed-address-port 10.34.47.19 2057
mesh-seed-address-port 10.33.34.24 2057
mesh-seed-address-port 10.34.118.182 2057
interval 150
timeout 20
}
fabric {
port 3001
}
info {
port 3003
}
}
namespace PS1 {
replication-factor 2
memory-size 70G
single-bin false
data-in-index false
storage-engine memory
stop-writes-pct 90
high-water-memory-pct 75
}
namespace LS1 {
replication-factor 2
memory-size 30G
single-bin false
data-in-index false
storage-engine memory
stop-writes-pct 90
high-water-memory-pct 75
}
Any possible explanations for this?
Upvotes: 0
Views: 895
Reputation: 2939
Seems like the nodes are having network connectivity issues at such higher throughput. This can have different root causes, from simple network related bottleneck (bandwidth, packets per second), to something on the node itself getting in the way of interfacing properly with the network (soft interrupts surge, improper distribution of network queues, CPU thrashing). This would prevent heartbeat connections/messages from going through, resulting in nodes leaving the cluster until it recovers. If running on a cloud/virtualized environment, some hosts may have noisier neighbors than others, etc...
The increase in number of connections is a symptom as any slow down on a node would cause the client to compensate by increasing the throughput (which will increase the number of connections, which can also lead to a downward spiraling effect).
Finally, a single node leaving or joining the cluster shouldn't impact read transactions much. Check your policy and make sure you have the socketTimeout / totalTimeout / maxRetries, etc... set correctly so that a read can quickly retry against a different replica.
This article can help on this latest point: https://discuss.aerospike.com/t/understanding-timeout-and-retry-policies/2852/3
Upvotes: 2