Reputation: 10294

Aerospike - No improvements in latency on moving to in-memory cluster from on-disk cluster

To begin with, we had an aerospike cluster having 5 nodes of i2.2xlarge type in AWS, which our production fleet of around 200 servers was using to store/retrieve data. The aerospike config of the cluster was as follows -

service {
        user root
        group root
        paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
        pidfile /var/run/aerospike/asd.pid
        service-threads 8
        transaction-queues 8
        transaction-threads-per-queue 4
        fabric-workers 8
        transaction-pending-limit 100
        proto-fd-max 25000
}

logging {
        # Log file must be an absolute path.
        file /var/log/aerospike/aerospike.log {
                context any info
        }
}

network {
        service {
                address any
                port 3000
        }

        heartbeat {
                mode mesh
                port 3002 # Heartbeat port for this node.

                # List one or more other nodes, one ip-address & port per line:
                mesh-seed-address-port <IP> 3002
                mesh-seed-address-port <IP> 3002
                mesh-seed-address-port <IP> 3002
                mesh-seed-address-port <IP> 3002
               # mesh-seed-address-port <IP> 3002
                interval 250
                timeout 10
        }

        fabric {
                port 3001
        }

        info {
                port 3003
        }
}

namespace FC {
        replication-factor 2
        memory-size 7G
        default-ttl 30d # 30 days, use 0 to never expire/evict.


        high-water-disk-pct 80    # How full may the disk become before the server begins eviction
            high-water-memory-pct 70 # Evict non-zero TTL data if capacity exceeds # 70% of 15GB
               stop-writes-pct 90       # Stop writes if capacity exceeds 90% of 15GB


                  storage-engine device {
                  device /dev/xvdb1
                  write-block-size 256K
                      }

}

It was properly handling the traffic corresponding to the namespace "FC", with latencies within 14 ms, as shown in the following graph plotted using graphite -

However, on turning on another namespace, with much higher traffic on the same cluster, it started to give a lot of timeouts and higher latencies, as we scaled up the number of servers using the same cluster of 5 nodes (increasing number of servers step by step from 20 to 40 to 60) with the following namespace configuration -

namespace HEAVYNAMESPACE {
        replication-factor 2
        memory-size 35G
        default-ttl 30d # 30 days, use 0 to never expire/evict.

 high-water-disk-pct 80    # How full may the disk become before the server begins eviction
    high-water-memory-pct 70 # Evict non-zero TTL data if capacity exceeds # 70% of 35GB
   stop-writes-pct 90       # Stop writes if capacity exceeds 90% of 35GB


   storage-engine device {
device /dev/xvdb8
write-block-size 256K
    }

}

Following were the observations -

----FC Namespace----

20 - servers, 6k Write TPS, 16K Read TPS
set latency = 10ms
set timeouts = 1
get latency = 15ms
get timeouts = 3

40 - servers, 12k Write TPS, 17K Read TPS
set latency = 12ms
set timeouts = 1 
get latency = 20ms
get timeouts = 5

60 - servers, 17k Write TPS, 18K Read TPS
set latency = 25ms
set timeouts = 5
get latency = 30ms
get timeouts = 10-50 (fluctuating)

----HEAVYNAMESPACE----

20 - del servers, 6k Write TPS, 16K Read TPS
set latency = 7ms
set timeouts = 1
get latency = 5ms
get timeouts = 0
no of keys = 47 million x 2
disk usage = 121 gb
ram usage = 5.62 gb

40 - del servers, 12k Write TPS, 17K Read TPS
set latency = 15ms
set timeouts = 5
get latency = 12ms
get timeouts = 2

60 - del servers, 17k Write TPS, 18K Read TPS
set latency = 25ms
set timeouts = 25-75 (fluctuating)
get latency = 25ms
get timeouts = 2-15 (fluctuating)

* Set latency refers to latency in setting aerospike cache keys and similarly get for getting keys.

We had to turn off the namespace "HEAVYNAMESPACE" after reaching 60 servers.

We then started a fresh POC with a cluster having nodes which were r3.4xlarge instances of AWS (find details here https://aws.amazon.com/ec2/instance-types/), with the key difference in aerospike configuration being the usage of memory only for caching, hoping that it would give better performance. Here is the aerospike.conf file -

service {
        user root
        group root
        paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
        pidfile /var/run/aerospike/asd.pid
        service-threads 16
        transaction-queues 16
        transaction-threads-per-queue 4
        proto-fd-max 15000
}

logging {
        # Log file must be an absolute path.
        file /var/log/aerospike/aerospike.log {
                context any info
        }
}

network {
        service {
                address any
                port 3000
        }

        heartbeat {
                mode mesh
                port 3002 # Heartbeat port for this node.

                # List one or more other nodes, one ip-address & port per line:
                mesh-seed-address-port <IP> 3002
                mesh-seed-address-port <IP> 3002
                mesh-seed-address-port <IP> 3002
                mesh-seed-address-port <IP> 3002
                mesh-seed-address-port <IP> 3002

                interval 250
                timeout 10
        }

        fabric {
                port 3001
        }

        info {
                port 3003
        }
}

namespace FC {
        replication-factor 2
        memory-size 30G
        storage-engine memory
                default-ttl 30d # 30 days, use 0 to never expire/evict.

        high-water-memory-pct 80 # Evict non-zero TTL data if capacity exceeds # 70% of 15GB
        stop-writes-pct 90       # Stop writes if capacity exceeds 90% of 15GB

}

We began with the FC namespace only, and decided to go ahead with the HEAVYNAMESPACE only if we saw significant improvements with the FC namespace, but we didn't. Here are the current observations with different combinations of node count and server count -

Current stats

Observation Point 1 - 4 nodes serving 130 servers. Point 2 - 5 nodes serving 80 servers. Point 3 - 5 nodes serving 100 servers.

These observation points are highlighted in the graphs below -

Get latency -

Set successes (giving a measure of the load handled by the cluster) -

We also observed that -

Total memory usage across cluster is 5.52 GB of 144 GB. Node-wise memory usage is ~ 1.10 GB out of 28.90 GB.
There were no observed write failures yet.
There were occasional get/set timeouts which looked fine.
No evicted objects.

Conclusion

We are not seeing the improvements we had expected, by using the memory-only configuration. We would like to get some pointers to be able to scale up with the same cost - - via tweaking the aerospike configurations - or by using some more suitable AWS instance type (even if that would lead to cost cutting).

Update

Output of top command on one of the aerospike servers, to show SI (Pointed out by @Sunil in his answer) -

$ top
top - 08:02:21 up 188 days, 48 min,  1 user,  load average: 0.07, 0.07, 0.02
Tasks: 179 total,   1 running, 178 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.3%us,  0.1%sy,  0.0%ni, 99.4%id,  0.0%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:  125904196k total,  2726964k used, 123177232k free,   148612k buffers
Swap:        0k total,        0k used,        0k free,   445968k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 63421 root      20   0 5217m 1.6g 4340 S  6.3  1.3 461:08.83 asd

If I am not wrong, the SI appears to be 0.2%. I checked the same on all the nodes of the cluster and it is 0.2% on one and 0.1% on the rest of the three.

Also, here is the output of the network stats on the same node -

$ sar -n DEV 10 10
Linux 4.4.30-32.54.amzn1.x86_64 (ip-10-111-215-72)      07/10/17        _x86_64_        (16 CPU)

08:09:16        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
08:09:26           lo     12.20     12.20      5.61      5.61      0.00      0.00      0.00      0.00
08:09:26         eth0   2763.60   1471.60    299.24    233.08      0.00      0.00      0.00      0.00

08:09:26        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
08:09:36           lo     12.00     12.00      5.60      5.60      0.00      0.00      0.00      0.00
08:09:36         eth0   2772.60   1474.50    300.08    233.48      0.00      0.00      0.00      0.00

08:09:36        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
08:09:46           lo     17.90     17.90     15.21     15.21      0.00      0.00      0.00      0.00
08:09:46         eth0   2802.80   1491.90    304.63    245.33      0.00      0.00      0.00      0.00

08:09:46        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
08:09:56           lo     12.00     12.00      5.60      5.60      0.00      0.00      0.00      0.00
08:09:56         eth0   2805.20   1494.30    304.37    237.51      0.00      0.00      0.00      0.00

08:09:56        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
08:10:06           lo      9.40      9.40      5.05      5.05      0.00      0.00      0.00      0.00
08:10:06         eth0   3144.10   1702.30    342.54    255.34      0.00      0.00      0.00      0.00

08:10:06        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
08:10:16           lo     12.00     12.00      5.60      5.60      0.00      0.00      0.00      0.00
08:10:16         eth0   2862.70   1522.20    310.15    238.32      0.00      0.00      0.00      0.00

08:10:16        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
08:10:26           lo     12.00     12.00      5.60      5.60      0.00      0.00      0.00      0.00
08:10:26         eth0   2738.40   1453.80    295.85    231.47      0.00      0.00      0.00      0.00

08:10:26        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
08:10:36           lo     11.79     11.79      5.59      5.59      0.00      0.00      0.00      0.00
08:10:36         eth0   2758.14   1464.14    297.59    231.47      0.00      0.00      0.00      0.00

08:10:36        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
08:10:46           lo     12.00     12.00      5.60      5.60      0.00      0.00      0.00      0.00
08:10:46         eth0   3100.40   1811.30    328.31    289.92      0.00      0.00      0.00      0.00

08:10:46        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
08:10:56           lo      9.40      9.40      5.05      5.05      0.00      0.00      0.00      0.00
08:10:56         eth0   2753.40   1460.80    297.15    231.98      0.00      0.00      0.00      0.00

Average:        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
Average:           lo     12.07     12.07      6.45      6.45      0.00      0.00      0.00      0.00
Average:         eth0   2850.12   1534.68    307.99    242.79      0.00      0.00      0.00      0.00

From the above, I think the total number of packets handled per second should be 2850.12+1534.68 = 4384.8 (sum of rxpck/s and txpck/s) which is well within 250K packets per second, as mentioned in The Amazon EC2 deployment guide on the Aerospike site which is referred in @RonenBotzer's answer.

Update 2

I ran the asadm command followed by show latency on one of the nodes of the cluster and from the output, it appears that there is no latency beyond 1 ms for both reads and writes -

Admin> show latency
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~read Latency~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                               Node                 Time   Ops/Sec   >1Ms   >8Ms   >64Ms
                                  .                 Span         .      .      .       .
ip-10-111-215-72.ec2.internal:3000    11:35:01->11:35:11    1242.1    0.0    0.0     0.0
ip-10-13-215-20.ec2.internal:3000     11:34:57->11:35:07    1297.5    0.0    0.0     0.0
ip-10-150-147-167.ec2.internal:3000   11:35:04->11:35:14    1147.7    0.0    0.0     0.0
ip-10-165-168-246.ec2.internal:3000   11:34:59->11:35:09    1342.2    0.0    0.0     0.0
ip-10-233-158-213.ec2.internal:3000   11:35:00->11:35:10    1218.0    0.0    0.0     0.0
Number of rows: 5

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~write Latency~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                               Node                 Time   Ops/Sec   >1Ms   >8Ms   >64Ms
                                  .                 Span         .      .      .       .
ip-10-111-215-72.ec2.internal:3000    11:35:01->11:35:11      33.0    0.0    0.0     0.0
ip-10-13-215-20.ec2.internal:3000     11:34:57->11:35:07      37.2    0.0    0.0     0.0
ip-10-150-147-167.ec2.internal:3000   11:35:04->11:35:14      36.4    0.0    0.0     0.0
ip-10-165-168-246.ec2.internal:3000   11:34:59->11:35:09      36.9    0.0    0.0     0.0
ip-10-233-158-213.ec2.internal:3000   11:35:00->11:35:10      33.9    0.0    0.0     0.0
Number of rows: 5

Upvotes: 1

Answers (2)

sunil

Reputation: 3567

Your title is misleading. Please consider changing it. You moved from on-disk to in-memory. mem+disk means data is both on disk and mem (using data-in-memory=true).

My best guess is that one CPU is bottlenecking to do network I/O. You can take a look at the top output and see the si (software interrupts) If one CPU is showing much higher than the other, simplest thing you can try is RPS (Receive Packet Steering)

echo f|sudo tee  /sys/class/net/eth0/queues/rx-0/rps_cpus

Once you confirm that its network bottlneck, You can try ENA as suggested by @Ronen

Going into details, When you had 15ms latency with only FC, assuming its low tps. But when you added high load on HEAVYNAMESPACE in prod, the latency kept increasing as you added more client nodes and hence tps.

Simlarly in you POC also, the latency increased with client nodes. The latency is under 15ms even with 130 servers. Its partly good. I am not sure if I understood your set_success graph. Assumign its in ktps.

Update:

After looking at the server side latency histogram, looks like server is doing fine. Most likely it is a client issue. Check CPU and network on the client machine(s).

Upvotes: 3

Ronen Botzer

Reputation: 7117

Aerospike has several modes for storage that you can configure:

Data in memory with no persistence
Data in memory, persisted to disk
Data on SSD, primary index in memory (AKA Hybrid Memory architecture)

In-Memory Optimizations

Release 3.11 and release 3.12 of Aerospike include several big performance improvements for in-memory namespaces.

Among these are a change to how partitions are represented, from a single red-black tree to sprigs (many sub-trees). The new config parameters partition-tree-sprigs and partition-tree-locks should be used appropriately. In your case, as r3.4xlarge instances have 122G of DRAM, you can afford the 311M of overhead associated with setting partition-tree-sprigs to the max value of 4096.

You should also consider the auto-pin=cpu setting, as well. This option does require Linux Kernal >= 3.19 which is part of Ubuntu >= 15.04 (but not many others yet).

Clustering Improvements

The recent releases 3.13 and 3.14 include a rewrite of the cluster manager. In general you should consider using the latest version, but I'm pointing out the aspects that will directly affect your performance.

EC2 Networking and Aerospike

You don't show the latency numbers of the cluster itself, so I suspect the problem is with the networking, rather than the nodes.

Older instance family types, such as the r3, c3, i2, come with ENIs - NICs which have a single transmit/receive queue. The software interrupts of cores accessing this queue may become a bottleneck as the number of CPUs increases, all of which need to wait for their turn to use the NIC. There's a knowledge base article in the Aerospike community discussion forum on using multiple ENIs with Aerospike to get around the limited performance capacity of the single ENI you initially get with such an instance. The Amazon EC2 deployment guide on the Aerospike site talks about using RPS to maximize TPS when you're in an instance that uses ENIs.

Alternatively, you should consider moving to the newer instances (r4, i3, etc) which come with multiqueue ENAs. These do not require RPS, and support higher TPS without adding extra cards. They also happen to have better chipsets, and cost significantly less than their older siblings (r4 is roughly 30% cheaper than r3, i3 is about 1/3 the price of the i2).

Upvotes: 4

Aerospike - No improvements in latency on moving to in-memory cluster from on-disk cluster

Answers (2)

Related Questions