Reputation: 233
I have a use case where I am continuously ingesting data in batches into Scylla using gocql driver, During the heavy write test I observed that scyllas write response latency increases over the time, Sometime it leads to restart of scylla node,Where in case of cassandra latency is constant over the time. I just want to know the proper configs for this use case so that i can achieve constant latency throughout the time.
Configs used for scylla cluster
Details of writer process Basically it is a kafka consumer . The flow of the consumer is
1- reads 500 messages from kafka
2- 500 workers(goroutine) start writing it into scylla(cassandra) in batches(single batch contain data related to single partition) each batch contains avg 3k records(max => 20k).(replication factor of keyspace is 1)
3- updates the state of batch in counter table scylla.
4- commit this 500 messages to kafka
5 - back to step 1
soo, basically in the test I am using 3 consumers . scylla is not able to cope up with the inject rate of kafka while cassandra matches the inject rate.
Shared the grafana dashborad of load test , please let me know if anything else is required.
[![inject vs drain rate ][1]][1]
[![scylla memory dashboard][2]][2]
[![scyllaIOqueue][3]][3]
[![ScyllaIo][4]][4]
[![scyllaDiskDetails][5]][5]
[![latencies][6]][6]
[![load][7]][7]
smp 16
cpuset 0-15
memory 80G
iops
cat /etc/scylla.d/io_properties.yaml
[root@ip /]# cat /etc/scylla.d/io_properties.yaml
disks:
- mountpoint: /var/lib/scylla
read_iops: 265
read_bandwidth: 99796024
write_iops: 1177
write_bandwidth: 130168192
Is there any other config which I missed by which I can achieve constant write latency.
[1]: https://i.sstatic.net/o0yQc.png
[2]: https://i.sstatic.net/i0RhS.png
[3]: https://i.sstatic.net/sA4WY.png
[4]: https://i.sstatic.net/5QAob.png
[5]: https://i.sstatic.net/6U5UM.png
[6]: https://i.sstatic.net/DG2my.png
[7]: https://i.sstatic.net/TOtuQ.png
saw this logs in scylla container
WARN 2020-02-05 11:07:54,409 [shard 12] seastar_memory - oversized allocation: 1081344 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x2cf31dd
0x2a1d0c4
0x2a21e8b
0x103d7d2
0x103e298
0x10070c0
0x100cd14
0x10289b8
0x1028057
0x1028f59
0x2a003ac
0x2a50491
0x2a5069f
0x2aba615
0x2acedac
0x2a330ed
/opt/scylladb/libreloc/libpthread.so.0+0x85a1
/opt/scylladb/libreloc/libc.so.6+0xfb302
Upvotes: 2
Views: 1509
Reputation: 13771
You reported that "write response latency increases over the time", but didn't explain how you measured this, or how much it increases. Does latency increase from 1ms to 2ms, or from 1ms to 500ms? Does the mean latency increase, or the tail latency (e.g., 99th percentile) increase?
Some of the ideas raised by other responses will mostly explain increase in tail latency. But in batch workloads like you are describing, you usually do not care about tail latency, and care just about getting a reasonable (not even low) mean latency (in batch workloads the more important measure is the throughput). But, if you are seeing the mean latency continuously growing and becoming unreasonable, what is usually happening is that your client's concurrency is increasing or in other words, it is starting too many new writes without waiting for previous requests to have finished (see Little's Law). You didn't say how you were doing your "batch writes". Are you using a client with a fixed number of threads, or can your write concurrency grow uncontrollably?
When your client correctly has fixed concurrency, Scylla still has to be careful not to make the client believe previous work has completed while in fact there is still a lot of background work - I explained this issue and how Scylla solves it in a blog bost a year ago.
Of course it's always possible that Scylla has a bug in this area, so if you suspect it please report your problem - with more details - on the Scylla mailing list or bug tracker.
Upvotes: 2
Reputation: 870
There is too few data, the best is to discuss on the mailing list or slack. The best is to use the Grafana monitor and to observe if you hit a limit. Compaction run in parallel but the scylla scheduler gives it a lower priority.
Could it be that you run something else on the machine besides Scylla?
Upvotes: 0