Paul
Paul

Reputation: 243

Cassandra 1.2 huge read latency

I'm working on a 4 node cassandra 1.2.6 cluster with a single keyspace, replication factor of 2 (3 originally, but dropped to 2) and 10 or so column families. It is running the Oracle 1.7 jvm. It has a mix of reads and writes, with probably two to three times as many writes as reads.

Even under a small amount of load, I am seeing very large read latencies, and I get quite a few read timeouts (using the datastax java driver). Here is an example output of nodetool cfstats for one of the column families:

    Column Family: user_scores
    SSTable count: 1
    SSTables in each level: [1, 0, 0, 0, 0, 0, 0, 0, 0]
    Space used (live): 7539098
    Space used (total): 7549091
    Number of Keys (estimate): 42112
    Memtable Columns Count: 2267
    Memtable Data Size: 1048576
    Memtable Switch Count: 2
    Read Count: 2101
    **Read Latency: 272334.202 ms.**
    Write Count: 24947
    Write Latency: NaN ms.
    Pending Tasks: 0
    Bloom Filter False Positives: 0
    Bloom Filter False Ratio: 0.00000
    Bloom Filter Space Used: 55376
    Compacted row minimum size: 447
    Compacted row maximum size: 219342
    Compacted row mean size: 1051

as you can see, I tried using a level base compaction strategy to try and improve read latency, but as you can also see the latency is huge. I'm a bit stumped. I had a cassandra 1.1.6 cluster working beautifully, but no luck so far with 1.2.

The cluster is running on VM's with 4 CPU's and 7 Gb of ram. The data drive is setup as a striped raid across 4 disks. The machine doesn't seem to be IO bound.

I'm running a pretty vanilla configuration, with all the defaults.

I do see strange CPU behavior where the CPU is spiking even under smaller load. Sometimes I see compactions running, but they are niced so I don't think are the culprit.

I'm trying to figure out where to go next. Any help appreciated!

[update with rpc_timeout trace] Still playing with this. Here is an example trace. It looks like the merge step is taking way too long.

 activity                                                                | timestamp    | source        | source_elapsed
-------------------------------------------------------------------------+--------------+---------------+----------------
                                                      execute_cql3_query | 04:57:18,882 | 100.69.176.51 |              0
 Parsing select * from user_scores where user_id='26257166' LIMIT 10000; | 04:57:18,884 | 100.69.176.51 |           1981
                                                      Peparing statement | 04:57:18,885 | 100.69.176.51 |           2997
                         Executing single-partition query on user_scores | 04:57:18,885 | 100.69.176.51 |           3657
                                            Acquiring sstable references | 04:57:18,885 | 100.69.176.51 |           3724
                                             Merging memtable tombstones | 04:57:18,885 | 100.69.176.51 |           3779
                                            Key cache hit for sstable 32 | 04:57:18,886 | 100.69.176.51 |           3910
                             Seeking to partition beginning in data file | 04:57:18,886 | 100.69.176.51 |           3930
                              Merging data from memtables and 1 sstables | 04:57:18,886 | 100.69.176.51 |           4211
                                                        Request complete | 04:57:38,891 | 100.69.176.51 |       20009870

Older traces below:

[newer trace] After addressing the problem noted in the logs by completely rebuilding the cluster data repository, I still ran into the problem, although it took quite a bit longer. Here is a trace I grabbed when in the bad state:

Tracing session: a6dbefc0-ea49-11e2-84bb-ef447a7d9a48

 activity                                                                                        | timestamp    | source         | source_elapsed
-------------------------------------------------------------------------------------------------+--------------+----------------+----------------
                                                                              execute_cql3_query | 16:48:02,755 | 100.69.196.124 |              0
                                                      Parsing select * from user_scores limit 1; | 16:48:02,756 | 100.69.196.124 |           1774
                                                                              Peparing statement | 16:48:02,759 | 100.69.196.124 |           4006
                                                                   Determining replicas to query | 16:48:02,759 | 100.69.196.124 |           4286
                                                             Enqueuing request to /100.69.176.51 | 16:48:02,763 | 100.69.196.124 |           8849
                                                         Sending message to cdb002/100.69.176.51 | 16:48:02,764 | 100.69.196.124 |           9456
                                                           Message received from /100.69.196.124 | 16:48:03,449 |  100.69.176.51 |            160
                                                            Message received from /100.69.176.51 | 16:48:09,646 | 100.69.196.124 |        6891860
                                                         Processing response from /100.69.176.51 | 16:48:09,647 | 100.69.196.124 |        6892426
 Executing seq scan across 1 sstables for [min(-9223372036854775808), min(-9223372036854775808)] | 16:48:10,288 |  100.69.176.51 |        6838754
                                                     Seeking to partition beginning in data file | 16:48:10,289 |  100.69.176.51 |        6839689
                                                              Read 1 live and 0 tombstoned cells | 16:48:10,289 |  100.69.176.51 |        6839927
                                                     Seeking to partition beginning in data file | 16:48:10,289 |  100.69.176.51 |        6839998
                                                              Read 1 live and 0 tombstoned cells | 16:48:10,289 |  100.69.176.51 |        6840082
                                                                    Scanned 1 rows and matched 1 | 16:48:10,289 |  100.69.176.51 |        6840162
                                                           Enqueuing response to /100.69.196.124 | 16:48:10,289 |  100.69.176.51 |        6840229
                                                              Sending message to /100.69.196.124 | 16:48:10,299 |  100.69.176.51 |        6850072
                                                                                Request complete | 16:48:09,648 | 100.69.196.124 |        6893029

[update] I should add that things work just dandy with a solo cassandra instance on my macbook pro. AKA Works on my machine...:)

[update with trace data]

Here is some trace data. This is from the java driver. The downside is I can only trace the queries that succeed. I make it a total of 67 queries before every query starts timing out. What is weird is that it doesn't look that bad. The at query 68, I no longer get a response, and two of the servers are running hot.

2013-07-11 02:15:45 STDIO [INFO] ***************************************
66:Host (queried): cdb003/100.69.198.47
66:Host (tried): cdb003/100.69.198.47
66:Trace id: c95e51c0-e9cf-11e2-b9a9-5b3c0946787b

66:-----------------------------------------------------+--------------+-----------------+--------------
66:            Enqueuing data request to /100.69.176.51 | 02:15:42.045 |  /100.69.198.47 |          200
66:          Enqueuing digest request to /100.69.176.51 | 02:15:42.045 |  /100.69.198.47 |          265
66:                  Sending message to /100.69.196.124 | 02:15:42.045 |  /100.69.198.47 |          570
66:                   Sending message to /100.69.176.51 | 02:15:42.045 |  /100.69.198.47 |          574
66:                Message received from /100.69.176.51 | 02:15:42.107 |  /100.69.198.47 |        62492
66:             Processing response from /100.69.176.51 | 02:15:42.107 |  /100.69.198.47 |        62780
66:                Message received from /100.69.198.47 | 02:15:42.508 | /100.69.196.124 |           31
66:     Executing single-partition query on user_scores | 02:15:42.508 | /100.69.196.124 |          406
66:                        Acquiring sstable references | 02:15:42.508 | /100.69.196.124 |          473
66:                         Merging memtable tombstones | 02:15:42.508 | /100.69.196.124 |          577
66:                        Key cache hit for sstable 11 | 02:15:42.508 | /100.69.196.124 |          807
66:         Seeking to partition beginning in data file | 02:15:42.508 | /100.69.196.124 |          849
66:          Merging data from memtables and 1 sstables | 02:15:42.509 | /100.69.196.124 |         1500
66:                Message received from /100.69.198.47 | 02:15:43.379 |  /100.69.176.51 |           60
66:     Executing single-partition query on user_scores | 02:15:43.379 |  /100.69.176.51 |          399
66:                        Acquiring sstable references | 02:15:43.379 |  /100.69.176.51 |          490
66:                         Merging memtable tombstones | 02:15:43.379 |  /100.69.176.51 |          593
66:                         Key cache hit for sstable 7 | 02:15:43.380 |  /100.69.176.51 |         1098
66:         Seeking to partition beginning in data file | 02:15:43.380 |  /100.69.176.51 |         1141
66:          Merging data from memtables and 1 sstables | 02:15:43.380 |  /100.69.176.51 |         1912
66:                  Read 1 live and 0 tombstoned cells | 02:15:43.438 |  /100.69.176.51 |        59094
66:                Enqueuing response to /100.69.198.47 | 02:15:43.438 |  /100.69.176.51 |        59225
66:                   Sending message to /100.69.198.47 | 02:15:43.438 |  /100.69.176.51 |        59373
66:Started at: 02:15:42.04466:Elapsed time in micros: 63105
2013-07-11 02:15:45 STDIO [INFO] ***************************************
67:Host (queried): cdb004/100.69.184.134
67:Host (tried): cdb004/100.69.184.134
67:Trace id: c9f365d0-e9cf-11e2-a4e5-7f3170333ff5

67:-----------------------------------------------------+--------------+-----------------+--------------
67:               Message received from /100.69.184.134 | 02:15:42.536 |  /100.69.198.47 |           36
67:     Executing single-partition query on user_scores | 02:15:42.536 |  /100.69.198.47 |          273
67:                        Acquiring sstable references | 02:15:42.536 |  /100.69.198.47 |          311
67:                         Merging memtable tombstones | 02:15:42.536 |  /100.69.198.47 |          353
67:                         Key cache hit for sstable 8 | 02:15:42.536 |  /100.69.198.47 |          436
67:         Seeking to partition beginning in data file | 02:15:42.536 |  /100.69.198.47 |          455
67:          Merging data from memtables and 1 sstables | 02:15:42.537 |  /100.69.198.47 |          811
67:                  Read 1 live and 0 tombstoned cells | 02:15:42.550 |  /100.69.198.47 |        14242
67:               Enqueuing response to /100.69.184.134 | 02:15:42.550 |  /100.69.198.47 |        14456
67:                  Sending message to /100.69.184.134 | 02:15:42.551 |  /100.69.198.47 |        14694
67:            Enqueuing data request to /100.69.198.47 | 02:15:43.021 | /100.69.184.134 |          323
67:                   Sending message to /100.69.198.47 | 02:15:43.021 | /100.69.184.134 |          565
67:                Message received from /100.69.198.47 | 02:15:43.038 | /100.69.184.134 |        17029
67:             Processing response from /100.69.198.47 | 02:15:43.038 | /100.69.184.134 |        17230
67:Started at: 02:15:43.021
67:Elapsed time in micros: 17622

And here is a trace using cqlsh:

Tracing session: d0f845d0-e9cf-11e2-8882-ef447a7d9a48

 activity                                                                | timestamp    | source         | source_elapsed
-------------------------------------------------------------------------+--------------+----------------+----------------
                                                      execute_cql3_query | 19:15:54,833 | 100.69.196.124 |              0
 Parsing select * from user_scores where user_id='39333433' LIMIT 10000; | 19:15:54,833 | 100.69.196.124 |            103
                                                      Peparing statement | 19:15:54,833 | 100.69.196.124 |            455
                         Executing single-partition query on user_scores | 19:15:54,834 | 100.69.196.124 |           1400
                                            Acquiring sstable references | 19:15:54,834 | 100.69.196.124 |           1468
                                             Merging memtable tombstones | 19:15:54,835 | 100.69.196.124 |           1575
                                            Key cache hit for sstable 11 | 19:15:54,835 | 100.69.196.124 |           1780
                             Seeking to partition beginning in data file | 19:15:54,835 | 100.69.196.124 |           1822
                              Merging data from memtables and 1 sstables | 19:15:54,836 | 100.69.196.124 |           2562
                                      Read 1 live and 0 tombstoned cells | 19:15:54,838 | 100.69.196.124 |           4808
                                                        Request complete | 19:15:54,838 | 100.69.196.124 |           5810

Upvotes: 2

Views: 3085

Answers (2)

Paul
Paul

Reputation: 243

Looks like I was running into a performance issue with 1.2. Fortunately a patch had just been applied to the 1.2 branch, so when I built from source my problem went away.

see https://issues.apache.org/jira/browse/CASSANDRA-5677 for a detailed explanation.

Thanks all!

Upvotes: 1

Raedwald
Raedwald

Reputation: 48616

The trace seems to show that much of the time is doing or waiting for network operations. Perhaps your network has problems?

If only some operations fail, perhaps you have a problem with only one of your nodes. When that node is not needed, things work, but when it is needed things go badly. It might be worth looking at the log files on the other nodes.

Upvotes: 1

Related Questions