Reputation: 542
I'm seeing really odd behavior from a Cassandra 2.x cluster.
If I execute the following query (with CL ONE or LOCAL_ONE)
select count(*) from api_cache;
The query can randomly take 20ms or 800ms on the same node (the table has around 600 rows)
In the first case around 100 sequential scans are executed and in the second around 2000 are executed. I have no idea why the system should exhibit such non-deterministic behavior.
This problem only occurs in DC2 (5 nodes with RF=5) whereas in DC1 (3 nodes RF=3) the same query always returns in around 15ms. The other points worth mentioning are that DC2 was created by streaming from DC1 and the keyspace uses sizeTiered compaction. All the nodes have 16GB of RAM, dedicated SSD for /var/lib/cassandra and are under very light load.
I have tried compacting, repairing, scrubbing etc - all with no effect.
Any help would be massively appreciated,
Thanks.
Here are the CQL tracings for both cases (edit: each from DC1)
UPDATE:
This is good. If I run the query from DC1 with CL=ALL then it returns in around 30ms every time. In other words, querying DC2 from DC1 is an order of magnitude quicker than querying DC2 from DC2!
Bearing in mind the DCs are 100 miles apart, at least we can discount network and disk issues on any of the nodes....
The key difference is that DC1 only ever scans for one range eg [min(-9223372036854775808), min(-9223372036854775808)]
whereas DC2 scans for multiple ranges eg (max(4772134901608081021), max(4787315709362044510)]
.
There is also the the fact that one uses min and the other uses multiple max values...
This is clearly the heart of the problem -- but I don't know how to fix it!!
Upvotes: 0
Views: 717
Reputation: 542
This turned out to be a regression introduced somewhere between 2.0.4 and 2.0.9.
The query planner is no longer able to calculate contiguous ranges for a given node, so instead of issuing one scan for each contiguous range it is frequently scanning each range individually.
See CASSANDRA-7535 for more details.
Upvotes: 1
Reputation: 491
count(*) performance is dependent on how many sstables you have. I looked at your trace output and it's clear that DC1 has few sstables while DC2 has many. Each one must be checked for rows to get a total. Having vnodes enabled will amplify the problem by splitting the data across more partitions.
Keeping an accurate count would require a read-modify-write and extra locking on every mutation, so Cassandra does not do it. Count(*) is really an estimation based on the row count per sstable/memtable, which is also an estimate.
For more information, please read this: http://www.wentnet.com/blog/?p=24
Upvotes: 2