Reputation: 602
My use case is like this: I am inserting 10 million rows in a table described as follows:
keyval bigint, rangef bigint, arrayval blob, PRIMARY KEY (rangef, keyval)
and input data is like follows -
keyval - some timestamp
rangef - a random number
arrayval - a byte array
I am taking my primary key as composite key because after inserting 10 million rows, I want to perform range scan on keyval. As keyval contains timestamp, and my query will be like, give me all the rows between this-time to this-time. Hence to perform these kind of Select queries, i have my primary key as composite key.
Now, while ingestion, the performance was very good and satisfactory. But when I ran the query described above, the performance was very low. When I queried - bring me all the rows within t1 and t1 + 3 minutes, almost 500k records were returned in 160 seconds.
My query is like this
Statement s = QueryBuilder.select().all().from(keySpace, tableName).allowFiltering().where(QueryBuilder.gte("keyval", 1411516800)).and(QueryBuilder.lte("keyval", 1411516980));
s.setFetchSize(10000);
ResultSet rs = sess.execute(s);
for (Row row : rs)
{
count++;
}
System.out.println("Batch2 count = " + count);
I am using default partitioner, that is MurMur partitioner.
My cluster configuration is -
No. of nodes - 4 No. of seed nodes - 1 No. of disks - 6 MAX_HEAP_SIZE for each node = 8G
Rest configuration is default.
How I can improve my range scan performance?
Upvotes: 3
Views: 2724
Reputation: 57798
RussS is correct that your problems are caused both by the use of ALLOW FILTERING
and that you are not limiting your query to a single partition.
How I can improve my range scan performance?
By limiting your query with a value for your partitioning key.
PRIMARY KEY (rangef, keyval)
If the above is indeed correct, then rangef
is your partitioning key. Alter your query to first restrict for a specific value of rangef
(the "single partition", as RussS suggested). Then your current range query on your clustering key keyval
should work.
Now, that query may not return anything useful to you. Or you might have to iterate through many rangef
values on the application side, and that could be cumbersome. This is where you need to re-evaluate your data model and come up with an appropriate key to partition your data by.
I made secondary index on Keyval, and my query performance was improved. From 160 seconds, it dropped to 40 seconds. So does indexing Keyval field makes sense?
The problem with relying on secondary indexes, is that they may seem fast at first, but get slow over time. Especially with a high-cardinality column like a timestamp (Keyval), a secondary index query has to go out to each node and ultimately scan a large number of rows to get a small number of results. It's always better to duplicate your data in a new query table, than to rely on a secondary index query.
Upvotes: 2
Reputation: 16576
Your are actually performing a full table scan and not a range scan. This is one of the slowest queries possible for Cassandra and is usually only used by analytics workloads. If at any time your queries require allow filterting
for a OLTP workload something is most likely wrong. Basically Cassandra has been designed with the knowledge that queries which require accessing the entire dataset will not scale so a great deal of effort is made to make it simple to partition and quickly access data within a partition.
To fix this you need to rethink your data model and think about how you can restrict the data to queries on a single partition.
Upvotes: 4