abi_pat
abi_pat

Reputation: 602

Range Scan in Cassandra-2.1.2 taking a lot of time

My use case is like this: I am inserting 10 million rows in a table described as follows:

keyval bigint, rangef bigint, arrayval blob, PRIMARY KEY (rangef, keyval)

and input data is like follows -

keyval - some timestamp
rangef - a random number
arrayval - a byte array

I am taking my primary key as composite key because after inserting 10 million rows, I want to perform range scan on keyval. As keyval contains timestamp, and my query will be like, give me all the rows between this-time to this-time. Hence to perform these kind of Select queries, i have my primary key as composite key.

Now, while ingestion, the performance was very good and satisfactory. But when I ran the query described above, the performance was very low. When I queried - bring me all the rows within t1 and t1 + 3 minutes, almost 500k records were returned in 160 seconds.

My query is like this

Statement s = QueryBuilder.select().all().from(keySpace, tableName).allowFiltering().where(QueryBuilder.gte("keyval", 1411516800)).and(QueryBuilder.lte("keyval", 1411516980));
    s.setFetchSize(10000);
    ResultSet rs = sess.execute(s);
    for (Row row : rs)
    {
        count++;
    }
    System.out.println("Batch2 count = " + count);

I am using default partitioner, that is MurMur partitioner.

My cluster configuration is -

No. of nodes - 4 No. of seed nodes - 1 No. of disks - 6 MAX_HEAP_SIZE for each node = 8G

Rest configuration is default.

How I can improve my range scan performance?

Upvotes: 3

Views: 2724

Answers (2)

Aaron
Aaron

Reputation: 57798

RussS is correct that your problems are caused both by the use of ALLOW FILTERING and that you are not limiting your query to a single partition.

How I can improve my range scan performance?

By limiting your query with a value for your partitioning key.

PRIMARY KEY (rangef, keyval)

If the above is indeed correct, then rangef is your partitioning key. Alter your query to first restrict for a specific value of rangef (the "single partition", as RussS suggested). Then your current range query on your clustering key keyval should work.

Now, that query may not return anything useful to you. Or you might have to iterate through many rangef values on the application side, and that could be cumbersome. This is where you need to re-evaluate your data model and come up with an appropriate key to partition your data by.

I made secondary index on Keyval, and my query performance was improved. From 160 seconds, it dropped to 40 seconds. So does indexing Keyval field makes sense?

The problem with relying on secondary indexes, is that they may seem fast at first, but get slow over time. Especially with a high-cardinality column like a timestamp (Keyval), a secondary index query has to go out to each node and ultimately scan a large number of rows to get a small number of results. It's always better to duplicate your data in a new query table, than to rely on a secondary index query.

Upvotes: 2

RussS
RussS

Reputation: 16576

Your are actually performing a full table scan and not a range scan. This is one of the slowest queries possible for Cassandra and is usually only used by analytics workloads. If at any time your queries require allow filterting for a OLTP workload something is most likely wrong. Basically Cassandra has been designed with the knowledge that queries which require accessing the entire dataset will not scale so a great deal of effort is made to make it simple to partition and quickly access data within a partition.

To fix this you need to rethink your data model and think about how you can restrict the data to queries on a single partition.

Upvotes: 4

Related Questions