Javadroider
Javadroider

Reputation: 2450

Cassandra - get all data for a certain time range

Is it possible to query a Cassandra database to get records for a certain range?

I have a table definition like this

CREATE TABLE domain(
domain_name text,
status int,
last_scanned_date long
PRIMARY KEY(text,last_scanned_date)
)

My requirement is to get all the domains which are not scanned in the last 24 hours. I wrote the following query, but this query is not efficient as Cassandra is trying to fetch entire dataset because of ALLOW FILTERING

SELECT * FROM domain where last_scanned_date<=<last24hourstimeinmillis> ALLOW FILTERING;

Then I decided to do it in two queries

1st query:

SELECT DISTINCT name from domain;

2nd query: Use IN operator to query domains which are not scanned i nlast 24 hours

SELECT * FROM domain where 
domain_name IN('domain1','domain2') 
AND 
last_scanned_date<=<last24hourstimeinmillis> 

My second approach works, but comes with an extra overhead of querying first for distinct values.

Is there any better approach than this?

Upvotes: 1

Views: 1377

Answers (2)

Alex Ott
Alex Ott

Reputation: 87359

Cassandra can effectively perform range queries only inside one partition. The same is for use of the aggregations, such as DISTINCT. So in your case you'll need to have only one partition that will contain all data. But that's is bad design.

You may try to split this big partition into smaller ones, by using TLDs as separate partition keys, and perform fetching in parallel from every partition - but this also will lead to imbalance, as some TLDs will have more sites than other.

Another issue with your schema is that you have last_scanned_date as clustering column, and this means that when you update last_scanned_date, you're effectively insert a new row into database - you'll need to explicitly remove row for previous last_scanned_date, otherwise the query last_scanned_date<=<last24hourstimeinmillis> will always fetch old rows that you already scanned.

Partially your problem with your current design could be solved by using the Spark that is able to perform effective scanning of full table via token range scan + range scan for every individual row - this will return only data in given time range. Or if you don't want to use Spark, you can perform token range scan in your code, something like this.

Upvotes: 1

Vikram Patil
Vikram Patil

Reputation: 661

You should update your structure table definition. Currently, you are selecting domain name as your partition key while you can not have more than 2 billion records in single Cassandra partition.

I would suggest you should use your time as part of your partition key. If you are not going to receive more than 2 billion requests per day. Try to use day since epoch as the partition key. You can do composite partition keys but they won't be helpful for your query.

While querying you have to scan at max two partitions with an additional filter in a query or in your application filtering out results which do not belong to a the range you have specified.

Go over following concepts before finalizing your design.

https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCompositePartitionKeyConcept.html

https://docs.datastax.com/en/dse-planning/doc/planning/planningPartitionSize.html

Upvotes: 1

Related Questions