Performing range queries for cassandra table

Question

I am trying to store data with following schema:

 CREATE TABLE temp_humidity_data (
                asset_id text, 
                date text, 
                event_time timestamp, 
                temprature int, 
                humidity int,
                PRIMARY KEY((asset_id, date),event_time)
            )

I have followed datastax article 'Getting Started with Time Series Modeling' - http://planetcassandra.org/blog/post/getting-started-with-time-series-data-modeling/

however with this data model one thing that is not working is query that returns me data between two dates. How do I do that?

If I do this:

select * from temp_humidity_data 
where asset_id='1234' AND date >= '2010-04-02' AND date <= '2011-04-03';

It gives me following error:

code=2200 [Invalid query] message="Only EQ and IN relation are supported on the partition key (unless you use the token() function)"

In understand there is a way to do IN operator but I don't want to put all those dates in a 'IN' operator. Is there a way to query when using the above table definition data between two dates?

ashic · Accepted Answer

The first key in the primary key (a composite in your case) is responsible for scattering the data across different partitions. Each partition is held in its entirety on a single node (and its replicas). Even if the query you request were possible (it would be if you had only the date as a primary and used a byteorderedpartitioner - the default is murmur3), it would effectively do a full scan across your cluster. Think of this being similar to a full table scan in an rdbms on a column without an index, only now the full table scan spans multiple machines.

The goal with the composite partition key here is to ensure no partition gets unmanageably big. It also takes away your ability to do the range query across dates. If you think your data for an asset can fit in a single partition, you can make the date the first clustering key. That would enable the query. However, all rows for an asset would be on a single partition. This may be an issue (it's typically good to target around 100MB as a max partition size - though there are exceptions), and hotspots may arise in your cluster (nodes holding partitions for very busy stuff will be busy, while other nodes less so). Another way around this is to maintain manual buckets - add a bucketid int as part of the partition key [i.e. (asset_id, bucket_id)], have date as the first clustering key, and maintain the bucketing from application code. This would distribute the data for an asset across multiple partitions that you control. This will need a bit of calculation, and will need you to query each bucket yourself - but will prevent hotspots, and allow your date range queries. You'd obviously only do this if the data for a particular asset is beyond single partition size, but manageable via buckets.

If you absolutely must partition based on date, consider things like Spark and Shark to do efficient post aggregation.

Hope that helps.

Performing range queries for cassandra table

Answers (2)

Related Questions