Cassandra SELECT on 2ndary index always sorted on partition key?

Question

Say I have the following table and secondary indices defined:

CREATE TABLE ItemUpdates (
    time         timestamp,
    item_name    text,
    item_context text,
    item_descr   text,
    tags         map,
    PRIMARY KEY ((time, item_name, item_context))
);

CREATE INDEX ItemUpdateByName
    ON ItemUpdates(item_name);

CREATE INDEX ItemUpdateByContext
    ON ItemUpdates(item_context);

CREATE INDEX ItemUpdateByTag
    ON ItemUpdates(KEYS(tags));

General background information on the data model: an item has a unique name within a context, so (item_name, item_context) is a natural key for items. Tags have some value associated with them.

A natural query in my application is "show me all updates on item X with a certain tag". This translates to:

SELECT * FROM ItemUpdates
    WHERE item_name='x'
        AND item_context='a'
        AND tags CONTAINS KEY 't';

When I try some queries I notice that although the cluster uses the Murmur3Partitioner, the results come ordered by time. This makes sense when you consider that Cassandra stores secondary indices as wide rows, and that colums are ordered by their name.

(1) Does Cassandra always return rows sorted by partition key when selecting on a(n) (set of) indexed column(s)?

The reason I find this interesting is that other natural queries in my application include:

fetch all updates on item X, since date D
fetch the 300 most recent updates on item X

What surprises me is that adding a clause ORDER BY time DESC to my select statement on ItemUpdates results in an error message "ORDER BY with 2ndary indexes is not supported."

(2) (How) can I do a range query on the partition key when I narrow the query by selecting on an indexed column?

ashic · Accepted Answer

The only natural "auto" sorting that you should get on cassandra is for columns in a wide row. partitions when using murmur3 are not "sorted" as that would mess up the random distribution (afaik). Indexes are stored on each node in a "hidden" table as wide rows. When on filter on an index, it's hitting that "partition" "on the node" and the values are the rows in that partition (which correspond to matching rows on that node). Try your query using different data sets and different columns. Maybe the data you have cause the results to be sorted.

(2) As it stands, you can only do range queries on clustering keys, not on the partition key. In general, for efficient querying, you should attempt to hit one (or a few) partitions, and filter on indexes / filter on clustering keys / range query on the clustering key. If you attempt to not hit a partition, it becomes a clusterwide operation, which isn't usually great. If you are looking to do cluster wide analysis (ala map reduce style), take a look at Apache Spark. Spark cassandra integration is quite good and is getting better.

Cassandra SELECT on 2ndary index always sorted on partition key?

Answers (1)

Related Questions