stack user
stack user

Reputation: 863

Cassandra filtering by date with a secondary index

I have a requirement to answer the following queries:

I've created the following table in Cassandra to deal with this:

CREATE TABLE first_purchase_by_shopper_date
(
    shop_id                     uuid,
    shopper_id                  uuid,
    dt_first_purchase           timestamp,

    ... (some text fields)

    PRIMARY KEY ((shop_id, shopper_id))
);

In order to be able to answer this query in Cassandra, I need to be able to filter this data on the dt_first_purchase field.

But if I add dt_first_purchase to the primary key, then it makes the row non-unique to a shopper - and therefore we get multiple entries in the table - but we only ever want one entry per shopper.

so my insert statement would look like

Insert into first first_purchase_by_shopper_date (shop_id, shopper_id, dt_first_purchase, ... ) Values(...) If Not Exists;

The if not exists at the end ensures that the entry is only written if none exists already (e.g. no update is performed on an existing record.)

How can I filter by date on this table - is a secondary index on the dt_first_purchase column my only option - and isn't this undesirable?

Upvotes: 4

Views: 8955

Answers (1)

Aaron
Aaron

Reputation: 57788

How can I filter by date on this table - is a secondary index on the dt_first_purchase column my only option - and isn't this undesirable?

You could certainly try a secondary index on dt_first_purchase (and querying by range on that would also require the use of the ALLOW FILTERING directive). For performance (especially with a large cluster), I do not recommend that.

But first and foremost, understand that Cassandra is designed around returning data for a specific key on a specific data partition. This means that the best way for you to query your data by a date range, is to first partition your data by a key that makes sense for your model. For instance, if you had a primary key defined like this:

PRIMARY KEY ((shop_id), dt_first_purchase, shopper_id)

basically, a record of which shop (shop_id) recorded a first purchase (dt_first_purchase) for a particular shopper (shopper_id)

With your data partitioned by shop (shop_id) you could then query the first purchases of new shoppers for a particular shop_id like this:

aploetz@cqlsh:stackoverflow> SELECT * 
FROM first_purchase_by_shopper_date 
WHERE shop_id=ce1089f6-c613-4d5b-a975-5dfd677b46f9 
AND dt_first_purchase >= '2014-01-01 00:00:00' 
AND dt_first_purchase < '2014-04-01 00:00:00';

 shop_id                              | dt_first_purchase        | shopper_id                           | value
--------------------------------------+--------------------------+--------------------------------------+-------
 ce1089f6-c613-4d5b-a975-5dfd677b46f9 | 2014-02-12 18:33:22-0600 | a7480417-aaf8-42b1-85dd-5d9a4a30c204 | shopper1
 ce1089f6-c613-4d5b-a975-5dfd677b46f9 | 2014-03-13 11:33:22-0500 | 07db2b71-2dc7-421d-bf73-82a5f6c55f89 | shopper2

(2 rows)

Additionally, you could then count the number of first purchases (new shoppers) for that particular shop and date range, like this:

aploetz@cqlsh:stackoverflow> SELECT COUNT(*) FROM first_purchase_by_shopper_date
WHERE shop_id=ce1089f6-c613-4d5b-a975-5dfd677b46f9 
AND dt_first_purchase >= '2014-01-01 00:00:00' 
AND dt_first_purchase < '2014-04-01 00:00:00';

 count
-------
     2

(1 rows)

Please note that this specific example may not work for your use case. Take it for what it is: a demonstration of how partitioning and querying work in Cassandra.

For more information, check out Patrick McFadin's article on Getting Started With Time Series Data Modeling. He discusses ways to solve a use case that is similar to yours.

Upvotes: 5

Related Questions