Reputation: 863
I have a requirement to answer the following queries:
I've created the following table in Cassandra to deal with this:
CREATE TABLE first_purchase_by_shopper_date
(
shop_id uuid,
shopper_id uuid,
dt_first_purchase timestamp,
... (some text fields)
PRIMARY KEY ((shop_id, shopper_id))
);
In order to be able to answer this query in Cassandra, I need to be able to filter this data on the dt_first_purchase field.
But if I add dt_first_purchase to the primary key, then it makes the row non-unique to a shopper - and therefore we get multiple entries in the table - but we only ever want one entry per shopper.
so my insert statement would look like
Insert into first first_purchase_by_shopper_date (shop_id, shopper_id, dt_first_purchase, ... ) Values(...) If Not Exists;
The if not exists at the end ensures that the entry is only written if none exists already (e.g. no update is performed on an existing record.)
How can I filter by date on this table - is a secondary index on the dt_first_purchase column my only option - and isn't this undesirable?
Upvotes: 4
Views: 8955
Reputation: 57788
How can I filter by date on this table - is a secondary index on the dt_first_purchase column my only option - and isn't this undesirable?
You could certainly try a secondary index on dt_first_purchase
(and querying by range on that would also require the use of the ALLOW FILTERING
directive). For performance (especially with a large cluster), I do not recommend that.
But first and foremost, understand that Cassandra is designed around returning data for a specific key on a specific data partition. This means that the best way for you to query your data by a date range, is to first partition your data by a key that makes sense for your model. For instance, if you had a primary key defined like this:
PRIMARY KEY ((shop_id), dt_first_purchase, shopper_id)
basically, a record of which shop (shop_id) recorded a first purchase (dt_first_purchase) for a particular shopper (shopper_id)
With your data partitioned by shop (shop_id) you could then query the first purchases of new shoppers for a particular shop_id like this:
aploetz@cqlsh:stackoverflow> SELECT *
FROM first_purchase_by_shopper_date
WHERE shop_id=ce1089f6-c613-4d5b-a975-5dfd677b46f9
AND dt_first_purchase >= '2014-01-01 00:00:00'
AND dt_first_purchase < '2014-04-01 00:00:00';
shop_id | dt_first_purchase | shopper_id | value
--------------------------------------+--------------------------+--------------------------------------+-------
ce1089f6-c613-4d5b-a975-5dfd677b46f9 | 2014-02-12 18:33:22-0600 | a7480417-aaf8-42b1-85dd-5d9a4a30c204 | shopper1
ce1089f6-c613-4d5b-a975-5dfd677b46f9 | 2014-03-13 11:33:22-0500 | 07db2b71-2dc7-421d-bf73-82a5f6c55f89 | shopper2
(2 rows)
Additionally, you could then count the number of first purchases (new shoppers) for that particular shop and date range, like this:
aploetz@cqlsh:stackoverflow> SELECT COUNT(*) FROM first_purchase_by_shopper_date
WHERE shop_id=ce1089f6-c613-4d5b-a975-5dfd677b46f9
AND dt_first_purchase >= '2014-01-01 00:00:00'
AND dt_first_purchase < '2014-04-01 00:00:00';
count
-------
2
(1 rows)
Please note that this specific example may not work for your use case. Take it for what it is: a demonstration of how partitioning and querying work in Cassandra.
For more information, check out Patrick McFadin's article on Getting Started With Time Series Data Modeling. He discusses ways to solve a use case that is similar to yours.
Upvotes: 5