Reputation: 943
Consider the following table:
CREATE TABLE routes (
start text,
end text,
validFrom timestamp,
validTo timestamp,
PRIMARY KEY (start, end, validFrom, validTo)
);
How would one write a cql query to find all routes that go from point A to point B and are valid between date x and date y. Essentially the equivalent to the following SQL statement:
SELECT * from routes where start = 'A' and end = 'B' and validFrom <= x and validTo >= y.
I've been reading through the Cassandra Docs and my impression is that such a query is not possible with the given table. If this is the case then how might one model the data in order to enable such a query.
Upvotes: 1
Views: 412
Reputation: 57748
SELECT * from routes where start = 'A' and end = 'B' and validFrom <= x and validTo >= y
Due to the way in which Cassandra stores its data on-disk, it can't pull a consecutive range of rows for a query like that. So what you need to do is adjust your model slightly.
CREATE TABLE routes (
start text,
end text,
valid timestamp,
toFrom text,
name text,
PRIMARY KEY (start, end, valid, toFrom)
);
By consolidating the to/from valid
times into a single column, you are now allowed to execute a range query on it, checking for a beginning and an end. Each row will need to be stored twice. Once with the "Valid To" time and once with the "Valid From" time. The toFrom
column helps differentiate this.
Now I can run this query:
> SELECT * FROM routes
WHERE start='A' AND end='B'
AND valid>='2016-02-01'AND valid<='2016-02-20';
start | end | valid | tofrom | name
-------+-----+--------------------------+--------+---------
A | B | 2016-02-01 06:00:00+0000 | F | combo 1
A | B | 2016-02-14 06:00:00+0000 | T | combo 1
A | B | 2016-02-15 06:00:00+0000 | F | combo 2
(3 rows)
Just a thought, but it might make sense to use a composite partition key (like PRIMARY KEY ((start, end), valid, toFrom)
) for better data distribution. Although that really depends on your query patterns....so that won't work if you'd ever need to query for all rows that start with "A" (for example).
Edit 20160221
I accepted your answer but now I thought about it some more I don't think this will work. E.g. I think the query: SELECT * FROM routes WHERE start='A' AND end='B' AND valid>='2016-02-01' AND valid<='2016-02-13'; will yield no results even though the "Combo 1" route should be valid.
Actually, that does work:
> SELECT * FROM routes
WHERE start='A' AND end='B'
AND valid>='2016-02-01'AND valid<='2016-02-13';
start | end | valid | tofrom | name
-------+-----+--------------------------+--------+---------
A | B | 2016-02-01 06:00:00+0000 | F | combo 1
(1 rows)
That being said, I understand what you're getting at. If you should select a time period that happens between the valid to/from (not inclusive), it will yield zero rows...like this:
> SELECT * FROM routes
WHERE start='A' AND end='B'
AND valid>='2016-02-02'AND valid<='2016-02-13';
start | end | valid | tofrom | name
-------+-----+--------------------------+--------+---------
(0 rows)
That's a valid concern. Two points about that:
In terms of your question which asked for the "equivalent to the following SQL statement", you would still have this same problem in a relational database (tried it on my MariaDB test instance). Actually, it'd be worse, as a query for valid>='2016-02-01'AND valid<='2016-02-13'
returns one row in Cassandra, but validFrom>='2016-02-01' AND validTo<='2016-02-13'
returns nothing in MariaDB.
This really becomes a business logic problem, inconsequential of your datastore. To avoid this issue, you should take a look at your business logic/requirements, and determine the maximum time between valid
to/from. You can then adjust your query so that it always exceeds that time. Example: if routes are only valid for a maximum of two or three weeks, then querying for one month at a time should suffice.
Upvotes: 2