Reputation: 331
I’m designing a new table in Cassandra
create table student ( studentid text PRIMARY KEY, department text, major text, updatedon timestamp)
I would need to to perform three queries on this table
I can achieve this using a composite primary key, however I also want rows to be uniquely identifiable based on id only. For example:
Row 1 : 1| engineering | electrical | 01-01-2021
If the student were to a different major?
1| engineeering | mechanical | 02-02-2021
I would like to perform an upsert where only the major and updated on columns would change.
My conundrum is I don’t understand what I should have as my primary key if I want to perform range queries on updatedon, where rows a uniquely identified by id only.
I came across a bucketing approach but wasn’t sure if that would add additional complexity to my simple/minimal design.
Upvotes: 2
Views: 160
Reputation: 16303
It looks like you're approaching it backwards by starting with the table design. When modelling your data in Cassandra, it sounds counter-intuitive but you need to start with the application queries first and design tables against those queries.
Let me illustrate by listing all your app queries and designing a table for each of them.
APP QUERY 1 - Query all data (findByAll)
If your intention is to retrieve all the records to display them, this is a bad idea in Cassandra since it will require a full table scan. I'm aware that developers are used to doing this on toy applications with a small amount of data but in Cassandra, data is distributed across nodes so full table scans don't scale.
Think of situations where you have a million or more records with hundreds of nodes in the cluster. It doesn't make sense for an app to wait for the query to finish retrieving all records.
APP QUERY 2 - Query all data based on major, planning on adding a secondary index on this column
Adding an index on major
isn't a good idea if performance matters to you. You should design a table specifically optimised for this query. For example:
CREATE TABLE students_by_major (
major text,
studentid text,
department text,
updatedon timestamp,
PRIMARY KEY (major, studentid)
)
In this table, each major
partition has 1 or more rows of studentid
. For example:
major | studentid | department | updatedon
------------------------+-----------+-------------+---------------------------------
computer science | 321 | science | 2020-01-23 00:00:00.000000+0000
electrical engineering | 321 | engineering | 2020-02-24 00:00:00.000000+0000
electrical engineering | 654 | engineering | 2019-05-06 00:00:00.000000+0000
chemical engineering | 654 | engineering | 2019-07-08 00:00:00.000000+0000
arts | 987 | law | 2020-09-12 00:00:00.000000+0000
civil engineering | 654 | engineering | 2019-02-04 00:00:00.000000+0000
APP QUERY 3 - Query data based on time range I.e updated on column
You'll only be able to do a range query on updatedon
if the column is defined in the primary key.
APP QUERY 4 - If the student were to do a different major?
You can have a table where each student has multiple rows of majors:
CREATE TABLE majors_by_student (
studentid text,
major text,
department text,
updatedon timestamp,
PRIMARY KEY (studentid, major)
)
For example, student ID 654
has updated their major 3 times:
cqlsh> SELECT * FROM majors_by_student WHERE studentid = '654';
studentid | updatedon | department | major
-----------+---------------------------------+-------------+------------------------
654 | 2019-07-08 00:00:00.000000+0000 | engineering | chemical engineering
654 | 2019-05-06 00:00:00.000000+0000 | engineering | electrical engineering
654 | 2019-02-04 00:00:00.000000+0000 | engineering | civil engineering
QUERY 5 - You want to perform range queries on updatedon
where rows are uniquely identified by studentid
only.
CREATE TABLE community.updated_majors_by_student (
studentid text,
updatedon timestamp,
department text,
major text,
PRIMARY KEY (studentid, updatedon)
)
Using student 654
above as an example, you can do a range query for any updates made after April 30 with:
SELECT * FROM updated_majors_by_student WHERE studentid = '654' AND updatedon > '2019-04-30 +0000';
Note that since updatedon
is a timestamp, you need to specify the timezone for precision and +0000
is the TZ for UTC.
studentid | updatedon | department | major
-----------+---------------------------------+-------------+------------------------
654 | 2019-07-08 00:00:00.000000+0000 | engineering | chemical engineering
654 | 2019-05-06 00:00:00.000000+0000 | engineering | electrical engineering
To keep the tables above in sync, you need to use CQL BATCH
statements as I've described in this article -- https://community.datastax.com/articles/2744/. Cheers!
Upvotes: 3