Light Yagami
Light Yagami

Reputation: 331

Data Modeling for range queries in Cassandra

I’m designing a new table in Cassandra

create table student ( studentid text PRIMARY KEY, department text, major text, updatedon timestamp)

I would need to to perform three queries on this table

  1. Query all data (findByAll)
  2. Query all data based on major, planning on adding a secondary index on this column
  3. Query data based on time range I.e updated on column

I can achieve this using a composite primary key, however I also want rows to be uniquely identifiable based on id only. For example:

Row 1 : 1| engineering | electrical | 01-01-2021

If the student were to a different major?

1| engineeering | mechanical | 02-02-2021

I would like to perform an upsert where only the major and updated on columns would change.

My conundrum is I don’t understand what I should have as my primary key if I want to perform range queries on updatedon, where rows a uniquely identified by id only.

I came across a bucketing approach but wasn’t sure if that would add additional complexity to my simple/minimal design.

Upvotes: 2

Views: 160

Answers (1)

Erick Ramirez
Erick Ramirez

Reputation: 16303

It looks like you're approaching it backwards by starting with the table design. When modelling your data in Cassandra, it sounds counter-intuitive but you need to start with the application queries first and design tables against those queries.

Let me illustrate by listing all your app queries and designing a table for each of them.

APP QUERY 1 - Query all data (findByAll)

If your intention is to retrieve all the records to display them, this is a bad idea in Cassandra since it will require a full table scan. I'm aware that developers are used to doing this on toy applications with a small amount of data but in Cassandra, data is distributed across nodes so full table scans don't scale.

Think of situations where you have a million or more records with hundreds of nodes in the cluster. It doesn't make sense for an app to wait for the query to finish retrieving all records.

APP QUERY 2 - Query all data based on major, planning on adding a secondary index on this column

Adding an index on major isn't a good idea if performance matters to you. You should design a table specifically optimised for this query. For example:

CREATE TABLE students_by_major (
    major text,
    studentid text,
    department text,
    updatedon timestamp,
    PRIMARY KEY (major, studentid)
)

In this table, each major partition has 1 or more rows of studentid. For example:

 major                  | studentid | department  | updatedon
------------------------+-----------+-------------+---------------------------------
       computer science |       321 |     science | 2020-01-23 00:00:00.000000+0000
 electrical engineering |       321 | engineering | 2020-02-24 00:00:00.000000+0000
 electrical engineering |       654 | engineering | 2019-05-06 00:00:00.000000+0000
   chemical engineering |       654 | engineering | 2019-07-08 00:00:00.000000+0000
                   arts |       987 |         law | 2020-09-12 00:00:00.000000+0000
      civil engineering |       654 | engineering | 2019-02-04 00:00:00.000000+0000

APP QUERY 3 - Query data based on time range I.e updated on column

You'll only be able to do a range query on updatedon if the column is defined in the primary key.

APP QUERY 4 - If the student were to do a different major?

You can have a table where each student has multiple rows of majors:

CREATE TABLE majors_by_student (
    studentid text,
    major text,
    department text,
    updatedon timestamp,
    PRIMARY KEY (studentid, major)
)

For example, student ID 654 has updated their major 3 times:

cqlsh> SELECT * FROM majors_by_student WHERE studentid = '654';

 studentid | updatedon                       | department  | major
-----------+---------------------------------+-------------+------------------------
       654 | 2019-07-08 00:00:00.000000+0000 | engineering |   chemical engineering
       654 | 2019-05-06 00:00:00.000000+0000 | engineering | electrical engineering
       654 | 2019-02-04 00:00:00.000000+0000 | engineering |      civil engineering

QUERY 5 - You want to perform range queries on updatedon where rows are uniquely identified by studentid only.

CREATE TABLE community.updated_majors_by_student (
    studentid text,
    updatedon timestamp,
    department text,
    major text,
    PRIMARY KEY (studentid, updatedon)
)

Using student 654 above as an example, you can do a range query for any updates made after April 30 with:

SELECT * FROM updated_majors_by_student WHERE studentid = '654' AND updatedon > '2019-04-30 +0000';

Note that since updatedon is a timestamp, you need to specify the timezone for precision and +0000 is the TZ for UTC.

 studentid | updatedon                       | department  | major
-----------+---------------------------------+-------------+------------------------
       654 | 2019-07-08 00:00:00.000000+0000 | engineering |   chemical engineering
       654 | 2019-05-06 00:00:00.000000+0000 | engineering | electrical engineering

To keep the tables above in sync, you need to use CQL BATCH statements as I've described in this article -- https://community.datastax.com/articles/2744/. Cheers!

Upvotes: 3

Related Questions