colesico
colesico

Reputation: 105

Cassandra/Scylla DB. Filtering topics by multiple categories with end-to-end sorting by topics creation time

Suppose there is a table of topics:

CREATE TABLE topic(
  group text,
  ctegory text,
  created_at timestamp
)

Topics grouped into groups. Each topic belongs to one category. Topic has a creation date.

It is required to select topics (in given group) that match the given categories and the list of results must be sorted by creation date only (excluding category).

If I make the key like this (see below), the list of categories will be sorted not only by created_at , but first by category and then by created_at

PRIMARY KEY (group , category, created_at )

select * from topic where group =... and category in ('A','B','C') 

If I make the key like this (see below), it will be impossible to perform the request without specifying created_at

PRIMARY KEY (group,created_at,category )

select * from topic where group =... and  category in ('A','B','C')

If I make a secondary index on the category field, it won't be able to use the IN operator to specify multiple categories in select startment...

How should I organize the data model for the given selection scenario (table structure, primary key, possibly secondary key)?

Upvotes: 0

Views: 206

Answers (3)

Nadav Har'El
Nadav Har'El

Reputation: 13761

I think the best schema for you would be PRIMARY KEY (group,category,created_at) (the opposite order of what you suggested) or even ((group,category),created_at). In other words, for each group/category combination, you can efficiently retrieve a list of results sorted by date for this particular category.

You can then fetch the results for each category 'A', 'B', and 'C', each will be sorted by data, and you can efficiently merge these three sorted streams of results to get one sorted stream of results.

If you do unpaged request (only makes sense if you know you're expecting, say, 1000 results, but not one million), you can even ask Cassandra or Scylla to resort the results for you with a SELECT with IN and ORDER BY (this isn't supported if you do paging). But it's also very easy to do this merging in the client because it requires a simple merge, not even a full sort.

Upvotes: 2

Madhavan
Madhavan

Reputation: 649

@colesico, I leveraged Paul's data model schema here. Ignore the column and table names used.

But if use IN operator (i.e. category IN ('A','B') ) - sorting by created_at is not performed

It will be sorted correctly within a given partition. Here is a demonstration. I've used Astra DB to quickly come up with this example.

Created the table:

token@cqlsh:payloadtest> create table so20230427(g text,c text, ct timestamp, primary key((g,c),ct)) with clustering order by (ct desc);

Inserted a bunch of records:

token@cqlsh:payloadtest> select * from so20230427 ;

 g | c | ct
---+---+---------------------------------
 1 | 2 | 2023-04-27 16:49:30.172000+0000
 1 | 2 | 2023-04-27 16:49:29.407000+0000
 1 | 2 | 2023-04-27 16:49:27.480000+0000
 1 | 1 | 2023-04-27 16:49:15.788000+0000
 1 | 1 | 2023-04-27 16:49:14.605000+0000
 1 | 1 | 2023-04-27 16:49:02.751000+0000

(6 rows)

When I query, I can very well see that the ct column is sorted in the order that I need, i.e. in descending order.

token@cqlsh:payloadtest> select * from so20230427 where g='1' and c='1';

 g | c | ct
---+---+---------------------------------
 1 | 1 | 2023-04-27 16:49:15.788000+0000
 1 | 1 | 2023-04-27 16:49:14.605000+0000
 1 | 1 | 2023-04-27 16:49:02.751000+0000

(3 rows)
token@cqlsh:payloadtest> select * from so20230427 where g='1' and c in ('1','2');

 g | c | ct
---+---+---------------------------------
 1 | 1 | 2023-04-27 16:49:15.788000+0000
 1 | 1 | 2023-04-27 16:49:14.605000+0000
 1 | 1 | 2023-04-27 16:49:02.751000+0000
^^^^ The records are sorted within the partition of ('1','1') ^^^
 1 | 2 | 2023-04-27 16:49:30.172000+0000
 1 | 2 | 2023-04-27 16:49:29.407000+0000
 1 | 2 | 2023-04-27 16:49:27.480000+0000
^^^^ The records are sorted within the partition of ('1','2') ^^^

(6 rows)

What else are you missing here or needed to accomplish? Are you wanting to accomplish the sorting across all of the partitions (i.e. the group and category partition combination in your case)? If so, you will have to do that in your application.

Upvotes: 1

Paul
Paul

Reputation: 416

Have you already tried using created_at as a clustering column so it sorts on insert? You can use ascending or descending depending on what you need, for example:

    CREATE TABLE topic(
    group text,
    category text,
    created_at timestamp
    PRIMARY KEY (group , category), created_at )
  )WITH CLUSTERING ORDER BY (created_at DESC);

Upvotes: 2

Related Questions