Gleydson S. Tavares
Gleydson S. Tavares

Reputation: 600

Distinct users in Cassandra

I have the following problem: In my Cassandra database I have several messages sent by several users. My messages table has the following structure:

CREATE TABLE messages (
  recipient bigint,
  sender bigint,
  created_at text,
  content text,
  PRIMARY KEY((recipient, sender),created_at)
);

I need to count the number of messages sent by a user in a day. For example, between the date 2017-01-01 and 2017-01-05

sender | created_at
  1       2017-01-01
  1       2017-01-01
  2       2017-01-01
  3       2017-01-02  
  3       2017-01-02
  4       2017-01-03
  4       2017-01-04
  5       2017-01-04

I would have the result

2017-01-01 = 2
2017-01-02 = 1
2017-01-03 = 1  
2017-01-04 = 2

Upvotes: 1

Views: 68

Answers (2)

xmas79
xmas79

Reputation: 5180

From what I can see, you cannot do that with your table structure because your partition key contains the recipient. To be told, you should not count at all, because counting keys in cassandra is hard.

However, if you insist on counting these keys, I suggest you two approaches:


Create a new counter table

CREATE TABLE counters_by_user (
    sender bigint,
    ts timestamp,
    messages counter,
    PRIMARY KEY (sender, ts)
)

This table will allow you to directly fetch the value you are looking for. It allows you to select the appropriate "granularity" of the counters, that is if you want a day-by-day counter simply store the timestamp in the ts field in the yyyy-mm-dd format. If you want hourly-based counting, store it in yyyy-mm-dd HH:00 format, etc... You'll need the exact sender only to fetch the results, and can range query by specifying the ts component of the primary key. Have a look at the Counters page documentation on how to use them, and beware that the main drawback of this approach is that Cassandra can over/under count, so watch your steps if you need to be pedantic on counting.


Create a new message table, aka denormalize your data

CREATE TABLE messages_by_sender (
    sender bigint,
    created_at timestamp,
    PRIMARY KEY (sender, created_at)
);

Every time you insert a row in the messages table you'll insert a row here, and when you need to count the messages sent you simply run a SELECT COUNT(*) FROM messages_by_sender WHERE sender=? to count them all, or SELECT COUNT(*) FROM messages_by_sender WHERE sender=? AND created_at > ? AND created_at < ?; to specify a range. If you have a lot of messages per sender, however this will lead to inefficiency, since counting keys in Cassandra requires a partition scan.

Upvotes: 3

blr
blr

Reputation: 968

AFAIK, cassandra doesn't allow do SQL-like aggregations. You either need to pre-compute & keep into another database OR do the counting as the results are being scanned/queried.

Another option it to integrate cassandra with hadoop & do some map-reduce, but that seems like an overkill for what you're trying here.

Upvotes: 1

Related Questions