awfm9
awfm9

Reputation: 197

Cassandra Primary Key Selection to Bound Partition Growth

We are currently testing Cassanda as a database for a big amount of meta-data on communication events. As most queries will be limited to a single customer, it makes sense to shard by customer ID. However, this would mean the partitions would keep growing infinitely over time. I'm struggling a bit to come up with a solution that seems clean enough.

The first idea is to use a composite key of customer ID and some time interval. Are there other options, that might be better and grow more organically?

As we want to have as few partition reads as possible, I was thinking to simply use the year to put an upper bound on data per customer per partition. However, this would distribute the data rather unevenly, if I'm not mistaking. Could this be solved by moving to months or even weeks/days?

I'm sure that this is a problem that often comes up and I'm interested in hearing the various solutions that people put into place.

EDIT: To be more clean on the type of query, they will calculate aggregates over big time slices, per customer. Ideally, we would only have this:

PRIMARY KEY ((customer_id), timestamp)

However, as I've mentioned, this would lead to unbound growth per partition over the years.

Upvotes: 4

Views: 476

Answers (1)

Jim Meyer
Jim Meyer

Reputation: 9475

Well a partition can hold a ton of rows, but if your volume over the years will be a concern, you could borrow an idea from hash tables. When more than one value hashes to a value, the extra values are stored as an overflow linked list.

We can extend the same idea to a partition. When a partition for a high volume customer "fills up", we add extra partitions to a list.

So you could define your table like this:

CREATE TABLE events (
    cust_id int,
    bucket int, 
    ts int,
    overflow list<int> static,
    PRIMARY KEY ((cust_id, bucket), ts));

For most customers, you would just set bucket to zero and use a single partition. But if the zero partition got too big, then add a 1 to the static list to indicate that you are now also storing data in bucket 1. You can then add more partitions to the list as needed.

For example:

INSERT INTO events (cust_id, bucket, ts) VALUES (123, 0, 1);
INSERT INTO events (cust_id, bucket, ts) VALUES (123, 0, 2);

SELECT * from events;

 cust_id | bucket | ts | overflow
---------+--------+----+----------
     123 |      0 |  1 |     null
     123 |      0 |  2 |     null

Now imagine you want to start using a second partition for this customer, just add it to the static list:

UPDATE events SET overflow = overflow + [1] WHERE cust_id=123 and bucket=0;
INSERT INTO events (cust_id, bucket, ts) VALUES (123, 1, 3);
INSERT INTO events (cust_id, bucket, ts) VALUES (123, 1, 4);

So to check if a customer is using any overflow bucket partitions:

SELECT overflow FROM events WHERE cust_id=123 and bucket=0 limit 1;

 overflow
----------
      [1]

Now you can do range queries over the partitions:

SELECT * FROM events WHERE cust_id=123 and bucket IN(0,1) AND ts>1 and ts<4;

 cust_id | bucket | ts | overflow
---------+--------+----+----------
     123 |      0 |  2 |      [1]
     123 |      1 |  3 |     null

You could define "bucket" to have whatever meaning you wanted, like year or something. Note that the overflow list is defined as static, so it is only stored once with each partition and not with each event row.

Probably the more conventional approach would be to partition by cust_id and year, but then you need to know the start and end years somehow in order to do queries. With the overflow approach, the first bucket is the master and has a standard known value like 0 for reads. But a drawback is you need to do a read to know which bucket to write to, but if each customer generates a large group of events during a communication session, then maybe the overhead of that wouldn't be too much.

Upvotes: 3

Related Questions