Cassandra modelling, I have a billion of some kind of digital code to store, should I use wide row (CQL with cluster key)?

Question

I am currently doing Cassandra modelling, I have billions of some kind of digital code hnm_code to store, like this:

create table hnm (
    create_batch_id int, // A creation batch can generate up to 1 million code.
    hnm_code text,       // Cardinality: billions
    product_name text,
    primary key (hnm_code)
);

The cardinality of create_batch_id is relatively small as compared to of hnm_code. However, what I want is that I should be able to use a value of a single hnm_code column to inquire that record (the create_batch_id is unknown at the time of query). should I use wide row (CQL with cluster key), like this?:

create table hnm_with_cluster_key (
    create_batch_id int,
    hnm_code text,
    product_name text,
    primary key (create_batch_id, hnm_code)
);

Thanks! It would be nice if you could advise me on how can I achieve good performance on massive this query, and evenly distribution of hnm_code?

Aaron · Accepted Answer

what I want is that I should be able to use a value of a single hnm_code column to inquire that record

In Cassandra, you should design your models to match your query patterns. So this case says it all. The first solution with a partition key on hnm_code will fulfill this.

the create_batch_id is unknown at the time of query

If you were to use the second solution with PRIMARY KEY (create_batch_id, hnm_code), you would need to know (and provide) create_batch_id at query time.

It would be nice if you could advise me on how can I achieve good performance on massive this query, and evenly distribution of hnm_code?

Cassandra rows are distributed by the hashed value of the partition key. So the higher the cardinality of that key, the more even distribution you will have in your cluster. Also, Cassandra is designed to perform well with lookup by partition key, so your queries should be quite fast.

In addition, with the 2nd table definition, my query looks like this: select * from hnm_with_cluster_key where hnm_code='1234' allow filtering;

With a CQL rowcount in the billions, using the ALLOW FILTERING directive will not perform well. I strongly recommend against that.

Now I suppose maybe I just need these 2 tables both, One for select a single hnm_code row by a single condition hnm_code = $hnm_code, one for select a creation batch of hnm_codes by create_batch_id = $batch_id, but I resent this duplication, considering that billions of rows is doubled.

And therein lies the crux of your problem. Cassandra simply does not support the type of query flexibility to allow this. It is often not feasible to support multiple queries from a single table design. If you need to support querying by create_batch_id, then you will need both of your tables. Each model simply will not support well-performing queries for the other.

Yes, data duplication/redundancy may violate everything we were taught in school about normalization. But Cassandra is just not designed to work with fully normalized models. I wrote an article last year for Planet Cassandra that discusses some of these trade-offs: Escaping Disco-Era Data Modeling.

Essentially while massive data duplication is not something that anyone really wants to do, it can be a necessary trade-off in designing high-performing Cassandra models.

Cassandra modelling, I have a billion of some kind of digital code to store, should I use wide row (CQL with cluster key)?

Answers (2)

Related Questions