Reputation: 490
I am currently doing Cassandra modelling, I have billions of some kind of digital code hnm_code
to store, like this:
create table hnm (
create_batch_id int, // A creation batch can generate up to 1 million code.
hnm_code text, // Cardinality: billions
product_name text,
primary key (hnm_code)
);
The cardinality of create_batch_id
is relatively small as compared to of hnm_code
. However, what I want is that I should be able to use a value of a single hnm_code
column to inquire that record (the create_batch_id
is unknown at the time of query).
should I use wide row (CQL with cluster key), like this?:
create table hnm_with_cluster_key (
create_batch_id int,
hnm_code text,
product_name text,
primary key (create_batch_id, hnm_code)
);
Thanks! It would be nice if you could advise me on how can I achieve good performance on massive this query, and evenly distribution of hnm_code?
Upvotes: 2
Views: 168
Reputation: 57798
what I want is that I should be able to use a value of a single hnm_code column to inquire that record
In Cassandra, you should design your models to match your query patterns. So this case says it all. The first solution with a partition key on hnm_code
will fulfill this.
the create_batch_id is unknown at the time of query
If you were to use the second solution with PRIMARY KEY (create_batch_id, hnm_code)
, you would need to know (and provide) create_batch_id
at query time.
It would be nice if you could advise me on how can I achieve good performance on massive this query, and evenly distribution of hnm_code?
Cassandra rows are distributed by the hashed value of the partition key. So the higher the cardinality of that key, the more even distribution you will have in your cluster. Also, Cassandra is designed to perform well with lookup by partition key, so your queries should be quite fast.
In addition, with the 2nd table definition, my query looks like this:
select * from hnm_with_cluster_key where hnm_code='1234' allow filtering;
With a CQL rowcount in the billions, using the ALLOW FILTERING
directive will not perform well. I strongly recommend against that.
Now I suppose maybe I just need these 2 tables both, One for select a single hnm_code row by a single condition
hnm_code = $hnm_code
, one for select a creation batch of hnm_codes bycreate_batch_id = $batch_id
, but I resent this duplication, considering that billions of rows is doubled.
And therein lies the crux of your problem. Cassandra simply does not support the type of query flexibility to allow this. It is often not feasible to support multiple queries from a single table design. If you need to support querying by create_batch_id
, then you will need both of your tables. Each model simply will not support well-performing queries for the other.
Yes, data duplication/redundancy may violate everything we were taught in school about normalization. But Cassandra is just not designed to work with fully normalized models. I wrote an article last year for Planet Cassandra that discusses some of these trade-offs: Escaping Disco-Era Data Modeling.
Essentially while massive data duplication is not something that anyone really wants to do, it can be a necessary trade-off in designing high-performing Cassandra models.
Upvotes: 2
Reputation: 784
Cassandra is different from another SQL, it used the first primary key as partition key. On my opinion, the partition key is better not uniqued. So the second design is better.
Upvotes: 0