Shashikant Kulkarni
Shashikant Kulkarni

Reputation: 177

Cassandra data model confusion

Looking for some help in Cassandra data modeling.

I am taking here some dummy example. Say I have device and I collect the data of device. Now I have some queries like 1. Select device data where device status="published"; status could be "published" or "unpublished"

  1. Select device data where device status="published" and enabled=true;

Now if I want to create column family to satisfy the above queries then I can do the following

CREATE TABLE devices (
    device_id text,
    device_name text,
    status text,
    enabled boolean,
    .
    .
    .
    .
    .
    other device information,
    PRIMARY KEY (status, enabled, device_id))

Now my question is

  1. Can I create a column family like this? If yes, are there any potential problems in this.

  2. The status and enabled values may change for the device so will it create new row because the primary key will be different? If it inserts new row then how to remove the old records? How to refer the new record if old record cannot be deleted by keeping all the other device information same?

Upvotes: 2

Views: 93

Answers (2)

Gautam
Gautam

Reputation: 564

As @undefined_variable mentioned, this kind of table will lead to hotspots on the cluster. Actually your entire data will get collected at most in two nodes (along with replicas) only. The first question you might want to ask yourself is, how many devices will there be, and do these above queries, really make sense. If you have, say 100000 devices, would you really read 100000 rows at a time? Wouldn't there be more filters. Based on that, you need to decide how to model this.

Upvotes: 0

undefined_variable
undefined_variable

Reputation: 6218

Can I create a column family like this? If yes, are there any potential problems in this.

NO... Though you can create such table and cassandra will not restrict I will suggest not to.

The table design has one big problem and that is data distribution. Since status will be only published and unpublished there will be only 2 rows will be created. This will eventually lead to wide rows, which will degrade the performance.

CQL to Internal data structure

The status and enabled values may change for the device so will it create new row because the primary key will be different? If it inserts new row then how to remove the old records? How to refer the new record if old record cannot be deleted by keeping all the other device information same?

Based on above information for one particular device there can be only 4 distict values (status=published,enabled=true/false) and (status=published enabled=true/false), though it will not be rows... it is cells in cql. Deleting a record in cassandra will create tombstones and if you have frequent status change and you are deleting record then it will create many tombstones and then you will have to run frequent compactions else query will start failing.

I would suggest to use some primary key.

Upvotes: 1

Related Questions