Klun
Klun

Reputation: 54

Cassandra : TTL vs dynamic tables vs large amount of deletes

I have basically a data table like that (a partition id, along with a serialized value serialized_value) :

CREATE TABLE keyspace.data (
    id bigint,
    serialized_value blob,
    PRIMARY KEY (id)
) WITH caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
  AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy', 'enabled': 'true'}
    AND compression = { 'class' : 'LZ4Compressor'};

The usecase involves to maintain multiples versions of the data (serialized_value for a given id).

Every day, I will have to sent into Cassandra a fresh version of the data. It involves 100 millions of rows/partitions each time.

Of course, I don't need to maintain ALL version of the data, only the last 4 days (so the four most recent version_id).

I identify three solutions to do that :

solution 1 : TTL

the idea is to set a TTL at insert time. In that a way, oldest versions of the data are automatically dropped, without having problems related to thombstones.

pros :

  • no read performance penalty (?)
  • no problem related to thombstones

cons :

  • if fails occur with ingestion several days, I may loose all the data from the Cassandra cluster because of TTL automatic delete

solution 2 : dynamic tables

the table creation becomes :

CREATE TABLE keyspace.data_{version_id} (
    id bigint,
    serialized_value blob,
    PRIMARY KEY (id)
) ...;

the table name include the version_id.

pros :

  • the table (corresponding to a version) is easy to delete
  • no read performance penalty
  • no problem related to thombstones

cons :

  • dynamically adding a table to the cluster might need all the nodes to be up every time.
  • a bit more difficult to handle client side (query specific table name, instead of the same one)

solution 3 : large amount of deletes

in that case, all the data stay in a single table, and a version_id is added to the primary key.

CREATE TABLE keyspace.data (
    version_id int,
    id bigint,
    serialized_value blob,
    PRIMARY KEY ((version_id,id))
) ...;

pros :

  • only one single table to create and maintain, for the entire application lifecycle

cons :

  • read performance penalty may occurs because of lot of thombstones
  • problem related to thombstones, because large amount of data need to be deleted, in order to purge all data related to old version_id.

the delete will only match the exact partition key, so it will generate partition thombstones and NOT cell thombstones. but thus, I'm afraid of the performance of doing that..

What is the best way for you to acheive that ? :-)

Upvotes: 0

Views: 465

Answers (1)

Erick Ramirez
Erick Ramirez

Reputation: 16353

It would be preferable to cluster your data based on a date or timestamp sorted in reverse order and still with TTL set. For example:

CREATE TABLE ks.blobs_by_id (
    id bigint,
    version timestamp,
    serialized_value blob,
    PRIMARY KEY (id, version)
) WITH CLUSTERING ORDER BY (version DESC)

If you have a default TTL on the table, older versions will automatically expire so when you retrieve the rows with:

SELECT ... FROM blobs_by_id WHERE id = ? LIMIT 4

Only the 4 most recent rows will be returned (in descending order) and you won't be iterating over the deleted rows. Cheers!

Upvotes: 2

Related Questions