Reputation: 54
I have basically a data
table like that (a partition id
, along with a serialized value serialized_value
) :
CREATE TABLE keyspace.data (
id bigint,
serialized_value blob,
PRIMARY KEY (id)
) WITH caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy', 'enabled': 'true'}
AND compression = { 'class' : 'LZ4Compressor'};
The usecase involves to maintain multiples versions of the data (serialized_value
for a given id
).
Every day, I will have to sent into Cassandra a fresh version of the data. It involves 100 millions of rows/partitions each time.
Of course, I don't need to maintain ALL version of the data, only the last 4 days (so the four most recent version_id
).
I identify three solutions to do that :
solution 1 : TTL
the idea is to set a TTL at insert time. In that a way, oldest versions of the data are automatically dropped, without having problems related to thombstones.
pros :
- no read performance penalty (?)
- no problem related to thombstones
cons :
- if fails occur with ingestion several days, I may loose all the data from the Cassandra cluster because of TTL automatic delete
solution 2 : dynamic tables
the table creation becomes :
CREATE TABLE keyspace.data_{version_id} (
id bigint,
serialized_value blob,
PRIMARY KEY (id)
) ...;
the table name include the version_id
.
pros :
- the table (corresponding to a version) is easy to delete
- no read performance penalty
- no problem related to thombstones
cons :
- dynamically adding a table to the cluster might need all the nodes to be up every time.
- a bit more difficult to handle client side (query specific table name, instead of the same one)
solution 3 : large amount of deletes
in that case, all the data stay in a single table, and a version_id
is added to the primary key.
CREATE TABLE keyspace.data (
version_id int,
id bigint,
serialized_value blob,
PRIMARY KEY ((version_id,id))
) ...;
pros :
- only one single table to create and maintain, for the entire application lifecycle
cons :
- read performance penalty may occurs because of lot of thombstones
- problem related to thombstones, because large amount of data need to be deleted, in order to purge all data related to old version_id.
the delete will only match the exact partition key, so it will generate
partition thombstones
and NOTcell thombstones
. but thus, I'm afraid of the performance of doing that..
What is the best way for you to acheive that ? :-)
Upvotes: 0
Views: 465
Reputation: 16353
It would be preferable to cluster your data based on a date or timestamp sorted in reverse order and still with TTL set. For example:
CREATE TABLE ks.blobs_by_id (
id bigint,
version timestamp,
serialized_value blob,
PRIMARY KEY (id, version)
) WITH CLUSTERING ORDER BY (version DESC)
If you have a default TTL on the table, older versions will automatically expire so when you retrieve the rows with:
SELECT ... FROM blobs_by_id WHERE id = ? LIMIT 4
Only the 4 most recent rows will be returned (in descending order) and you won't be iterating over the deleted rows. Cheers!
Upvotes: 2