user1394569
user1394569

Reputation: 115

Cassandra hard vs soft delete

I have multiple tables that I want to keep their deleted data.

I thought of two options to achieve that:

  1. Create new table called deleted_x and when deleting from x, immediatly insert to deleted_x.

    Advantage : querying from only one table.

    Disadvantages :

    • Do insert for each delete
    • When the original table structure changes, I will have to change the deleted table too.
  2. Have a column called is_deleted and put it in the partition key in each of these tables and set it to true when deleting a row.

    Advantage : One table structure

    Disadvantage : mention is_deleted in all queries from table

Are there any performence considerations I should think of additionally? Which way is the better way?

Upvotes: 6

Views: 1965

Answers (2)

Stefan Podkowinski
Stefan Podkowinski

Reputation: 5249

Are there any performence considerations I should think of additionally?

You will effectively double the write load and storage size for your cluster by inserting your data twice. This includes compactions, repairs, bootstrapping new nodes and backups.

Which way is the better way?

Let me suggest a 3rd option instead.

  1. Create table all_data that contains each row and will never be deleted from
  2. Create table active_data using the same partition key. This table will only contain non-deleted rows (Edit: but not any data at all, just the key!).
  3. Check if key is in active_data before reading from all_data will allow you to only read non-deleted rows

Upvotes: 0

Jeff Jirsa
Jeff Jirsa

Reputation: 4426

Option #1 is awkward, but it's probably the right way to do things in Cassandra. You could issue the two mutations (one DELETE, and one INSERT) in a single batch, and guarantee that both are written.

Option #2 isn't really as easy as you may expect if you're coming from a relational background, because adding an is_deleted column to a table in Cassandra and expecting to be able to query against it isn't trivial. The primary reason is that Cassandra performs significantly better when querying against the primary key (partition key(s) + optional clustering key(s) than secondary indexes. Therefore, for maximum performance, you'd need to model this as a clustering key - doing so then prohibits you from simply issuing an update - you'd need to delete + insert, anyway.

Option #2 becomes somewhat more viable in 3.0+ with Materialized Views - if you're looking at Cassandra 3.0+, it may be worth considering.

Upvotes: 3

Related Questions