Reputation: 132

Deleting column in cassandra for large dataset

We have a redundant column that we'd like to delete from our Cassandra database (version 2.1.15). This is a text column represents the majority of data on disk (15 nodes X 1.8 TB per node).

The easiest option just seems to be an alter table to remove that column, and then let Cassandra compaction take care of things (also running Cassandra Reaper to manage repairs). However, given the size of the dataset I'm concerned I will knock over the cluster with a massive delete.

Other options I've consider is a process that will run through the keyspace setting the value to null, but I think this will have the same effect as removing the column, but is more under out control (but also requires writing something to do this).

Would anyone have any advice on how to approach this?

Thanks!

Upvotes: 4

Answers (2)

Pedro Vidigal

Reputation: 422

Dropping a column does mark the deleted values as tombstones. The column value becomes unavailable immediately and the column data is removed in the next compaction cycle.

If you want to to expedite the removal of the column before the compaction occurs, you can run nodetool upgradesstables to remove the data, after you use the ALTER TABLE command to change the metadata for the column.

See Documentation: https://docs.datastax.com/en/cql/3.1/cql/cql_reference/alter_table_r.html

Upvotes: 4

Alex Ott

Reputation: 87234

If I remember correctly, drop of column doesn't really mark the deleted values with tombstone, but instead inserts corresponding entry into system.dropped_columns table, and then code, like, SerializationHelper & BTreeRow, performs filtering on the fly. The data will be deleted when compaction will happen.

Explicitly setting the value to null won't make situation better because you'll add data to the table.

I would recommend to test deletion on small cluster & check how it behaves.

Upvotes: 1

Deleting column in cassandra for large dataset

Answers (2)

Related Questions