Reputation: 3064
In a C* 1.2.x cluster we have 7 keyspaces and each keyspace contains a column family that uses wide rows. The cf uses LCS. I am periodically doing deletes in the rows. Initially each row may contain at most 1 entry per day. Entries older than 3 months are deleted and at max 1 entry per week is kept. I have been running this for a few months but disk space isn't really reclaimed. I need to investigate why. For me it looks like the tombstones are not purged. Each keyspace has around 1300 sstable files (*-Data.db) and each file is around 130 Mb in size (sstable_size_in_mb is 128). GC grace seconds is 864000 in each CF. tombstone_threshold is not specified, so it should default to 0.2. What should I look at to find out why diskspace isn't reclaimed?
Upvotes: 3
Views: 2087
Reputation: 427
I was hoping for magic sauce here.
We are going to do a JMX-triggered LCS -> STCS -> LCS in a rolling fashion through the cluster. The switching of compaction strategy forces LCS structured sstables to restructure and apply the tombstones (in our version of cassandra we can't force an LCS compact).
There are nodetool commands to force compactions between tables, but that might screw up LCS. There are also nodetool commands to reassign the level of sstables, but again, that might foobar LCS if you muck with its structure.
What really should probably happen is that row tombstones should be placed in a separate sstable type that can be independently processed against "data" sstables to get the purge to occur. The tombstone sstable <-> data sstable processing doesn't remove the tombstone sstable, just removes tombstones from the tombstone sstable that are no longer needed after the data sstable was processed/pared/pruned. Perhaps these can be classified as "PURGE" tombstones for large scale data removals as opposed to more ad-hoc "DELETE" tombstones that would be intermingled with data. But who knows when that would be added to Cassandra.
Upvotes: 1
Reputation: 29
Thanks for the great explanation of LCS, @minaguib. I think the statement from Datastax is misleading, at least to me
at most 10% of space will be wasted by obsolete rows.
Depends on how we define the “obsolete rows”. If “obsolete rows” is defined as ALL the rows which are supposed to be compacted, in your example, these “obsolete rows" will be age=30, age=29, age=28. We can end up wasting (N-1)/N space as these “age" can be in different levels.
Upvotes: 0
Reputation: 121
I've answered a similar question before on the cassandra mailing list here
To elaborate a bit further, it's crucial you understand the Levelled Compaction Strategy and leveldb in general (given normal write behavior)
To summarize the above:
The layout of your LCS tree in cassandra is stored in a json file that you can easily check - you can find it in the same directory as the sstables for the keyspace+ColumnFamily. Here's an example of one of my nodes (coupled with the jq tool + awk to summarize):
$ cat users.json | jq ".generations[].members|length" | awk '{print "Level", NR-1, ":", $0, "sstables"}'
Level 0 : 1 sstables
Level 1 : 10 sstables
Level 2 : 109 sstables
Level 3 : 1065 sstables
Level 4 : 2717 sstables
Level 5 : 0 sstables
Level 6 : 0 sstables
Level 7 : 0 sstables
As you've noted, the sstables are usually of equal size, so you can see that each level is roughly 10x the size of the previous one. I would expect in the node above to satisfy the majority of read operations in ~5 sstable reads. Once I add enough data for Level 4 to reach 10000 sstables and Level 5 starts getting populated, my read latency will increase slightly as each read will incur 1 more sstable read to satisfy. (on a tangent, cassandra provides bucketed histograms for you to check all these stats).
With the above out of the way, let's walk through some operations:
I hope this answers your questions regarding why deletes in cassandra, especially with LCS, actually consume space instead of free up space (at least initially). The rows+columns the tombstones are attached to themselves have a size (which might actually be larger than the size of the value you're trying to delete if you have simple values).
The key point here is that they must traverse all the levels up to the highest level L before cassandra will actually discard them, and the primary driver of that bubbling up is the total write volume.
Upvotes: 7