Reputation: 3064

What delays a tombstone purge when using LCS in Cassandra

In a C* 1.2.x cluster we have 7 keyspaces and each keyspace contains a column family that uses wide rows. The cf uses LCS. I am periodically doing deletes in the rows. Initially each row may contain at most 1 entry per day. Entries older than 3 months are deleted and at max 1 entry per week is kept. I have been running this for a few months but disk space isn't really reclaimed. I need to investigate why. For me it looks like the tombstones are not purged. Each keyspace has around 1300 sstable files (*-Data.db) and each file is around 130 Mb in size (sstable_size_in_mb is 128). GC grace seconds is 864000 in each CF. tombstone_threshold is not specified, so it should default to 0.2. What should I look at to find out why diskspace isn't reclaimed?

Upvotes: 3

Answers (3)

Constance Eustace

Reputation: 427

I was hoping for magic sauce here.

We are going to do a JMX-triggered LCS -> STCS -> LCS in a rolling fashion through the cluster. The switching of compaction strategy forces LCS structured sstables to restructure and apply the tombstones (in our version of cassandra we can't force an LCS compact).

There are nodetool commands to force compactions between tables, but that might screw up LCS. There are also nodetool commands to reassign the level of sstables, but again, that might foobar LCS if you muck with its structure.

What really should probably happen is that row tombstones should be placed in a separate sstable type that can be independently processed against "data" sstables to get the purge to occur. The tombstone sstable <-> data sstable processing doesn't remove the tombstone sstable, just removes tombstones from the tombstone sstable that are no longer needed after the data sstable was processed/pared/pruned. Perhaps these can be classified as "PURGE" tombstones for large scale data removals as opposed to more ad-hoc "DELETE" tombstones that would be intermingled with data. But who knows when that would be added to Cassandra.

Upvotes: 1

user1861417

Reputation: 29

Thanks for the great explanation of LCS, @minaguib. I think the statement from Datastax is misleading, at least to me

 at most 10% of space will be wasted by obsolete rows.

Depends on how we define the “obsolete rows”. If “obsolete rows” is defined as ALL the rows which are supposed to be compacted, in your example, these “obsolete rows" will be age=30, age=29, age=28. We can end up wasting (N-1)/N space as these “age" can be in different levels.

Upvotes: 0

minaguib

Reputation: 121

I've answered a similar question before on the cassandra mailing list here

To elaborate a bit further, it's crucial you understand the Levelled Compaction Strategy and leveldb in general (given normal write behavior)

To summarize the above:

The data store is organized as "levels". Each level is 10 times larger than the level under it. Files in level 0 have overlapping ranges. Files in higher levels do not have overlapping ranges within each level.
New writes are stored as new sstables entering level 0. Every once in a while all sstables in level0 are "compacted" upwards to level 1 sstables, and these are then compacted upwards to level 2 sstables etc..
Reads for a given key will perform ~N reads, N being the number of levels in your tree (which is a function of the total data set size). Level 0 sstables are all scanned (since there are no constraints that each has non-overlapping range with siblings). Level 1 and higher sstables however have no-overlapping ranges, so the DB knows which 1 exact sstable in level1 covers the range of the key you're asking for, same for level 2 etc...

The layout of your LCS tree in cassandra is stored in a json file that you can easily check - you can find it in the same directory as the sstables for the keyspace+ColumnFamily. Here's an example of one of my nodes (coupled with the jq tool + awk to summarize):

$ cat users.json | jq ".generations[].members|length" | awk '{print "Level", NR-1, ":", $0, "sstables"}'
Level 0 : 1 sstables
Level 1 : 10 sstables
Level 2 : 109 sstables
Level 3 : 1065 sstables
Level 4 : 2717 sstables
Level 5 : 0 sstables
Level 6 : 0 sstables
Level 7 : 0 sstables

As you've noted, the sstables are usually of equal size, so you can see that each level is roughly 10x the size of the previous one. I would expect in the node above to satisfy the majority of read operations in ~5 sstable reads. Once I add enough data for Level 4 to reach 10000 sstables and Level 5 starts getting populated, my read latency will increase slightly as each read will incur 1 more sstable read to satisfy. (on a tangent, cassandra provides bucketed histograms for you to check all these stats).

With the above out of the way, let's walk through some operations:

We issue a write ["bob"]["age"] = 30. This will enter level0. Usually soon after it'll be compacted to level1. Slowly then it'll spend time in each level but as more writes enter the system, it'll migrate upwards to the highest level N
We issue a read for ["bob"]["age"]. The DB can then check each level from lowest to highest - as soon as it finds the data it can return it. If it reaches the highest level and it hasn't found it, the data doesn't exist on this node. If at any level it finds a tombstone, it can return "not found" as the data has been deleted

We issue a delete ["bob"]["age"]. This will enter level0 as a normal write with a special value "column tombstone". Usually soon after it'll be compacted to level1. Slowly then it'll spend time in each level but as more writes enter the system, it'll migrate upwards to the highest level N. During each compaction, if the sstables being compacted together have a tombstone (such as in l1) and an actual value (such as "30" in an l2), the tombstone "swallows up" the value and affects the logical deletion at that level. The tomstone however cannot be discarded yet, and must persist till it has had the chance to compact against every level until the highest one is reached - this is the only way to ensure that if L2 has age=30, L3 has an older age=29, and L4 has an even older age=28, all of them will have the chance to be destroyed by the tombstone. Only when the tombstone reaches the highest level can it actually be discarded entirely
We issue a read for ["bob"]["age"]. The DB can then check each level from lowest to highest - as soon as it finds the data it can return it. If it reaches the highest level and it hasn't found it, the data doesn't exist on this node. If at any level it finds a tombstone, it can return "not found" as the data has been deleted

We issue a delete ["bob"]. This will enter level0 as a normal write with a special value "row tombstone". It will follow the same logic as the above column-level tombstone, except if it collides with any existing data of any column under row "bob" it discards it.
We issue a read for ["bob"]["age"]. The DB can then check each level from lowest to highest - as soon as it finds the data it can return it. If it reaches the highest level and it hasn't found it, the data doesn't exist on this node. If at any level it finds a tombstone, it can return "not found" as the data has been deleted

I hope this answers your questions regarding why deletes in cassandra, especially with LCS, actually consume space instead of free up space (at least initially). The rows+columns the tombstones are attached to themselves have a size (which might actually be larger than the size of the value you're trying to delete if you have simple values).

The key point here is that they must traverse all the levels up to the highest level L before cassandra will actually discard them, and the primary driver of that bubbling up is the total write volume.

Upvotes: 7

What delays a tombstone purge when using LCS in Cassandra

Answers (3)

Related Questions