wspeirs
wspeirs

Reputation: 1413

Cassandra TWCS Merges SSTables in the Same Bucket

I created the following table on Cassandra 3.11 for storing metrics using the TimeWindowCompactionStrategy:

CREATE TABLE metrics.my_test (
    metric_name text,
    metric_week text,
    metric_time timestamp,
    tags map<text, text>,
    value double,
    PRIMARY KEY ((metric_name, metric_week), metric_time)
) WITH CLUSTERING ORDER BY (metric_time DESC)
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_size': '1', 'compaction_window_unit': 'MINUTES'}
    AND default_time_to_live = 7776000
    AND gc_grace_seconds = 60;

Following the blog post on TLP about TWCS, I thought I'd be able to issue a compaction and none of the SSTables in the same bucket (1 minute window) would be compacted together. However, it seems as though this is not true, and everything gets compacted together. Before compaction:

# for f in *Data.db; do ls -l $f && java -jar /root/sstable-tools-3.11.0-alpha11.jar describe $f | grep timestamp; done
-rw-r--r-- 1 cassandra cassandra 1431 Mar 22 17:29 mc-10-big-Data.db
Minimum timestamp: 1521739701309280 (03/22/2018 17:28:21)
Maximum timestamp: 1521739777814859 (03/22/2018 17:29:37)
-rw-r--r-- 1 cassandra cassandra 619 Mar 22 17:30 mc-11-big-Data.db
Minimum timestamp: 1521739787241285 (03/22/2018 17:29:47)
Maximum timestamp: 1521739810545148 (03/22/2018 17:30:10)
-rw-r--r-- 1 cassandra cassandra 654 Mar 22 17:20 mc-1-big-Data.db
Minimum timestamp: 1521739189529560 (03/22/2018 17:19:49)
Maximum timestamp: 1521739216248636 (03/22/2018 17:20:16)
-rw-r--r-- 1 cassandra cassandra 1154 Mar 22 17:21 mc-2-big-Data.db
Minimum timestamp: 1521739217033715 (03/22/2018 17:20:17)
Maximum timestamp: 1521739277579629 (03/22/2018 17:21:17)
-rw-r--r-- 1 cassandra cassandra 855 Mar 22 17:22 mc-3-big-Data.db
Minimum timestamp: 1521739283859916 (03/22/2018 17:21:23)
Maximum timestamp: 1521739326037634 (03/22/2018 17:22:06)
-rw-r--r-- 1 cassandra cassandra 1047 Mar 22 17:23 mc-4-big-Data.db
Minimum timestamp: 1521739327868930 (03/22/2018 17:22:07)
Maximum timestamp: 1521739387131847 (03/22/2018 17:23:07)
-rw-r--r-- 1 cassandra cassandra 1288 Mar 22 17:24 mc-5-big-Data.db
Minimum timestamp: 1521739391318240 (03/22/2018 17:23:11)
Maximum timestamp: 1521739459713561 (03/22/2018 17:24:19)
-rw-r--r-- 1 cassandra cassandra 767 Mar 22 17:25 mc-6-big-Data.db
Minimum timestamp: 1521739461284097 (03/22/2018 17:24:21)
Maximum timestamp: 1521739505132186 (03/22/2018 17:25:05)
-rw-r--r-- 1 cassandra cassandra 1216 Mar 22 17:26 mc-7-big-Data.db
Minimum timestamp: 1521739507504019 (03/22/2018 17:25:07)
Maximum timestamp: 1521739583459167 (03/22/2018 17:26:23)
-rw-r--r-- 1 cassandra cassandra 749 Mar 22 17:27 mc-8-big-Data.db
Minimum timestamp: 1521739587644109 (03/22/2018 17:26:27)
Maximum timestamp: 1521739625351120 (03/22/2018 17:27:05)
-rw-r--r-- 1 cassandra cassandra 1259 Mar 22 17:28 mc-9-big-Data.db
Minimum timestamp: 1521739627983733 (03/22/2018 17:27:07)
Maximum timestamp: 1521739698691870 (03/22/2018 17:28:18)

After issuing nodetool compact metrics my_test:

# for f in *Data.db; do ls -l $f && java -jar /root/sstable-tools-3.11.0-alpha11.jar describe $f | grep timestamp; done
-rw-r--r-- 1 cassandra cassandra 8677 Mar 22 17:30 mc-12-big-Data.db
Minimum timestamp: 1521739189529561 (03/22/2018 17:19:49)
Maximum timestamp: 1521739810545148 (03/22/2018 17:30:10)

It's clear to see that SSTables from multiple time windows were merged together, as the only SSTable after the compaction covers 17:19:49 to 17:30:10.

What can I do to prevent this from happening? I have a large-ish (12 nodes, ~550GB/node) table implemented with TWCS, but has multiple overlapping SSTables. I'd like to compress out any tombstones, and merge those overlapping SSTables; however, I'm worried I'll be left with a single 550GB SSTable per node. My concern is a single SSTable that large will be slow when doing reads... is that a valid concern?

Upvotes: 0

Views: 398

Answers (1)

Chris Lohfink
Chris Lohfink

Reputation: 16410

Dont manually issue nodetool compact, that explicitly merges everything together into one table.

TWCS will be STCS within the time window until its done then compact that window down, a 1 minute window is crazy aggressive and probably not something that will realistically work since data will be delivered across window boundaries. Flushes can (and likely) be more than 1 minute apart so it wont even be on sstables by time window passes meaning almost everything is out of window. Some overlapping sstables are Ok so dont worry too much about it but you will need a larger window than 1 minute. Id be careful of anything less than 1 day.

Especially with partition key at 1 week and 3 month TTL you would have tens of thousands of sstables which isn't maintainable for streaming. Repairs will simply break.

Upvotes: 0

Related Questions