Reputation: 1865
If after a long period of time, STCS produced a really big SSTable, and later on we received a read request for a partition key that only exists in that big SSTable (i.e. it's unique across all the SSTables for that table), would the read latency be increased because we're dealing with a big SSTable, or is the read latency NOT influenced by the size of a partition index?
On a side note, I suppose that having the help of the partition summary and then using the partition index with pointers for just one big SSTable is still better than seeking a lot of smaller SSTables.
Upvotes: 0
Views: 368
Reputation: 8812
First, there is a single instance of Partition Key Cache by Cassandra process and it is shared by all SSTables and all tables. Its size limit is defined in cassandra.yaml
# Default value is empty to make it "auto" (min(5% of Heap (in MB), 100MB)).
# Set to 0 to disable key cache.
key_cache_size_in_mb:
For the Index Summary that is used to perform binary search to find the nearest partition offset for scanning, normally we sample every 128 partition keys but for SSTables that have a lot of partition keys, this sampling can be increased to save memory.
CREATE TABLE music.example (
id int PRIMARY KEY
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
...
AND max_index_interval = 2048
AND min_index_interval = 128
...;
The total memory usage of Index Summary can be configured in cassandra.yaml
# A fixed memory pool size in MB for for SSTable index summaries. If left
# empty, this will default to 5% of the heap size. If the memory usage of
# all index summaries exceeds this limit, SSTables with low read rates will
# shrink their index summaries in order to meet this limit. However, this
# is a best-effort process. In extreme conditions Cassandra may need to use
# more than this amount of memory.
index_summary_capacity_in_mb:
# How frequently index summaries should be resampled. This is done
# periodically to redistribute memory from the fixed-size pool to sstables
# proportional their recent read rates. Setting to -1 will disable this
# process, leaving existing index summaries at their current sampling level.
index_summary_resize_interval_in_minutes: 60
See CASSANDRA-6379 So to reply your question, the read performance for big SSTable:
Upvotes: 2