Reputation: 145
I want to design my cluster and want to set proper size of key_cache and row_cache depending on the size of the tables/columnfamilies. Similar to mysql, do we have something like this in Cassandra/CQL?
SELECT table_name AS "Tables",
round(((data_length + index_length) / 1024 / 1024), 2) "Size in MB"
FROM information_schema.TABLES
WHERE table_schema = "$DB_NAME";
Or any other way to look for the data-size and indexes' size separately.
Or what configuration of each node would be needed to put my table completely in the memory without considering any replication factor.
Upvotes: 3
Views: 4118
Reputation: 11100
The key cache and row caches work rather differently. It's important to understand the difference for calculating sizes.
The key cache is a cache of offsets within files for locations for rows. It is basically a map from (key, file) to offset. Therefore scaling the key cache size depends on number of rows, not overall data size. You can find the number of rows from the 'Number of keys' parameter in 'nodetool cfstats'. Note this is per node, not a total, but that's what you want to decide on cache sizes. The default size is min(5% of Heap (in MB), 100MB), which is probably sufficient for most applications. A subtlety here is that rows may exist in multiple files (SSTables), the number depending on your write pattern. However, this duplication is accounted for (approximately) in the estimated count from nodetool.
The row cache caches the actual row. To get a size estimate for this you can use the 'Space used' parameter in 'nodetool cfstats'. However, the row cache caches deserialized data and only the latest copy so the size could be quite different (higher or lower).
There is also a third less configurable cache - your OS filesystem cache. In most cases this is actually better than the row cache. It avoids duplicating data in memory, because when using the row cache most likely data will be in the filesystem cache too. And reading from an SSTable in the filesystem cache is only 30% slower than the row cache in my experiments (a while ago, probably not valid any more but unlikely to be significantly different). The main use case for the row cache is when you have one relatively small CF that you want to ensure is cached. Otherwise using the filesystem cache is probably the best.
In conclusion, the Cassandra defaults of a large key cache and no row cache are the best for most setups. You should only play with the caches if you know your access pattern won't work with the defaults or if you're having performance issues.
Upvotes: 1