Does Cassandra Optimize Storage For Repeating Strings?

Question

We are trying to reduce our current disk usage. In doing so we noticed that most of the information we store is mostly meta data comprising of the same strings repeated across multiple tables and rows.

As strings often tend to occupy more space than integers we thought that we could replace these strings with integers in order to cut down our disk usage. We did this and noticed little difference on disk consumption.

We only noticed a substantial difference when there is a larger variance in the meta data strings. I.e. the strings varied more.

So now I am wondering if Cassandra 2.1 employs some clever means of storing repetitive information and if so, can someone point me to some sources about how it does this? I have been unable to find anything on the matter.

Thanks.

xmas79 · Accepted Answer

Cassandra never mixes-up data belonging to different tables, so if your strings are repeating across multiple tables then C* can't mix-optimize them in any way.

The only thing is that C* (unless you disabled it) uses compression during SSTable flushes. Depending on how you design your table, the compression ratio C* will achieve will greatly vary. As an example, the compression algorithm will greatly benefit from having your string column as a clustering key. And having too much small memtables could affect the compression ratio of each SSTable.

Does Cassandra Optimize Storage For Repeating Strings?

Answers (1)

Related Questions