Reputation: 1054
I am considering the use of Cassandra as a time-series store. I have millions of series and each series have around 10K of sequential points with uniform intervals. Some series though have a few thousands points or less. They may start and end at different points but all share the same times. I access the data series
I am considering two options. First I could just have a column per time as it is recommended for monitoring systems for example (I have a different access pattern though). Second, using list columns one per partition.
I am worried about read performance (second use case is more critical) and storage overhead. I did finf the following formula:
total_column_size = column_name_size + column_value_size + 15
here
I think that would make the first option quite expensive in terms of storage. I could not find any documentation for a list storage layout. Do you know of any? Have other recommendations?
BTW, I am using python as a client for cassandra if that makes any difference.
Upvotes: 1
Views: 279
Reputation: 1769
"Storage is cheap" is generally the philosophy here. If you have 2 query patterns, which you seem to, then store everything twice: once partitioned by your desired verticals (days by the looks), and once again by your chosen series. If you don't know how to partition your series in advance (it wasn't clear from the question) then it becomes more complicated. Cassandra reads are sequential when reading in order - and this is the only way you should be using it anyway.
You have in the region of X0bn points which is larger than your average DB but is not bordering on ridiculous, particularly when distributed over a cluster. It's hard to put an exact figure given that I don't know the width of your data points, but if these are just scalar values then this is only going to be 2TB or so of data.
Upvotes: 3