Reputation: 2895
I am in the process of learning Cassandra as an alternative to SQL databases for one of the projects I am working for, that involves Big Data.
For the purpose of learning, I've been watching the videos offered by DataStax, more specifically DS220 which covers modeling data in Cassandra.
While watching one of the videos in the course series I was introduced to the concept of splitting partitions to manage partition size.
My current understanding is that Cassandra has a max logical capacity of 2B entries per partition, but a suggested max of a couple 100s MB per partition.
I'm currently dealing with large amounts of real-time financial data that I must store (time series), meaning I can easily fill out GBs worth of data in a day.
The video course talks about introducing an additional partition key in order to split a partition with the purpose or reducing the size per partition requirement. The video pointed out to using either a time based key or an arbitrary "bucket" key that gets incremented when the number of manageable rows has been reached.
With that in mind, this led me to the following problem: given that partition keys are only used as equality criteria (ie. point to the partition to find records), how do I find all the records that end up being spread across multiple partitions without having to specify either the bucket or timestamp key?
For example, I may receive 1M records in a single day, which would likely go over the 100-500Mb partition limit, so I wouldn't be able to set a partition on a per date basis, that means that my daily data would be broken down into hourly partitions, or alternatively, into "bucketed" partitions (for balanced partition sizes). This means that all my daily data would be spread across multiple partitions splits.
Given this scenario, how do I go about querying for all records for a given day? (additional clustering keys could include a symbol for which I want to have the results for, or I want all the records for that specific day)
Any help would be greatly appreciated.
Thank you.
Upvotes: 1
Views: 801
Reputation: 3784
Basically this goes down to choosing right resolution for your data. I would say first step for you would be to determinate what is best fit for your data. Lets for sake of example take 1 hour as something that is good and question is how to fetch all records for particular date.
Your application logic will be slightly more complicated since you are trading simplicity for ability to store large amounts of data in distributed fashion. You take date which you need and issue 24 queries in a loop and glue data on application level. However when you glue that in can be huge (I do not know your presentation or export requirements so this can pull 1M to memory).
Other idea can be having one table as simple lookup table which has key of date and values of partition keys having financial data for that date. Than when you read you go first to lookup table to get keys and then to partitions having results. You can also store counter of values per partition key so you know what amount of data you expect.
All in all it is best to figure out some natural bucket in your data set and add it to date (organization, zip code etc.) and you can use trick with additional lookup table. This approach can be used for symbol you mentioned. You can have symbols as partition keys, clustering per date and values of partitions having results for that date as values. Than you query for symbol #
on 29-10-2015
and you see partitions A
, D
and Z
have results so you go to those partitions and get financial data from them and glue it together on application level.
Upvotes: 1