Reputation: 165
Currently I am using cassandra for storing data for my functional use cases (display time-series and consolidated data to users). Cassandra is very good at it, if you design correctly your data model (query driven)
Basically, data are ingested from RabbitMQ by Storm and save to Cassandra
Lambda architecture is just a design-pattern for big-data architect and technology independent, the layers can be combined :
Cassandra is a database that can be used as serving layer & batch layer : I'm using it for my analytics purpose with spark too (because data are already well formatted, like time-series, in cassandra)
As far as I know, one huge thing to consider is STORING your raw data before any processing. You need to do this in order to recover for any problem, human-based (algorithm problem, DROP TABLE in PROD, stuff like that this can happen..) or for future use or mainly for batch aggregation
And here I'm facing a choice :
Currently I'm storing it in cassandra, but i'm consider switching storing the raw data in HDFS for different reason : raw data are "dead", using cassandra token, using resource (mainly disk space) in cassandra cluster.
Can someone help me in that choice ?
Upvotes: 1
Views: 232
Reputation: 1810
HDFS makes perfect sense. Some considerations :
Upvotes: 2
Reputation: 2101
hdfs is better idea for binary files. Cassandra is o.k. for storing locations where the files are etc etc but just pure files need to be modelled really really well so most of the people just give up on cassandra and complain that it sucks. It still can be done, if you want to do it there are some examples like: https://academy.datastax.com/resources/datastax-reference-application-killrvideo
that might help you to get started.
Also the question is more material for quora or even http://www.mail-archive.com/[email protected]/ this question has been asked there a lot of time.
Upvotes: 0