Eldinea
Eldinea

Reputation: 165

Big Data - Lambda Architecture and Storing Raw Data

Currently I am using cassandra for storing data for my functional use cases (display time-series and consolidated data to users). Cassandra is very good at it, if you design correctly your data model (query driven)

Basically, data are ingested from RabbitMQ by Storm and save to Cassandra

Lambda architecture is just a design-pattern for big-data architect and technology independent, the layers can be combined :

Cassandra is a database that can be used as serving layer & batch layer : I'm using it for my analytics purpose with spark too (because data are already well formatted, like time-series, in cassandra)

As far as I know, one huge thing to consider is STORING your raw data before any processing. You need to do this in order to recover for any problem, human-based (algorithm problem, DROP TABLE in PROD, stuff like that this can happen..) or for future use or mainly for batch aggregation

And here I'm facing a choice :

Currently I'm storing it in cassandra, but i'm consider switching storing the raw data in HDFS for different reason : raw data are "dead", using cassandra token, using resource (mainly disk space) in cassandra cluster.

Can someone help me in that choice ?

Upvotes: 1

Views: 232

Answers (2)

Venkat
Venkat

Reputation: 1810

HDFS makes perfect sense. Some considerations :

  • Serialization of data - Use ORC/ Parquet or AVRO if format is variable
  • Compression of data - Always compress
  • HDFS does not like too many small files - In case of streaming have a job which aggregates & write single large file on a regular interval
  • Have a good partitioning scheme so you can get to data you want on HDFS without wasting resources

Upvotes: 2

Marko Švaljek
Marko Švaljek

Reputation: 2101

hdfs is better idea for binary files. Cassandra is o.k. for storing locations where the files are etc etc but just pure files need to be modelled really really well so most of the people just give up on cassandra and complain that it sucks. It still can be done, if you want to do it there are some examples like: https://academy.datastax.com/resources/datastax-reference-application-killrvideo

that might help you to get started.

Also the question is more material for quora or even http://www.mail-archive.com/[email protected]/ this question has been asked there a lot of time.

Upvotes: 0

Related Questions