Big Data - Lambda Architecture and Storing Raw Data

Question

Currently I am using cassandra for storing data for my functional use cases (display time-series and consolidated data to users). Cassandra is very good at it, if you design correctly your data model (query driven)

Basically, data are ingested from RabbitMQ by Storm and save to Cassandra

Lambda architecture is just a design-pattern for big-data architect and technology independent, the layers can be combined :

Cassandra is a database that can be used as serving layer & batch layer : I'm using it for my analytics purpose with spark too (because data are already well formatted, like time-series, in cassandra)

As far as I know, one huge thing to consider is STORING your raw data before any processing. You need to do this in order to recover for any problem, human-based (algorithm problem, DROP TABLE in PROD, stuff like that this can happen..) or for future use or mainly for batch aggregation

And here I'm facing a choice :

Currently I'm storing it in cassandra, but i'm consider switching storing the raw data in HDFS for different reason : raw data are "dead", using cassandra token, using resource (mainly disk space) in cassandra cluster.

Can someone help me in that choice ?

Venkat · Accepted Answer

HDFS makes perfect sense. Some considerations :

Serialization of data - Use ORC/ Parquet or AVRO if format is variable
Compression of data - Always compress
HDFS does not like too many small files - In case of streaming have a job which aggregates & write single large file on a regular interval
Have a good partitioning scheme so you can get to data you want on HDFS without wasting resources

Big Data - Lambda Architecture and Storing Raw Data

Answers (2)

Related Questions