Reputation: 512
What are the top industrial implementation approaches of file format to store data in HDFS for better performance and better utilization of the cluster?
Seems storing data in parquet file format gives good performance numbers as compared to the normal text file. Using parquet with snappy compression gives performance as well better utilization of cluster in terms of space as well.
So my question is whether to go with only parquet file format or to go with parquet plus snappy compression for storing data on HDFS. What are the industrial standard approaches and why? Any help is highly appreciated.
Upvotes: -2
Views: 228
Reputation: 20969
It certainly depends on your usecase.
Do you want to use a query engine (Hive, Impala) on top of these files? Go for a columnar format like ORC or Parquet. Columnar formats are much more efficient for queries, as you usually only project a subset of data to your result. Plus side, they compress really well.
Do you plan on using mostly MapReduce/batch operations on all the fields of your data?
Again depending on your use case: Human readable? Use JSON or CSVs. Binary? Use sequence files.
Upvotes: 1
Reputation: 2255
Keep in mind that distributions follow different approaches
Hortonworks will tell you that you should use ORC. As this is the format supported by Hortonworks. You can use it with snappy.
Cloudera will tell you to use Parquet as it is their preferred format.
MapR will tell you that HDFS is a file storage and not a file system, using MapRFS is the only real file system on Hadoop and you should go for that.
Following the advises of the distributors is definitely a good choice. Most likely you will not choose a distribution just on file storage parameters.
Upvotes: 0
Reputation: 3421
As far as I know, Parquet format with Snappy Compression is very efficient and widely used in industry. You can use Avro also but it depends upon your use case. A comparison stat on internet:
Uncompressed CSV : 1.8 GB
Avro : 1.5 GB
Avro w/ Snappy Compression : 750 MB
Parquet w/ Snappy Compression : 300 MB
You can have a look at this document for more details.
Upvotes: 2