Reputation: 495
I have seen different file formats being used while importing and storing into HDFS and also data processing engines utilize these formats while performing their own set of procedures.So what kind of difference these file formats make and how their choice is made for different use cases.Being a newbie it creates confusion.Kindly help the same.
Upvotes: 1
Views: 989
Reputation: 575
The choice depends on the use case that you are facing according to the type of data you have, the compatibility with processing tools, schema evolution, the file size, the type of query and read performances.
In general :
Here is some keys that can help you
Writing performance ( the more + have the faster is )
Reading performance ( the more + have the faster is )
File sizes ( the more + have the smaller file is )
and here is some facts about each file type
Avro :
Parquet :
SequenceFile :
I wish that my answer will help you
Upvotes: 6