Utkarsh Saraf
Utkarsh Saraf

Reputation: 495

Avro,parquet and SequenceFileFormat position in Hadoop Ecosystem and their utility

I have seen different file formats being used while importing and storing into HDFS and also data processing engines utilize these formats while performing their own set of procedures.So what kind of difference these file formats make and how their choice is made for different use cases.Being a newbie it creates confusion.Kindly help the same.

Upvotes: 1

Views: 989

Answers (1)

Mehdi TAZI
Mehdi TAZI

Reputation: 575

The choice depends on the use case that you are facing according to the type of data you have, the compatibility with processing tools, schema evolution, the file size, the type of query and read performances.

In general :

  • Avro is more suitable for event data that can change over time
  • Sequence is for datasets sharded between MR jobs
  • Parquet is more suitable for analytics due to it's columnar format

Here is some keys that can help you

Writing performance ( the more + have the faster is )

  • Sequence : +++
  • Avro : ++
  • Parquet : +

Reading performance ( the more + have the faster is )

  • Sequence : +
  • Avro : + + +
  • Parquet : + + + + +

File sizes ( the more + have the smaller file is )

  • Sequence : +
  • Avro : ++
  • Parquet : + + +

and here is some facts about each file type

Avro :

  • better in schema evolution
  • Is row oriented binary format
  • Has a schema
  • the file contain the schema in addition to the data.
  • Supports schema evolution
  • Can be compressed
  • Compact and fast binary format

Parquet :

  • Slow in writing but fast in reading
  • Is column oriented binary format
  • supports compression
  • Optimized and efficient in terms of disk I/O when specific columns need to be queried

SequenceFile :

  • Is row oriented format
  • Supports splitting even if the data is compressed
  • Can be used to pack small files in hadoop

I wish that my answer will help you

Upvotes: 6

Related Questions