Rex
Rex

Reputation: 159

when to use parquet over ORC OR ORC over Parquet?

I gone through many stack links and other blogs and there are mix response from all of them. All answers are mostly inspired by favoritism, but can't find any specific data point where we can choose one over another. Be it Data structure complexity or be it compression or be it performance or be it compatibility, both file format have been claimed good in different blogs.

Please help with specific use case or area in which one supersede over other.

Upvotes: 1

Views: 4125

Answers (1)

Harjeet Kumar
Harjeet Kumar

Reputation: 524

ORC and Parquet are very Similar File Formats. They have more in similarity as compare to differences.

  1. Both are Columnar File systems
  2. Both have block level compression.

However we have following pointers to chose them:

  1. Parquet is developed and supported by Cloudera. It is inspired from columnar file format and Google Dremel. So Cloudera supported products and distributions prefer parquet. if you are planning to use impala with your data, then prefer parquet

  2. ORC format has evolved from RCFile format. It is very good when you have complex datatypes as part of your data.

  3. ORC can provide you better compression.

  4. ORC is more mature than Parquet when it comes to providing predicate pushdown features. Recently this has been provided in parquet also.

You can watch this video on youtube. It covers this topic well.

Upvotes: 3

Related Questions