Difference between the compression codec and file formats in hadoop?

Question

I want to know about how the compression codec and fileformat differs in hadoop. For example, parquet file format is also reduces the size of the orginal file and supports file spliting. Bzip2codec also does the same. Please help me to understand the difference between the two better.

Erik Schmiegelow · Accepted Answer

Compression and file formats are completely different things.

A file format describes the structure of data stored in a file. Avro will contain Avro serialized objects, SequenceFile will contain a key (usually a number) and a value (the original data). Parquet is a special file format which allows columnized storage and as such is quite space efficient.

You can have more efficient formats (e.g. TIFF and JPG for images) and less so (PSD).

On top of that you may choose to compress the files in storage, with different compression codecs. Bzip, snappy and GZ are common methods. This would corresponds to compressing your image with Zip in the above example.

Hope this provides some clarity.

Difference between the compression codec and file formats in hadoop?

Answers (1)

Related Questions