Amazon emr: best compression/fileformat

We currently have some files stored on an S3 server. The files are log files (.log extension but plain text content) that have been gzipped to reduce disk space. But gzip isn't splittable and now we are looking for a few good alternatives to store/process our files on Amazon EMR.

So what is the best compression or file format to use on log files? I came across avro and SequenceFile, bzip2, LZO and snappy. It's a bit much and I am a bit overwhelmed.

So I would appreciate any insights in this matter.

Data is to be used for pig jobs (map/reduce jobs)

Kind regards

Upvotes: 1

Answers (2)

Manish Poddar

Reputation: 21

Hi We can use following algorithms as per our use cases.

GZIP(Algorithm) : Splittable(No), Compression Ratio(High),Compress and Decompress Speed(Medium)
SNAPPY(Algorithm) : Splittable(No), Compression Ratio(LOW),Compress and Decompress Speed(Very Fast)
BZIP2(Algorithm) : Splittable(Yes), Compression Ratio(Very High),Compress and Decompress Speed(Slow)
LZO(Algorithm) : Splittable(Yes), Compression Ratio(LOW),Compress and Decompress Speed(FAST)

Upvotes: 0

Paulo Fidalgo

Reputation: 22296

If you check the Best Practices for Amazon EMR there's a section talking about compressing the outputs:

Compress mapper outputs–Compression means less data written to disk, which improves disk I/O. You can monitor how much data written to disk by looking at FILE_BYTES_WRITTEN Hadoop metric. Compression can also help with the shuffle phase where reducers pull data. Compression can benefit your cluster HDFS data replication as well. Enable compression by setting mapred.compress.map.output to true. When you enable compression, you can also choose the compression algorithm. LZO has better performance and is faster to compress and decompress.

Upvotes: 0

Amazon emr: best compression/fileformat

Answers (2)

Related Questions