Punit
Punit

Reputation: 11

Hadoop Mapreduce with compressed/encrypted files (file of large size)

I have hdfs cluster which stores large csv files in a compressed/encrypted form as selected by end user. For compression, encryption, I have create a wrapper input stream which feed data to HDFS in compressed/encrypted form. Compression format used GZ, Encryption format AES256. A 4.4GB csv file is compressed to 40MB on HDFS.

Now I have mapreduce job(java) which processes multiple compressed files together. MR job uses FileInputFormat. When splits are calculated by mapper, 4.4GB compressed file(40MB) is allocated only 1 mapper with split start as 0 and split length equivalent 40MB.

How do I process such compressed file of larger size.? One option I found was to implement custom RecordReader and use wrapper input stream to read uncompressed data and process it. Since I don't have actual length of the file, so I don't know how much data to read from input stream.

If I read upto end from InputStream, then how to handle when 2 mappers are allocated to same file as explained below. If compressed file size is larger than 64MB, then 2 mappers wil be allocated for same file. How to handle this scenario.?

Hadoop Version - 2.7.1

Upvotes: 1

Views: 799

Answers (1)

Ramzy
Ramzy

Reputation: 7138

The compression format should be decided keeping in mind if the file would be processed by map reduce. Because, is the compression format is splittable, then map reduce works normally.

However, if not splittable(in your case gzip is not splittable, and map reduce will know it), then entire file would be processed in one mapper. This will serve the purpose, but will have data locality issues, as one mapper will only perform the job, and it fetches the data from other blocks.

From Hadoop definitive guide: "For large files, you should not use a compression format that does not support splitting on the whole file, because you lose locality and make MapReduce applications very inefficient".

You can refer to the section compression in Hadoop I/O chapter, for more information.

Upvotes: 0

Related Questions