Raj
Raj

Reputation: 155

in Hadoop, How can you give whole file as input to mapper?

An interviewer recently asked me this question:

I said by configuring block size or split size equal to file size.

He said it is wrong.

Upvotes: 1

Views: 1480

Answers (2)

Esquive
Esquive

Reputation: 181

Well if you told it like that I think that he didn't like the "configuring block size" part.

EDIT : Somehow I think changing block size is a bad idea because it is global to HDFS.

On the other hand a solution to prevent splitting, would be to set the min split size bigger than the largest file to map.

A cleaner solution would be to subclass the concerned InputFormat implementation. Especially by overriding the isSpitable() method to return false. In your case you could do something like this with FileInputFormat:

public class NoSplitFileInputFormat extends FileInputFormat 
{

    @Override
    protected boolean isSplitable(JobContext context, Path file) 
    {
        return false;
    }
}

Upvotes: 3

Evgeny Benediktov
Evgeny Benediktov

Reputation: 1399

The interviewer wanted to hear that you can make isSplitable to return false by gzip-compressing the input file.

In this case, MapReduce will do the right thing and not try to split the gzipped file, since it knows that the input is gzip-compressed (by looking at the filename extension) and that gzip does not support splitting.

This will work, but at the expense of locality: a single map will process all HDFS blocks, most of which will not be local to the map. Also, with fewer maps, the job is less granular, and so may take longer to run.

Upvotes: 3

Related Questions