Reputation: 155
An interviewer recently asked me this question:
I said by configuring block size or split size equal to file size.
He said it is wrong.
Upvotes: 1
Views: 1480
Reputation: 181
Well if you told it like that I think that he didn't like the "configuring block size" part.
EDIT : Somehow I think changing block size is a bad idea because it is global to HDFS.
On the other hand a solution to prevent splitting, would be to set the min split size bigger than the largest file to map.
A cleaner solution would be to subclass the concerned InputFormat implementation. Especially by overriding the isSpitable() method to return false. In your case you could do something like this with FileInputFormat:
public class NoSplitFileInputFormat extends FileInputFormat
{
@Override
protected boolean isSplitable(JobContext context, Path file)
{
return false;
}
}
Upvotes: 3
Reputation: 1399
The interviewer wanted to hear that you can make isSplitable to return false by gzip-compressing the input file.
In this case, MapReduce will do the right thing and not try to split the gzipped file, since it knows that the input is gzip-compressed (by looking at the filename extension) and that gzip does not support splitting.
This will work, but at the expense of locality: a single map will process all HDFS blocks, most of which will not be local to the map. Also, with fewer maps, the job is less granular, and so may take longer to run.
Upvotes: 3