Reputation: 13446
I was just going through the word count example in MapReduce. The map function is very straightforward. Is there a higher level function that decides what part of the file go to what mapper? Suppose you are relying on a function (such as SHA1) that relies on the input of the entire file, is there any to tell the framework not to split files?
Upvotes: 0
Views: 167
Reputation: 33495
Is there a higher level function that decides what part of the file go to what mapper?
When a map slot is free on a node, the scheduler picks a split which is nearest to the node to avoid data transfer as much as possible. If an unprocessed input split is on the same node as the free map slot then that split is processed, if not then a split in the same rack is chosen or else a split outside the rack is chosen.
is there any to tell the framework not to split files?
Implement the FileInputFormat#isSplitable(). Then the input files are not split and are processed one per map.
import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapred.TextInputFormat;
public class NonSplittableTextInputFormat extends TextInputFormat {
@Override
protected boolean isSplitable(FileSystem fs, Path file) {
return false;
}
}
Upvotes: 2
Reputation: 1270
You can write custom InputSplit and RecordReader in Hadoop. You can program these methods to tell the framework to split input files in the way you want.
Please check out: http://developer.yahoo.com/hadoop/tutorial/module5.html
Upvotes: 1