Reputation: 257
I am new to hadoop and trying to run word-count example. I want to set no of maps equal to number of input files. I pass a directory to hadoop wordcount example having a total of 10 files but the number of maps tasks being created are more than 10. Can we limit number of map task equal to number of files and 1 map task take one file as an input.
I am using version 1 hadoop.
Upvotes: 0
Views: 151
Reputation: 1411
You will get a mapper for each split. You can bypass this in a few ways, the first being to set mapred.min.split.size
large so that none of the files meet the split criteria.
Another option is to implement your own InputFormat as Praveen suggests. There are a few already created, though I don't know their state with current versions of Hadoop. https://gist.github.com/sritchie/808035 and http://lordjoesoftware.blogspot.com/2010/08/customized-splitters-and-readers.html are a few though they are old.
Another simple option would be to put your files up in a format that is not splittable. GZip comes to mind however it does create a little overhead due to decompressing the files. More overhead is involved if the size of the gzipped file is larger then the block size due to the fact that it will be placed on different nodes and have to be combined BEFORE it can be put through the map task.
Upvotes: 2
Reputation: 33495
If the files are huge, then the a single map per file will be a bottleneck. Create a new InputFormat and start using it. Here is the code for the same.
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
public class NonSplittableTextInputFormat extends TextInputFormat {
@Override
protected boolean isSplitable(JobContext context, Path file) {
return false;
}
}
Upvotes: 2