Reputation: 95
I am running Mapreduce on hadoop 2.9.0.
My problem:
I have a number of text files (about 10- 100 text files). Each file is very small in terms of size, but due to my logical problem, I need 1 mapper to handle 1 text file. The result of these mappers will be aggregated by my reducers.
I need to design so that the number of mappers always equals number of files. How to do that in Java code? What kind of function that I need to extend?
Thanks a lot.
Upvotes: 0
Views: 367
Reputation: 279
I've had to do something very similar, and faced similar problems to you. The way I achieved this was to feed in a text file containing the path's to each file, for example the text file would contain this kind of information:
/path/to/filea
/path/to/fileb
/a/different/path/to/filec
/a/different/path/to/another/called/filed
I'm not sure what exactly you want your mapper's to do, but when creating your job, you want to do the following:
public static void main( String args[] ) {
Job job = Job.getInstance(new Configuration(), 'My Map reduce application');
job.setJarByClass(Main.class);
job.setMapperClass(CustomMapper.class);
job.setInputFormatClass(NLineInputFormat.class);
...
}
Your CustomMapper.class
will want to extend Mapper like so:
public class CustomMapper extends Mapper<LongWritable, Text, <Reducer Key>, <Reducer Value> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Configuration configuration = context.getConfiguration();
ObjectTool tool = new ObjectTool(configuration, new Path(value.toString()));
context.write(<reducer key>, <reducer value>);
}
}
Where ObjectTool
is another class which deals with what you want to actually do with your files.
So let me explain broadly what this is doing, the magic here is job.setInputFormatClass(NLineInputFormat.class)
, but what is it doing exactly?
It's essentially taking your input and splitting the data by each line, and sends each line to a mapper. By having a text file containing each file by a new line, you then create a 1:1 relationship between mappers and files. A great addition to this setup is it allows you to create advanced tooling for the files you want to deal with.
I used this to create a compression tool in HDFS, when I was researching on approaches to this, a lot of people were essentially reading the file to stdout and compressing it that way, however, when it came to doing a checksum on the original file and the file being compressed and decompressed, the results were different. This was due to the type of data in these files, and there was no easy way to implement bytes writeable. (Information on the cat'ing of files to std out can be seen here).
That link also quotes the following:
org.apache.hadoop.mapred.lib.NLineInputFormat is the magic here. It basically tells the job to feed one file per maptask
Hope this helps!
Upvotes: 1