Java Mapreduce - getting names of files with matches & printing to output file

Question

Hi I've been trying to come up with a modified version of the standard WordCount v1.0 wherein I read all files from an input directory (args[0]) and I print the output to an output directory (args[1]) which consists of not just the words and the number of occurrences, but a list of files where matches took place.

So for example I have 2 text files:

//1.txt
I love hadoop
and big data

//2.txt
I like programming
hate big data

The output would be:

//Output.txt
I       2   1.txt 2.txt
love    1   1.txt
hadoop  1   1.txt
and     1   1.txt
big     2   1.txt 2.txt
data    2   1.txt 2.txt
like    1   1.txt
programming  1  2.txt
hate    1   2.txt

At this stage I'm not sure how to extract the name of the file where the match occured. Furthermore I'm not sure how to store the file name - whether I would use a Triple or I would need to use nested maps, so perhaps Map (K1, Map (K2, v))? I don't know which would be possible in a mapreduce program so any tips would be greatly appreciated.

gudok · Accepted Answer

Getting file names is generally not encouraged. Different input formats have different ways of doing this, and some of them may not provide such functionality at all.

Assuming that you are working with simple TextInputFormat, you can use mapper context to retrieve the split:

FileSplit split = (FileSplit)context.getInputSplit();
String filename = split.getPath().getName();

To produce the format desired, let mapper emit tuples . Reducer should collect them into Map(filename)>. This assumes no combiner is used.

Java Mapreduce - getting names of files with matches & printing to output file

Answers (1)

Related Questions

Java Mapreduce - getting names of files with matches &amp; printing to output file

Answers (1)

Related Questions

Java Mapreduce - getting names of files with matches & printing to output file