Reputation: 3
Hi I've been trying to come up with a modified version of the standard WordCount v1.0 wherein I read all files from an input directory (args[0]) and I print the output to an output directory (args[1]) which consists of not just the words and the number of occurrences, but a list of files where matches took place.
So for example I have 2 text files:
//1.txt
I love hadoop
and big data
//2.txt
I like programming
hate big data
The output would be:
//Output.txt
I 2 1.txt 2.txt
love 1 1.txt
hadoop 1 1.txt
and 1 1.txt
big 2 1.txt 2.txt
data 2 1.txt 2.txt
like 1 1.txt
programming 1 2.txt
hate 1 2.txt
At this stage I'm not sure how to extract the name of the file where the match occured. Furthermore I'm not sure how to store the file name - whether I would use a Triple or I would need to use nested maps, so perhaps Map (K1, Map (K2, v))? I don't know which would be possible in a mapreduce program so any tips would be greatly appreciated.
Upvotes: 0
Views: 368
Reputation: 4179
Getting file names is generally not encouraged. Different input formats have different ways of doing this, and some of them may not provide such functionality at all.
Assuming that you are working with simple TextInputFormat
, you can use mapper context to retrieve the split:
FileSplit split = (FileSplit)context.getInputSplit();
String filename = split.getPath().getName();
To produce the format desired, let mapper emit tuples <Text(word),Text(filename)>
. Reducer should collect them into Map<String(word), Set<String>(filename)>
. This assumes no combiner is used.
Upvotes: 1