Reputation: 4013
I'm kind of new to MapReduce in Hadoop. I'm trying to process entries from many log files. The mapper process is quite similar with the one in WordCount tutorial.
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
The thing is instead of putting the word as the key for the reducer, I want to put a related data from a table in RDBMS. For example, the processed text are like this
apple orange duck apple giraffe horse lion, lion grape
And there is a table
name type
apple fruit
duck animal
giraffe animal
grape fruit
orange fruit
lion animal
So, instead of counting the word, I want to count the type. The output would be like
fruit 4
animal 5
Let's say in the previous code, it will be like this
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
String object = tokenizer.nextToken();
//========================================
String type = SomeClass.translate(object);
//========================================
word.set(type);
output.collect(word, one);
}
}
The SomeClass.translate
will translate the object name to the type by querying from a RDBMS.
My questions
apple
words in more than one machines, how to reduce the number of database look-up for apple
?UPDATE
I'm implementing it using Apache Hadoop on Amazon Elastic MapReduce and the translation table is stored in Amazon RDS/MySQL. I would really appreciate if you could provide some sample codes or links.
Upvotes: 0
Views: 240
Reputation: 18424
If you're worried about minimizing DB queries, you could do this in two MR jobs: first do a standard word count, then use the output of that job to do the translation to type, and re-summing.
Alternatively, if your mapping table is small enough to fit in memory, you could start by serializing it, adding it to the DistributedCache, and then loading it into memory as part of the Mapper's setup method. Then there's no need to worry about doing the translation too many times, as it's just a cheap memory lookup.
Upvotes: 1
Reputation: 33495
To summarize the requirement, a join is done between the data in table and a file and count is done on the joined data. Based on the input size of the data there are different ways (M or MR only) join can be done. For more details on joining go through Data-Intensive Text Processing with MapReduce - Section 3.5.
Upvotes: 1