Querying Data from DBMS in Hadoop Mapper Before Mapping

Question

I'm kind of new to MapReduce in Hadoop. I'm trying to process entries from many log files. The mapper process is quite similar with the one in WordCount tutorial.

public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        output.collect(word, one);
    }
}

The thing is instead of putting the word as the key for the reducer, I want to put a related data from a table in RDBMS. For example, the processed text are like this

apple orange duck apple giraffe horse lion, lion grape

And there is a table

name     type
apple    fruit
duck     animal
giraffe  animal
grape    fruit
orange   fruit
lion     animal

So, instead of counting the word, I want to count the type. The output would be like

fruit 4
animal 5

Let's say in the previous code, it will be like this

public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
        String object = tokenizer.nextToken();
        //========================================
        String type = SomeClass.translate(object);
        //========================================
        word.set(type);
        output.collect(word, one);
    }
}

The SomeClass.translate will translate the object name to the type by querying from a RDBMS.

My questions

Is this doable? (and how to do that?)
What are the concerns? I came into understanding that the mapper will be run in more than one machines. So let's say there are apple words in more than one machines, how to reduce the number of database look-up for apple?
Or is there a very good alternative without doing translation in the mapper? Or maybe there is a common way to do this? (or is this whole question a really stupid question?)

UPDATE

I'm implementing it using Apache Hadoop on Amazon Elastic MapReduce and the translation table is stored in Amazon RDS/MySQL. I would really appreciate if you could provide some sample codes or links.

Joe K · Accepted Answer

If you're worried about minimizing DB queries, you could do this in two MR jobs: first do a standard word count, then use the output of that job to do the translation to type, and re-summing.

Alternatively, if your mapping table is small enough to fit in memory, you could start by serializing it, adding it to the DistributedCache, and then loading it into memory as part of the Mapper's setup method. Then there's no need to worry about doing the translation too many times, as it's just a cheap memory lookup.

Querying Data from DBMS in Hadoop Mapper Before Mapping

Answers (2)

Related Questions