further processing after reducer

Question

Probably a very lame question. I have two documents and I want to find the overlap of both documents in map reduce fashion and then compare the overlap (lets say I have some measure to do that)

SO this is what I am thinking:

1) Run the normal wordcount job on one document (https://sites.google.com/site/hadoopandhive/home/hadoop-how-to-count-number-of-times-a-word-appeared-in-a-file-using-map-reduce-framework)
2) But rather than saving a file, save everything in a HashMap(word,true)
3) Pass that HashMap along the second wordcount mapreduce program and then as I am processing the second document, check the words against the HashMap to find whether the word is present or not.

So, something like this

 1) HashMap hm = runStepOne(); <-- map reduce job
 2) runSteptwo(HashMap)

How do I do this in hadoop

ryanbwork · Accepted Answer

Sounds like you could use some form of DistributeCache to store your intermediate results after the initial wordcount job, then run another job which utilizes these intermediate results to test whether they occur in the second document. You may be able to encapsulate both these steps into a single MR job, but off the top of my head I'm not sure how.

further processing after reducer

Answers (2)

Related Questions