Reputation: 33293
Probably a very lame question. I have two documents and I want to find the overlap of both documents in map reduce fashion and then compare the overlap (lets say I have some measure to do that)
SO this is what I am thinking:
1) Run the normal wordcount job on one document (https://sites.google.com/site/hadoopandhive/home/hadoop-how-to-count-number-of-times-a-word-appeared-in-a-file-using-map-reduce-framework)
2) But rather than saving a file, save everything in a HashMap(word,true)
3) Pass that HashMap along the second wordcount mapreduce program and then as I am processing the second document, check the words against the HashMap to find whether the word is present or not.
So, something like this
1) HashMap<String, boolean> hm = runStepOne(); <-- map reduce job
2) runSteptwo(HashMap<String, boolean>)
How do I do this in hadoop
Upvotes: 0
Views: 286
Reputation: 33545
Check the Section 3.5 in Data-Intensive Text Processing with MapReduce on how to do joins. There are also different MR algorithms in the same paper.
Upvotes: 1
Reputation: 2153
Sounds like you could use some form of DistributeCache to store your intermediate results after the initial wordcount job, then run another job which utilizes these intermediate results to test whether they occur in the second document. You may be able to encapsulate both these steps into a single MR job, but off the top of my head I'm not sure how.
Upvotes: 3