Hadoop join with String key

Question

I'm implementing a reduce-side join to find matches between databases A and B. Both files from the datasets contains a json object per line. The join key is the name attribute of each record, so, the mapper extract the name of the json and pass it as key and the json itself as value. The reducer must merge the jsons objects for the same or similar person name.

The problem is that I need to group keys using a string similarity matching algorithm, e.g., John White must be considered equal to John White Lennon.

I've tried to do that using a grouping comparator but it is not working as expected.

How can this be implemented?

Thanks in advance!

vefthym · Accepted Answer

What you request here could be described as a set similarity join, where the sets are, e.g. the sets of tokens, or n-grams of each line. Here is a research paper, that describes how you can achieve that in MapReduce. I hope you find it useful.

Hadoop join with String key

Answers (1)

Related Questions