Algorithm to best match item pairs given similarity score

Question

I am trying to match two lists of products by name.

Products come from different websites, and theirs names may vary from one website to the other in many subtle ways, e.g. "iPhone 128 GB" vs "Apple iPhone 128GB".

The product lists intersect, but are not equal and one is not a superset of the other; i.e. some products from list A are not in list B, and vice versa.

Given an algorithm that compares two strings (product names) and returns a similarity score between 0 and 1 (I already have a satisfactory implementation here), I'm looking for an algorithm that performs an optimal match of list A to list B.

In other words, I think I'm looking for an algorithm that maximizes the sum of all similary scores in the matches.

Note that a product from one list must be matched to at most one product from the other list.

My initial idea

for each product in A, get the similary with each product in B, and retain the product that yields the highest score, provided that it exceeds a certain threshold, such as 0.75. Match these products.
if the product with the highest score was already matched to another product in A earlier in the loop, take the second-to-highest, provided that it exceeds the threshold above. Match to this one instead.

etc.

My worry with this native implementation is that if there's a better match later in the loop, but the product from B has already been assigned to another product from A in a previous iteration, the matching is not optimal.

An improved version

To ensure that the product is matched to its highest similarity counterpart, I thought of the following implementation:

pre-compute the similarity scores for all A-B pairs
discard the similarities lower than the threshold used above
order by similarity, highest first
for each pair, if neither product A nor product B has already been matched, match these products.

This algorithm should optimally match product pairs, ensuring that each pair got the highest similarity.

My worry is that it's very compute- and memory-intensive: say I have 5,000 products in both lists, that is 25,000,000 similarity scores to pre-compute and potentially store in memory (or database); in reality it will be lower due to the minimum required threshold, but it can still get very large and is still CPU intensive.

Did I miss something?

Is there a more efficient algorithm that gives the same output as this improved version?

gimme_danger · Accepted Answer

Your model could be reformulated in graph terms: consider a complete weighted bipartite graph, where vertices of the first part are names from list A, vertices of the second part are names from list B and edges are weighted with precomputed scores of similarity.

Now your problem looks really close to the dense Assignment_problem, which optimal solution can be found with Hungarian algorithm (O(n³) complexity).

If optimal solution is not your final goal and some good approximations to optimum can also satisfy your requirements, try heuristic algorithms for assignment problem, here is another topic with a brief overview of them.

Algorithm to best match item pairs given similarity score

My initial idea

An improved version

Did I miss something?

Answers (2)

Related Questions