BenMorel
BenMorel

Reputation: 36474

Algorithm to best match item pairs given similarity score

I am trying to match two lists of products by name.

Products come from different websites, and theirs names may vary from one website to the other in many subtle ways, e.g. "iPhone 128 GB" vs "Apple iPhone 128GB".

The product lists intersect, but are not equal and one is not a superset of the other; i.e. some products from list A are not in list B, and vice versa.

Given an algorithm that compares two strings (product names) and returns a similarity score between 0 and 1 (I already have a satisfactory implementation here), I'm looking for an algorithm that performs an optimal match of list A to list B.

In other words, I think I'm looking for an algorithm that maximizes the sum of all similary scores in the matches.

Note that a product from one list must be matched to at most one product from the other list.

My initial idea

etc.

My worry with this native implementation is that if there's a better match later in the loop, but the product from B has already been assigned to another product from A in a previous iteration, the matching is not optimal.

An improved version

To ensure that the product is matched to its highest similarity counterpart, I thought of the following implementation:

This algorithm should optimally match product pairs, ensuring that each pair got the highest similarity.

My worry is that it's very compute- and memory-intensive: say I have 5,000 products in both lists, that is 25,000,000 similarity scores to pre-compute and potentially store in memory (or database); in reality it will be lower due to the minimum required threshold, but it can still get very large and is still CPU intensive.

Did I miss something?

Is there a more efficient algorithm that gives the same output as this improved version?

Upvotes: 4

Views: 5068

Answers (2)

gimme_danger
gimme_danger

Reputation: 798

Your model could be reformulated in graph terms: consider a complete weighted bipartite graph, where vertices of the first part are names from list A, vertices of the second part are names from list B and edges are weighted with precomputed scores of similarity.

enter image description here

Now your problem looks really close to the dense Assignment_problem, which optimal solution can be found with Hungarian algorithm (O(n³) complexity).

If optimal solution is not your final goal and some good approximations to optimum can also satisfy your requirements, try heuristic algorithms for assignment problem, here is another topic with a brief overview of them.

Upvotes: 3

Cihan
Cihan

Reputation: 2307

Your second algorithm should provide a decent output, but it's not optimal. Check the following case:

Set0 Set1 
A    C
B    D

Similarities:
A-C = 900
A-D = 850
B-C = 850
B-D = 0

Your algorithm's output: [(A,C), (B,D)]. Value 900.
Optimal output: [(A,D), (B,C)]. Value 1700.  

The problem you are working with is exactly the Assigment Problem, which is "finding, in a weighted bipartite graph, a matching in which the sum of weights of the edges is as large as possible". You can find many ways to optimally and efficiently solve this problem.

Upvotes: 2

Related Questions