Reputation: 36474
I am trying to match two lists of products by name.
Products come from different websites, and theirs names may vary from one website to the other in many subtle ways, e.g. "iPhone 128 GB" vs "Apple iPhone 128GB".
The product lists intersect, but are not equal and one is not a superset of the other; i.e. some products from list A
are not in list B
, and vice versa.
Given an algorithm that compares two strings (product names) and returns a similarity score between 0 and 1 (I already have a satisfactory implementation here), I'm looking for an algorithm that performs an optimal match of list A
to list B
.
In other words, I think I'm looking for an algorithm that maximizes the sum of all similary scores in the matches.
Note that a product from one list must be matched to at most one product from the other list.
A
, get the similary with each product in B
, and retain the product that yields the highest score, provided that it exceeds a certain threshold, such as 0.75
. Match these products.A
earlier in the loop, take the second-to-highest, provided that it exceeds the threshold above. Match to this one instead.etc.
My worry with this native implementation is that if there's a better match later in the loop, but the product from B
has already been assigned to another product from A
in a previous iteration, the matching is not optimal.
To ensure that the product is matched to its highest similarity counterpart, I thought of the following implementation:
A
-B
pairsA
nor product B
has already been matched, match these products.This algorithm should optimally match product pairs, ensuring that each pair got the highest similarity.
My worry is that it's very compute- and memory-intensive: say I have 5,000 products in both lists, that is 25,000,000 similarity scores to pre-compute and potentially store in memory (or database); in reality it will be lower due to the minimum required threshold, but it can still get very large and is still CPU intensive.
Is there a more efficient algorithm that gives the same output as this improved version?
Upvotes: 4
Views: 5068
Reputation: 798
Your model could be reformulated in graph terms: consider a complete weighted bipartite graph, where vertices of the first part are names from list A, vertices of the second part are names from list B and edges are weighted with precomputed scores of similarity.
Now your problem looks really close to the dense Assignment_problem, which optimal solution can be found with Hungarian algorithm (O(n³) complexity).
If optimal solution is not your final goal and some good approximations to optimum can also satisfy your requirements, try heuristic algorithms for assignment problem, here is another topic with a brief overview of them.
Upvotes: 3
Reputation: 2307
Your second algorithm should provide a decent output, but it's not optimal. Check the following case:
Set0 Set1
A C
B D
Similarities:
A-C = 900
A-D = 850
B-C = 850
B-D = 0
Your algorithm's output: [(A,C), (B,D)]. Value 900.
Optimal output: [(A,D), (B,C)]. Value 1700.
The problem you are working with is exactly the Assigment Problem, which is "finding, in a weighted bipartite graph, a matching in which the sum of weights of the edges is as large as possible". You can find many ways to optimally and efficiently solve this problem.
Upvotes: 2