Reputation: 17295
I’m trying to write a simple program to compare prices for products from different suppliers. Different suppliers may call the same product different things.
For example, the following three strings refer to the same product:
Or the following two strings are the same product:
Furthermore - some products are not the same, but are similar (for example, Full Cream 2L Milk may encompass various similar products.)
The only bits of information I have on each product are the title, and a price.
What are currently recommended techniques for matching product strings like this?
From my Googling and reading other SO threads, I found:
Would you use one of the above techniques, or would you use a different technique?
Also, does anybody know of any example code, or even libraries for this sort of problem? I couldn't seem to find any.
(For example, I saw that some people were having performance problems with calculating the Jaro-Winkler distance for large data-sets. I was hoping there might be a distributed implementation of the algorithm (e.g. with Mahout), but wasn’t able to find anything concrete.)
Upvotes: 21
Views: 6399
Reputation: 6544
Would you use one of the above techniques, or would you use a different technique?
If I were doing this for real, I wouldn't use much machine learning. I'm sure most big companies have a database of brand and product names, and use that to match things up fairly easily. Some data sanitation might be needed - but its not much of an ML problem.
If you don't have that database, I'd say go simple. Convert everything to a feature-vector and do nearest neighbor search. Use that to create a tool to help you make a database. IE: you mark the first "A2 Whole Milk 2L" as "milk" yourself, and then see if its nearest neighbors are milk. Give yourself a way to quickly mark "yes" and "needs review", or some similar such option.
For simple data such as you suggested, where it will work 90% of the time - you should be able to get through the data with ease. I've done similar to label several thousand documents in a day.
Once you have your own database, resolving these should be pretty straightforward. You could reuse the code to create your database to handle "unseen" data.
Upvotes: 10