victorhooi
victorhooi

Reputation: 17295

Comparing and matching product names from different stores/suppliers

I’m trying to write a simple program to compare prices for products from different suppliers. Different suppliers may call the same product different things.

For example, the following three strings refer to the same product:

Or the following two strings are the same product:

Furthermore - some products are not the same, but are similar (for example, Full Cream 2L Milk may encompass various similar products.)

The only bits of information I have on each product are the title, and a price.

What are currently recommended techniques for matching product strings like this?

From my Googling and reading other SO threads, I found:

Would you use one of the above techniques, or would you use a different technique?

Also, does anybody know of any example code, or even libraries for this sort of problem? I couldn't seem to find any.

(For example, I saw that some people were having performance problems with calculating the Jaro-Winkler distance for large data-sets. I was hoping there might be a distributed implementation of the algorithm (e.g. with Mahout), but wasn’t able to find anything concrete.)

Upvotes: 21

Views: 6399

Answers (1)

Raff.Edward
Raff.Edward

Reputation: 6544

Would you use one of the above techniques, or would you use a different technique?

If I were doing this for real, I wouldn't use much machine learning. I'm sure most big companies have a database of brand and product names, and use that to match things up fairly easily. Some data sanitation might be needed - but its not much of an ML problem.

If you don't have that database, I'd say go simple. Convert everything to a feature-vector and do nearest neighbor search. Use that to create a tool to help you make a database. IE: you mark the first "A2 Whole Milk 2L" as "milk" yourself, and then see if its nearest neighbors are milk. Give yourself a way to quickly mark "yes" and "needs review", or some similar such option.

For simple data such as you suggested, where it will work 90% of the time - you should be able to get through the data with ease. I've done similar to label several thousand documents in a day.

Once you have your own database, resolving these should be pretty straightforward. You could reuse the code to create your database to handle "unseen" data.

Upvotes: 10

Related Questions