Reputation: 1
I need to automatically match product names (food). The problem is similar to Fuzzy matching of product names
The main problem is that even single-letter changes in relevant keywords can make a huge difference, but it's not easy to detect which are the relevant keywords. Consider for example three product names Lenovo T400
, Lenovo R400
and New Lenovo T-400, Core 2 Duo
.
The first two are ridiculously similar strings by any standard (ok, soundex might help to disinguish the T and R in this case, but the names might as well be 400T
and 400R
), the first and the third are quite far from each other as strings, but are the same product.
Obviously, the matching algorithm cannot be a 100% precise, my goal is to automatically match around 80% of the names with a high confidence.
But there's a complication: My strings have mistakes because the files I want to search are the result from image recognition. The product titles don't have spaces in those files.
For example, I want to find product name cookiesoreovarianta
and I have strings
cookiesoreovariantb (a real other product)
cookiesoreovariamtq (a real other product, "a" and "q" are similar symbols in some fonts)
cookiesoreovariamta (just a mistake)
I do not have not a full database of canonical names.
How would I approach this. Any ideas?
Upvotes: 0
Views: 3292
Reputation: 1
For product data I found I needed to use a combination of fuzzy matching algorithms to be effective as each individual technique has weaknesses.
For your particular case dealing with model numbers, you could adjust your the final similarity metric to be much less forgiving where both words are non-dictionary words or where both words contain numeric digits, because model numbers are more precise than normal english words.
If your data really looks like "cookiesoreovariantb", your biggest problem is actually tokenization. Once the words are divided correctly in to "cookies oreo variant b", you can do a lot more to control the necessary degree of similarity to conclude that a match is found.
I wrote a post detailing weaknesses I found trying to use each individual similarity metric by itself on product data. https://saas.findwatt.com/blog/post/confused-people-dont-buy-how-fuzzy-matching-helps
Upvotes: 0
Reputation: 3249
Ideally, you could split the strings into separate tokens and then identify what tokens are the brand, what tokens are the model name, what tokens are the model number, etc.
A good way to do that would be to use conditional random fields to train a part of speech classifier. We made a toolkit called parserator to help do that.
However, your problem is harder than normal because you also have to do what's called word segmentation.
This stackoverflow question has a pretty good introduction to word segmentation How to split text without spaces into list of words?
Once you have your titles segmented and labeled, when you compare two product titles you will want to compare the different parts of the title differently. For example, you find the Levenshtein distance between the brand names, then the distance between the model names, and then distance between the model numbers.
To do these multiple comparisons effectively and efficiently, use a package for record linkage like dedupe.
Upvotes: 1