Reputation: 1

Fuzzy strings matching algorithms for product titles

I need to automatically match product names (food). The problem is similar to Fuzzy matching of product names

The main problem is that even single-letter changes in relevant keywords can make a huge difference, but it's not easy to detect which are the relevant keywords. Consider for example three product names Lenovo T400, Lenovo R400 and New Lenovo T-400, Core 2 Duo.

The first two are ridiculously similar strings by any standard (ok, soundex might help to disinguish the T and R in this case, but the names might as well be 400T and 400R), the first and the third are quite far from each other as strings, but are the same product.

Obviously, the matching algorithm cannot be a 100% precise, my goal is to automatically match around 80% of the names with a high confidence.

But there's a complication: My strings have mistakes because the files I want to search are the result from image recognition. The product titles don't have spaces in those files.

For example, I want to find product name cookiesoreovarianta and I have strings

cookiesoreovariantb (a real other product)
cookiesoreovariamtq (a real other product, "a" and "q" are similar symbols in some fonts)
cookiesoreovariamta (just a mistake)

I do not have not a full database of canonical names.

How would I approach this. Any ideas?

Upvotes: 0

Answers (2)

Tim Gilbert

Reputation: 1

For product data I found I needed to use a combination of fuzzy matching algorithms to be effective as each individual technique has weaknesses.

I started with a truncated phonetic Metaphone (because it can be pre-computed) to find initial candidate matches worth a closer examination.
Then I compared the candidate matches use a combination of the similarity metrics of Damerau-Levenshtein, full length phonetic, character NGrams with Jaccard similarity, similar length, and whether word1 starts/ends word2 and vice versa.
Then I multiplied the individual algorithms by an arbitrary weight factor that suited my needs.
Then I excluded the lowest similarity result, and averaged the rest together to be the overall similarity.

For your particular case dealing with model numbers, you could adjust your the final similarity metric to be much less forgiving where both words are non-dictionary words or where both words contain numeric digits, because model numbers are more precise than normal english words.

If your data really looks like "cookiesoreovariantb", your biggest problem is actually tokenization. Once the words are divided correctly in to "cookies oreo variant b", you can do a lot more to control the necessary degree of similarity to conclude that a match is found.

I wrote a post detailing weaknesses I found trying to use each individual similarity metric by itself on product data. https://saas.findwatt.com/blog/post/confused-people-dont-buy-how-fuzzy-matching-helps

Upvotes: 0

fgregg

Reputation: 3249

Ideally, you could split the strings into separate tokens and then identify what tokens are the brand, what tokens are the model name, what tokens are the model number, etc.

A good way to do that would be to use conditional random fields to train a part of speech classifier. We made a toolkit called parserator to help do that.

However, your problem is harder than normal because you also have to do what's called word segmentation.

This stackoverflow question has a pretty good introduction to word segmentation How to split text without spaces into list of words?

Once you have your titles segmented and labeled, when you compare two product titles you will want to compare the different parts of the title differently. For example, you find the Levenshtein distance between the brand names, then the distance between the model names, and then distance between the model numbers.

To do these multiple comparisons effectively and efficiently, use a package for record linkage like dedupe.

Upvotes: 1

Fuzzy strings matching algorithms for product titles

Answers (2)

Related Questions