samson
samson

Reputation: 1152

Fuzzy substring search from a list of strings

Okay, I've seen lots of posts about fuzzy string matching, Levenstein distance, longest common substring, and so on. None of them seem to fit exactly for what I want to do. I'm pulling product results from a variety of web services, and from those I can build a big list of names for the product. These names may include a bunch of variable junk. Here are some examples, from SearchUPC:

Apple 60W magsafe adapter L-shape with extension cord
Original Apple 60W Power Adapter (L-shaped Connector) for MacBook MC461LL/A with AC Extension Wall Cord (Bulk Packaging)
Current Apple MagSafe 60W Power Adapter for MacBook MC461LL/A with AC Extension Wall Cord (Bulk Packaging)
Apple 60W MagSafe Power Adapter - Apple Mac Accessories
Apple - MagSafe 60W Power Adapter for MacBook and 13\" MacBook Pro
MagSafe - power adapter - 60 Watt

etc. What I'd like to do is pull the common product name (which to my heuristic human eye is obviously Apple 60W MagSafe Power Adapter), but none of the aforementioned methods seem likely to work. My main problem is that I don't know what to search the list of strings for... At first, I was thinking of trying longest common substring, but it seems like that will fail as a bunch of the strings have things out of order, which might yield a product name of power adapter, which is not terribly useful to the user.

Note: the vast majority of the records returned from the SearchUPC API (mostly omitted here) do include the literal string "Apply 60W MagSafe Power Adapter".

I'm implementing this in Objective-C, for iOS, but I'm really interested in the algorithm more than the implementation, so any language is acceptable.

Upvotes: 2

Views: 1317

Answers (2)

Yves V.
Yves V.

Reputation: 773

If you want to compare strings but need something more robust than longest common substring with respect to changed order of substrings, you might look into a technique called string tiling. Simplified, the principle is as follows:

  1. Find the largest common substring (larger than a minimum length) in both strings
  2. Remove that substring from both strings
  3. Repeat until there are no more substrings larger than your minimal length

In practice, the ratio between the remaining (unmatched) stringparts to the initial lengths is a excellent indicator on how well the strings match. And the technique is extremely robust against reordering of substrings. You can find the scientific paper by M. Wise describing this technique here. I have implemented the algorithm myself in the past (it's not hard), but apparantly free implementations are readily available (for example here). While I have used the algorithm in a variety of fuzzy matching scenarios, I can't vouch for the implementation which I have never used myself.

String tiling does not solve the problem of finding the largest common productname in itself, but it seems to me that by slightly modifying the algorithm, you can do so. You're probably less interested in calculating the matchpercentage than in the keeping the similar parts?

As the previous poster remarked, such fuzzy matching almost always needs some kind of manual validation, as both false positives and false negatives are unavoidable.

Upvotes: 1

SBaha
SBaha

Reputation: 83

Sounds like your approach needs to be two fold. One should be matching records with one another. The other part is pulling the "canonical name" out of those matching records. In your case it is a product, but same concept. I'm not sure how you will go about associating groups of matching records with the standardized product name, but I would suggest trying to pull the important information out of the records and trying to match that to some resources on the Internet. For instance, for your example, maybe you compare your data to an apple product list. Or you could try and have a robot that crawls google and pulls the top result to try and associate. Bottom line, at the end of the day, if your text is really dirty, you will need to include some human intervention. What I mean is that you may set a threshold for match, no match, and needs review. Good luck.

Upvotes: 0

Related Questions