Daniel Cook
Daniel Cook

Reputation: 1043

Comparing string similarity with a large amount of records

The data is camera and lens product names.

I have 55,000 records in my product table and I want compare them each against a clean 3500 master record set, so I know what they are to provide additional information.

The product table gets updated daily when it loses and gains several thousand records, performance matters.

Here an example of the data I'm working with, these 5 records

Canon 45MM 2.8 TSE    
Canon 45mm F2.8 TS-E
Canon 45mm F/2.8L Tilt-Shift - Boxed
Canon EF TS-E 45mm f/2.8 Tilt-shift Black Lens
Canon TS-e 45mm f2.8 Lens - Unboxed

should all be matched to the master record

Canon TS-E 45mm f/2.8

I tried full-text search to compare strings it was very fast, but results were poor.

Next I tried this Levenshtein distance function https://lucidar.me/en/web-dev/levenshtein-distance-in-mysql/

Each comparison (1 record against 3500 master records) can take 30-60 seconds, the results are better. Some examples.

Canon 85mm 1.2 MK II L - number 7

    M_PRODUCTNAME   SCORE
1   Canon EOS 5D Mark II    14
2   Canon EOS 6D Mark II    14
3   Canon EOS-1D Mark II N  14
4   Canon EF 85mm F1.2  14
5   Canon EF 50mm F1.8 II   14
6   Canon EOS 7D Mark II    14
7   Canon EF 85mm F1.2L II USM  14
8   Canon EOS 5D Mark III   14
9   Canon EOS-1D Mark II    14
10  Canon EOS M6 Mark II    14

Canon EF 80-200mm f4-5.6 II Lens - number 1 (actual error in record should be f4.5 not f4!)

    M_PRODUCTNAME   SCORE
1   Canon EF 80-200mm f/4.5-5.6 II  12
2   Canon EF 70-300mm f/4-5.6L IS USM   13
3   Canon EF 70-300mm f/4-5.6 IS USM    13
4   Canon EF 70-200mm F4L IS II USM 14
5   Canon EF 55-200mm f/4.5-5.6 II USM  14
6   Canon EF 70-300 F4-5.6 IS II USM    15
7   Canon EF 70-200mm f/2.8L USM    15
8   Canon EF 70-200mm F4L IS USM    15
9   Canon EF 70-200mm f/2.8L IS USM 15
10  Canon EF 70-200mm F4L USM   15

Canon fit zenitar c lens16 mm f2.8 - no match

    M_PRODUCTNAME   SCORE
1   7artisans 12mm F2.8 22
2   Canon TS-E 45mm f/2.8   22
3   Canon TS-E 90mm f/2.8   22
4   7artisans 25mm F1.8 23
5   Canon TS-E 17mm f/4L    23
6   Canon EF 28mm f/2.8 23
7   Canon Extender EF 1.4x III  23
8   Canon Extender EF 1.4x II   23
9   Canon EF 24mm f/2.8 23
10  Canon EF 35mm F2.0  23

CANON EOS IX APS Film Autofocus & Manual SLR EF/EFS Mount Camera Body - TESTED - no match

    M_PRODUCTNAME   SCORE
1   Minolta Maxxum 7 35mm SLR Camera (Body Only)    60
2   Canon EOS 400D (EOS Digital Rebel XTi / EOS Kiss Digital X) 61
3   Canon EOS 300D (EOS Digital Rebel / EOS Kiss Digital)   61
4   Canon EOS 350D (EOS Digital Rebel XT / EOS Kiss Digital N)  61
5   Holga 120FN Medium Format Plastic Camera with Flash 62
6   Canon EOS 1100D (EOS Rebel T3 / EOS Kiss X50)   62
7   Canon EOS 1200D (EOS Rebel T5 / EOS Kiss X70)   62
8   Canon EF-S 35mm F2.8 Macro IS STM   62
9   Canon EF-M 28mm F3.5 Macro IS STM   62
10  Canon EF-S 60mm f/2.8 Macro USM 62

I believe I may be able to adjust the scoring for removal/change/addition of characters, but even then it still takes too long to run.

e.g. query that took 35 seconds.

SELECT m_productname, levenshtein(m_productname, 'Tamron SP 45mm f/1.8 Di VC USD, Canon EF Fit') AS score FROM m_product ORDER by score

Also I'm still unsure on how to action the data when a low scoring match is incorrect, there may need to be some manual input at some point - but maybe this is a headache for another day.

Either I need Levenshtein to give me better scoring and run much faster, or I need an alternative approach, any ideas?

I need to be able to run the query 55,000 times in the first instance and then about 3000 times every day for new entries. So 30 seconds per query is no good.

I'm using ColdFusion if that opens up other options.

Upvotes: 2

Views: 850

Answers (2)

Adam Nathaniel Davis
Adam Nathaniel Davis

Reputation: 425

Your data seems to have a lot of info that could be broken out into more succinct columns. Before I worked more on the Levenshtein approach, I'd spend time writing some transformation scripts that would parse that data into additional columns in the same table. In other words, instead of having a table like this:

M_PRODUCTNAME
Canon EF 70-200mm f/2.8L USM

I'd have a table like this:

M_PRODUCTNAME                   BRAND   APERTURE  FSTOP
Canon EF 70-200mm f/2.8L USM    Canon   70-200    2.8L

How would you get that data broken out? Personally, I'd take each full product name and split it by spaces into an array. Then I'd look at each item in the array to apply transformation logic. Does the item contain any of your known brands? Then add that brand to the BRAND column. Does the string end with "mm"? Then I'd add that item to the APERTURE column. Does the item start with "f/" or "F/"? Then I'd add that item to FSTOP column.

This approach wouldn't catch EVERY quirk in your M_PRODUCTNAME data. But you could probably tweak it to extract a large volume of meaningful metadata. And once the data is extracted into additional columns, it's far faster and easier to search for those elements.

Also, the algorithm that extracts that data into multiple columns would probably be the same one used to find the elements to be searched on from the target string.

Upvotes: 3

Gordon Linoff
Gordon Linoff

Reputation: 1269873

My simplest suggestion is to parse out the name of the maker in both tables and use that to limit the comparison space for Levenshtein distance. The code would look something like this:

select p.*, m.*, levenshtein(m.name, p.name)
from product p join
     master m
     on p.maker = m.maker;

This can use an index on (maker) on the two tables.

You may also be able to distinguish between cameras and lenses. If so, include that as another description column.

If there are other attributes you can filter out, they would also help. In other words, you don't want to compare 50,000 records in one table to 3,500 in another. If, instead, you are comparing each of the 50,000 records to -- say -- 300, then your code will be much, much faster.

Upvotes: 2

Related Questions