Reputation: 1043
The data is camera and lens product names.
I have 55,000 records in my product table and I want compare them each against a clean 3500 master record set, so I know what they are to provide additional information.
The product table gets updated daily when it loses and gains several thousand records, performance matters.
Here an example of the data I'm working with, these 5 records
Canon 45MM 2.8 TSE
Canon 45mm F2.8 TS-E
Canon 45mm F/2.8L Tilt-Shift - Boxed
Canon EF TS-E 45mm f/2.8 Tilt-shift Black Lens
Canon TS-e 45mm f2.8 Lens - Unboxed
should all be matched to the master record
Canon TS-E 45mm f/2.8
I tried full-text search to compare strings it was very fast, but results were poor.
Next I tried this Levenshtein distance function https://lucidar.me/en/web-dev/levenshtein-distance-in-mysql/
Each comparison (1 record against 3500 master records) can take 30-60 seconds, the results are better. Some examples.
Canon 85mm 1.2 MK II L - number 7
M_PRODUCTNAME SCORE
1 Canon EOS 5D Mark II 14
2 Canon EOS 6D Mark II 14
3 Canon EOS-1D Mark II N 14
4 Canon EF 85mm F1.2 14
5 Canon EF 50mm F1.8 II 14
6 Canon EOS 7D Mark II 14
7 Canon EF 85mm F1.2L II USM 14
8 Canon EOS 5D Mark III 14
9 Canon EOS-1D Mark II 14
10 Canon EOS M6 Mark II 14
Canon EF 80-200mm f4-5.6 II Lens - number 1 (actual error in record should be f4.5 not f4!)
M_PRODUCTNAME SCORE
1 Canon EF 80-200mm f/4.5-5.6 II 12
2 Canon EF 70-300mm f/4-5.6L IS USM 13
3 Canon EF 70-300mm f/4-5.6 IS USM 13
4 Canon EF 70-200mm F4L IS II USM 14
5 Canon EF 55-200mm f/4.5-5.6 II USM 14
6 Canon EF 70-300 F4-5.6 IS II USM 15
7 Canon EF 70-200mm f/2.8L USM 15
8 Canon EF 70-200mm F4L IS USM 15
9 Canon EF 70-200mm f/2.8L IS USM 15
10 Canon EF 70-200mm F4L USM 15
Canon fit zenitar c lens16 mm f2.8 - no match
M_PRODUCTNAME SCORE
1 7artisans 12mm F2.8 22
2 Canon TS-E 45mm f/2.8 22
3 Canon TS-E 90mm f/2.8 22
4 7artisans 25mm F1.8 23
5 Canon TS-E 17mm f/4L 23
6 Canon EF 28mm f/2.8 23
7 Canon Extender EF 1.4x III 23
8 Canon Extender EF 1.4x II 23
9 Canon EF 24mm f/2.8 23
10 Canon EF 35mm F2.0 23
CANON EOS IX APS Film Autofocus & Manual SLR EF/EFS Mount Camera Body - TESTED - no match
M_PRODUCTNAME SCORE
1 Minolta Maxxum 7 35mm SLR Camera (Body Only) 60
2 Canon EOS 400D (EOS Digital Rebel XTi / EOS Kiss Digital X) 61
3 Canon EOS 300D (EOS Digital Rebel / EOS Kiss Digital) 61
4 Canon EOS 350D (EOS Digital Rebel XT / EOS Kiss Digital N) 61
5 Holga 120FN Medium Format Plastic Camera with Flash 62
6 Canon EOS 1100D (EOS Rebel T3 / EOS Kiss X50) 62
7 Canon EOS 1200D (EOS Rebel T5 / EOS Kiss X70) 62
8 Canon EF-S 35mm F2.8 Macro IS STM 62
9 Canon EF-M 28mm F3.5 Macro IS STM 62
10 Canon EF-S 60mm f/2.8 Macro USM 62
I believe I may be able to adjust the scoring for removal/change/addition of characters, but even then it still takes too long to run.
e.g. query that took 35 seconds.
SELECT m_productname, levenshtein(m_productname, 'Tamron SP 45mm f/1.8 Di VC USD, Canon EF Fit') AS score FROM m_product ORDER by score
Also I'm still unsure on how to action the data when a low scoring match is incorrect, there may need to be some manual input at some point - but maybe this is a headache for another day.
Either I need Levenshtein to give me better scoring and run much faster, or I need an alternative approach, any ideas?
I need to be able to run the query 55,000 times in the first instance and then about 3000 times every day for new entries. So 30 seconds per query is no good.
I'm using ColdFusion if that opens up other options.
Upvotes: 2
Views: 850
Reputation: 425
Your data seems to have a lot of info that could be broken out into more succinct columns. Before I worked more on the Levenshtein approach, I'd spend time writing some transformation scripts that would parse that data into additional columns in the same table. In other words, instead of having a table like this:
M_PRODUCTNAME
Canon EF 70-200mm f/2.8L USM
I'd have a table like this:
M_PRODUCTNAME BRAND APERTURE FSTOP
Canon EF 70-200mm f/2.8L USM Canon 70-200 2.8L
How would you get that data broken out? Personally, I'd take each full product name and split it by spaces into an array. Then I'd look at each item in the array to apply transformation logic. Does the item contain any of your known brands? Then add that brand to the BRAND column. Does the string end with "mm"? Then I'd add that item to the APERTURE column. Does the item start with "f/" or "F/"? Then I'd add that item to FSTOP column.
This approach wouldn't catch EVERY quirk in your M_PRODUCTNAME data. But you could probably tweak it to extract a large volume of meaningful metadata. And once the data is extracted into additional columns, it's far faster and easier to search for those elements.
Also, the algorithm that extracts that data into multiple columns would probably be the same one used to find the elements to be searched on from the target string.
Upvotes: 3
Reputation: 1269873
My simplest suggestion is to parse out the name of the maker in both tables and use that to limit the comparison space for Levenshtein distance. The code would look something like this:
select p.*, m.*, levenshtein(m.name, p.name)
from product p join
master m
on p.maker = m.maker;
This can use an index on (maker)
on the two tables.
You may also be able to distinguish between cameras and lenses. If so, include that as another description column.
If there are other attributes you can filter out, they would also help. In other words, you don't want to compare 50,000 records in one table to 3,500 in another. If, instead, you are comparing each of the 50,000 records to -- say -- 300, then your code will be much, much faster.
Upvotes: 2