Reputation: 2056
Consider following are my two sql tables:
Table 1 Table 2
+-------+-------------------------+ +-------+------------------------------+ | USD | Model | | USD | Model | +-------+-------------------------+ +-------+------------------------------+ | 700 | iPad 2 WiFi 16GB | | 710 | iPad2 WiFi 16GB | | 400 | iPhone 4S 16GB | | 450 | iPhone4S 16GB | | 250 | iPod Touch(4th Gen)8GB | | 200 |iPod Touch 4th Generation 8GB | +-------+-------------------------+ +-------+------------------------------+
I am stuck in comparing the data present in two different tables intelligently. I dug alot on the context of searching or comparing and I found
...in PHP and MySQL but they all are not efficient. Because similar_text
and LEVENSHTEIN
are really good, but the worst drawback is that they are extremely slow for 1000 rows, soundex()
and metaphones
return the same sound for such items which are not alike, like "iphone" and "ipad", both are not same etc. All I want to do is compare efficiently two rows that are alike like from the above example "iPhone 4S 16GB" and "iPhone4S 16GB" is the same or the like mentioned above and my solution should work quickly to compare such rows. Kindly let me know what are my options for comparing so I can solve my query. I would really appreciate any idea, any hint.
Note: My one table contains around ~900 rows.
This is a continuation of:
Compare two arrays and sort WRT USD
Pattern comparing with mysql between two tables column
Upvotes: 2
Views: 461
Reputation: 1626
I covered this kind of thing when doing a spam detector (loads of research, and then ditched the idea later, but moving on...).
Basically, do not use like, it's slow on large text and indexes are limited for example:
LIKE '%hello' can not use an index, but, LIKE 'hello%' can. Also, large fields will result in large indexs to make the work as you intend (they are ok for say email addresses which tend to be short).
Use = which will also be case insensitive which you must have for this.
Next, add a new field to the tables which contains the already parsed metaphone() representation (this means that it only has to be calculated ONCE).
Now you have a table of say 1000 records, each with their metaphone version as well as the original. You MUST do this to get the efficiency you require. When you want to see if some text already exists, you just convert the new text to it's metephone version, then search the db tables for it (searching on the metephone parsed field). Much quicker ;)
To improve accuracy, you may want to delete all the common words and remove punctuation such as:
Then combine all multiple whitespace such as 5 spaces, into just 1 space.
The nature of what you are doing will have hundreds of little tweaks you can do to perfect it for what you need it for.
Upvotes: 1
Reputation: 739
From a mysql point of view a solution is this:
SELECT *
FROM tb1
WHERE (USD, Model) NOT IN (SELECT USD, Model FROM tb2)
Upvotes: 0
Reputation: 4239
One way would be to create a function/stored procedure which strips a string of spaces '(',')', replaces ' Generation ' with 'Gen' and so on. After that you could create two temporary tables which are identical to your current tables except for the "Model" field where you apply your 'strip_unnecessary()' function. Now it should just be a matter of JOIN:ing the tables together or comparing them in any other way.
The trick with temporary tables saves some excecution time but if you are going to perform this query often, consider using views instead for increased performance.
This solution is highly dependent on how good the 'strip_unnecessary()' function is and on how much you know in advance of the contents of the "Model" field.
Upvotes: 1
Reputation: 16677
if you don't like the complicated-but-more-likely-to-produce-good-results solutions, then maybe you like to just remove the spaces from your text strings and try simple string compares.. or maybe convert to all UPPER then compare with no spaces.
that would at least solve your stated 'similarity' example.
Upvotes: 2