Reputation: 1016
I am currently trying to figure out how to calculate the similarity between two records. My first record would be from a deactivated advertisement - so I want to find e.g. the 10 most similar advertisement regarding to some VARCHAR-fields equalness.
The thing, I can't figure out is, if there is any MySQL function, that can help me in any way - or if I need to compare the strings in some weird way?
Similarity would be defined by these fields:
Title (weight: 50 %)
Content (weight: 40 %)
Category (weight: 10 %)
I want the calculation to be like this:
Title: Words that match in the title field (only words >2 letters are matched).
Description: Words that match in the title field (only words >2 letters are matched).
Catgory: Match the category and if that doesn't match match the parent category with less weight :)
An equation of this could be:
#1 is the old, inactive post, #2 is the active post:
#2 title matches #1 title in 3 words out of #2's total of 10 words. That gives 30 % match = 30 points.
#2 description matches #1 description in 10 words out of #2's total of 400 words. That gives a 4 % match = 4 points.
#2 category doesn't match #1's category, therefore 0 % match. That gives 0 points.
Then the sum would be 34 points for #2. :)
Here's my query - but it doesn't return different rows, but a lot of the same row.
SELECT
a.AdvertisementID as A_AdvertisementID,
IF(a.Topic LIKE a2.Topic, 50, 0) + IF(a.Description LIKE a2.Description, 40, 0) + IF(a.Cate_CategoryID LIKE a2.Cate_CategoryID, 10, 0) as A_Score,
a.AdvertisementID as A_AdvertisementID,
a.Topic as A_Topic,
LEFT(a.Description, 300) as A_Description,
a.Price as A_Price,
a.Type as A_Type
FROM
".DB_PREFIX."A_Advertisements a2,
".DB_PREFIX."A_Advertisements a
WHERE
a2.AdvertisementID <> a.AdvertisementID
AND
a.AdvertisementID = :a_id
ORDER BY
A_Score DESC
Upvotes: 0
Views: 1555
Reputation: 300845
If you can literally compare the fields you are interested in, you could have MySQL perform a simple scoring calculation using the IF() function, for example
select
foo.id,
if (foo.title='wantedtitle', 50, 0) +
if (foo.content='wantedcontent', 40, 0) +
if (foo.category='wantedcategory', 10, 0) as score
from foo
order by score desc
limit 10
A basic 'find a fragment' could be achieved using like
select
foo.id,
if (foo.title like '%wantedtitlefragment%', 50, 0) +
if (foo.content like '%wantedcontentfragment%', 40, 0) +
if (foo.category like '%wantedcategoryfragment%', 10, 0) as score
from foo
order by score desc
limit 10
There are other techniques, but they might be slow to implement in MySQL. For example, you could calculate the Levenstein distance between two string - see this post for an example implementation.
Upvotes: 2