Reputation: 259
It seems so easy to find the percentage between two strings using php code, I just use
int similar_text ( string $first , string $second [, float &$percent ]
but assume that I have two strings for example:
1- Sponsors back away from Sharapova after failed drug test
2- Maria Sharapova failed drugs test at Australian Open
With similar_text
tool I got 53.7% but it doesn't make any sense because the two strings are talking about "failed drug test" for "Sharapova" and the percent should be more than 53.7%.
My question is: is there any way to find the real similarity percent between two strings?
Upvotes: 1
Views: 2473
Reputation: 1498
I have implemented several algorithms that will search for duplicates and they can be quite similar.
The approach I am usually using is the following:
It appears to me that in implementing step 1) you will be able to improve your results drastically.
Example of normalization algorithm (I use "Sponsors back away from Sharapova after failed drug test" for the details):
1) lowercase the string
-> "sponsors back away from sharapova after failed drug test"
2) explode string in words
-> [sponsors, back, away, from, sharapova, after, failed, drug, test]
3) remove noisy words (like propositions, e.g. in, for, that, this, etc.). This step can be customized to your needs
-> [sponsors, sharapova, failed, drug, test]
4) sort the array alphabetically (optional, but this can help implementing the algorithm...)
-> [drug, failed, sharapova, sponsors, test]
Applying the very same algorithm to your other string, you would obtain:
[australian, drugs, failed, maria, open, sharapova, test]
This will help you elaborate a clever algorithm. For example:
$words1 = ['drug', 'failed', 'sharapova', 'sponsors', 'test'];
$words2 = ['australian', 'drugs', 'failed', 'maria', 'open', 'sharapova', 'test'];
$nbWords1 = count($words1);
$stringSimilarity = 0;
foreach($words1 as $word1){
$max = null;
$similarity = null;
foreach($words2 as $word2){
similar_text($word1, $word2, $similarity);
if($similarity > $max){ //1)
$max = $similarity;
}
}
$stringSimilarity += $max; //2)
}
var_dump(($stringSimilarity/$nbWords1)); //3)
Running this code will give you 84.83660130719. Not bad, I think ^^. I am sure this algorithm can be further refined, but this is a good start... Also here, we are basically computing the average similarity percentage for each words, you may want a different final approach... tune for your needs ;-)
Upvotes: 3