Reputation: 259

Similar text percentage in php

It seems so easy to find the percentage between two strings using php code, I just use

int similar_text ( string $first , string $second [, float &$percent ]

but assume that I have two strings for example:

1- Sponsors back away from Sharapova after failed drug test

2- Maria Sharapova failed drugs test at Australian Open

With similar_text tool I got 53.7% but it doesn't make any sense because the two strings are talking about "failed drug test" for "Sharapova" and the percent should be more than 53.7%.

My question is: is there any way to find the real similarity percent between two strings?

Upvotes: 1

Answers (1)

Vincent Pazeller

Reputation: 1498

I have implemented several algorithms that will search for duplicates and they can be quite similar.

The approach I am usually using is the following:

normalize the strings
use a comparison algorithm (e.g. similar_text, levenshtein, etc.)

It appears to me that in implementing step 1) you will be able to improve your results drastically.

Example of normalization algorithm (I use "Sponsors back away from Sharapova after failed drug test" for the details):

1) lowercase the string

-> "sponsors back away from sharapova after failed drug test"

2) explode string in words

-> [sponsors, back, away, from, sharapova, after, failed, drug, test]

3) remove noisy words (like propositions, e.g. in, for, that, this, etc.). This step can be customized to your needs

-> [sponsors, sharapova, failed, drug, test]

4) sort the array alphabetically (optional, but this can help implementing the algorithm...)

-> [drug, failed, sharapova, sponsors, test]

Applying the very same algorithm to your other string, you would obtain:

[australian, drugs, failed, maria, open, sharapova, test]

This will help you elaborate a clever algorithm. For example:

for each word in the first string, search the highest similarity in the words of the second string
accumulate the highest similarity
divide the accumulated similarity by the number of words


    $words1 = ['drug', 'failed', 'sharapova', 'sponsors', 'test'];
    $words2 = ['australian', 'drugs', 'failed', 'maria', 'open', 'sharapova', 'test'];
    $nbWords1 = count($words1);
    $stringSimilarity = 0;

    foreach($words1 as $word1){
        $max = null;
        $similarity = null;
        foreach($words2 as $word2){
            similar_text($word1, $word2, $similarity);
            if($similarity > $max){ //1)
                $max = $similarity;
            }
        }
        $stringSimilarity += $max; //2)
    }
    var_dump(($stringSimilarity/$nbWords1)); //3)

Running this code will give you 84.83660130719. Not bad, I think ^^. I am sure this algorithm can be further refined, but this is a good start... Also here, we are basically computing the average similarity percentage for each words, you may want a different final approach... tune for your needs ;-)

Upvotes: 3

Similar text percentage in php

Answers (1)

Related Questions