Sami
Sami

Reputation: 1491

PHP - Keyword matching in text strings - How to enhance the accuracy of returned keywords?

I have a piece of PHP code as follows:

$words = array(
    'Art' => '1',
    'Sport' => '2',
    'Big Animals' => '3',
    'World Cup' => '4',
    'David Fincher' => '5',
    'Torrentino' => '6',
    'Shakes' => '7',
    'William Shakespeare' => '8'
    );
$text = "I like artists, and I like sports. Can you call the name of a big animal? Brazil World Cup matchers are very good. William Shakespeare is very famous in the world.";
$all_keywords = $all_keys = array();
foreach ($words as $word => $key) {
    if (strpos(strtolower($text), strtolower($word)) !== false) {
        $all_keywords[] = $word;
        $all_keys[] = $key;
    }
}
        echo $keywords_list = implode(',', $all_keywords) ."<br>";
        echo $keys_list = implode(',', $all_keys) . "<br>";

The code echos Art,Sport,World Cup,Shakes,William Shakespeare and 1,2,4,7,8; however, the code is very simple and is not accurate enough to echo the right keywords. For example, the code returns 'Shakes' => '7' because of the Shakespeare word in $text, but as you can see, "Shakes" can not represent "Shakespeare" as a proper keyword. Basically I want to return Art,Sport,World Cup,William Shakespeare and 1,2,4,8 instead of Art,Sport,World Cup,Shakes,William Shakespeare and 1,2,4,7,8. So, could you please help me how to develop a better code to extract the keywords without having similar problems? thanks for your help.

Upvotes: 4

Views: 1791

Answers (4)

J A
J A

Reputation: 1766

From the top of my head, I think there are two additional steps to make this function a bit robust.

  • If we somehow sort the $words array by strlen (descending, bigger words at the top and smaller at the bottom) there would be greater chance for desired "match".
  • In the for loop, when a word "matches" or strcmp returns true, we can remove the matched word from the string to avoid further unnecessary match. (e.g. Shakes will always match where William Shakespeare matches.)

P.S. SO ios app rocks! But still not easy to code(bloody autocorrect!)

Upvotes: 0

didierc
didierc

Reputation: 14730

Replace

strpos(strtolower($text), strtolower($word)

With

preg_match('/\b'.$word.'\b/',$text)

Or, since you don't seem to care about capital letters:

preg_match('/\b'.strtolower($word).'\b/', strtolower($text))

I suggest in that case that you perform strtolower($text) beforehand, for instance just before the beginning of foreach.

Upvotes: 0

Trick
Trick

Reputation: 646

You're better off using regular expressions if you want accurate matches. I modified your original code to use them instead of strpos() as it will result in partial matches, as was the case with your code.
There's room for improvement, but hopefully you get the basic gist of it.

Let me know if you have any questions.

Code was modified to a shell script, so save to demo.php and chmod +x demo.php && ./demo.php


` #!/usr/bin/php

//array of regular expressions to match your words/phrases
$words = array(
    '/\b[Aa]rt\b/',
    '/\bI\b/',
    '/\bSport\b/',
    '/\bBig Animals\b/' ,
    '/\bWorld Cup\b/' ,
    '/\bDavid Fincher\b/',
    '/\bTorrentino\b/' ,
    '/\bShakes\b/' ,
    '/\b[sS]port[s]{0,1}\b/' ,
    '/\bWilliam Shakespeare\b/',
);

$text = "I like artists and art, and I like sports. Can you call the name of a big animal? Brazil World Cup matchers are very good. William Shakespeare is very famous in the world.";

$all_keywords = array();  //changed formatting for clarity
$all_keys     = array();
foreach ($words as $regex) {
  $m = array();
  if (preg_match_all($regex, $text, $m, PREG_OFFSET_CAPTURE)>=1)
    for ($n=0;$n<count($m); ++$n) { 
      $match = $m[0];
      foreach($match as $mm) {         
        $key = $mm[1];          //key is the offset in $text where the match begins
        $word = $mm[0];         //the matched word/phrase
        $all_keywords[] = $word;
        $all_keys[] = $key;
      }
    }
}

echo "\$text = \"$text\"\n";
echo $keywords_list = implode(',', $all_keywords) ."<br>\n";
echo $keys_list = implode(',', $all_keys) . "<br>\n";

`

Upvotes: 3

Ja͢ck
Ja͢ck

Reputation: 173562

You may want to look at regular expressions to weed out partial matches:

// create regular expression by using alternation
// of all given words
$re = '/\b(?:' . join('|', array_map(function($keyword) {
    return preg_quote($keyword, '/');
}, array_keys($words))) . ')\b/i';

preg_match_all($re, $text, $matches);
foreach ($matches[0] as $keyword) {
    echo $keyword, " ", $words[$keyword], "\n";
}

The expression uses the \b assertion to match word boundaries, i.e. the word must be on its own.

Output

World Cup 4
William Shakespeare 8

Upvotes: 4

Related Questions