Reputation: 538
I'm trying to build this tag system for my website, where it checks the written article (could be 400-1000 words), for specific words and make a string with all the keywords found, from the array.
The one I made is working alright, but there is some problems I would like to fix.
$a = "This is my article and it's about apples and pears. I like strawberries as well though.";
$targets = array('apple', 'apples','pear','pears','strawberry','strawberries','grape','grapes');
foreach($targets as $t)
{
if (preg_match("/\b" . $t . "\b/i", $a)) {
$b[] = $t;
}
}
echo $b[0].",".$b[1].",".$b[2].",".$b[3];
$tags = $b[0].",".$b[1].",".$b[2].",".$b[3];
First of all, I would like to know, if there is any way, I can make this more effecient. I have a database with around 5.000 keywords and expanding day by day.
A you can see, I don't know how to get ALL the matches. I'm writing $b[0], $b[1] etc.
I would like it to just make a string with ALL the matches - but only 1 time per match. If apples is mentioned 5 times, then only 1 should go in the string.
A said - this works. But I don't feel, that this is the best solution.
EDIT:
I'm now trying this, but I cant get it to work at all.
$a = "This is my article and it's about apples and pears. I like strawberries as well though.";
$targets = array('apple', 'apples','pear','pears','strawberry','strawberries','grape','grapes');
$targets = implode('|', $targets);
$b = [];
preg_match("/\b(" . $targets . ")\b/i", $a, $b);
echo $b;
Upvotes: 0
Views: 878
Reputation: 47894
First, I'd like to provide a non-regex method, then I'll get into some long-winded regex condsiderations.
Because your search "needles" are whole words, you can leverage the magic of str_word_count()
like so:
Code: (Demo)
$targets=['apple','apples','pear','pears','strawberry','strawberries','grape','grapes']; // all lowercase
$input="Apples, pears, and strawberries are delicious. I probably favor the flavor of strawberries most. My brother's favorites are crabapples and grapes.";
$lowercase_input=strtolower($input); // eliminate case-sensitive issue
$words=str_word_count($lowercase_input,1); // split into array of words, permitting: ' and -
$unique_words=array_flip(array_flip($words)); // faster than array_unique()
$targeted_words=array_intersect($targets,$unique_words); // retain matches
$tags=implode(',',$targeted_words); // glue together with commas
echo $tags;
echo "\n\n";
// or as a one-liner
echo implode(',',array_intersect($targets,array_flip(array_flip(str_word_count(strtolower($input),1)))));
Output:
apples,pears,strawberries,grapes
apples,pears,strawberries,grapes
Now about the regex...
While matiaslauriti's answer may get you a correct result, it makes very little attempt to provide any big gains in efficiency.
I'll make two points:
Do NOT use preg_match()
in a loop when preg_match_all()
was specifically designed to capture multiple occurrences in a single call. (code to be supplied later in answer)
Condense your pattern logic as much as possible...
Let's say you have an input like this:
$input="Today I ate an apple, then a pear, then a strawberry. This is my article and it's about apples and pears. I like strawberries as well though.";
If you use this array of tags:
$targets=['apple','apples','pear','pears','strawberry','strawberries','grape','grapes'];
to generate a simple piped regex pattern like:
/\b(?:apple|apples|pear|pears|strawberry|strawberries|grape|grapes)\b/i
It will take the regex engine 677 steps to match all of the fruit in $input
. (Demo)
In contrast, if you condense the tag elements using the ?
quantifier like this:
\b(?:apples?|pears?|strawberry|strawberries|grapes?)\b
Your pattern gains brevity AND efficiency, giving the same expected result in just 501 steps. (Demo)
Generating this condensed pattern can be done programmatically for simple associations, (including pluralization and verb conjugations).
Here is a method for handling singular/plural relationships:
foreach($targets as $v){
if(substr($v,-1)=='s'){ // if tag ends in 's'
if(in_array(substr($v,0,-1),$targets)){ // if same words without trailing 's' exists in tag list
$condensed_targets[]=$v.'?'; // add '?' quantifier to end of tag
}else{
$condensed_targets[]=$v; // add tag that is not plural (e.g. 'dress')
}
}elseif(!in_array($v.'s',$targets)){ // if tag doesn't end in 's' and no regular plural form
$condensed_targets[]=$v; // add tag with irregular pluralization (e.g. 'strawberry')
}
}
echo '/\b(?:',implode('|',$condensed_targets),")\b/i\n";
// /\b(?:apples?|pears?|strawberry|strawberries|grapes?)\b/i
This technique will only handle the simplest cases. You can really ramp up performance by scrutinizing the tag list and identifying related tags and condensing them.
Performing my above method to condense the piped pattern on every page load is going to cost your users load time. My very strong recommendation is to keep a database table of your ever-growing tags which are stored as regex-ified tags. When new tags are encountered/generated, add them individually to the table automatically. You should periodically review the ~5000 keywords and seek out tags that can be merged without losing accuracy.
It may even help you to maintain database table logic, if you have one column for regex patterns, and another column which shows a csv of what the row's regex pattern includes:
---------------------------------------------------------------
| Pattern | Tags |
---------------------------------------------------------------
| apples? | apple,apples |
---------------------------------------------------------------
| walk(?:s|er|ed|ing)? | walk,walks,walker,walked,walking |
---------------------------------------------------------------
| strawberry | strawberry |
---------------------------------------------------------------
| strawberries | strawberries |
---------------------------------------------------------------
To improve efficiency, you can update your table data by merging the strawberry and strawberries rows like this:
---------------------------------------------------------------
| strawberr(?:y|ies) | strawberry,strawberries |
---------------------------------------------------------------
With such a simple improvement, if you only check $input
for these two tags, the steps required drops from 59 to 40.
Because you are dealing with >5000 tags the performance improvement will be very noticeable. This kind of refinement is best handled on a human level, but you might use some programmatical techniques to identify tags that share an internal substring.
When you want to use your Pattern column values, just pull them from your database, pipe them together, and place them inside preg_match_all()
.
*Keep in mind you should use non-capturing groups when condensing tags into a single pattern because my code to follow will reduce memory usage by avoiding capture groups.
Code (Demo Link):
$input="Today I ate an apple, then a pear, then a strawberry. This is my article and it's about apples and pears. I like strawberries as well though.";
$targets=['apple','apples','pear','pears','strawberry','strawberries','grape','grapes'];
//echo '/\b(?:',implode('|',$targets),")\b/i\n";
// condense singulars & plurals forms using ? quantifier
foreach($targets as $v){
if(substr($v,-1)=='s'){ // if tag ends in 's'
if(in_array(substr($v,0,-1),$targets)){ // if same words without trailing 's' exists in tag list
$condensed_targets[]=$v.'?'; // add '?' quantifier to end of tag
}else{
$condensed_targets[]=$v; // add tag that is not plural (e.g. 'dress')
}
}elseif(!in_array($v.'s',$targets)){ // if tag doesn't end in 's' and no regular plural form
$condensed_targets[]=$v; // add tag with irregular pluralization (e.g. 'strawberry')
}
}
echo '/\b(?:',implode('|',$condensed_targets),")\b/i\n\n";
// use preg_match_all and call it just once without looping!
$tags=preg_match_all("/\b(?:".implode('|',$condensed_targets).")\b/i",$input,$out)?$out[0]:null;
echo "Found tags: ";
var_export($tags);
Output:
/\b(?:apples?|pears?|strawberry|strawberries|grapes?)\b/i
Found tags: array ( 0 => 'apple', 1 => 'pear', 2 => 'strawberry', 3 => 'apples', 4 => 'pears', 5 => 'strawberries', )
...if you've managed to read this far down my post, you've likely got a problem like the OP's and you want to move forward without regrets/mistakes. Please go to my related Code Review post for more information about fringe case considerations and method logic.
Upvotes: 1
Reputation: 8082
preg_match
already saves the matches. So:
int preg_match ( string $pattern , string $subject [, array &$matches [, int $flags = 0 [, int $offset = 0 ]]] )
The 3 param is already saving the matches, change this:
if (preg_match("/\b" . $t . "\b/i", $a)) {
$b[] = $t;
}
To this:
$matches = [];
preg_match("/\b" . $t . "\b/i", $a, $matches);
$b = array_merge($b, $matches);
But, if you are comparing directly the word, the documentation recomends using strpos()
.
Tip
Do not use preg_match() if you only want to check if one string is contained in another string. Use strpos() instead as it will be faster.
EDIT
You could improve (in performance) your code if you still want to use preg_match
by doing this, replace this:
$targets = array('apple', 'apples','pear','pears','strawberry','strawberries','grape','grapes');
foreach($targets as $t)
{
if (preg_match("/\b" . $t . "\b/i", $a)) {
$b[] = $t;
}
}
With this:
$targets = array('apple', 'apples','pear','pears','strawberry','strawberries','grape','grapes');
$targets = implode('|', $targets);
preg_match("/\b(" . $t . ")\b/i", $a, $matches);
Here you are joining all your $targets
with |
(pipe), so your regex is like this: (target1|target2|target3|targetN)
so you do only one search and not that foreach.
Upvotes: 0