Reputation: 147
I have a small search engine doing its thing, and want to highlight the results. I thought I had it all worked out till a set of keywords I used today blew it out of the water.
The issue is that preg_replace() is looping through the replacements, and later replacements are replacing the text I inserted into previous ones. Confused? Here is my pseudo function:
public function highlightKeywords ($data, $keywords = array()) {
$find = array();
$replace = array();
$begin = "<span class=\"keywordHighlight\">";
$end = "</span>";
foreach ($keywords as $kw) {
$find[] = '/' . str_replace("/", "\/", $kw) . '/iu';
$replace[] = $begin . "\$0" . $end;
}
return preg_replace($find, $replace, $data);
}
OK, so it works when searching for "fred" and "dagg" but sadly, when searching for "class" and "lass" and "as" it strikes a real issue when highlighting "Joseph's Class Group"
Joseph's <span class="keywordHighlight">Cl</span><span <span c<span <span class="keywordHighlight">cl</span>ass="keywordHighlight">lass</span>="keywordHighlight">c<span <span class="keywordHighlight">cl</span>ass="keywordHighlight">lass</span></span>="keywordHighlight">ass</span> Group
How would I get the latter replacements to only work on the non-HTML components, but to also allow the tagging of the whole match? e.g. if I was searching for "cla" and "lass" I would want "class" to be highlighted in full as both the search terms are in it, even though they overlap, and the highlighting that was applied to the first match has "class" in it, but that shouldn't be highlighted.
Sigh.
I would rather use a PHP solution than a jQuery (or any client-side) one.
Note: I have tried to sort the keywords by length, doing the long ones first, but that means the cross-over searches do not highlight, meaning with "cla" and "lass" only part of the word "class" would highlight, and it still murdered the replacement tags :(
EDIT: I have messed about, starting with pencil & paper, and wild ramblings, and come up with some very unglamorous code to solve this issue. It's not great, so suggestions to trim/speed this up would still be greatly appreciated :)
public function highlightKeywords ($data, $keywords = array()) {
$find = array();
$replace = array();
$begin = "<span class=\"keywordHighlight\">";
$end = "</span>";
$hits = array();
foreach ($keywords as $kw) {
$offset = 0;
while (($pos = stripos($data, $kw, $offset)) !== false) {
$hits[] = array($pos, $pos + strlen($kw));
$offset = $pos + 1;
}
}
if ($hits) {
usort($hits, function($a, $b) {
if ($a[0] == $b[0]) {
return 0;
}
return ($a[0] < $b[0]) ? -1 : 1;
});
$thisthat = array(0 => $begin, 1 => $end);
for ($i = 0; $i < count($hits); $i++) {
foreach ($thisthat as $key => $val) {
$pos = $hits[$i][$key];
$data = substr($data, 0, $pos) . $val . substr($data, $pos);
for ($j = 0; $j < count($hits); $j++) {
if ($hits[$j][0] >= $pos) {
$hits[$j][0] += strlen($val);
}
if ($hits[$j][1] >= $pos) {
$hits[$j][1] += strlen($val);
}
}
}
}
}
return $data;
}
Upvotes: 4
Views: 1097
Reputation: 762
OP - something that's not clear in the question is whether $data can contain HTML from the get-go. Can you clarify this?
If $data can contain HTML itself, you are getting into the realms attempting to parse a non-regular language with a regular language parser, and that's not going to work out well.
In such a case, I would suggest loading the $data HTML into a PHP DOMDocument, getting hold of all of the textNodes and running one of the other perfectly good answers on the contents of each text block in turn.
Upvotes: 0
Reputation: 7108
I had to revisit this subject myself today and wrote a better version of the above. I'll include it here. It's the same idea only easier to read and should perform better since it uses arrays instead of concatenation.
<?php
function highlight_range_sort($a, $b) {
$A = abs($a);
$B = abs($b);
if ($A == $B)
return $a < $b ? 1 : 0;
else
return $A < $B ? -1 : 1;
}
function highlightKeywords($data, $keywords = array(),
$prefix = '<span class="highlight">', $suffix = '</span>') {
$datacopy = strtolower($data);
$keywords = array_map('strtolower', $keywords);
// this will contain offset ranges to be highlighted
// positive offset indicates start
// negative offset indicates end
$ranges = array();
// find start/end offsets for each keyword
foreach ($keywords as $keyword) {
$offset = 0;
$length = strlen($keyword);
while (($pos = strpos($datacopy, $keyword, $offset)) !== false) {
$ranges[] = $pos;
$ranges[] = -($offset = $pos + $length);
}
}
if (!count($ranges))
return $data;
// sort offsets by abs(), positive
usort($ranges, 'highlight_range_sort');
// combine overlapping ranges by keeping lesser
// positive and negative numbers
$i = 0;
while ($i < count($ranges) - 1) {
if ($ranges[$i] < 0) {
if ($ranges[$i + 1] < 0)
array_splice($ranges, $i, 1);
else
$i++;
} else if ($ranges[$i + 1] < 0)
$i++;
else
array_splice($ranges, $i + 1, 1);
}
// create substrings
$ranges[] = strlen($data);
$substrings = array(substr($data, 0, $ranges[0]));
for ($i = 0, $n = count($ranges) - 1; $i < $n; $i += 2) {
// prefix + highlighted_text + suffix + regular_text
$substrings[] = $prefix;
$substrings[] = substr($data, $ranges[$i], -$ranges[$i + 1] - $ranges[$i]);
$substrings[] = $suffix;
$substrings[] = substr($data, -$ranges[$i + 1], $ranges[$i + 2] + $ranges[$i + 1]);
}
// join and return substrings
return implode('', $substrings);
}
// Example usage:
echo highlightKeywords("This is a test.\n", array("is"), '(', ')');
echo highlightKeywords("Classes are as hard as they say.\n", array("as", "class"), '(', ')');
// Output:
// Th(is) (is) a test.
// (Class)es are (as) hard (as) they say.
Upvotes: 0
Reputation: 7108
I've used the following to address this problem:
<?php
$protected_matches = array();
function protect(&$matches) {
global $protected_matches;
return "\0" . array_push($protected_matches, $matches[0]) . "\0";
}
function restore(&$matches) {
global $protected_matches;
return '<span class="keywordHighlight">' .
$protected_matches[$matches[1] - 1] . '</span>';
}
preg_replace_callback('/\x0(\d+)\x0/', 'restore',
preg_replace_callback($patterns, 'protect', $target_string));
The first preg_replace_callback
pulls out all matches and replaces them with nul-byte-wrapped placeholders; the second pass replaces them with the span tags.
Edit: Forgot to mention that $patterns
was sorted by string length, longest to shortest.
Edit; another solution
<?php
function highlightKeywords($data, $keywords = array(),
$prefix = '<span class="hilite">', $suffix = '</span>') {
$datacopy = strtolower($data);
$keywords = array_map('strtolower', $keywords);
$start = array();
$end = array();
foreach ($keywords as $keyword) {
$offset = 0;
$length = strlen($keyword);
while (($pos = strpos($datacopy, $keyword, $offset)) !== false) {
$start[] = $pos;
$end[] = $offset = $pos + $length;
}
}
if (!count($start)) return $data;
sort($start);
sort($end);
// Merge and sort start/end using negative values to identify endpoints
$zipper = array();
$i = 0;
$n = count($end);
while ($i < $n)
$zipper[] = count($start) && $start[0] <= $end[$i]
? array_shift($start)
: -$end[$i++];
// EXAMPLE:
// [ 9, 10, -14, -14, 81, 82, 86, -86, -86, -90, 99, -103 ]
// take 9, discard 10, take -14, take -14, create pair,
// take 81, discard 82, discard 86, take -86, take -86, take -90, create pair
// take 99, take -103, create pair
// result: [9,14], [81,90], [99,103]
// Generate non-overlapping start/end pairs
$a = array_shift($zipper);
$z = $x = null;
while ($x = array_shift($zipper)) {
if ($x < 0)
$z = $x;
else if ($z) {
$spans[] = array($a, -$z);
$a = $x;
$z = null;
}
}
$spans[] = array($a, -$z);
// Insert the prefix/suffix in the start/end locations
$n = count($spans);
while ($n--)
$data = substr($data, 0, $spans[$n][0])
. $prefix
. substr($data, $spans[$n][0], $spans[$n][1] - $spans[$n][0])
. $suffix
. substr($data, $spans[$n][1]);
return $data;
}
Upvotes: 0