CrazyChris
CrazyChris

Reputation: 147

keyword highlight is highlighting the highlights in PHP preg_replace()

I have a small search engine doing its thing, and want to highlight the results. I thought I had it all worked out till a set of keywords I used today blew it out of the water.

The issue is that preg_replace() is looping through the replacements, and later replacements are replacing the text I inserted into previous ones. Confused? Here is my pseudo function:

public function highlightKeywords ($data, $keywords = array()) {
    $find = array();
    $replace = array();
    $begin = "<span class=\"keywordHighlight\">";
    $end = "</span>";
    foreach ($keywords as $kw) {
        $find[] = '/' . str_replace("/", "\/", $kw) . '/iu';
        $replace[] = $begin . "\$0" . $end;
    }
    return preg_replace($find, $replace, $data);
}

OK, so it works when searching for "fred" and "dagg" but sadly, when searching for "class" and "lass" and "as" it strikes a real issue when highlighting "Joseph's Class Group"

Joseph's <span class="keywordHighlight">Cl</span><span <span c<span <span class="keywordHighlight">cl</span>ass="keywordHighlight">lass</span>="keywordHighlight">c<span <span class="keywordHighlight">cl</span>ass="keywordHighlight">lass</span></span>="keywordHighlight">ass</span> Group

How would I get the latter replacements to only work on the non-HTML components, but to also allow the tagging of the whole match? e.g. if I was searching for "cla" and "lass" I would want "class" to be highlighted in full as both the search terms are in it, even though they overlap, and the highlighting that was applied to the first match has "class" in it, but that shouldn't be highlighted.

Sigh.

I would rather use a PHP solution than a jQuery (or any client-side) one.

Note: I have tried to sort the keywords by length, doing the long ones first, but that means the cross-over searches do not highlight, meaning with "cla" and "lass" only part of the word "class" would highlight, and it still murdered the replacement tags :(

EDIT: I have messed about, starting with pencil & paper, and wild ramblings, and come up with some very unglamorous code to solve this issue. It's not great, so suggestions to trim/speed this up would still be greatly appreciated :)

public function highlightKeywords ($data, $keywords = array()) {
    $find = array();
    $replace = array();
    $begin = "<span class=\"keywordHighlight\">";
    $end = "</span>";
    $hits = array();
    foreach ($keywords as $kw) {
        $offset = 0;
        while (($pos = stripos($data, $kw, $offset)) !== false) {
            $hits[] = array($pos, $pos + strlen($kw));
            $offset = $pos + 1;
        }
    }
    if ($hits) {
        usort($hits, function($a, $b) {
            if ($a[0] == $b[0]) {
                return 0;
            }
            return ($a[0] < $b[0]) ? -1 : 1;
        });
        $thisthat = array(0 => $begin, 1 => $end);
        for ($i = 0; $i < count($hits); $i++) {
            foreach ($thisthat as $key => $val) {
                $pos = $hits[$i][$key];
                $data = substr($data, 0, $pos) . $val . substr($data, $pos);
                for ($j = 0; $j < count($hits); $j++) {
                    if ($hits[$j][0] >= $pos) {
                        $hits[$j][0] += strlen($val);
                    }
                    if ($hits[$j][1] >= $pos) {
                        $hits[$j][1] += strlen($val);
                    }
                }
            }
        }
    }
    return $data;
}

Upvotes: 4

Views: 1097

Answers (3)

sanbikinoraion
sanbikinoraion

Reputation: 762

OP - something that's not clear in the question is whether $data can contain HTML from the get-go. Can you clarify this?

If $data can contain HTML itself, you are getting into the realms attempting to parse a non-regular language with a regular language parser, and that's not going to work out well.

In such a case, I would suggest loading the $data HTML into a PHP DOMDocument, getting hold of all of the textNodes and running one of the other perfectly good answers on the contents of each text block in turn.

Upvotes: 0

Steve
Steve

Reputation: 7108

I had to revisit this subject myself today and wrote a better version of the above. I'll include it here. It's the same idea only easier to read and should perform better since it uses arrays instead of concatenation.

<?php

function highlight_range_sort($a, $b) {
    $A = abs($a);
    $B = abs($b);
    if ($A == $B)
        return $a < $b ? 1 : 0;
    else
        return $A < $B ? -1 : 1;
}

function highlightKeywords($data, $keywords = array(),
       $prefix = '<span class="highlight">', $suffix = '</span>') {

        $datacopy = strtolower($data);
        $keywords = array_map('strtolower', $keywords);
        // this will contain offset ranges to be highlighted
        // positive offset indicates start
        // negative offset indicates end
        $ranges = array();

        // find start/end offsets for each keyword
        foreach ($keywords as $keyword) {
            $offset = 0;
            $length = strlen($keyword);
            while (($pos = strpos($datacopy, $keyword, $offset)) !== false) {
                $ranges[] = $pos;
                $ranges[] = -($offset = $pos + $length);
            }
        }

        if (!count($ranges))
            return $data;

        // sort offsets by abs(), positive
        usort($ranges, 'highlight_range_sort');

        // combine overlapping ranges by keeping lesser
        // positive and negative numbers
        $i = 0;
        while ($i < count($ranges) - 1) {
            if ($ranges[$i] < 0) {
                if ($ranges[$i + 1] < 0)
                    array_splice($ranges, $i, 1);
                else
                    $i++;
            } else if ($ranges[$i + 1] < 0)
                $i++;
            else
                array_splice($ranges, $i + 1, 1);
        }

        // create substrings
        $ranges[] = strlen($data);
        $substrings = array(substr($data, 0, $ranges[0]));
        for ($i = 0, $n = count($ranges) - 1; $i < $n; $i += 2) {
            // prefix + highlighted_text + suffix + regular_text
            $substrings[] = $prefix;
            $substrings[] = substr($data, $ranges[$i], -$ranges[$i + 1] - $ranges[$i]);
            $substrings[] = $suffix;
            $substrings[] = substr($data, -$ranges[$i + 1], $ranges[$i + 2] + $ranges[$i + 1]);
        }

        // join and return substrings
        return implode('', $substrings);
}

// Example usage:
echo highlightKeywords("This is a test.\n", array("is"), '(', ')');
echo highlightKeywords("Classes are as hard as they say.\n", array("as", "class"), '(', ')');
// Output:
// Th(is) (is) a test.
// (Class)es are (as) hard (as) they say.

Upvotes: 0

Steve
Steve

Reputation: 7108

I've used the following to address this problem:

<?php

$protected_matches = array();
function protect(&$matches) {
    global $protected_matches;
    return "\0" . array_push($protected_matches, $matches[0]) . "\0";
}
function restore(&$matches) {
    global $protected_matches;
    return '<span class="keywordHighlight">' .
              $protected_matches[$matches[1] - 1] . '</span>';
}

preg_replace_callback('/\x0(\d+)\x0/', 'restore',
    preg_replace_callback($patterns, 'protect', $target_string));

The first preg_replace_callback pulls out all matches and replaces them with nul-byte-wrapped placeholders; the second pass replaces them with the span tags.

Edit: Forgot to mention that $patterns was sorted by string length, longest to shortest.

Edit; another solution

<?php
        function highlightKeywords($data, $keywords = array(),
            $prefix = '<span class="hilite">', $suffix = '</span>') {

        $datacopy = strtolower($data);
        $keywords = array_map('strtolower', $keywords);
        $start = array();
        $end   = array();

        foreach ($keywords as $keyword) {
            $offset = 0;
            $length = strlen($keyword);
            while (($pos = strpos($datacopy, $keyword, $offset)) !== false) {
                $start[] = $pos;
                $end[]   = $offset = $pos + $length;
            }
        }

        if (!count($start)) return $data;

        sort($start);
        sort($end);

        // Merge and sort start/end using negative values to identify endpoints
        $zipper = array();
        $i = 0;
        $n = count($end);

        while ($i < $n)
            $zipper[] = count($start) && $start[0] <= $end[$i]
                ? array_shift($start)
                : -$end[$i++];

        // EXAMPLE:
        // [ 9, 10, -14, -14, 81, 82, 86, -86, -86, -90, 99, -103 ]
        // take 9, discard 10, take -14, take -14, create pair,
        // take 81, discard 82, discard 86, take -86, take -86, take -90, create pair
        // take 99, take -103, create pair
        // result: [9,14], [81,90], [99,103]

        // Generate non-overlapping start/end pairs
        $a = array_shift($zipper);
        $z = $x = null;
        while ($x = array_shift($zipper)) {
            if ($x < 0)
                $z = $x;
            else if ($z) {
                $spans[] = array($a, -$z);
                $a = $x;
                $z = null;
            }
        }
        $spans[] = array($a, -$z);

        // Insert the prefix/suffix in the start/end locations
        $n = count($spans);
        while ($n--)
            $data = substr($data, 0, $spans[$n][0])
            . $prefix
            . substr($data, $spans[$n][0], $spans[$n][1] - $spans[$n][0])
            . $suffix
            . substr($data, $spans[$n][1]);

        return $data;
    }

Upvotes: 0

Related Questions