Complex RegEx replacement

Question

I'm trying to write a highlighting functionality. There are two types of highlighting: positive and negative. Positive is done first. Highlighting in itself is very simple - just wrapping keyword/phrase in a span with a specific class, which depends on type of highlighting.

Problem:
Sometimes, negative highlighting can contain positive one.

Example:
Original text:

some data from blahblah test was not statistically valid

After text passes through positive highlighting "filter", it'll end up like this:

some data from blahblah test was not statistically valid

or

some data from blahblah test was not statistically valid

Then in negative list, we have a phrase not statistically valid.

In both cases, resulting text after passing through both "filters" should look like:

some data from blahblah test was not statistically valid

Conditions:
- Amount of span tags or their location within keyword/phrase from negative "filter" list is unknown
- Keyword/phrase must be matched even if it includes span tags (including right before and right after keyword/phrase). These span tags have to be removed.
- If any span tags are detected, amount of opening and closing span tags removed has to be equal.

Questions:
- How to detect these span tags if there are any?
- Is this even possible with RegEx alone?

revo · Accepted Answer

I don't think if it can be done with a single Regular Expression and if it's possible, then honestly I'm so lazy for blowing my mind to make it.

In what way do you think?

I came to a solution that takes 4 steps to achieve what you desire:

Extract all words from HTML and store them with their corresponding positions.
Explode each Negative List's value to words
Match each word consecutively to the (1) in-order-words and store
Replace recently found values with their new HTML wrapper (...) by their positions

I have, however, made a detailed flowchart (I'm not good at flowchats, sorry) that you feel better in understanding things. It could help if you look at codes at the first.

Here is what we have:

$HTML = <<< HTML
some data from blahblah test was not statistically valid
HTML;

$listOfNegatives = ['not statistically valid'];

To extract words (real words) I used a RegEx which will fulfill our needs at this step:

~\b(?]+>)~

To get positions of each word too, a flag should be used with preg_match_all(): PREG_OFFSET_CAPTURE

/**
 * Extract all words and their corresponsing positions
 * @param  [string] $HTML
 * @return [array]  $HTMLWords
 */
function extractWords($HTML) {
    $HTMLWords = [];
    preg_match_all("~\b(?]+>)~", $HTML, $words, PREG_OFFSET_CAPTURE);
    foreach ($words[0] as $word) {
        $HTMLWords[$word[1]] = $word[0];
    }
    return $HTMLWords;
}

This function's output is something like this:

Array
(
    [0] => some
    [5] => data
    [10] => from
    [38] => blahblah
    [47] => test
    [59] => was
    [63] => not
    [90] => statistically
    [127] => valid
)

What we should do here is to match each words of a list's value - consecutively - to words we just extracted. So as our first list's value not statistically valid we have three words not, statistically and valid and these words should come continuously in the extracted words array. (which happens)

To handle this I wrote a function:

/**
 * Check if any of our defined list values can be found in an ordered-array of exctracted words
 * @param  [array] $HTMLWords
 * @param  [array] $listOfNegatives
 * @return [array] $subString
 */
function checkNegativesExistence($HTMLWords, $listOfNegatives) {
    $counter = 0;
    $previousWordOffset = null;
    $subStrings = [];

    foreach ($listOfNegatives as $i => $string) {
        $stringWords = explode(" ", $string);
        $wordIndex = 0;
        foreach ($HTMLWords as $offset => $HTMLWord) {
            if ($wordIndex > count($stringWords) - 1) {
                $wordIndex = 0;
                $counter++;
            }

            if ($stringWords[$wordIndex] == $HTMLWord) {
                $subStrings[$counter][] = [$HTMLWord, $offset, $previousWordOffset];
                $wordIndex++;
            } elseif (isset($subStrings[$counter]) && count($subStrings[$counter]) > 0) {
                unset($subStrings[$counter]);
                $wordIndex = 0;
            }
            $previousWordOffset = $offset + strlen($HTMLWord);
        }
        $counter++;
    }

    return $subStrings;
}

Which has an output like below:

Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => not
                    [1] => 63
                    [2] => 62
                )

            [1] => Array
                (
                    [0] => statistically
                    [1] => 90
                    [2] => 66
                )

            [2] => Array
                (
                    [0] => valid
                    [1] => 127
                    [2] => 103
                )
        )
)

If you see we have a complete string split into words and their offsets (we have two offsets, first one is real offset second one is offset of previous word). We need them later.

Now another thing we should consider is to replace this occurrence from offset 62 to 127 + strlen(valid) with not statistically valid and forget about every thing else.

/**
 * Substitute newly matched strings with negative HTML wrapper
 * @param  [array]  $subStrings
 * @param  [string] $HTML
 * @return [string] $HTML
 */
function negativeHighlight($subStrings, $HTML) {
    $offset = 0;
    $HTMLLength = strlen($HTML);

    foreach ($subStrings as $key => $value) {
        $arrayOfWords = [];
        foreach ($value as $word) {
            $arrayOfWords[] = $word[0];
            if (current($value) == $value[0]) {
                $start = substr($HTML, $word[1], strlen($word[0])) == $word[0] ? $word[2] : $word[2] + $offset;
            }
            if (current($value) == end($value)) {
                $defaultLength = $word[1] + strlen($word[0]) - $start;
                $length = substr($HTML, $word[1], strlen($word[0])) === $word[0] ? $defaultLength : $defaultLength + $offset;
            }
        }
        $string = implode(" ", $arrayOfWords);
        $HTML = substr_replace($HTML, "{$string}", $start, $length);

        if ($HTMLLength > strlen($HTML)) {
            $offset = -($HTMLLength - strlen($HTML));
        } elseif ($HTMLLength < strlen($HTML)) {
            $offset = strlen($HTML) - $HTMLLength;
        }
    }

    return $HTML;
}

An important thing here I should note is that by doing first substitution we may affect offsets of other extracted values (that we don't have here). So calculating new HTML length is required:

if ($HTMLLength > strlen($HTML)) {
    $offset = -($HTMLLength - strlen($HTML));
} elseif ($HTMLLength < strlen($HTML)) {
    $offset = strlen($HTML) - $HTMLLength;
}

and... we should check if by this change of length how did our offsets changed:

Word's offset is intact (some characters were added/removed after this word)
Word's offset is changed (some characters were added/removed before this word)

This checking is done by this block (we need to check first and last word only):

if (current($value) == $value[0]) {
    $start = substr($HTML, $word[1], strlen($word[0])) == $word[0] ? $word[2] : $word[2] + $offset;
}
if (current($value) == end($value)) {
    $defaultLength = $word[1] + strlen($word[0]) - $start;
    $length = substr($HTML, $word[1], strlen($word[0])) === $word[0] ? $defaultLength : $defaultLength + $offset;
}

Doing all together:

$newHTML = negativeHighlight(checkNegativesExistence(extractWords($HTML), $listOfNegatives), $HTML);

Output:

some data from blahblah test was not statistically valid

But there are problems with our last output: unmatched tags.

I'm sorry that I lied I've done this problem solving in 4 steps but it has one more. Here I made another RegEx to match all truly nested tags and those which are mistakenly existed:

~(]+>([^<]*+<(?!/)(?:([a-zA-Z0-9]++)[^>]*>[^<]*|(?2)))*[^<]*|(?'single']+>|<[^>]+>))~

By a preg_replace_callback() I only replace tags in group named single with nothing:

echo preg_replace_callback("~(]+>([^<]*+<(?!/)(?:([a-zA-Z0-9]++)[^>]*>[^<]*|(?2)))*[^<]*|(?'single']+>|<[^>]+>))~",
    function ($match) {
        if (isset($match['single'])) {
            return null;
        }
        return $match[1];
    },
    $newHTML
);

and we have right output:

some data from blahblah test was not statistically valid

Failing cases

My solution does not output right HTML on below situations:

1- If a word like is between other words:

blahblah test  not

Why?

Because my last RegEx spots as an unmatched tag so it will
remove it.

2- If a word like not (which is part of a negative list's value in our list) is enclosed with <> -> . Which outputs:

some data from blahblah test was  statistically valid

Why?

Because my first RegEx understands words which are not between tags specific characters <>

3- If list has values that one is the other's substring:

$listOfNegatives = ['not statistically valid', 'not statistically'];

Why?

Because they overlap.

Working demo

Complex RegEx replacement

Answers (2)

In what way do you think?

Failing cases

Related Questions