Reputation: 4244
I'm trying to write a highlighting functionality. There are two types of highlighting: positive and negative. Positive is done first. Highlighting in itself is very simple - just wrapping keyword/phrase in a span
with a specific class, which depends on type of highlighting.
Problem:
Sometimes, negative highlighting can contain positive one.
Example:
Original text:
some data from blahblah test was not statistically valid
After text passes through positive highlighting "filter", it'll end up like this:
some data from <span class="positive">blahblah test</span> was not <span class="positive">statistically valid</span>
or
some data from <span class="positive">blahblah test</span> was not <span class="positive">statistically <span class="positive">valid</span></span>
Then in negative list, we have a phrase not statistically valid
.
In both cases, resulting text after passing through both "filters" should look like:
some data from <span class="positive">blahblah test</span> was <span class="negative">not statistically valid</span>
Conditions:
- Amount of span
tags or their location within keyword/phrase from negative "filter" list is unknown
- Keyword/phrase must be matched even if it includes span
tags (including right before and right after keyword/phrase). These span
tags have to be removed.
- If any span
tags are detected, amount of opening and closing span
tags removed has to be equal.
Questions:
- How to detect these span
tags if there are any?
- Is this even possible with RegEx alone?
Upvotes: 2
Views: 153
Reputation: 48711
I don't think if it can be done with a single Regular Expression and if it's possible, then honestly I'm so lazy for blowing my mind to make it.
I came to a solution that takes 4 steps to achieve what you desire:
<span class="negative">...</span>
) by their positionsI have, however, made a detailed flowchart (I'm not good at flowchats, sorry) that you feel better in understanding things. It could help if you look at codes at the first.
Here is what we have:
$HTML = <<< HTML
some data from <span class="positive">blahblah test</span> was not <span class="positive">statistically <span class="positive">valid</span></span>
HTML;
$listOfNegatives = ['not statistically valid'];
To extract words (real words) I used a RegEx which will fulfill our needs at this step:
~\b(?<![</])\w+\b(?![^<>]+>)~
To get positions of each word too, a flag should be used with preg_match_all()
: PREG_OFFSET_CAPTURE
/**
* Extract all words and their corresponsing positions
* @param [string] $HTML
* @return [array] $HTMLWords
*/
function extractWords($HTML) {
$HTMLWords = [];
preg_match_all("~\b(?<![</])\w+\b(?![^<>]+>)~", $HTML, $words, PREG_OFFSET_CAPTURE);
foreach ($words[0] as $word) {
$HTMLWords[$word[1]] = $word[0];
}
return $HTMLWords;
}
This function's output is something like this:
Array
(
[0] => some
[5] => data
[10] => from
[38] => blahblah
[47] => test
[59] => was
[63] => not
[90] => statistically
[127] => valid
)
What we should do here is to match each words of a list's value - consecutively - to words we just extracted. So as our first list's value not statistically valid
we have three words not
, statistically
and valid
and these words should come continuously in the extracted words array. (which happens)
To handle this I wrote a function:
/**
* Check if any of our defined list values can be found in an ordered-array of exctracted words
* @param [array] $HTMLWords
* @param [array] $listOfNegatives
* @return [array] $subString
*/
function checkNegativesExistence($HTMLWords, $listOfNegatives) {
$counter = 0;
$previousWordOffset = null;
$subStrings = [];
foreach ($listOfNegatives as $i => $string) {
$stringWords = explode(" ", $string);
$wordIndex = 0;
foreach ($HTMLWords as $offset => $HTMLWord) {
if ($wordIndex > count($stringWords) - 1) {
$wordIndex = 0;
$counter++;
}
if ($stringWords[$wordIndex] == $HTMLWord) {
$subStrings[$counter][] = [$HTMLWord, $offset, $previousWordOffset];
$wordIndex++;
} elseif (isset($subStrings[$counter]) && count($subStrings[$counter]) > 0) {
unset($subStrings[$counter]);
$wordIndex = 0;
}
$previousWordOffset = $offset + strlen($HTMLWord);
}
$counter++;
}
return $subStrings;
}
Which has an output like below:
Array
(
[0] => Array
(
[0] => Array
(
[0] => not
[1] => 63
[2] => 62
)
[1] => Array
(
[0] => statistically
[1] => 90
[2] => 66
)
[2] => Array
(
[0] => valid
[1] => 127
[2] => 103
)
)
)
If you see we have a complete string split into words and their offsets (we have two offsets, first one is real offset second one is offset of previous word). We need them later.
Now another thing we should consider is to replace this occurrence from offset 62
to 127 + strlen(valid)
with <span class="negative">not statistically valid</span>
and forget about every thing else.
/**
* Substitute newly matched strings with negative HTML wrapper
* @param [array] $subStrings
* @param [string] $HTML
* @return [string] $HTML
*/
function negativeHighlight($subStrings, $HTML) {
$offset = 0;
$HTMLLength = strlen($HTML);
foreach ($subStrings as $key => $value) {
$arrayOfWords = [];
foreach ($value as $word) {
$arrayOfWords[] = $word[0];
if (current($value) == $value[0]) {
$start = substr($HTML, $word[1], strlen($word[0])) == $word[0] ? $word[2] : $word[2] + $offset;
}
if (current($value) == end($value)) {
$defaultLength = $word[1] + strlen($word[0]) - $start;
$length = substr($HTML, $word[1], strlen($word[0])) === $word[0] ? $defaultLength : $defaultLength + $offset;
}
}
$string = implode(" ", $arrayOfWords);
$HTML = substr_replace($HTML, "<span class=\"negative\">{$string}</span>", $start, $length);
if ($HTMLLength > strlen($HTML)) {
$offset = -($HTMLLength - strlen($HTML));
} elseif ($HTMLLength < strlen($HTML)) {
$offset = strlen($HTML) - $HTMLLength;
}
}
return $HTML;
}
An important thing here I should note is that by doing first substitution we may affect offsets of other extracted values (that we don't have here). So calculating new HTML length is required:
if ($HTMLLength > strlen($HTML)) {
$offset = -($HTMLLength - strlen($HTML));
} elseif ($HTMLLength < strlen($HTML)) {
$offset = strlen($HTML) - $HTMLLength;
}
and... we should check if by this change of length how did our offsets changed:
This checking is done by this block (we need to check first and last word only):
if (current($value) == $value[0]) {
$start = substr($HTML, $word[1], strlen($word[0])) == $word[0] ? $word[2] : $word[2] + $offset;
}
if (current($value) == end($value)) {
$defaultLength = $word[1] + strlen($word[0]) - $start;
$length = substr($HTML, $word[1], strlen($word[0])) === $word[0] ? $defaultLength : $defaultLength + $offset;
}
Doing all together:
$newHTML = negativeHighlight(checkNegativesExistence(extractWords($HTML), $listOfNegatives), $HTML);
Output:
some data from <span class="positive">blahblah test</span> was <span class="negative">not statistically valid</span></span></span>
But there are problems with our last output: unmatched tags.
I'm sorry that I lied I've done this problem solving in 4 steps but it has one more. Here I made another RegEx to match all truly nested tags and those which are mistakenly existed:
~(<span[^>]+>([^<]*+<(?!/)(?:([a-zA-Z0-9]++)[^>]*>[^<]*</\3>|(?2)))*[^<]*</span>|(?'single'</[^>]+>|<[^>]+>))~
By a preg_replace_callback()
I only replace tags in group named single
with nothing:
echo preg_replace_callback("~(<span[^>]+>([^<]*+<(?!/)(?:([a-zA-Z0-9]++)[^>]*>[^<]*</\3>|(?2)))*[^<]*</span>|(?'single'</[^>]+>|<[^>]+>))~",
function ($match) {
if (isset($match['single'])) {
return null;
}
return $match[1];
},
$newHTML
);
and we have right output:
some data from <span class="positive">blahblah test</span> was <span class="negative">not statistically valid</span>
My solution does not output right HTML on below situations:
1- If a word like <was>
is between other words:
<span class="positive">blahblah test</span> <was> not
Why?
<was>
as an unmatched tag so it will2- If a word like not
(which is part of a negative list's value in
our list) is enclosed with <>
-> <not>
. Which outputs:
some data from <span class="positive">blahblah test</span> was <not> <span class="positive">statistically <span class="positive">valid</span></span>
Why?
<>
3- If list has values that one is the other's substring:
$listOfNegatives = ['not statistically valid', 'not statistically'];
Why?
Upvotes: 3
Reputation: 6742
Here is what I've come up with. I honestly can't say whether it will cope with the full range of the requirement, but it might help a bit
$s = 'some data from blahblah test was not statistically valid';
$replaced = highlight($s);
var_dump($replaced);
function highlight($s) {
// split the string on the negative parts, capturing the full negative string each time
$parts = preg_split('/(not statistically valid)/',$s,-1,PREG_SPLIT_DELIM_CAPTURE);
$output = '';
$negativePart = 0; // keep track of whether we're dealing with a negative or part or the remainder - they will alternate.
foreach ($parts as $part) {
if ($negativePart) {
$output .= negativeHighlight($part);
} else {
$output .= positiveHighlight($part);
}
$negativePart = !$negativePart;
}
return $output;
}
// only deals with a single negative part at a time, so just wraps with a span
function negativeHighlight($part) {
return "<span class='negative'>$part</span>";
}
// potentially deals with several replacements at once
function positiveHighlight($part) {
return preg_replace('/(blahblah test)|(statistically valid)/', "<span class='positive'>$1</span>", $part);
}
Upvotes: 1