Reputation: 14123
For a given string (typically a paragraph) I want to replace some words/phrases but ignore them if they happen to be surrounded by tags in some way. This also needs to be case-insensitive.
Using this as an example:
You can find a link here <a href="#">link</a> and a lot
of things in different styles. Public platform can appear in bold:
<b>public platform</b>, and we also have italics here too: <i>italics</i>.
While I like soft pillows I am picky about soft <i>pillows</i>.
While I want to find fox, I din't want foxes to show up.
The text "shiny fruits" is in a span tag: one of the <span>shiny fruits</span>.
Lets say I want to replace these words:
link
: 2 occurrences. First in plain-text (match), second in an A
tag (ignore)public platform
: Plain-text (match, case-insensitive), second in an B
tag (ignore)soft pillows
: 1 plain-text match.fox
: 1 plain-text match. It looks at complete words.fruits
: Plain-text (match), second in span
tag (ignore) with other textFor background; I am searching for phrase matches (not single words) and linking matches to a relevant page.
I want to avoid nesting HTML (no links within bold tags or vice-versa) or other mistakes (example: the <a href="#">phrase <b>goes</a> here</b>
)
I tried a few approaches like searching against a sanitized copy of the text with HTML content removed, while that told me there was a match, I had a whole new problem of mapping that back to the original.
Upvotes: 1
Views: 674
Reputation: 563
i've found mentions about regex negative lookaheads and after breaking my mind over got this regex (assuming you have VALID html tags pairing)
// made function a bit ugly just to try to show how it comes together
public function replaceTextOutsideTags($sourceText = null, $toReplace = 'inner text', $dummyText = '(REPLACED TEXT HERE)')
{
$string = $sourceText ?? "Inner text
You can find a link here <a href=\"#\">link</a> and a lot
of things in different styles. Public platform can appear in bold:
<b>public platform</b>, and we also have italics here too: <i>italics</i>.
While I like soft pillows I am picky about soft <i>pillows</i>.
While I want to find fox, I din't want foxes to show up.
The text \"shiny fruits\" is in a span tag: one of the <span>shiny fruits</span>.
The inner text like this <a>inner <b>inner text </b> here to test too</b>, event inner text
<a inner text>omg thats sad... or not</a>
";
// it would be nice to use [[:punct:]] but somehow regex thinks that < and > are also punctuation marks
$punctuation = "\.,!\?:;\\|\/=\"#"; // this part might take additional attention but you get the point
$stringPart = "\b$toReplace\b";
$excludeSequence = "(?![\w\n\s>$punctuation]*?";
$excludeOutside = "$excludeSequence<\/)"; // note on closing )
$excludeTag = "$excludeSequence>)"; // note on closing )
$pattern = "/" . $stringPart . $excludeOutside . $excludeTag . "/im";
return preg_replace($pattern, $dummyText, $string);
}
example output with default params
"""
(REPLACED TEXT HERE)\r\n
You can find a link here <a href="#">link</a> and a lot \r\n
of things in different styles. Public platform can appear in bold: \r\n
<b>public platform</b>, and we also have italics here too: <i>italics</i>. \r\n
While I like soft pillows I am picky about soft <i>pillows</i>. \r\n
While I want to find fox, I din't want foxes to show up.\r\n
The text "shiny fruits" is in a span tag: one of the <span>shiny fruits</span>.\r\n
The (REPLACED TEXT HERE) like this <a>inner <b>inner text </b> here to test too</b>, event (REPLACED TEXT HERE)\r\n
<a inner text>omg thats sad... or not</a>
"""
now step-by-step
pillow
if have only pillowS
)\w
word symbols, \s
whitespaces or \n
new lines and allowed punctuation that ends with opening close tag </
- we don't need this match, and here comes negative lookahead (?![\w\n\s>$punctuation]*?<\/)
. here we can be sure that match won't get into new tag because <
is not in described sequence ($excludeOutside
variable)$excludeTag
variable is basically the same as $excludeOutside
but for cases when $toReplace
can be html tag itself, for example a
<
or >
and having those symbols can lead to unexpected behaviorUpvotes: 1