donohoe
donohoe

Reputation: 14123

Replace text in a string and ignore matches within HTML tags

For a given string (typically a paragraph) I want to replace some words/phrases but ignore them if they happen to be surrounded by tags in some way. This also needs to be case-insensitive.

Using this as an example:

You can find a link here <a href="#">link</a> and a lot 
of things in different styles. Public platform can appear in bold: 
<b>public platform</b>, and we also have italics here too: <i>italics</i>. 
While I like soft pillows I am picky about soft <i>pillows</i>. 
While I want to find fox, I din't want foxes to show up.
The text "shiny fruits" is in a span tag:  one of the <span>shiny fruits</span>.

Lets say I want to replace these words:

For background; I am searching for phrase matches (not single words) and linking matches to a relevant page.

I want to avoid nesting HTML (no links within bold tags or vice-versa) or other mistakes (example: the <a href="#">phrase <b>goes</a> here</b>)

I tried a few approaches like searching against a sanitized copy of the text with HTML content removed, while that told me there was a match, I had a whole new problem of mapping that back to the original.

Upvotes: 1

Views: 674

Answers (1)

Ol D. Castor
Ol D. Castor

Reputation: 563

i've found mentions about regex negative lookaheads and after breaking my mind over got this regex (assuming you have VALID html tags pairing)

// made function a bit ugly just to try to show how it comes together
public function replaceTextOutsideTags($sourceText = null, $toReplace = 'inner text', $dummyText = '(REPLACED TEXT HERE)')
{
  $string = $sourceText ?? "Inner text
  You can find a link here <a href=\"#\">link</a> and a lot 
  of things in different styles. Public platform can appear in bold: 
  <b>public platform</b>, and we also have italics here too: <i>italics</i>. 
  While I like soft pillows I am picky about soft <i>pillows</i>. 
  While I want to find fox, I din't want foxes to show up.
  The text \"shiny fruits\" is in a span tag:  one of the <span>shiny fruits</span>.
  The inner text like this <a>inner <b>inner text </b> here to test too</b>, event inner text
  <a inner text>omg thats sad... or not</a>
  ";
  // it would be nice to use [[:punct:]] but somehow regex thinks that < and > are also punctuation marks
  $punctuation = "\.,!\?:;\\|\/=\"#"; // this part might take additional attention but you get the point
  $stringPart = "\b$toReplace\b";
  $excludeSequence = "(?![\w\n\s>$punctuation]*?";
  $excludeOutside = "$excludeSequence<\/)"; // note on closing )
  $excludeTag = "$excludeSequence>)"; // note on closing )
  $pattern = "/" . $stringPart . $excludeOutside . $excludeTag . "/im";
  
  return preg_replace($pattern, $dummyText, $string);
}

example output with default params

     """
     (REPLACED TEXT HERE)\r\n
     You can find a link here <a href="#">link</a> and a lot \r\n
     of things in different styles. Public platform can appear in bold: \r\n
     <b>public platform</b>, and we also have italics here too: <i>italics</i>. \r\n
     While I like soft pillows I am picky about soft <i>pillows</i>. \r\n
     While I want to find fox, I din't want foxes to show up.\r\n
     The text "shiny fruits" is in a span tag:  one of the <span>shiny fruits</span>.\r\n
     The (REPLACED TEXT HERE) like this <a>inner <b>inner text </b> here to test too</b>, event (REPLACED TEXT HERE)\r\n
     <a inner text>omg thats sad... or not</a>     
     """

now step-by-step

  1. no subsequent matches (we don't need pillow if have only pillowS)
  2. if text is followed with any length sequence of \w word symbols, \s whitespaces or \n new lines and allowed punctuation that ends with opening close tag </ - we don't need this match, and here comes negative lookahead (?![\w\n\s>$punctuation]*?<\/). here we can be sure that match won't get into new tag because < is not in described sequence ($excludeOutside variable)
  3. $excludeTag variable is basically the same as $excludeOutside but for cases when $toReplace can be html tag itself, for example a
pay attention that this code can't cover text with < or > and having those symbols can lead to unexpected behavior

Upvotes: 1

Related Questions