Reputation: 1843
I am trying to replace special symbols/placeholders with HTML tags in a block of text. So far it is working well, but I now need to avoid the replacements if the replaceable entities are inside of <pre>
or <code>
tags.
My current code:
function parse($text) {
$search = array(
'/\*\*(.*?)\*\*/is', // bold
'/\/\/(.*?)\/\//is', // italic
'/__(.*?)__/is', // underline
); #search
$replace = array(
'<b>$1</b>',
'<i>$1</i>',
'<u>$1</u>',
); #replace
return preg_replace($search, $replace, $text);
} #parse
Sample input:
<pre>
** Bold Text **
// Italic Text //
__ Underline Text __
</pre>
<code>
** Bold Text **
// Italic Text //
__ Underline Text __
</code>
How do I exclude the text inside of these specific tags while parsing?
Upvotes: 0
Views: 1257
Reputation: 47900
Not only will my solution exclude text between <pre>
and <code>
tags, I'll consolidate your pre-existing patterns.
First, match the leading <pre>
and <code>
tags through to their identically named end tag in an ungreedy manner. Discard such matched substrings with (*SKIP)(*FAIL)
.
Next, match the two-character substrings that should be replaced through to their identically fashioned end in an ungreedy manner. Capture group #2 will be the first of the paired symbols at the start of the match, and capture group #3 will be the text to be wrapped in the new HTML tags.
In the callback function, $m
represented the payload of matchess. [0]
is the fullstring match and [1]
was only relevant to discarded chunks of text. The [2]
element of $m
will contain the first of the matched symbols -- use strtr()
to translate these symbols to their respective HTML tag. The [3]
element of $m
will contain the inner text -- this doesn't need to be changed, but merely inserted between the new opening and closing HTML tags.
As an additional feature (not requested), I recommend executing the replacement process inside of a do-while()
loop so that nested BBtags are all converted.
Regarding the s
, i
, and x
pattern modifiers:
s
tells the regex engine that .
(any character) in the pattern should also match newline characters (which it does not do by default).i
tells the regex engine to match case-insensitively -- this will affect literal strings pre
and code
because all other aspects of the pattern are already case-insensitively matching.x
tells the regex engine to ignore literal whitespaces in the pattern -- this is done to enable the pattern to be written on multiple lines with indentation for easier human readability.Code: (Demo)
do {
$text = preg_replace_callback(
'#
<(pre|code)>.*?</\1>(*SKIP)(*FAIL)
|([*_/])\2(.*?)\2\2
#six',
fn($m) => sprintf(
'<%1$s>%2$s</%1$s>',
strtr($m[2], '*/_', 'biu'),
$m[3]
),
$text,
-1,
$count
);
} while ($count);
var_export($text);
More challenging input:
$text = <<<TEXT
<pre>
** Bold Text **
// Italic Text //
__ Underline Text __
</pre>
** Bold Text **
// Italic Text //
__ Underline Text __
__** Bold Text **
Nested Text __
<code>
** Bold Text **
// Italic Text //
__ Underline Text __
</code>
TEXT;
Output:
'<pre>
<b> Bold Text </b>
<i> Italic Text </i>
<u> Underline Text </u>
</pre>
<b> Bold Text </b>
<i> Italic Text </i>
<u> Underline Text </u>
<u><b> Bold Text </b>
Nested Text </u>
<code>
<b> Bold Text </b>
<i> Italic Text </i>
<u> Underline Text </u>
</code>'
Upvotes: 0
Reputation: 17831
First of all, that's not BBCode. BBCode uses [
and ]
as delimiters to mimic common HTML markup tags. What you have there, is something similar to Markdown or reStructuredText.
Secondly, you replacement algorithm is extremely simple and likely to give you much trouble in the future. If you are not merely doing this to learn how to code in PHP, I'd suggest you use existing parsers that already do what you want to do, like PHP Markdown, PHP reStrucuredText or PHP BBCode Parser.
Now, as for your actual question: This will not be easy, but you can start with altering your regexes so they only apply if they are not inside <pre>
tags like this: (untested)
'/(?<!<pre>).*?\*\*(.*?)\*\*.*?(?!</pre>)/is', // bold
Upvotes: 1