Reputation: 713
I'm writing a simple Markdown parser to output HTML for pages that also use some LaTeX equations. For example, for italics :
// italic
$content = preg_replace_callback(
'/(\*|_)(.+)\1/',
function ($m) {
return "<i>" . $m[2] . "</i>";
},
$content
);
Unfortunately, a lot of Markdown formatting clashes with LaTeX symbols (and also with code blocks), so I need to escape the LaTeX sections first, and parse Markdown only outside of these sections. The LaTeX bits are delimited by $
and $$
, so it's easy to spot them :
preg_match("/\$+(.*?)\$+/", $content)
For example, this is a sample of a such page :
## Section title
Lorem ipsum *dolores* sic amet. $E = mc^2$, and since :
$$
\cos(3*\pi*\sqrt{2}) = \delta
$$
So… clash between italics and multiplication.
My first guess is I should split the content into 2 arrays : one containing the LaTeX bits with their index, and one containing the non-LaTeX bits located between the LaTeX ones, process the second array aside, and then merge them back together.
preg_split()
breaks on said patterns and returns the intermediate substrings, but ditches the substrings matching the patterns. It seems it can be tweaked with a PREG_SPLIT_DELIM_CAPTURE
flag to return all substrings including the breaking points matching the regexp, but the documentation doesn't show the output data structure when this flag is used, so I don't get how to iterate on the output array and only work on the parts not matching the pattern.
What does this function output and/or is there a better/faster way to perform pattern detection outside of regions matching some other pattern ?
Upvotes: 0
Views: 45
Reputation: 163632
One option might be making the parts that start and end with only $
on the same line not part of the match using SKIP FAIL.
Then capture either *
or _
in a capture group and use the backreference \1
to match the same char without matching the same char in between.
^\$+(?:\R(?!\$+$).*)*\R\$+$(*SKIP)(*FAIL)|([*_])((?:(?!\1).)+)\1
The pattern matches:
^
Start of string\$+
Match 1+ occurrences of $
(?:\R(?!\$+$).*)*
Match all lines that do not have only $
\R\$+$
Match a line with only $
(*SKIP)(*FAIL)|
Skip what is currently matched([*_])
Capture either *
or _
in group 1((?:(?!\1).)+)
Repeat matching all chars other than what is captured\1
Backreference to group 1, matching the same char as capturedExample
$content= <<<'DATA'
## Section title
Lorem ipsum *dolores* sic amet. $E = mc^2$, and since :
$$
\cos(3*\pi*\sqrt{2}) = \delta
$$
DATA;
$content = preg_replace_callback(
'/^\$+(?:\R(?!\$+$).*)*\R\$+$(*SKIP)(*FAIL)|([*_])((?:(?!\1).)+)\1/m',
function ($m) {
return "<i>" . $m[2] . "</i>";
},
$content
);
echo $content;
Output
## Section title
Lorem ipsum <i>dolores</i> sic amet. $E = mc^2$, and since :
$$
\cos(3*\pi*\sqrt{2}) = \delta
$$
I must note that getting markup with a regex can be brittle and have edge cases.
You might make the pattern more specific by for example asserting whitespace boundaries.
^\$+(?:\R(?!\$+$).*)*\R\$+$(*SKIP)(*FAIL)|(?<!\S)([*_])((?:(?!\1).)+)\1(?!\S)
Upvotes: 1