PHP regexp parse avoiding substrings

Question

I'm writing a simple Markdown parser to output HTML for pages that also use some LaTeX equations. For example, for italics :

  // italic
  $content = preg_replace_callback(
    '/(\*|_)(.+)\1/',
    function ($m) {
      return "" . $m[2] . "";
    },
    $content
  );

Unfortunately, a lot of Markdown formatting clashes with LaTeX symbols (and also with code blocks), so I need to escape the LaTeX sections first, and parse Markdown only outside of these sections. The LaTeX bits are delimited by $ and $$, so it's easy to spot them :

preg_match("/\$+(.*?)\$+/", $content)

For example, this is a sample of a such page :


## Section title

Lorem ipsum *dolores* sic amet. $E = mc^2$, and since :

$$
\cos(3*\pi*\sqrt{2}) = \delta
$$

So… clash between italics and multiplication.

My first guess is I should split the content into 2 arrays : one containing the LaTeX bits with their index, and one containing the non-LaTeX bits located between the LaTeX ones, process the second array aside, and then merge them back together.

preg_split() breaks on said patterns and returns the intermediate substrings, but ditches the substrings matching the patterns. It seems it can be tweaked with a PREG_SPLIT_DELIM_CAPTURE flag to return all substrings including the breaking points matching the regexp, but the documentation doesn't show the output data structure when this flag is used, so I don't get how to iterate on the output array and only work on the parts not matching the pattern.

What does this function output and/or is there a better/faster way to perform pattern detection outside of regions matching some other pattern ?

The fourth bird · Accepted Answer

One option might be making the parts that start and end with only $ on the same line not part of the match using SKIP FAIL.

Then capture either * or _ in a capture group and use the backreference \1 to match the same char without matching the same char in between.

^\$+(?:\R(?!\$+$).*)*\R\$+$(*SKIP)(*FAIL)|([*_])((?:(?!\1).)+)\1

The pattern matches:

^ Start of string
\$+ Match 1+ occurrences of $
(?:\R(?!\$+$).*)* Match all lines that do not have only $
\R\$+$ Match a line with only $
(*SKIP)(*FAIL)| Skip what is currently matched
([*_]) Capture either * or _ in group 1
((?:(?!\1).)+) Repeat matching all chars other than what is captured
\1 Backreference to group 1, matching the same char as captured

Regex demo | Php demo

Example

$content= <<<'DATA'
## Section title

Lorem ipsum *dolores* sic amet. $E = mc^2$, and since :

$$
\cos(3*\pi*\sqrt{2}) = \delta
$$
DATA;

$content = preg_replace_callback(
    '/^\$+(?:\R(?!\$+$).*)*\R\$+$(*SKIP)(*FAIL)|([*_])((?:(?!\1).)+)\1/m',
    function ($m) {
        return "" . $m[2] . "";
    },
    $content
);

echo $content;

Output

## Section title

Lorem ipsum dolores sic amet. $E = mc^2$, and since :

$$
\cos(3*\pi*\sqrt{2}) = \delta
$$

I must note that getting markup with a regex can be brittle and have edge cases.

You might make the pattern more specific by for example asserting whitespace boundaries.

^\$+(?:\R(?!\$+$).*)*\R\$+$(*SKIP)(*FAIL)|(?


Regex demo

PHP regexp parse avoiding substrings

Answers (1)

Related Questions