Aurélien Pierre
Aurélien Pierre

Reputation: 713

PHP regexp parse avoiding substrings

I'm writing a simple Markdown parser to output HTML for pages that also use some LaTeX equations. For example, for italics :

  // italic
  $content = preg_replace_callback(
    '/(\*|_)(.+)\1/',
    function ($m) {
      return "<i>" . $m[2] . "</i>";
    },
    $content
  );

Unfortunately, a lot of Markdown formatting clashes with LaTeX symbols (and also with code blocks), so I need to escape the LaTeX sections first, and parse Markdown only outside of these sections. The LaTeX bits are delimited by $ and $$, so it's easy to spot them :

preg_match("/\$+(.*?)\$+/", $content)

For example, this is a sample of a such page :


## Section title

Lorem ipsum *dolores* sic amet. $E = mc^2$, and since :

$$
\cos(3*\pi*\sqrt{2}) = \delta
$$

So… clash between italics and multiplication.

My first guess is I should split the content into 2 arrays : one containing the LaTeX bits with their index, and one containing the non-LaTeX bits located between the LaTeX ones, process the second array aside, and then merge them back together.

preg_split() breaks on said patterns and returns the intermediate substrings, but ditches the substrings matching the patterns. It seems it can be tweaked with a PREG_SPLIT_DELIM_CAPTURE flag to return all substrings including the breaking points matching the regexp, but the documentation doesn't show the output data structure when this flag is used, so I don't get how to iterate on the output array and only work on the parts not matching the pattern.

What does this function output and/or is there a better/faster way to perform pattern detection outside of regions matching some other pattern ?

Upvotes: 0

Views: 45

Answers (1)

The fourth bird
The fourth bird

Reputation: 163632

One option might be making the parts that start and end with only $ on the same line not part of the match using SKIP FAIL.

Then capture either * or _ in a capture group and use the backreference \1 to match the same char without matching the same char in between.

^\$+(?:\R(?!\$+$).*)*\R\$+$(*SKIP)(*FAIL)|([*_])((?:(?!\1).)+)\1

The pattern matches:

  • ^ Start of string
  • \$+ Match 1+ occurrences of $
  • (?:\R(?!\$+$).*)* Match all lines that do not have only $
  • \R\$+$ Match a line with only $
  • (*SKIP)(*FAIL)| Skip what is currently matched
  • ([*_]) Capture either * or _ in group 1
  • ((?:(?!\1).)+) Repeat matching all chars other than what is captured
  • \1 Backreference to group 1, matching the same char as captured

Regex demo | Php demo

Example

$content= <<<'DATA'
## Section title

Lorem ipsum *dolores* sic amet. $E = mc^2$, and since :

$$
\cos(3*\pi*\sqrt{2}) = \delta
$$
DATA;

$content = preg_replace_callback(
    '/^\$+(?:\R(?!\$+$).*)*\R\$+$(*SKIP)(*FAIL)|([*_])((?:(?!\1).)+)\1/m',
    function ($m) {
        return "<i>" . $m[2] . "</i>";
    },
    $content
);

echo $content;

Output

## Section title

Lorem ipsum <i>dolores</i> sic amet. $E = mc^2$, and since :

$$
\cos(3*\pi*\sqrt{2}) = \delta
$$

I must note that getting markup with a regex can be brittle and have edge cases.

You might make the pattern more specific by for example asserting whitespace boundaries.

^\$+(?:\R(?!\$+$).*)*\R\$+$(*SKIP)(*FAIL)|(?<!\S)([*_])((?:(?!\1).)+)\1(?!\S)

Regex demo

Upvotes: 1

Related Questions