Nostromo
Nostromo

Reputation: 1264

Regex to find markup

I'm sure someone already asked this question but I don't know what words to search for in google to find these answers.

I have to "translate" a text with markup to html (or rtf or xaml). The markup for "bold" is *. If I'd like the bold text to contain a literal * I have to mask it with a back slash.

So, the marked-up text...

This is *ju\*st* a test.

...should translate to "This is ju*st a test."

I'm looking for a regex pattern to get all the matches to "translate" to bold inside my marked-up text.

Right now I'm stuck with this one (a literal star followed by one or more characters that are not a star (as few as possible), followed by a literal star)

\*[^*]+?\*

But how can I enhance the "one or more characters that are not a star" part to don't stop at stars that are preceded with a backslash?

I want to use this regex in a .NET project, in case there are differences between the languages.

Upvotes: 4

Views: 1219

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

You may use

(?<=(?<!\\)(?:\\{2})*)\*[^\\*]*(?:\\.[^\\*]*)*\*

See the .NET regex demo.

Details

  • (?<=(?<!\\)(?:\\{2})*) - a positive lookbehind that makes sure there is no \ escape char right before the current location. In other words, it matches a location that is immediately preceded with:
    • (?<!\\) - no \ char followed with
    • (?:\\{2})* - any zero or more repetitions of double backslashes
  • \* - a * char
  • [^\\*]* - zero or more chars other than \ and *
  • (?: - start of a non-capturing group matching...
    • \\. - any char (other than a newline, compile the pattern with RegexOptions.Singleline to allow any escaped char) escaped with a \ char
    • [^\\*]* - zero or more chars other than \ and *
  • )* - zero or more times
  • \* - a * char.

Upvotes: 1

Right leg
Right leg

Reputation: 16730

You want to match from a markup star to another markup star. In your markup language, a literal star is actually not only *, but \*. In regex, this translates by \\\*: a backslash, that must be escaped, then a star, that must be escaped too.

Therefore, you need to specify in your pattern that you're looking for a markup star, as opposed to a literal star.

\*.*[^\\]\*

\*             a markup star
  .*           followed by any character
    [^\\]\*    then a markup star, that is, one not escaped by a backslash

This is a little off though, because .* is greedy, so in "*ju\*st* *ju\*st*, it's gonna match the whole string, from the first to the last stars.

You can use the lazy/non-greedy version of the star modifier: *? in most engines. So it becomes:

\*.*?[^\\]\*

\*             a markup star
  .*?          followed by any character, but as few as possible
     [^\\]\*   then a markup star, that is, one not escaped by a backslash

Small try with Python:

>>> s = r"*ju\*st* *ju\*st*"
>>> re.match(r"\*.*[^\\]\*", s)
<re.Match object; span=(0, 17), match='*ju\\*st* *ju\\*st*'>
>>> re.match(r"\*.*?[^\\]\*", s)
<re.Match object; span=(0, 8), match='*ju\\*st*'>

If your regex engine does not support lazy modifiers, you'll need to explicit this behaviour:

\*([^*]|\\\*)*[^\\]\*

\*                       a markup star
  (                      then either...
   [^*]                  ...any character but a star...
       |                 ...or...
        \\\*             ...a star prefix by a backslash, ie a literal star
            )*           any number
              [^\\]\*    then a markup star

Upvotes: 1

Related Questions