thatidiotguy
thatidiotguy

Reputation: 9011

Regular Expression Without Lookbehind for Markdown Bolding

So I am trying to write a regular expression for JavaScript that will allow me to replace ** with tags as a sort of self rolled Markdown to HTML converter.

e.g.

**bold** -> <strong>bold</strong>

but

\**not** -> **not** because * was escaped.

I have the following regular expression which seems to work well:

/(?<!\\)(?:\\\\)*(\*\*)([^\\\*]+)(\*\*)/g

However, JS does not support lookbehinds! I rewrote it using lookaheads:

/(\*\*)([^\\\*]+)*(\*\*)(?!\\)(?:\\\\)*/g

but this would require me to reverse the string which is undesirable because I need to support multibyte characters (see here). I am not completely opposed to using the library mentioned in that answer, but I would prefer a solution that does not require me to add one if possible.

Is there a way to rewrite my regular expression without using look behinds?

EDIT:

After thinking about this a little more, I'm even starting to question whether regular expressions is even the best way to approach this problem, but I will leave the question up out of interest.

Upvotes: 3

Views: 196

Answers (3)

Jordan Running
Jordan Running

Reputation: 106147

Consider the following regular expression:

/(.*?)(\\\\|\\\*|\*\*)/g

You can think of this as a tokenizer. It does a non-greedy match of some (or no) text followed by one of the special character sequences \\, \*, and finally **. Matching in this order ensures that weird edge cases like **foo \** bar\\** are handled correctly (<strong>foo \** bar\</strong>). This makes for a very simple String.prototype.replace with a switch in its replacement function. A boolean bold flag helps us decide if ** should be replaced with <strong> or </strong>.

const TOKENIZER = /(.*?)(\\\\|\\\*|\*\*)/g;

function render(str) {
  let bold = false;
  return str.replace(TOKENIZER, (_, text, special) => {
    switch (special) {
      case '\\\\':
        return text + '\\';
      case '\\*':
        return text + '*';
      case '**':
        bold = !bold;
        return text + (bold ? '<strong>' : '</strong>');
      default:
        return text + special;
    }
  });
}

Here I'm assuming that \\ should become \ and \* should become *, as in normal Markdown parsers. It's not dissimilar to Dmitry's solution, but simpler. See it in action in the below snippet:

const TOKENIZER = /(.*?)(\\\\|\\\*|\*\*)/g;

function render(str) {
  let bold = false;
  return str.replace(TOKENIZER, (_, text, special) => {
    switch (special) {
      case '\\\\':
        return text + '\\';
      case '\\*':
        return text + '*';
      case '**':
        bold = !bold;
        return text + (bold ? '<strong>' : '</strong>');
      default:
        return text + special;
    }
  });
}

// Test
const input = document.getElementById('input');
const outputText = document.getElementById('output-text');
const outputHtml = document.getElementById('output-html');

function makeOutput(str) {
  const result = render(str);
  outputText.value = render(str);
  outputHtml.innerHTML = render(str);
}

input.addEventListener('input', evt => makeOutput(evt.target.value));
makeOutput(input.value);
body{font-family:'Helvetica Neue',Helvetica,sans-serif}
textarea{display:block;font-family:monospace;width:100%;margin-bottom:1em}
div{padding:2px;background-color:lightgoldenrodyellow}
<label for="input">Input</label>
<textarea id="input" rows="3">aaa **BBB** ccc \**ddd** EEE \\**fff \**ggg** HHH**</textarea>

Output HTML:
<textarea id="output-text" rows="3" disabled></textarea>

Rendered HTML:
<div id="output-html"></div>

Upvotes: 0

Dmitry Egorov
Dmitry Egorov

Reputation: 9650

One way to work around missing lookbehinds is to match undesired patterns first and then using alternation match the desired pattern. Then apply conditional replace, substituting the undesired patterns with themselves and the desired ones with what you actually want.

In your particular case this means match \* first and **<something>** only after that. Then use

input.replace(/\\\*|\*\*(.*?)\*\*/, function(m, p1) {
    return m == '\\*' ? m : '<strong>' + p1 + '</strong>';
})

to do the conditional replace.

The real regex is more complex though. First, you need to secure from escaped backslash itself (i.e. \\**bold** should become \\<strong>bold</strong>). So you need to match \\ separately the same way as you do for \*.

Second, the expression between ** and ** may also contain some escaped asterisks and slashes. To cope with this you need to match \\ and \** explicitly and (using alternation) only after that anything else non-greedily. This may be represented as (?:\\\\|\\\*\*|\*(?!\*)|[\S\s])*?.

Therefore the final regex turns to

\\\\|\\\*|\*\*((?:\\\\|\\\*\*|\*(?!\*)|[\S\s])*?)\*\*

Demo: https://regex101.com/r/Da35r5/1

JavaScript replace demo:

function convert() {
  var md = document.getElementById("md").value;
  var re = /\\\\|\\\*|\*\*((?:\\\\|\\\*\*|\*(?!\*)|[\S\s])*?)\*\*/g;
  var html = md.replace(re, function(match, p1) {
    return match.startsWith('\\') ? match : '<strong>' + p1 + '</strong>';
  });
  document.getElementById("html").value = html;
}
<span style="display:inline-block">
MD
<textarea id="md" cols="20" rows="10" style="display:block">
**bold**
**foo * bar **
**foo \** bar**
**fo\\\\** bar** **
\**bold** **
\\**bold**
** multi
line**
</textarea>
</span>

<span style="display:inline-block">
HTML
<textarea id="html" cols="50" rows="10" style="display:block">
</textarea>
</span>

<button onclick="convert()" style="display:block">Convert</button>

Upvotes: 3

Agnius Vasiliauskas
Agnius Vasiliauskas

Reputation: 11277

Try this formula, without look(ahead|behind) at all:

(?:(?:[\\])\*\*(?:.+?)\*\*|(?:[^\\\n]|^)\*\*(.+)\*\*)

Demo

Upvotes: 0

Related Questions