Reputation: 29326
I'm trying to write a basic Markdown parser, and I want to build a regular expression that can detect links and emphasis.
In Markdown links look like [text](URL)
and emphasis/italics look like *text*
or _text_
.
I have no problem detecting emphasis, nor do I have issue detecting links, but when links have underscores in them, such as http://example.com/link_to_article
, my parser detects _to_
as an attempt at emphasis.
How do I stop this?
My first attempt was to make sure there were no characters before the first underscore or after the second, but inline emphasis is totally valid, as seen here on Stackoverflow so examples like intere_stin_g
are totally valid, shooting that idea in the foot.
So how would I accomplish this?
Upvotes: 4
Views: 421
Reputation: 45
There are three main ways to do this.
A big, fancy regex, which'll look something like this:
(?<!\(\s*\S+)_([^_]+)_(?!\S+(?:\s+"[^"]")\s*\))
I strongly recommend against this approach, because even that monstrosity isn't fully standard-compliant, and... I mean, who wants to try to decipher that? Even splitting it over multiple lines only makes it a little better. Also, that lookbehind might not even be accepted, depending on your regex engine.
Disallow mid-word italics using _
. This makes your regex a whole lot simpler:
\b_[^_]+_\b
Stack Overflow does this.
Orient your entire program around a stream-based design, where you match fragments and parse them as you work through the string. This is a bit harder to code, but would basically be:
NB: I put [^_]
in a few places when that's not strictly accurate; more accurate would be (?:(?<!\\)(\\\\)*\\_|[^_])+
; i.e. a series of escaped _
or non-_
characters. Alternatively, you could do something roughly like _.*?(?<!\\)(\\\\)*_
; i.e. match from _
until the very next unescaped _
.
P.S. If you want to learn more about regex, there are a lot of handy tools to help you, like online parsers and tutorials
Upvotes: 3