Reputation: 5425
I'm trying to make a markdown parser in python, not because it's useful but because it's fun and because I'm trying to learn regular expression.
#! /usr/bin/env python
#-*- coding: utf-8 -*-
import re
class Converter:
def markdown2html(self, string):
string = re.sub('\*{3}(.+)\*{3}', '<strong>\\1</strong>', string)
string = re.sub('\*{2}(.+)\*{2}', '<i>\\1</i>', string)
string = re.sub('^#{1}(.+)$', '<h1>\\1</h1>', string, flags=re.MULTILINE)
string = re.sub('^#{2}(.+)$', '<h2>\\1</h2>', string, flags=re.MULTILINE)
return string
markdown_sting = """
##h2 heading
#H1 heading
This should be a ***bold*** char
#anohter h1
anohter ***bold***
this is a **italic** string
"""
converter = Converter()
print converter.markdown2html(markdown_sting)
It prints
<h1>#h2 heading</h1>
<h1>H1 heading</h1>
This should be a <strong>bold</strong> char
<h1>anohter h1</h1>
anohter <strong>bold</strong>
this is a <i>italic</i> string
As you can see it does not parse the h2 tag. Where I went wrong?
Upvotes: 1
Views: 1887
Reputation: 6257
You could make sure to only match the wanted number of hash signs by making sure that the first character of the heading text isn't a hash sign. This can be done by using [^#]
like this:
string = re.sub('^#{1}([^#].*)$', '<h1>\\1</h1>', string, flags=re.MULTILINE)
string = re.sub('^#{2}([^#].*)$', '<h2>\\1</h2>', string, flags=re.MULTILINE)
This way the order of the rules won't matter, making the rules more robust.
Upvotes: 4
Reputation: 310069
when you parser sees #
, it does the substitution for h1
. Then it tries to do the substitution for h2
, but there are not strings ##
since one of the hashes ('#'
) was already replaced when parsing the h1
portion.
A simple fix is to exchange the order:
string = re.sub('^#{2}(.+)$', '<h2>\\1</h2>', string, flags=re.MULTILINE)
string = re.sub('^#{1}(.+)$', '<h1>\\1</h1>', string, flags=re.MULTILINE)
In general, when you're applying transforms to data, you should order it from most restrictive to least restrictive in order to avoid these problems.
Upvotes: 4
Reputation: 344
A more appropriate and more efficient method may be to compare the first characters of the string, and then perform a simple string replace
def markdown2html(self, string):
if string[0:2] == "##":
string = string.replace( "##", "<h2>" ) + "</h2>"
if string[0] == "#":
string = string.replace( "##", "<h1>" ) + "</h1>"
return string
That way you are doing simple list manipulation rather than RegEx. But in all cases, order matters
Upvotes: 1
Reputation: 599788
Those regexes are evaluated in order. The h1 regex will grab any line beginning with a # and convert it to <h1>
. So by the time it gets to the h2 regex, the line no longer begins with ##. Swap those two expressions around.
Upvotes: 1