gaggina
gaggina

Reputation: 5425

regex issue in making a markdown parser

I'm trying to make a markdown parser in python, not because it's useful but because it's fun and because I'm trying to learn regular expression.

#! /usr/bin/env python
#-*- coding: utf-8 -*-

import re

class Converter:

    def markdown2html(self, string):

        string = re.sub('\*{3}(.+)\*{3}', '<strong>\\1</strong>', string)
        string = re.sub('\*{2}(.+)\*{2}', '<i>\\1</i>', string)
        string = re.sub('^#{1}(.+)$', '<h1>\\1</h1>', string, flags=re.MULTILINE)
        string = re.sub('^#{2}(.+)$', '<h2>\\1</h2>', string, flags=re.MULTILINE)

        return string

markdown_sting = """
##h2 heading
#H1 heading
This should be a ***bold*** char
#anohter h1
anohter ***bold***
this is a **italic** string
"""

converter = Converter()
print converter.markdown2html(markdown_sting)

It prints

<h1>#h2 heading</h1>
<h1>H1 heading</h1>
This should be a <strong>bold</strong> char
<h1>anohter h1</h1>
anohter <strong>bold</strong>
this is a <i>italic</i> string

As you can see it does not parse the h2 tag. Where I went wrong?

Upvotes: 1

Views: 1887

Answers (4)

David P&#228;rsson
David P&#228;rsson

Reputation: 6257

You could make sure to only match the wanted number of hash signs by making sure that the first character of the heading text isn't a hash sign. This can be done by using [^#] like this:

string = re.sub('^#{1}([^#].*)$', '<h1>\\1</h1>', string, flags=re.MULTILINE)
string = re.sub('^#{2}([^#].*)$', '<h2>\\1</h2>', string, flags=re.MULTILINE)

This way the order of the rules won't matter, making the rules more robust.

Upvotes: 4

mgilson
mgilson

Reputation: 310069

when you parser sees #, it does the substitution for h1. Then it tries to do the substitution for h2, but there are not strings ## since one of the hashes ('#') was already replaced when parsing the h1 portion.

A simple fix is to exchange the order:

string = re.sub('^#{2}(.+)$', '<h2>\\1</h2>', string, flags=re.MULTILINE)
string = re.sub('^#{1}(.+)$', '<h1>\\1</h1>', string, flags=re.MULTILINE)

In general, when you're applying transforms to data, you should order it from most restrictive to least restrictive in order to avoid these problems.

Upvotes: 4

Tyndyll
Tyndyll

Reputation: 344

A more appropriate and more efficient method may be to compare the first characters of the string, and then perform a simple string replace

def markdown2html(self, string):

    if string[0:2] == "##":
        string = string.replace( "##", "<h2>" ) + "</h2>"
    if string[0] == "#":
        string = string.replace( "##", "<h1>" ) + "</h1>"
    return string

That way you are doing simple list manipulation rather than RegEx. But in all cases, order matters

Upvotes: 1

Daniel Roseman
Daniel Roseman

Reputation: 599788

Those regexes are evaluated in order. The h1 regex will grab any line beginning with a # and convert it to <h1>. So by the time it gets to the h2 regex, the line no longer begins with ##. Swap those two expressions around.

Upvotes: 1

Related Questions