Bing Sun
Bing Sun

Reputation: 245

how to replace markdown tags into html by python?

I want to replace some "markdown" tags into html tags.

for example:

#Title1#
##title2##
Welcome to **My Home Page**

will be turned into

<h1>Title1</h1>
<h2>title2</h2>
Welcome to <b>My Home Page</b>

I just don't know how to do that...For Title1,I tried this:

#!/usr/bin/env python3
import re
text = '''
        #Title1#
        ##title2##
'''
 p = re.compile('^#\w*#\n$')
 print(p.sub('<h1>\w*</h1>',text))

but nothing happens..

 #Title1#
 ##title2##

How could those bbcode/markdown language come into html tags?

Upvotes: 3

Views: 2415

Answers (2)

Jongware
Jongware

Reputation: 22447

Your regular expression does not work because in the default mode, ^ and $ (respectively) matches the beginning and the end of the whole string.

'^'

(Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline (my emph.)

'$'

Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in 'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in 'foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string.

(7.2.1. Regular Expression Syntax)

Add the flag re.MULTILINE in your compile line:

p = re.compile('^#(\w*)#\n$', re.MULTILINE)

and it should work – at least for single words, such as your example. A better check would be

p = re.compile('^#([^#]*)#\n$', re.MULTILINE)

– any sequence that does not contain a #.

In both expressions, you need to add parentheses around the part you want to copy so you can use that text in your replacement code. See the official documentation on Grouping for that.

Upvotes: 1

Asunez
Asunez

Reputation: 2347

Check this regex: demo

Here you can see how I substituted the #...# into <h1>...</h1>. I believe you can get this to work with double # and so on to get other markdown features considered, but still you should listen to @Thomas and @nhahtdh comments and use a markdown parser. Using regexes in such cases is unreliable, slow and unsafe.

As for inline text like **...** to <b>...</b> you can try this regex with substitution: demo. Hope you can twink this for other features like underlining and so on.

Upvotes: 4

Related Questions