Alex S
Alex S

Reputation: 4884

Making a (hopefully simple) wiki parser with python

With the help of joksnet's programs here I've managed to get plaintext Wikipedia articles that I'm looking for.

The text returned includes Wiki markup for the headings, so for example, the sections of the Albert Einstein article are returned like this:

==Biography==

===Early life and education===
blah blah blah

What I'd really like to do is feed the retrieved text to a function and wrap all the top level sections in bold html tags and the second level sections in italics, like this:

<b>Biography</b>

<i>Early life and education</i>
blah blah blah

But I'm afraid I don't know how to even start, at least not without making the function dangerously naive. Do I need to use regular expressions? Any suggestions greatly appreciated.

PS Sorry if "parsing" is too strong a word for what I'm trying to do here.

Upvotes: 1

Views: 2140

Answers (3)

Alex S
Alex S

Reputation: 4884

I ended up doing this:

def parseWikiTitles(x):
    counter = 1

    while '===' in x:
        if counter == 1:
            x = x.replace('===','<i>',1)
            counter = 2

        else:
            x = x.replace('===',r'</i>',1)
            counter = 1

    counter = 1

    while '==' in x:
        if counter == 1:
            x = x.replace('==','<b>',1)
            counter = 2

        else:
            x = x.replace('==',r'</b>',1)
            counter = 1


    x = x.replace('<b> ', '<b>', 50)
    x = x.replace(r' </b>', r'</b>', 50)
    x = x.replace('<i> ', '<i>', 50)
    x = x.replace(r' </i>', r'<i>', 50)

    return x

I pass the string of text with wiki titles to that function and it returns the same text with the == and === replaced with bold and italics HTML tags. The last thing removes spaces before and after titles, for example == title == gets converted to <b>title</b> instead of <b> title </b>

Has worked without problem so far.

Thanks for the help guys, Alex

Upvotes: 1

svick
svick

Reputation: 244757

I think the best way here would be to let MediaWiki take care of the parsing. I don't know the library you're using, but basically this is the difference between

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Albert%20Einstein&rvprop=content

which returns the raw wikitext and

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Albert%20Einstein&rvprop=content&rvparse

which returns the parsed HTML.

Upvotes: 2

Rapture
Rapture

Reputation: 113

You can use regex and scraping modules like Scrapy and Beautifulsoup to parse and scrape wiki pages. Now that you clarified your question I suggest you use the py-wikimarkup module that is hosted on github. The link is https://github.com/dcramer/py-wikimarkup/ . I hope that helps.

Upvotes: 1

Related Questions