Reputation: 4884
With the help of joksnet's programs here I've managed to get plaintext Wikipedia articles that I'm looking for.
The text returned includes Wiki markup for the headings, so for example, the sections of the Albert Einstein article are returned like this:
==Biography==
===Early life and education===
blah blah blah
What I'd really like to do is feed the retrieved text to a function and wrap all the top level sections in bold html tags and the second level sections in italics, like this:
<b>Biography</b>
<i>Early life and education</i>
blah blah blah
But I'm afraid I don't know how to even start, at least not without making the function dangerously naive. Do I need to use regular expressions? Any suggestions greatly appreciated.
PS Sorry if "parsing" is too strong a word for what I'm trying to do here.
Upvotes: 1
Views: 2140
Reputation: 4884
I ended up doing this:
def parseWikiTitles(x):
counter = 1
while '===' in x:
if counter == 1:
x = x.replace('===','<i>',1)
counter = 2
else:
x = x.replace('===',r'</i>',1)
counter = 1
counter = 1
while '==' in x:
if counter == 1:
x = x.replace('==','<b>',1)
counter = 2
else:
x = x.replace('==',r'</b>',1)
counter = 1
x = x.replace('<b> ', '<b>', 50)
x = x.replace(r' </b>', r'</b>', 50)
x = x.replace('<i> ', '<i>', 50)
x = x.replace(r' </i>', r'<i>', 50)
return x
I pass the string of text with wiki titles to that function and it returns the same text with the == and === replaced with bold and italics HTML tags. The last thing removes spaces before and after titles, for example == title ==
gets converted to <b>title</b>
instead of <b> title </b>
Has worked without problem so far.
Thanks for the help guys, Alex
Upvotes: 1
Reputation: 244757
I think the best way here would be to let MediaWiki take care of the parsing. I don't know the library you're using, but basically this is the difference between
which returns the raw wikitext and
which returns the parsed HTML.
Upvotes: 2
Reputation: 113
You can use regex and scraping modules like Scrapy and Beautifulsoup to parse and scrape wiki pages. Now that you clarified your question I suggest you use the py-wikimarkup module that is hosted on github. The link is https://github.com/dcramer/py-wikimarkup/ . I hope that helps.
Upvotes: 1