Reputation: 2229
I would like to use Python to extract content formatted in MediaWiki markup following a particular string. For example, the 2012 U.S. presidential election article, contains fields called "nominee1" and "nominee2". Toy example:
In [1]: markup = get_wikipedia_markup('United States presidential election, 2012')
In [2]: markup
Out[2]:
u"{{
| nominee1 = '''[[Barack Obama]]'''\n
| party1 = Democratic Party (United States)\n
| home_state1 = [[Illinois]]\n
| running_mate1 = '''[[Joe Biden]]'''\n
| nominee2 = [[Mitt Romney]]\n
| party2 = Republican Party (United States)\n
| home_state2 = [[Massachusetts]]\n
| running_mate2 = [[Paul Ryan]]\n
}}"
Using the election article above as an example, I would like to extract the information immediately following the "nomineeN" field but that exists before the invocation of the next field (demarcated by a pip "|"). Thus, given the example above, I would ideally like to extract "Barack Obama" and "Mitt Romney" -- or at least the syntax in which they're embedded ('''[[Barack Obama]]''' and [[Mitt Romney]]). Other regex has extracted links from the wikimarkup, but my (failed) attempts of using a positive lookbehind assertion have been something of the flavor of:
nominees = re.findall(r'(?<=\|nominee\d\=)\S+',markup)
My thinking is that it should find strings like "|nominee1=" and "|nominee2=" with some whitespace possible between "|", "nominee", "=" and then return the content following it like "Barack Obama" and "Mitt Romney".
Upvotes: 2
Views: 1089
Reputation: 1108
Use mwparserfromhell! It condenses your code and is more reassuring for capturing the result. For usage with this example:
import mwparserfromhell as mw
text = get_wikipedia_markup('United States presidential election, 2012')
code = mw.parse(text)
templates = code.filter_templates()
for template in templates:
if template.name == 'Infobox election':
nominee1 = template.get('nominee1').value
nominee2 = template.get('nominee2').value
print nominee1
print nominee2
Very simple thing to do to capture the result.
Upvotes: 3
Reputation: 11
For infobox data like this, it's best to use DBpedia. They've done all the extraction work for you :)
http://wiki.dbpedia.org/Downloads38
See the "Ontology Infobox Properties " file. You don't have to be an ontologies expert here. Just use simple tsv parser to find the info you need!
Upvotes: 1
Reputation: 65903
First of all, you're missing a space after nominee\d
. You probably want nominee\d\s*\=
. In addition, you really don't want to be parsing markup with regex. Try using one of the suggestions here instead.
If you must do it with regex, why not a slightly more readable multi line solution?
import re
markup_string = """{{
| nominee1 = '''[[Barack Obama]]'''
| party1 = Democratic Party (United States)
| home_state1 = [[Illinois]]
| running_mate1 = '''[[Joe Biden]]'''
| nominee2 = [[Mitt Romney]]
| party2 = Republican Party (United States)
| home_state2 = [[Massachusetts]]
| running_mate2 = [[Paul Ryan]]<br>
}}"""
for match in re.finditer(r'(nominee\d\s*\=)[^|]*', markup_string, re.S):
end_nominee, end_line = match.end(1), match.end(0)
print end_nominee, end_line
print markup_string[end_nominee:end_line]
Upvotes: 0
Reputation: 25582
Lookbehinds aren't necessary here—it's much easier to use matching groups to specify exactly what should be extracted from the string. (In fact, lookbehinds can't work here with Python's regular expression engine, since the optional spaces make the expression variable-width.)
Try this regex:
\|\s*nominee\d+\s*=\s*(?:''')?\[\[([^]]+)\]\](?:''')?
Results:
re.findall(r"\|\s*nominee\d+\s*=\s*(?:''')?\[\[([^]]+)\]\](?:''')?", markup)
# => ['Barack Obama', 'Mitt Romney']
Upvotes: 1