Brian Keegan
Brian Keegan

Reputation: 2229

Regular expression for extracting fields from wiki template markup

I would like to use Python to extract content formatted in MediaWiki markup following a particular string. For example, the 2012 U.S. presidential election article, contains fields called "nominee1" and "nominee2". Toy example:

In [1]: markup = get_wikipedia_markup('United States presidential election, 2012')
In [2]: markup
Out[2]:
u"{{
| nominee1 = '''[[Barack Obama]]'''\n
| party1 = Democratic Party (United States)\n
| home_state1 = [[Illinois]]\n
| running_mate1 = '''[[Joe Biden]]'''\n
| nominee2 = [[Mitt Romney]]\n
| party2 = Republican Party (United States)\n
| home_state2 = [[Massachusetts]]\n
| running_mate2 = [[Paul Ryan]]\n
}}"

Using the election article above as an example, I would like to extract the information immediately following the "nomineeN" field but that exists before the invocation of the next field (demarcated by a pip "|"). Thus, given the example above, I would ideally like to extract "Barack Obama" and "Mitt Romney" -- or at least the syntax in which they're embedded ('''[[Barack Obama]]''' and [[Mitt Romney]]). Other regex has extracted links from the wikimarkup, but my (failed) attempts of using a positive lookbehind assertion have been something of the flavor of:

nominees = re.findall(r'(?<=\|nominee\d\=)\S+',markup)

My thinking is that it should find strings like "|nominee1=" and "|nominee2=" with some whitespace possible between "|", "nominee", "=" and then return the content following it like "Barack Obama" and "Mitt Romney".

Upvotes: 2

Views: 1089

Answers (4)

Hairr
Hairr

Reputation: 1108

Use mwparserfromhell! It condenses your code and is more reassuring for capturing the result. For usage with this example:

import mwparserfromhell as mw
text = get_wikipedia_markup('United States presidential election, 2012')
code = mw.parse(text)
templates = code.filter_templates()
for template in templates:
    if template.name == 'Infobox election':
        nominee1 = template.get('nominee1').value
        nominee2 = template.get('nominee2').value
print nominee1
print nominee2

Very simple thing to do to capture the result.

Upvotes: 3

BJH
BJH

Reputation: 11

For infobox data like this, it's best to use DBpedia. They've done all the extraction work for you :)

http://wiki.dbpedia.org/Downloads38

See the "Ontology Infobox Properties " file. You don't have to be an ontologies expert here. Just use simple tsv parser to find the info you need!

Upvotes: 1

Chinmay Kanchi
Chinmay Kanchi

Reputation: 65903

First of all, you're missing a space after nominee\d. You probably want nominee\d\s*\=. In addition, you really don't want to be parsing markup with regex. Try using one of the suggestions here instead.

If you must do it with regex, why not a slightly more readable multi line solution?

import re

markup_string = """{{
| nominee1 = '''[[Barack Obama]]'''
| party1 = Democratic Party (United States)
| home_state1 = [[Illinois]]
| running_mate1 = '''[[Joe Biden]]'''
| nominee2 = [[Mitt Romney]]
| party2 = Republican Party (United States)
| home_state2 = [[Massachusetts]]
| running_mate2 = [[Paul Ryan]]<br>
}}"""

for match in re.finditer(r'(nominee\d\s*\=)[^|]*', markup_string, re.S):
    end_nominee, end_line = match.end(1), match.end(0)
    print end_nominee, end_line
    print markup_string[end_nominee:end_line]

Upvotes: 0

Jon Gauthier
Jon Gauthier

Reputation: 25582

Lookbehinds aren't necessary here—it's much easier to use matching groups to specify exactly what should be extracted from the string. (In fact, lookbehinds can't work here with Python's regular expression engine, since the optional spaces make the expression variable-width.)

Try this regex:

\|\s*nominee\d+\s*=\s*(?:''')?\[\[([^]]+)\]\](?:''')?

Results:

re.findall(r"\|\s*nominee\d+\s*=\s*(?:''')?\[\[([^]]+)\]\](?:''')?", markup)
# => ['Barack Obama', 'Mitt Romney']

Upvotes: 1

Related Questions