Jeremy
Jeremy

Reputation: 2000

Extracting page titles and contributors from MediaWiki XML

I have a very large (7GB) MediaWiki XML dump, which consists of records of each change made to each page of the Wiki. I am trying to record which users have contributed to each page, and so I want to extract that from the XML.

The XML looks something like:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/">
 <page>
  <title>Unique Page title</title>
  <id>11</id>
  <restrictions>sysop</restrictions>
  <revision>
    <id>11</id>
    <timestamp>2005-10-26T02:23:03Z</timestamp>
    <contributor>
      <ip>MediaWiki default</ip>
    </contributor>
    <text xml:space="preserve">i</text>
  </revision>
 </page>
 <page> ... </page>
 <page> ... </page>
 ...
</mediawiki>

For a file this size, I believe I need to use iterparse. For now, I'm just trying to print out the title, but when I run the following code, it prints "None".

NS = '{http://www.mediawiki.org/xml/export-0.3/}'
from xml.etree.ElementTree import iterparse
with open('XMLFile.xml') as f:
    for event, elem in iterparse(f):
        if elem.tag == NS + 'page':
            for node in elem:
                if node.tag == NS + 'title':
                    print node.text()
        elem.clear()

Upvotes: 3

Views: 488

Answers (3)

mzjn
mzjn

Reputation: 50967

You get None when printing the text content of the title element because you are using elem.clear() "too early". By default, iterparse() only generates "end" events. When the "end" event for page is emitted, all its subelements, including title, have already been cleared (emptied).

If elem.clear() in the code in the question is moved just one indentation level (four spaces) to the right, it will work as expected. Another way to make your code work is to change iterparse(f) to iterparse(f, events=["start"]).

And node.text() should be node.text.

See http://effbot.org/zone/element-iterparse.htm for more details on iterparse().


Assume that the XML dump (mw.xml) looks like this:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/">
  <page>
    <title>Unique Page title 1</title>
    <id>11</id>
    <restrictions>sysop</restrictions>
    <revision>
      <id>11</id>
      <timestamp>2005-10-26T02:23:03Z</timestamp>
      <contributor>
       <username>Alice</username>
      </contributor>
      <text xml:space="preserve">i</text>
    </revision>
  </page>

  <page>
    <title>Unique Page title 2</title>
    <id>11</id>
    <restrictions>sysop</restrictions>
    <revision>
      <id>11</id>
      <timestamp>2005-10-26T02:23:03Z</timestamp>
      <contributor>
       <username>Bob</username>
      </contributor>
      <text xml:space="preserve">j</text>
    </revision>
  </page>
</mediawiki>

Here is a suggestion on how you can get the title and contributor:

from xml.etree.ElementTree import iterparse

NS = '{http://www.mediawiki.org/xml/export-0.3/}'

with open('mw.xml') as f:
    for event, elem in iterparse(f):
        if elem.tag == '{0}page'.format(NS):
            title = elem.find("{0}title".format(NS))
            contr = elem.find(".//{0}username".format(NS))

            if title is not None:
                print title.text
            if contr is not None:
                print contr.text

            elem.clear()

Output:

Unique Page title 1 
Alice
Unique Page title 2 
Bob

I'm assuming that you want the username of the contributor. According to the latest XML Schema, contributor can contain username, ip, and/or id child elements (this is true also for the 0.3 version of the schema).

Upvotes: 1

Ilmari Karonen
Ilmari Karonen

Reputation: 50338

I have no experience in using Python and iterparse, but generally, the way you'd do this with an iterative XML parser would be like this:

  • Outside the parsing loop, set up variables to store the current page title and list of contributors.
  • Inside the loop, whenever a page tag is opened, reset the variables.
  • When you encounter a title tag, set the page title variable to its contents.
  • When you encounter a contributor tag, add its contents to the list of contributors.
  • When the page tag is closed, output the collected title and the list of contributors.

Upvotes: 1

Brion
Brion

Reputation: 1041

Try pulling the 'title' elements directly out during iterative parsing instead of doing a secondary loop:

NS = '{http://www.mediawiki.org/xml/export-0.3/}'
from xml.etree.ElementTree import iterparse
with open('XMLFile.xml') as f:
    for event, elem in iterparse(f):
            if elem.tag == NS + 'title':
                print elem.text
            elem.clear()

seems to work for me.

Upvotes: 3

Related Questions