Reputation: 2000
I have a very large (7GB) MediaWiki XML dump, which consists of records of each change made to each page of the Wiki. I am trying to record which users have contributed to each page, and so I want to extract that from the XML.
The XML looks something like:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/">
<page>
<title>Unique Page title</title>
<id>11</id>
<restrictions>sysop</restrictions>
<revision>
<id>11</id>
<timestamp>2005-10-26T02:23:03Z</timestamp>
<contributor>
<ip>MediaWiki default</ip>
</contributor>
<text xml:space="preserve">i</text>
</revision>
</page>
<page> ... </page>
<page> ... </page>
...
</mediawiki>
For a file this size, I believe I need to use iterparse. For now, I'm just trying to print out the title, but when I run the following code, it prints "None".
NS = '{http://www.mediawiki.org/xml/export-0.3/}'
from xml.etree.ElementTree import iterparse
with open('XMLFile.xml') as f:
for event, elem in iterparse(f):
if elem.tag == NS + 'page':
for node in elem:
if node.tag == NS + 'title':
print node.text()
elem.clear()
Upvotes: 3
Views: 488
Reputation: 50967
You get None
when printing the text content of the title
element because you are using elem.clear()
"too early". By default, iterparse()
only generates "end" events. When the "end" event for page
is emitted, all its subelements, including title
, have already been cleared (emptied).
If elem.clear()
in the code in the question is moved just one indentation level (four spaces) to the right, it will work as expected. Another way to make your code work is to change iterparse(f)
to iterparse(f, events=["start"])
.
And node.text()
should be node.text
.
See http://effbot.org/zone/element-iterparse.htm for more details on iterparse()
.
Assume that the XML dump (mw.xml) looks like this:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/">
<page>
<title>Unique Page title 1</title>
<id>11</id>
<restrictions>sysop</restrictions>
<revision>
<id>11</id>
<timestamp>2005-10-26T02:23:03Z</timestamp>
<contributor>
<username>Alice</username>
</contributor>
<text xml:space="preserve">i</text>
</revision>
</page>
<page>
<title>Unique Page title 2</title>
<id>11</id>
<restrictions>sysop</restrictions>
<revision>
<id>11</id>
<timestamp>2005-10-26T02:23:03Z</timestamp>
<contributor>
<username>Bob</username>
</contributor>
<text xml:space="preserve">j</text>
</revision>
</page>
</mediawiki>
Here is a suggestion on how you can get the title and contributor:
from xml.etree.ElementTree import iterparse
NS = '{http://www.mediawiki.org/xml/export-0.3/}'
with open('mw.xml') as f:
for event, elem in iterparse(f):
if elem.tag == '{0}page'.format(NS):
title = elem.find("{0}title".format(NS))
contr = elem.find(".//{0}username".format(NS))
if title is not None:
print title.text
if contr is not None:
print contr.text
elem.clear()
Output:
Unique Page title 1
Alice
Unique Page title 2
Bob
I'm assuming that you want the username of the contributor. According to the latest XML Schema, contributor
can contain username
, ip
, and/or id
child elements (this is true also for the 0.3 version of the schema).
Upvotes: 1
Reputation: 50338
I have no experience in using Python and iterparse, but generally, the way you'd do this with an iterative XML parser would be like this:
page
tag is opened, reset the variables.title
tag, set the page title variable to its contents.contributor
tag, add its contents to the list of contributors.page
tag is closed, output the collected title and the list of contributors.Upvotes: 1
Reputation: 1041
Try pulling the 'title' elements directly out during iterative parsing instead of doing a secondary loop:
NS = '{http://www.mediawiki.org/xml/export-0.3/}'
from xml.etree.ElementTree import iterparse
with open('XMLFile.xml') as f:
for event, elem in iterparse(f):
if elem.tag == NS + 'title':
print elem.text
elem.clear()
seems to work for me.
Upvotes: 3