Reputation: 139
I have a Wikipedia XML dump created by exporting all pages of a certain category. You can see the exact structure of this XML file by generating one for yourself at https://en.wikipedia.org/wiki/Special:Export. Now I would like to make a list, in Python, of the titles of each article. I have tried using:
import xml.etree.ElementTree as ET
tree = ET.parse('./comp_sci_wiki.xml')
root = tree.getroot()
for element in root:
for sub in element:
print sub.find("title")
Nothing is printed. This seems like it should be a relatively straightforward task. Any help you could provide would be much appreciated. Thanks!
Upvotes: 0
Views: 1667
Reputation: 311750
If you look at the beginning of the exported file, you'll see that the document declares a default XML namespace:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLo
That means there is no un-namespaced "title" element in the document,
which is one reason why your sub.find("title")
statement is failing.
You can see this if you were to print out your root
element:
>>> print root
<Element '{http://www.mediawiki.org/xml/export-0.10/}mediawiki' at 0x7f2a45df6c10>
Note that it doesn't say <Element 'mediawiki'>
. The identifier includes the full namespace. This document describes in detail how to work with namespaces in XML document, but the tl;dir version is that you need:
>>> from xml.etree import ElementTree as ET
>>> tree=ET.parse('/home/lars/Downloads/Wikipedia-20160405005142.xml')
>>> root = tree.getroot()
>>> ns = 'http://www.mediawiki.org/xml/export-0.10/
>>> for page in root.findall('{%s}page' % ns):
... print (page.find('{%s}title' % ns).text)
...
Category:Wikipedia books on computer science
Computer science in sport
Outline of computer science
Category:Unsolved problems in computer science
Category:Philosophy of computer science
[...etc...]
>>>
That that your life would probably be easier if you were to install
the lxml
module, which includes full xpath support, allowing you to
do something like this:
>>> nsmap={'x': 'http://www.mediawiki.org/xml/export-0.10/'}
>>> for title in tree.xpath('//x:title', namespaces=nsmap):
... print (title.text)
...
Category:Wikipedia books on computer science
Computer science in sport
Outline of computer science
Category:Unsolved problems in computer science
Category:Philosophy of computer science
Category:Computer science organizations
[...etc...]
Anyway, read through the docs on namespace support and hopefully that
plus these examples will point you in the right direction. The
takeaway should be that XML namespaces are important, and title
in
one namespace is not the same as title
in another namespace.
Upvotes: 3