user2585945
user2585945

Reputation: 139

Retrieving All Articles Titles from an XML Wiki Dump - Python

I have a Wikipedia XML dump created by exporting all pages of a certain category. You can see the exact structure of this XML file by generating one for yourself at https://en.wikipedia.org/wiki/Special:Export. Now I would like to make a list, in Python, of the titles of each article. I have tried using:

import xml.etree.ElementTree as ET

tree = ET.parse('./comp_sci_wiki.xml')
root = tree.getroot()

for element in root:
    for sub in element:
        print sub.find("title")

Nothing is printed. This seems like it should be a relatively straightforward task. Any help you could provide would be much appreciated. Thanks!

Upvotes: 0

Views: 1667

Answers (1)

larsks
larsks

Reputation: 311750

If you look at the beginning of the exported file, you'll see that the document declares a default XML namespace:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLo

That means there is no un-namespaced "title" element in the document, which is one reason why your sub.find("title") statement is failing. You can see this if you were to print out your root element:

>>> print root
<Element '{http://www.mediawiki.org/xml/export-0.10/}mediawiki' at 0x7f2a45df6c10>

Note that it doesn't say <Element 'mediawiki'>. The identifier includes the full namespace. This document describes in detail how to work with namespaces in XML document, but the tl;dir version is that you need:

>>> from xml.etree import ElementTree as ET
>>> tree=ET.parse('/home/lars/Downloads/Wikipedia-20160405005142.xml')
>>> root = tree.getroot()
>>> ns = 'http://www.mediawiki.org/xml/export-0.10/
>>> for page in root.findall('{%s}page' % ns):
...   print (page.find('{%s}title' % ns).text)
... 
Category:Wikipedia books on computer science
Computer science in sport
Outline of computer science
Category:Unsolved problems in computer science
Category:Philosophy of computer science
[...etc...]
>>> 

That that your life would probably be easier if you were to install the lxml module, which includes full xpath support, allowing you to do something like this:

>>> nsmap={'x': 'http://www.mediawiki.org/xml/export-0.10/'}
>>> for title in tree.xpath('//x:title', namespaces=nsmap):
...   print (title.text)
... 
Category:Wikipedia books on computer science
Computer science in sport
Outline of computer science
Category:Unsolved problems in computer science
Category:Philosophy of computer science
Category:Computer science organizations
[...etc...]

Anyway, read through the docs on namespace support and hopefully that plus these examples will point you in the right direction. The takeaway should be that XML namespaces are important, and title in one namespace is not the same as title in another namespace.

Upvotes: 3

Related Questions