Reputation: 355
The goal is to read all … stuff from a Wikipedia DUMP (70Gb file). This is not possible to load in memory, therefore I tried to parse the file incrementally and get some values from it. However the script I just wrote does not print anything and immediately occupies all my memory.
Here is the code:
from lxml import etree
def fast_iter(context, func, *args, **kwargs):
for event, elem in context:
func(elem, *args, **kwargs)
elem.clear()
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context
def process_element(elem):
#print(elem)
print (elem.xpath( './revision/text/text( )' ))
context = etree.iterparse( 'enwiki-latest-pages-articles-multistream.xml', tag='page' )
fast_iter(context,process_element)
When this script is applied in a small xml file, it prints the values from the requested xpath.
However when applied on the full file, nothing happens.
Here are a same lines from the Wikipedia dump
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
<base>https://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.33.0-wmf.19</generator>
<case>first-letter</case>
<namespaces>
<namespace key="-2" case="first-letter">Media</namespace>
<namespace key="-1" case="first-letter">Special</namespace>
<namespace key="0" case="first-letter" />
<namespace key="1" case="first-letter">Talk</namespace>
<namespace key="2" case="first-letter">User</namespace>
<namespace key="3" case="first-letter">User talk</namespace>
<namespace key="4" case="first-letter">Wikipedia</namespace>
<namespace key="5" case="first-letter">Wikipedia talk</namespace>
<namespace key="6" case="first-letter">File</namespace>
<namespace key="7" case="first-letter">File talk</namespace>
<namespace key="8" case="first-letter">MediaWiki</namespace>
<namespace key="9" case="first-letter">MediaWiki talk</namespace>
<namespace key="10" case="first-letter">Template</namespace>
<namespace key="11" case="first-letter">Template talk</namespace>
<namespace key="12" case="first-letter">Help</namespace>
<namespace key="13" case="first-letter">Help talk</namespace>
<namespace key="14" case="first-letter">Category</namespace>
<namespace key="15" case="first-letter">Category talk</namespace>
<namespace key="100" case="first-letter">Portal</namespace>
<namespace key="101" case="first-letter">Portal talk</namespace>
<namespace key="108" case="first-letter">Book</namespace>
<namespace key="109" case="first-letter">Book talk</namespace>
<namespace key="118" case="first-letter">Draft</namespace>
<namespace key="119" case="first-letter">Draft talk</namespace>
<namespace key="446" case="first-letter">Education Program</namespace>
<namespace key="447" case="first-letter">Education Program talk</namespace>
<namespace key="710" case="first-letter">TimedText</namespace>
<namespace key="711" case="first-letter">TimedText talk</namespace>
<namespace key="828" case="first-letter">Module</namespace>
<namespace key="829" case="first-letter">Module talk</namespace>
<namespace key="2300" case="first-letter">Gadget</namespace>
<namespace key="2301" case="first-letter">Gadget talk</namespace>
<namespace key="2302" case="case-sensitive">Gadget definition</namespace>
<namespace key="2303" case="case-sensitive">Gadget definition talk</namespace>
</namespaces>
</siteinfo>
<page>
<title>AccessibleComputing</title>
<ns>0</ns>
<id>10</id>
<redirect title="Computer accessibility" />
<revision>
<id>854851586</id>
<parentid>834079434</parentid>
<timestamp>2018-08-14T06:47:24Z</timestamp>
<contributor>
<username>Godsy</username>
<id>23257138</id>
</contributor>
<comment>remove from category for seeking instructions on rcats</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">#REDIRECT [[Computer accessibility]]
{{R from move}}
{{R from CamelCase}}
{{R unprintworthy}}</text>
<sha1>42l0cvblwtb4nnupxm6wo000d27t6kf</sha1>
</revision>
</page>
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
<revision>
<id>885648527</id>
<parentid>885645378</parentid>
<timestamp>2019-03-01T11:16:23Z</timestamp>
<contributor>
<username>Jarnsax</username>
<id>33627956</id>
</contributor>
<comment>improve citation metadata</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">{{redirect2|Anarchist|Anarchists|the fictional character|Anarchist (comics)|other uses|Anarchists (disambiguation)}}
{{pp-move-indef}}
{{short description|Political philosophy that advocates self-governed societies}}
{{Use dmy dates|date=July 2018}}
{{use British English|date=January 2014}}
{{Anarchism sidebar}}
{{Basic forms of government}}
'''Anarchism''' is an [[anti-authoritarian]] [[political philosophy]]{{sfn|McLaughlin|2007|p=59}}{{sfn|Flint|2009|p=27}} that advocates [[Self-governance|self-governed]] societies based on voluntary, [[cooperative]] institutions and the rejection of coercive [[Hierarchy|hierarchies]] those societies view as unjust. These institutions are often described as [[Stateless society|stateless societies]],{{r|group=note|Note01}}{{sfn|Sheehan|2003|p=85}} although several authors have defined them more specifically as distinct institutions based on non-hierarchical or [[Free association (communism and anarchism)|free associations]].{{r|group=note|Note02}} Anarchism holds the [[State (polity)|state]] to be undesirable, unnecessary, and harmful.{{r|group=note|Note03}}<ref name=definition /> Any philosophy consistent with statelessness, that is, principled opposition to the State, is anarchist, thus anarchist schools of thought range from [[anarcho-communism]] to [[anarcho-capitalism]].{{sfn|Fiala|2018}}
While [[Anti-statism|opposition to the state]] is central,{{r|group=note|Note04}} many forms of anarchism specifically entail opposing authority or hierarchical organisation based on authority in the conduct of all human relations.{{r|group=note|Note05}} Anarchism is often considered a [[Far-left politics|far-left]] ideology,{{r|group=note|Note06}}{{sfn|Kahn|2000}}{{sfn|Moyihan|2007}} and much of [[anarchist economics]] and [[Anarchist law|anarchist legal philosophy]] reflect [[Libertarian socialism|anti-authoritarian interpretations]] of [[Anarcho-communism|communism]], [[Collectivist anarchism|collectivism]], [[Anarcho-syndicalism|syndicalism]], [[Mutualism (economic theory)|mutualism]], or [[participatory economics]].{{r|group=note|Note07}}
Anarchism does not offer a fixed body of doctrine from a single particular world view, instead fluxing and flowing as a philosophy.{{sfn|Marshall|2010|p=16}} Many types and traditions of anarchism exist, not all of which are mutually exclusive.{{sfn|Sylvan|2007|p=262}} [[Anarchist schools of thought]] can differ fundamentally, supporting anything from extreme [[individualism]] to complete [[collectivism]].{{sfn|McLean|McMillan|2003|loc= Anarchism}} Strains of anarchism have often been divided into the categories of [[Social anarchism|social]] and [[individualist anarchism]] or similar dual classifications.{{sfn|Ostergaard|p=14|loc=Anarchism}}{{sfn|Kropotkin|2002|p=5}}{{sfn|Fowler|1972}}
</text>
</revision>
</page>
</mediawiki>
Has anybody did this before? Any idea how to efficient parse this huge dump? Is there any package/lib that has done it before? I don't want to reinvent the wheel.
Upvotes: 2
Views: 2142
Reputation: 15523
Question: Parsing incrementally a large wikipedia dump XML file
When this (the Questions) script is applied in a small xml file, it prints the values from the requested xpath.
However when applied on the full file, nothing happens.
I wonder, you get anything from the small file, as you don't use a namespace
parameter.
The Wikipedia
xml file uses the following default namespace
:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/"
This example is using lxml
:
from lxml import etree
class Wikipedia:
def __init__(self, fh, tag):
"""
Initialize 'iterparse' to only generate 'end' events on tag '<entity>'
:param fh: File Handle from the XML File to parse
:param tag: The tag to process
"""
# Prepend the default Namespace {*} to get anything.
self.context = etree.iterparse(fh, events=("end",), tag=['{*}' + tag])
def _parse(self):
"""
Parse the XML File for all '<tag>...</tag>' Elements
Clear/Delete the Element Tree after processing
:return: Yield the current 'Event, Element Tree'
"""
for event, elem in self.context:
yield event, elem
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
def __iter__(self):
"""
Iterate all '<tag>...</tag>' Element Trees yielded from self._parse()
:return: Dict var 'entity' {tag1, value, tag2, value, ... ,tagn, value}}
"""
for event, elem in self._parse():
entity = {}
# Assign the 'elem.namespace' to the 'xpath'
entity['revision'] = elem.xpath('./xmlns:revision/xmlns:text/text( )',
namespaces={'xmlns':etree.QName(elem).namespace})
yield entity
if __name__ == "__main__":
XML = b""""""<?xml version='1.0' encoding='UTF-8'?>
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/
http://www.mediawiki.org/xml/export-0.10.xsd"
version="0.10" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
... (omitted for brevity)""""""
#with open('.\\FILE.XML', 'rb') as in_xml_
with io.BytesIO(XML) as in_xml:
for record in Wikipedia(in_xml, tag='page'):
print("record:{}".format(record))
Output:
record:{'revision': ['#REDIRECT [[Computer accessi... (omitted for brevity) record:{'revision': ["{{redirect2|Anarchist|Anarch... (omitted for brevity)
Tested with Python: 3.5 - lxml.etree: 3.7.1
Upvotes: 3
Reputation: 23815
Use SAX. See example (https://www.tutorialspoint.com/python3/python_xml_processing.htm) below.
Simple API for XML (SAX) − Here, you register callbacks for events of interest and then let the parser proceed through the document. This is useful when your documents are large or you have memory limitations, it parses the file as it reads it from the disk and the entire file is never stored in the memory.
SAX is a standard interface for event-driven XML parsing. Parsing XML with SAX generally requires you to create your own ContentHandler by subclassing xml.sax.ContentHandler.
import xml.sax
class MovieHandler( xml.sax.ContentHandler ):
def __init__(self):
self.CurrentData = ""
self.type = ""
self.format = ""
self.year = ""
self.rating = ""
self.stars = ""
self.description = ""
# Call when an element starts
def startElement(self, tag, attributes):
self.CurrentData = tag
if tag == "movie":
print ("*****Movie*****")
title = attributes["title"]
print ("Title:", title)
# Call when an elements ends
def endElement(self, tag):
if self.CurrentData == "type":
print ("Type:", self.type)
elif self.CurrentData == "format":
print ("Format:", self.format)
elif self.CurrentData == "year":
print ("Year:", self.year)
elif self.CurrentData == "rating":
print ("Rating:", self.rating)
elif self.CurrentData == "stars":
print ("Stars:", self.stars)
elif self.CurrentData == "description":
print ("Description:", self.description)
self.CurrentData = ""
# Call when a character is read
def characters(self, content):
if self.CurrentData == "type":
self.type = content
elif self.CurrentData == "format":
self.format = content
elif self.CurrentData == "year":
self.year = content
elif self.CurrentData == "rating":
self.rating = content
elif self.CurrentData == "stars":
self.stars = content
elif self.CurrentData == "description":
self.description = content
if ( __name__ == "__main__"):
# create an XMLReader
parser = xml.sax.make_parser()
# turn off namepsaces
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
# override the default ContextHandler
Handler = MovieHandler()
parser.setContentHandler( Handler )
parser.parse("c:\\temp\\movies.xml")
movies.xml
<collection shelf = "New Arrivals">
<movie title = "Enemy Behind">
<type>War, Thriller</type>
<format>DVD</format>
<year>2003</year>
<rating>PG</rating>
<stars>10</stars>
<description>Talk about a US-Japan war</description>
</movie>
<movie title = "Transformers">
<type>Anime, Science Fiction</type>
<format>DVD</format>
<year>1989</year>
<rating>R</rating>
<stars>8</stars>
<description>A schientific fiction</description>
</movie>
<movie title = "Trigun">
<type>Anime, Action</type>
<format>DVD</format>
<episodes>4</episodes>
<rating>PG</rating>
<stars>10</stars>
<description>Vash the Stampede!</description>
</movie>
<movie title = "Ishtar">
<type>Comedy</type>
<format>VHS</format>
<rating>PG</rating>
<stars>2</stars>
<description>Viewable boredom</description>
</movie>
</collection>
Upvotes: 1