How to process xml files in python

Question

I have a ~1GB XML file that has XML tags that I need to fetch data from. I have the XML file in the following format (I'm only pasting sample data because the actual file is about a gigabyte in size).

report.xml

What is the best way to parse/process XML files and fetch the data from xml tags in Python?
Are there any frameworks that can process XML files?
The method needs to be fast; it needs to finish in less than 100 seconds.

I've been using Hadoop with Python to process XML files and it usually takes nearly 200 seconds just to process the data... So I'm looking for an alternative solution that parses the above XML tags and fetches data from the tags.

Here's the data from the tags in the sense:

 campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content" avgPosition="1.16" cost="0" clicks="0" ...

After processing the XML file, I will store the data and values (79057390,3451305670 ...) in a MySQL database. All I need is to be able to process XML files about 1GB in size and save the processed data to a MySQL database in less than 100 seconds.

Juan Antonio Gomez Moriano · Accepted Answer

I recently faced a similar problem, the way to solve it for me was to use the iterparse function and lxml, at the end, it is all based on using SAX-like parser instead of a DOM-like one, remember DOM works in memory while SAX is event-driven, so you will save a ton of memory using SAX (and that means time too!, as you will not need to wait to load all the document in order to parse it!)

I think you can use something like this

import xml.etree.cElementTree as ET

file_path = "/path/to/your/test.xml"
context = ET.iterparse(file_path, events=("start", "end")) #Probably we could use only the start tag
# turn it into an iterator
context = iter(context)
on_members_tag = False

for event, elem in context:
    tag = elem.tag
    value = elem.text
    if value :
        value = value.encode('utf-8').strip()       
    if event == 'start' :
        if tag == "row" :
            attribs = elem.attrib
            print "This is the campaignID %s and this is the adGroupID" % (attribs['campaignID'] , attribs['adGroupID'])

    elem.clear() #Save memory!

How to process xml files in python

Answers (1)

Related Questions