numaroth
numaroth

Reputation: 1313

How do I parse some of the data from a large xml file?

I need to extract the location and radius data from a large xml file that is formatted as below and store the data in 2-dimensional ndarray. This is my first time using Python and I can't find anything about the best way to do this.

<species name="MyHeterotrophEPS" header="family,genealogy,generation,birthday,biomass,inert,capsule,growthRate,volumeRate,locationX,locationY,locationZ,radius,totalRadius">
0,0,0,0.0,0.0,0.0,77.0645361927206,-0.1001871531330136,-0.0013358287084401814,4.523853439106942,234.14575280979898,123.92820420047076,0.0,0.6259920275663835;
0,0,0,0.0,0.0,0.0,108.5705297969604,-0.1411462759900182,-0.001881950346533576,1.0429122163754276,144.1066875513379,72.24884428367467,0.0,0.7017581019907897;
.
.
.
</species>

Edit:I mean "large" by human standards. I am not having any memory issues with it.

Upvotes: 3

Views: 3615

Answers (3)

Martijn Pieters
Martijn Pieters

Reputation: 1124558

You essentially have CSV data in the XML text value.

Use ElementTree to parse the XML, then use numpy.genfromtxt() to load that text into an array:

from xml.etree import ElementTree as ET

tree = ET.parse('yourxmlfilename.xml')
species = tree.find(".//species[@name='MyHeterotrophEPS']")
names = species.attrib['header']
array = numpy.genfromtxt((line.rstrip(';') for line in species.text.splitlines()), 
    delimiter=',', names=names)

Note the generator expression, with a str.splitlines() call; this turns the text of the XML element into a sequence of lines, which .genfromtxt() is quite happy to receive. We do remove the trailing ; character from each line.

For your sample input (minus the . lines), this results in:

array([ (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 77.0645361927206, -0.1001871531330136, -0.0013358287084401814, 4.523853439106942, 234.14575280979898, 123.92820420047076, 0.0, 0.6259920275663835),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 108.5705297969604, -0.1411462759900182, -0.001881950346533576, 1.0429122163754276, 144.1066875513379, 72.24884428367467, 0.0, 0.7017581019907897)], 
      dtype=[('family', '<f8'), ('genealogy', '<f8'), ('generation', '<f8'), ('birthday', '<f8'), ('biomass', '<f8'), ('inert', '<f8'), ('capsule', '<f8'), ('growthRate', '<f8'), ('volumeRate', '<f8'), ('locationX', '<f8'), ('locationY', '<f8'), ('locationZ', '<f8'), ('radius', '<f8'), ('totalRadius', '<f8')])

Upvotes: 4

abarnert
abarnert

Reputation: 366073

If your XML is just that species node, it's pretty simple, and Martijn Pieters has already explained it better than I can.

But if you've got a ton of species nodes in the document, and it's too large to fit the whole thing into memory, you can use iterparse instead of parse:

import numpy as np
import xml.etree.ElementTree as ET

for event, node in ET.iterparse('species.xml'):
    if node.tag == 'species':
        name = node.attr['name']
        names = node.attr['header']
        csvdata = (line.rstrip(';') for line in node.text.splitlines())
        array = np.genfromtxt(csvdata, delimiter=',', names=names)
        # do something with the array.

This won't help if you just have one super-gigantic species node, because even iterparse (or similar solutions like a SAX parser) parse one entire node at a time. You'd need to find an XML library that lets you stream the text of large nodes, and off the top of my head, I don't think of any stdlib or popular third-party parsers that can do that.

Upvotes: 2

kirelagin
kirelagin

Reputation: 13626

If the file is really large, use ElementTree or SAX.

If the file is not that large (i.e. fits into memory), minidom might be easier to work with.

Each line seems to be a simple string of comma-separated numbers, so you can sipmly do line.split(',').

Upvotes: 0

Related Questions