Reputation: 1313
I need to extract the location and radius data from a large xml file that is formatted as below and store the data in 2-dimensional ndarray. This is my first time using Python and I can't find anything about the best way to do this.
<species name="MyHeterotrophEPS" header="family,genealogy,generation,birthday,biomass,inert,capsule,growthRate,volumeRate,locationX,locationY,locationZ,radius,totalRadius">
0,0,0,0.0,0.0,0.0,77.0645361927206,-0.1001871531330136,-0.0013358287084401814,4.523853439106942,234.14575280979898,123.92820420047076,0.0,0.6259920275663835;
0,0,0,0.0,0.0,0.0,108.5705297969604,-0.1411462759900182,-0.001881950346533576,1.0429122163754276,144.1066875513379,72.24884428367467,0.0,0.7017581019907897;
.
.
.
</species>
Edit:I mean "large" by human standards. I am not having any memory issues with it.
Upvotes: 3
Views: 3615
Reputation: 1124558
You essentially have CSV data in the XML text value.
Use ElementTree
to parse the XML, then use numpy.genfromtxt()
to load that text into an array:
from xml.etree import ElementTree as ET
tree = ET.parse('yourxmlfilename.xml')
species = tree.find(".//species[@name='MyHeterotrophEPS']")
names = species.attrib['header']
array = numpy.genfromtxt((line.rstrip(';') for line in species.text.splitlines()),
delimiter=',', names=names)
Note the generator expression, with a str.splitlines()
call; this turns the text of the XML element into a sequence of lines, which .genfromtxt()
is quite happy to receive. We do remove the trailing ;
character from each line.
For your sample input (minus the .
lines), this results in:
array([ (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 77.0645361927206, -0.1001871531330136, -0.0013358287084401814, 4.523853439106942, 234.14575280979898, 123.92820420047076, 0.0, 0.6259920275663835),
(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 108.5705297969604, -0.1411462759900182, -0.001881950346533576, 1.0429122163754276, 144.1066875513379, 72.24884428367467, 0.0, 0.7017581019907897)],
dtype=[('family', '<f8'), ('genealogy', '<f8'), ('generation', '<f8'), ('birthday', '<f8'), ('biomass', '<f8'), ('inert', '<f8'), ('capsule', '<f8'), ('growthRate', '<f8'), ('volumeRate', '<f8'), ('locationX', '<f8'), ('locationY', '<f8'), ('locationZ', '<f8'), ('radius', '<f8'), ('totalRadius', '<f8')])
Upvotes: 4
Reputation: 366073
If your XML is just that species
node, it's pretty simple, and Martijn Pieters has already explained it better than I can.
But if you've got a ton of species
nodes in the document, and it's too large to fit the whole thing into memory, you can use iterparse
instead of parse
:
import numpy as np
import xml.etree.ElementTree as ET
for event, node in ET.iterparse('species.xml'):
if node.tag == 'species':
name = node.attr['name']
names = node.attr['header']
csvdata = (line.rstrip(';') for line in node.text.splitlines())
array = np.genfromtxt(csvdata, delimiter=',', names=names)
# do something with the array.
This won't help if you just have one super-gigantic species
node, because even iterparse
(or similar solutions like a SAX parser) parse one entire node at a time. You'd need to find an XML library that lets you stream the text of large nodes, and off the top of my head, I don't think of any stdlib or popular third-party parsers that can do that.
Upvotes: 2
Reputation: 13626
If the file is really large, use ElementTree
or SAX
.
If the file is not that large (i.e. fits into memory), minidom
might be easier to work with.
Each line seems to be a simple string of comma-separated numbers, so you can sipmly do line.split(',')
.
Upvotes: 0