Reputation: 73
For an assignment, I need to parse through a 2 million line XML file, and input the data into a MySQL database. Since we are using a python environment with sqlite for the class, I am attempting to use python to parse the file. Keep in mind I am just learning python so everything is new!
I have had a few attempts, but keep failing and getting frustrated. For efficiency, I am testing my code out on just a small amount of the full XML, here:
<pub>
<ID>7</ID>
<title>On the Correlation of Image Size to System Accuracy in Automatic Fingerprint Identification Systems</title>
<year>2003</year>
<booktitle>AVBPA</booktitle>
<pages>895-902</pages>
<authors>
<author>J. K. Schneider</author>
<author>C. E. Richardson</author>
<author>F. W. Kiefer</author>
<author>Venu Govindaraju</author>
</authors>
</pub>
First attempt
Here I successfully pulled out all data from each tag, except when there are multiple authors under the <authors>
tag. I am trying to loop through each node in the authors tag, count, then create a temporary array for those authors, then throw them into my database next with SQL. I am getting "15" for the number of authors, but clearly there are only 4! How do I solve this?
from xml.dom import minidom
xmldoc= minidom.parse("test.xml")
pub = xmldoc.getElementsByTagName("pub")[0]
ID = pub.getElementsByTagName("ID")[0].firstChild.data
title = pub.getElementsByTagName("title")[0].firstChild.data
year = pub.getElementsByTagName("year")[0].firstChild.data
booktitle = pub.getElementsByTagName("booktitle")[0].firstChild.data
pages = pub.getElementsByTagName("pages")[0].firstChild.data
authors = pub.getElementsByTagName("authors")[0]
author = authors.getElementsByTagName("author")[0].firstChild.data
num_authors = len(author)
print("Number of authors: ", num_authors )
print(ID)
print(title)
print(year)
print(booktitle)
print(pages)
print(author)
Upvotes: 0
Views: 269
Reputation: 89305
Notice that you were getting the number of characters in the first author here, for the code limits the result to only the first author (index 0) and then get its length :
author = authors.getElementsByTagName("author")[0].firstChild.data
num_authors = len(author)
print("Number of authors: ", num_authors )
Just don't limit the result to get all the authors :
author = authors.getElementsByTagName("author")
num_authors = len(author)
print("Number of authors: ", num_authors )
You can use list comprehension to get all author names, instead of author elements, in a list :
author = [a.firstChild.data for a in authors.getElementsByTagName("author")]
print(author)
# [u'J. K. Schneider', u'C. E. Richardson', u'F. W. Kiefer', u'Venu Govindaraju']
Upvotes: 1