Handling multiple nodes when parsing XML with Python

Question

For an assignment, I need to parse through a 2 million line XML file, and input the data into a MySQL database. Since we are using a python environment with sqlite for the class, I am attempting to use python to parse the file. Keep in mind I am just learning python so everything is new!

I have had a few attempts, but keep failing and getting frustrated. For efficiency, I am testing my code out on just a small amount of the full XML, here:


7
On the Correlation of Image Size to System Accuracy in Automatic Fingerprint Identification Systems
2003
AVBPA
895-902

    J. K. Schneider
    C. E. Richardson
    F. W. Kiefer
    Venu Govindaraju

First attempt

Here I successfully pulled out all data from each tag, except when there are multiple authors under the tag. I am trying to loop through each node in the authors tag, count, then create a temporary array for those authors, then throw them into my database next with SQL. I am getting "15" for the number of authors, but clearly there are only 4! How do I solve this?

from xml.dom import minidom

xmldoc= minidom.parse("test.xml")

pub = xmldoc.getElementsByTagName("pub")[0]
ID = pub.getElementsByTagName("ID")[0].firstChild.data
title = pub.getElementsByTagName("title")[0].firstChild.data
year = pub.getElementsByTagName("year")[0].firstChild.data
booktitle = pub.getElementsByTagName("booktitle")[0].firstChild.data
pages = pub.getElementsByTagName("pages")[0].firstChild.data
authors = pub.getElementsByTagName("authors")[0]
author = authors.getElementsByTagName("author")[0].firstChild.data
num_authors = len(author)
print("Number of authors: ", num_authors )

print(ID)
print(title)
print(year)
print(booktitle)
print(pages)
print(author)

har07 · Accepted Answer

Notice that you were getting the number of characters in the first author here, for the code limits the result to only the first author (index 0) and then get its length :

author = authors.getElementsByTagName("author")[0].firstChild.data
num_authors = len(author)
print("Number of authors: ", num_authors )

Just don't limit the result to get all the authors :

author = authors.getElementsByTagName("author")
num_authors = len(author)
print("Number of authors: ", num_authors )

You can use list comprehension to get all author names, instead of author elements, in a list :

author = [a.firstChild.data for a in authors.getElementsByTagName("author")]
print(author)
# [u'J. K. Schneider', u'C. E. Richardson', u'F. W. Kiefer', u'Venu Govindaraju']

Handling multiple nodes when parsing XML with Python

Answers (1)

Related Questions