Reputation: 349
I just started learning how to parse xml using minidom
. I tried to get the author's names (xml data is down below) using the following code:
from xml.dom import minidom
xmldoc = minidom.parse("cora.xml")
author = xmldoc.getElementsByTagName ('author')
for author in author:
authorID=author.getElementsByTagName('author id')
print authorID
I got empty brackets([]
) all the way. Can someone please help me out? I will also need the title and venue. Thanks in advance. See xml data below:
<?xml version="1.0" encoding="UTF-8"?>
<coraRADD>
<publication id="ahlskog1994a">
<author id="199">M. Ahlskog</author>
<author id="74"> J. Paloheimo</author>
<author id="64"> H. Stubb</author>
<author id="103"> P. Dyreklev</author>
<author id="54"> M. Fahlman</author>
<title>Inganas</title>
<title>and</title>
<title>M.R.</title>
<venue>
<venue pubid="ahlskog1994a" id="1">
<name>Andersson</name>
<name> J Appl. Phys.</name>
<vol>76</vol>
<date> (1994). </date>
</venue>
Upvotes: 1
Views: 516
Reputation: 1125168
You can only find tags with getElementsByTagName()
, not attributes. You'll need to access those through the Element.getAttribute()
method instead:
for author in author:
authorID = author.getAttribute('id')
print authorID
If you are still learning about parsing XML, you really want to stay away from the DOM. The DOM API is overly verbose to fit many different programming languages.
The ElementTree API would be easier to use:
import xml.etree.ElementTree as ET
tree = ET.parse('cora.xml')
root = tree.getroot()
# loop over all publications
for pub in root.findall('publication'):
print ' '.join([t.text for t in pub.findall('title')])
for author in pub.findall('author'):
print 'Author id: {}'.format(author.attrib['id'])
print 'Author name: {}'.format(author.text)
for venue in pub.findall('.//venue[@id]'): # all venue tags with id attribute
print ', '.join([name.text for name in venue.findall('name')])
Upvotes: 1