user2274879
user2274879

Reputation: 349

Python xml parsing using minidom

I just started learning how to parse xml using minidom. I tried to get the author's names (xml data is down below) using the following code:

from xml.dom import minidom

xmldoc = minidom.parse("cora.xml")

author = xmldoc.getElementsByTagName ('author')

for author in author:
    authorID=author.getElementsByTagName('author id')
    print authorID

I got empty brackets([]) all the way. Can someone please help me out? I will also need the title and venue. Thanks in advance. See xml data below:

<?xml version="1.0" encoding="UTF-8"?>
<coraRADD>
   <publication id="ahlskog1994a">
      <author id="199">M. Ahlskog</author>
      <author id="74"> J. Paloheimo</author>
      <author id="64"> H. Stubb</author>
      <author id="103"> P. Dyreklev</author>
      <author id="54"> M. Fahlman</author>
      <title>Inganas</title>
      <title>and</title>
      <title>M.R.</title>
      <venue>
         <venue pubid="ahlskog1994a" id="1">
                  <name>Andersson</name>
                  <name> J Appl. Phys.</name>
                  <vol>76</vol>
                  <date> (1994). </date>
            </venue>

Upvotes: 1

Views: 516

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1125168

You can only find tags with getElementsByTagName(), not attributes. You'll need to access those through the Element.getAttribute() method instead:

for author in author:
    authorID = author.getAttribute('id')
    print authorID

If you are still learning about parsing XML, you really want to stay away from the DOM. The DOM API is overly verbose to fit many different programming languages.

The ElementTree API would be easier to use:

import xml.etree.ElementTree as ET

tree = ET.parse('cora.xml')
root = tree.getroot()

# loop over all publications
for pub in root.findall('publication'):
    print ' '.join([t.text for t in pub.findall('title')])
    for author in pub.findall('author'):
        print 'Author id: {}'.format(author.attrib['id'])
        print 'Author name: {}'.format(author.text)
    for venue in pub.findall('.//venue[@id]'):  # all venue tags with id attribute
        print ', '.join([name.text for name in venue.findall('name')])

Upvotes: 1

Related Questions