JCB
JCB

Reputation: 13

Using Python 3.6 to parse XML how can I determine if an XML tag contains no data

I am trying to learn Python by writing a script that will extract data from multiple records in an XML file. I have been able to find the answers to most of my questions by searching on the web, but I have not found a way to determine if an XML tag contains no data before the getElementsByTagName("tagname")[0].firstChild.data method is used and an AttributeError is thrown when no data is present. I realize that I could write my code with a try and handle the AttributeError but I would rather know that the tag is empty before I try to extract the data an not have to handle the exception. Here is an example of an XML file that contains two records one with data in the tags and one with an empty tag.

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<records>
  <rec>
    <name>ZYSRQPO</name>
    <state>Washington</state>
    <country>United States</country>
  </rec>
  <rec>
    <name>ZYXWVUT</name>
    <state></state>
    <country>Mexico</country>
  </rec>
</records>

Here is a sample of the code that I might use to extract the data:

from xml.dom import minidom
import sys

mydoc  = minidom.parse('mydataFile.xml')
records = mydoc.getElementsByTagName("rec")

for rec in records:
    try:
        name = rec.getElementsByTagName("name")[0].firstChild.data
        state = rec.getElementsByTagName("state")[0].firstChild.data
        country = rec.getElementsByTagName("country")[0].firstChild.data
        print('{}\t{}\t{}'.format(name, state, country))

    except (AttributeError):
        print('AttributeError encountered in record {}'.format(name), file=sys.stderr)
        continue

When processing this file no information for the record named ZYXWVUT will be printed except that an exception was encountered. I would like to be able to have a null value for the state name used and the rest of the information printed about this record. Is there a method that can be used to do what I want, so that I could use an if statement to determine whether the tag contained no data before using getElementsByTagName and encountering an error when no data is found?

Upvotes: 1

Views: 1464

Answers (2)

JCB
JCB

Reputation: 13

I tried reedcourty's second suggestion and found that it worked great. But I decided that I really did not want none to be returned if the element was empty. Here is what I came up with:

from xml.dom import minidom
import sys

def get_node_data(node):
    if len(node.childNodes) == 0:
        result = '*->No ' + node.nodeName + '<-*'
    else:
        result = node.firstChild.data
    return result

mydoc  = minidom.parse(dataFileSpec)
records = mydoc.getElementsByTagName("rec")

for rec in records:
    name = get_node_data(rec.getElementsByTagName("name")[0])
    state = get_node_data(rec.getElementsByTagName("state")[0])
    country = get_node_data(rec.getElementsByTagName("country")[0])
    print('{}\t{}\t{}'.format(name, state, country))

When this is run against this XML:

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<records>
  <rec>
    <name>ZYSRQPO</name>
    <country>United States</country>
    <state>Washington</state>
  </rec>
  <rec>
    <name></name>
    <country>United States</country>
    <state>Washington</state>
  </rec>
  <rec>
    <name>ZYXWVUT</name>
    <country>Mexico</country>
    <state></state>
  </rec>
  <rec>
    <name>ZYNMLKJ</name>
    <country></country>
    <state>Washington</state>
  </rec>
</records>

It produces this output:

ZYSRQPO Washington      United States
*->No name<-*   Washington      United States
ZYXWVUT *->No state<-*  Mexico
ZYNMLKJ Washington      *->No country<-*

Upvotes: 0

reedcourty
reedcourty

Reputation: 1010

from xml.dom import minidom
import sys

mydoc  = minidom.parse('mydataFile.xml')
records = mydoc.getElementsByTagName("rec")

for rec in records:
    name = rec.getElementsByTagName("name")[0].firstChild.data
    state = None if len(rec.getElementsByTagName("state")[0].childNodes) == 0 else rec.getElementsByTagName("state")[0].firstChild.data
    country = rec.getElementsByTagName("country")[0].firstChild.data
    print('{}\t{}\t{}'.format(name, state, country))

Or if there is any chance, that name and country is empty too:

from xml.dom import minidom
import sys


def get_node_data(node):
    if len(node.childNodes) == 0:
        result = None
    else:
        result = node.firstChild.data
    return result


mydoc  = minidom.parse('mydataFile.xml')
records = mydoc.getElementsByTagName("rec")

for rec in records:
    name = get_node_data(rec.getElementsByTagName("name")[0])
    state = get_node_data(rec.getElementsByTagName("state")[0])
    country = get_node_data(rec.getElementsByTagName("country")[0])
    print('{}\t{}\t{}'.format(name, state, country))

Upvotes: 1

Related Questions