Reputation: 13
I am trying to learn Python by writing a script that will extract data from multiple records in an XML file. I have been able to find the answers to most of my questions by searching on the web, but I have not found a way to determine if an XML tag contains no data before the getElementsByTagName("tagname")[0].firstChild.data method is used and an AttributeError is thrown when no data is present. I realize that I could write my code with a try and handle the AttributeError but I would rather know that the tag is empty before I try to extract the data an not have to handle the exception. Here is an example of an XML file that contains two records one with data in the tags and one with an empty tag.
<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<records>
<rec>
<name>ZYSRQPO</name>
<state>Washington</state>
<country>United States</country>
</rec>
<rec>
<name>ZYXWVUT</name>
<state></state>
<country>Mexico</country>
</rec>
</records>
Here is a sample of the code that I might use to extract the data:
from xml.dom import minidom
import sys
mydoc = minidom.parse('mydataFile.xml')
records = mydoc.getElementsByTagName("rec")
for rec in records:
try:
name = rec.getElementsByTagName("name")[0].firstChild.data
state = rec.getElementsByTagName("state")[0].firstChild.data
country = rec.getElementsByTagName("country")[0].firstChild.data
print('{}\t{}\t{}'.format(name, state, country))
except (AttributeError):
print('AttributeError encountered in record {}'.format(name), file=sys.stderr)
continue
When processing this file no information for the record named ZYXWVUT will be printed except that an exception was encountered. I would like to be able to have a null value for the state name used and the rest of the information printed about this record. Is there a method that can be used to do what I want, so that I could use an if statement to determine whether the tag contained no data before using getElementsByTagName and encountering an error when no data is found?
Upvotes: 1
Views: 1464
Reputation: 13
I tried reedcourty's second suggestion and found that it worked great. But I decided that I really did not want none to be returned if the element was empty. Here is what I came up with:
from xml.dom import minidom
import sys
def get_node_data(node):
if len(node.childNodes) == 0:
result = '*->No ' + node.nodeName + '<-*'
else:
result = node.firstChild.data
return result
mydoc = minidom.parse(dataFileSpec)
records = mydoc.getElementsByTagName("rec")
for rec in records:
name = get_node_data(rec.getElementsByTagName("name")[0])
state = get_node_data(rec.getElementsByTagName("state")[0])
country = get_node_data(rec.getElementsByTagName("country")[0])
print('{}\t{}\t{}'.format(name, state, country))
When this is run against this XML:
<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<records>
<rec>
<name>ZYSRQPO</name>
<country>United States</country>
<state>Washington</state>
</rec>
<rec>
<name></name>
<country>United States</country>
<state>Washington</state>
</rec>
<rec>
<name>ZYXWVUT</name>
<country>Mexico</country>
<state></state>
</rec>
<rec>
<name>ZYNMLKJ</name>
<country></country>
<state>Washington</state>
</rec>
</records>
It produces this output:
ZYSRQPO Washington United States
*->No name<-* Washington United States
ZYXWVUT *->No state<-* Mexico
ZYNMLKJ Washington *->No country<-*
Upvotes: 0
Reputation: 1010
from xml.dom import minidom
import sys
mydoc = minidom.parse('mydataFile.xml')
records = mydoc.getElementsByTagName("rec")
for rec in records:
name = rec.getElementsByTagName("name")[0].firstChild.data
state = None if len(rec.getElementsByTagName("state")[0].childNodes) == 0 else rec.getElementsByTagName("state")[0].firstChild.data
country = rec.getElementsByTagName("country")[0].firstChild.data
print('{}\t{}\t{}'.format(name, state, country))
Or if there is any chance, that name and country is empty too:
from xml.dom import minidom
import sys
def get_node_data(node):
if len(node.childNodes) == 0:
result = None
else:
result = node.firstChild.data
return result
mydoc = minidom.parse('mydataFile.xml')
records = mydoc.getElementsByTagName("rec")
for rec in records:
name = get_node_data(rec.getElementsByTagName("name")[0])
state = get_node_data(rec.getElementsByTagName("state")[0])
country = get_node_data(rec.getElementsByTagName("country")[0])
print('{}\t{}\t{}'.format(name, state, country))
Upvotes: 1