Reputation: 67
I'm trying to open an XML file and parse through it, looking through its tags and finding the text within each specific tag. If the text within the tag matches a string, I want it remove a part of the string or substitute it with something else.
My question is, I'm not sure if: start = x.find('start_char').text is actually getting the text inside "start_char" tag and saving it to the "start" variable. (Does "x.find('tag_name').text actually get the text inside the tag?)
The XML file has the following data:
<?xml version="1.0" encoding="utf-8"?>
<metadata>
<filter>
<regex>ATL|LAX|DFW</regex >
<start_char>3</start_char>
<end_char></end_char>
<action>remove</action>
</filter>
<filter>
<regex>DFW.+\.$</regex >
<start_char>3</start_char>
<end_char>-1</end_char>
<action>remove</action>
</filter>
<filter>
<regex>\-</regex >
<replacement></replacement>
<action>substitute</action>
</filter>
<filter>
<regex>\s</regex >
<replacement></replacement>
<action>substitute</action>
</filter>
<filter>
<regex> T&R$</regex >
<start_char></start_char>
<end_char>-4</end_char>
<action>remove</action>
</filter>
</metadata>
The Python code I'm using is:
from xml.etree.ElementTree import ElementTree
# filters.xml is the file that holds the things to be filtered
tree = ElementTree()
tree.parse("filters.xml")
# Get the data in the XML file
root = tree.getroot()
# Loop through filters
for x in root.findall('filter'):
# Find the text inside the regex tag
regex = x.find('regex').text
# Find the text inside the start_char tag
start = x.find('start_char').text
# Find the text inside the end_char tag
end = x.find('end_char').text
# Find the text inside the replacement tag
#replace = x.find('replacement')
# Find the text inside the action tag
action = x.find('action').text
if action == 'remove':
if re.match(r'regex', mfn_pn, re.IGNORECASE):
mfn_pn = mfn_pn[start:end]
elif action == 'substitute':
mfn_pn = re.sub(r'regex', '', mfn_pn)
return mfn_pn
Upvotes: 2
Views: 1397
Reputation: 4462
Code start = x.find('start_char').text
will function in cases when filter
element has start_char
child, otherwise it will throw an error AttributeError: 'NoneType' object has no attribute 'text'
.
This can be avoided e.g. using following approach:
# find element
start_el = x.find('start_char')
# check if element exist and assign its text to the variable, None (or another default value) otherwise
start = start_el.text if start_el is not None else None
Same applies to end
variable.
Using this approach, following values will be retrieved for your example document:
3 None
3 -1
None None
None None
None -4
Upvotes: 1