Sophia
Sophia

Reputation: 67

How to get text inside specific tag using tag name with Python

I'm trying to open an XML file and parse through it, looking through its tags and finding the text within each specific tag. If the text within the tag matches a string, I want it remove a part of the string or substitute it with something else.

My question is, I'm not sure if: start = x.find('start_char').text is actually getting the text inside "start_char" tag and saving it to the "start" variable. (Does "x.find('tag_name').text actually get the text inside the tag?)

The XML file has the following data:

<?xml version="1.0" encoding="utf-8"?>
<metadata>
    <filter>
        <regex>ATL|LAX|DFW</regex >
        <start_char>3</start_char>
        <end_char></end_char>
        <action>remove</action>
    </filter>
    <filter>
        <regex>DFW.+\.$</regex >
        <start_char>3</start_char>
        <end_char>-1</end_char>
        <action>remove</action>
    </filter>
    <filter>
        <regex>\-</regex >
        <replacement></replacement>
        <action>substitute</action>
    </filter>
    <filter>
        <regex>\s</regex >
        <replacement></replacement>
        <action>substitute</action>
    </filter>
    <filter>
        <regex> T&amp;R$</regex >
        <start_char></start_char>
        <end_char>-4</end_char>
        <action>remove</action>
    </filter>
</metadata>

The Python code I'm using is:

from xml.etree.ElementTree import ElementTree    

# filters.xml is the file that holds the things to be filtered
tree = ElementTree()
tree.parse("filters.xml")

# Get the data in the XML file 
root = tree.getroot()

# Loop through filters
for x in root.findall('filter'):

    # Find the text inside the regex tag
    regex = x.find('regex').text

    # Find the text inside the start_char tag
    start = x.find('start_char').text

    # Find the text inside the end_char tag
    end = x.find('end_char').text

    # Find the text inside the replacement tag
    #replace = x.find('replacement')

    # Find the text inside the action tag
    action = x.find('action').text

    if action == 'remove':
        if re.match(r'regex', mfn_pn, re.IGNORECASE):
            mfn_pn = mfn_pn[start:end]

    elif action == 'substitute':
        mfn_pn = re.sub(r'regex', '', mfn_pn)

    return mfn_pn

Upvotes: 2

Views: 1397

Answers (1)

Alexandra Dudkina
Alexandra Dudkina

Reputation: 4462

Code start = x.find('start_char').text will function in cases when filter element has start_char child, otherwise it will throw an error AttributeError: 'NoneType' object has no attribute 'text'.

This can be avoided e.g. using following approach:

# find element
start_el = x.find('start_char')
# check if element exist and assign its text to the variable, None (or another default value) otherwise
start = start_el.text if start_el is not None else None

Same applies to end variable.

Using this approach, following values will be retrieved for your example document:

3 None
3 -1
None None
None None
None -4

Upvotes: 1

Related Questions