x89
x89

Reputation: 3460

filter non-nested tag values from XML

I have an xml that looks like this.

<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
    <offer id="11" parent_id="12">
        <name>Alpha</name>
        <pos>697</pos>
        <kat_pis>
            <pos kat="2">112</pos>
        </kat_pis>
    </offer>
    <offer id="12" parent_id="31">
        <name>Beta</name>
        <pos>099</pos>
        <kat_pis>
            <pos kat="2">113</pos>
        </kat_pis>
    </offer>
</details>
</main_heading>

I am parsing it using BeautifulSoup. Upon doing this:

soup = BeautifulSoup(file, 'xml')

pos = []
for i in (soup.find_all('pos')):
    pos.append(i.text)

I get a list of all POS tag values, even the ones that are nested within the tag kat_pis.

So I get (697, 112, 099. 113).

However, I only want to get the POS values of the non-nested tags.

Expected desired result is (697, 099).

How can I achieve this?

Upvotes: 0

Views: 45

Answers (2)

Barry the Platipus
Barry the Platipus

Reputation: 10460

Here is one way of getting those first level pos:

from bs4 import BeautifulSoup as bs

xml_doc = '''<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
    <offer id="11" parent_id="12">
        <name>Alpha</name>
        <pos>697</pos>
        <kat_pis>
            <pos kat="2">112</pos>
        </kat_pis>
    </offer>
    <offer id="12" parent_id="31">
        <name>Beta</name>
        <pos>099</pos>
        <kat_pis>
            <pos kat="2">113</pos>
        </kat_pis>
    </offer>
</details>
</main_heading>'''

soup = bs(xml_doc, 'xml')

pos = []
for i in (soup.select('offer > pos')):
    pos.append(i.text)

print(pos)

Result in terminal:

['697', '099']

Upvotes: 2

larsks
larsks

Reputation: 311645

I think the best solution would be to abandon BeautifulSoup for an XML parser with XPath support, like lxml. Using XPath expressions, you can ask for only those tos elements that are children of offer elements:

from lxml import etree

with open('data.xml') as fd:
    doc = etree.parse(fd)

pos = []
for ele in (doc.xpath('//offer/pos')):
    pos.append(ele.text)

print(pos)

Given your example input, the above code prints:

['697', '099']

Upvotes: 1

Related Questions