alphazwest
alphazwest

Reputation: 4450

In Beautifulsoup4, Get All SubElements of an Element, but Not SubElements of the SubElements

I've got the following html:

<div class="what-im-after">
    <p>
        "content I want"
    </p>
    <p>
        "content I want"
    </p>
    <p>
        "content I want"
    </p>
    <div class='not-what-im-after">
        <p>
            "content I don't want"
        </p>
    </div>
    <p>
        "content I want"
    </p><p>
        "content I want"
    </p>
</div>

I'm trying to extract all the content from the paragraph tags that are SubElements of the <div class="what-im-after"> container, but not the ones that are found within the <div class="not-what-im-after"> container.

when I do this:

soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='what-im-after').findAll('p')

I get back all the <p> tags, including those within the <div class='not-what-im-after>, which makes complete sense to me; that's what I'm asking it for.

My question is how do I instruct Python to get all the <p> tags, unless they are in another SubElement?

Upvotes: 1

Views: 1265

Answers (3)

Padraic Cunningham
Padraic Cunningham

Reputation: 180401

What you want is to set recursive=False if you just want the p tags under the what-im-after div that are not inside any other tags:

soup = BeautifulSoup(html)

print(soup.find('div', class_='what-im-after').find_all("p", recursive=False))

That is exactly the same as your loop logic checking the parent.

Upvotes: 2

Find
Find

Reputation: 75

from bs4 import BeautifulSoup

htmltxt = """<div class="what-im-after">
    <p>
        "content I want"
    </p>
    <p>
        "content I want"
    </p>
    <p>
        "content I want"
    </p>
    <div class='not-what-im-after">
        <p>
            "content I don't want"
        </p>
    </div>
    <p>
        "content I want"
    </p><p>
        "content I want"
    </p>
</div>"""

soup = BeautifulSoup(htmltxt, 'lxml')


def filter_p(container):
    items = container.contents
    ans = []
    for item in items:
        if item.name == 'p':
            ans.append(item)
    return ans

print(filter_p(soup.div))

Maybe you want this. And I just filter the first level p children of div.

Upvotes: -1

alphazwest
alphazwest

Reputation: 4450

In the course of writing this question, an approach came to mind which seems to work fine.

Basically, I'm checking each <p> element to see if the parent element is <div class="what-im-after"> which, in effect, excludes any <p> tags nested within subelements.

My code is as follows:

filter_list = []

parent = soup.find('div', class_='what-im-after')
content = soup.find('div', class_='what-im-after').findAll('p')

if content.parent is parent:
    filter_list.append(content)

filter_list then contains all of the <p> tags that aren't nested within other SubElements.

Upvotes: 0

Related Questions