Reputation: 4450
I've got the following html:
<div class="what-im-after">
<p>
"content I want"
</p>
<p>
"content I want"
</p>
<p>
"content I want"
</p>
<div class='not-what-im-after">
<p>
"content I don't want"
</p>
</div>
<p>
"content I want"
</p><p>
"content I want"
</p>
</div>
I'm trying to extract all the content from the paragraph tags that are SubElements of the <div class="what-im-after">
container, but not the ones that are found within the <div class="not-what-im-after">
container.
when I do this:
soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='what-im-after').findAll('p')
I get back all the <p>
tags, including those within the <div class='not-what-im-after>
, which makes complete sense to me; that's what I'm asking it for.
My question is how do I instruct Python to get all the <p>
tags, unless they are in another SubElement?
Upvotes: 1
Views: 1265
Reputation: 180401
What you want is to set recursive=False if you just want the p tags under the what-im-after
div that are not inside any other tags:
soup = BeautifulSoup(html)
print(soup.find('div', class_='what-im-after').find_all("p", recursive=False))
That is exactly the same as your loop logic checking the parent.
Upvotes: 2
Reputation: 75
from bs4 import BeautifulSoup
htmltxt = """<div class="what-im-after">
<p>
"content I want"
</p>
<p>
"content I want"
</p>
<p>
"content I want"
</p>
<div class='not-what-im-after">
<p>
"content I don't want"
</p>
</div>
<p>
"content I want"
</p><p>
"content I want"
</p>
</div>"""
soup = BeautifulSoup(htmltxt, 'lxml')
def filter_p(container):
items = container.contents
ans = []
for item in items:
if item.name == 'p':
ans.append(item)
return ans
print(filter_p(soup.div))
Maybe you want this. And I just filter the first level p children of div.
Upvotes: -1
Reputation: 4450
In the course of writing this question, an approach came to mind which seems to work fine.
Basically, I'm checking each <p>
element to see if the parent element is <div class="what-im-after">
which, in effect, excludes any <p>
tags nested within subelements.
My code is as follows:
filter_list = []
parent = soup.find('div', class_='what-im-after')
content = soup.find('div', class_='what-im-after').findAll('p')
if content.parent is parent:
filter_list.append(content)
filter_list
then contains all of the <p>
tags that aren't nested within other SubElements.
Upvotes: 0