Reputation: 45
I tried to get the all text out of the following HTML structure:
<div class="header"><h1>Header</h1></div>
<div class="container">
<div class="header"><h1>Sub Header</h1></div>
<p>Target_2</p>
<p>Target_3</p>
<p>Target_4</p>
</div>
My approach was something like this:
targets = soup.find_all("div", class_=["header", "container"])
for html_row in targets:
for row in html_row.strings:
print(row)
Output:
Header
Sub Header
Target_2
Target_3
Target_4
Sub Header
My problem is that "Sub Header" is found twice because of the header
class.
How can I exclude the header
class inside of the container
class?
I have to grab everything with the classes.
Upvotes: 2
Views: 150
Reputation: 20038
You can set the recursive
argument to False
, which will only find direct children:
from bs4 import BeautifulSoup
html = """
<div class="header"><h1>Header</h1></div>
<div class="container">
<div class="header"><h1>Sub Header</h1></div>
<p>Target_2</p>
<p>Target_3</p>
<p>Target_4</p>
</div>"""
soup = BeautifulSoup(html, "html.parser")
targets = soup.find_all("div", class_=["header", "container"], recursive=False)
for tag in targets:
print(tag.text.strip())
Output:
Header
Sub Header
Target_2
Target_3
Target_4
Upvotes: 2
Reputation: 195438
You can put a condition inside the loop to check, if the tag isn't inside other tag with class="container"
using .find_parent()
:
from bs4 import BeautifulSoup
html_doc = '''<div class="header"><h1>Header</h1></div>
<div class="container">
<div class="header"><h1>Sub Header</h1></div>
<p>Target_2</p>
<p>Target_3</p>
<p>Target_4</p>
</div>'''
soup = BeautifulSoup(html_doc, 'html.parser')
targets = soup.find_all("div", class_=["header", "container"])
for tag in targets:
if tag.find_parent(attrs={'class':'container'}):
continue
print(tag.text.strip())
Prints:
Header
Sub Header
Target_2
Target_3
Target_4
Upvotes: 1