Jonah Ho
Jonah Ho

Reputation: 45

BeautifulSoup4 exclude div that is in wrapper

I tried to get the all text out of the following HTML structure:

<div class="header"><h1>Header</h1></div>
<div class="container">
    <div class="header"><h1>Sub Header</h1></div>
    <p>Target_2</p>
    <p>Target_3</p>
    <p>Target_4</p>
</div>

My approach was something like this:

targets = soup.find_all("div", class_=["header", "container"])

for html_row in targets:
   for row in html_row.strings:
         print(row)

Output:

Header
Sub Header
Target_2
Target_3
Target_4
Sub Header

My problem is that "Sub Header" is found twice because of the header class. How can I exclude the header class inside of the container class? I have to grab everything with the classes.

Upvotes: 2

Views: 150

Answers (2)

MendelG
MendelG

Reputation: 20038

You can set the recursive argument to False, which will only find direct children:

from bs4 import BeautifulSoup


html = """
<div class="header"><h1>Header</h1></div>
<div class="container">
    <div class="header"><h1>Sub Header</h1></div>
    <p>Target_2</p>
    <p>Target_3</p>
    <p>Target_4</p>
</div>"""

soup = BeautifulSoup(html, "html.parser")
targets = soup.find_all("div", class_=["header", "container"], recursive=False)

for tag in targets:
    print(tag.text.strip())

Output:

Header
Sub Header
Target_2
Target_3
Target_4

Upvotes: 2

Andrej Kesely
Andrej Kesely

Reputation: 195438

You can put a condition inside the loop to check, if the tag isn't inside other tag with class="container" using .find_parent():

from bs4 import BeautifulSoup


html_doc = '''<div class="header"><h1>Header</h1></div>
<div class="container">
    <div class="header"><h1>Sub Header</h1></div>
    <p>Target_2</p>
    <p>Target_3</p>
    <p>Target_4</p>
</div>'''

soup = BeautifulSoup(html_doc, 'html.parser')
targets = soup.find_all("div", class_=["header", "container"])

for tag in targets:
    if tag.find_parent(attrs={'class':'container'}):
        continue
    print(tag.text.strip())

Prints:

Header
Sub Header
Target_2
Target_3
Target_4

Upvotes: 1

Related Questions