BeautifulSoup4 exclude div that is in wrapper

Question

I tried to get the all text out of the following HTML structure:

Header

    Sub Header
    Target_2
    Target_3
    Target_4

My approach was something like this:

targets = soup.find_all("div", class_=["header", "container"])

for html_row in targets:
   for row in html_row.strings:
         print(row)

Output:

Header
Sub Header
Target_2
Target_3
Target_4
Sub Header

My problem is that "Sub Header" is found twice because of the header class. How can I exclude the header class inside of the container class? I have to grab everything with the classes.

MendelG · Accepted Answer

You can set the recursive argument to False, which will only find direct children:

from bs4 import BeautifulSoup


html = """
Header

    Sub Header
    Target_2
    Target_3
    Target_4
"""

soup = BeautifulSoup(html, "html.parser")
targets = soup.find_all("div", class_=["header", "container"], recursive=False)

for tag in targets:
    print(tag.text.strip())

Output:

Header
Sub Header
Target_2
Target_3
Target_4

BeautifulSoup4 exclude div that is in wrapper

Answers (2)

Related Questions