rdk
rdk

Reputation: 449

Beautiful Soup - Ignore child divs with same name as parent div

the html is structured as so:

  <div class="my_class">
       <div>important text</div>
       <div class="my_class">
            <div>not important</div>
       </div>
   </div>
   <div class="my_class">
       <div>important text</div>
       <div class="my_class">
            <div>not important</div>
       </div>
   </div>
   ...

Basically, there are many divs with the same name as their child divs, and ultimately, I want to find the "important text" which is found under the partent div only.

When I try to find all divs with class="my_class", I obviously get both the parents and the childs. How can I only get the parent divs?

Here is my code for getting all divs with class = "my_class" and finding the important text:

my_div_list = soup.find_all('div', attrs={'class': 'my_class'})
for my_div in my_div_list:
    text_item = my_div.find('div') # to get to the div that contains the important text
    print(text_item.getText())

Obviously, the output is:

important text
not important
important text
not important
...

When I want:

 important text
 important text
 ...

Upvotes: 0

Views: 836

Answers (3)

QHarr
QHarr

Reputation: 84465

with bs4 4.7.1 you can use :has and :first-child

from bs4 import BeautifulSoup as bs

html = '''<div class="my_class">
       <div>important text</div>
       <div class="my_class">
            <div>not important</div>
       </div>
   </div>
   <div class="my_class">
       <div>important text</div>
       <div class="my_class">
            <div>not important</div>
       </div>
   </div>'''

soup = bs(html, 'lxml')
print([i.text for i in soup.select('.my_class:has(>.my_class) > div:first-child')])

Upvotes: 1

Julia K
Julia K

Reputation: 407

From the findall() documentation:

recursive is a boolean argument (defaulting to True) which tells Beautiful Soup whether to go all the way down the parse tree, or whether to only look at the immediate children of the Tag or the parser object.

So, given the first level of divs is for example under the tags <head> and <body>, you can set

soup.html.body.find_all('div', attrs={'class': 'my_class'}, 
recursive=False)

Output:

 ['important text', 'important text']

Upvotes: 1

Ajax1234
Ajax1234

Reputation: 71451

You can iterate over soup.contents:

from bs4 import BeautifulSoup as soup
r = [i.div.text for i in soup(html, 'html.parser').contents if i != '\n']

Output:

['important text', 'important text']

Upvotes: 1

Related Questions