Reputation: 449
the html is structured as so:
<div class="my_class">
<div>important text</div>
<div class="my_class">
<div>not important</div>
</div>
</div>
<div class="my_class">
<div>important text</div>
<div class="my_class">
<div>not important</div>
</div>
</div>
...
Basically, there are many divs with the same name as their child divs, and ultimately, I want to find the "important text" which is found under the partent div only.
When I try to find all divs with class="my_class", I obviously get both the parents and the childs. How can I only get the parent divs?
Here is my code for getting all divs with class = "my_class" and finding the important text:
my_div_list = soup.find_all('div', attrs={'class': 'my_class'})
for my_div in my_div_list:
text_item = my_div.find('div') # to get to the div that contains the important text
print(text_item.getText())
Obviously, the output is:
important text
not important
important text
not important
...
When I want:
important text
important text
...
Upvotes: 0
Views: 836
Reputation: 84465
with bs4 4.7.1 you can use :has and :first-child
from bs4 import BeautifulSoup as bs
html = '''<div class="my_class">
<div>important text</div>
<div class="my_class">
<div>not important</div>
</div>
</div>
<div class="my_class">
<div>important text</div>
<div class="my_class">
<div>not important</div>
</div>
</div>'''
soup = bs(html, 'lxml')
print([i.text for i in soup.select('.my_class:has(>.my_class) > div:first-child')])
Upvotes: 1
Reputation: 407
From the findall()
documentation:
recursive is a boolean argument (defaulting to True) which tells Beautiful Soup whether to go all the way down the parse tree, or whether to only look at the immediate children of the Tag or the parser object.
So, given the first level of divs is for example under the tags <head>
and <body>
, you can set
soup.html.body.find_all('div', attrs={'class': 'my_class'},
recursive=False)
Output:
['important text', 'important text']
Upvotes: 1
Reputation: 71451
You can iterate over soup.contents
:
from bs4 import BeautifulSoup as soup
r = [i.div.text for i in soup(html, 'html.parser').contents if i != '\n']
Output:
['important text', 'important text']
Upvotes: 1