Reputation: 4450
I have html as follows:
<html>
<div class="maindiv">
text data here
<br>
continued text data
<br>
<div class="somename">
text & data I want to omit
</div>
</div>
</html>
I am trying to get only the the text found in the maindiv
element, without getting text data found in the somename
element. In most cases, in my experience anyway, most text data is contained within some child element. I have ran into this particular case however where the data seems to be contained somewhat will-nilly and is a bit harder to filter.
My approach is as follows:
textdata= soup.find('div', class_='maindiv').get_text()
This gets all the text data found within the maindiv
element, as well as the text data found in the somename
div element.
The logic I'd like to use is more along the lines of:
textdata = soup.find('div', class_='maindiv').get_text(recursive=False)
which would omit any text data found within the somename
element.
I know the recursive=False
argument works for locating only parent-level elemenets when searching the DOM structure using BeautifulSoup, but can't be used with the .get_text()
method.
I've realized the approach of finding all the text, then subtracting the string data found in the somename
element from the string data found in the maindiv
element, but I'm looking for something a little more efficient.
Upvotes: 5
Views: 8098
Reputation: 12168
from bs4 import BeautifulSoup
html ='''
<html>
<div class="maindiv">
text data here
<br>
continued text data
<br>
<div class="somename">
text & data I want to omit
</div>
</div>
</html>'''
soup = BeautifulSoup(html, 'lxml')
soup.find('div', class_="maindiv").next_element
out:
'\n text data here \n '
Upvotes: 2
Reputation: 3265
Not that far from your subtracting method, but one way to do it (at least in Python 3) is to discard all child divs.
s = soup.find('div', class_='maindiv')
for child in s.find_all("div"):
child.decompose()
print(s.get_text())
Would print something like:
text data here
continued text data
That might be a bit more efficient and flexible than subtracting the strings, though it still needs to go through the children first.
Upvotes: 6