Reputation: 75
I have a BeautifulSoup of this format
<div class='text'>
<h3> text </h3>
<p> some more text </p>
"text here <b> is </b> important"
</div>
How do I extract just the string "text here is important" leaving out the h3, and p elements but the bold tag text remains inside the output
Thanks a ton
Upvotes: 3
Views: 2969
Reputation: 719
html = "<div class='text'>
<h3> text </h3>
<p> some more text </p>
"text here <b> is </b> important"
</div>"
soup = BeautifulSoup(html,'lxml')
for b_element in soup.find_all('b'):
b_element.unwrap() #removes the <b> </b> while keeping the text intact
soup.smooth() #fixes any consecutive unwrapped text into continuous text
#finally getting only parent text now using below code
elements = soup.find_all('div')
for element in elements:
fulltextofelement = element.find(text=True, recursive=True)
onlyparenttext = element.find(text=True, recursive=False)
Upvotes: 1
Reputation: 81
Well for this specific format, try using .next_siblings for element p
import bs4
from bs4 import BeautifulSoup
text = '''<div class='text'>
<h3> text </h3>
<p> some more text </p>
"text here <b> is </b> important"</div>'''
response = BeautifulSoup(text)
str_list = []
for x in (response.p.next_siblings):
# filter of "b" tag and get its text
if type(x) == bs4.element.Tag:
str_list.append(x.get_text().strip())
else :
str_list.append(x.strip())
output = " ".join(str_list)
print(output)
This gave me output as :
"text here is important"
Upvotes: 3
Reputation: 14273
You can use tag.decompose()
to remove the unwanted tags and then extract the remaining text.
from bs4 import BeautifulSoup
spam = """<div class='text'>
<h3> text </h3>
<p> some more text </p>
"text here <b> is </b> important"
</div>"""
soup = BeautifulSoup(spam, 'html.parser')
div = soup.find('div')
for tag in ('h3', 'p'):
div.find(tag).decompose()
print(div.text.strip())
output
"text here is important"
Upvotes: 1