Aadhiraj Nayar
Aadhiraj Nayar

Reputation: 75

beautifulsoup get only string directly inside tag

I have a BeautifulSoup of this format

<div class='text'>
<h3> text </h3>
<p> some more text </p>
"text here <b> is </b> important"
</div>

How do I extract just the string "text here is important" leaving out the h3, and p elements but the bold tag text remains inside the output

Thanks a ton

Upvotes: 3

Views: 2969

Answers (3)

Shreyesh Desai
Shreyesh Desai

Reputation: 719

html = "<div class='text'>
<h3> text </h3>
<p> some more text </p>
"text here <b> is </b> important"
</div>"

soup = BeautifulSoup(html,'lxml')

for b_element in soup.find_all('b'):
   b_element.unwrap() #removes the <b> </b> while keeping the text intact

soup.smooth() #fixes any consecutive unwrapped text into continuous text

#finally getting only parent text now using below code
elements = soup.find_all('div')

for element in elements:
   fulltextofelement = element.find(text=True, recursive=True)
   onlyparenttext = element.find(text=True, recursive=False)

Upvotes: 1

Somitra Gupta
Somitra Gupta

Reputation: 81

Well for this specific format, try using .next_siblings for element p

import bs4
from bs4 import BeautifulSoup

text = '''<div class='text'>
<h3> text </h3>
<p> some more text </p>
"text here <b> is </b> important"</div>'''

response = BeautifulSoup(text)

str_list = []
for x in (response.p.next_siblings):
    # filter of "b" tag and get its text
    if type(x) == bs4.element.Tag:
        str_list.append(x.get_text().strip())
    else :
        str_list.append(x.strip())

output = " ".join(str_list)
print(output)

This gave me output as :

"text here is important"

Upvotes: 3

buran
buran

Reputation: 14273

You can use tag.decompose() to remove the unwanted tags and then extract the remaining text.

from bs4 import BeautifulSoup
spam = """<div class='text'>
<h3> text </h3>
<p> some more text </p>
"text here <b> is </b> important"
</div>"""

soup = BeautifulSoup(spam, 'html.parser')
div = soup.find('div')
for tag in ('h3', 'p'):
    div.find(tag).decompose()
print(div.text.strip())

output

"text here  is  important"

Upvotes: 1

Related Questions