Amit Singh
Amit Singh

Reputation: 87

BeautifulSoup : do not insert a line break with soup.get_text for certain tags like <b>

I am trying to parse the text from the HTML using beautiful soup. I have selected the node, but when i try to get the text:

HTML element

<div class="_1dcffiq">
   <div class="_1d9ug36">Join the Adventure</div>
   <div class="_doc79r">If you're <b>passionate</b> about <b>improving the health</b> of others and working on a <b>big problem</b>.</div>
   <div class="_1d9ug36">Thank you</div>
</div>

Python code, the element tag holds the above mentioned HTML div

element.get_text(" | ")

Current output is

Join the Adventure | If you're  | passionate |  about  | improving the health |  of others and working on a  | big problem | . | Thank you

So the get_text(' | ') breaks the text by the tags and hence it breaks the text on tags as well. My requirement is to not break on the inline tags and get the text as:

Expected output

Join the Adventure | If you're passionate about  improving the health of others and working on a big problem . | Thank you

I am looking for a generic solution as my div is not fixed.

Upvotes: 3

Views: 689

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195543

You can .unwrap() the <b> tags from the element and then .smooth() the text:

from bs4 import BeautifulSoup


html_doc = '''<div class="_1dcffiq">
   <div class="_1d9ug36">Join the Adventure</div>
   <div class="_doc79r">If you're <b>passionate</b> about <b>improving the health</b> of others and working on a <b>big problem</b>.</div>
   <div class="_1d9ug36">Thank you</div>
</div>'''

soup = BeautifulSoup(html_doc, 'html.parser')

element = soup.select_one('._1dcffiq')

for b in soup.select('b'):
    b.unwrap()
element.smooth()

print(element.get_text(strip='True', separator=' | '))

Prints:

Join the Adventure | If you're passionate about improving the health of others and working on a big problem. | Thank you

Or:

Use .find_all() with recursive=False and then join text:

text = ' | '.join(tag.text for tag in element.find_all(recursive=False))
print(text)

Prints:

Join the Adventure | If you're passionate about improving the health of others and working on a big problem. | Thank you

Upvotes: 2

Related Questions