Reputation: 87
I am trying to parse the text from the HTML using beautiful soup. I have selected the node, but when i try to get the text:
HTML element
<div class="_1dcffiq">
<div class="_1d9ug36">Join the Adventure</div>
<div class="_doc79r">If you're <b>passionate</b> about <b>improving the health</b> of others and working on a <b>big problem</b>.</div>
<div class="_1d9ug36">Thank you</div>
</div>
Python code, the element tag holds the above mentioned HTML div
element.get_text(" | ")
Current output is
Join the Adventure | If you're | passionate | about | improving the health | of others and working on a | big problem | . | Thank you
So the get_text(' | ') breaks the text by the tags and hence it breaks the text on tags as well. My requirement is to not break on the inline tags and get the text as:
Expected output
Join the Adventure | If you're passionate about improving the health of others and working on a big problem . | Thank you
I am looking for a generic solution as my div is not fixed.
Upvotes: 3
Views: 689
Reputation: 195543
You can .unwrap()
the <b>
tags from the element and then .smooth()
the text:
from bs4 import BeautifulSoup
html_doc = '''<div class="_1dcffiq">
<div class="_1d9ug36">Join the Adventure</div>
<div class="_doc79r">If you're <b>passionate</b> about <b>improving the health</b> of others and working on a <b>big problem</b>.</div>
<div class="_1d9ug36">Thank you</div>
</div>'''
soup = BeautifulSoup(html_doc, 'html.parser')
element = soup.select_one('._1dcffiq')
for b in soup.select('b'):
b.unwrap()
element.smooth()
print(element.get_text(strip='True', separator=' | '))
Prints:
Join the Adventure | If you're passionate about improving the health of others and working on a big problem. | Thank you
Or:
Use .find_all()
with recursive=False
and then join text:
text = ' | '.join(tag.text for tag in element.find_all(recursive=False))
print(text)
Prints:
Join the Adventure | If you're passionate about improving the health of others and working on a big problem. | Thank you
Upvotes: 2