Reputation: 559
The following code is used to grab continuous segments of text from within html.
for text in soup.find_all_next(text=True):
if isinstance(text, Comment):
# We found a comment, ignore
continue
if not text.strip():
# We found a blank text, ignore
continue
# Whatever is left must be good
print(text)
Text items are broken up by structure tags like <div>
or <br>
but also formatting tags like <em>
and <strong>
. This causes me some inconvenience in further parsing of the text and I would like to be able to grab continuous text items while ignoring any formatting tags interior to the text.
For example, soup.find_all_next(text=True)
would take the html code <div>This is <em>important</em> text</div>
and return a single string, This is important text
instead of three strings, This is
, important
, and text
.
I'm not sure if that's clear... Let me know if it's not.
EDIT: The reason I'm walking through the html text item by text item is that I'm only beginning the walk after I see a specific "begin" comment tag and I'm stopping when I reach a specific "end" comment tag. Are there any solutions that work within this context of needing to walk item by item? The full code I'm using is below.
soup = BeautifulSoup(page)
for instanceBegin in soup.find_all(text=isBeginText):
# We found a start comment, look at all text and comments:
for text in instanceBegin.find_all_next(text=True):
# We found a text or comment, examine it closely
if isEndText(text):
# We found the end comment, everybody out of the pool
break
if isinstance(text, Comment):
# We found a comment, ignore
continue
if not text.strip():
# We found a blank text, ignore
continue
# Whatever is left must be good
print(text)
Where the two functions isBeginText(text)
and isEndText(text)
return true if the string passed to them matches my starting or ending comment tags.
Upvotes: 3
Views: 3510
Reputation: 13459
How about using find_all_next
twice, once for each of the beginning and ending tag and taking the difference of the two generated lists?
As an example, I'll use a modified version of the html_doc
from the documentation of BeautifulSoup:
import bs4
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<!-- START--><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><!-- END -->
<p class="story">...</p>
"""
soup = bs4.BeautifulSoup(html_doc, 'html.parser')
comments = soup.findAll(text=lambda text:isinstance(text, bs4.Comment))
# Step 1: find the beginning and ending markers
node_start = [ cmt for cmt in comments if cmt.string == " START" ][0]
node_end = [ cmt for cmt in comments if cmt.string == " END " ][0]
# Step 2, subtract the 2nd list of strings from the first
all_text = node_start.find_all_next(text=True)
all_after_text = node_end.find_all_next(text=True)
subset = all_text[:-(len(all_after_text) + 1)]
print(subset)
# ['Lacie', ' and\n', 'Tillie', ';\nand they lived at the bottom of a well.']
Upvotes: 2
Reputation: 90
If you grab the parent element containing your children elements and do get_text()
, BeautifulSoup will strip out all html tags for you and only return a continuous string of text.
You can find an example here
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.get_text())
Upvotes: 4