How does one get the text from html while ignoring formatting tags using BeautifulSoup?

Question

The following code is used to grab continuous segments of text from within html.

    for text in soup.find_all_next(text=True):
        if isinstance(text, Comment):
            # We found a comment, ignore
            continue
        if not text.strip():
            # We found a blank text, ignore
            continue
        # Whatever is left must be good
        print(text)

Text items are broken up by structure tags like

or

but also formatting tags like and . This causes me some inconvenience in further parsing of the text and I would like to be able to grab continuous text items while ignoring any formatting tags interior to the text.

For example, soup.find_all_next(text=True) would take the html code
This is important text
and return a single string, This is important text instead of three strings, This is, important, and text.

I'm not sure if that's clear... Let me know if it's not.

EDIT: The reason I'm walking through the html text item by text item is that I'm only beginning the walk after I see a specific "begin" comment tag and I'm stopping when I reach a specific "end" comment tag. Are there any solutions that work within this context of needing to walk item by item? The full code I'm using is below.

soup = BeautifulSoup(page) for instanceBegin in soup.find_all(text=isBeginText): # We found a start comment, look at all text and comments: for text in instanceBegin.find_all_next(text=True): # We found a text or comment, examine it closely if isEndText(text): # We found the end comment, everybody out of the pool break if isinstance(text, Comment): # We found a comment, ignore continue if not text.strip(): # We found a blank text, ignore continue # Whatever is left must be good print(text)

Where the two functions isBeginText(text) and isEndText(text) return true if the string passed to them matches my starting or ending comment tags.

Oliver W. · Accepted Answer

How about using find_all_next twice, once for each of the beginning and ending tag and taking the difference of the two generated lists?

As an example, I'll use a modified version of the html_doc from the documentation of BeautifulSoup:

import bs4

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.


...
"""

soup = bs4.BeautifulSoup(html_doc, 'html.parser')
comments = soup.findAll(text=lambda text:isinstance(text, bs4.Comment))

# Step 1: find the beginning and ending markers
node_start = [ cmt for cmt in comments if cmt.string == " START" ][0]
node_end = [ cmt for cmt in comments if cmt.string == " END " ][0]

# Step 2, subtract the 2nd list of strings from the first
all_text = node_start.find_all_next(text=True)
all_after_text = node_end.find_all_next(text=True)

subset = all_text[:-(len(all_after_text) + 1)]
print(subset)

# ['Lacie', ' and
', 'Tillie', ';
and they lived at the bottom of a well.']

How does one get the text from html while ignoring formatting tags using BeautifulSoup?

Answers (2)

Related Questions