wrkyle
wrkyle

Reputation: 559

How does one get the text from html while ignoring formatting tags using BeautifulSoup?

The following code is used to grab continuous segments of text from within html.

    for text in soup.find_all_next(text=True):
        if isinstance(text, Comment):
            # We found a comment, ignore
            continue
        if not text.strip():
            # We found a blank text, ignore
            continue
        # Whatever is left must be good
        print(text)

Text items are broken up by structure tags like <div> or <br> but also formatting tags like <em> and <strong>. This causes me some inconvenience in further parsing of the text and I would like to be able to grab continuous text items while ignoring any formatting tags interior to the text.

For example, soup.find_all_next(text=True) would take the html code <div>This is <em>important</em> text</div> and return a single string, This is important text instead of three strings, This is, important, and text.

I'm not sure if that's clear... Let me know if it's not.

EDIT: The reason I'm walking through the html text item by text item is that I'm only beginning the walk after I see a specific "begin" comment tag and I'm stopping when I reach a specific "end" comment tag. Are there any solutions that work within this context of needing to walk item by item? The full code I'm using is below.

soup = BeautifulSoup(page)
for instanceBegin in soup.find_all(text=isBeginText):
    # We found a start comment, look at all text and comments:
    for text in instanceBegin.find_all_next(text=True):
        # We found a text or comment, examine it closely
        if isEndText(text):
            # We found the end comment, everybody out of the pool
            break
        if isinstance(text, Comment):
            # We found a comment, ignore
            continue
        if not text.strip():
            # We found a blank text, ignore
            continue
        # Whatever is left must be good
        print(text)

Where the two functions isBeginText(text) and isEndText(text) return true if the string passed to them matches my starting or ending comment tags.

Upvotes: 3

Views: 3510

Answers (2)

Oliver W.
Oliver W.

Reputation: 13459

How about using find_all_next twice, once for each of the beginning and ending tag and taking the difference of the two generated lists?

As an example, I'll use a modified version of the html_doc from the documentation of BeautifulSoup:

import bs4

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<!-- START--><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p><!-- END -->

<p class="story">...</p>
"""

soup = bs4.BeautifulSoup(html_doc, 'html.parser')
comments = soup.findAll(text=lambda text:isinstance(text, bs4.Comment))

# Step 1: find the beginning and ending markers
node_start = [ cmt for cmt in comments if cmt.string == " START" ][0]
node_end = [ cmt for cmt in comments if cmt.string == " END " ][0]

# Step 2, subtract the 2nd list of strings from the first
all_text = node_start.find_all_next(text=True)
all_after_text = node_end.find_all_next(text=True)

subset = all_text[:-(len(all_after_text) + 1)]
print(subset)

# ['Lacie', ' and\n', 'Tillie', ';\nand they lived at the bottom of a well.']

Upvotes: 2

chafreaky
chafreaky

Reputation: 90

If you grab the parent element containing your children elements and do get_text(), BeautifulSoup will strip out all html tags for you and only return a continuous string of text.

You can find an example here

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.get_text())

Upvotes: 4

Related Questions