user10873885
user10873885

Reputation:

Get text from specific blocks excluding some nested tags

I have been trying to make a Python script which actually extracts text from a specific block of element but has to exclude some text within nested siblings.

This is my HTML part I'm trying to scrape:

<div class="article_body">
    <div id="articleBodyContents">
        Stack Overflow
        <br/>
        Is Love
        <br/>
        <a href="https://example_site1.com" target="_blank">Ad</a>
        <br/>
        <a href="https://example_site2.com" target="_blank">Ad2</a>
    </div>
</div>

Here is so far I've progressed:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
divs = soup.findAll('div', {'id':'articleBodyContents'})
for ops in divs:
    print(ops.text.replace('\n', '').strip())

However this prints out:

Stack Overflow
Is love
Ad
Ad2

What I want is only:

Stack Overflow
Is love

Upvotes: 0

Views: 186

Answers (1)

0xInfection
0xInfection

Reputation: 2919

You are nearly there. You'd need help of NavigableString to achieve this. Just catch the previous parent, and iterate over it checking if the strings are an instance of NavigableString. Here is your code:

from bs4 import BeautifulSoup, NavigableString

html = """
<div class="article_body">
    <div id="articleBodyContents">
        Stack Overflow
        <br/>
        Is love
        <br/>
        <a href="https://example_site1.com" target="_blank">Ad</a>
        <br/>
        <a href="https://example_site2.com" target="_blank">Ad2</a>
    </div>
</div>
"""

soup = BeautifulSoup(html, "html.parser")
divs = soup.find('div', {'class':'article_body'})
ops = [element for element in divs.div if isinstance(element, NavigableString)]
for op in ops:
    print(op.strip().replace('\n', ''))

Output:

Stack Overflow
Is love

Upvotes: 1

Related Questions