Get text from specific blocks excluding some nested tags

Question

I have been trying to make a Python script which actually extracts text from a specific block of element but has to exclude some text within nested siblings.

This is my HTML part I'm trying to scrape:


    
        Stack Overflow
        

        Is Love
        

        Ad
        

        Ad2

Here is so far I've progressed:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
divs = soup.findAll('div', {'id':'articleBodyContents'})
for ops in divs:
    print(ops.text.replace('
', '').strip())

However this prints out:

Stack Overflow
Is love
Ad
Ad2

What I want is only:

Stack Overflow
Is love

0xInfection · Accepted Answer

You are nearly there. You'd need help of NavigableString to achieve this. Just catch the previous parent, and iterate over it checking if the strings are an instance of NavigableString. Here is your code:

from bs4 import BeautifulSoup, NavigableString

html = """

    
        Stack Overflow
        

        Is love
        

        Ad
        

        Ad2
    

"""

soup = BeautifulSoup(html, "html.parser")
divs = soup.find('div', {'class':'article_body'})
ops = [element for element in divs.div if isinstance(element, NavigableString)]
for op in ops:
    print(op.strip().replace('
', ''))

Output:

Stack Overflow
Is love

Get text from specific blocks excluding some nested tags

Answers (1)

Related Questions