Reputation:
I have been trying to make a Python script which actually extracts text from a specific block of element but has to exclude some text within nested siblings.
This is my HTML part I'm trying to scrape:
<div class="article_body">
<div id="articleBodyContents">
Stack Overflow
<br/>
Is Love
<br/>
<a href="https://example_site1.com" target="_blank">Ad</a>
<br/>
<a href="https://example_site2.com" target="_blank">Ad2</a>
</div>
</div>
Here is so far I've progressed:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
divs = soup.findAll('div', {'id':'articleBodyContents'})
for ops in divs:
print(ops.text.replace('\n', '').strip())
However this prints out:
Stack Overflow
Is love
Ad
Ad2
What I want is only:
Stack Overflow
Is love
Upvotes: 0
Views: 186
Reputation: 2919
You are nearly there. You'd need help of NavigableString
to achieve this. Just catch the previous parent, and iterate over it checking if the strings are an instance of NavigableString
. Here is your code:
from bs4 import BeautifulSoup, NavigableString
html = """
<div class="article_body">
<div id="articleBodyContents">
Stack Overflow
<br/>
Is love
<br/>
<a href="https://example_site1.com" target="_blank">Ad</a>
<br/>
<a href="https://example_site2.com" target="_blank">Ad2</a>
</div>
</div>
"""
soup = BeautifulSoup(html, "html.parser")
divs = soup.find('div', {'class':'article_body'})
ops = [element for element in divs.div if isinstance(element, NavigableString)]
for op in ops:
print(op.strip().replace('\n', ''))
Output:
Stack Overflow
Is love
Upvotes: 1