Reputation: 325
I'm working on an assignment for class. We have to scrape information about an online book list that looks something like this:
<p class="css-38z03z"><strong>1. <a data-link-name="in body link" href="https://www.theguardian.com/books/2016/feb/01/100-best-nonfiction-books-of-all-time-the-sixth-extinction-elizabeth-kolbert">The Sixth Extinction by Elizabeth Kolbert (2014)</a> </strong><br/> An` `engrossing account of the looming catastrophe caused by ecology’s “neighbours from hell” – mankind.</p>
What I need to do is us beautiful soup to extract the second half of that HTML blurb. I need my output to be "An engrossing account of the looming catastrophe caused by ecology's "neighbours from hell" - mankind.
Here's the closest I can get (which isn't very close.)
soup_doc.find('p').strong
plz_work = soup_doc.strong.next_sibling
plz_work.get_text
I've tried using other varients of the sibling
tags but no luck. What should I do?
Upvotes: 1
Views: 599
Reputation: 1228
This works for this particular example, but not sure if it is stable for the entire scope that you are working with.
from bs4 import BeautifulSoup
html = """
<p class="css-38z03z">
<strong>1.
<a data-link-name="in body link" href="https://www.theguardian.com/books/2016/feb/01/100-best-nonfiction-books-of-all-time-the-sixth-extinction-elizabeth-kolbert">The Sixth Extinction by Elizabeth Kolbert (2014)
</a>
</strong>
<br/> An engrossing account of the looming catastrophe caused by ecology’s “neighbours from hell” – mankind.
</p>"""
soup = BeautifulSoup(html, 'html.parser')
element_all = soup.find('p').text
element_unwanted = soup.find('strong').text
if element_unwanted in element_all:
element = element_all.replace(element_unwanted, '').strip()
print(element)
Upvotes: 1
Reputation: 20018
Simply use .next
:
from bs4 import BeautifulSoup
html = '''<p class="css-38z03z"><strong>1. <a data-link-name="in body link" href="https://www.theguardian.com/books/2016/feb/01/100-best-nonfiction-books-of-all-time-the-sixth-extinction-elizabeth-kolbert">The Sixth Extinction by Elizabeth Kolbert (2014)</a> </strong><br/> An engrossing account of the looming catastrophe caused by ecology’s “neighbours from hell” – mankind.</p>
'''
soup = BeautifulSoup(html, "html.parser")
print(soup.select_one('.css-38z03z br').next)
Output:
An engrossing account of the looming catastrophe caused by ecology’s “neighbours from hell” – mankind.
Upvotes: 1