Reputation:
(Picture is small, here is another link: https://i.sstatic.net/gO9jb.png)
I'm trying to extract the text of the review at the bottom. I've tried this:
y = soup.find_all("div", style = "margin-left:0.5em;")
review = y[0].text
The problem is that there is unwanted text in the unexpanded div
tags that becomes tedious to remove from the content of the review. For the life of me, I just can't figure this out. Could someone please help me?
Edit: The HTML is:
div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;"> 9 of 35 people found the following review helpful </div>
<div style="margin-bottom:0.5em;">
<div style="margin-bottom:0.5em;">
<div class="tiny" style="margin-bottom:0.5em;">
<b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.
The div tag above the text is as follows:
<div class="tiny" style="margin-bottom:0.5em;">
<b>
<span class="h3color tiny">This review is from: </span>
<a href="https://rads.stackoverflow.com/amzn/click/com/B005C7QVUE" rel="nofollow noreferrer">A Dance with Dragons: A Song of Ice and Fire: Book 5 (Audible Audio Edition)</a>
</b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.
Upvotes: 2
Views: 1718
Reputation: 414825
To get the text in the tail of div.tiny
:
review = soup.find("div", "tiny").findNextSibling(text=True)
Full example:
#!/usr/bin/env python
from bs4 import BeautifulSoup
html = """<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
9 of 35 people found the following review helpful </div>
<div style="margin-bottom:0.5em;">
<div style="margin-bottom:0.5em;">
<div class="tiny" style="margin-bottom:0.5em;">
<b>
<span class="h3color tiny">This review is from: </span>
<a href="http://rads.stackoverflow.com/amzn/click/B005C7QVUE">
A Dance with Dragons: A Song of Ice and Fire: Book 5 (Audible Audio Edition)</a>
</b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few."""
soup = BeautifulSoup(html)
review = soup.find("div", "tiny").findNextSibling(text=True)
print(review)
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.
Here's an equivalent lxml
code that produces the same output:
import lxml.html
doc = lxml.html.fromstring(html)
print doc.find(".//div[@class='tiny']").tail
Upvotes: 2
Reputation: 1146
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#strings-and-stripped-strings suggests that the .strings method is what you want - it returns a iterator of each string within the object. So if you turn that iterator into a list and take the last item, you should get what you want. For example:
$ python
>>> import bs4
>>> text = '<div style="mine"><div>unwanted</div>wanted</div>'
>>> soup = bs4.BeautifulSoup(text)
>>> soup.find_all("div", style="mine")[0].text
u'unwantedwanted'
>>> list(soup.find_all("div", style="mine")[0].strings)[-1]
u'wanted'
Upvotes: 2