Reputation: 1526
At first I would like to say, that I already found the same question with answers, but I couldn't get them working. I try to extract the data from the reviews, for now the review's content and it's usefulness. I am new to BeautifulSoup and Python in general.
For now, I use the findAll method to get a list of divs containing the review, for example, some random site with opinions about the product:
import urllib2
from BeautifulSoup import BeautifulSoup
turl = ""
page= urllib2.urlopen(turl);
soup = BeautifulSoup(page);
products = soup.findAll("div", style = "margin-left:0.5em;")
print products[0]
In this way I get the output like this:
<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
335 of 368 people found the following review helpful
<div style="margin-bottom:0.5em;">
<span style="margin-right:5px;"><span class="swSprite s_star_5_0 " title="5.0 out of 5 stars"><span>5.0 out of 5 stars</span></span> </span>
<span style="vertical-align:middle;"><b>Decades later, still great but on different terms.</b>, <nobr>August 24, 2001</nobr></span>
<div style="margin-bottom:0.5em;">
<div><div style="float:left;">By </div><div style="float:left;"><a href=""><span style="font-weight: bold;">mirope "mirope"</span></a> - <a href="">See all my reviews</a><br />
<a href="" target="AmazonHelp" onclick="return amz_js_PopWin(this.href,'AmazonHelp','width=340,height=340,resizable=1,scrollbars=1,toolbar=1,status=1');"><span class="cmtySprite s_BadgeVineVoice "><span>(VINE VOICE)</span></span></a>
</div></div><div style="clear:both;"></div>
<div class="tiny" style="margin-bottom:0.5em;">
<span class="crVerifiedStripe"><b class="h3color tiny" style="margin-right: 0.5em;">Amazon Verified Purchase</b><span class="tiny verifyWhatsThis">(<a href="" target="AmazonHelp" onclick="amz_js_PopWin('', 'AmazonHelp', 'width=400,height=500,resizable=1,scrollbars=1,toolbar=0,status=1');return false; ">What's this?</a>)</span></span>
<div class="tiny" style="margin-bottom:0.5em;">
<b><span class="h3color tiny">This review is from: </span><a href="" rel="nofollow noreferrer">The Great Gatsby (Paperback)</a></b>
Having reread this book for the first time in 20 years, I can confirm that there's a reason that it's considered one of the very best American novels. However, my reaction to the story was different than when I first read it in high school. I recall that back then I was hoping that Daisy and Gatsby's love story would ultimately yield a happy ending. Now, I found them both to be such shallow creatures that they inspired no pity. While I considered the characters to be emotionally stunted, that dooesn't mean I was not impressed with Fitzergerald's skillful rendering. As in most forms of art, in literature it is more difficult to accurately and interestingly portray nothingness than to describe a richly endowed subject. At this more cynical age, I found Daisy to be a remarkable emotional void, and Gatsby's quest to pour all of his hopes and dreams into such a shallow cauldron only confirmed his own vapidity. One thing that hasn't changed in all these years is my amazement at Fitzgerald's ability to set a scene. His descriptive passages are truly poetic, and his command of word choice in unparalleled. All this made for a stimulating and delightful read.
<div style="padding-top: 10px; clear: both; width: 100%;">
<div class="reviews-voting-stripe" style="float:left; padding-right:15px; border-right:1px solid #CCCCCC"><div style="padding-bottom:5px;"><b class="tiny" style="color:#666666;white-space:nowrap;">Help other customers find the most helpful reviews</b> </div><div style="width:300px;">
<a name="R3KCIEAV000FPG.2115.Helpful.Reviews" style="font-size:1px;"> </a><span class="crVotingButtons"><nobr><span class="votingPrompt">Was this review helpful to you? </span><a rel="nofollow" class="votingButtonReviews votingButton-yes" href=""><span class="cmtySprite s_largeYes "><span>Yes</span></span></a>
<a rel="nofollow" class="votingButtonReviews votingButton-no" href=""><span class="cmtySprite s_largeNo "><span>No</span></span></a></nobr> <span class="votingMessage"></span></span>
</div></div><div style="float:left;"><div style="padding-left:15px;"><div style="white-space:nowrap;"><span class="tiny">
<a name="R3KCIEAV000FPG.2115.Inappropriate.Reviews" style="font-size:1px;"> </a><span class="reportingButton"><nobr><a rel="nofollow" class="reportingButton" href="">Report abuse</a></nobr></span>
</span> <span style="color:#CCCCCC;">|</span> <span class="tiny"><a href="">Permalink</a></span></div><div style="white-space:nowrap;padding-left:-5px;padding-top:5px;"><a href=""><span class="swSprite s_comment "><span>Comment</span></span></a> <a href="">Comments (19)</a></div></div></div><div style="clear:both;"></div>
<br />
And from this output I would like to extract two integers - 335 and 368 (how many people found it useful) and string containing the review's text (just words, without tags and new lines) of a review itself, which is placed in the main div, under 5 sub-divs. How can I get some part of this div without the rest of it, working on tags?
I converted the object returned by BeautifulSoap to string and loaded back to soup - is there any other way to do it? Doesn't seem too nice. Then I use your method, but I get a lot of empty lines, I try to remove them striping and using condition, but they are still there:
import urllib2
from BeautifulSoup import BeautifulSoup
turl = ""
toppage = urllib2.urlopen(turl);
soup = BeautifulSoup(toppage);
products = soup.findAll("div", style = "margin-left:0.5em;")
for (counter,i) in enumerate(products):
soup2 = BeautifulSoup(str(products[counter]))
for (counter2,x) in enumerate(soup2.div):
if x.string:
if x.string.isspace:
print "empty string"
print "string number " + str(counter) + " " + x.string.strip().lstrip()
Upvotes: 1
Views: 9767
Reputation: 88218
Using your source webpage, here is a complete example
import urllib2, re
from BeautifulSoup import BeautifulSoup
turl = ""
toppage = urllib2.urlopen(turl)
soup = BeautifulSoup(toppage)
review_tag = {'class':re.compile("mt9 reviewText")}
helpful_tag = {'class':re.compile("hlp")}
all_reviews = soup.findAll(attrs=review_tag)
all_helpful = soup.findAll(attrs=helpful_tag)
for text,info in zip(all_reviews, all_helpful):
print info.string.strip()
print '\n'.join(text.findAll(text=True)).strip()
print "*******************************************"
This gives
337 of 370 people found the following review helpful
Having reread this book for the first time in 20 years, I can confirm that there's a reason that it's [...]
114 of 123 people found the following review helpful
It's difficult to give any even-handed critique F. Scott Fitzgerald's standard-setting Jazz Age [...]
54 of 60 people found the following review helpful
Scott Fitzgerald, a monumental talent who only occasionally got things working right, made Gatsby great by the extraordinary invention of Nick Carraway. Carraway as
This was made before the edit to the post:
Assuming you've loaded the data into a soup unimaginatively called soup
for x in soup.body.div:
if x.string:
print x.string.strip()
335 of 368 people found the following review helpful
Having reread this book for the first time in 20 years, [... more here]
Which are the strings you are looking for.
The html may be a mess, so let me give you some tips on spidering your way through a new webpage. First I found the text:
import re
x = soup.find(text=re.compile('Having reread this book'))
Then I stepped through the parents to find out what I was looking into:
print x.parent
print x.parent.parent
print x.parent.parent.parent
From there I saw that everything was contained inside the main divs as strings. Simple then to loop over what I was looking for!
Upvotes: 4