sirVir
sirVir

Reputation: 1526

Extracting content of div with BeautifulSoup

At first I would like to say, that I already found the same question with answers, but I couldn't get them working. I try to extract the data from the reviews, for now the review's content and it's usefulness. I am new to BeautifulSoup and Python in general.

For now, I use the findAll method to get a list of divs containing the review, for example, some random site with opinions about the product:

import urllib2
from BeautifulSoup import BeautifulSoup
turl = "http://www.amazon.com/The-Great-Gatsby-Scott-Fitzgerald/product-reviews/0743273567/ref=cm_cr_pr_hist_5?ie=UTF8&filterBy=addFiveStar&showViewpoints=0"
page= urllib2.urlopen(turl);
soup = BeautifulSoup(page);
products = soup.findAll("div", style = "margin-left:0.5em;")
print products[0]

In this way I get the output like this:

<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
        335 of 368 people found the following review helpful
      </div>
<div style="margin-bottom:0.5em;">
<span style="margin-right:5px;"><span class="swSprite s_star_5_0 " title="5.0 out of 5 stars"><span>5.0 out of 5 stars</span></span> </span>
<span style="vertical-align:middle;"><b>Decades later, still great but on different terms.</b>, <nobr>August 24, 2001</nobr></span>
</div>
<div style="margin-bottom:0.5em;">
<div><div style="float:left;">By&nbsp;</div><div style="float:left;"><a href="http://www.amazon.com/gp/pdp/profile/A1IKD6BDEE18CI"><span style="font-weight: bold;">mirope "mirope"</span></a>  - <a href="http://www.amazon.com/gp/cdp/member-reviews/A1IKD6BDEE18CI?ie=UTF8&amp;sort_by=MostRecentReview">See all my reviews</a><br />
<a href="http://www.amazon.com/gp/help/customer/display.html?ie=UTF8&amp;nodeId=14279681&amp;pop-up=1#VN" target="AmazonHelp" onclick="return amz_js_PopWin(this.href,'AmazonHelp','width=340,height=340,resizable=1,scrollbars=1,toolbar=1,status=1');"><span class="cmtySprite s_BadgeVineVoice "><span>(VINE VOICE)</span></span></a>
&nbsp;&nbsp;


</div></div><div style="clear:both;"></div>
</div>
<div class="tiny" style="margin-bottom:0.5em;">
<span class="crVerifiedStripe"><b class="h3color tiny" style="margin-right: 0.5em;">Amazon Verified Purchase</b><span class="tiny verifyWhatsThis">(<a href="http://www.amazon.com/gp/community-help/amazon-verified-purchase" target="AmazonHelp" onclick="amz_js_PopWin('http://www.amazon.com/gp/community-help/amazon-verified-purchase', 'AmazonHelp', 'width=400,height=500,resizable=1,scrollbars=1,toolbar=0,status=1');return false; ">What's this?</a>)</span></span>
</div>
<div class="tiny" style="margin-bottom:0.5em;">
<b><span class="h3color tiny">This review is from: </span><a href="https://rads.stackoverflow.com/amzn/click/com/0684801523" rel="nofollow noreferrer">The Great Gatsby (Paperback)</a></b>
</div>

Having reread this book for the first time in 20 years, I can confirm that there's a reason that it's considered one of the very best American novels. However, my reaction to the story was different than when I first read it in high school. I recall that back then I was hoping that Daisy and Gatsby's love story would ultimately yield a happy ending. Now, I found them both to be such shallow creatures that they inspired no pity. While I considered the characters to be emotionally stunted, that dooesn't mean I was not impressed with Fitzergerald's skillful rendering. As in most forms of art, in literature it is more difficult to accurately and interestingly portray nothingness than to describe a richly endowed subject. At this more cynical age, I found Daisy to be a remarkable emotional void, and Gatsby's quest to pour all of his hopes and dreams into such a shallow cauldron only confirmed his own vapidity. One thing that hasn't changed in all these years is my amazement at Fitzgerald's ability to set a scene. His descriptive passages are truly poetic, and his command of word choice in unparalleled. All this made for a stimulating and delightful read.
      <div style="padding-top: 10px; clear: both; width: 100%;">
<div class="reviews-voting-stripe" style="float:left; padding-right:15px; border-right:1px solid #CCCCCC"><div style="padding-bottom:5px;"><b class="tiny" style="color:#666666;white-space:nowrap;">Help other customers find the most helpful reviews</b>&nbsp;</div><div style="width:300px;">
<a name="R3KCIEAV000FPG.2115.Helpful.Reviews" style="font-size:1px;"> </a><span class="crVotingButtons"><nobr><span class="votingPrompt">Was this review helpful to you?&nbsp;</span><a rel="nofollow" class="votingButtonReviews votingButton-yes" href="http://www.amazon.com/gp/voting/cast/Reviews/2115/R3KCIEAV000FPG/Helpful/1?ie=UTF8&amp;target=aHR0cDovL3d3dy5hbWF6b24uY29tL3Jldmlldy8wNzQzMjczNTY3&amp;token=9BE8627F650F9D873DB4042D67CB37FA98AFD161&amp;voteAnchorName=R3KCIEAV000FPG.2115.Helpful.Reviews&amp;voteSessionID=000-0000000-0000000"><span class="cmtySprite s_largeYes "><span>Yes</span></span></a>
<a rel="nofollow" class="votingButtonReviews votingButton-no" href="http://www.amazon.com/gp/voting/cast/Reviews/2115/R3KCIEAV000FPG/Helpful/-1?ie=UTF8&amp;target=aHR0cDovL3d3dy5hbWF6b24uY29tL3Jldmlldy8wNzQzMjczNTY3&amp;token=B35087155FEB75AC5155B500CE8518AEFD4ADBAC&amp;voteAnchorName=R3KCIEAV000FPG.2115.Helpful.Reviews&amp;voteSessionID=000-0000000-0000000"><span class="cmtySprite s_largeNo "><span>No</span></span></a></nobr> <span class="votingMessage"></span></span>
</div></div><div style="float:left;"><div style="padding-left:15px;"><div style="white-space:nowrap;"><span class="tiny">
<a name="R3KCIEAV000FPG.2115.Inappropriate.Reviews" style="font-size:1px;"> </a><span class="reportingButton"><nobr><a rel="nofollow" class="reportingButton" href="http://www.amazon.com/gp/voting/cast/Reviews/2115/R3KCIEAV000FPG/Inappropriate/1?ie=UTF8&amp;target=aHR0cDovL3d3dy5hbWF6b24uY29tL3Jldmlldy8wNzQzMjczNTY3&amp;token=414B10F161A63A55D269D6EE7DC174FF22482F7E&amp;voteAnchorName=R3KCIEAV000FPG.2115.Inappropriate.Reviews&amp;voteSessionID=000-0000000-0000000">Report abuse</a></nobr></span>
</span> <span style="color:#CCCCCC;">|</span> <span class="tiny"><a href="http://www.amazon.com/review/R3KCIEAV000FPG">Permalink</a></span></div><div style="white-space:nowrap;padding-left:-5px;padding-top:5px;"><a href="http://www.amazon.com/review/R3KCIEAV000FPG"><span class="swSprite s_comment "><span>Comment</span></span></a>&nbsp;<a href="http://www.amazon.com/review/R3KCIEAV000FPG">Comments (19)</a></div></div></div><div style="clear:both;"></div>
</div>
<br />
</div>

And from this output I would like to extract two integers - 335 and 368 (how many people found it useful) and string containing the review's text (just words, without tags and new lines) of a review itself, which is placed in the main div, under 5 sub-divs. How can I get some part of this div without the rest of it, working on tags?

EDIT:

I converted the object returned by BeautifulSoap to string and loaded back to soup - is there any other way to do it? Doesn't seem too nice. Then I use your method, but I get a lot of empty lines, I try to remove them striping and using condition, but they are still there:

import urllib2
from BeautifulSoup import BeautifulSoup
turl = "http://www.amazon.com/The-Great-Gatsby-Scott-Fitzgerald/product-reviews/0743273567/ref=cm_cr_pr_hist_5?ie=UTF8&filterBy=addFiveStar&showViewpoints=0"
toppage = urllib2.urlopen(turl);
soup = BeautifulSoup(toppage);
products = soup.findAll("div", style = "margin-left:0.5em;")

for (counter,i) in enumerate(products):
    soup2 = BeautifulSoup(str(products[counter]))
    for (counter2,x) in enumerate(soup2.div):
        if x.string:
            if x.string.isspace:
                print "empty string"
            else:
                print "string number " + str(counter) + " " + x.string.strip().lstrip() 
**

Upvotes: 1

Views: 9767

Answers (1)

Hooked
Hooked

Reputation: 88218

Complete minimal working example

Using your source webpage, here is a complete example

import urllib2, re
from BeautifulSoup import BeautifulSoup   

turl = "http://rads.stackoverflow.com/amzn/click/0743273567"
toppage = urllib2.urlopen(turl)
soup = BeautifulSoup(toppage)

review_tag  = {'class':re.compile("mt9 reviewText")}
helpful_tag = {'class':re.compile("hlp")}

all_reviews = soup.findAll(attrs=review_tag)
all_helpful = soup.findAll(attrs=helpful_tag)

for text,info in zip(all_reviews, all_helpful):
    print info.string.strip()
    print '\n'.join(text.findAll(text=True)).strip()
    print "*******************************************"

This gives

337 of 370 people found the following review helpful
Having reread this book for the first time in 20 years, I can confirm that there's a reason that it's [...]
*******************************************
114 of 123 people found the following review helpful
It's difficult to give any even-handed critique F. Scott Fitzgerald's standard-setting Jazz Age [...]
*******************************************
54 of 60 people found the following review helpful
Scott Fitzgerald, a monumental talent who only occasionally got things working right, made Gatsby great by the extraordinary invention of Nick Carraway.  Carraway as

Old version

This was made before the edit to the post:

Assuming you've loaded the data into a soup unimaginatively called soup

for x in soup.body.div:
    if x.string:
        print x.string.strip()

Gives:

335 of 368 people found the following review helpful

Having reread this book for the first time in 20 years, [... more here]

Which are the strings you are looking for.

Is it that easy?

The html may be a mess, so let me give you some tips on spidering your way through a new webpage. First I found the text:

import re
x = soup.find(text=re.compile('Having reread this book'))

Then I stepped through the parents to find out what I was looking into:

print x.parent
print x.parent.parent
print x.parent.parent.parent

From there I saw that everything was contained inside the main divs as strings. Simple then to loop over what I was looking for!

Upvotes: 4

Related Questions