Reputation:
I'm trying to collect some reviews of books from Amazon. Here's what I have so far:
import requests
from bs4 import BeautifulSoup
def data(site):
url = site
r = requests.get(url)
soup = BeautifulSoup(r.text)
y = soup.find_all("div", style = "margin-left:0.5em;")
words = []
for item in y:
item = str(item.text).split()
words.append(item)
reviews = [" ".join(x) for x in words]
return reviews
f = data('http://www.amazon.com/Dance-Dragons-Song-Fire-Book/product-reviews/0553801473/ref=cm_cr_pr_top_link_11?ie=UTF8&pageNumber=11&showViewpoints=0&sortBy=bySubmissionDateDescending')
In addition to the review, I get some extraneous information, such as author, title, and number of people who found the review helpful. Is there a way to use BeautifulSoup to exclude everything but the text of the reviews? The text of the reviews doesn't have class or style attributes and the other bits of text do (I think...), but I haven't found a way to filter my soup.find_all results. I would really appreciate any help.
Upvotes: 0
Views: 3904
Reputation: 13356
All the reviews are enclosed in a table
, so you can find the table first, then extract the review text from each of them.
Changing this line should do it:
...
y = soup\
.find('table', {'id' : 'productReviews'})\ # here you find the table
.find_all("div", style = "margin-left:0.5em;")
...
Upvotes: 1