user1709173
user1709173

Reputation:

Filtering BeautifulSoup find_all results

I'm trying to collect some reviews of books from Amazon. Here's what I have so far:

import requests
from bs4 import BeautifulSoup

def data(site):
    url = site
    r = requests.get(url)
    soup = BeautifulSoup(r.text) 
    y = soup.find_all("div", style = "margin-left:0.5em;")
    words = []
    for item in y:
        item = str(item.text).split()
        words.append(item)
    reviews = [" ".join(x) for x in words]
    return reviews

f = data('http://www.amazon.com/Dance-Dragons-Song-Fire-Book/product-reviews/0553801473/ref=cm_cr_pr_top_link_11?ie=UTF8&pageNumber=11&showViewpoints=0&sortBy=bySubmissionDateDescending')

In addition to the review, I get some extraneous information, such as author, title, and number of people who found the review helpful. Is there a way to use BeautifulSoup to exclude everything but the text of the reviews? The text of the reviews doesn't have class or style attributes and the other bits of text do (I think...), but I haven't found a way to filter my soup.find_all results. I would really appreciate any help.

Upvotes: 0

Views: 3904

Answers (1)

Sufian Latif
Sufian Latif

Reputation: 13356

All the reviews are enclosed in a table, so you can find the table first, then extract the review text from each of them.

Changing this line should do it:

...
y = soup\
    .find('table', {'id' : 'productReviews'})\ # here you find the table
    .find_all("div", style = "margin-left:0.5em;")
...

Upvotes: 1

Related Questions