Filtering BeautifulSoup find_all results

Question

I'm trying to collect some reviews of books from Amazon. Here's what I have so far:

import requests
from bs4 import BeautifulSoup

def data(site):
    url = site
    r = requests.get(url)
    soup = BeautifulSoup(r.text) 
    y = soup.find_all("div", style = "margin-left:0.5em;")
    words = []
    for item in y:
        item = str(item.text).split()
        words.append(item)
    reviews = [" ".join(x) for x in words]
    return reviews

f = data('http://www.amazon.com/Dance-Dragons-Song-Fire-Book/product-reviews/0553801473/ref=cm_cr_pr_top_link_11?ie=UTF8&pageNumber=11&showViewpoints=0&sortBy=bySubmissionDateDescending')

In addition to the review, I get some extraneous information, such as author, title, and number of people who found the review helpful. Is there a way to use BeautifulSoup to exclude everything but the text of the reviews? The text of the reviews doesn't have class or style attributes and the other bits of text do (I think...), but I haven't found a way to filter my soup.find_all results. I would really appreciate any help.

Filtering BeautifulSoup find_all results

Answers (1)

Related Questions