Reputation: 49
I was trying to scrape a website for some university project. The website is https://www.bonprix.it/prodotto/leggings-a-pinocchietto-pacco-da-2-leggings-a-pinocchietto-pacco-da-2-bianco-nero-956015/?itemOptionId=12211813. I have a problem with my python code. What I want to obtain is all the reviews for the pages from 1 to 5, but instead I get all [].Any help would be appreciated!
Here is the code:
import csv
from bs4 import BeautifulSoup
import urllib.request
import re
import pandas as pd
import requests
reviewlist = []
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0"
opener = AppURLopener()
response = opener.open('https://www.bonprix.it/prodotto/leggings-a-pinocchietto-pacco-da-2-leggings-a-pinocchietto-pacco-da-2-bianco-nero-956015/?itemOptionId=12211813')
soup = BeautifulSoup(response,'html.parser')
reviews = soup.find_all('div',{'class':'reviewContent'})
for i in reviews:
review = {
'per_review_name' : i.find('span',{'itemprop':'name'}).text.strip(),
'per_review' : i.find('p',{'class':'reviewText'}).text.strip(),
'per_review_taglia' : i.find('p',{'class':'singleReviewSizeDescr'}).text.strip(),
}
reviewlist.append(review)
for page in range (1,5):
prova = soup.find_all('div',{'data-page': '{page}'})
print(prova)
print(len(reviewlist))
df = pd.DataFrame(reviewlist)
df.to_csv('list.csv',index=False)
print('Fine.')
And here the output that I get:
[]
5
[]
5
[]
5
[]
5
Fine.
Upvotes: 2
Views: 136
Reputation: 843
The website only loads first page of the reviews in the first request. If you inspect its requests, you can see that it requests for additional data when you change the page of the reviews. You can rewrite your code as following to get the reviews from all pages:
reviews_dom = []
for page in range(1,6):
url = f"https://www.bonprix.it/reviews/list/?styleId=31436999&sortby=date&page={page}&rating=0&variant=0&size=0&bodyHeight=0&showOldReviews=true&xxl=false&variantFilters="
r = requests.request("GET", url)
soup = BeautifulSoup(r.text, "html.parser")
reviews_dom += soup.find_all("div", attrs={"class": "reviewContent"})
reviews = []
for review_item in reviews_dom:
review = {
'per_review_name' : review_item.find('span', attrs={'itemprop':'name'}).text.strip(),
'per_review' : review_item.find('p', attrs={'class':'reviewText'}).text.strip(),
'per_review_taglia' : review_item.find('p', attrs={'class':'singleReviewSizeDescr'}).text.strip(),
}
reviews.append(review)
print(len(reviews))
print(reviews)
In the first iteration, we request the data for each page of reviews (first 5 pages in the above example).
In the second iteration, we parse the reviews dom and extract the data we need.
Upvotes: 0
Reputation: 91
As I understand it the site uses Javascript to load most of its content, therfore you cant scrape that data, as it isn't loaded initially, but you can use the rating backend for your product site the link is:
You can go through the pages by changing the page parameter in the url/get request, the link returns a html document of the rating page an you can get the rating from the rating value meta tag
Upvotes: 2