Web Scryping in Python

Question

I was trying to scrape a website for some university project. The website is https://www.bonprix.it/prodotto/leggings-a-pinocchietto-pacco-da-2-leggings-a-pinocchietto-pacco-da-2-bianco-nero-956015/?itemOptionId=12211813. I have a problem with my python code. What I want to obtain is all the reviews for the pages from 1 to 5, but instead I get all [].Any help would be appreciated!

Here is the code:

import csv
from bs4 import BeautifulSoup
import urllib.request
import re
import pandas as pd
import requests
reviewlist = []
class AppURLopener(urllib.request.FancyURLopener):
    version = "Mozilla/5.0"

opener = AppURLopener()
response = opener.open('https://www.bonprix.it/prodotto/leggings-a-pinocchietto-pacco-da-2-leggings-a-pinocchietto-pacco-da-2-bianco-nero-956015/?itemOptionId=12211813')

soup = BeautifulSoup(response,'html.parser')

reviews = soup.find_all('div',{'class':'reviewContent'})


for i in reviews:
    review = {

        'per_review_name' : i.find('span',{'itemprop':'name'}).text.strip(),
        'per_review' : i.find('p',{'class':'reviewText'}).text.strip(),
        'per_review_taglia' : i.find('p',{'class':'singleReviewSizeDescr'}).text.strip(),
        
    }
    reviewlist.append(review)
   
for page in range (1,5):
    prova = soup.find_all('div',{'data-page': '{page}'})
    print(prova)
    print(len(reviewlist))
        
df = pd.DataFrame(reviewlist)
df.to_csv('list.csv',index=False)
print('Fine.')

And here the output that I get:

[]
5
[]
5
[]
5
[]
5
Fine.

Saeed Esmaili · Accepted Answer

The website only loads first page of the reviews in the first request. If you inspect its requests, you can see that it requests for additional data when you change the page of the reviews. You can rewrite your code as following to get the reviews from all pages:

reviews_dom = []
for page in range(1,6):
    url = f"https://www.bonprix.it/reviews/list/?styleId=31436999&sortby=date&page={page}&rating=0&variant=0&size=0&bodyHeight=0&showOldReviews=true&xxl=false&variantFilters="
    r = requests.request("GET", url)
    soup = BeautifulSoup(r.text, "html.parser")
    reviews_dom += soup.find_all("div", attrs={"class": "reviewContent"})
    
reviews = []
for review_item in reviews_dom:
    review = {
        'per_review_name' : review_item.find('span', attrs={'itemprop':'name'}).text.strip(),
        'per_review' : review_item.find('p', attrs={'class':'reviewText'}).text.strip(),
        'per_review_taglia' : review_item.find('p', attrs={'class':'singleReviewSizeDescr'}).text.strip(),
    }
    reviews.append(review)
    
print(len(reviews))
print(reviews)

What happens in the code?

In the first iteration, we request the data for each page of reviews (first 5 pages in the above example).

In the second iteration, we parse the reviews dom and extract the data we need.

Web Scryping in Python

Answers (2)

What happens in the code?

Related Questions