Mr. Robot
Mr. Robot

Reputation: 159

Scraping Customer Reviews from DM.de

I have been trying to scrape user reviews from DM website without any luck. An example page: https://www.dm.de/l-oreal-men-expert-men-expert-vita-lift-vitalisierende-feuchtigkeitspflege-p3600523606276.html

I have tried to load the product-detail pages with beautifulsoup4 and scrapy.

from bs4 import BeautifulSoup
import requests
url = "https://www.dm.de/l-oreal-men-expert-men-expert-vita-lift-vitalisierende-feuchtigkeitspflege-p3600523606276.html"
response = requests.get(url)
print(response.text)  

Running the code shows no content of the reviews- like you'd get from amazon.de! It only shows the scripts from the website.

EDIT: From the Dev tool, it can be seen that, the reviwes are stored in JSON in the following folder. This exactly what I am trying to extract.

JSON file to Extract

Upvotes: 1

Views: 649

Answers (3)

Mr. Robot
Mr. Robot

Reputation: 159

I have tried a lot to properly scrape DM product detail pages with scrapy and bs4 but failed to get a 100% accurate scraper. That's why I have decided to move to selenium. It is slow but gives 100% accurate scraping result.

    try:
        driver.get(url)
        print("Current URL is Valid --> OK")
        print("Current URL : ", url)
    except Exception as e:
        print("URL : ", url, " -->> is Invalid!!!")
        print("Error Occured : ", e)
        driver.quit()

    driver.maximize_window()
    driver.set_page_load_timeout(10)

    ## close overlay and cookies
    time.sleep(round(random.uniform(1.0,1.5),2))  # give time to properly load the page initially
    try:
        driver.find_element_by_xpath('//*[@id="custom-layer-wrapper"]/section/header/button').click()
        driver.find_element_by_xpath('//*[@id="overlays"]/div[2]/div/div/div[2]/button').click()
    except Exception as e:
        print(e)

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight*0.65);") # scroll down to next review page button
    time.sleep(round(random.uniform(4.5,5.5),2))  # give time to properly load the page initially

    while True:
        try:
            # iterate through each comment page
            response = driver.execute_script("return document.documentElement.outerHTML")  # Export rendered HTML
            # now extract the reviews
            soup = BeautifulSoup(response, 'lxml')
            soup = soup.find('ol', {'class': 'bv-content-list-reviews'})
            # product_title = product_title + soup.find('div',{'data-dmid' : 'detail-page-headline'}).text

            tempR = soup.find_all('div', {'class': 'bv-content-summary-body-text'});reviews = reviews + tempR
            tempS = soup.find_all('span', {'class': 'bv-content-rating bv-rating-ratio'});stars = stars + tempS
            tempT = soup.find_all('div', {'class': 'bv-content-title-container'});titles = titles + tempT
            tempU = soup.find_all('div', {'class', 'bv-content-author-name'}); users = users + tempU;
            tempH = soup.find_all('div', {'class', 'bv-content-tag-dimensions'}); hauttyps = hauttyps + tempH;
            tempD = soup.find_all('div', {'class', 'bv-content-datetime'}); dates = dates + tempD;
            # for item in driver.find_elements_by_css_selector('[itemprop="dateCreated"]'):
            #     dates.append(item.get_attribute('content'))

            tempUp = soup.find_all('button', {'class': 'bv-content-btn-feedback-yes'}); helpUp = helpUp + tempUp;
            tempDown = soup.find_all('button', {'class': 'bv-content-btn-feedback-no'}); helpDown = helpDown + tempDown;

            ## Go to next Review page
            # button_next = driver.find_element_by_xpath('//*[@id="BVRRContainer"]/div/div/div/div/div[3]/div/ul/li[2]/a/span[2]')
            # button_next = driver.find_element_by_css_selector('#BVRRContainer > div > div > div > div > div.bv-content-pagination > div > ul > li.bv-content-pagination-buttons-item.bv-content-pagination-buttons-item-next > a > span.bv-content-btn-pages-next')
            button_next = driver.find_element_by_partial_link_text('►')
            button_next.location_once_scrolled_into_view
            button_next.click()
            time.sleep(round(random.uniform(2.5,3.0),2))  # give time to properly load the page initially
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight*0.90);") # scroll down to next review page button
            time.sleep(round(random.uniform(4.5,5.0),2))  # give time to properly load the page initially

        except Exception as e:
            print(e)
            print("----REACHED THE LAST PAGE-----")
            break

    time.sleep(3)  #
    driver.quit()

Upvotes: 0

chitown88
chitown88

Reputation: 28565

I don't have time to play around with the params, but it's all there in the request url to get back that json.

import requests
import json

url = "https://api.bazaarvoice.com/data/batch.json?"
num_reviews = 100

query = 'passkey=caYXUVe0XKMhOqt6PdkxGKvbfJUwOPDhKaZoAyUqWu2KE&apiversion=5.5&displaycode=18357-de_de&resource.q0=reviews&filter.q0=isratingsonly%3Aeq%3Afalse&filter.q0=productid%3Aeq%3A596141&filter.q0=contentlocale%3Aeq%3Ade*%2Cde_DE&sort.q0=submissiontime%3Adesc&stats.q0=reviews&filteredstats.q0=reviews&include.q0=authors%2Cproducts%2Ccomments&filter_reviews.q0=contentlocale%3Aeq%3Ade*%2Cde_DE&filter_reviewcomments.q0=contentlocale%3Aeq%3Ade*%2Cde_DE&filter_comments.q0=contentlocale%3Aeq%3Ade*%2Cde_DE&limit.q0=' +str(num_reviews) + '&offset.q0=0&limit_comments.q0=3&callback=bv_1111_19110'

url = "https://api.bazaarvoice.com/data/batch.json?"
request_url = url + query

response = requests.get(request_url)
jsonStr = response.text
jsonStr = response.text.split('(',1)[-1].rsplit(')',1)[0]
jsonData = json.loads(jsonStr)

reviews = jsonData['BatchedResults']['q0']['Results']

for each in reviews:
    print ('Rating: %s\n%s\n' %(each['Rating'], each['ReviewText']))

Output:

Rating: 5
Immer wieder zufrieden

Rating: 5
ich bin mit dem Produkt sehr zufrieden und kann es nur weiterempfehlen.

Rating: 5
Super Creme - zieht schnell ein - angenehmer Geruch - hält lange vor - nicht fettend - ich hatte schon das Gefühl, dass meine Falten weniger geworden sind. Sehr zu empfehlen

Rating: 5
Das Produkt erfüllt meine Erwärtungen in jeder Hinsicht-ich kaufe es gerne immer wieder

Rating: 5
riecht super, zieht schnell ein und hinterlsst ein tolles Hautgefhl

Rating: 3
ganz ok...die Creme fühlt sich nur etwas seltsam an auf der Haut...ich konnte auch nicht wirklich eine Verbesserung des Hautbildes erkennen

Rating: 4
Für meinen Geschmack ist das Produkt zu fettig/dick zum auftauen.

Rating: 1
Ich bin seit mehreren Jahren treuer Benutzer von L'oreal Produkten und habe bis jetzt immer das blaue Gesichtsgel verwendet. Mit dem war ich mehr als zufrieden. Jetzt habe ich die rote Creme gekauft und bin total enttäuscht. Nach ca. einer Stunde entwickelt sich ein sehr seltsamer Geruch, es riecht nach ranssigem Öl! Das ist im Gesicht nicht zu ertragen.

....

Edit:

Ton of cleaning up to do to make this more compact, but here's the basic query:

import requests
import json

url = "https://api.bazaarvoice.com/data/batch.json"
num_reviews = 100

payload = {
'passkey': 'caYXUVe0XKMhOqt6PdkxGKvbfJUwOPDhKaZoAyUqWu2KE',
'apiversion': '5.5',
'displaycode': '18357-de_de',
'resource.q0': 'reviews',
'filter.q0': 'productid:eq:596141',
'sort.q0': 'submissiontime:desc',
'stats.q0': 'reviews',
'filteredstats.q0': 'reviews',
'include.q0': 'authors,products,comments',
'filter_reviews.q0': 'contentlocale:eq:de*,de_DE',
'filter_reviewcomments.q0': 'contentlocale:eq:de*,de_DE',
'filter_comments.q0': 'contentlocale:eq:de*,de_DE',
'limit.q0': str(num_reviews),
'offset.q0': '0',
'limit_comments.q0': '3',

'resource.q1': 'reviews',
'filter.q1': 'productid:eq:596141',
'sort.q1': 'submissiontime:desc',
'stats.q1': 'reviews',
'filteredstats.q1': 'reviews',
'include.q1': 'authors,products,comments',
'filter_reviews.q1': 'contentlocale:eq:de*,de_DE',
'filter_reviewcomments.q1': 'contentlocale:eq:de*,de_DE',
'filter_comments.q1': 'contentlocale:eq:de*,de_DE',
'limit.q1': str(num_reviews),
'offset.q1': '0',
'limit_comments.q1': '3',

'resource.q2': 'reviews',
'filter.q2': 'productid:eq:596141',
'sort.q2': 'submissiontime:desc',
'stats.q2': 'reviews',
'filteredstats.q2': 'reviews',
'include.q2': 'authors,products,comments',
'filter_reviews.q2': 'contentlocale:eq:de*,de_DE',
'filter_reviewcomments.q2': 'contentlocale:eq:de*,de_DE',
'filter_comments.q2': 'contentlocale:eq:de*,de_DE',
'limit.q2': str(num_reviews),
'offset.q2': '0',
'limit_comments.q2': '3',

'callback': 'bv_1111_19110'}


response = requests.get(url, params = payload)
jsonStr = response.text

jsonStr = response.text.split('(',1)[-1].rsplit(')',1)[0]
jsonData = json.loads(jsonStr)

reviews = jsonData['BatchedResults']['q0']['Results']
for k, v in jsonData['BatchedResults'].items():
    for each in v['Results']:
        print ('Rating: %s\n%s\n' %(each['Rating'], each['ReviewText']))

Upvotes: 2

xXliolauXx
xXliolauXx

Reputation: 1313

As most modern websites it seems dm.de only loads content through javascript after the page initially loaded. This is problematic because pythons requests library and scrapy only deal with http, but do not load any javascript.

The same thing happens on amazon, but there it is detected and you get a javascript-free version.

You can try this for yourself by disabling javascript in your browser and then opening the site you want to scrape.

Solutions include using a scraper that supports javascript, or scrape using an automated browser (using a full browser also supports js of course). Selenium with chromium worked well for me.

Upvotes: 3

Related Questions