Replace scrapy response.body with selenium response

Question

I try to crawl following product-site from an online-shop with scrapy: https://www.mediamarkt.de/de/product/_lg-65uk6470plc-2391592.html'

The properties of the product are listed in a normal html-table and some of them are getting showed only when the "Alle Details einblenden"-button was clicked. The properties are safed in a js-var and are preloaded from the begining. By pressing the button, a js-function adds the rest of the properties to the table.

Now I try to get the full content of the webpage and then to crawl it completly.

By the reason, that I need to use the SitemapSpider by scrapy, I decided to use selenium to get the content of this site, then to simulate clicking the button and replace the full content with the scrapy response.body. Afterwards, when the data gets parsed, scrapy should parse the new properties from the table too. But it doesn't work and I really don't know why. The properties, which are shown from the beginning, are getting parsed sucessfully.

chromeDriver = webdriver.Chrome('C:/***/***/chromedriver.exe') #only for testing

def parse(self,response):   
    chromeDriver.get(response.url)
    moreContentButton = chromeDriver.find_element_by_xpath('//div[@class="mms-product-features__more"]/span[@class="mms-link underline"]')
    chromeDriver.execute_script('arguments[0].click();', moreContentButton)
    newHTMLBody = chromeDriver.page_source.encode('utf-8')
    response._set_body(newHTMLBody)

    scrapyProductLoader = ItemLoader(item=Product(), response=response)
    scrapyProductLoader.add_xpath('propertiesKeys', '//tr[@class="mms-feature-list__row"]/th[@class="mms-feature-list__dt"]')
    scrapyProductLoader.add_xpath('propertiesValues', '//tr[@class="mms-feature-list__row"]/td[@class="mms-feature-list__dd"]')

I tried the response.replace(body=chromeDriver.page_source) method instead of response._set_body(newHTMLBody), but that doesn't worked. It changes nothing. I know that response.body contains all properties of the product (by creating a html-file containing the response.body), but scrapy adds only the properties of the product before the button was clicked (in this example: Betriebssystem: webOS 4.0 (AI ThinQ) is the last entry).

But I need all properties.

Here is a part of the reponse.body before the ItemLoader got initialized:

Betriebssystem
webOS 4.0 (AI ThinQ)
Prozessor
Quad Core-Prozessor
Energieeffizienzklasse
A


Bild
Thanks for your attention and your help.

Bildschirmauflösung
3.840 x 2.160 Pixel
Bildwiederholungsfrequenz
True Motion 100

Michael Savchenko · Accepted Answer

You don't need selenium or anything else to get the desired data from the mentioned page.

import json
text_data = response.css('script').re('window.__PRELOADED_STATE__ = (.+);')[0]

# This dict will contain everything you need.
data = json.loads(text_data)

Selenium is a testing tool. Avoid using it for scraping.

Replace scrapy response.body with selenium response

Answers (2)

Related Questions