Reputation: 3
I try to crawl following product-site from an online-shop with scrapy: https://www.mediamarkt.de/de/product/_lg-65uk6470plc-2391592.html'
The properties of the product are listed in a normal html-table and some of them are getting showed only when the "Alle Details einblenden"-button was clicked. The properties are safed in a js-var and are preloaded from the begining. By pressing the button, a js-function adds the rest of the properties to the table.
Now I try to get the full content of the webpage and then to crawl it completly.
By the reason, that I need to use the SitemapSpider by scrapy, I decided to use selenium to get the content of this site, then to simulate clicking the button and replace the full content with the scrapy response.body. Afterwards, when the data gets parsed, scrapy should parse the new properties from the table too. But it doesn't work and I really don't know why. The properties, which are shown from the beginning, are getting parsed sucessfully.
chromeDriver = webdriver.Chrome('C:/***/***/chromedriver.exe') #only for testing
def parse(self,response):
chromeDriver.get(response.url)
moreContentButton = chromeDriver.find_element_by_xpath('//div[@class="mms-product-features__more"]/span[@class="mms-link underline"]')
chromeDriver.execute_script('arguments[0].click();', moreContentButton)
newHTMLBody = chromeDriver.page_source.encode('utf-8')
response._set_body(newHTMLBody)
scrapyProductLoader = ItemLoader(item=Product(), response=response)
scrapyProductLoader.add_xpath('propertiesKeys', '//tr[@class="mms-feature-list__row"]/th[@class="mms-feature-list__dt"]')
scrapyProductLoader.add_xpath('propertiesValues', '//tr[@class="mms-feature-list__row"]/td[@class="mms-feature-list__dd"]')
I tried the response.replace(body=chromeDriver.page_source) method instead of response._set_body(newHTMLBody), but that doesn't worked. It changes nothing. I know that response.body contains all properties of the product (by creating a html-file containing the response.body), but scrapy adds only the properties of the product before the button was clicked (in this example: Betriebssystem: webOS 4.0 (AI ThinQ) is the last entry).
But I need all properties.
Here is a part of the reponse.body before the ItemLoader got initialized:
<tr class="mms-feature-list__row"><th scope="row" class="mms-feature-list__dt">Betriebssystem</th>
<td class="mms-feature-list__dd">webOS 4.0 (AI ThinQ)</td></tr>
<tr class="mms-feature-list__row"><th scope="row" class="mms-feature-list__dt">Prozessor</th>
<td class="mms-feature-list__dd">Quad Core-Prozessor</td></tr><tr class="mms-feature-list__row">
<th scope="row" class="mms-feature-list__dt">Energieeffizienzklasse</th>
<td class="mms-feature-list__dd">A</td></tr>
</tbody></table></div>
<div class="mms-feature-list mms-feature-list--rich">
<h3 class="mms-headline">Bild</h3>
<table class="mms-feature-list__container">
<tbody><tr class="mms-feature-list__row"><th scope="row" class="mms-feature-list__dt">Bildschirmauflösung</th>
<td class="mms-feature-list__dd">3.840 x 2.160 Pixel</td></tr>
<tr class="mms-feature-list__row"><th scope="row" class="mms-feature-list__dt">Bildwiederholungsfrequenz</th>
<td class="mms-feature-list__dd">True Motion 100</td></tr>
Thanks for your attention and your help.
Upvotes: 0
Views: 961
Reputation: 21261
You could proabley do this
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url="Any URL HERE", body=BODY_STRING_HERE, encoding='utf-8')
>>> response.xpath('xpath_here').extract()
Upvotes: 1
Reputation: 1445
You don't need selenium or anything else to get the desired data from the mentioned page.
import json
text_data = response.css('script').re('window.__PRELOADED_STATE__ = (.+);')[0]
# This dict will contain everything you need.
data = json.loads(text_data)
Selenium is a testing tool. Avoid using it for scraping.
Upvotes: 1