Reputation: 390
Intro
I have to add to my crawler "Others also bought"-items of certain productlink. Its really weird for me, because there are divs like"open-on-mobile" and "inner generated", what does this mean for me?
Goal
I already got every important information which are needed, except of the "others also bought", after trying of hours, i decided to ask here, before i waste more time and get more frustrated
HTML construction
My code
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import DuifcsvItem
import csv
class DuifSpider(scrapy.Spider):
name = "duif"
allowed_domains = ['duif.nl']
custom_settings = {'FIELD_EXPORT_FIELDS' : ['SKU', 'Title', 'Title_small', 'NL_PL_PC', 'Description']}
with open("duifonlylinks.csv","r") as f:
reader = csv.DictReader(f)
start_urls = [items['Link'] for items in reader]
rules = (
Rule(LinkExtractor(), callback='parse'),
)
def parse(self, response):
card = response.xpath('//div[@class="heading"]')
if not card:
print('No productlink', response.url)
items = DuifcsvItem()
items['Link'] = response.url
items['SKU'] = response.xpath('//p[@class="desc"]/text()').get().strip()
items['Title'] = response.xpath('//h1[@class="product-title"]/text()').get()
items['Title_small'] = response.xpath('//div[@class="left"]/p/text()').get()
items['NL_PL_PC'] = response.xpath('//div[@class="desc"]/ul/li/em/text()').getall()
items['Description'] = response.xpath('//div[@class="item"]/p/text()').getall()
yield items
Actual Webpage: https://www.duif.nl/product/pot-seal-matt-finish-light-pink-large
Perfect would be, if could access this href with xpath
XPATH i already tried
>>> response.xpath('//div[@class="title"]/h3/text()').get()
>>> response.xpath('//div[@class="inner generated"]/div//h3/text()').get()
>>> response.xpath('//div[@class="wrap-products"]/div/div/a/@href').get()
>>> response.xpath('/div[@class="description"]/div/h3/text()').get()
>>> response.xpath('//div[@class="open-on-mobile"]/div/div/div/a/@href').get()
>>> response.xpath('//div[@class="product cross-square white"]/a/@href').get()
>>> response.xpath('//a[@class="product-link"]').get()
>>> response.xpath('//a[@class="product-link"]').getall()
Upvotes: 1
Views: 108
Reputation: 10666
You can find "Others also bought" product ids in this part of HTML (see createCrossSellItems
section):
<script>
$(function () {
createUpsellItems("885034747 | 885034800 | 885034900 |")
createCrossSellItems("885034347 | 480010600 | 480010700 | 010046700 | 500061967 | 480011000 |")
})
</script>
But adding details for all these products to your main item is a bit tricky. First you need to understand how do you want to save this information (one-to-many). It may be a single field OtherAlsoBought
where you'll save a JSON-like structure for example. Or you can use many fields like OtherAlsoBought_Product_1_Title
, OtherAlsoBought_Product_1_Link
, OtherAlsoBought_Product_2_Title
, OtherAlsoBought_Product_2_Link
etc.
One possible way to collect these details will be to save all product ids into an array and next yield
one id at a time (simple GET
https://www.duif.nl/api/v2/catalog/product?itemcode=885034347_Parent
should work but with a correct Referer
header) but also passing products array (using meta
or cb_kwargs
) to get next id. Of course you also need to pass your main item
for each request (to add current product details to it and yield
everything at the end).
UPDATE You need to add fields you need to the below code:
import scrapy
import json
import re
class DuifSpider(scrapy.Spider):
name="duif"
start_urls = ['https://www.duif.nl/product/pot-seal-matt-finish-light-pink-large']
def parse(self, response):
item = {}
item['title'] = response.xpath('//h1[@class="product-title"]/text()').get()
item['url'] = response.url
item['cross_sell'] = []
cross_sell_items_raw = response.xpath('//script[contains(., "createCrossSellItems(")]/text()').re_first(r'createCrossSellItems\("([^"]+)')
cross_sell_items = re.findall(r"\d+", cross_sell_items_raw)
if cross_sell_items:
cross_sell_item_id = cross_sell_items.pop(0)
yield scrapy.Request(
f"https://www.duif.nl/api/v2/catalog/product?itemcode={cross_sell_item_id}_Parent",
headers={
'referer': response.url,
'Content-type': 'application/json',
'Authorization': 'bearer null',
'Accept': '*/*',
},
callback=self.parse_cross_sell,
meta={
'item': item,
'referer': response.url,
'cross_sell_items': cross_sell_items,
}
)
else:
# There is no "Others also bought" items for this page, just save main item
yield item
def parse_cross_sell(self, response):
main_item = response.meta["item"]
cross_sell_items = response.meta["cross_sell_items"]
data = json.loads(response.text)
current_cross_sell_item = {}
current_cross_sell_item['title'] = data["_embedded"]["products"][0]["name"]
current_cross_sell_item['url'] = data["_embedded"]["products"][0]["url"]
current_cross_sell_item['description'] = data["_embedded"]["products"][0]["description"]
main_item['cross_sell'].append(current_cross_sell_item)
if cross_sell_items:
cross_sell_item_id = cross_sell_items.pop(0)
yield scrapy.Request(
f"https://www.duif.nl/api/v2/catalog/product?itemcode={cross_sell_item_id}_Parent",
headers={
'referer': response.meta['referer'],
'Content-type': 'application/json',
'Authorization': 'bearer null',
'Accept': '*/*',
},
callback=self.parse_cross_sell,
meta={
'item': main_item,
'referer': response.meta['referer'],
'cross_sell_items': cross_sell_items,
}
)
else:
# no more cross sell items to process, save output
yield main_item
Upvotes: 1