XPATH some divs not reachable

Question

Intro

I have to add to my crawler "Others also bought"-items of certain productlink. Its really weird for me, because there are divs like"open-on-mobile" and "inner generated", what does this mean for me?

Goal

I already got every important information which are needed, except of the "others also bought", after trying of hours, i decided to ask here, before i waste more time and get more frustrated

HTML construction

My code

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import DuifcsvItem
import csv

class DuifSpider(scrapy.Spider):
    name = "duif"
    allowed_domains = ['duif.nl']
    custom_settings = {'FIELD_EXPORT_FIELDS' : ['SKU', 'Title', 'Title_small', 'NL_PL_PC', 'Description']}
    with open("duifonlylinks.csv","r") as f:
        reader = csv.DictReader(f)
        start_urls = [items['Link'] for items in reader]
    
    rules = (
        Rule(LinkExtractor(), callback='parse'),
    )



    def parse(self, response):
        card = response.xpath('//div[@class="heading"]')

        if not card:
            print('No productlink', response.url)        
        
        items = DuifcsvItem()
        items['Link'] = response.url
        items['SKU'] = response.xpath('//p[@class="desc"]/text()').get().strip()
        items['Title'] = response.xpath('//h1[@class="product-title"]/text()').get()
        items['Title_small'] = response.xpath('//div[@class="left"]/p/text()').get()
        items['NL_PL_PC'] = response.xpath('//div[@class="desc"]/ul/li/em/text()').getall()
        items['Description'] = response.xpath('//div[@class="item"]/p/text()').getall()
        yield items

Actual Webpage: https://www.duif.nl/product/pot-seal-matt-finish-light-pink-large

Perfect would be, if could access this href with xpath

XPATH i already tried

>>> response.xpath('//div[@class="title"]/h3/text()').get()
>>> response.xpath('//div[@class="inner generated"]/div//h3/text()').get()
>>> response.xpath('//div[@class="wrap-products"]/div/div/a/@href').get()
>>> response.xpath('/div[@class="description"]/div/h3/text()').get()
>>> response.xpath('//div[@class="open-on-mobile"]/div/div/div/a/@href').get()
>>> response.xpath('//div[@class="product cross-square white"]/a/@href').get()
>>> response.xpath('//a[@class="product-link"]').get()
>>> response.xpath('//a[@class="product-link"]').getall()

gangabass · Accepted Answer

You can find "Others also bought" product ids in this part of HTML (see createCrossSellItems section):

But adding details for all these products to your main item is a bit tricky. First you need to understand how do you want to save this information (one-to-many). It may be a single field OtherAlsoBought where you'll save a JSON-like structure for example. Or you can use many fields like OtherAlsoBought_Product_1_Title, OtherAlsoBought_Product_1_Link, OtherAlsoBought_Product_2_Title, OtherAlsoBought_Product_2_Link etc.

One possible way to collect these details will be to save all product ids into an array and next yield one id at a time (simple GET https://www.duif.nl/api/v2/catalog/product?itemcode=885034347_Parent should work but with a correct Referer header) but also passing products array (using meta or cb_kwargs) to get next id. Of course you also need to pass your main item for each request (to add current product details to it and yield everything at the end).

UPDATE You need to add fields you need to the below code:

import scrapy
import json
import re

class DuifSpider(scrapy.Spider):
    name="duif"
    start_urls = ['https://www.duif.nl/product/pot-seal-matt-finish-light-pink-large']

    def parse(self, response):
        item = {}
        item['title'] = response.xpath('//h1[@class="product-title"]/text()').get()
        item['url'] = response.url
        item['cross_sell'] = []

        cross_sell_items_raw = response.xpath('//script[contains(., "createCrossSellItems(")]/text()').re_first(r'createCrossSellItems\("([^"]+)')
        cross_sell_items = re.findall(r"\d+", cross_sell_items_raw)

        if cross_sell_items:
            cross_sell_item_id = cross_sell_items.pop(0)
            yield scrapy.Request(
                f"https://www.duif.nl/api/v2/catalog/product?itemcode={cross_sell_item_id}_Parent",
                headers={
                    'referer': response.url,
                    'Content-type': 'application/json',
                    'Authorization': 'bearer null',
                    'Accept': '*/*',
                },
                callback=self.parse_cross_sell,
                meta={
                    'item': item,
                    'referer': response.url,
                    'cross_sell_items': cross_sell_items,
                }
            )
        else:
            # There is no "Others also bought" items for this page, just save main item
            yield item

    def parse_cross_sell(self, response):
        main_item = response.meta["item"]
        cross_sell_items = response.meta["cross_sell_items"]

        data = json.loads(response.text)
        current_cross_sell_item = {}
        current_cross_sell_item['title'] = data["_embedded"]["products"][0]["name"]
        current_cross_sell_item['url'] = data["_embedded"]["products"][0]["url"]
        current_cross_sell_item['description'] = data["_embedded"]["products"][0]["description"]

        main_item['cross_sell'].append(current_cross_sell_item)
        if cross_sell_items:
            cross_sell_item_id = cross_sell_items.pop(0)
            yield scrapy.Request(
                f"https://www.duif.nl/api/v2/catalog/product?itemcode={cross_sell_item_id}_Parent",
                headers={
                    'referer': response.meta['referer'],
                    'Content-type': 'application/json',
                    'Authorization': 'bearer null',
                    'Accept': '*/*',
                },
                callback=self.parse_cross_sell,
                meta={
                    'item': main_item,
                    'referer': response.meta['referer'],
                    'cross_sell_items': cross_sell_items,
                }
            )
        else:
            # no more cross sell items to process, save output
            yield main_item

XPATH some divs not reachable

Answers (1)

Related Questions