Nini
Nini

Reputation: 13

SCRAPY: Every time my spider crawls, it is scraping the same page (first page)

I have written a code to scrape through a page using Scrapy in Python. Bellow I have pasted the main.py code. But, whenever I run my spider, it scrapes only from the first page ( DEBUG: Scraped from <200 https://www.tuscc.si/produkti/instant-juhe>), which is also the Referer in the Request Headers (when inspected).

I have tried adding the source of the "Request Payload" field data, which is pasted here: {"action":"loadList","skip":64,"filter":{"1005":[],"1006":[],"1007":[],"1009":[],"1013":[]}}, and when I try to open the page with it (modified in this lookout:

https://www.tuscc.si/produkti/instant-juhe#32;'action':'loadList';'skip':'32';'sort':'none'

), the browser opens it. But the scrapy shell doesn't. I have also tried adding the numbers from the Request URL: https://www.tuscc.si/cache/script/tuscc.js?1563872492384, where the query string parameters are 1563872492384; but it still won't scrape from the requested page.

Also, I have tried many variations and added many stuff, all which i have read online just to see if there will be progress, but none....

The code is:

from scrapy.spiders import CrawlSpider
from tus_pomos.items import TusPomosItem
from tus_pomos.scrapy_splash import SplashRequest


class TusPomosSpider(CrawlSpider):
    name = 'TUSP'
    allowed_domains = ['www.tuscc.si']
    start_urls = ["https://www.tuscc.si/produkti/instant-juhe#0;1563872492384;",
                  "https://www.tuscc.si/produkti/instant-juhe#64;1563872492384;", ]
    download_delay = 5.0

    def start_requests(self):
        # payload = [
        #     {"action": "loadList",
        #      "skip": 0,
        #      "filter": {
        #          "1005": [],
        #          "1006": [],
        #          "1007": [],
        #          "1009": [],
        #          "1013": []}
        #      }]
        for url in self.start_urls:
            r = SplashRequest(url, self.parse, magic_response=False, dont_filter=True, endpoint='render.json', meta={
                'original_url': url,
                'dont_redirect': True},
                              args={
                                  'wait': 2,
                                  'html': 1
                              })
            r.meta['dont_redirect'] = True
            yield r

    def parse(self, response):
        items = TusPomosItem()
        pro = response.css(".thumb-box")
        for p in pro:
            pro_link = p.css("a::attr(href)").extract_first()
            pro_name = p.css(".description::text").extract_first()
            items['pro_link'] = pro_link
            items['pro_name'] = pro_name
            yield items

In conclusion, I am requesting to crawl all the pages from the pagination, for example this page (I also tried with command scrapy shell url):

https://www.tuscc.si/produkti/instant-juhe#64;1563872492384;

But the response is always the first page, and it is scraping it repeatedly:

https://www.tuscc.si/produkti/instant-juhe

I would be grateful if you help me. Thanks


THE PARSE_DETAILS GENERATOR FUNCTION

def parse_detail(self, response):
    items = TusPomosItem()
    pro = response.css(".thumb-box")
    for p in pro:
        pro_link = p.css("a::attr(href)").extract_first()
        pro_name = p.css(".description::text").extract_first()
        items['pro_link'] = pro_link
        items['pro_name'] = pro_name
        my_details = {
            'pro_link': pro_link,
            'pro_name': pro_name
        }
        with open('pro_file.json', 'w') as json_file:
            json.dump(my_details, json_file)

        yield items
        # yield scrapy.FormRequest(
        #     url='https://www.tuscc.si/produkti/instant-juhe',
        #     callback=self.parse_detail,
        #     method='POST',
        #     headers=self.headers
        #     )

Here I am not sure whether I should assign my „items“ variable the way it is, or get it from the response.body? Also, should the yield be the way it is, or should I change it with an Request(which is more than partially copied by the ANSWER code given)?

I am new here, so thanks for the understanding!

Upvotes: 1

Views: 652

Answers (1)

Wim Hermans
Wim Hermans

Reputation: 2116

Instead of using Splash to render the pages, it's probably more efficient to get the data from the underlying requests that are made. The below piece of code goes through all pages with articles. Under parse_detail, you can write the logic to load the data from the response into a json, in which you can find the 'pro_link' and 'pro_name' of the products.

import scrapy
import json
from scrapy.spiders import Spider
from ..items import TusPomosItem


class TusPomosSpider(Spider):
    name = 'TUSP'
    allowed_domains = ['tuscc.si']
    start_urls = ["https://www.tuscc.si/produkti/instant-juhe"]
    download_delay = 5.0

    headers = {
        'Origin': 'https://www.tuscc.si',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'en-GB,en;q=0.9,nl-BE;q=0.8,nl;q=0.7,ro-RO;q=0.6,ro;q=0.5,en-US;q=0.4',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
        'Content-Type': 'application/json; charset=UTF-8',
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'X-Requested-With': 'XMLHttpRequest',
        'Connection': 'keep-alive',
        'Referer': 'https://www.tuscc.si/produkti/instant-juhe',
    }

    def parse(self, response):
        number_of_pages = int(response.xpath(
            '//*[@class="paginationHolder"]//@data-size').extract_first())
        number_per_page = int(response.xpath(
            '//*[@name="pageSize"]/*[@selected="selected"]/text()').extract_first())

        for page_number in range(0, number_of_pages):
            skip = number_per_page * page_number
            data = {"action": "loadList",
                    "filter": {"1005": [], "1006": [], "1007": [], "1009": [],
                               "1013": []},
                    "skip": str(skip),
                    "sort": "none"
                    }
            yield scrapy.Request(
                url='https://www.tuscc.si/produkti/instant-juhe',
                callback=self.parse_detail,
                method='POST',
                body=json.dumps(data),
                headers=self.headers
                )

    def parse_detail(self, response):
        detail_page = json.loads(response.text)
        for product in detail_page['docs']:
            item = TusPomosItem()
            item['pro_link'] = product['url']
            item['pro_name'] = product['title']
            yield item

Upvotes: 1

Related Questions