H.B.
H.B.

Reputation: 63

No internal method call with scrapy

I'm using scrapy to crawl a website. The first call seems ok and collects some data. For every subsequent request I need some information from another request. For programing simplification, I separated the different requests into different method calls. But it seems that scrapy does not provide method calls with some special parameter. Every sub-call won't be executed.

I tried already a few different things:

  1. Called a instance method with self.sendQueryHash(response, tagName, afterHash)

  2. Called a static method with sendQueryHash(response, tagName, afterHash) and changed the indent

  3. Removed the method call and it worked. I saw the sendQueryHash output on the logger.

import scrapy
import re
import json
import logging

class TestpostSpider(scrapy.Spider):
    name = 'testPost'
    allowed_domains = ['test.com']

    tags = [
         "this"
        ,"that"   ]

    def start_requests(self):
        requests = []
        for i, value in enumerate(self.tags):
            url = "https://www.test.com/{}/".format(value)
            requests.append(scrapy.Request(
                        url,
                        meta={'cookiejar': i},
                        callback=self.parsefirstAccess))
        return requests

    def parsefirstAccess(self, response):
        self.logger.info("parsefirstAccess")
        jsonData = response.text

        # That call works fine
        tagName, hasNext, afterHash = self.extractFirstNextPageData(jsonData)
        yield {
                'json':jsonData,
                'requestTime':int(round(time.time() * 1000)),
                'requestNumber':0
        }

        if not hasNext:
            self.logger.info("hasNext is false")
            # No more data available stop processing
            return
        else:
            self.logger.info("hasNext is true")
            # Send request to get the query hash of the current tag
            self.sendQueryHash(response, tagName, afterHash) # Problem occures here

    ## 3.
    def sendQueryHash(self, response, tagName, afterHash):
        self.logger.info("sendQueryHash")
        request = scrapy.Request(
            "https://www.test.com/static/bundles/es6/TagPageContainer.js/21d3cb18e725.js",
            meta={'cookiejar': response.meta['cookiejar']},
            callback=self.parseQueryHash,
            dont_filter=True)
        request.cb_kwargs['tagName'] = tagName
        request.cb_kwargs['afterHash'] = afterHash
        yield request

    def extractFirstNextPageData(self, json):
        return "data1", True, "data3"

I expect that the sendQueryHash output is shown but it never happen. Only wenn I comment the lines self.sendQueryHash and def sendQueryHash out.

That's only one example of the behavior what I don't expect.

Upvotes: 0

Views: 248

Answers (1)

tomjn
tomjn

Reputation: 5389

self.sendQueryHash(response, tagName, afterHash) # Problem occures here

will just create a generator that you do nothing with. You need to make sure you yield your Request back to the scrapy engine. Since it is just a single request that is returned you should be able to use return instead of yield from sendQueryHash and then directly yield the Request by replacing the above line with

yield self.sendQueryHash(response, tagName, afterHash)

Upvotes: 1

Related Questions