Reputation: 63
I'm using scrapy to crawl a website. The first call seems ok and collects some data. For every subsequent request I need some information from another request. For programing simplification, I separated the different requests into different method calls. But it seems that scrapy does not provide method calls with some special parameter. Every sub-call won't be executed.
I tried already a few different things:
Called a instance method with self.sendQueryHash(response, tagName, afterHash)
Called a static method with sendQueryHash(response, tagName, afterHash) and changed the indent
Removed the method call and it worked. I saw the sendQueryHash output on the logger.
import scrapy
import re
import json
import logging
class TestpostSpider(scrapy.Spider):
name = 'testPost'
allowed_domains = ['test.com']
tags = [
"this"
,"that" ]
def start_requests(self):
requests = []
for i, value in enumerate(self.tags):
url = "https://www.test.com/{}/".format(value)
requests.append(scrapy.Request(
url,
meta={'cookiejar': i},
callback=self.parsefirstAccess))
return requests
def parsefirstAccess(self, response):
self.logger.info("parsefirstAccess")
jsonData = response.text
# That call works fine
tagName, hasNext, afterHash = self.extractFirstNextPageData(jsonData)
yield {
'json':jsonData,
'requestTime':int(round(time.time() * 1000)),
'requestNumber':0
}
if not hasNext:
self.logger.info("hasNext is false")
# No more data available stop processing
return
else:
self.logger.info("hasNext is true")
# Send request to get the query hash of the current tag
self.sendQueryHash(response, tagName, afterHash) # Problem occures here
## 3.
def sendQueryHash(self, response, tagName, afterHash):
self.logger.info("sendQueryHash")
request = scrapy.Request(
"https://www.test.com/static/bundles/es6/TagPageContainer.js/21d3cb18e725.js",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parseQueryHash,
dont_filter=True)
request.cb_kwargs['tagName'] = tagName
request.cb_kwargs['afterHash'] = afterHash
yield request
def extractFirstNextPageData(self, json):
return "data1", True, "data3"
I expect that the sendQueryHash output is shown but it never happen. Only wenn I comment the lines self.sendQueryHash
and def sendQueryHash
out.
That's only one example of the behavior what I don't expect.
Upvotes: 0
Views: 248
Reputation: 5389
self.sendQueryHash(response, tagName, afterHash) # Problem occures here
will just create a generator that you do nothing with. You need to make sure you yield
your Request
back to the scrapy
engine. Since it is just a single request that is returned you should be able to use return
instead of yield
from sendQueryHash
and then directly yield
the Request
by replacing the above line with
yield self.sendQueryHash(response, tagName, afterHash)
Upvotes: 1