Reputation: 36
I'm using scrapy.Spider to scrape, and I want to use request inside my callback function which is in start_requests, but that request didn't work, it should return a response but it only returns Request.
I followed the debug breakpoint and found that in class Request(object_ref), the request only finished the initialization but it didn't go into request = next(slot.start_requests) as expected, to start requesting, thus only returning Request.
Here is my code in brief:
class ProjSpider(scrapy.Spider):
name = 'Proj'
allowed_domains = ['mashable.com']
def start_requests(self):
# pages
pages = 10
for i in range(1, pages):
url = "https://mashable.com/channeldatafeed/Tech/new/page/"+str(i)
yield scrapy.Request(url, callback=self.parse_mashable)
Request works fine yet and following is:
def parse_mashable(self, response):
item = Item()
json2parse = response.text
json_response = json.loads(json2parse)
d = json_response['dataFeed'] # a list containing dicts, in which there is url for detailed article
for data in d:
item_url = data['url'] # the url for detailed article
item_response = self.get_response_mashable(item_url)
# here I want to parse the item_response to get detail
item['content'] = item_response.xpath("//body").get
yield item
def get_response_mashable(self,url):
response = scrapy.Request(url)
# using self.parser. I've also defined my own parser and yield an item
# but the problem is it never got to callback
return response # tried yield also but failed
this is where Request doesn't work. The url is in the allowed_domains, and it's not a duplicate url. I'm guessing it's because of scrapy's asynchronous mechanism of Request, but how could it affect the request in self.parse_mashable, by then the Request in start_requests is already finished. I managed to do the second request in python Requests-html, but still I couldn't figure out why.
So could anyone help pointing where I'm doing wrong? Thx in advance!
Upvotes: 0
Views: 579
Reputation: 28246
Scrapy doesn't really expect you to do this the way you're trying to, so it doesn't have a simple way to do it.
What you should be doing instead is passing the data you've scraped from the original page to the new callback using the request's meta
dict.
For details, check Passing additional data to callback functions.
Upvotes: 1