Reputation: 74
I am working on a project that is divided into two parts:
For the second point, and to follow Scrapy's asynchronous philosophy, where should such a code be placed? (I hesitate between in the spider or in a pipeline). Do we have to use different libraries like asyncio & aiohttp to be able to achieve this goal asynchronously? (I love aiohttp so this is not a problem to use it)
Thanks you
Upvotes: 0
Views: 887
Reputation: 74
I recently had the same problem (again) and found an elegant solution using Twisted decorators t.i.d.inlineCallbacks
.
# -*- coding: utf-8 -*-
import scrapy
import re
from twisted.internet.defer import inlineCallbacks
from sherlock import utils, items, regex
class PagesSpider(scrapy.spiders.SitemapSpider):
name = 'pages'
allowed_domains = ['thing.com']
sitemap_follow = [r'sitemap_page']
def __init__(self, site=None, *args, **kwargs):
super(PagesSpider, self).__init__(*args, **kwargs)
@inlineCallbacks
def parse(self, response):
# things
request = scrapy.Request("https://google.com")
response = yield self.crawler.engine.download(request, self)
# Twisted execute the request and resume the generator here with the response
print(response.text)
Upvotes: 0
Reputation: 28236
Since you're doing this to fetch additional information about an item, I'd just yield a request from the parsing method, passing the already scraped information in the meta
attribute.
You can see an example of this at https://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments
This can also be done in a pipeline (either using scrapy's engine API, or a different library, e.g. treq).
I do however think that doing it "the normal way" from the spider makes more sense in this instance.
Upvotes: 1