Reputation: 74

Fetch data from API inside Scrapy

I am working on a project that is divided into two parts:

Retrieve a specific page
Once the ID of this page is extracted,
Send requests to an API to obtain additional information on this page

For the second point, and to follow Scrapy's asynchronous philosophy, where should such a code be placed? (I hesitate between in the spider or in a pipeline). Do we have to use different libraries like asyncio & aiohttp to be able to achieve this goal asynchronously? (I love aiohttp so this is not a problem to use it)

Thanks you

Upvotes: 0

Answers (2)

HelloEdit

Reputation: 74

I recently had the same problem (again) and found an elegant solution using Twisted decorators t.i.d.inlineCallbacks.

# -*- coding: utf-8 -*-
import scrapy
import re
from twisted.internet.defer import inlineCallbacks

from sherlock import utils, items, regex


class PagesSpider(scrapy.spiders.SitemapSpider):
    name = 'pages'
    allowed_domains = ['thing.com']
    sitemap_follow = [r'sitemap_page']

    def __init__(self, site=None, *args, **kwargs):
        super(PagesSpider, self).__init__(*args, **kwargs)

    @inlineCallbacks
    def parse(self, response):
        # things
        request = scrapy.Request("https://google.com")
        response = yield self.crawler.engine.download(request, self) 
        # Twisted execute the request and resume the generator here with the response
        print(response.text)

Upvotes: 0

stranac

Reputation: 28266

Since you're doing this to fetch additional information about an item, I'd just yield a request from the parsing method, passing the already scraped information in the meta attribute.

You can see an example of this at https://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments

This can also be done in a pipeline (either using scrapy's engine API, or a different library, e.g. treq).
I do however think that doing it "the normal way" from the spider makes more sense in this instance.

Upvotes: 1

Fetch data from API inside Scrapy

Answers (2)

Related Questions