DjangoPy
DjangoPy

Reputation: 865

Scrapy:Pass data between 2 spiders

I need to create a spider that crawls for some data from web site. part of the data is an external URL.

I already created the spider that crawls the data from the root site and now i want to write the spider for external web pages.

I was thinking of creating a crawlspider that uses the SgmlLinkExtractor to follow some specific links in each external web page.

what is the recommended way to communicate the list of start_url to the second spider?

My idea is to generate a json file for the items and to read the attribute in start_requests of the second spider.

Upvotes: 1

Views: 1548

Answers (2)

warvariuc
warvariuc

Reputation: 59664

I already created the spider that crawls the data from the root site and now i want to write the spider for external web pages.

Save these external page urls to a db.

what is the recommended way to communicate the list of start_url to the second spider?

Override BaseSpider.start_requests in your other spider and create requests from urls you get from the db.

Upvotes: 2

user1460015
user1460015

Reputation: 2003

The question is pretty vague but here is one way (??)

class PracticeSpider(BaseSpider):
    name = "project_name"
    allowed_domains = ["my_domain.org"]

    def start_requests(self):
        start_urls = "The First Page URL"
        return [Request(start_urls, callback=self.parse)]

    def parse(self, response):
        # parse the first page
        yield self.pageParser(response)

        # grab the external URLs you want to follow
        ext_urls = ...

        for url in ext_urls:
            yield Request(url, callback=self.pageParser)

    def pageParser(self, response):
        # parse the page
        return items

There is also a meta={} attribute in Request that might be of some help.

Upvotes: 0

Related Questions