sphinks
sphinks

Reputation: 3128

Get results of Scrapy spiders in variable

I try to run Scrapy spider and some SDK call to another resource inside Django. The main idea collect results from both of them in one list once it will be ready and output it to view. SDK is working in sync way, so there are no issues. But I could not get results from a spider. Anyone could point me to the correct solution?

My code to run parses looks like this:

class scrapyParser(Parser):
    def __init__(self, keywords=None, n_items=None):
        super().__init__(keywords, n_items)

    def parse(self):
        result = []
        if not super().parse():
            return False


        crawler = UrlCrawlerScript(Parser1, result, [BASE_PATH + self.keywords])
        crawler.start()
        crawler.join()

        print(crawler.outputResponse)

        return result[:self.n_items]


class UrlCrawlerScript(Process):
    def __init__(self, result, urls):
        Process.__init__(self)
        settings = get_project_settings()

        self.crawler = Crawler(spider, settings=settings)
        self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
        self.spider = spider
        self.urls = urls
        self.outputResponse = result

    @inlineCallbacks
    def cycle_run(self):
        yield self.crawler.crawl(Parser1, outputResponse=self.outputResponse, start_urls=self.urls)
        returnValue(self.outputResponse)


    def run(self):

        result = self.cycle_run()
        result.addCallback(print)
       
        reactor.run()

Parse code is very simple and it has such a template:

import scrapy

class Parser1(scrapy.Spider):
    name = 'items'
    allowed_domains = ['domain.com']

    def parse(self, response):
        ...
        # parsing page
        for item in row_data:
            scraped_info = {
                ...
            }
            self.outputResponse.append(scraped_info)

So I could not get anything in the output of parse. It returns an empty list. However, I'm at the very beginning of my way with async calls in Python and Twisted framework. It's highly possible I just messed something.

Upvotes: 1

Views: 861

Answers (1)

sphinks
sphinks

Reputation: 3128

After doing a lot of different code snippets and looking for SO answers I finally found an easy and elegant solution. Using scrapyscript.

class scrapyParser(Parser):
    def __init__(self, keywords=None, n_items=None):
        super().__init__(keywords, n_items)

    def parse(self):
        result = []
        if not super().parse():
            return False

        processor = Processor(settings=None)

        job1 = Job(Parser1, url=URL1 + self.keywords)
        job2 = Job(Parser2, url=URL2 + self.keywords)
        return processor.run([job1, job2])

Source: https://stackoverflow.com/a/62902603/1345788

Upvotes: 1

Related Questions