Reputation: 3128
I try to run Scrapy spider and some SDK call to another resource inside Django. The main idea collect results from both of them in one list once it will be ready and output it to view. SDK is working in sync way, so there are no issues. But I could not get results from a spider. Anyone could point me to the correct solution?
My code to run parses looks like this:
class scrapyParser(Parser):
def __init__(self, keywords=None, n_items=None):
super().__init__(keywords, n_items)
def parse(self):
result = []
if not super().parse():
return False
crawler = UrlCrawlerScript(Parser1, result, [BASE_PATH + self.keywords])
crawler.start()
crawler.join()
print(crawler.outputResponse)
return result[:self.n_items]
class UrlCrawlerScript(Process):
def __init__(self, result, urls):
Process.__init__(self)
settings = get_project_settings()
self.crawler = Crawler(spider, settings=settings)
self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
self.spider = spider
self.urls = urls
self.outputResponse = result
@inlineCallbacks
def cycle_run(self):
yield self.crawler.crawl(Parser1, outputResponse=self.outputResponse, start_urls=self.urls)
returnValue(self.outputResponse)
def run(self):
result = self.cycle_run()
result.addCallback(print)
reactor.run()
Parse code is very simple and it has such a template:
import scrapy
class Parser1(scrapy.Spider):
name = 'items'
allowed_domains = ['domain.com']
def parse(self, response):
...
# parsing page
for item in row_data:
scraped_info = {
...
}
self.outputResponse.append(scraped_info)
So I could not get anything in the output of parse. It returns an empty list. However, I'm at the very beginning of my way with async calls in Python and Twisted framework. It's highly possible I just messed something.
Upvotes: 1
Views: 861
Reputation: 3128
After doing a lot of different code snippets and looking for SO answers I finally found an easy and elegant solution. Using scrapyscript.
class scrapyParser(Parser):
def __init__(self, keywords=None, n_items=None):
super().__init__(keywords, n_items)
def parse(self):
result = []
if not super().parse():
return False
processor = Processor(settings=None)
job1 = Job(Parser1, url=URL1 + self.keywords)
job2 = Job(Parser2, url=URL2 + self.keywords)
return processor.run([job1, job2])
Source: https://stackoverflow.com/a/62902603/1345788
Upvotes: 1