Reputation: 3959
in a similar vein to this question: stackoverflow: running-multiple-spiders-in-scrapy
I am wondering, can I run a entire scrapy project from within another python program? Lets just say I wanted to build a entire program that required scraping several different sites, and I build entire scrapy projects for each site.
instead of running from command line as a one of, I want to run these spiders and acquire the information from them.
I can use mongoDB in python ok, and I can already build scrapy projects that contain spiders, but now just merging it all into one application.
I want to run the application once, and have the ability to control multiple spiders from my own program
Why do this? well this application may also connect to other sites using a API and needs to compare results from the API site to the scraped site in real time. I don't want to ever have to call scrapy from the command line, its all meant to be self contained.
(I have been asking lots of questions about scraping recently, because I am trying to find the right solution to build in)
Thanks :)
Upvotes: 7
Views: 1874
Reputation: 1238
The answer by Maxime Lorant finally solved my issues I had with building a scrapy spider in my own script. It solves two issues I had:
It allows to call the spider two times in a row (in the simple example in the scrapy tutorial this leads to a crash since you cannot start the twister reactor two time)
It allows to return variables from the spider back into the script.
There is only one thing: this example does not work with the scrapy version I am now using (Scrapy 1.5.2) and Python 3.7
After some playing with the code I got a working example which I would like to share. Also I have a question, see below the script. It is a standalone script, so I have added a spider as well
import logging
import multiprocessing as mp
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.signals import item_passed
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher
class CrawlerWorker(mp.Process):
name = "crawlerworker"
def __init__(self, spider, result_queue):
mp.Process.__init__(self)
self.result_queue = result_queue
self.items = list()
self.spider = spider
self.logger = logging.getLogger(self.name)
self.settings = get_project_settings()
self.logger.setLevel(logging.DEBUG)
self.logger.debug("Create CrawlerProcess with settings {}".format(self.settings))
self.crawler = CrawlerProcess(self.settings)
dispatcher.connect(self._item_passed, item_passed)
def _item_passed(self, item):
self.logger.debug("Adding Item {} to {}".format(item, self.items))
self.items.append(item)
def run(self):
self.logger.info("Start here with {}".format(self.spider.urls))
self.crawler.crawl(self.spider, urls=self.spider.urls)
self.crawler.start()
self.crawler.stop()
self.result_queue.put(self.items)
class QuotesSpider(scrapy.Spider):
name = "quotes"
def __init__(self, **kw):
super(QuotesSpider, self).__init__(**kw)
self.urls = kw.get("urls", [])
def start_requests(self):
for url in self.urls:
yield scrapy.Request(url=url, callback=self.parse)
else:
self.log('Nothing to scrape. Please pass the urls')
def parse(self, response):
""" Count number of The's on the page """
the_count = len(response.xpath("//body//text()").re(r"The\s"))
self.log("found {} time 'The'".format(the_count))
yield {response.url: the_count}
def report_items(message, item_list):
print(message)
if item_list:
for cnt, item in enumerate(item_list):
print("item {:2d}: {}".format(cnt, item))
else:
print(f"No items found")
url_list = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
'http://quotes.toscrape.com/page/3/',
'http://quotes.toscrape.com/page/4/',
]
result_queue1 = mp.Queue()
crawler = CrawlerWorker(QuotesSpider(urls=url_list[:2]), result_queue1)
crawler.start()
# wait until we are done with the crawl
crawler.join()
# crawl again
result_queue2 = mp.Queue()
crawler = CrawlerWorker(QuotesSpider(urls=url_list[2:]), result_queue2)
crawler.start()
crawler.join()
#
report_items("First result", result_queue1.get())
report_items("Second result", result_queue2.get())
As you can see, the code is almost identical, except some imports have change due to changes in the scrapy API.
There is one thing: I got a deprecation warning with the pydistatch import saying:
ScrapyDeprecationWarning: Importing from scrapy.xlib.pydispatch is deprecated and will no longer be supported in future Scrapy versions. If you just want to connect signals use the from_crawler class method, otherwise import pydispatch directly if needed. See: https://github.com/scrapy/scrapy/issues/1762
module = self._system_import(name, *args, **kwargs)
I found here how to solve this. However, I cannot get this working. Anybody knows how to apply the from_crawler class method to get rid of the deprecation warning?
Upvotes: 0
Reputation: 36161
Yep, of course you can ;)
The idea (inspired from this blog post) is to create a worker and then use it in your own Python script:
from scrapy import project, signals
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy.xlib.pydispatch import dispatcher
from multiprocessing.queues import Queue
import multiprocessing
class CrawlerWorker(multiprocessing.Process):
def __init__(self, spider, result_queue):
multiprocessing.Process.__init__(self)
self.result_queue = result_queue
self.crawler = CrawlerProcess(settings)
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
self.spider = spider
dispatcher.connect(self._item_passed, signals.item_passed)
def _item_passed(self, item):
self.items.append(item)
def run(self):
self.crawler.crawl(self.spider)
self.crawler.start()
self.crawler.stop()
self.result_queue.put(self.items)
Example of use:
result_queue = Queue()
crawler = CrawlerWorker(MySpider(myArgs), result_queue)
crawler.start()
for item in result_queue.get():
yield item
Another way would be to execute the scrapy crawl command with system()
Upvotes: 7