Reputation: 993
I want to build a web crawler based on Scrapy to grab news pictures from several news portal website. I want to this crawler to be:
Run forever
Means it will periodical re-visit some portal pages to get updates.
Schedule priorities.
Give different priorities to different type of URLs.
Multi thread fetch
I've read the Scrapy document but have not found something related to what I listed (maybe I am not careful enough). Is there anyone here know how to do that ? or just give some idea/example about it. Thanks!
Upvotes: 11
Views: 5951
Reputation: 41
About the requirement on running-forever, here's some details.
You need to catch the signals.spider_idle
signal, and in your method that
connected to the signal, you need to raise a DontCloseSpider
exception. The spider_idle
signal is sent to the scrapy engine when there is no pending requests, and by default the spider will shutdown. You can intercept this process.
See codes blow:
import scrapy
from scrapy.exceptions import DontCloseSpider
from scrapy.xlib.pydispatch import dispatcher
class FooSpider(scrapy.Spider):
def __init__(self, *args, **kwargs):
super(FooSpider, self).__init__(*args, **kwargs)
dispatcher.connect(self.spider_idle, signals.spider_idle)
def spider_idle(self):
#you can revisit your portal urls in this method
raise DontCloseSpider
Upvotes: 0
Reputation: 43487
Scrapy is a framework for the spidering of websites, as such, it is intended to support your criteria but it isn't going to dance for you out of the box; you will probably have to get relatively familiar with the module for some tasks.
Scrapy is a library, not an application. There is a non-trivial amount of work (code) that a user of the module needs to make.
Upvotes: 12