no1
no1

Reputation: 945

Force my scrapy spider to stop crawling

is there a chance to stop crawling when specific if condition is true (like scrap_item_id == predefine_value ). My problem is similar to Scrapy - how to identify already scraped urls but I want to 'force' my scrapy spider to stop crawling after discover the last scraped item.

Upvotes: 37

Views: 34445

Answers (5)

Kpax7
Kpax7

Reputation: 11

I found a solution based on the solution by @alukach. In the latest versions of scrapy this part of code:

from scrapy.project import crawler
crawler._signal_shutdown(9,0) #Run this if the cnxn fails.

has 2 problems:

  1. There is no project in scrapy to import
  2. There is no signal number 9

Valid signals are:

{<Signals.SIGINT: 2>: 'SIGINT',
 <Signals.SIGILL: 4>: 'SIGILL',
 <Signals.SIGFPE: 8>: 'SIGFPE',
 <Signals.SIGSEGV: 11>: 'SIGSEGV',
 <Signals.SIGTERM: 15>: 'SIGTERM',
 <Signals.SIGBREAK: 21>: 'SIGBREAK',
 <Signals.SIGABRT: 22>: 'SIGABRT'}

so the solution is to send a signal is:

from scrapy.crawler import CrawlerProcess
process = CrawlerProcess()
process._signal_shutdown(22,0)

Upvotes: 0

Alex
Alex

Reputation: 1031

Tried lots of options nothing works. This dirty hack do the trick for Linux:

os.kill(os.getpid(), signal.SIGINT)
os.kill(os.getpid(), signal.SIGINT)

This sends SIGINT signal two times to scrapy. Second signal forces shutdown

Upvotes: 0

Macbric
Macbric

Reputation: 482

From a pipeline, I prefer the following solution.

class MongoDBPipeline(object):

def process_item(self, item, spider):
    spider.crawler.engine.close_spider(self, reason='duplicate')

Source: Force spider to stop in scrapy

Upvotes: 4

alukach
alukach

Reputation: 6298

This question was asked 8 months ago but I was wondering the same thing and have found another (not great) solution. Hopefully this can help the future readers.

I'm connecting to a database in my Pipeline file, if the database connection is unsuccessful, I wanted the Spider to stop crawling (no point in collecting data if there's nowhere to send it). What I ended up doing was using:

from scrapy.project import crawler
crawler._signal_shutdown(9,0) #Run this if the cnxn fails.

This causes the Spider to do the following:

[scrapy] INFO: Received SIGKILL, shutting down gracefully. Send again to force unclean shutdown.

I just kind of pieced this together after reading your comment and looking through the "/usr/local/lib/python2.7/dist-packages/Scrapy-0.12.0.2543-py2.7.egg/scrapy/crawler.py" file. I'm not totally sure what it's doing, the first number delivered to the function is the signame (for example, using 3,0 instead of 9,0 returns error [scrapy] INFO: Received SIGKILL...

Seems to work well enough though. Happy scraping.

EDIT: I also suppose that you could just force your program to shut down with something like:

import sys
sys.exit("SHUT DOWN EVERYTHING!")

Upvotes: 12

Sjaak Trekhaak
Sjaak Trekhaak

Reputation: 4966

In the latest version of Scrapy, available on GitHub, you can raise a CloseSpider exception to manually close a spider.

In the 0.14 release note doc is mentioned: "Added CloseSpider exception to manually close spiders (r2691)"

Example as per the docs:

def parse_page(self, response):
  if 'Bandwidth exceeded' in response.body:
    raise CloseSpider('bandwidth_exceeded')

See also: http://readthedocs.org/docs/scrapy/en/latest/topics/exceptions.html?highlight=closeSpider

Upvotes: 45

Related Questions