How to stop the Crawler

I am trying to write a crawler that goes to a website and searches for a list of keywords, with max_Depth of 2. But the scraper is supposed to stop once any of the keyword's appears on any page, the problem i am facing right now is that the crawler does-not stop when it first see's any of the keywords.

Even after trying, early return command, break command and CloseSpider Commands and even python exit commands.

My class of the Crawler:

class WebsiteSpider(CrawlSpider):

name = "webcrawler"

allowed_domains = ["www.roomtoread.org"]
start_urls = ["https://"+"www.roomtoread.org"]
rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]

crawl_count = 0
words_found = 0                                 

def check_buzzwords(self, response):

    self.__class__.crawl_count += 1

    crawl_count = self.__class__.crawl_count

    wordlist = [
        "sfdc",
        "pardot",
        "Web-to-Lead",
        "salesforce"
        ]

    url = response.url
    contenttype = response.headers.get("content-type", "").decode('utf-8').lower()
    data = response.body.decode('utf-8')

    for word in wordlist:
            substrings = find_all_substrings(data, word)
            for pos in substrings:
                    ok = False
                    if not ok:
                        if  self.__class__.words_found==0:
                            self.__class__.words_found += 1
                            print(word + "," + url + ";")
                            STOP!

                            
                            
                        
    return Item()

def _requests_to_follow(self, response):
    if getattr(response, "encoding", None) != None:
            return CrawlSpider._requests_to_follow(self, response)
    else:
            return []

I want it to stop execution when if not ok: is True.

Upvotes: 0

Views: 149

Answers (1)

Pt Fisch
Pt Fisch

Reputation: 23

When I want to stop a spider, I usually use the exception exception scrapy.exceptions.CloseSpider(reason='cancelled') from Scrapy-Docs.

The example there shows how you can use it:

if 'Bandwidth exceeded' in response.body:
    raise CloseSpider('bandwidth_exceeded')

In your case something like

if not ok:
    raise CloseSpider('keyword_found')

Or is that what you meant with

CloseSpider Commands

and already tried it?

Upvotes: 1

Related Questions