Miguel Febres
Miguel Febres

Reputation: 2183

Scrapy: Stop previous parse function on condition

I have a very specific situation with one scraper I am developing right now. The first function parse_posts_pages iterates through all the pages from a specific forum page and for each page, it calls the second function parse_posts.

def parse_posts_pages(self, response):
    thread_id = response.meta['thread_id']
    thread_link = response.meta['thread_link']
    thread_name = response.meta['thread_name']
    if len(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')) == 3:
        posts_per_page = int(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')[1])
        total_posts = int(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')[2])
        if posts_per_page > 0:
            post_mod = total_posts % posts_per_page
            pages = total_posts / posts_per_page
            if post_mod > 0: pages += 1
        else: pages = 1

    for page in range(pages, 0, -1):
        cur_page = '' if page == 1 else '/page' + str(page)
        post_page_link = thread_link + cur_page
        return scrapy.Request(post_page_link, self.parse_posts, meta={'thread_id': thread_id, 'thread_name': thread_name})


def parse_posts(self, response):
    global maxPostIDByThread, executeFullSpider
    thread_id = response.meta['thread_id']
    thread_name = response.meta['thread_name']
    for post in response.xpath('//*[@id="posts"]/li'):
        post_id = post.xpath('@id').re(r'(\d.*)')[0]
        if not executeFullSpider and post_id in maxPostIDByThread:
            break #<- I need this break to also cancel the for from parse_posts_pages function
        ...

In the second function there is an if condition. When this conditions resolves to true I need to break the current for loop AND also the for loop from parse_posts_pages as there is no need to continue the pagination.

Is there any way to stop the for loop in the first function from the second function?

Upvotes: 2

Views: 583

Answers (2)

asduj
asduj

Reputation: 351

Just raise CloseSpider, as described in the manual

How can I instruct a spider to stop itself?

Raise the CloseSpider from a callback.

from scrapy.exceptions import CloseSpider

def parse_page(self, response):
    if 'Bandwidth exceeded' in response.body:
        raise CloseSpider('bandwidth_exceeded')

http://doc.scrapy.org/en/latest/faq.html#how-can-i-instruct-a-spider-to-stop-itself http://doc.scrapy.org/en/latest/topics/exceptions.html#scrapy.exceptions.CloseSpider

Note that requests that are still in progress (HTTP request sent, response not yet received) will still be parsed. No new request will be processed though.

https://stackoverflow.com/a/23895143/5041915

Update: Actually I found out something interesting If stop spider in main function.

It may happen that a new valid worker will not have time to start because raise the exception works faster.

I suggest checking the condition in a call-back function and raise exception as early as possible.

Upvotes: 1

Bhanu prathap
Bhanu prathap

Reputation: 94

Declare a global parse_status variable with default value to False .If required condition met in second function ,change parse_status to True and break the loop in first function

    def parse_posts_pages(self, response):
    thread_id = response.meta['thread_id']
    thread_link = response.meta['thread_link']
    thread_name = response.meta['thread_name']
    if len(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')) == 3:
        posts_per_page = int(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')[1])
        total_posts = int(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')[2])
        if posts_per_page > 0:
            post_mod = total_posts % posts_per_page
            pages = total_posts / posts_per_page
            if post_mod > 0: pages += 1
        else: pages = 1



    for page in range(pages, 0, -1):
            if self.parse_status == True:
                break
            cur_page = '' if page == 1 else '/page' + str(page)
            post_page_link = thread_link + cur_page
            return scrapy.Request(post_page_link, self.parse_posts, meta={'thread_id': thread_id, 'thread_name': thread_name})


def parse_posts(self, response):
    global maxPostIDByThread, executeFullSpider
    thread_id = response.meta['thread_id']
    thread_name = response.meta['thread_name']
    for post in response.xpath('//*[@id="posts"]/li'):
        post_id = post.xpath('@id').re(r'(\d.*)')[0]
        if not executeFullSpider and post_id in maxPostIDByThread:
            self.parse_status=True
            break #<- I need this break to also can

Upvotes: 0

Related Questions