Arman Avetisyan
Arman Avetisyan

Reputation: 401

Scrapy CrawlProcess crawl() function sometimes hangs indefinitely

I'm using CrawlPrcoess to run a scrapy spider with a selenium downloader. Most of the time the code works as expected but sometimes it hangs indefinitely after executing the crawl() function.

process = CrawlerProcess(project_settings)
process.crawl(MySpider, urls = urls) #This is where it hangs (but most of the time works fine)
process.start()

It's not throwing any exceptions so I'm trying to understand how to debug it. Also, is it possible to setup a timeout exception for the crawl() function?

UPDATE

I was able to setup spider_idle signal. The signal is being configured properly but it's still not being executed. I guess this is not the proper signal for this task.

@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
    print("-- Initiating signals from_crawler function")
    spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
    crawler.signals.connect(spider.handleSpiderIdle, signal=signals.spider_idle)
    return spider

And below is the signal function but it's not being called.

    def handleSpiderIdle(self,spider):
       '''Handle spider idle event.'''  
       print(f'\nSpider idle: {spider.name}. Restarting it... ')

UPDATE 2

the code hangs specifically in start_requests function

    def start_requests(self):
    print("-- Initial Request Started --") #This print statement is not being executed
    yield SeleniumRequest(
        url=self.start_urls[0],
        callback=self.parse,
        wait_time=2,
    )
    print("-- Initial Request Passed --")

Upvotes: 1

Views: 152

Answers (0)

Related Questions