Reputation: 401
I'm using CrawlPrcoess to run a scrapy spider with a selenium downloader. Most of the time the code works as expected but sometimes it hangs indefinitely after executing the crawl() function.
process = CrawlerProcess(project_settings)
process.crawl(MySpider, urls = urls) #This is where it hangs (but most of the time works fine)
process.start()
It's not throwing any exceptions so I'm trying to understand how to debug it. Also, is it possible to setup a timeout exception for the crawl() function?
UPDATE
I was able to setup spider_idle signal. The signal is being configured properly but it's still not being executed. I guess this is not the proper signal for this task.
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
print("-- Initiating signals from_crawler function")
spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.handleSpiderIdle, signal=signals.spider_idle)
return spider
And below is the signal function but it's not being called.
def handleSpiderIdle(self,spider):
'''Handle spider idle event.'''
print(f'\nSpider idle: {spider.name}. Restarting it... ')
UPDATE 2
the code hangs specifically in start_requests function
def start_requests(self):
print("-- Initial Request Started --") #This print statement is not being executed
yield SeleniumRequest(
url=self.start_urls[0],
callback=self.parse,
wait_time=2,
)
print("-- Initial Request Passed --")
Upvotes: 1
Views: 152