Reputation: 507
Hi I'm wondering how could I pass scraping result which is pandas file to module which created creating spider.
import mySpider as mspider
def main():
spider1 = mspider.MySpider()
process = CrawlerProcess()
process.crawl(spider1)
process.start()
print(len(spider1.result))
Spider:
class MySpider(scrapy.Spider):
name = 'MySpider'
allowed_domains = config.ALLOWED_DOMAINS
result = pd.DataFrame(columns=...)
def start_requests(self):
yield Request(url=...,headers=config.HEADERS, callback=self.parse)
def parse(self, response):
*...Some Code of adding values to result...*
print("size: " + str(len(self.result)))
Printed value in main method is 0 when in parse method is 1005. Could you tell me how should I pass value between.
I would like to do that cause I'm running multiple spiders. After they finish scraping I'll merge and save to file.
SOLUTION
def spider_closed(spider, reason):
print("Size" + str(len(spider.result)))
def main():
now = datetime.now()
spider1 = spider.MySpider()
crawler_process = CrawlerProcess()
crawler = crawler_process.create_crawler(spider1)
crawler.signals.connect(spider_closed, signals.spider_closed)
crawler_process.crawl(spider1)
crawler_process.start()
Upvotes: 2
Views: 753
Reputation: 474003
The main reason for this behavior is the asynchronous nature of Scrapy itself. The print(len(spider1.result))
line would be executed before the .parse()
method is called.
There are multiple ways to wait for the spider to be finished. I would do the spider_closed
signal:
from scrapy import signals
def spider_closed(spider, reason):
print(len(spider.result))
spider1 = mspider.MySpider()
crawler_process = CrawlerProcess(settings)
crawler = crawler_process.create_crawler()
crawler.signals.connect(spider_closed, signals.spider_closed)
crawler.crawl(spider1)
crawler_process.start()
Upvotes: 2