Running spider from crawl command and from CrawlerProcess does not output the same

Question

I implemented a scrapy spider that I used to call using

scrapy crawl myspider -a start_url='http://www.google.com'

Now I need to run that spider from a script (from a django app, using django-rq but that shouldn't have any impact for the question).

Thus, I followed the CrawlerProcess doc to end up with a script like this

crawler_settings = Settings()
crawler_settings.setmodule(cotextractor_settings)

process = CrawlerProcess(settings=crawler_settings)
process.crawl(MySpider(start_url='http://www.google.com'))
process.start()

Issue is that, from the script, my crawl fails due to a missing start_url arg. After digging into both spiders output, I noticed that the second one (from the script) was displaying twice a debug command I've set up in my spider constructor.

Here is the constructor

def __init__(self, *args, **kwargs): 
    super(MySpider, self).__init__(*args, **kwargs)
    logger.debug(kwargs)
    self.start_urls = [kwargs.get('start_url')]

Here is the crawl command output; notice there is only 1 debug output

2017-07-11 21:53:12 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: cotextractor)
2017-07-11 21:53:12 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'cotextractor', 'DUPEFILTER_CLASS': 'cotextractor.dupefilters.PersistentDupeFilter', 'NEWSPIDER_MODULE': 'cotextractor.spiders', 'SPIDER_MODULES': ['cotextractor.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36'}
2017-07-11 21:53:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2017-07-11 21:53:12 [cotextractor.spiders.spiders] DEBUG: {'start_url': 'http://www.google.com'}
2017-07-11 21:53:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'cotextractor.middlewares.RotatingProxyMiddleware',
 'cotextractor.middlewares.BanDetectionMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-07-11 21:53:13 [scrapy.middleware] INFO: Enabled spider middlewares:
...

And finally the output from the script (the django-rq worker); notice the debug that appear twice, one time good and the second one empty

2017-07-11 21:59:27 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: cotextractor)
2017-07-11 21:59:27 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'cotextractor', 'DUPEFILTER_CLASS': 'cotextractor.dupefilters.PersistentDupeFilter', 'NEWSPIDER_MODULE': 'cotextractor.spiders', 'SPIDER_MODULES': ['cotextractor.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36'}
2017-07-11 21:59:27 [cotextractor.spiders.spiders] DEBUG: {'start_url': 'http://www.google.com'}
2017-07-11 21:59:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2017-07-11 21:59:27 [cotextractor.spiders.spiders] DEBUG: {}
2017-07-11 21:59:27 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'cotextractor.middlewares.RotatingProxyMiddleware',
 'cotextractor.middlewares.BanDetectionMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-07-11 21:59:27 [scrapy.middleware] INFO: Enabled spider middlewares:
...

My guess is that my spider from the script fails as its constructor is called twice; once with the parameters and once without. Yet, I cannot figure out why the CrawlerProcess would trigger twice the spider constructor.

Thank you for your support,

mouch · Accepted Answer

All right.

As already explained here Passing arguments to process.crawl in Scrapy python

I'm actually not using the crawl method properly. I do not need to send a spider object but only the spider's name! So, here is the script I have to use

crawler_settings = Settings()
crawler_settings.setmodule(cotextractor_settings)

process = CrawlerProcess(settings=crawler_settings)
process.crawl(MySpider, start_url='http://www.google.com')
process.start()

The documentation was straight forward; crawl receives as a parameter either a crawler or a spider's name...

https://doc.scrapy.org/en/latest/topics/api.html#scrapy.crawler.CrawlerProcess

Too bad that I spent hours on this

Hope this could help someone ;)

Running spider from crawl command and from CrawlerProcess does not output the same

Answers (1)

Related Questions