Louis Thibault
Louis Thibault

Reputation: 21450

Why is scrapy dumping thousands of `ERROR` log messages without any description of the error?

I'm in the process of writing a CrawlSpider that will parse google search results. The search query changes each time, so the spider must first connect to a database to gather information on the search query it will need to parse. Here is my anotated CrawlSpider class:

class GoogleSpider(CrawlSpider):
    name = 'googlespider'
    allowed_domains = ['google.com', 'google.ca', 'google.fr']
    logger = log

    _google_query = "http://www.google.{0}/search?q={1}"

    def __init__(self, *args, **kwargs):
        super(GoogleSpider, self).__init__(*args, **kwargs)
        dispatcher.connect(self.get_startup_params, signals.spider_opened)

    @defer.inlineCallbacks
    def get_startup_params(self, spider, **kw):

        # Get the exact requests to issue to google
        exreqs = yield get_exactrequests()

        # Create the google query (i.e. url to scrape) and store associated information
        start_urls = []
        self.item_lookup = {}
        for keyword, exact_request, lang in exreqs['res']:
            url = self.mk_google_query(lang, exact_request)
            start_urls.append(url)
            self.item_lookup[url] = (keyword, exact_request)

        # Assign the google query URLs to `start_urls`
        self.start_urls = tuple(start_urls)

    def mk_google_query(self, lang, search_terms):
        return self._google_query.format(lang, quote(search_terms))

    def parse_item(self, response):
        sel = Selector(response)
        item = Item()
        keyword, exact_request = self.item_lookup[response.request.url]
        item['urls'] = map(lambda r: r.extract(),
                           sel.xpath('//h3[@class="r"]/a/@href'))
        item['keyword'] = keyword
        item['exactrequest'] = exact_request
        return item

When I run scrapy crawl googlespider, I get a MASSIVE log output that looks like this:

[-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100 [-] ERROR: 2014-02-17 00:24:38+0100

This output goes on for (I would estimate) a good 10,000 lines -- far beyond my terminal's scrollback.

Does anybody know what the issue might be and how I should go about diagnosing/fixing it?

Thank you!

Upvotes: 1

Views: 96

Answers (2)

Louis Thibault
Louis Thibault

Reputation: 21450

It turns out @Rho was correct: the problem stems from the fact that I called log.start(). Removing the call to that function restored sanity.

Upvotes: 1

Guy Gavriely
Guy Gavriely

Reputation: 11396

hard to tell as your log practically say nothing, though the following is recommended:

  1. the way you load start_urls seem to be unnecessarily complex, scrapy has a ready made start_requests function you can override if your urls generation requires extra work
  2. do you inherit CrawlerSpider on purpose? as you don't seem to declare any rule I assume you should inherit from Spider instead

Upvotes: 1

Related Questions