the_interest_seeker
the_interest_seeker

Reputation: 94

Getting twisted.defer.CancelledError when using Scrapy

Whenever I run the scrapy crawl command the following error props up:

2016-03-12 00:16:56 [scrapy] ERROR: Error downloading <GET http://XXXXXXX/rnd/sites/default/files/Agreement%20of%20FFCCA(1).pdf>
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/handlers/http11.py", line 246, in _cb_bodyready
    raise defer.CancelledError()
CancelledError
2016-03-12 00:16:56 [scrapy] ERROR: Error downloading <GET http://XXXXXX/rnd/sites/default/files/S&P_Chemicals,etc.20150903.doc>
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/handlers/http11.py", line 246, in _cb_bodyready
    raise defer.CancelledError()
CancelledError

I have tried searching on the internet on this error, but to no benefit.

My crawler code is given below:

import os
import StringIO
import sys
import scrapy
from scrapy.conf import settings
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

class IntSpider(CrawlSpider):
    name = "intranetspidey"
    allowed_domains = ["*****"]
    start_urls = [
        "******"
    ]
    rules = (
        Rule(LinkExtractor(deny_extensions=["ppt","pptx"],deny=(r'.*\?.*') ),
             follow=True,
             callback='parse_webpage'),
    )


    def get_pdf_text(self, response):
        """ Peek inside PDF to check possible violations.
        @return: PDF content as searcable plain-text string
        """
        try:
                from pyPdf import PdfFileReader
        except ImportError:
                print "Needed: easy_install pyPdf"
                raise 
        stream = StringIO.StringIO(response.body)
        reader = PdfFileReader(stream)
        text = u""

        if reader.getDocumentInfo().title:
                # Title is optional, may be None
                text += reader.getDocumentInfo().title

        for page in reader.pages:
                # XXX: Does handle unicode properly?
                text += page.extractText()

        return text 

    def parse_webpage(self, response):

        ct = response.headers.get("content-type", "").lower()
        if "pdf" in ct or ".pdf" in response.url:
            data = self.get_pdf_text(response)

        elif "html" in ct:
              do something

I am just starting out with Scrapy and I would be highly grateful for your knowledgeable solutions.

Upvotes: 0

Views: 965

Answers (2)

neverlastn
neverlastn

Reputation: 2204

Ah - simple! :)

Just open the source code where the error is thrown... seems like the page is more than maxsize... which leads us here.

So, the problem is that you're trying to get large documents. Increase the DOWNLOAD_MAXSIZE limit in settings and you should be fine.

Note: Your performance will suffer because you're blocking the CPU to do PDF decoding and while this happens no further requests will be issued. Scrapy's architecture is strictly single-threaded. Here are two (out of many) solutions:

a) Use the file pipeline to download the files and then batch-process them with some other system.

b) Use reactor.spawnProcess() and use separate processes for PDF decoding. (see here). This allows you to use Python or any other command line tool to do PDF decoding.

Upvotes: 1

Steven Almeroth
Steven Almeroth

Reputation: 8202

Do you get a line like this in your output/log something like:

Expected response size X larger than download max size Y.

It sounds like the response you are requesting is more than 1GB. Your error is coming from the download handler which defaults to one gig but can be overridden easily in:

Upvotes: 0

Related Questions