Getting twisted.defer.CancelledError when using Scrapy

Question

Whenever I run the scrapy crawl command the following error props up:

2016-03-12 00:16:56 [scrapy] ERROR: Error downloading 
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/handlers/http11.py", line 246, in _cb_bodyready
    raise defer.CancelledError()
CancelledError
2016-03-12 00:16:56 [scrapy] ERROR: Error downloading 
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/handlers/http11.py", line 246, in _cb_bodyready
    raise defer.CancelledError()
CancelledError

I have tried searching on the internet on this error, but to no benefit.

My crawler code is given below:

import os
import StringIO
import sys
import scrapy
from scrapy.conf import settings
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

class IntSpider(CrawlSpider):
    name = "intranetspidey"
    allowed_domains = ["*****"]
    start_urls = [
        "******"
    ]
    rules = (
        Rule(LinkExtractor(deny_extensions=["ppt","pptx"],deny=(r'.*\?.*') ),
             follow=True,
             callback='parse_webpage'),
    )


    def get_pdf_text(self, response):
        """ Peek inside PDF to check possible violations.
        @return: PDF content as searcable plain-text string
        """
        try:
                from pyPdf import PdfFileReader
        except ImportError:
                print "Needed: easy_install pyPdf"
                raise 
        stream = StringIO.StringIO(response.body)
        reader = PdfFileReader(stream)
        text = u""

        if reader.getDocumentInfo().title:
                # Title is optional, may be None
                text += reader.getDocumentInfo().title

        for page in reader.pages:
                # XXX: Does handle unicode properly?
                text += page.extractText()

        return text 

    def parse_webpage(self, response):

        ct = response.headers.get("content-type", "").lower()
        if "pdf" in ct or ".pdf" in response.url:
            data = self.get_pdf_text(response)

        elif "html" in ct:
              do something

I am just starting out with Scrapy and I would be highly grateful for your knowledgeable solutions.

neverlastn · Accepted Answer

Ah - simple! :)

Just open the source code where the error is thrown... seems like the page is more than maxsize... which leads us here.

So, the problem is that you're trying to get large documents. Increase the DOWNLOAD_MAXSIZE limit in settings and you should be fine.

Note: Your performance will suffer because you're blocking the CPU to do PDF decoding and while this happens no further requests will be issued. Scrapy's architecture is strictly single-threaded. Here are two (out of many) solutions:

a) Use the file pipeline to download the files and then batch-process them with some other system.

b) Use reactor.spawnProcess() and use separate processes for PDF decoding. (see here). This allows you to use Python or any other command line tool to do PDF decoding.

Getting twisted.defer.CancelledError when using Scrapy

Answers (2)

Related Questions