Sending content directly to scrapy pipeline

Question

I'm working with scrapy. In my current project I am capturing the text from pdf files. I want to send this to a pipeline for parsing. Right now I have:

def get_pdf_text(self, response):
    in_memory_pdf = BytesIO(bytes(response.body))
    in_memory_pdf.seek(0)
    doc = slate.PDF(in_memory_pdf)
    item =OveItem()
    item['pdf_text']=doc
    return item

pipelines.py

class OvePipeline(object):
    def process_item(self, item, spider):
       .......
        return item

This works ,but I think it would be cleaner just to yield the result directly and not have to attach the result to an item to get it to a pipeline, like:

def get_pdf_text(self, response):
    in_memory_pdf = BytesIO(bytes(response.body))
    in_memory_pdf.seek(0)
    yield slate.PDF(in_memory_pdf)

Is this possible?

alecxe · Accepted Answer

According to Scrapy documentation, a spider callback has to either return a Request instance(s), dictionary(ies) or Item instance(s):

This method, as well as any other Request callback, must return an iterable of Request and/or dicts or Item objects.

So, if you don't want to define a special "item" for the pdf content, simply wrap it into a dict:

def get_pdf_text(self, response):
    in_memory_pdf = BytesIO(bytes(response.body))
    in_memory_pdf.seek(0)

    doc = slate.PDF(in_memory_pdf)

    return {'pdf_text': doc}

Sending content directly to scrapy pipeline

Answers (1)

Related Questions