Reputation: 36317
I'm working with scrapy. In my current project I am capturing the text from pdf files. I want to send this to a pipeline for parsing. Right now I have:
def get_pdf_text(self, response):
in_memory_pdf = BytesIO(bytes(response.body))
in_memory_pdf.seek(0)
doc = slate.PDF(in_memory_pdf)
item =OveItem()
item['pdf_text']=doc
return item
pipelines.py
class OvePipeline(object):
def process_item(self, item, spider):
.......
return item
This works ,but I think it would be cleaner just to yield the result directly and not have to attach the result to an item to get it to a pipeline, like:
def get_pdf_text(self, response):
in_memory_pdf = BytesIO(bytes(response.body))
in_memory_pdf.seek(0)
yield slate.PDF(in_memory_pdf)
Is this possible?
Upvotes: 1
Views: 349
Reputation: 474021
According to Scrapy documentation, a spider callback has to either return a Request
instance(s), dictionary(ies) or Item
instance(s):
This method, as well as any other Request callback, must return an iterable of Request and/or dicts or Item objects.
So, if you don't want to define a special "item" for the pdf content, simply wrap it into a dict:
def get_pdf_text(self, response):
in_memory_pdf = BytesIO(bytes(response.body))
in_memory_pdf.seek(0)
doc = slate.PDF(in_memory_pdf)
return {'pdf_text': doc}
Upvotes: 2