Scrapy - generating items outside of parse callback

Question

This might be a bit of an odd one. I have a Scrapy project with a few spiders that inherit from CrawlSpider. Aside for their normal execution (going through the intended website), I also want to be able to push items outside of the scope of the original callback.

I have a thread defined that goes over files in a folder, and then passes them on to parse_files, as if it was content downloaded by Scrapy. Is there any way I can get the items generated from that through the pipelines and middlewares I have, as if it was just another downloaded page?

I know that's not the architecture they had in mind, but I'm wondering if I can work around this. I'm familiar with Scrapy's architecture, and am basically looking for a good way to push items to the Engine.

class SomeSpider(CrawlSpider):
name = "generic_spider"

def __init__(self):
    CrawlSpider.__init__(self, instance_config)
    self.file_thread = Thread(target=self._file_thread_loop)
    self.file_thread.daemon = True
    self.file_thread.start()

    self.rules += (Rule(LxmlLinkExtractor(allow=['/somepath/'], deny=[], callback=self.parse_items, follow=True),)

def _file_thread_loop(self):
    while True:
    #... read files...
        for file in files:
            response = HtmlResponse(url=file['url'], body=file['body'])
            for item in self.parse_items(response):
                yield item # <-- I want this to go to the pipelines and middlewares

        time.sleep(10)


def parse_items(self, response):
    hxs = Selector(response)

    # ... parse page ...
    for item in resulting_items:
        yield item

marven · Accepted Answer

I'm not sure if there is a way to push items directly to the engine, but what you could do is push dummy requests with the items in the meta variable and just yield them in the callback.

def _file_thread_loop(self):
    while True:
    #... read files...
        for file in files:
            response = HtmlResponse(url=file['url'], body=file['body'])
            req = Request(
                url='http://example.com',
                meta={'items': self.parse_items(response)},
                callback=self.yield_item
            )
            self.crawler.engine.crawl(req, spider=self)

        time.sleep(10)


def yield_item(self, response):
    for item in response.meta['items']:
        yield item

Scrapy - generating items outside of parse callback

Answers (1)

Related Questions