Don't wait for files to download with Scrapy

Question

I have an item pipeline that takes a url from item and downloads it. The problem is that I have another pipeline in which I manually check this file and add some info about it. And I really need to do it before the file is downloaded.

class VideoCommentPipeline(object):

    def process_item(self, item, spider):
        os.system("vlc -vvv %s > /dev/null 2>&1 &" % item['file'])
        item['comment'] = raw_input('Your comment:')
        return item

class VideoDownloadPipeline(object):

    def process_item(self, item, spider):
        video_basename = item['file'].split('/')[-1]
        new_filename = os.path.join(VIDEOS_DIR, video_basename)
        downloaded = False
        for i in range(5):
            try:
                video = urllib2.urlopen(item['file']).read()
                downloaded = True
                break
            except:
                continue
        if not downloaded:
            raise DropItem("Couldn't download file from %s" % item)
        f = open(new_filename, 'wb')
        f.write(video)
        f.close()
        item['file'] = video_basename
        return item

But now I always have to wait for another item because file from previous item isn't downloaded yet. I'd rather check all items and let it all be downloaded then. How can I do that?

Pawel Miech · Accepted Answer

Scrapy provides media pipeline that can be used for your purposes here. It is not documented well but it exists and can be used, at least in most recent scrapy version. To understand how it works you need to read the code, it's quite intuitive IMO. You can check image pipeline interface to understand how media pipeline works.

To check each video before downloading it you can write something resembling this (you need to match it to your item field names)

from scrapy.contrib.pipeline.media import MediaPipeline

class VideoPipeline(MediaPipeline):
    VIDEOS_DIR = "/stack/scrapy/video/video/store"

    def get_media_requests(self, item, info):
        """
        Evaluate file and, if you like it, download it.
        """
        os.system("vlc -vvv %s > /dev/null 2>&1 &" % item['video_url'][0])
        your_opinion = raw_input("how does it look?")
        item["comment"] = your_opinion
        if your_opinion == "hot":
            # issue request download video
            return Request(item["video_url"][0], meta={"item":item})

    def media_downloaded(self, response, request, info):
        """
        File is downloaded available as response.body save it.
        """
        item = response.meta.get("item")
        video = response.body
        video_basename = item['title'][0]
        new_filename = os.path.join(self.VIDEOS_DIR, video_basename)
        f = open(new_filename, 'wb')
        f.write(video)
        f.close()

Don't wait for files to download with Scrapy

Answers (1)

Related Questions

Don&#39;t wait for files to download with Scrapy

Answers (1)

Related Questions

Don't wait for files to download with Scrapy