Reputation: 5221
I have an item pipeline that takes a url from item and downloads it. The problem is that I have another pipeline in which I manually check this file and add some info about it. And I really need to do it before the file is downloaded.
class VideoCommentPipeline(object):
def process_item(self, item, spider):
os.system("vlc -vvv %s > /dev/null 2>&1 &" % item['file'])
item['comment'] = raw_input('Your comment:')
return item
class VideoDownloadPipeline(object):
def process_item(self, item, spider):
video_basename = item['file'].split('/')[-1]
new_filename = os.path.join(VIDEOS_DIR, video_basename)
downloaded = False
for i in range(5):
try:
video = urllib2.urlopen(item['file']).read()
downloaded = True
break
except:
continue
if not downloaded:
raise DropItem("Couldn't download file from %s" % item)
f = open(new_filename, 'wb')
f.write(video)
f.close()
item['file'] = video_basename
return item
But now I always have to wait for another item because file from previous item isn't downloaded yet. I'd rather check all items and let it all be downloaded then. How can I do that?
Upvotes: 3
Views: 2096
Reputation: 7822
Scrapy provides media pipeline that can be used for your purposes here. It is not documented well but it exists and can be used, at least in most recent scrapy version. To understand how it works you need to read the code, it's quite intuitive IMO. You can check image pipeline interface to understand how media pipeline works.
To check each video before downloading it you can write something resembling this (you need to match it to your item field names)
from scrapy.contrib.pipeline.media import MediaPipeline
class VideoPipeline(MediaPipeline):
VIDEOS_DIR = "/stack/scrapy/video/video/store"
def get_media_requests(self, item, info):
"""
Evaluate file and, if you like it, download it.
"""
os.system("vlc -vvv %s > /dev/null 2>&1 &" % item['video_url'][0])
your_opinion = raw_input("how does it look?")
item["comment"] = your_opinion
if your_opinion == "hot":
# issue request download video
return Request(item["video_url"][0], meta={"item":item})
def media_downloaded(self, response, request, info):
"""
File is downloaded available as response.body save it.
"""
item = response.meta.get("item")
video = response.body
video_basename = item['title'][0]
new_filename = os.path.join(self.VIDEOS_DIR, video_basename)
f = open(new_filename, 'wb')
f.write(video)
f.close()
Upvotes: 3