Reputation: 197
I'm running a Scrapy spider in python to scrape images from a website. After trying some other methods, I'm attempting to implement an ImagesPipeline for doing this.
items.py
class NHTSAItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
settings.py:
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = 'C:\Users\me\Desktop'
myspider.py
def parse_photo_page(self, response):
item = NHTSAItem()
for sel in response.xpath('//table[@id="tblData"]/tr'):
url = sel.xpath('td/font/a/@href').extract()
table_fields = sel.xpath('td/font/text()').extract()
if url:
base_url_photo = "http://www-nrd.nhtsa.dot.gov"
full_url = base_url_photo + url[0]
if not item:
item['image_urls'] = [full_url]
else:
item['image_urls'].append(full_url)
return item
There are no errors that come up, the images just don't get downloaded. The debugger even says "Scraped" Here's the log:
DEBUG: Scraped from <200 http://www-nrd.nhtsa.dot.gov/database/VSR/veh/../SearchMedia.aspx?database=v&tstno=4000&mediatype=p&p_tstno=4000>
{'image_urls': [u'http://www-nrd.nhtsa.dot.gov/database/MEDIA/GetMedia.aspx?tstno=4000&index=1&database=V&type=P',
u'http://www-nrd.nhtsa.dot.gov/database/MEDIA/GetMedia.aspx?tstno=4000&index=2&database=V&type=P',
u'http://www-nrd.nhtsa.dot.gov/database/MEDIA/GetMedia.aspx?tstno=4000&index=3&database=V&type=P',
u'http://www-nrd.nhtsa.dot.gov/database/MEDIA/GetMedia.aspx?tstno=4000&index=4&database=V&type=P',
u'http://www-nrd.nhtsa.dot.gov/database/MEDIA/GetMedia.aspx?tstno=4000&index=5&database=V&type=P']}
I don't care about extending the pipeline (making a custom pipeline), the default imagespipeline is fine. The images are nowhere to be found. Any ideas what I'm doing wrong?
Upvotes: 4
Views: 3325
Reputation: 41
If you applied all the process as described in https://docs.scrapy.org/en/latest/topics/media-pipeline.html
The last thing you have to apply is to install Pillow
library.
This is a 5 steps process to properly download images in Scrapy:
1- Define image_urls
and images
fields inside items.py
image_urls = scrapy.Field()
images = scrapy.Field()
2- Active Scrapy images pipeline inside settings.py
file:
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
3- Set images download folder path inside settings.py
file:
IMAGES_STORE = 'path_to_your_folder'
4- install Pillow library:
pip install pillow
5- Inside your spider file, assign image URLs to the item image_urls
field:
item = SpiderItem()
item['image_urls'] = ['set_images_urls_here']
# do other stuff if needed....
yield item
When you follow these 5 steps, you will successfully download images with Scrapy
Upvotes: 2
Reputation: 197
Here's the solution, which came to me from this parallel question: Scrapy: Error 10054 after retrying image download (Thanks to @neverlastn)
I simply added this snippet to my actual spider.py file.
custom_settings = {
"ITEM_PIPELINES": {'scrapy.pipelines.images.ImagesPipeline': 1},
"IMAGES_STORE": saveLocation
}
I think it wasn't properly referencing my settings.py file, and therefore didn't activate the image pipeline. I'm not sure how to get it to accurately reference my settings file, but this solution is good enough for me!
Upvotes: 3
Reputation: 2204
try replacing in your settings.py
IMAGES_STORE = 'C:\Users\me\Desktop'
with:
IMAGES_STORE = import os
IMAGES_STORE = os.getcwd()
If it works, it's a problem with the format of the absolute path. Then either of those should work:
IMAGES_STORE = 'C:\\Users\\me\\Desktop'
or
IMAGES_STORE = 'C:/Users/me/Desktop'
P.S. This is the settings.py
. The relative XPaths issue from the other question/answer also applies here.
Upvotes: -1