J-Krush
J-Krush

Reputation: 197

Scrapy ImagesPipeline not downloading images

I'm running a Scrapy spider in python to scrape images from a website. After trying some other methods, I'm attempting to implement an ImagesPipeline for doing this.

items.py

class NHTSAItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

settings.py:

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = 'C:\Users\me\Desktop'

myspider.py

def parse_photo_page(self, response):
    item = NHTSAItem()
    for sel in response.xpath('//table[@id="tblData"]/tr'):
        url = sel.xpath('td/font/a/@href').extract()
        table_fields = sel.xpath('td/font/text()').extract()
        if url:
            base_url_photo = "http://www-nrd.nhtsa.dot.gov"
            full_url = base_url_photo + url[0]
            if not item:
                item['image_urls'] = [full_url]
            else: 
                item['image_urls'].append(full_url)
    return item

There are no errors that come up, the images just don't get downloaded. The debugger even says "Scraped" Here's the log:

DEBUG: Scraped from <200 http://www-nrd.nhtsa.dot.gov/database/VSR/veh/../SearchMedia.aspx?database=v&tstno=4000&mediatype=p&p_tstno=4000>
{'image_urls': [u'http://www-nrd.nhtsa.dot.gov/database/MEDIA/GetMedia.aspx?tstno=4000&index=1&database=V&type=P',
            u'http://www-nrd.nhtsa.dot.gov/database/MEDIA/GetMedia.aspx?tstno=4000&index=2&database=V&type=P',
            u'http://www-nrd.nhtsa.dot.gov/database/MEDIA/GetMedia.aspx?tstno=4000&index=3&database=V&type=P',
            u'http://www-nrd.nhtsa.dot.gov/database/MEDIA/GetMedia.aspx?tstno=4000&index=4&database=V&type=P',
            u'http://www-nrd.nhtsa.dot.gov/database/MEDIA/GetMedia.aspx?tstno=4000&index=5&database=V&type=P']}

I don't care about extending the pipeline (making a custom pipeline), the default imagespipeline is fine. The images are nowhere to be found. Any ideas what I'm doing wrong?

Upvotes: 4

Views: 3325

Answers (3)

Shah Muhammad
Shah Muhammad

Reputation: 41

If you applied all the process as described in https://docs.scrapy.org/en/latest/topics/media-pipeline.html

The last thing you have to apply is to install Pillow library.

This is a 5 steps process to properly download images in Scrapy:

1- Define image_urls and images fields inside items.py

 image_urls = scrapy.Field()
 images = scrapy.Field()

2- Active Scrapy images pipeline inside settings.py file:

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}

3- Set images download folder path inside settings.py file:

IMAGES_STORE = 'path_to_your_folder'

4- install Pillow library:

pip install pillow

5- Inside your spider file, assign image URLs to the item image_urls field:

item = SpiderItem()
item['image_urls'] = ['set_images_urls_here']
# do other stuff if needed....

yield item

When you follow these 5 steps, you will successfully download images with Scrapy

Upvotes: 2

J-Krush
J-Krush

Reputation: 197

Here's the solution, which came to me from this parallel question: Scrapy: Error 10054 after retrying image download (Thanks to @neverlastn)

I simply added this snippet to my actual spider.py file.

custom_settings = {
    "ITEM_PIPELINES": {'scrapy.pipelines.images.ImagesPipeline': 1},
    "IMAGES_STORE": saveLocation
}

I think it wasn't properly referencing my settings.py file, and therefore didn't activate the image pipeline. I'm not sure how to get it to accurately reference my settings file, but this solution is good enough for me!

Upvotes: 3

neverlastn
neverlastn

Reputation: 2204

try replacing in your settings.py

IMAGES_STORE = 'C:\Users\me\Desktop'

with:

IMAGES_STORE = import os
IMAGES_STORE = os.getcwd()

If it works, it's a problem with the format of the absolute path. Then either of those should work:

IMAGES_STORE = 'C:\\Users\\me\\Desktop'

or

IMAGES_STORE = 'C:/Users/me/Desktop'

P.S. This is the settings.py. The relative XPaths issue from the other question/answer also applies here.

Upvotes: -1

Related Questions