Reputation: 319
So I have been trying to learn Python by creating a pretty basic crawler. At the moment, all of my scraping works as expected, with the exception of images:
I have added an image store to my settings.py, I am successfully extracting the URL of the images through the spider, but it is not currently saving any images.
The line for extracting the image URL is as follows:
snowboard['image_URL'] = ''.join(item.xpath('li[@class="productImage"]/a/img/@data-original').extract())
This will produce something along the lines of this:
"image_URL": "/zoom/858553/230"
in my items.json.
Thus far, all looking ok except that no images are being saved to my image store. For reference, this is my item pipeline:
class SnowboardPipeline(object):
def process_item(self, item, spider):
return item
def get_media_requests(self, item, info):
for imageURL in item['image_URL']:
yield Request(imageURL)
I am curious if it's something to do with the images not having an exception, or I've looked over something glaringly obvious in the documentation when it comes to pulling down images.
Upvotes: 1
Views: 190
Reputation: 319
So for those who are curious, my issues was essentially that the image pipeline needs full URLs rather than just extensions. In hindsight, this is obvious.
We can resolve this by importing urlparse into the scraper, then joining our relative image URL with the response URL as follows:
snowboard['image_urls'] = [urlparse.urljoin(response.url, snowboard['URL'])]
Which will yield a full URL to the image. I then had issues with a jpeg decoder missing, but that was fixed by installing the relevant dependencies and reinstalling PIL.
Upvotes: 2