Scrape image data with scrapy

Question

I am using Scrapy to scrape the images related to a product on amazon.com. How would I parse the image data?

I typically use the XPath. However, I was not able to locate the XPath for the images (besides the thumbnails). For example, this is how I parse the title.

title = response.xpath('//h1[@id="title"]/span/text()').extract()

The link to the item is: https://www.amazon.com/dp/B01N068GIX?psc=1

Tom&#225;š Linhart · Accepted Answer

Seems like the images can be extracted from JavaScript that's present in the page source. I used js2xml library to convert JavaScript source code to XML (you can learn more about it on Scrapinghub's blogpost). The XML can then be used to create a Selector with which you can extract data as usual. Take a look at this example spider:

# -*- coding: utf-8 -*-                                                         
import js2xml                                                                   
import scrapy                                                                   

class ExampleSpider(scrapy.Spider):                                             
    name = 'example'                                                            
    allowed_domains = ['amazon.com']                                            
    start_urls = ['https://www.amazon.com/dp/B01N068GIX?psc=1/']                

    def parse(self, response):                                                  
        item = dict()
        js = response.xpath("//script[contains(text(), 'register(\"ImageBlockATF\"')]/text()").extract_first()
        xml = js2xml.parse(js)                                                  
        selector = scrapy.Selector(root=xml)                                   
        item['image_urls'] = selector.xpath('//property[@name="colorImages"]//property[@name="hiRes"]/string/text()').extract()
        yield item

If you'd like to test it out, run it like

scrapy runspider example.py -s USER_AGENT="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.52 Safari/537.36"

as Amazon seems to block Scrapy based on user agent string.

Scrape image data with scrapy

Answers (2)

Related Questions