PiccolMan
PiccolMan

Reputation: 5366

Scrape image data with scrapy

I am using Scrapy to scrape the images related to a product on amazon.com. How would I parse the image data?

I typically use the XPath. However, I was not able to locate the XPath for the images (besides the thumbnails). For example, this is how I parse the title.

title = response.xpath('//h1[@id="title"]/span/text()').extract()

The link to the item is: https://www.amazon.com/dp/B01N068GIX?psc=1

Upvotes: 9

Views: 2844

Answers (2)

Dan Temkin
Dan Temkin

Reputation: 1605

I know the question says to use scrapy but here is a version of what you want using beautifulsoup, requests, and urllib. You also bypass the need to set the useragent using this method.

from bs4 import BeautifulSoup as bsoup
import requests
from urllib import request

def load_image(url):
    resp1 = requests.get(url)
    imgurl = _find_image_url(resp1.content)
    resp2 = request.urlopen(imgurl) #treats url as file-like object
    print(resp2.url)
def _find_image_url(html_block):
    soup = bsoup(html_block, "html5lib")
    body = soup.find("body")
    imgtag = soup.find("img", {"id":"landingImage"})
    imageurl = dict(imgtag.attrs)["src"]
    return imageurl


load_image("https://www.amazon.com/dp/B01N068GIX?psc=1")

Upvotes: 1

Tomáš Linhart
Tomáš Linhart

Reputation: 10210

Seems like the images can be extracted from JavaScript that's present in the page source. I used js2xml library to convert JavaScript source code to XML (you can learn more about it on Scrapinghub's blogpost). The XML can then be used to create a Selector with which you can extract data as usual. Take a look at this example spider:

# -*- coding: utf-8 -*-                                                         
import js2xml                                                                   
import scrapy                                                                   

class ExampleSpider(scrapy.Spider):                                             
    name = 'example'                                                            
    allowed_domains = ['amazon.com']                                            
    start_urls = ['https://www.amazon.com/dp/B01N068GIX?psc=1/']                

    def parse(self, response):                                                  
        item = dict()
        js = response.xpath("//script[contains(text(), 'register(\"ImageBlockATF\"')]/text()").extract_first()
        xml = js2xml.parse(js)                                                  
        selector = scrapy.Selector(root=xml)                                   
        item['image_urls'] = selector.xpath('//property[@name="colorImages"]//property[@name="hiRes"]/string/text()').extract()
        yield item

If you'd like to test it out, run it like

scrapy runspider example.py -s USER_AGENT="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.52 Safari/537.36"

as Amazon seems to block Scrapy based on user agent string.

Upvotes: 9

Related Questions