Reputation: 5366
I am using Scrapy to scrape the images related to a product on amazon.com. How would I parse the image data?
I typically use the XPath. However, I was not able to locate the XPath for the images (besides the thumbnails). For example, this is how I parse the title.
title = response.xpath('//h1[@id="title"]/span/text()').extract()
The link to the item is: https://www.amazon.com/dp/B01N068GIX?psc=1
Upvotes: 9
Views: 2844
Reputation: 1605
I know the question says to use scrapy but here is a version of what you want using beautifulsoup, requests, and urllib. You also bypass the need to set the useragent using this method.
from bs4 import BeautifulSoup as bsoup
import requests
from urllib import request
def load_image(url):
resp1 = requests.get(url)
imgurl = _find_image_url(resp1.content)
resp2 = request.urlopen(imgurl) #treats url as file-like object
print(resp2.url)
def _find_image_url(html_block):
soup = bsoup(html_block, "html5lib")
body = soup.find("body")
imgtag = soup.find("img", {"id":"landingImage"})
imageurl = dict(imgtag.attrs)["src"]
return imageurl
load_image("https://www.amazon.com/dp/B01N068GIX?psc=1")
Upvotes: 1
Reputation: 10210
Seems like the images can be extracted from JavaScript that's present in the page source. I used js2xml library to convert JavaScript source code to XML (you can learn more about it on Scrapinghub's blogpost). The XML can then be used to create a Selector
with which you can extract data as usual. Take a look at this example spider:
# -*- coding: utf-8 -*-
import js2xml
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['amazon.com']
start_urls = ['https://www.amazon.com/dp/B01N068GIX?psc=1/']
def parse(self, response):
item = dict()
js = response.xpath("//script[contains(text(), 'register(\"ImageBlockATF\"')]/text()").extract_first()
xml = js2xml.parse(js)
selector = scrapy.Selector(root=xml)
item['image_urls'] = selector.xpath('//property[@name="colorImages"]//property[@name="hiRes"]/string/text()').extract()
yield item
If you'd like to test it out, run it like
scrapy runspider example.py -s USER_AGENT="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.52 Safari/537.36"
as Amazon seems to block Scrapy based on user agent string.
Upvotes: 9