taphos
taphos

Reputation: 33

How to extract the src of dynamically loaded images using scrapy

I'm currently attempting to scrape the site https://www.bloomingdales.com with scrapy.

In this project I'm attempting to extract the url of the main image loaded in each of the product pages e.g.:

https://www.bloomingdales.com/shop/product/free-people-over-the-rainbow-beanie?ID=1791385&CategoryID=1006048#fn=ppp%3D%26spp%3D1%26sp%3D1%26rid%3D83%26spc%3D94%26rsid%3Dundefined%26pn%3D1|2|1|94

However each picture is loaded with an image request on the website and so I can't simply xpath to locate the image url. How do I extract the image urls using scrapy?

Here's a screenshot of the requests I see in my chrome developer tools:

Upvotes: 3

Views: 1121

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21436

It's quite common for e-commerce websites to store some json data in html body and then have the user's browser unpack it into a full page.

For this particular page if you copy the image url and search about in page source you can see all of the product data stored in:

<script id="pdp_data" type="application/json">some_json</script>

You can grab this data with scrapy and decode json to python dictionary:

data = response.xpath("//script[@id='pdp_data']/text()").extract_first()
import json
data = json.loads(data)
# then you can parse the data
data['product']['imageSource']
# '8/optimized/9216988_fpx.tif'

Upvotes: 4

Related Questions