Extract image source from lazy loading content with Scrapy

Question

I'm trying to extract the value of the src img tag using Scrapy.

For example:

I want to extract the URL:

https://media.rightmove.co.uk/map/_generate?width=768&height=347&zoomLevel=15&latitude=53.803485&longitude=-1.561766&signature=rq2YsiaRQTXqZ2ilgvbFF3fdWfU=

When I view the response in Chrome returned from the scrapy shell I can see the data I want (via developer tools) to extract, but when I try to extract it with XPath it returns nothing.

e.g.

response.xpath("""//*[@id="root"]/div/div[3]/main/div[15]/div/a/img""").get()

I'm guessing loading="lazy" has something to do with it, however, the returned response from scrapy shows the data I want when viewed in a browser (with javascript disabled).

Steps to reproduce:

$ scrapy shell https://www.rightmove.co.uk/properties/91448747#/
$ view(response)

Anyone know how I can extract the URL from the map? I'm interested in doing this in order to extract the lat-long of the property.

Marcos · Accepted Answer

This HTML tag is been generated by some JS when you open the page on the browser. When inspecting with view(response), I suggest to set to the tab to Offline in the devtools/Network tab and reload the page.

This will prevent the tab downloading other content, the same way scrapy shell does. Indeed, after doing this we can see that this tag does not exist at this point.

But this data seems to be available on one of the scripts tag. You can check it executing the following commands.

$ scrapy shell https://www.rightmove.co.uk/properties/91448747#/
import json
jdata = json.loads(response.xpath('//script').re_first('window.PAGE_MODEL = (.*)'))
from pprint import pprint as pp
pp(jdata)

Extract image source from lazy loading content with Scrapy

Answers (1)

Related Questions