Lucien S.
Lucien S.

Reputation: 5345

Extracting content of <script> with Scrapy

I'm trying to extract the latitude and longitude from this page: https://www.realestate.com.kh/buy/nirouth/4-bed-5-bath-twin-villa-143957/

Where it can be found in this part of the page (the Xpath of this part is /html/head/script[8]):

<script type="application/ld+json">{"@context":"http://schema.org","@type":"Residence","address":{"@type":"PostalAddress","addressRegion":"Phnom Penh","addressLocality":"Chbar Ampov"},"geo":{"@type":"GeoCoordinates","latitude":11.52,"longitude":104.95,"address":{"@type":"PostalAddress","addressRegion":"Phnom Penh","addressLocality":"Chbar Ampov"}}}</script>

Here's my script :

import scrapy

class ScrapingSpider(scrapy.Spider):
    name = 'scraping'
    # allowed_domains = ['https://www.realestate.com.kh/buy/']
    start_urls = ['https://www.realestate.com.kh/buy/']

    def parse(self, response):
        lat = response.xpath('/html/head/script[8]')
        print('----------------',lat)

        yield {
           'lat': lat
        }

However, this Xpath yield an empty list. Is is because the content I'm looking for is in a JS script?

Upvotes: 0

Views: 274

Answers (1)

renatodvc
renatodvc

Reputation: 2564

Since scrapy doesn't execute js, some <script> tag may be not be loaded into the page. For this reason using a index to pinpoint the element you want isn't a good idea. Better to search for something specific, my suggestion would be:

response.xpath('//head/script[contains(text(), "latitude")]')

Edit:

The above selector will return a selector list, from it you can choose how to parse. If you want to extract the whole text in script you can use:

response.xpath('//head/script[contains(text(), "latitude")]/text()').get()

If you want only the latitude value, you can use a regex:

response.xpath('//head/script[contains(text(), "latitude")]/text()').re_first(r'"latitude":(\d{1,3}\.\d{1,2})')

Docs on using regex methods of Selectors.

Upvotes: 1

Related Questions