ybb
ybb

Reputation: 147

using scrapy to extract dynamic data - location based on postcodes

I'm new to Scrapy, and with some tutorials I was able to scrape a few simple websites, but I'm facing an issue now with a new website where I have to fill a search form and extract the results. The response I get doesn't have the results.

Let's say for example, for the following site: http://www.beaurepaires.com.au/store-locator/

I want to provide a list of postcodes and extract information about stores in each postcode (store name and address).

I'm using the following code but it's not working, and I'm not sure where to start from.

class BeaurepairesSpider(BaseSpider):
    name = "beaurepaires"
    allowed_domains = ["http://www.beaurepaires.com.au"]
    start_urls = ["http://www.beaurepaires.com.au/store-locator/"]
    #start_urls = ["http://www.beaurepaires.com.au/"]

    def parse(self, response):
        yield FormRequest.from_response(response, formname='frm_dealer_locator', formdata={'dealer_postcode_textfield':'2115'}, callback=self.parseBeaurepaires)

    def parseBeaurepaires(self, response):
        hxs = HtmlXPathSelector(response)
        filename = "postcodetest3.txt"
        open(filename, 'wb').write(response.body)
        table = hxs.select("//div[@id='jl_results']/table/tbody")
        headers = table.select("tr[position()<=1]")
        data_rows = table.select("tr[position()>1]")

Thanks!!

Upvotes: 2

Views: 1218

Answers (1)

alecxe
alecxe

Reputation: 474241

The page load here heavily uses javascript and is too complex for Scrapy. Here's an example of what I've come up to:

import re
from scrapy.http import FormRequest, Request
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider


class BeaurepairesSpider(BaseSpider):
    name = "beaurepaires"
    allowed_domains = ["beaurepaires.com.au", "gdt.rightthere.com.au"]
    start_urls = ["http://www.beaurepaires.com.au/store-locator/"]

    def parse(self, response):
        yield FormRequest.from_response(response, formname='frm_dealer_locator',
                                        formdata={'dealer_postcode_textfield':'2115'},
                                        callback=self.parseBeaurepaires)

    def parseBeaurepaires(self, response):
        hxs = HtmlXPathSelector(response)

        script = str(hxs.select("//div[@id='jl_container']/script[4]/text()").extract()[0])
        url, script_name = re.findall(r'LoadScripts\("([a-zA-Z:/\.]+)", "(\w+)"', script)[0]
        url = "%s/locator/js/data/%s.js" % (url, script_name)
        yield Request(url=url, callback=self.parse_js)

    def parse_js(self, response):
        print response.body  # here are your locations - right, inside the js file

see that regular expressions are used, hardcoded urls, and you'll have to parse js in order to get your locations - too fragile even if you'll finish it and get the locations.

Just switch to in-browser tools like selenium (or combine scrapy with it).

Upvotes: 3

Related Questions