Cartucho
Cartucho

Reputation: 3329

Change website deliver country with Scrapy

I need to scrape the website http://www.yellowkorner.com/ By choosing a different country, all the prices will change. There are 40+ countries listed, and each of those must be scrapped.

My current spider is pretty simple

# coding=utf-8

import scrapy


class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['http://www.yellowkorner.com/photos/index.aspx']

    def parse(self, response):
        for url in response.css('a::attr("href")').re(r'/photos/\d\d\d\d/.*$'):
            yield scrapy.Request(response.urljoin(url), self.parse_prices)

    def parse_prices(self, response):
        yield None

How can I scrape price information for all countries?

enter image description here

Upvotes: 0

Views: 1764

Answers (1)

guilhermerama
guilhermerama

Reputation: 750

Open the page with firebug and refresh. Inspecting the web page at the panel Network / Sub Panel Cookies you will see that the page saves de country information with cookies (see image below).

enter image description here

So you have to force the cookie "YellowKornerCulture" attribute values LANGUAGE and COUNTRY at the request. I made an example based on your code to get the available countries on the site and a loop to get all the prices. See the code below:

# coding=utf-8

import scrapy


class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['http://www.yellowkorner.com/photos/index.aspx']

    def parse(self, response):
        countries = self.get_countries(response)
    #countries = ['BR', 'US'] try this if you only have some countries     
    for country in countries:
        #With the expression re(r'/photos/\d\d\d\d/.*$') you only get photos with 4-digit ids. I think this is not your goal.   
            for url in response.css('a::attr("href")').re(r'/photos/\d\d\d\d/.*$'):
                yield scrapy.Request(response.urljoin(url), cookies={'YellowKornerCulture' : 'Language=US&Country='+str(country), 'YellowKornerHistory' : '', 'ASP.NET_SessionId' : ''}, callback=self.parse_prices, dont_filter=True, meta={'country':country})

    def parse_prices(self, response):
        yield {
        'name': response.xpath('//h1[@itemprop="name"]/text()').extract()[0],   
        'price': response.xpath('//span[@itemprop="price"]/text()').extract()[0],
        'country': response.meta['country']

    }
    #function that gets the countries avaliables on the site    
    def get_countries(self, response):
        return response.xpath('//select[@id="ctl00_languageSelection_ddlCountry"]/option/attribute::value').extract()

Took a certain time to figure this out but you have to erase another cookies that the site is using to choose the language page. Also I fixed the language value to English(US). The parameter dont_filter=True was used because you are requesting an already requested url each loop iteration and the default behavior of scrapy is don't repeat a request to the same url due performance reasons.

PS: The xpath expressions provided can be improved.

Hope this helps.

Upvotes: 2

Related Questions