Reputation: 3329
I need to scrape the website http://www.yellowkorner.com/ By choosing a different country, all the prices will change. There are 40+ countries listed, and each of those must be scrapped.
My current spider is pretty simple
# coding=utf-8
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://www.yellowkorner.com/photos/index.aspx']
def parse(self, response):
for url in response.css('a::attr("href")').re(r'/photos/\d\d\d\d/.*$'):
yield scrapy.Request(response.urljoin(url), self.parse_prices)
def parse_prices(self, response):
yield None
How can I scrape price information for all countries?
Upvotes: 0
Views: 1764
Reputation: 750
Open the page with firebug and refresh. Inspecting the web page at the panel Network / Sub Panel Cookies you will see that the page saves de country information with cookies (see image below).
So you have to force the cookie "YellowKornerCulture" attribute values LANGUAGE and COUNTRY at the request. I made an example based on your code to get the available countries on the site and a loop to get all the prices. See the code below:
# coding=utf-8
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://www.yellowkorner.com/photos/index.aspx']
def parse(self, response):
countries = self.get_countries(response)
#countries = ['BR', 'US'] try this if you only have some countries
for country in countries:
#With the expression re(r'/photos/\d\d\d\d/.*$') you only get photos with 4-digit ids. I think this is not your goal.
for url in response.css('a::attr("href")').re(r'/photos/\d\d\d\d/.*$'):
yield scrapy.Request(response.urljoin(url), cookies={'YellowKornerCulture' : 'Language=US&Country='+str(country), 'YellowKornerHistory' : '', 'ASP.NET_SessionId' : ''}, callback=self.parse_prices, dont_filter=True, meta={'country':country})
def parse_prices(self, response):
yield {
'name': response.xpath('//h1[@itemprop="name"]/text()').extract()[0],
'price': response.xpath('//span[@itemprop="price"]/text()').extract()[0],
'country': response.meta['country']
}
#function that gets the countries avaliables on the site
def get_countries(self, response):
return response.xpath('//select[@id="ctl00_languageSelection_ddlCountry"]/option/attribute::value').extract()
Took a certain time to figure this out but you have to erase another cookies that the site is using to choose the language page. Also I fixed the language value to English(US). The parameter dont_filter=True
was used because you are requesting an already requested url each loop iteration and the default behavior of scrapy is don't repeat a request to the same url due performance reasons.
PS: The xpath expressions provided can be improved.
Hope this helps.
Upvotes: 2